primes2 := {p in {2..10000} | forall i in {2..fix(sqrt(p))} | p mod i /= 0}; print(primes2);One feature I like (and use a lot) is test things from the command line:
$ setl 'time0:=time();primes:= {p in {2..100000} | forall i in {2..fix(sqrt(p))} | p mod i /= 0}; print("Num primes:",#primes);print("It took", (time()-time0)/1000,"seconds");' Num primes: 9592 It took 2.222 secondsA variant of prime number generation using
not exists
instead of forall
:
$ setl 'print({n in {2..100} | (not (exists m in{2..n - 1} | n mod m = 0))}); ' {2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97}Still another variant using intersection of
{2..n}
and the compound numbers:
$ setl 'n := 150; print({2..n} - {x : x in {2..n} | exists y in {2..fix(sqrt(x))} | x mod y = 0});' {2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149}Here are some other examples of set/array comprehensions.
$ setl 'f:= [1,1]; r := [f(i) := f(i-1)+f(i-2) : i in [3..10]]; print(f);'Pythagorean triplets as a "one-liner" (not very fast for say [1..300]).
$ setl 'print({[a, b, h]: b in {1..30}, a in {1..b - 1} | (exists h in {2..a + b} | (a*a + b*b = h*h)) and (not (exists d in {2..b - 1} | ((b mod d) = 0 and (a mod d) = 0)))}); ' {[3 4 5] [5 12 13] [7 24 25] [8 15 17] [20 21 29]}Creation of a power set (all subsets of a set), with the intermediate values printed;
$ setl 'a := {1,2,3}; p := { {}}; (for x in A, y in P) p with:= Y with x; print(p); end; print(p);' {{} {1}} {{} {1} {2}} {{} {1} {2} {1 2}} {{} {1} {2} {3} {1 2}} {{} {1} {2} {3} {1 2} {1 3}} {{} {1} {2} {3} {1 2} {1 3} {2 3}} {{} {1} {2} {3} {1 2} {1 3} {2 3} {1 2 3}} {{} {1} {2} {3} {1 2} {1 3} {2 3} {1 2 3}}Collect values from a tuple to a map (hash table).
[key, value]
.
a := [1,1,2,2,3,3,3,4,4,4,4]; m:={ [i, #[j : j in [1..#a] | a(j) = i ]] : i in { i : i in a}};Then a faster version:
$ setl 'a := [1,1,2,2,3,3,3,4,4,4,4]; m:= {}; for i in a loop m(i) +:= 1; end loop; print(m);' {[1 2] [2 2] [3 3] [4 4]}Index and value of a map
for x = s(i) in ...
in a map (hash table) loop gives both the index (i) and the value (x). Here we also see how to represent ranges with increment other than 1 (much like Haskell).
setl 's := {[i,i**2] : i in [1,3..15]}; for x = s(i) loop print(i,x); end loop;' 1 1 3 9 5 25 7 49 9 81 11 121 13 169 15 225Multi-map (
m{value}
)a(2)
gives OM
, the special undefined value (represented as "*" i GNU SETL).
setl 'a := {[1,["a"]], [2, ["b"]], [1, ["c"]]}; print(a);print(a(2));print(a(1));print(a{1});' {[1 [a]] [1 [c]] [2 [b]]} [b] * {[a] [c]}Compound operators
op/tuple
or op/map
) makes it possible to write quite sparse code (somewhat akin to APL and J). Here is the factorial of 100, also showing the support for arbitrary precision.
$ setl 'print(*/[1..20]);' 2432902008176640000There is no built-in
max
for tuples. Instead we use the compound operator version, which is possible since max
is a binary operator:
$ setl 'setrandom(0); print(max/[random(10000) : i in [1..100]]);' 9898Another example of compound operators is from Project Euler problem #5 (the smallest number that is evenly divisible by all of the numbers from 1 to 20). In my solution (project_euler5.setl) lcm and gcd is defined as operators (in contrast to procedures):
print(lcm/[2..20]); -- Prints the answer. op lcm(a,b); g := a gcd b; return (a*b) div g; end op lcm; op gcd(u, v); return if v = 0 then abs u else v gcd u mod v end; end op;Speaking of Project Euler problems, here is the SETL program for the first problem (Find the sum of all the multiples of 3 or 5 below 1000):
print(+/{i : i in [1..999] | i mod 3 = 0 or i mod 5 = 0});In averages_pythagorean_means.setl, three different version of mean are defined (as procedures) using compound operators (maybe not the most efficient way).
-- arithmetic mean proc mean_A(x); return +/x/#x; end proc; -- geometric mean proc mean_G(x); return (*/x)**(1/#x); end proc; -- harmonic mean proc mean_H(x); return #x/+/[1/i:i in x]; end proc;Randomization The
setrandom(0)
is for creating random variables starting with an "arbitrary" seed.
setl 'setrandom(0); s := [1,3,5,8]; print([random(s) : i in [1..10]]);' [5 1 8 8 5 3 3 1 5 3]With a set we get a value only once:
$ setl 's1 := {1..10}; setrandom(0); print({ random(s1) : i in [1..10]});' {3 5 6 7 8}In GNU SETL the order of the set is always presented as sorted, but this is not a requirement in the SETL language.
$ setl 's:="nonabstractedness"; m:=s("a.*b.*c*.d*.e*"); print(s);print(m);' nonabstractedness abstractednessAlso see read_test2.setl that search for words like this in a word file.
gsub
for global substitution):
$ setl 's:="nonabstractedness"; m:=sub(s,"a.*b.*c*.d*.e*",""); print(s);print(m);' non abstractednessNote that GNU SETL don't support non-greedy regular expressions (i.e. the ".+?" constructs from Perl etc), so the plain old
[^...]
construct must be used.:
$ setl 's:="nonabstractedness"; m:=s("a[^s]+s"); print(s);print(m);' nonabstractedness absA small drawback is that GNU SETL don't have support for national characters in strings. The only acceptable characters are the "plain ASCII".
$ setl 'x := "12345 12345 12345"; print(any(x, "123"));print(x);' 1 2345 12345 12345However, I miss the
many
function which takes many characters from the beginning, not just the first and it is quite easy to create it. First let's see how it works, where we will take all the characters from the beginning of the string if they are any of "123":
$ setl 'x := "12345 12345 12345"; print(x); while any(x, "123") /= "" loop print(x); end;;print(x);' 12345 12345 12345 2345 12345 12345 345 12345 12345 45 12345 12345 45 12345 12345(The corresponding regular expression for this is, of course
^[123]+
.)
many
is defined below. The first argument is defined as read-write (rw
) so we can modify the string s
. The value returned (z
) contains all the matched characters.
proc many(rw s,p); z := ""; while (zz := any(s,p)) /= "" and zz /= "" loop z +:= zz; end loop; return z; end proc;'And here is
many
in action. Note: procedures must always be placed last in a program.
x := "12345 12345 12345"; print(x); z:=many(x, "123"); print("x",x); print("z",z); proc many(rw s,p); print(s); print(p); while (zz := any(s,p)) /= "" and zz /= "" loop z +:= zz; end loop; return z; end proc;Result:
12345 12345 12345 12345 12345 12345 123 x 45 12345 12345 z 123In look_and_say_sequence.setl
many
is used, as well as a direct approach and one using regular expression.
$ setl 'f := filter("ls p*.setl"); print(f);s := split(f,"\n");print([s,#s]);' perm.setl pointer.setl primes2.setl primes3.setl primes.setl printprimes.setl [['perm.setl' 'pointer.setl' 'primes2.setl' 'primes3.setl' 'primes.setl' 'printprimes.setl' ''] 7]Reading a file directly is by
getline
. Note split()
.
x := split(getfile("file.txt"), "\n"); print(#x);
print(send_more_money1()); proc send_more_money1; ss := {0..9}; smm := [[S,E,N,D,M,O,R,Y] : -- ensure that all numbers are different S in ss , E in ss - {S} , N in ss - {S,E} , D in ss - {S,E,N} , M in ss - {S,E,N,D} , O in ss - {S,E,N,D,M} , R in ss - {S,E,N,D,M,O} , Y in ss - {S,E,N,D,M,O,R} | S > 0 and M > 0 and (S * 1000 + E * 100 + N * 10 + D) + (M * 1000 + O * 100 + R * 10 + E) = (M * 10000 + O * 1000 + N * 100 + E * 10 + Y )]; return smm; end proc;For some other (and slower) variants, see send_more_money.setl.
/
) returns a real number, whereas in SETL2 /
returns an integer. So here we use div
instead of /
.
procedure prime_factors(n); facts := []; while even(n) loop facts with:= 2; n := n div 2; end loop; while exists k in [3,5..ceil(sqrt(float(n)))] | n mod k = 0 loop facts with:= k; n := n div k; end loop; facts with:= n; return facts; end prime_factors;Quick sort
proc qsort(a); if #a > 1 then pivot := a(#a div 2 + 1); a := qsort([x in a | x < pivot]) + [x in a | x = pivot] + qsort([x in a | x > pivot]); end if; return a; end proc;In the programs anagrams.setl and sorting.setl I compare some different sort algorithms.
{log}
(setlog) program Clique.slog):
proc clique(G); V := { vv : p in G, vv in p}; -- the vertices cliques := {}; for C in pow(V) loop if forall I in C | forall J in C | {I,J} in {{I}} + G then cliques with:= C; end if; end loop; return cliques; end proc;Luhn test of credit card numbers
proc isluhn10(num); x := [val(i) : i in reverse(num)]; m := {[i,val("0246813579"(i+1))] : i in [0..9]}; return +/[x(i) + m(x(i+1)?0) : i in [1,3..#num]] mod 10 = 0; end proc;Pancake sort
procedure pancake_sort(rw nums); for i in [#nums,#nums-1..1] loop -- find the index of the largest element not yet sorted -- this variant is sligtly faster [this_max, max_idx] := find_max(nums(1..i)); if max_idx = i then continue; -- element already in place end if; -- flip this max element to index 1 if max_idx > 1 then nums(1..max_idx) := rev(nums(1..max_idx)); end if; -- then flip the max element to its place nums(1..i) := rev(nums(1..i)); end loop; end procedure; -- reverse a tuple procedure rev(a); return [a(i) : i in [#a,#a-1..1]]; end procedure; -- -- find the (first) index of the max value -- in a tuple. -- Returns [max_value, index] procedure find_max(a); max_idx := 1; this_max := a(1); for j in [2..#a] loop if a(j) > this_max then this_max := a(j); max_idx := j; end if; end loop; return [this_max, max_idx]; end procedure;
is_tuple
)
minNodes
nodes (akin to the existing maxNodes
minNodes
.alldifferentVariables
.I talked there about building a node validator that restricted the programs with these constraints. However, a better way - and more genetic programming-ish - is to "penalty" programs that do not satisfy these restrictions. And this is the way I have taken.
The new options are both used in the recreational problem by Richard Wiseman (Friday puzzle 2010-02-26) of finding an equation with result 24 using the numbers 5, 5, 5, 1
exactly once, and the four arithmetic operators (+,-,*,/). Richard Wiseman's solution can be read Answer to the Friday puzzle…. (2010-03-01).
The problem is modeled in number_puzzle4.conf. Here is the configuration file, where the new options are marked in bold (see below for the ForLoopD
option).
presentation: Puzzle
return_type: DoubleClass
num_input_variables: 4
variable_names: a b c d e
# With ForLoopD
# functions: Multiply,Divide,Add,Subtract,ForLoopD
functions: Multiply,Divide,Add,Subtract
# We don't use any numeric terminals
no_terminals: true
max_init_depth: 4
population_size: 1000
max_crossover_depth: 4
num_evolutions: 400
max_nodes: 7
min_nodes: 7 100
alldifferent_variables: true 100
show_similiar: true
similiar_sort_method: length
data
5 5 5 1 24
Here we require that there ought to be minimum of 7 nodes (as well as maximum number of nodes), i.e. the 4 variables (a, b, c, d
), and 3 operators between them. If a program has less number of nodes, then we "penalty" the program with 100 (the second value) points (errors). Note that there is no guarantee that the constraint is held, just quite probable with this large penalty.
The other option, alldifferentVariables
, is used in the same way: If there is an variable in the program that has already been use, we penalty it by 100 (the second value) points.
Also, I had increased the number of evolution to 400 (from 100) because of these constraints.
With these new options the required solution to the problem is found rather easy, though maybe not on each run. Remember that a=5, b=5, c=5, and d = 1, and the target is the number 24.
(c - (d / a)) * b
b * (c - (d / a))
The numeric solution of the problem is 5*(5-1/5)
, and the two programs are just permutations of this solutions.
minNodes
and maxNodes
to 8 and ran again.
min_nodes: 8
max_nodes: 8
Since there can be no solution with 8 nodes there must be some penalty. Then the following solutions came, all with an error of 100, the penalty for not been minimum 8 nodes. The variables are, however, all different as they should so there is no penalty.
All solutions with the best fitness (100.0):
Sort method: occurrence
(a - (d / c)) * b [42831]
b * (a - (d / c)) [28531]
(b - (d / c)) * a [72]
a * (b - (d / c)) [9]
b * (c - (d / a)) [9]
(a - (d / b)) * c [2]
It was 6 different solutions with fitness 100.0
It is interesting that there are more solutions with the constraint of 8 nodes than with 7.
Increasing the the min, and max number of nodes to 9 and 9, respectively, then there are solutions with the stated number of nodes. But now there is a penalty of 100 for not been all different, and they are - of course - not a real solution to the problem.
All solutions with the best fitness (100.0):
Sort method: occurrence
(c * b) - ((c / d) / a) [17588]
(c - (d - (b * b))) - a [2]
It was 2 different solutions with fitness 100.0
data
1 2 3 4 5
One run give the following 46 solutions with 0 errors. The number in [] is the number of found occurrences of the specific solution.
All solutions with the best fitness (0.0):
Sort method: occurrence
d - (a / (b - c)) [70602]
(c + d) - (a * b) [22871]
(c - (b * a)) + d [8724]
c + (d / (b * a)) [3109]
d - (b - (c * a)) [116]
((d - b) * a) + c [107]
(d - b) + (a * c) [58]
(c - b) + (a * d) [40]
(d * b) - (c * a) [35]
((c / b) * d) - a [24]
(c - b) * (d + a) [24]
c + ((a / b) * d) [19]
(b * d) - (c / a) [16]
d + ((c - a) / b) [15]
(d + a) / (c - b) [15]
(d + c) - (b / a) [13]
(c - b) + (d / a) [12]
(c / a) + (d - b) [8]
(b * d) - (a * c) [8]
(d * a) - (b - c) [8]
(d / (a * b)) + c [8]
(a * d) - (b - c) [7]
(a * d) + (c - b) [7]
(a + d) * (c - b) [6]
(a * c) + (d / b) [6]
(c / a) + (d / b) [5]
(c + d) - (b * a) [4]
(d + c) - (a * b) [4]
(c + (d / b)) * a [4]
(a + d) / (c - b) [4]
(d + a) * (c - b) [3]
(d / a) - (b - c) [3]
(b * d) - (c * a) [2]
((d / b) * c) - a [2]
(d / b) + (c / a) [2]
(d * a) + (c - b) [1]
((c - b) / a) + d [1]
c + (d - (a * b)) [1]
(c * a) - (b - d) [1]
(a * c) - (b - d) [1]
((d * a) + c) - b [1]
(d * b) - (a * c) [1]
(d + c) - (b * a) [1]
(d / a) + (c - b) [1]
a + ((c - b) * d) [1]
((c + d) - b) / a [1]
It was 46 different solutions with fitness 0.0
Here is much more solution, which indicates that it is a simpler program than the above.
ForLoop
, which also can be used in this program. (This was done by copying the code in the JGAP distribution, org.jgap.gp.function.ForLoop, and then do some small changes.)
The logic of this function is to create a for loop and for each loop add the result of the code in the body of the loop ("some code") to the final result which is then returned as a value of the loop. In a normal programming language this should be coded like this. The number of loops (the variable a
) is dynamic selected.
double x = 0.0d;
for(int i=0;i<a;i++) { x += some code }
return x;
As a test, I added ForLoopD
to the function list in the 5 5 5 1 24
problem (see the configuration above):
functions: Multiply,Divide,Add,Subtract,ForLoopD
One solution is the following with 0 errors:
for(int i=0;i<b;i++) { (c - (d / a)) }
Which is - of course - just another way of stating the following solution:
b*(c - (d / a))
Well, I have to see if this function is of any real use...
show_similar
: Alternative name of show_similiar
.
similiar_sort_method
: Method of sorting the similiar solutions when using show_similiar
, which shows all solutions that has the same fitness value as the best found solution. Alternative name: similar_sort_method
. Valid options are:
occurrence
: descending number of occurrences (default)
length
: length of solutions (ascending)
error_method
: Error method to use. Valid options are
totalError
: sum of (absolute) errors (default)
minError
: minimum error
meanError
: mean error
medianError
: median error
maxError
: max error
no_terminals
: If true then no Terminal is used, i.e. no numbers, just variables. Default: false.
make_time_series
: Make a time series of the first line of data. The value of num_input_variable
determines the number of laps (+1 for the output variable. See below for some examples.
make_time_series_with_index
: As make_time_series
with an extra input variable for the index of the series. (Somewhat experimental.)The fitness cases consists of all years 1890..2030, and 1200, 1300, 1400, 1500, 1600, 1700, and 1800.
The functions used are: Multiply,Divide,Add,Subtract,ModuloD,IfElseD
where IfElseD
may be replaced with IfLessThanOrEqualD
, or removed completely.
ModuloD
is not the normal modulo operator. Instead it is "protected modulo" where the arguments are first converted to integers and then taken modulo. However, if the second argument is 0 (zero), the result is 0 (zero). This function is represented as either modp
or %
below.
The program found a lot of solutions with error 1 (for year 1900).
Using IfLessThanOrEqualD
if(y <= ((modp(y,(y / 471.0))) * (296.0 * y))) { (y - y) } else { (327.0 / 327.0) }
Without IfElseD
:
(326.0 / (((((y - 536.0) % 536.0) + y) % (y / 226.0)) + 326.0)) % (283.0 % y)
(y / (((y * 654.0) % (24.0 % y)) + y)) % y
(y / (((y * (330.0 % y)) % (24.0 % y)) + y)) % y
Note also that this configuration uses only one fitness case and let the program find any solution that comply to the equation. It also use the new option no_terminals
for using just variables (no Terminal numbers) which was implemented for this example.
Here is a result from a sample run. The number in [] is the number of occurrences of the specific programs. In this example we also see the new option similiar_sort_method: length
at work, which sorts the similiar solutions according to length (normally it it sorted on the number of occurrences). The variables in the solutions means: a = 5, b = 5, c = 5 and d = 1.
All solutions with the best fitness (0.0):
Sort method: length
(b * c) - d [5]
(a * c) - d [4162]
(b * b) - d [4]
(c * a) - d [251]
(a * a) - d [10]
(c * c) - d [424]
(c * b) - d [1]
(b * a) - d [36]
(c - d) * (a + d) [1]
(b * a) - (b / c) [121]
(b * a) - (a / c) [2]
(c * b) - (c / c) [5]
(b * b) - (a / a) [3]
(c * a) - (b / b) [2]
(a * c) - (d * d) [633]
(a - d) * (d + b) [4]
(c * b) - (a / c) [1]
(a * b) - (c / b) [2]
(c * c) - (b / b) [1]
It was 19 different solutions with fitness 0.0
None of these are a solution to Wiseman's puzzle.
Here we have limited the number of nodes with max_modes: 7
(4 variables + 3 terminals), but there is no standard option in JGAP to state the minimum number of nodes. However, with a "node validator" this could probably be done. I plan to experiment more with node validators for these kind of constraints and "global functions" mentioned above.
make_time_series
. See below for more about this option.
make_time_series
. See below.
make_time_series
may require some explanation.
The following configuration file is all that is needed for the Fibonacci problem (in time series representation). Actually, the two lines in bold are the only needed, since the other options has defaults that would work well here.
make_time_series: true
num_input_variables: 4
terminal_range: -10 10
functions: Multiply,Divide,Add,Subtract
max_init_depth: 4
population_size: 100
num_evolutions: 100
max_crossover_depth: 8
max_nodes: 21
data
1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,10946,17711,28657,46368
The option make_time_series
will then transform the data into a data set and then proceed as if the data set has been stated explicit. Note: the SymbolicRegression program works with double, hence the somewhat unusual presentation.
The number of time lags is the number of input variables (num_input_variables
) + 1 for the output variable; here 4 + 1 = 5 time lags. The program prints the transformed data first, i.e.:
Making timeseries, #elements: 24
1.0 1.0 2.0 3.0 5.0
1.0 2.0 3.0 5.0 8.0
2.0 3.0 5.0 8.0 13.0
3.0 5.0 8.0 13.0 21.0
5.0 8.0 13.0 21.0 34.0
8.0 13.0 21.0 34.0 55.0
13.0 21.0 34.0 55.0 89.0
21.0 34.0 55.0 89.0 144.0
34.0 55.0 89.0 144.0 233.0
55.0 89.0 144.0 233.0 377.0
89.0 144.0 233.0 377.0 610.0
144.0 233.0 377.0 610.0 987.0
233.0 377.0 610.0 987.0 1597.0
377.0 610.0 987.0 1597.0 2584.0
610.0 987.0 1597.0 2584.0 4181.0
987.0 1597.0 2584.0 4181.0 6765.0
1597.0 2584.0 4181.0 6765.0 10946.0
2584.0 4181.0 6765.0 10946.0 17711.0
4181.0 6765.0 10946.0 17711.0 28657.0
It was 19 data rows
And then, as mentioned above, the program proceeds as usual. See Symbolic regression (using genetic programming) with JGAP
eureqa_cli
without any arguments shows the valid options:
Syntax:
eureca_cli datafile relationship functions fitness_method population_size crossover_prob mutation_prob
where only the data file and relationship must be stated.
...
It then lists all the valid options for functions and fitness methods, see below under Full help notice. Also see Eureqa's API for more information about Eureqa's options. I have not added any function of my own (because this is not possible at the moment) and so use what is available in Eureqa.
eureqa_cli
are:The following parameters are set as the default values from Eureqa, but are not options to the program:
eureqa_cli number_puzzle1.txt "z = f(x,y)"
eureqa_cli fib_38_ix.txt "t1 = f(ix)" "a a+b a-b a*b a/b a^b sqrt(a)"
eureqa_cli boyles_law.txt "PV = f(P,V)"
eureqa_cli p4_1.txt "y = f(x)" "a a+b a-b a*b a/b" "absolute_error"
eureqa_cli two_spirals.txt "z = f(x,y)" "a a+b a-b a*b a/b sin(a) cos(a) exp(a) log(a)"
eureqa_cli fib_38_ix.txt "t1 = f(ix)" "a a+b a-b a*b a/b a^b sqrt(a)" "squared_error" 1000 0.9 0.10
(with populations size 1000, crossover probability 0.9, and mutation probability 0.10)See Experimenting with Eureqa's API for output of similar problems.
eureqa_server
) has been started. See Experimenting with Eureqa's API for some more about this.
eureqa_cli is a command line interface to Eureqa's eureqa_server
Syntax:
eureca_cli datafile relationship functions fitness_method population_size crossover_prob mutation_prob
where only the data file and relationship must be stated
Valid functions (building blocks):
* constant: 1.34
* data variable: x
* addition: x+y
* subtraction: x-y
* multiplication: x*y
* division: x/y
* power: x^y
* exponential: exp(x)
* logarithm: log(x)
* sine: sin(x)
* cosine: cos(x)
* absolute value: abs(x)
* tangent: tan(x)
* two-input arctangent: atan2(x,y)
* minimum of two: min(x,y)
* maximum of two: max(x,y)
* square root: sqrt(x)
* gamma function: gamma(x)
* gaussian function: gauss(x)
* logistic function: logistic(x)
* hill function, power 2: hill2(x)
* step function: step(x)
* sign function: sign(x)
* arcsine: asin(x)
* arccosine: acos(x)
* arctangent: atan(x)
* hyperbolic sine: sinh(x)
* hyperbolic cosine: cosh(x)
* hyperbolic tangent: tanh(x)
* inverse hyperbolic sine: asinh(x)
* inverse hyperbolic cosine: acosh(x)
* inverse hyperbolic tangent: atanh(x)Special building blocks:
* equals: y = f(x)
* search formula: y = f(x)
* derivative: D(y,t) = f(x,y)
Valid fitness methods:
* absolute_error
* squared_error
* root_squared_error
* logarithmic_error
* explog_error
* correlation
* minimize_difference
* akaike_information
* bayesian_information
* maximum_error
* median_error
* implicit_error
* count
For more information about this program, see http://www.hakank.org/eureqa/
Eureqa's homepage: http://ccsl.mae.cornell.edu/eureqa/
Before starting anything Eureqa related, I had to install a newer version of
the Boost library since Eureqa requires version 1.42.0. It did take about half an hour but there where no problems during this step.
The Eureqa API archive must be downloaded, and unpacked.
After these preliminaries, I first tested the simplest example: minimal_client
. Unfortunately it didn't work right from the box on my Mandriva Linux machine, and I had to add two things (bold below) in the Makefile:
minimal_client: minimal_client.o
g++ minimal_client.o \
$(BOOST_LIBRARY_PATH)libboost_thread.a \
$(BOOST_LIBRARY_PATH)libboost_system.a \
$(BOOST_LIBRARY_PATH)libboost_serialization.a \
-o minimal_client -lpthread
The Makefile for other example basic_client
, already has these lines, and worked without any problems.
Before running the program, a running Eureqa standalone server is needed. It can be downloaded from Eureqa's download page, or from the directory ./server
in the installed API archive. The real work is done in the Eureqa server. The client program first tells the conditions of the run to the server (what data, variables, functions, to use), and later on ask the server for new/better solutions which is then presented by the client program.
To start the server:
./eureqa_server &
Now we are ready to start the minimal_client program. This example reads the data file ../data_sets/default_data.txt
(it seems to be the same as the default data set as in the Eureqa GUI).
./minimal_client
Here is the first lines of output from the program. If you have running the GUI version of Eureqa (which is really recommended) you will recognize most of this output.
Data: 100 data points, 3 variables
Options: "y = f(x)", 8 building-block types, Absolute Error fitness
Connection: Connected to 127.0.0.1
Server: xxxxxxxx, Eureqa 0.78 (linux), 2 CPU cores
0 generations, 1864 evaluations
Size: Fitness: Equation:
----- -------- ---------
7 -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)
39 generations, 764432 evaluations
Size: Fitness: Equation:
----- -------- ---------
7 -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)
1 -1.73044 f(x) = x
173 generations, 4.04115e+06 evaluations
304 generations, 7.28129e+06 evaluations
458 generations, 1.04966e+07 evaluations
Size: Fitness: Equation:
----- -------- ---------
7 -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)
1 -1.73044 f(x) = x
5 -1.61304 f(x) = sin(x/x)
...
A small issue: I don't understand why the fitness is negative here; absolute error should always be positive. Maybe it's just a tiny presentation bug, with a misplaced "-"?
The program eureqa_apitest1.cpp is based on the example eureqa_api_1_00_0/examples/minimal_client/minimal_client.cpp
mentioned above. The changes are not big, but some common options has been explicit:
a^b
), and sqrt (sqrt
) was added (the sin and cosine functions was removed).options.building_blocks_.clear();
options.building_blocks_.push_back("a"); // variables
options.building_blocks_.push_back("a+b"); // adds
options.building_blocks_.push_back("a-b"); // subtracts
options.building_blocks_.push_back("a*b"); // multiplies
options.building_blocks_.push_back("a/b"); // divides
options.building_blocks_.push_back("a^b"); // power
options.building_blocks_.push_back("sqrt(a)"); // sqrt
Note that the names in the building blocks don't have to match the variable names in the data file.
t1 = f(ix)
:options.search_relationship_ = "t1 = f(ix)";
options.fitness_metric_ = eureqa::fitness_types::absolute_error;
There are more fitness metrics to use, see Fitness Metric Identifiers.
Well, that's about it.
The program reads the file fib_38_ix.txt consisting of the first 38 Fibonacci numbers with the index (1..38). Note: In this problem we just use the first two variables in the file ix
, and t1
. The instances for 39..50 has been commented out to make it simpler.
The object is to find the closed form of the Fibonacci numbers, which is usually stated as:
(phi^n - (1-phi)^n)/sqrt(5)
where phi = (1+sqrt(5))/2 = ~ 1.61803 (golden ratio), and sqrt(5) ~ 2.2361.
See Fibonacci_number#Closed_form_expression (Wikipedia) for more about this.
Here is one solution (the 6 best solutions) from running the program a couple of minutes. Since the program don't have any stop criteria it will run forever if not manually stopped.
Size: Fitness: Equation:
----- -------- ---------
7 -104.178 f(ix) = 1.61808^(ix - 1.67436)
9 -103.999 f(ix) = 1.61808^(ix - 1.67436) + 1.61808
11 -101.371 f(ix) = 1.61808^(ix - 1.67436) + ix - 1.67436
5 -79382.2 f(ix) = 1.58323^ix
1 -2.55834e+06 f(ix) = ix
3 -2.53729e+06 f(ix) = ix/0.00018853
The first solution in the list has an fitness error of about 104: 1.61808^(ix - 1.67436)
.
Note the constant 1.61808 which is quite close to phi (1.61803).
When rounded, this program (solution) gives the following results for ix = 1..38. It is correct for the first 15 numbers (1..15), but will then deviate.
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 986(987), 1596(1597), 2583(2584), 4179(4181), 6762(6765), 10941(10946), 17703(...), 28646, 46351, 75000, 121355, 196362, 317730, 514113, 831876, 1346042, 2178003, 3524183, 5702410, 9226955, 14929952, 24157857, 39089345
The correct sequence is:
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887,9227465, 14930352, 24157817, 39088169
Here is the deviation from the correct sequence:
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -2, -3, -5, -8, -11, -17, -25, -38, -56, -81, -116, -164, -227, -306, -395, -477, -510, -400, 40, 1176
Maybe this is a wrong track, but it is nice to see the solutions evolve, which is one advantage of symbolic regression, and genetic programming in general.
Some useful pages:
Also, see my Eureqa page.
Other updated pages: Also, see: My Eureqa page and Eureqa: Equation discovery with genetic programming. ]]>
- reduced lag that servers report new solutions
- projects now save the smoothing preprocessing
- improved the ordering/display of the best solutions list
- improved the seeding previous solution method
- improved the AIC and BIC fitness metrics
- added ability to right-click a plot and copy its data to the clipboard
- added ability to start a search from the command line
- added ability to chose the training/validation data split in the advanced options
- added check to normalize data values with large offset or scale
- fixed bug when loading projects that could clear results
- fixed bug where resuming a search could fail to keep the previous results
- fixed bug where seeded equations were not recognized
- fixed bug where the fitness metric weighting was ignored
- fixed several minor user interface annoyances
- made compatible with the new open-source API
If we assume that: 2 + 3 = 10 7 + 2 = 63 6 + 5 = 66 8 + 4 = 96 How much is? 9 + 7 = ????This can be somewhat tricky to solve this by hand (or head), or maybe we simply are too lazy to solve it by hand. Using symbolic regression is easier (although probably not that fun): Just create a configuration file like the one below. In this problem there is a specific unknown instance that we want to solved for, so we can write an instance with the
?
(question mark) in the place of the (unknown) output.
presentation: Puzzle num_input_variables: 2 variable_names: x y z functions: Multiply,Divide,Add,Subtract terminal_range: -10 10 terminal_wholenumbers: true population_size: 100 num_evolutions: 100 show_similiar: true data 2 3 10 7 2 63 6 5 66 8 4 96 # the unknown instance 9 7 ?We use the four arithmetic function (*,/, +,-), coded as
Multiply,Divide,Add,Subtract
and have a small population size (100) and just 100 generations. The other options is explained more below.
It was 4 data rows It was 1 data rows in the user defined data set Presentation: Puzzle output_variable: z (index: 2) input variable: x input variable: y function1: &1 * &2 function1: / function1: &1 + &2 function1: &1 - &2 function1: 10.0 Creating initial population Evolving generation 0/100(time from start: 0,05s) Best solution fitness: 35.0 Best solution: x + ((y - x) + (10.0 * x)) Depth of chrom: 3. Number of functions/terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.979073833348314 Evolving generation 3/100(time from start: 0,15s) Best solution fitness: 31.0 Best solution: (10.0 * x) + ((x * y) - (8.0 + y)) Depth of chrom: 3. Number of functions/terminals: 11 (5 functions, 6 terminals) Correlation coefficient: 0.9945949940306454 Evolving generation 11/100(time from start: 0,39s) Best solution fitness: 26.0 Best solution: ((10.0 * x) + (x + (-7.0 * y))) + (x * y) Depth of chrom: 4. Number of functions/terminals: 13 (6 functions, 7 terminals) Correlation coefficient: 0.969830602937701 Evolving generation 14/100(time from start: 0,50s) Best solution fitness: 22.0 Best solution: (x * x) + (4.0 * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 0.9726783536388712 Evolving generation 17/100(time from start: 0,58s) Best solution fitness: 0.0 Best solution: (x * y) + (x * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 1.0 All time best (from generation 17) Evolving generation 101/100(time from start: 1,71s) Best solution fitness: 0.0 Best solution: (x * y) + (x * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 1.0 Total time 1,71s All solutions with the best fitness (0.0): (x * x) + (x * y) (26) (x * x) + (y * x) (2) (x + y) * x (2) (x * y) + (x * x) (98) ((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1) ((x / (y / x)) + x) * y (1) It was 6 different solutions with fitness 0.0 Testing the fittest program with user defined test data: 9.0 7.0 Result: 144.0Since it is a genetic programming system, the first generation - generation 0 - is a completely random population of programs. Note that the configuration states very few limits in size, and number of population, and there is really no limits of the structure (see below).
x + ((y - x) + (10.0 * x))
has a quite bad fitness measure: 35; rather a long way from the goal of fitness 0 (the perfect score). The fitness is calculated by the sum of the differences between the program's output for each data point and the real data point. (Note: One of my TODO:s is to have more alternatives of error measures.)
(x * y) + (x * x)
with a fitness (error) of 0.0, and a correlation coefficient of 1.0 (perfect fit between the input variable and output variable). After the 100 generations, the best solution is printed again with the total time (about 1.7 seconds).
show_similar: true
was set, all solutions with the same fitness score as the best is also shown:
(x * x) + (x * y) (26) (x * x) + (y * x) (2) (x + y) * x (2) (x * y) + (x * x) (98) ((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1) ((x / (y / x)) + x) * y (1)Some of these solutions are just permutation of the best solution, i.e. the places of the variable names or expressions are changed. Other are not very interesting either, or, like the 5th one, is not at all useful in this example. The numbers in parenthesis after the solution is the number of occurrences of the solution. Luckily the 5th solution was generated only once.
functions
to use. In this problem it would suffice with just the functions Add
, and Multiply
, but it is - of course - a special case.
1,1,2,3 1,2,3,5 2,3,5,8 ...After doing this transformation, we proceed in the same manner as the number problem above, i.e. create a configuration file (fib1.conf) with the options and the data:
# Fibonacci with 3 variables num_input_variables: 3 variable_names: F1 F2 F3 F4 functions: Multiply,Divide,Add,Subtract terminal_range: -10 10 max_init_depth: 4 population_size: 20 max_crossover_depth: 8 num_evolutions: 100 max_nodes: 21 show_population: false stop_criteria_fitness: 0 data 1,1,2,3 1,2,3,5 2,3,5,8 3,5,8,13 5,8,13,21 8,13,21,34 13,21,34,55 21,34,55,89 34,55,89,144 55,89,144,233 89,144,233,377 144,233,377,610 233,377,610,987 377,610,987,1597 610,987,1597,2584 987,1597,2584,4181 1597,2584,4181,6765 2584,4181,6765,10946 4181,6765,10946,17711 6765,10946,17711,28657 10946,17711,28657,46368And here is a complete run (slightly edited) where the correct solution is found in generation 10:
F3+F2
. Also, here we added the option stop_criteria_fitness: 0
which makes the program to exit after the criteria have been reached.
It was 21 data rows Presentation: This is the Fibonacci series output_variable: F4 (index: 3) input variable: F1 input variable: F2 input variable: F3 function1: &1 * &2 function1: / function1: &1 + &2 function1: &1 - &2 function1: 4.0 Evolving generation 0/100(time from start: 0,01s) Best solution fitness: 17619.64 Best solution: ((F3 + F1) + 7.0) + (F1 / F3) Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.9999999998872628 Evolving generation 2/100(time from start: 0,06s) Best solution fitness: 7050.70 Best solution: ((F2 + F2) * (F1 + F1)) / (((F2 + 7.0) + F2) + (F1 - F3)) Depth of chrom: 4. Number of functions+terminals: 17 (8 functions, 9 terminals) Correlation coefficient: 0.9999999122259431 Evolving generation 9/100(time from start: 0,19s) Best solution fitness: 6749.0 Best solution: (F2 / F2) + (4.0 * F1) Depth of chrom: 2. Number of functions+terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 0.99999999956117 Evolving generation 10/100(time from start: 0,21s) Best solution fitness: 0.0 Best solution: F3 + F2 Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals) Correlation coefficient: 1.0 Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 10 All time best (from generation 10) Evolving generation 10/100(time from start: 0,21s) Best solution fitness: 0.0 Best solution: F3 + F2 Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals) Correlation coefficient: 1.0 Total time 0,21s All solutions with the best fitness (0.0): F3 + F2 (1) It was 1 different solutions with fitness 0.0A variant where we remove the stopping criteria, and instead use
show_similar: true
results in the following similar
solutions:
All solutions with the best fitness (0.0): (F2 / 1.0) + F3 (1) (3.0 / 3.0) * (F3 + F2) (1) F3 + F2 (406) F2 + F3 (7) (-2.0 / -2.0) * (F3 + F2) (1) It was 5 different solutions with fitness 0.0TODO: I have some plans to automatically generate this kind of lagged time serie for a sequence, and hope it will be finished soon. This is not rocket science but it can be tricky in the details.
functions: Multiply,Divide,Add,Subtract,IfLessThanOrEqualDThe function
IfLessThanOrEqualD
may be worth a comment: I am currently reading John Koza' first Genetic Programming book Genetic Programming: v. 1 On the Programming of Computers by Means of Natural Selection (ISBN: 9780262111706). He mentions the IFLTE
function (if a <= b then c else d
), which I implemented - by cloning and mutation of my IfElseD.java function (see, programmers also use these operators :). As Koza writes in page 365 about the function, it can be used instead of the following function:
==
) in this list, but I'm not so convinced that IFLTE can properly replace this.
validation_pct
). See below for a discussion of this.
population_size: 100 num_evolutions: 100Other configuration options in the file:
input_variables: 4
variable_names: sl sw pl pw class
class
) is the output variable. Note that it it possible to change what variable to use as output variable, with the option output_variable
and state the (0-based) position in the variable list. Here we could have the following option: output_variable: 4
.
terminal_range: -20 20
Terminal
(ephemeral terminal), i.e. the numbers used in the solution expression.
terminal_wholenumbers: false
max_nodes: 31
mutation_prob: 0.01
crossover_prob: 0.9
show_progression: true
show_results: true
result_precision: 5
show_results
: The precision of the results.
hits_criteria: 0.5
Round
or Floor
?)
It was 150 data rows Presentation: Iris output_variable: class (index: 4) input variable: sl input variable: sw input variable: pl input variable: pw function1: &1 * &2 function1: &1 + &2 function1: &1 - &2 function1: / function1: if(&1 <= &2) then (&3) else(&4) function1: 16.624795863695333 Evolving generation 0/100(time from start: 0,16s) Best solution fitness: 64.0 Best solution: ((sw * pl) - (pw * pw)) / ((sl - pl) * (sl - sw)) Depth of chrom: 3. Number of functions+terminals: 15 (7 functions, 8 terminals) Correlation coefficient: 0.6621167550765015 Number of hits (<= 0.5): 86 (of 150 = 0,57) Results for this program: ... (22) 1.0: 0,98889 (diff: 0,01111) (23) 1.0: 0,87582 (diff: 0,12418) (24) 1.0: 1,58128 (diff: -0,58128) > 0.5! (25) 1.0: 0,70000 (diff: 0,30000) ... total diff: 111.08622273795162 (no abs diff: -42.32778738529426 #hits: 86 (of 150) Evolving generation 1/100(time from start: 0,29s) Best solution fitness: 48.0 Best solution: ((sw + pl) * sl) / (18.98720292178895 + pl) Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.8492617928834261 Number of hits (<= 0.5): 102 (of 150 = 0,68) Results for this program: ... total diff: 60.411404633990145 (no abs diff: 35.03754935483205 #hits: 102 (of 150) Evolving generation 4/100(time from start: 0,61s) Best solution fitness: 7.0 Best solution: (19.035483741865328 / 19.035483741865328) + pw Depth of chrom: 2. Number of functions+terminals: 5 (2 functions, 3 terminals) Correlation coefficient: 0.9564638238016164 Number of hits (<= 0.5): 143 (of 150 = 0,95) ... total diff: 39.80000000000001 (no abs diff: -29.800000000000004 #hits: 143 (of 150) Evolving generation 16/100(time from start: 0,94s) Best solution fitness: 6.0 Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (sw + 14.495973076477487)) Depth of chrom: 4. Number of functions+terminals: 19 (8 functions, 11 terminals) Correlation coefficient: 0.9582679236481598 Number of hits (<= 0.5): 144 (of 150 = 0,96) Results for this program: ... total diff: 26.831398327892167 (no abs diff: -3.7720516041580767 #hits: 144 (of 150) Evolving generation 39/100(time from start: 1,54s) Best solution fitness: 5.0 Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487)) Depth of chrom: 6. Number of functions+terminals: 25 (11 functions, 14 terminals) Correlation coefficient: 0.958786524094953 Number of hits (<= 0.5): 145 (of 150 = 0,97) Results for this program: (0) 1.0: 0,97740 (diff: 0,02260) (1) 1.0: 1,01322 (diff: -0,01322) (2) 1.0: 1,00110 (diff: -0,00110) (3) 1.0: 1,00869 (diff: -0,00869) ... (69) 2.0: 1,94192 (diff: 0,05808) (70) 2.0: 2,59137 (diff: -0,59137) > 0.5! (71) 2.0: 2,11718 (diff: -0,11718) ... (118) 3.0: 3,11623 (diff: -0,11623) (119) 3.0: 2,35925 (diff: 0,64075) > 0.5! (120) 3.0: 3,08251 (diff: -0,08251) ... (128) 3.0: 2,91459 (diff: 0,08541) (129) 3.0: 2,39350 (diff: 0,60650) > 0.5! (130) 3.0: 2,70539 (diff: 0,29461) (131) 3.0: 2,73150 (diff: 0,26850) (132) 3.0: 3,01459 (diff: -0,01459) (133) 3.0: 2,31546 (diff: 0,68454) > 0.5! (134) 3.0: 2,23094 (diff: 0,76906) > 0.5! (135) 3.0: 3,08865 (diff: -0,08865) (136) 3.0: 3,17414 (diff: -0,17414) ... (148) 3.0: 3,07502 (diff: -0,07502) (149) 3.0: 2,60513 (diff: 0,39487) total diff: 26.224671119186706 (no abs diff: -0.04627940721407109 #hits: 145 (of 150) ... Total time 2,91sThe solutions of the best fit program (generation 39) is this.
Solution:(((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487))which is kind of funny looking. Here are some comments:
19.035483741865328 / 19.035483741865328
is 1. The program don't simplify such expressions, but this would be nice to have. (As I understand it, Eureqa has some kind of simplification process.)
if then else
is used as an expression with the returning value is used directly in the solution. Something we don't see here but may happen with other settings is that the logical operators (or expressions using these operators) returns 1.0 (true) or 0.0 (false) and these values is used directly in the calculations as any other expression.
Best solution fitness: 5.0 ... Number of hits (<= 0.5): 145 (of 150 = 0,97)We defined fitness as the number of differences <= 0.5, and we see that there are 5 wrongly classified instances (150-144=6), with a hit rate of 97%. Not too bad, but not very good either.
validation_pct
does exactly that. The value of the option is the percentage of the data that will be placed in the validation set. Or more exactly: it is the probability that a specific fitness case will be in the validation set.
MathProblem.java
but extended with a lot of bells & whistles.
javac -Xlint:unchecked -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression.javaand run with:
java -server -Xmx1024m -Xss2M -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression [config file]Here is my log4j.properties file.
IfElse
etc). However, for the ADF functions (defined by setting adf_arity
to > 0, but see below) more functions is supported. Please note that some of these functions are experimental (or very experimental) and the result may not make sense in this context.
D
.
Multiply
(double)
Multiply3
(double)
Add
(double)
Add3
(double)
Add4
(double)
Divide
(double)
Subtract
(double)
Sine
(double)
ArcSine
(double)
Tangent
(double)
ArcTangent
(double)
Cosine
(double)
ArcCosine
(double)
Exp
(double)
Log
(double)
Abs
(double)
Pow
(double)
Round
(double), compare with my RoundD
Ceil
(double)
Floor
(double)
Modulo
(double), implements Java's %
operator for double. See ModuloD for a variant
Max
(double)
Min
(double)
LesserThan
(boolean)
GreaterThan
(boolean)
If
(boolean)
IfElse
(boolean), cf the IfElseD
IfDyn
(boolean)
Loop
(boolean), cf the experimental LoopD
Equals
(boolean), cf EqualsD
ForXLoop
(boolean)
ForLoop
(boolean)
Increment
(boolean)
Pop
(boolean)
Push
(boolean)
And
(boolean), cf the double variant AndD
Or
(boolean), cf the double variant OrD
Xor
(boolean), cf the double variant XorD
Not
(boolean), cf the double variant NotD
SubProgram
(boolean, experimental)
Tupel
(boolean, experimental)
double
as input and returns a double
(0.0d or 1.0d). Some of these functions are tested in odd_parity.conf.
And
Different
Equals
GreaterThan
GreaterThanOrEqual
IfElse
If Less Than Or Equal Then .. Else
(if a < b then c else d). Inspired by Koza's function IFLTE
LesserThan
LesserThanOrEqual
Not
Or
Xor
double
as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard %
operator on double is not what I wanted.) This is tested in isbn_test.conf.
mod_replace
. This should be considered highly experimental.
Divide
where the division is done by first converting to Integer
then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected).
Divide
the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division.
Cube
(x^3)
Gamma
Gaussian
Hill
Logistic
RoundD
, my version of round()
Sigmoid
Sign
Sqrt
Square
(x^2)
Step
Identity function
Loop
for double
. Highly experimental.
Sqrt
function (which is here).
org.jgap.gp.function.Log
from the JGAP distribution.
toString()
: "sqrt &1". If a function has more arguments, the different arguments are presented as "&1", "&2", "&3", etc. E.g. the ModuloD
function has the following presentation "&1 mod &2", but it can be "mod(&1,&2)" or even "(mod &1 &2)" depending on the style of output. (Hmm, maybe there should be an option in all functions how to represent the names, e.g. mathematical, Java version, Lisp version. I have to think about this more.)
getName()
. We use Sqrt
.
exectute_double
. Since double is the only type that is supported right now, it suffices to change for exectute_double
. However, in some of the files, there are also support for other types, e.g. exectute_float
, exectute_int
, etc.
execute_object
: here we use execute_sqrt
as the call name. This same name is to be used in Compatible
.
makeCommands
method in SymbolicRegression.java. Recompile.
Exp
in the function list.
output_variable
to 0 for the reverse conversion (C -> F).
#
, %
: Line comments; lines that start with the characters "#" or "%" will be ignored.
presentation
: A text which is shown first in the run.
num_input_variables
: Number of input variables in the data set.
output_variable
: The index (0-based) of the output variable. Default is the last variable.
variable_names
: The name of the variables, in order. Default is "V0", "V1", etc
data
: Starts the data
section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a .
(dot).?
(question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run.
terminal_range
: The range for the Terminal
as lower upper
. Note: Only one Terminal is used.
terminal_wholenumbers
: If the Terminal
should use wholenumbers or not (boolean)
constant
: Define a Constant
with this value
functions
: Define the functions, with the same name as in JGAP (or own defined functions).
adf_arity
: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.
adf_function
: The functions used for ADF.
adf_type
: Either double or boolean. If set to boolean, we can use the boolean and logical operators.
max_init_depth
: JGAP parameter maxInitDepth
min_init_depth
: JGAP parameter minInitDepth
program_creation_max_tries
: JGAP parameter programCreationMaxTries
population_size
: JGAP parameter populationSize
max_crossover_depth
: JGAP parameter maxCrossoverDepth
function_prob
: JGAP parameter functionProb
reproduction_prob
: JGAP parameter reproductionProb
mutation_prob
: JGAP parameter mutationProb
crossover_prob
: JGAP parameter crossoverProb
dynamize_arity_prob
: JGAP parameter dynamizeArityProb
no_command_gene_cloning
: JGAP parameter no_command_gene_cloning
use_program_cache
: JGAP parameter use_program_cache
new_chroms_percent
: JGAP parameter newChromsPercent
num_evolutions
: JGAP parameter numEvolution
tournament_selector_size
: JGAP parameter tournamentSelectorSize
max_nodes
: JGAP parameter maxNodes
scale_error
: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.
stop_criteria_fitness
: If set (>= 0) then the program will run "forever" (instead of num_evolution
) until fitness is less or equal to the value.
show_population
: This shows the whole population in each generation. Mainly for debugging purposes.
show_similar
: Shows all the solutions (programs) with the same fitness value as the best solution.
show_progression
: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).
sample_pct
: (float) Takes a (sample) percentage of the data set if > 0.0.
validation_pct
: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.
show_all_generations
: Show info of all generations, not just when fitness is changed.
hits_criteria
: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of non-hits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value.
mod_replace
: Setting the replacement value of 0 (zero) for the ModuloIntD
function (see above).
showResults
: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.
resultPrecision
: the precision in the output used in showResult
, default 5
ignore_variables
: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.
return_type
: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as double/DoubleClass
.
adf_arity: 0 adf_type: boolean adf_functions: IfElse,GreaterThan,LesserThanOne of the problems I have with ADF is that many of the interesting ADF functions, e.g.
Loop
, ForLoop
, ForXLoop
, requires a different representation that SymbolicRegression supports. In spite of this, it can be interesting to experiment with the existing support for ADF.
Explanations of the ADF related options:
adf_arity
: When > 0 ADF is activated and all the ADF functions has this arity (number of arguments)
adf_type
: return type for the ADF functions. Can be either boolean
or double
. In order to work, the ADF function must support the stated type (and it is here I have some problems).
adf_function
: a list if functions to be used as ADF.
If you have read any of my other blogs, you may ask: why yet another blog? Well, the other two blogs is not enough for my recent plans. One of the blogs, hakank.blogg, is in Swedish and it's probably not a good thing to blend different languages in one blog; the other, My Constraint Programming Blog, is targeted to a very specific topic: constraint programming. Since I want to be able to write about other things in English, it seems to be a good idea to start a new blog.
Something about the name Arrays in Flux. One of the first names that came to mind was Panta Rei (meaning "everything flows"). I like the philosophical idea that everything is in a steady flow of changes, with a famous saying from Heraclit: You can not step twice into the same river. Related to programming it seems to be a good description of a lot of programs: the code changes all the time, either by adding new functions or changing old, and by correcting bugs.
Well the name "Panta Rei" really don't have the associations I wanted. After some playing with that phrase, a sound-alike come up: "Pant Array" which was immediately discarded. However, the "Array" kind of stuck, since it's a nice reference to programming. Then it was not very long for the final version: Arrays in Flux. (One alternative was to have a sub title: pantA reI, where the upper case "A" and "I" in should allude one of my big interests: AI. I reckon that was too far fetched.)
Also, it helps that the name is right now (almost) a unique search phrase in one of the search engines (namely Google).
What will be published here? Everything is possible, but there probably will be in some of these areas (with the Misc as a nice catch all category):
It will not be updated very often, so it's safe to subscribe to it :-).
Welcome to Arrays in Flux!
Hakan Kjellerstrand (hakank@bonetmail.com),
http://www.hakank.org/ .