<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
   <channel>
      <title>Arrays in Flux</title>
      <link>http://www.hakank.org/arrays_in_flux/</link>
      <description>Author: Hakan Kjellerstrand (hakank@gmail.com). 
This blog contain my English writings about programming, machine learning/data mining, AI, and other things. . Also see my other blogs: hakank.blogg (Swedish), and My Constraint Programming Blog for other writings, or some other page on my site http://www.hakank.org/. </description>
      <language>en</language>
      <copyright>Copyright 2012</copyright>
      <lastBuildDate>Fri, 30 Nov 2012 21:04:37 +0100</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/?v=3.2</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>Changing email address</title>
         <description><![CDATA[<p>Just wanted to inform that I now have completely changed email address to <a href="mailto:hakank@gmail.com">hakank@gmail.com</a>, in case you have problem reaching me via the older bonetmail address.</p>]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2012/11/changing_email_address_1.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2012/11/changing_email_address_1.html</guid>
         <category>Misc</category>
         <pubDate>Fri, 30 Nov 2012 21:04:37 +0100</pubDate>
      </item>
            <item>
         <title>SETL - The SET Programming Language</title>
         <description><![CDATA[The last weeks I have played with the programming language <a href="http://setl.org/">SETL</a> (<b>Set L</b>anguage). I like learning these kind of "paradigmatic" programming language even if they are not very much in use anymore. There is almost always new things to learn from them, or they make one to see well known things in a new light.
<br><br>
SETL was created in the late 1960's and is to be considered one early very high level language (VHLL) using sets as the bearing principle (like mathematical formulation) together with a PASCAL-like syntax. Some trivia:
<ul>
 <li> The first validating ADA compiler was written in SETL.
 <li> ABC, one of the inspirations of Python was inspired by SETL.
 <li> SETL is attributed as the first programming language that supported list (set/array) comprehensions. A very handy concept. Haskell's list comprehensions was inspired by SETL. 
</ul>

For more about the history of SETL, see <a href="http://cs.nyu.edu/~bacon/phd-thesis/diss/node7.html">A Brief History of SETL</a> in David Bacon's dissertation "SETL for Internet Data Processing". David Bacon is the person behind GNU SETL. In the section <a href="http://cs.nyu.edu/~bacon/phd-thesis/diss/node52.html">Comparison with Other Languages</a> Bacon compares SETL with some other languages (Perl, Icon, Functional Languages, Python, Rexx, and Java).
<br><br>
I like SETL, much for its handling of sets and tuples (arrays) which make prototyping of some kinds of problem easy, especially those with a mathematical bent. However, the advantages SETL once had as been a VHLL, prior to the "agile" languages - e.g. Perl, Python, Ruby, Haskell, etc - is not so big anymore. (I should probably mention that I'm at least acquainted with these mentioned languages..)
<br><br>
In case I forgot it: See my <a href="http://www.hakank.org/setl/">SETL page</a> with links and my SETL programs (and maybe some not mentioned here).

<h3>Different versions of SETL</h3>
There are some the different versions (or off springs) of SETL:
 <ul>
   <li> <a href="http://setl.org/">GNU SETL</a>. This is the version I use here, and seems to be the only public available and working version.
   <li> SETL2. Documented as a draft at <a href="http://www.settheory.com/">The Restored Eye</a> (settheory.com).
   <li> ISETL. See <a href="http://raider.muc.edu/~kirchmjf/isetlj/isetlj.html">ISETLJ</a>, ISETL in Java which is to released in May/June this year. ISETL has been used in teaching mathematics, e.g. abstract algebra.
 </ul>


<h2>Examples of SETL</h2>
I will not go through all features of SETL here, just show some example of what I have done and like about the language. See <a href="http://www.settheory.com/">Programming in SETL. (Draft in Progress)</a> (at settheory.com) for an in-depth tutorial of the language (SETL2 but much is also applied to SETL), or Robert B. K. Dewar's <a href="http://www.setl-lang.org/docs/setlprog.pdf">The SETL Programming Language</a> (PDF) for an overview, or <a href="http://www.linuxjournal.com/article/6805?page=0,2">An Invitation to SETL</a>.

<br><br>
All the examples below works with GNU SETL. Many of the smaller examples is shown as a command one-liner, since I often test different features this way. And as you may notice, quite a few of the examples are not very unlike programs written in Python or Haskell.
<br><br>
The mandatory <b>prime generation</b> program:
<pre>
 primes2 := {p in {2..10000} | forall i in {2..fix(sqrt(p))} | p mod i /= 0};
 print(primes2);
</pre>

One feature I like (and use a lot) is test things from the command line:
<pre>
$ setl 'time0:=time();primes:= {p in {2..100000} | forall i in {2..fix(sqrt(p))} | p mod i /= 0}; 
   print("Num primes:",#primes);print("It took", (time()-time0)/1000,"seconds");'
Num primes: 9592
It took 2.222 seconds
</pre>

A variant of <b>prime number generation</b> using <code>not exists</code> instead of <code>forall</code>:
<pre>
$ setl 'print({n in {2..100} | (not (exists m in{2..n - 1} | n mod m = 0))}); '
{2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97}
</pre>

Still another variant using intersection of <code>{2..n}</code> and the compound numbers:
<pre>
$ setl 'n := 150; print({2..n} - {x : x in {2..n} | exists y in {2..fix(sqrt(x))} | x mod y = 0});'
{2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149}
</pre>
Here are some other examples of set/array comprehensions.<br><br>
<b>Fibonacci sequence</b>
As a one liner:
<pre>
$ setl 'f:= [1,1]; r := [f(i) := f(i-1)+f(i-2) : i in [3..10]];  print(f);'
</pre>

<b>Pythagorean triplets</b> as a "one-liner" (not very fast for say [1..300]).
<pre>
$ setl 'print({[a, b, h]: b in {1..30}, a in {1..b - 1} | 	
		(exists h in {2..a + b} | (a*a + b*b = h*h)) and 
		(not (exists d in {2..b - 1} | ((b mod d) = 0 and (a mod d) = 0)))}); '
{[3 4 5] [5 12 13] [7 24 25] [8 15 17] [20 21 29]}
</pre>

Creation of a <b>power set</b> (all subsets of a set), with the intermediate values printed; 
<pre>
$ setl 'a := {1,2,3}; p := { {}}; (for x in A, y in P) p with:= Y with x; print(p); end; print(p);'
{{} {1}}
{{} {1} {2}}
{{} {1} {2} {1 2}}
{{} {1} {2} {3} {1 2}}
{{} {1} {2} {3} {1 2} {1 3}}
{{} {1} {2} {3} {1 2} {1 3} {2 3}}
{{} {1} {2} {3} {1 2} {1 3} {2 3} {1 2 3}}
{{} {1} {2} {3} {1 2} {1 3} {2 3} {1 2 3}}
</pre>

<b>Collect</b> values from a tuple to a map (hash table).<br>
A map is represented as a set of tuples of <code>[key, value]</code>. 
<br>
First a slow solution:
<pre>
a := [1,1,2,2,3,3,3,4,4,4,4];
m:={ [i, #[j : j in [1..#a] | a(j) = i ]] :  i in { i : i in a}};
</pre>
Then a faster version:
<pre>
$ setl 'a := [1,1,2,2,3,3,3,4,4,4,4];  m:= {}; for i in a loop m(i) +:= 1; end loop; print(m);'
{[1 2] [2 2] [3 3] [4 4]}
</pre>

<b>Index and value of a map</b><br>
The construct <code>for x = s(i) in ...</code> in a map (hash table) loop gives both the index (i) and the value (x). Here we also see how to represent ranges with increment other than 1 (much like Haskell).
<pre>
setl 's := {[i,i**2] : i in [1,3..15]}; for x = s(i) loop print(i,x); end loop;'
1 1
3 9
5 25
7 49
9 81
11 121
13 169
15 225
</pre>

<b>Multi-map</b> (<code>m{value}</code>)<br>
SETL has a special syntax for multi-maps, i.e. where a key has more than one values: use "{}" instead of using the parenthesis "()" for accessing. Here the key <i>1</i> has two values (<i>a</i> and <i>c</i>). Using a "single-map" access (<code>a(2)</code> gives <code>OM</code>, the special undefined value (represented as "*" i GNU SETL).
<pre>
setl 'a := {[1,["a"]], [2, ["b"]], [1, ["c"]]}; print(a);print(a(2));print(a(1));print(a{1});'
{[1 [a]] [1 [c]] [2 [b]]}
[b]
*
{[a] [c]}
</pre>


<b>Compound operators</b><br>
With a compound operators (as <code><b>op</b>/tuple</code> or <code><b>op</b>/map</code>) makes it possible to write quite sparse code (somewhat akin to APL and <a href="http://www.jsoftware.com/">J</a>). Here is the factorial of 100, also showing the support for arbitrary precision.
<pre>
$ setl 'print(*/[1..20]);'
2432902008176640000
</pre>

There is no built-in <code>max</code> for tuples. Instead we use the compound operator version, which is possible since <code>max</code> is a binary operator:
<pre>
$ setl 'setrandom(0); print(<b>max</b>/[random(10000) : i in [1..100]]);'
9898
</pre>

Another example of compound operators is from Project Euler problem #5 (<i>the smallest number that is evenly divisible by all of the numbers from 1 to 20</i>). In my solution (<a href="http://www.hakank.org/setl/project_euler5.setl">project_euler5.setl</a>) lcm and gcd is defined as operators (in contrast to procedures):
<pre>
print(lcm/[2..20]); -- Prints the answer.

op lcm(a,b);
  g := a gcd b;
  return (a*b) div g;
end op lcm;

op gcd(u, v);
  return if v = 0 then abs u else v gcd u mod v end;
end op;
</pre>

Speaking of Project Euler problems, here is the SETL program for the first problem (<i>Find the sum of all the multiples of 3 or 5 below 1000</i>):
<pre>
print(+/{i : i in [1..999] | i mod 3 = 0 or i mod 5 = 0});
</pre>

In <a href="http://www.hakank.org/setl/averages_pythagorean_means.setl">averages_pythagorean_means.setl</a>, three different version of mean are defined (as procedures) using compound operators (maybe not the most efficient way).
<pre>
-- arithmetic mean
proc mean_A(x);   
  return +/x/#x; 
end proc;

-- geometric mean
proc mean_G(x); 
  return (*/x)**(1/#x);
end proc;

-- harmonic mean
proc mean_H(x);
  return #x/+/[1/i:i in x];
end proc;
</pre>

<b>Randomization</b>
The <code>setrandom(0)</code> is for creating random variables starting with an "arbitrary" seed. 
<pre>
setl 'setrandom(0); s := [1,3,5,8]; print([random(s) : i in [1..10]]);'
[5 1 8 8 5 3 3 1 5 3]
</pre>

With a set we get a value only once:
<pre>
$ setl 's1 := {1..10}; setrandom(0); print({ random(s1) : i in [1..10]});'
{3 5 6 7 8}
</pre>

In GNU SETL the <b>order</b> of the set is always presented as sorted, but this is not a requirement in the SETL language.

<br><br>
<b>Regular expressions</b><br>
GNU SETL has built in support for regular expressions (which standard SETL has not). Some examples:
<pre>
$ setl 's:="nonabstractedness"; m:=s("a.*b.*c*.d*.e*"); print(s);print(m);'
nonabstractedness
abstractedness
</pre>

Also see <a href="http://www.hakank.org/setl/read_test2.setl">read_test2.setl</a> that search for words like this in a word file.
<br><br>
Substitution (cf. <code>gsub</code> for global substitution):
<pre>
$ setl 's:="nonabstractedness"; m:=sub(s,"a.*b.*c*.d*.e*",""); print(s);print(m);'
non
abstractedness
</pre>

Note that GNU SETL don't support non-greedy regular expressions (i.e. the ".+?" constructs from Perl etc), so the plain old <code>[^...]</code> construct must be used.:
<pre>
$ setl 's:="nonabstractedness"; m:=s("a[^s]+s"); print(s);print(m);'
nonabstractedness
abs
</pre>

A small drawback is that GNU SETL don't have support for national characters in strings. The only acceptable characters are the "plain ASCII".

<br><br>
<b>SNOBOL like pattern matching</b><br>
SETL also has SNOBOL/SPITBOL like patterns (but not as nicely integrated as in SNOBOL). Except as experiments, I tend to use regular expression rather than these functions.
<br><br>
Example: <a href="http://cs1.cs.nyu.edu/~bacon/setl-doc.html#any">any</a> is used like this:
<pre>
$ setl 'x := "12345 12345 12345"; print(any(x, "123"));print(x);'
1
2345 12345 12345
</pre>

However, I miss the <code>many</code> function which takes <b>many</b> characters from the beginning, not just the first and it is quite easy to create it. First let's see how it works, where we will take all the characters from the beginning of the string if they are any of "123":
<pre>
$ setl 'x := "12345 12345 12345"; print(x); while any(x, "123") /= "" loop print(x); end;;print(x);'
12345 12345 12345
2345 12345 12345
345 12345 12345
45 12345 12345
45 12345 12345
</pre>

(The corresponding regular expression for this is, of course <code>^[123]+</code>.)

<br><br>
A SETL procedure for <code>many</code> is defined below. The first argument is defined as read-write (<code>rw</code>) so we can modify the string <code>s</code>. The value returned (<code>z</code>) contains all the matched characters.
<pre>
proc many(rw s,p); 
   z := "";
   while (zz := any(s,p)) /= "" and zz /= "" loop 
        z +:= zz; 
   end loop; 
   return z; 
end proc;'
</pre>

And here is <code>many</code> in action. Note: procedures must always be placed last in a program.
<pre>
x := "12345 12345 12345";
print(x);
z:=many(x, "123");
print("x",x);
print("z",z);
proc many(rw s,p);
   print(s); print(p);
   while (zz := any(s,p)) /= "" and zz /= "" loop
   z +:= zz;
    end loop;
  return z;
end proc;
</pre>

Result:
<pre>
12345 12345 12345
12345 12345 12345
123
x 45 12345 12345
z 123
</pre>

In <a href="http://www.hakank.org/setl/look_and_say_sequence.setl">look_and_say_sequence.setl</a> <code>many</code> is used, as well as a direct approach and one using regular expression.

<br><br>
(Shell) filters:<br>
GNU SETL has a lot of extensions for system (UNIX) handling, e.g. filter.
<pre>
$ setl 'f := filter("ls p*.setl"); print(f);s := split(f,"\n");print([s,#s]);'
perm.setl
pointer.setl
primes2.setl
primes3.setl
primes.setl
printprimes.setl
[['perm.setl' 'pointer.setl' 'primes2.setl' 'primes3.setl' 'primes.setl' 'printprimes.setl' ''] 7]
</pre>

Reading a file directly is by <code>getline</code>. Note <code>split()</code>.
<pre>
x := split(getfile("file.txt"), "\n");
print(#x);
</pre>


<h3>Some larger examples</h3>
<b>SEND + MORE = MONEY</b><br>
This is rather slow since it has to loop through a lot of values. However it don't loop through all permutations since for each variables we exclude values of the previous stated variables.
<pre>
print(send_more_money1());

proc send_more_money1;
   ss := {0..9};

   smm := [[S,E,N,D,M,O,R,Y] : 
    -- ensure that all numbers are different
    S in ss ,
    E in ss - {S} ,
    N in ss - {S,E} , 
    D in ss - {S,E,N} , 
    M in ss - {S,E,N,D} , 
    O in ss - {S,E,N,D,M} , 
    R in ss - {S,E,N,D,M,O} ,  
    Y in ss - {S,E,N,D,M,O,R} | 
    S > 0 and M > 0 and
    (S * 1000 + E * 100 + N * 10 + D) +
    (M * 1000 + O * 100 + R * 10 + E) = 
    (M * 10000 + O * 1000 + N * 100 + E * 10 + Y )];

   return smm;
end proc;
</pre>

For some other (and slower) variants, see <a href="http://www.hakank.org/setl/send_more_money.setl">send_more_money.setl</a>.
<br><br>

<b>prime factors</b><br>
A rather fast version of calculating the prime factors of a number. Note that in GNU SETL division (<code>/</code>) returns a real number, whereas in SETL2 <code>/</code> returns an integer. So here we use <code>div</code> instead of <code>/</code>.
<pre>
procedure prime_factors(n);
    facts := [];
    while even(n) loop facts with:= 2; n := n div 2; end loop;
    while exists k in [3,5..ceil(sqrt(float(n)))] | n mod k = 0 loop
       facts with:= k; 
       n := n div k;
    end loop;
   facts with:= n;
   return facts;
end prime_factors;
</pre>

<b>Quick sort</b><br>
Somewhat surprising, (GNU) SETL don't have a built-in sort function, so I have to implement it myself (SETL2 has a package with a lot of different sort methods, though.). Here is the Quick sort we know from Haskell, Python etc using list/array comprehensions:
<pre>
proc qsort(a);
  if #a > 1 then
    pivot := a(#a div 2 + 1);
    a := qsort([x in a | x < pivot]) +
         [x in a | x = pivot] +
         qsort([x in a | x > pivot]);
  end if;
  return a;
end proc;
</pre>

In the programs <a href="http://www.hakank.org/setl/anagrams.setl">anagrams.setl</a> and <a href="http://www.hakank.org/setl/sorting.setl">sorting.setl</a> I compare some different sort algorithms. 

<br><br>
<b>Clique</b><br>
A rather inefficient but elegant version of finding the cliques in a graph is shown in <a href="http://www.hakank.org/setl/cliques.setl">cliques.setl</a> (inspired by <code>{log}</code> (setlog) program <a href="http://www.math.unipr.it/~gianfr/SETLOG/SamplePrograms/Clique.slog">Clique.slog</a>):
<pre>
proc clique(G);
  V := { vv : p in G, vv in p}; -- the vertices
  cliques := {};
  for C in pow(V) loop
    if forall I in C | forall J in C | {I,J} in {{I}} + G  then
      cliques with:= C;
    end if;
  end loop;
  return cliques;
end proc;
</pre>

<b>Luhn test of credit card numbers</b><br>
This problem is from <a href="http://rosettacode.org/wiki/Luhn_test_of_credit_card_numbers">Rosetta Code</a> (where I have taken some other problems). The SETL program is <a href="http://www.hakank.org/setl/luhn_tests_of_credit_card_numbers.sets">luhn_tests_of_credit_card_numbers.sets</a>, where the procedure is as follows:
<pre>
proc isluhn10(num);  
  x := [val(i) : i in reverse(num)];
  m := {[i,val("0246813579"(i+1))] : i in [0..9]};
  return  +/[x(i) + m(x(i+1)?0) : i in [1,3..#num]] mod 10 = 0; 
end proc;
</pre>

<b>Pancake sort</b><br>
Pancake sort (see <a href="http://en.wikipedia.org/wiki/Pancake_sorting">Wikipedia</a> and <a href="http://rosettacode.org/wiki/Sorting_algorithms/Pancake_sort">Rosetta code</a>) is a constrained method of sorting, where you may only flip a range of numbers in sequence. Here is one way to do it in SETL (see <a href="http://www.hakank.org/setl/pancake_sort.setl">pancake_sort.setl</a> for tests).
<pre>
procedure pancake_sort(rw nums);
  for i in [#nums,#nums-1..1] loop
     -- find the index of the largest element not yet sorted
     -- this variant is sligtly faster
     [this_max, max_idx] := find_max(nums(1..i));
     if max_idx = i then
       continue; -- element already in place
     end if;
     -- flip this max element to index 1
     if max_idx > 1 then
       nums(1..max_idx) := rev(nums(1..max_idx));
     end if;
     -- then flip the max element to its place
     nums(1..i) := rev(nums(1..i));
  end loop;
end procedure;

-- reverse a tuple
procedure rev(a);
  return [a(i) : i in [#a,#a-1..1]];
end procedure;

--
-- find the (first) index of the max value 
-- in a tuple.
-- Returns [max_value, index]
procedure find_max(a);
  max_idx := 1;
  this_max := a(1);
  for j in [2..#a] loop
    if a(j) > this_max then
      this_max := a(j);
      max_idx := j;
    end if;
  end loop;
  return [this_max, max_idx];
end procedure;

</pre>


<h2>Some SETL links</h2>
Here are some references of SETL.
<ul>
  <li> <a href="http://setl.org/setl">GNU SETL</a>: This the SETL implementation I use.
  <li> <a href="http://quincy.inria.fr/data/courses/hpl2000/setl_report.ps">
  <li> <a href="http://www.setl-lang.org/wiki">SETL Wiki</a>
  <li> <a href="http://setl.org/setl/doc/setl-lib.html">Library reference in GNU SETL</a>
  <li> <a href="http://www.settheory.com/">www.settheory.com</a>: "Programming in SETL", Jack Schwartz's draft of a book in SETL2 (not everything is supported in GNU SETL)
  <li> David Bacon's PhD thesis <a href="http://cs.nyu.edu/~bacon/phd-thesis/">SETL for Internet Data Processing</a>
  <li> Rosetta Code's entry <a href="http://rosettacode.org/wiki/Category:SETL">SETL</a>
  <li> <a href="http://quincy.inria.fr/data/courses/hpl2000/setl.html">The SETL Programming Language</a>: Lecture notes
  <li> Wikipedia: <a href="http://en.wikipedia.org/wiki/SETL">SETL</a>
  <li> <a href="http://setl.org/setl-server.html">Dave's Famous Original SETL Server</a>
  <li> Robert B. K. Dewar: <a href="http://www.setl-lang.org/docs/setlprog.pdf">The SETL Programming Language</a> (PDF)
</ul>



<h2>My SETL programs</h2>
I have collected some of my SETL programs at my <a href="http://www.hakank.org/setl/">SETL page</a>. They are mostly small examples and experiments, and a lot are from <a href="http://projecteuler.net/">Project Euler</a> and <a href="http://rosettacode.org">Rosetta Code</a>.
<ul>
<li><a href="http://www.hakank.org/setl/all_pairs.setl">all_pairs.setl</a>: All pairs (a slow variant of <code>is_tuple</code>)
<li><a href="http://www.hakank.org/setl/anagram.setl">anagram.setl</a>: Anagram of a given word from a word list
<li><a href="http://www.hakank.org/setl/anagrams.setl">anagrams.setl</a>: Largest sets of anagrams given a word list (Rosetta code)
<li><a href="http://www.hakank.org/setl/array_concatenation.setl">array_concatenation.setl</a>: Array concatenation (Rosetta code)
<li><a href="http://www.hakank.org/setl/averages_pythagorean_means.setl">averages_pythagorean_means.setl</a>: Averages/Pythagorean means (Rosetta code)
<li><a href="http://www.hakank.org/setl/binary_search.setl">binary_search.setl</a>: Binary search (Rosetta code)
<li><a href="http://www.hakank.org/setl/binomial.setl">binomoal.setl</a>: Binomial coefficients
<li><a href="http://www.hakank.org/setl/clique.setl">clique.setl</a>: Clique. Sample data: <a href="http://www.hakank.org/setl/clique.in">clique.in</a>
<li><a href="http://www.hakank.org/setl/closest_pair_problem.setl">closest_pair_problem.setl</a>: Closest pair problem
<li><a href="http://www.hakank.org/setl/collect.setl">collect.setl</a>: Collect the number of occurrences in a tuple
<li><a href="http://www.hakank.org/setl/comb_sort.setl">comb_sort.setl</a>: Comb sort
<li><a href="http://www.hakank.org/setl/equation_sys.setl">equation_sys.setl</a>: Equation system
<li><a href="http://www.hakank.org/setl/evolutionary_algorithm.setl">evolutionary_algorithm.setl</a>: Evolutionary Algorithm (Rosetta code)
<li><a href="http://www.hakank.org/setl/fibonacci_sequence.setl">fibonacci_sequence.setl</a>: Fibonacci sequence (different implementations)
<li><a href="http://www.hakank.org/setl/find_the_missing_permutation.setl">find_the_missing_permutation.setl</a>: Find the missing permutation (Rosetta code)  
<li><a href="http://www.hakank.org/setl/fizzbuzz.setl">fizzbuzz.setl</a>: FizzBuzz (Rosetta code)
<li><a href="http://www.hakank.org/setl/flatten_a_list.setl">flatten_a_list.setl</a>: Flatten a list (Rosetta code)
<li><a href="http://www.hakank.org/setl/forward_difference.setl">forward_difference.setl</a>: Forward difference (Rosetta code)
<li><a href="http://www.hakank.org/setl/gnome_sort.setl">gnome_sort.setl</a>: Gnome sort
<li><a href="http://www.hakank.org/setl/greatest_subsequential_sum.setl">greatest_subsequential_sum.setl</a>: Greatest subsequential sum (Rosetta code)
<li><a href="http://www.hakank.org/setl/hailstone_sequence.setl">hailstone_sequence.setl</a>: Hailstone sequence (Collatz sequence) (Rosetta code)
<li><a href="http://www.hakank.org/setl/happy_numbers.setl">happy_numbers.setl</a>: Happy numbers (Rosetta code)
<li><a href="http://www.hakank.org/setl/hash_from_two_arrays.setl">hash_from_two_arrays.setl</a>: Hash from two arrays
<li><a href="http://www.hakank.org/setl/in_difference.setl">in_difference.setl</a>: In difference
<li><a href="http://www.hakank.org/setl/knuth_shuffle.setl">knuth_shuffle.setl</a>: Knuth shuffle (Rosetta code)
<li><a href="http://www.hakank.org/setl/longest_common_subsequence.setl">longest_common_subsequence.setl</a>: Longest common sub sequence
<li><a href="http://www.hakank.org/setl/look_and_say_sequence.setl">look_and_say_sequence.setl</a>: Look and say sequence (Rosetta code)
<li><a href="http://www.hakank.org/setl/luhn_tests_of_credit_card_numbers.setl">luhn_tests_of_credit_card_numbers.setl</a>: Luhn tests of credit card numbers (Rosetta code)
<li><a href="http://www.hakank.org/setl/mandelbrot.setl">mandelbrot.setl</a>: Mandelbrot set
<li><a href="http://www.hakank.org/setl/median.setl">median.setl</a>: Median
<li><a href="http://www.hakank.org/setl/min_max.setl">min_max.setl</a>: Min and max
<li><a href="http://www.hakank.org/setl/minimum_common_multiple.setl">minimum_common_multiple.setl</a>: Minimum Common Multiple
<li><a href="http://www.hakank.org/setl/pancake_sort.setl">pancake_sort.setl</a>: Pancake sort
<li><a href="http://www.hakank.org/setl/pangram_checker.setl">pangram_checker.setl</a>: Pangram checker (Rosetta code)
<li><a href="http://www.hakank.org/setl/perfect_numbers.setl">perfect_numbers.setl</a>: Perfect numbers (Rosetta code)
<li><a href="http://www.hakank.org/setl/primes4.setl">primes4.setl</a>: Primes (one of many different implementations, this is not very efficient...)
<li><a href="http://www.hakank.org/setl/project_euler1.setl">project_euler1.setl</a>: Project Euler, problem 1, multiples of 3 or 5
<li><a href="http://www.hakank.org/setl/project_euler2.setl">project_euler2.setl</a>: Project Euler, problem 2, sum of all even-valued terms in Fibonacci sequence
<li><a href="http://www.hakank.org/setl/project_euler3.setl">project_euler3.setl</a>: Project Euler, problem 3, largest prime factor of 600851475143
<li><a href="http://www.hakank.org/setl/project_euler4.setl">project_euler4.setl</a>: Project Euler, problem 4, largest palindromic number from product of two 3-digits numbers
<li><a href="http://www.hakank.org/setl/project_euler5.setl">project_euler5.setl</a>: Project Euler, problem 5, smallest number evenly divisible by 1..20
<li><a href="http://www.hakank.org/setl/project_euler6.setl">project_euler6.setl</a>: Project Euler, problem 6, difference between sum of squares and squares of sums for 1..100
<li><a href="http://www.hakank.org/setl/project_euler7.setl">project_euler7.setl</a>: Project Euler, problem 7, 10001st prime number
<li><a href="http://www.hakank.org/setl/project_euler8.setl">project_euler8.setl</a>: Project Euler, problem 8, greatest product of five consecutive digits in a 1000-digit number
<li><a href="http://www.hakank.org/setl/project_euler9.setl">project_euler9.setl</a>: Project Euler, problem 9, Pythagorean triplet a+b+c=1000
<li><a href="http://www.hakank.org/setl/project_euler10.setl">project_euler10.setl</a>: Project Euler, problem 10, sum of all primes below 2 million
<li><a href="http://www.hakank.org/setl/project_euler11.setl">project_euler11.setl</a>: Project Euler, problem 11, greatest product of four adjacent numbers in a 20x20 grid
<li><a href="http://www.hakank.org/setl/project_euler12.setl">project_euler12.setl</a>: Project Euler, problem 12, first triangle number with over 500 divisors
<li><a href="http://www.hakank.org/setl/project_euler13.setl">project_euler13.setl</a>: Project Euler, problem 13, first 10 digits of a sum of 100 50-digit numbers
<li><a href="http://www.hakank.org/setl/project_euler14.setl">project_euler14.setl</a>: Project Euler, problem 14, Collatz problem (Hailstone sequence): longest sequence for n < 1000000
<li><a href="http://www.hakank.org/setl/project_euler15.setl">project_euler15.setl</a>: Project Euler, problem 15, how many routes through a 20x20 grid
<li><a href="http://www.hakank.org/setl/project_euler16.setl">project_euler16.setl</a>: Project Euler, problem 16, sum of the digits of 2^1000
<li><a href="http://www.hakank.org/setl/project_euler20.setl">project_euler20.setl</a>: Project Euler, problem 20, sum of the digits in 100! (factorial)
<li><a href="http://www.hakank.org/setl/project_euler21.setl">project_euler21.setl</a>: Project Euler, problem 21, sum of all amicable numbers under 10000
<li><a href="http://www.hakank.org/setl/project_euler22.setl">project_euler22.setl</a>: Project Euler, problem 22, total of all name scores in a file
<li><a href="http://www.hakank.org/setl/project_euler25.setl">project_euler25.setl</a>: Project Euler, problem 25, first Fibonacci term containing 1000 digits
<li><a href="http://www.hakank.org/setl/project_euler28.setl">project_euler28.setl</a>: Project Euler, problem 28, sum of numbers in a 1001x1001 spiral
<li><a href="http://www.hakank.org/setl/project_euler30.setl">project_euler30.setl</a>: Project Euler, problem 30, sum of all numbers that can be written as the sum of fifth powers of their digits
<li><a href="http://www.hakank.org/setl/project_euler31.setl">project_euler31.setl</a>: Project Euler, problem 31, in how many different ways can £2 be made using any number of coins
<li><a href="http://www.hakank.org/setl/project_euler32.setl">project_euler32.setl</a>: Project Euler, problem 32, sum of 1..9-pandigital numbers
<li><a href="http://www.hakank.org/setl/project_euler34.setl">project_euler34.setl</a>: Project Euler, problem 34, sum of all numbers that are equal to the sum of the factorial of their digits
<li><a href="http://www.hakank.org/setl/project_euler35.setl">project_euler35.setl</a>: Project Euler, problem 35, how many circular primes are there under 1000000
<li><a href="http://www.hakank.org/setl/project_euler36.setl">project_euler36.setl</a>: Project Euler, problem 36, sum of all numbers, less than one million, which are palindromic in base 10 and base 2
<li><a href="http://www.hakank.org/setl/project_euler48.setl">project_euler48.setl</a>: Project Euler, problem 48, find the last ten digits of the series 1^(1) + 2^(2) + 3^(3) + ... + 1000^(1000)
<li><a href="http://www.hakank.org/setl/read_test2.setl">read_test2.setl</a>: Reading a dictionary with regular expressions, e.g. "a.*b.*c.*d", "b.*c.*d.*e", etc). (This is one of my standard tests when learning a new programming language.)
<li><a href="http://www.hakank.org/setl/rot13.setl">rot13.setl</a>: ROT-13
<li><a href="http://www.hakank.org/setl/send_more_money.setl">send_more_money.setl</a>: SEND + MORE = MONEY
<li><a href="http://www.hakank.org/setl/shell_sort.setl">shell_sort.setl</a>: Shell sort
<li><a href="http://www.hakank.org/setl/shur_numbers.setl">shur_numbers.setl</a>: Shur numbers
<li><a href="http://www.hakank.org/setl/sort_map.setl">sort_map.setl</a>: Sorting a map
<li><a href="http://www.hakank.org/setl/sorting.setl">sorting.setl</a>: Some sorting methods
<li><a href="http://www.hakank.org/setl/soundex.setl">soundex.setl</a>: Soundex (Rosetta code)
<li><a href="http://www.hakank.org/setl/squares.setl">squares.setl</a>: Squares
<li><a href="http://www.hakank.org/setl/tree_traversal.setl">tree_traversal.setl</a>: Tree traversal (Rosetta code)
</ul>
]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/04/setl_the_set_programming_language_1.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/04/setl_the_set_programming_language_1.html</guid>
         <category>Programming/Programming languages</category>
         <pubDate>Tue, 27 Apr 2010 19:01:41 +0100</pubDate>
      </item>
            <item>
         <title>Symbolic Regression with JGAP - further improvements: minNodes, alldifferent, ForLoopD</title>
         <description><![CDATA[<p>In <a href="http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_with_jgap_some_improvements.html">Symbolic Regression with JGAP - some improvements</a> I mentioned some small improvements that would be nice to have in my <a href="http://www.hakank.org/jgap/">Symbolic Regression</a> program:<ul>  <li> constraint that a program has at least <code>minNodes</code> nodes (akin to the existing <code>maxNodes</code><br />
This have been implemented with the option <code>minNodes</code>.<br />
  <li> constraint that the variables in a program should be unique.<br />
This have been implemented with the option <code>alldifferentVariables</code>.</ul></p>

<p>I talked there about building a node validator that restricted the programs with these constraints. However, a better way - and more genetic programming-ish - is to "penalty" programs that do not satisfy these restrictions. And this is the way I have taken.</p>

<p>The new options are both used in the recreational <a href="http://richardwiseman.wordpress.com/2010/02/26/its-the-friday-puzzle-48/">problem</a> by Richard Wiseman (Friday puzzle 2010-02-26) of finding an equation with result 24 using the numbers <code>5, 5, 5, 1</code> exactly once, and the four arithmetic operators (+,-,*,/). Richard Wiseman's solution can be read <a href="http://richardwiseman.wordpress.com/2010/03/01/answer-to-the-friday-puzzle-41/">Answer to the Friday puzzle….</a> (2010-03-01).</p>

<p>The problem is modeled in <a href="http://www.hakank.org/jgap/number_puzzle4.conf">number_puzzle4.conf</a>. Here is the configuration file, where the new options are marked in bold (see below for the <code>ForLoopD</code> option).</p>

<p><code>presentation: Puzzle<br />
return_type: DoubleClass<br />
num_input_variables: 4<br />
variable_names: a b c d e<br />
# With ForLoopD<br />
# functions: Multiply,Divide,Add,Subtract,<b>ForLoopD</b><br />
functions: Multiply,Divide,Add,Subtract<br />
# We don't use any numeric terminals<br />
no_terminals: true<br />
max_init_depth: 4<br />
population_size: 1000<br />
max_crossover_depth: 4<br />
num_evolutions: 400<br />
max_nodes: 7<br />
<b>min_nodes: 7 100</b><br />
<b>alldifferent_variables: true 100</b><br />
show_similiar: true<br />
similiar_sort_method: length<br />
data<br />
5 5 5 1  24<br />
</code></p>

<p>Here we require that there ought to be minimum of 7 nodes (as well as maximum number of nodes), i.e. the 4 variables (<code>a, b, c, d</code>), and 3 operators between them. If a program has less number of nodes, then we "penalty" the program with 100 (the second value) points (errors). Note that there is no guarantee that the constraint is held, just quite probable with this large penalty.</p>

<p>The other option, <code>alldifferentVariables</code>, is used in the same way: If there is an variable in the program that has already been use, we penalty it by 100 (the second value) points.</p>

<p>Also, I had increased the number of evolution to 400 (from 100) because of these constraints.</p>

<p>With these new options the required solution to the problem is found rather easy, though maybe not on each run. Remember that a=5, b=5, c=5, and d = 1, and the target is the number 24.<br />
<code><br />
(c - (d / a)) * b<br />
b * (c - (d / a))<br />
</code></p>

<p>The numeric solution of the problem is <code>5*(5-1/5)</code>, and the two programs are just permutations of this solutions.</p>

<h3>Further experiments</h3>
As an experiment I also set both the <code>minNodes</code> and <code>maxNodes</code> to 8 and ran again.
<code>
min_nodes: 8
max_nodes: 8
</code>

<p>Since there can be no solution with 8 nodes there must be some penalty. Then the following solutions came, all with an error of 100, the penalty for not been minimum 8 nodes. The variables are, however, all different as they should so there is no penalty.</p>

<p><code>All solutions with the best fitness (100.0):<br />
Sort method: occurrence<br />
(a - (d / c)) * b [42831]<br />
b * (a - (d / c)) [28531]<br />
(b - (d / c)) * a [72]<br />
a * (b - (d / c)) [9]<br />
b * (c - (d / a)) [9]<br />
(a - (d / b)) * c [2]<br />
It was 6 different solutions with fitness 100.0<br />
</code></p>

<p>It is interesting that there are more solutions with the constraint of 8 nodes than with 7.</p>

<p>Increasing the the min, and max number of nodes to 9 and 9, respectively, then there are solutions with the stated number of nodes. But now there is a penalty of 100 for not been all different, and they are - of course - not a real solution to the problem.<br />
<code><br />
All solutions with the best fitness (100.0):<br />
Sort method: occurrence<br />
(c * b) - ((c / d) / a) [17588]<br />
(c - (d - (b * b))) - a [2]<br />
It was 2 different solutions with fitness 100.0<br />
</code></p>

<h3>Another example: 1 2 3 4  5</h3>
Yet another example using the same configuration file is the following problem, i.e. the result should be 5 using the numbers 1, 2, 3, and 4 and the four operators. 

<p><code>data<br />
1 2 3 4  5</code></p>

<p>One run give the following 46 solutions with 0 errors. The number in [] is the number of found occurrences of the specific solution.</p>

<p><code>All solutions with the best fitness (0.0):<br />
Sort method: occurrence<br />
d - (a / (b - c)) [70602]<br />
(c + d) - (a * b) [22871]<br />
(c - (b * a)) + d [8724]<br />
c + (d / (b * a)) [3109]<br />
d - (b - (c * a)) [116]<br />
((d - b) * a) + c [107]<br />
(d - b) + (a * c) [58]<br />
(c - b) + (a * d) [40]<br />
(d * b) - (c * a) [35]<br />
((c / b) * d) - a [24]<br />
(c - b) * (d + a) [24]<br />
c + ((a / b) * d) [19]<br />
(b * d) - (c / a) [16]<br />
d + ((c - a) / b) [15]<br />
(d + a) / (c - b) [15]<br />
(d + c) - (b / a) [13]<br />
(c - b) + (d / a) [12]<br />
(c / a) + (d - b) [8]<br />
(b * d) - (a * c) [8]<br />
(d * a) - (b - c) [8]<br />
(d / (a * b)) + c [8]<br />
(a * d) - (b - c) [7]<br />
(a * d) + (c - b) [7]<br />
(a + d) * (c - b) [6]<br />
(a * c) + (d / b) [6]<br />
(c / a) + (d / b) [5]<br />
(c + d) - (b * a) [4]<br />
(d + c) - (a * b) [4]<br />
(c + (d / b)) * a [4]<br />
(a + d) / (c - b) [4]<br />
(d + a) * (c - b) [3]<br />
(d / a) - (b - c) [3]<br />
(b * d) - (c * a) [2]<br />
((d / b) * c) - a [2]<br />
(d / b) + (c / a) [2]<br />
(d * a) + (c - b) [1]<br />
((c - b) / a) + d [1]<br />
c + (d - (a * b)) [1]<br />
(c * a) - (b - d) [1]<br />
(a * c) - (b - d) [1]<br />
((d * a) + c) - b [1]<br />
(d * b) - (a * c) [1]<br />
(d + c) - (b * a) [1]<br />
(d / a) + (c - b) [1]<br />
a + ((c - b) * d) [1]<br />
((c + d) - b) / a [1]<br />
It was 46 different solutions with fitness 0.0<br />
</code></p>

<p>Here is much more solution, which indicates that it is a simpler program than the above.</p>

<h2>ForLoopD</h2>
I have also implemented a double version of JGAP's existing <code>ForLoop</code>, which also can be used in this program. (This was done by copying the code in the JGAP distribution, org.jgap.gp.function.ForLoop, and then do some small changes.)

<p>The logic of this function is to create a for loop and for each loop add the result of the code in the body of the loop ("some code") to the final result which is then returned as a value of the loop. In a normal programming language this should be coded like this. The number of loops (the variable <code>a</code>) is dynamic selected.</p>

<p><code>  double x = 0.0d;<br />
   for(int i=0;i&lt;a;i++) { x += some code }<br />
  return x;<br />
</code></p>

<p>As a test, I added <code>ForLoopD</code> to the function list in the <code>5 5 5 1 24</code> problem (see the configuration above):<br />
<code><br />
functions: Multiply,Divide,Add,Subtract,<b>ForLoopD</b><br />
</code></p>

<p>One solution is the following with 0 errors:<br />
<code><br />
   for(int i=0;i&lt;b;i++) { (c - (d / a)) }<br />
</code></p>

<p>Which is - of course - just another way of stating the following solution:<br />
<code><br />
   b*(c - (d / a))<br />
</code></p>

<p>Well, I have to see if this function is of any real use...</p>

<h2>Download and more info</h2>
The symbolic regression program can be downloaded from <a href="http://www.hakank.org/jgap/">my JGAP page</a> which also contains more information about the program and <a href="http://jgap.sourceforge.net/">JGAP</a>.. 
]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/03/symbolic_regression_with_jgap_further_improvements_minnodes_alldifferent_forloopd.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/03/symbolic_regression_with_jgap_further_improvements_minnodes_alldifferent_forloopd.html</guid>
         <category>Symbolic regression</category>
         <pubDate>Wed, 03 Mar 2010 19:00:01 +0100</pubDate>
      </item>
            <item>
         <title>Symbolic Regression with JGAP - some improvements</title>
         <description><![CDATA[<p>The <a href="http://www.hakank.org/jgap/">SymbolicRegression</a> program (using JGAP in Java) has been updated with some improvements. </p>

<h2>New configuration options</h2>
Some of of these new options are explained in the examples below.
<ul><li> <code>show_similar</code>: Alternative name of <code>show_similiar</code>.
<li> <code>similiar_sort_method</code>: Method of sorting the similiar solutions when using <code>show_similiar</code>, which shows all solutions that has the same fitness value as the best found solution. Alternative name: <code>similar_sort_method</code>. Valid options are:
  <ul>
    <li> <code>occurrence</code>: descending number of occurrences (default)
    <li> <code>length</code>: length of solutions (ascending)
 </ul>
<li> <code>error_method</code>: Error method to use. Valid options are
     <ul>
       <li> <code>totalError</code>: sum of (absolute) errors (default)
       <li> <code>minError</code>: minimum error
       <li> <code>meanError</code>: mean error
       <li> <code>medianError</code>: median error
       <li> <code>maxError</code>: max error
      </ul>
<li> <code>no_terminals</code>: If true then no Terminal is used, i.e. no numbers, just variables. Default: false. 
<li> <code>make_time_series</code>: Make a time series of the first line of data. The value of <code>num_input_variable</code> determines the number of laps (+1 for the output variable. See below for some examples.
<li> <code>make_time_series_with_index</code>: As <code>make_time_series</code> with an extra input variable for the index of the series. (Somewhat experimental.)</ul>

<h2>New examples</h2>
Some new examples has been published as well.

<ul> <li><a href="http://www.hakank.org/jgap/leap_years.conf">leap_years.conf</a><br>
This example tries to figure out how to calculate the leap years. See <a href="http://en.wikipedia.org/wiki/Leap_year">Leap_year</a> (Wikipedia) for more on leap years.

<p>The fitness cases consists of all years 1890..2030, and 1200, 1300, 1400, 1500, 1600, 1700, and 1800.</p>

<p>The functions used are: <code>Multiply,Divide,Add,Subtract,ModuloD,IfElseD</code> where <code>IfElseD</code> may be replaced with <code>IfLessThanOrEqualD</code>, or removed completely.</p>

<p><code>ModuloD</code> is not the normal modulo operator. Instead it is "protected modulo" where the arguments are first converted to integers and then taken modulo. However, if the second argument is 0 (zero), the result is 0 (zero). This function is represented as either <code>modp</code> or <code>%</code> below. </p>

<p>The program found a lot of solutions with error 1 (for year 1900). </p>

<p>Using <code>IfLessThanOrEqualD</code><br />
<code><br />
  if(y <= ((modp(y,(y / 471.0))) * (296.0 * y))) { (y - y) } else { (327.0 / 327.0) }<br />
</code></p>

<p>Without <code>IfElseD</code>:<br />
<code><br />
(326.0 / (((((y - 536.0) % 536.0) + y) % (y / 226.0)) + 326.0)) % (283.0 % y)<br />
(y / (((y * 654.0) % (24.0 % y)) + y)) % y<br />
(y / (((y * (330.0 % y)) % (24.0 % y)) + y)) % y<br />
</code></p>

<p> <li><a href="http://www.hakank.org/jgap/number_puzzle4.conf">number_puzzle4.conf</a><br><br />
Number puzzle inspired by Richard Wiseman's <a href="http://richardwiseman.wordpress.com/2010/02/26/its-the-friday-puzzle-48/">It's the Friday Puzzle</a> (2010-02-26). The problem is to find the result 24 from the numbers 5,5,5,1 and the operators +,-,*,/. However, the requirement that the numbers should be used exactly once is not held here. (It would be quite useful to have these kind of "global functions" requiring that all variables should be different, or used exactly once etc. Compare with "global constraints" in <a href="http://www.hakank.org/constraint_programming_blog/global_constraints/">constraint programming</a>.)</p>

<p>Note also that this configuration uses only one fitness case and let the program find any solution that comply to the equation. It also use the new option <code>no_terminals</code> for using just variables (no Terminal numbers) which was implemented for this example.</p>

<p>Here is a result from a sample run. The number in [] is the number of occurrences of the specific programs. In this example we also see the new option <code>similiar_sort_method: length</code> at work, which sorts the similiar solutions according to length (normally it it sorted on the number of occurrences). The variables in the solutions means: a = 5, b = 5, c = 5 and d = 1.<br />
<code><br />
All solutions with the best fitness (0.0):<br />
Sort method: length<br />
(b * c) - d [5]<br />
(a * c) - d [4162]<br />
(b * b) - d [4]<br />
(c * a) - d [251]<br />
(a * a) - d [10]<br />
(c * c) - d [424]<br />
(c * b) - d [1]<br />
(b * a) - d [36]<br />
(c - d) * (a + d) [1]<br />
(b * a) - (b / c) [121]<br />
(b * a) - (a / c) [2]<br />
(c * b) - (c / c) [5]<br />
(b * b) - (a / a) [3]<br />
(c * a) - (b / b) [2]<br />
(a * c) - (d * d) [633]<br />
(a - d) * (d + b) [4]<br />
(c * b) - (a / c) [1]<br />
(a * b) - (c / b) [2]<br />
(c * c) - (b / b) [1]<br />
It was 19 different solutions with fitness 0.0<br />
</code></p>

<p>None of these are a solution to Wiseman's puzzle. </p>

<p>Here we have limited the number of nodes with <code>max_modes: 7</code> (4 variables + 3 terminals), but there is no standard option in JGAP to state the minimum number of nodes. However, with a "node validator" this could probably be done. I plan to experiment more with node validators for these kind of constraints and "global functions" mentioned above.</p>

<p>  <li><a href="http://www.hakank.org/jgap/sunspots_timeseries.conf">sunspots_timeseries.conf</a><br><br />
Two version of sunspots data using <code>make_time_series</code>. See below for more about this option.</p>

<p> <li><a href="http://www.hakank.org/jgap/timeseries_test1.conf">timeseries_test1.conf</a><br><br />
Some other examples of the <code>make_time_series</code>. See below.</p>

<p> <li><a href="http://www.hakank.org/jgap/timeseries_dailyisbn.conf">timeseries_dailyisbn.conf</a><br><br />
Another time series example: the classic time series "Daily closing price of IBM stock, Jan 1, 1980 to Oct. 8, 1992" , <a href="http://www.robjhyndman.com/TSDL/data/DAILYIBM.DAT">DAILYIBM.DAT</a> from Rob J Hyndman's <a href="http://www.robjhyndman.com/TSDL/">TSDL</a> (Time Series Data Library)<br />
</ul></p>

<h2>make_time_series</h2>
The option <code>make_time_series</code> may require some explanation.

<p>The following configuration file is all that is needed for the Fibonacci problem (in time series representation). Actually, the two lines in bold are the only needed, since the other options has defaults that would work well here.<br />
<code><br />
<b>make_time_series: true</b><br />
<b>num_input_variables: 4</b><br />
terminal_range: -10 10<br />
functions: Multiply,Divide,Add,Subtract<br />
max_init_depth: 4<br />
population_size: 100<br />
num_evolutions: 100<br />
max_crossover_depth: 8<br />
max_nodes: 21<br />
data<br />
1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,10946,17711,28657,46368<br />
</code></p>

<p>The option <code>make_time_series</code> will then transform the data into a data set and then proceed as if the data set has been stated explicit. Note: the SymbolicRegression program works with double, hence the somewhat unusual presentation.</p>

<p>The number of time lags is the number of input variables (<code>num_input_variables</code>) + 1 for the output variable; here 4 + 1 = 5 time lags.  The program prints the transformed data first, i.e.:<br />
<code><br />
Making timeseries, #elements: 24<br />
1.0 1.0 2.0 3.0 5.0<br />
1.0 2.0 3.0 5.0 8.0<br />
2.0 3.0 5.0 8.0 13.0<br />
3.0 5.0 8.0 13.0 21.0<br />
5.0 8.0 13.0 21.0 34.0<br />
8.0 13.0 21.0 34.0 55.0<br />
13.0 21.0 34.0 55.0 89.0<br />
21.0 34.0 55.0 89.0 144.0<br />
34.0 55.0 89.0 144.0 233.0<br />
55.0 89.0 144.0 233.0 377.0<br />
89.0 144.0 233.0 377.0 610.0<br />
144.0 233.0 377.0 610.0 987.0<br />
233.0 377.0 610.0 987.0 1597.0<br />
377.0 610.0 987.0 1597.0 2584.0<br />
610.0 987.0 1597.0 2584.0 4181.0<br />
987.0 1597.0 2584.0 4181.0 6765.0<br />
1597.0 2584.0 4181.0 6765.0 10946.0<br />
2584.0 4181.0 6765.0 10946.0 17711.0<br />
4181.0 6765.0 10946.0 17711.0 28657.0<br />
It was 19 data rows<br />
</code></p>

<p>And then, as mentioned above, the program proceeds as usual. See <a href="http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_using_genetic_programming_with_jgap_1.html">Symbolic regression (using genetic programming) with JGAP</a><br />
</p>]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_with_jgap_some_improvements.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_with_jgap_some_improvements.html</guid>
         <category>Symbolic regression</category>
         <pubDate>Sun, 28 Feb 2010 18:36:23 +0100</pubDate>
      </item>
            <item>
         <title>Experimenting with Eureqa&apos;s API II: eureca_cli</title>
         <description><![CDATA[<p>In <a href="http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api.html">Experimenting with Eureqa's API</a>, I mentioned a simple C++ program using <a href="http://ccsl.mae.cornell.edu/eureqa">Eureqa</a>'s  <a href="http://code.google.com/p/eureqa-api/">API</a>. Now I have written a program with more command line options and flexibility: <a href="http://www.hakank.org/eureqa/eureqa_cli.cpp">eureqa_cli.cpp</a>. It is also available from <a href="http://www.hakank.org/eureqa/">my Eureqa page</a>.</p>

<h3>eureqa_cli</h3>
Running the program <code>eureqa_cli</code> without any arguments shows the valid options:

<p><code><br />
Syntax:<br />
&nbsp;&nbsp;&nbsp;&nbsp; eureca_cli datafile relationship functions fitness_method population_size crossover_prob mutation_prob<br />
where only the data file and relationship must be stated.</p>

<p>...</p>

<p></code></p>

<p>It then lists all the valid options for functions and fitness methods, see below under <b>Full help notice</b>. Also see Eureqa's <a href="http://code.google.com/p/eureqa-api/wiki/doc_intro">API</a> for more information about Eureqa's options. I have not added any function of my own (because this is not possible at the moment) and so use what is available in Eureqa.</p>

<h3>Default values</h3>
The default values of <code>eureqa_cli</code> are:<ul>  <li> functions (building blocks): "a a+b a-b a*b a/b"
  <li> fitness: "absolute_error
  <li> population_size: 100
  <li> crossover_probability = 0.5
  <li> mutation_probability = 0.01</ul>

<p>The following parameters are set as the default values from Eureqa, but are not options to the program:<ul>  <li> normalize_fitness_by_ = 10.0;<br />
  <li> predictor_population_size_ = 10;<br />
  <li> trainer_population_size_ = 10;<br />
  <li> predictor_crossover_probability_ = 0.5;<br />
  <li> predictor_mutation_probability_ = 0.2;<br />
  <li> implicit_derivative_dependencies_ = "";</ul></p>

<h3>Examples</h3>
Here are some examples using the program. The data files is at <a href="http://www.hakank.org/eureqa/">my Eureqa page</a>.<ul> <li> <code>eureqa_cli number_puzzle1.txt "z = f(x,y)"</code>
 <li> <code>eureqa_cli fib_38_ix.txt "t1 = f(ix)" "a a+b a-b a*b a/b a^b sqrt(a)"</code>
 <li> <code>eureqa_cli boyles_law.txt "PV = f(P,V)"</code>
 <li> <code>eureqa_cli p4_1.txt "y = f(x)" "a a+b a-b a*b a/b" "absolute_error"</code>
 <li> <code>eureqa_cli two_spirals.txt "z = f(x,y)" "a a+b a-b a*b a/b sin(a) cos(a) exp(a) log(a)"</code>
 <li>  <code>eureqa_cli fib_38_ix.txt "t1 = f(ix)" "a a+b a-b a*b a/b a^b sqrt(a)" "squared_error"  1000 0.9 0.10 </code> (with populations size 1000, crossover probability 0.9, and mutation probability 0.10)</ul>

<p>See <a href="http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api.html">Experimenting with Eureqa's API</a> for output of similar  problems.</p>

<h3>Eureqa server</h3>
This program requires that the <b>Eureqa server</b> (the program <code>eureqa_server</code>) has been started. See <a href="http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api.html">Experimenting with Eureqa's API</a> for some more about this.

<h3>Full help notice</h3>
Here is the full help notice of the program:

<p><code><br />
eureqa_cli is a command line interface to Eureqa's eureqa_server<br />
Syntax:<br />
&nbsp;&nbsp;&nbsp;&nbsp;eureca_cli datafile relationship functions fitness_method population_size crossover_prob mutation_prob<br />
where only the data file and relationship must be stated</p>

<p>Valid functions (building blocks):<br />
 * constant: 1.34<br />
 * data variable: x<br />
 * addition: x+y<br />
 * subtraction: x-y<br />
 * multiplication: x*y<br />
 * division: x/y<br />
 * power: x^y<br />
 * exponential: exp(x)<br />
 * logarithm: log(x)<br />
 * sine: sin(x)<br />
 * cosine: cos(x)<br />
 * absolute value: abs(x)<br />
 * tangent: tan(x)<br />
 * two-input arctangent: atan2(x,y)<br />
 * minimum of two: min(x,y)<br />
 * maximum of two: max(x,y)<br />
 * square root: sqrt(x)<br />
 * gamma function: gamma(x)<br />
 * gaussian function: gauss(x)<br />
 * logistic function: logistic(x)<br />
 * hill function, power 2: hill2(x)<br />
 * step function: step(x)<br />
 * sign function: sign(x)<br />
 * arcsine: asin(x)<br />
 * arccosine: acos(x)<br />
 * arctangent: atan(x)<br />
 * hyperbolic sine: sinh(x)<br />
 * hyperbolic cosine: cosh(x)<br />
 * hyperbolic tangent: tanh(x)<br />
 * inverse hyperbolic sine: asinh(x)<br />
 * inverse hyperbolic cosine: acosh(x)<br />
 * inverse hyperbolic tangent: atanh(x)Special building blocks:<br />
 * equals: y = f(x)<br />
 * search formula: y = f(x)<br />
 * derivative: D(y,t) = f(x,y)</p>

<p>Valid fitness methods:<br />
 * absolute_error<br />
 * squared_error<br />
 * root_squared_error<br />
 * logarithmic_error<br />
 * explog_error<br />
 * correlation<br />
 * minimize_difference<br />
 * akaike_information<br />
 * bayesian_information<br />
 * maximum_error<br />
 * median_error<br />
 * implicit_error<br />
 * count</p>

<p>For more information about this program, see http://www.hakank.org/eureqa/<br />
Eureqa's homepage: http://ccsl.mae.cornell.edu/eureqa/<br />
</code><br />
</p>]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api_ii_eureca_cli_1.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api_ii_eureca_cli_1.html</guid>
         <category>Symbolic regression</category>
         <pubDate>Sun, 28 Feb 2010 09:55:08 +0100</pubDate>
      </item>
            <item>
         <title>Experimenting with Eureqa&apos;s API</title>
         <description><![CDATA[<p>In <a href="http://www.hakank.org/arrays_in_flux/2010/02/eureqa_version_078beta_released.html">Eureqa version 0.78beta released</a> I mentioned that there is an <a href="http://code.google.com/p/eureqa-api/">API</a> for connecting to the Eureqa server. Now I have tested it, and it is really nice.</p>

<h2>Installation</h2>
I followed the steps in <a href="http://code.google.com/p/eureqa-api/wiki/getting_started_on_unix_variants">Getting Started on Linux or Mac</a> (the Windows variant is <a href="http://code.google.com/p/eureqa-api/wiki/getting_started_on_windows">here</a>). Here are some comments and findings during this installation and preparation step.

<p>Before starting anything Eureqa related, I had to install a newer version of <br />
the <a href="http://www.boost.org/">Boost library</a> since Eureqa requires version 1.42.0. It did take about half an hour but there where no problems during this step.</p>

<p>The <a href="http://eureqa-api.googlecode.com/files/eureqa_api_1_00_0.zip">Eureqa API archive</a> must be downloaded, and unpacked.</p>

<p>After these preliminaries, I first tested the simplest example: <code>minimal_client</code>. Unfortunately it didn't work right from the box on my Mandriva Linux machine, and I had to add two things (bold below) in the Makefile:<br />
<pre>minimal_client: minimal_client.o<br />
	g++ minimal_client.o \<br />
	<b>$(BOOST_LIBRARY_PATH)libboost_thread.a \</b><br />
	$(BOOST_LIBRARY_PATH)libboost_system.a \<br />
	$(BOOST_LIBRARY_PATH)libboost_serialization.a \<br />
	-o minimal_client <b>-lpthread</b></pre></p>

<p>The Makefile for other example <code>basic_client</code>, already has these lines, and worked without any problems.</p>

<p>Before running the program, a running Eureqa standalone <b>server</b> is needed. It can be downloaded from <a href="http://ccsl.mae.cornell.edu/eureqa_download">Eureqa's download page</a>, or from the directory <code>./server</code> in the installed API archive. The real work is done in the Eureqa server. The client program first tells the conditions of the run to the server (what data, variables, functions, to use), and later on ask the server for new/better solutions which is then presented by the client program.</p>

<p>To start the server:<br />
<code>  ./eureqa_server &amp; </code></p>

<p>Now we are ready to start the minimal_client program. This example reads the data file <code>../data_sets/default_data.txt</code> (it seems to be the same as the default data set as in the Eureqa GUI). </p>

<p><code>./minimal_client</code></p>

<p>Here is the first lines of output from the program. If you have running the GUI version of Eureqa (which is really recommended) you will recognize most of this output.</p>

<p><code>Data: 100 data points, 3 variables<br />
Options: "y = f(x)", 8 building-block types, Absolute Error fitness<br />
Connection: Connected to 127.0.0.1<br />
Server: xxxxxxxx, Eureqa 0.78 (linux), 2 CPU cores<br />
0 generations, 1864 evaluations<br />
Size:   Fitness:        Equation:<br />
-----   --------        ---------<br />
7       -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)</p>

<p>39 generations, 764432 evaluations<br />
Size:   Fitness:        Equation:<br />
-----   --------        ---------<br />
7       -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)<br />
1       -1.73044        f(x) = x</p>

<p>173 generations, 4.04115e+06 evaluations<br />
304 generations, 7.28129e+06 evaluations<br />
458 generations, 1.04966e+07 evaluations<br />
Size:   Fitness:        Equation:<br />
-----   --------        ---------<br />
7       -1.4854 f(x) = -1.50204e-07 + sin(-1.50204e-07 + x)<br />
1       -1.73044        f(x) = x<br />
5       -1.61304        f(x) = sin(x/x)<br />
...</code></p>

<p>A small issue: I don't understand why the fitness is negative here; absolute error should always be positive. Maybe it's just a tiny presentation bug, with a misplaced "-"?</p>

<h2>Example: Closed form of Fibonacci number</h2>
In order to test the API more, I tried one of the problems from <a href="http://translate.google.se/translate?u=http%3A%2F%2Fwww.hakank.org%2Fwebblogg%2Farchives%2F001354.html&sl=sv&tl=en&hl=&ie=UTF-8">Eureqa: Equation discovery with genetic programming</a>, namely trying to find a closed form of the Fibonacci numbers.

<p>The program <a href="http://www.hakank.org/eureqa/eureqa_apitest1.cpp">eureqa_apitest1.cpp</a> is based on the example <code>eureqa_api_1_00_0/examples/minimal_client/minimal_client.cpp</code> mentioned above. The changes are not big, but some common options has been explicit:<br />
<ul>  <li> building_blocks<br />
All the building blocks that are in the GUI client seems to be supported via the API, see <a href="http://code.google.com/p/eureqa-api/wiki/doc_building_blocks">building blocks</a> for a full list. Instead of the default building blocks, they have been stated, and the functions power (<code>a^b</code>), and sqrt (<code>sqrt</code>) was added (the sin and cosine functions was removed).<br />
<code>options.building_blocks_.clear();<br />
options.building_blocks_.push_back("a"); // variables<br />
options.building_blocks_.push_back("a+b"); // adds<br />
options.building_blocks_.push_back("a-b"); // subtracts<br />
options.building_blocks_.push_back("a*b"); // multiplies<br />
options.building_blocks_.push_back("a/b"); // divides<br />
options.building_blocks_.push_back("a^b"); // power<br />
options.building_blocks_.push_back("sqrt(a)"); // sqrt</code></p>

<p>Note that the names in the building blocks don't have to match the variable names in the data file.</p>

<p>  <li> search_relationship<br />
The relationship, i.e. the formula we want to find, is stated in the same way as in the GUI: <code>t1 = f(ix)</code>:<br />
<pre>options.search_relationship_ = "t1 = f(ix)";</pre></p>

<p>  <li> fitness_metric<br />
Also, I stated the fitness metric (which happens to be the default):<br />
<code>options.fitness_metric_ = eureqa::fitness_types::absolute_error;</code></p>

<p>There are more fitness metrics to use, see <a href="http://code.google.com/p/eureqa-api/wiki/doc_fitness_types">Fitness Metric Identifiers</a>.</ul></p>

<p>Well, that's about it.</p>

<p><br />
The program reads the file <a href="http://www.hakank.org/eureqa/fib_38_ix.txt">fib_38_ix.txt</a> consisting of the first 38 Fibonacci numbers with the index (1..38). Note: In this problem we just use the first two variables in the file <code>ix</code>, and <code>t1</code>. The instances for 39..50 has been commented out to make it simpler.</p>

<p>The object is to find the closed form of the Fibonacci numbers, which is usually stated as:<br />
<code><br />
(phi^n - (1-phi)^n)/sqrt(5)<br />
</code>  <br />
where phi = (1+sqrt(5))/2 = ~ 1.61803  (golden ratio), and sqrt(5)     ~ 2.2361.</p>

<p>See <a href="http://en.wikipedia.org/wiki/Fibonacci_number#Closed_form_expression">Fibonacci_number#Closed_form_expression</a> (Wikipedia) for more about this.</p>

<p>Here is one solution (the 6 best solutions)  from running the program a couple of minutes. Since the program don't have any stop criteria it will run forever if not manually stopped.<br />
<pre>    Size:   Fitness:        Equation:<br />
    -----   --------        ---------<br />
    7       -104.178        f(ix) = 1.61808^(ix - 1.67436)<br />
    9       -103.999        f(ix) = 1.61808^(ix - 1.67436) + 1.61808<br />
    11      -101.371        f(ix) = 1.61808^(ix - 1.67436) + ix - 1.67436<br />
    5       -79382.2        f(ix) = 1.58323^ix<br />
    1       -2.55834e+06    f(ix) = ix<br />
    3       -2.53729e+06    f(ix) = ix/0.00018853</pre><br />
 </p>

<p><br />
The first solution in the list has an fitness error of about 104: <code>1.61808^(ix - 1.67436)</code>.<br />
Note the constant 1.61808 which is quite close to phi (1.61803).</p>

<p>When rounded, this program (solution) gives the following results for ix = 1..38. It is correct for the first 15 numbers (1..15), but will then deviate.<br />
<code>  1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 986(987), 1596(1597), 2583(2584), 4179(4181), 6762(6765), 10941(10946), 17703(...), 28646, 46351, 75000, 121355, 196362, 317730, 514113, 831876, 1346042, 2178003, 3524183, 5702410, 9226955, 14929952, 24157857, 39089345</code></p>

<p>The correct sequence is:<br />
<code>1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887,9227465, 14930352, 24157817, 39088169</code></p>

<p>Here is the deviation from the correct sequence:<br />
<code> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -2, -3, -5, -8, -11, -17, -25, -38, -56, -81, -116, -164, -227, -306, -395, -477, -510, -400, 40, 1176</code></p>

<p>Maybe this is a wrong track, but it is nice to see the solutions evolve, which is one advantage of symbolic regression, and genetic programming in general.</p>

<h2>Documentation</h2>
The <a href="http://code.google.com/p/eureqa-api/">API documentation</a> is well structured and all pages has small examples making it easy to start programming. Maybe later experiments requires some reading in the included C++ header files.

<p>Some useful pages:<ul><li> <a href="http://code.google.com/p/eureqa-api/wiki/doc_search_options">search options</a><br />
<li> <a href="http://code.google.com/p/eureqa-api/wiki/doc_building_blocks">building-blocks</a><br />
<li> <a href="http://code.google.com/p/eureqa-api/wiki/doc_fitness_types">fitness types</a></ul></p>

<h2>Other comments</h2>
I will continue experimenting with Eureqa and its API by doing more general program, etc. However, it will probably not be as general as my <a href="http://www.hakank.org/jgap/">JGAP symbolic regression</a> program.

<p>Also, see <a href="http://www.hakank.org/eureqa/">my Eureqa page</a>.<br />
</p>]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/experimenting_with_eureqas_api.html</guid>
         <category>Genetic programming/algorithms</category>
         <pubDate>Thu, 25 Feb 2010 18:54:29 +0100</pubDate>
      </item>
            <item>
         <title>Eureqa version 0.78beta released</title>
         <description><![CDATA[Yesterday version 0.78beta of <a href="http://ccsl.mae.cornell.edu/eureqa">Eureqa</a> was released, and can be downloaded <a href="http://ccsl.mae.cornell.edu/eureqa_download">here</a>.
<br><br>
One new feature (which I haven't tested yet) is the Library and API. There are some examples in the distribution. See <a href="http://code.google.com/p/eureqa-api/">Eureqa API</a> site at Google Code for more information about this.
<br><br>
Other new features in this version (from the <a href="http://ccsl.mae.cornell.edu/eureqa_download">Download page</a>):<blockquote><ul> <li>reduced lag that servers report new solutions
 <li>projects now save the smoothing preprocessing
 <li>improved the ordering/display of the best solutions list
 <li>improved the seeding previous solution method
 <li>improved the AIC and BIC fitness metrics
 <li>added ability to right-click a plot and copy its data to the clipboard
 <li>added ability to start a search from the command line
 <li>added ability to chose the training/validation data split in the advanced options
 <li>added check to normalize data values with large offset or scale
 <li>fixed bug when loading projects that could clear results
 <li>fixed bug where resuming a search could fail to keep the previous results
 <li>fixed bug where seeded equations were not recognized
 <li>fixed bug where the fitness metric weighting was ignored
 <li>fixed several minor user interface annoyances
 <li>made compatible with the new open-source API</ul></blockquote>

Other updated pages:<ul> <li> <a href="http://ccsl.mae.cornell.edu/eureqa_bugs">Known issues</a>
 <li> <a href="http://ccsl.mae.cornell.edu/eureqa_requests">Requested features</a></ul>

Also, see: <a href="http://www.hakank.org/eureqa/">My Eureqa page</a> and <a href="http://translate.google.se/translate?u=http%3A%2F%2Fwww.hakank.org%2Fwebblogg%2Farchives%2F001354.html&sl=sv&tl=en&hl=&ie=UTF-8">Eureqa: Equation discovery with genetic programming</a>.
]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/eureqa_version_078beta_released.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/eureqa_version_078beta_released.html</guid>
         <category>Genetic programming/algorithms</category>
         <pubDate>Mon, 22 Feb 2010 17:52:20 +0100</pubDate>
      </item>
            <item>
         <title>Symbolic regression (using genetic programming) with JGAP</title>
         <description><![CDATA[Some weeks ago I tested the data analysis system <a href="http://ccsl.mae.cornell.edu/eureqa">Eureqa</a> and was very delighted by its functions and performance, and especially the concept of Symbolic Regression. To quote from Eureqa's home page, Eureqa is <i> detecting equations and hidden mathematical relationships in your data. Its primary goal is to identify the simplest mathematical formulas which could describe the underlying mechanisms that produced the data.</i>
<br><br>
I blogged about about this in my Swedish blog <a href="http://www.hakank.org/webblogg/archives/001354.html">Eureqa: equation discovery med genetisk programmering</a> (Google translation to English <a href="http://translate.google.se/translate?u=http%3A%2F%2Fwww.hakank.org%2Fwebblogg%2Farchives%2F001354.html&sl=sv&tl=en&hl=&ie=UTF-8">Eureqa: Equation discovery with genetic programming</a>). Eureqa uses Symbolic Regression (using genetic programming) for calculating a mathematical formula given some data points. For more about Eureqa, see <a href="http://www.hakank.org/eureqa/">my Eureqa page</a>.

<h2>Symbolic Regression with JGAP</h2>
After that, I started to write my own system for symbolic regression using the genetic algorithm/genetic programming system <a href="http://jgap.sourceforge.net/">JGAP</a>, written in Java. You will find Java code and example configuration files on <a href="http://www.hakank.org/jgap/">My JGAP page</a>. 
<br><br>
Here I give some examples of the usage of symbolic regression with my program. Last there is full lists of the defined function, options, and configuration files.
<br><br>
I wrote my own system instead of just using Eureqa for a couple of reasons:
<ul>
<li> learning genetic programming, and symbolic regression, is probably better done with writing code
<li> it is easy to write new functions with JGAP, and I want to experiment with different things, new functions and options
<li> in some cases (like this) I like command line tools better than GUI:s, and the use configuration file where the options for the learning and data is collected
</ul>

<h2>What is symbolic regression (SR)?</h2>
Symbolic regression is, simply put, a way of using genetic programming (GP) to generate a mathematical formula given some data points. In some cases of genetic programming it can be "real programs" but in this context there are mostly mathematical expressions. Note that symbolic regression is just one thing you can to with genetic programming. 
<br><br>
For an example of symbolic regression, see <a href="http://www.genetic-programming.com/gpquadraticexample.html"></a> (from <a href="http://www.genetic-programming.com">www.genetic-programming.com</a>.) Also see Wikipedia's <a href="http://en.wikipedia.org/wiki/Genetic_programming">Genetic Programming</a>.
<br><br>
What fascinates me most with genetic programming is that the result of is a program which can be understood, it is a "white box" (clear box) technique. For some other machine learning techniques, say neural networks, it is very hard to understand the result; it's just a black box, where you may use the results but don't get any insights into the solution.  Also, I tend to prefer other white box techniques such as decision trees and rule based.
<br><br>
My SymbolicRegression program have a lot of options, but I will not explain or exemplify them all here. All full list of the options with a short description is presented below.

<br><br>
Let us start with some simple examples  to understand what symbolic regression can do and how to do it with the program SymbolicExpression.

<h2>Simple example: number puzzle</h2>
Let's say we got the following puzzle:
<pre>
If we assume that:
2 + 3 = 10
7 + 2 = 63
6 + 5 = 66
8 + 4 = 96

How much is?
9 + 7 = ????
</pre>

This can be somewhat tricky to solve this by hand (or head), or maybe we simply are too lazy to solve it by hand. Using symbolic regression is easier (although probably not that fun): Just create a configuration file like the one below. In this problem there is a specific unknown instance that we want to solved for, so we can write an instance with the <code>?</code> (question mark) in the place of the (unknown) output.
<pre>
presentation: Puzzle
num_input_variables: 2
variable_names: x y z
functions: Multiply,Divide,Add,Subtract
terminal_range: -10 10
terminal_wholenumbers: true
population_size: 100
num_evolutions: 100
show_similiar: true
data
2 3 10
7 2 63
6 5 66
8 4 96

# the unknown instance
9 7 ?
</pre>

We use the four arithmetic function (*,/, +,-), coded as  <code>Multiply,Divide,Add,Subtract</code> and have a small population size (100) and just 100 generations. The other options is explained more below.
<br><br>
Here is an (edited) sample run of the problem. 
<pre>
It was 4 data rows
It was 1 data rows in the user defined data set
Presentation: Puzzle
output_variable: z (index: 2)
input variable: x
input variable: y
function1: <b>&1 * &2</b>
function1: <b>/</b>
function1: <b>&1 + &2</b>
function1: <b>&1 - &2</b>
function1: 10.0
Creating initial population

Evolving generation 0/100(time from start:  0,05s)
Best solution fitness: <b>35.0</b>
Best solution: <b>x + ((y - x) + (10.0 * x))</b>
Depth of chrom: 3. Number of functions/terminals: 9 (4 functions, 5 terminals)
Correlation coefficient: <b>0.979073833348314</b>

Evolving generation 3/100(time from start:  0,15s)
Best solution fitness: <b>31.0</b>
Best solution: <b>(10.0 * x) + ((x * y) - (8.0 + y))</b>
Depth of chrom: 3. Number of functions/terminals: 11 (5 functions, 6 terminals)
Correlation coefficient: <b>0.9945949940306454</b>

Evolving generation 11/100(time from start:  0,39s)
Best solution fitness: <b>26.0</b>
Best solution: <b>((10.0 * x) + (x + (-7.0 * y))) + (x * y)</b>
Depth of chrom: 4. Number of functions/terminals: 13 (6 functions, 7 terminals)
Correlation coefficient: <b>0.969830602937701</b>

Evolving generation 14/100(time from start:  0,50s)
Best solution fitness: <b>22.0</b>
Best solution: <b>(x * x) + (4.0 * x)</b>
Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals)
Correlation coefficient: <b>0.9726783536388712</b>

Evolving generation 17/100(time from start:  0,58s)
Best solution fitness: <b>0.0</b>
Best solution: <b>(x * y) + (x * x)</b>
Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals)
Correlation coefficient: <b>1.0</b>

All time best (from generation 17)

Evolving generation 101/100(time from start:  1,71s)
Best solution fitness: <b>0.0</b>
Best solution: <b>(x * y) + (x * x)</b>
Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals)
Correlation coefficient: <b>1.0</b>

Total time  1,71s

<b>All solutions with the best fitness (0.0):</b>
(x * x) + (x * y) (26)
(x * x) + (y * x) (2)
(x + y) * x (2)
(x * y) + (x * x) (98)
((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1)
((x / (y / x)) + x) * y (1)
It was 6 different solutions with fitness 0.0

Testing the fittest program with user defined test data:
9.0 7.0    Result: 144.0
</pre>

Since it is a genetic programming system, the first generation - generation 0 - is a completely random population of programs. Note that the configuration states very few limits in size, and number of population, and there is really no limits of the structure (see below).
<br><br>
The best fit program in this first generation, <code>x + ((y - x) + (10.0 * x))</code> has a quite bad <b>fitness</b> measure: 35; rather a long way from the goal of fitness 0 (the perfect score). The fitness is calculated by the sum of the differences between the program's output for each data point and the real data point. (Note: One of my TODO:s is to have more alternatives of error measures.)
<br><br>
Generation 3 has a somewhat better solution, as has generations 11, and 14. A perfect solution is found in generation 17: <code>(x * y) + (x * x)</code> with a fitness (error) of 0.0, and a correlation coefficient of 1.0 (perfect fit between the input variable and output variable). After the 100 generations, the best solution is printed again with the total time (about 1.7 seconds).
<br><br>
Since the option <code>show_similar: true</code> was set, all solutions with the same fitness score as the best is also shown:
<pre>
(x * x) + (x * y) (26)
(x * x) + (y * x) (2)
(x + y) * x (2)
(x * y) + (x * x) (98)
((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1)
((x / (y / x)) + x) * y (1)
</pre> 

Some of these solutions are just permutation of the best solution, i.e. the places of the variable names or expressions are changed. Other are not very interesting either, or, like the 5th one, is not at all useful in this example. The numbers in parenthesis after the solution is the number of occurrences of the solution. Luckily the 5th solution was generated only once.
<br<br>
During the evolution we also see that the <b>correlation coefficient</b> changes from <b>0.979073833348314</b> (which is rather good and unusual except in these easy problems) to a perfect fit <b>1.0</b>. 

<br><br>
Other notes: 
<ul>
<li> I consciously selected a rather bad run for this simple example to be able to show the development over the generations. In other runs the problem was solved in generation 2 or 3, and sometimes even in the generation 0. This indicates that it is a very easy problem and actually may have been replaced by just random search. For more serious (larger) problems this is not the case.
<li> The option in the configuration file are explained below. One of the most tricky thing about genetic programming is to select the <code>functions</code> to use. In this problem it would suffice with just the functions <code>Add</code>, and <code>Multiply</code>, but it is - of course - a special case.
</ul>

Note: This problem was taken from <a href="http://rogeralsing.com/">Roger Alsing</a>'s blog post the other day <a href="http://rogeralsing.com/2010/02/14/genetic-programming-code-smarter-than-you/">Genetic Programming: Code smarter than you</a>. Roger has also dona a great Mona Lisa application where a picture of Mona Lisa is evolved using genetic programming: <a href="http://rogeralsing.com/2008/12/07/genetic-programming-evolution-of-mona-lisa/">Genetic Programming: Evolution of Mona Lisa</a>. Note: There is an example of a Mona Lisa application in the JGAP distribution.


<h3>Example: Fibonacci sequence, as lagged "time series"</h3>
Let's take another example, one that I wrote in my <a href="http://www.hakank.org/webblogg/archives/001354.html">Eureqa posting</a>: Fibonacci numbers. The first one is 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946.
<br><br>
The relation between the numbers is, of course, that the next number F(n) is F(n-1) + F(n-2), e.g. 1 + 2 = 3, 2 + 3 = 5, 3 + 5 = 8 etc. It is a <b>recursive</b> equation. 
<br><br>
In order to analyze this kind of recursive sequences (in this program) we need to translate this sequence into a lagged "time series". Let us assume that we suspect that it is a recursive sequence but we don't know the exact relation or the number of previous elements that is needed. So we make a series where the output variable can be considered dependent of either 1, 2, or 3 variables. Like this:
<pre>
1,1,2,3
1,2,3,5
2,3,5,8
...
</pre>

After doing this transformation, we proceed in the same manner as the number problem above, i.e. create a configuration file (<a href="http://www.hakank.org/jgap/fib1.conf">fib1.conf</a>) with the options and the data:
<pre>
# Fibonacci with 3 variables
num_input_variables: 3
variable_names: F1 F2 F3 F4
functions: Multiply,Divide,Add,Subtract
terminal_range: -10 10
max_init_depth: 4
population_size: 20
max_crossover_depth: 8
num_evolutions: 100
max_nodes: 21
show_population: false
stop_criteria_fitness: 0
data
1,1,2,3
1,2,3,5
2,3,5,8
3,5,8,13
5,8,13,21
8,13,21,34
13,21,34,55
21,34,55,89
34,55,89,144
55,89,144,233
89,144,233,377
144,233,377,610
233,377,610,987
377,610,987,1597
610,987,1597,2584
987,1597,2584,4181
1597,2584,4181,6765
2584,4181,6765,10946
4181,6765,10946,17711
6765,10946,17711,28657
10946,17711,28657,46368
</pre>

And here is a complete run (slightly edited) where the correct solution is found in generation 10: <code>F3+F2</code>. Also, here we added the option <code>stop_criteria_fitness: 0</code> which makes the program to exit after the criteria have been reached.

<pre>
It was 21 data rows
Presentation: This is the Fibonacci series
output_variable: F4 (index: 3)
input variable: F1
input variable: F2
input variable: F3
function1: &1 * &2
function1: /
function1: &1 + &2
function1: &1 - &2
function1: 4.0

Evolving generation 0/100(time from start:  0,01s)
Best solution fitness: 17619.64
Best solution: ((F3 + F1) + 7.0) + (F1 / F3)
Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals)
Correlation coefficient: 0.9999999998872628

Evolving generation 2/100(time from start:  0,06s)
Best solution fitness: 7050.70
Best solution: ((F2 + F2) * (F1 + F1)) / (((F2 + 7.0) + F2) + (F1 - F3))
Depth of chrom: 4. Number of functions+terminals: 17 (8 functions, 9 terminals)
Correlation coefficient: 0.9999999122259431

Evolving generation 9/100(time from start:  0,19s)
Best solution fitness: 6749.0
Best solution: (F2 / F2) + (4.0 * F1)
Depth of chrom: 2. Number of functions+terminals: 7 (3 functions, 4 terminals)
Correlation coefficient: 0.99999999956117

Evolving generation 10/100(time from start:  0,21s)
Best solution fitness: 0.0
Best solution: F3 + F2
Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals)
Correlation coefficient: 1.0

Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 10

All time best (from generation 10)

Evolving generation 10/100(time from start:  0,21s)
Best solution fitness: 0.0
Best solution: F3 + F2
Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals)
Correlation coefficient: 1.0

Total time  0,21s

All solutions with the best fitness (0.0):
F3 + F2 (1)
It was 1 different solutions with fitness 0.0
</pre>


A variant where we remove the stopping criteria, and instead use 
<code>show_similar: true</code> results in the following similar
solutions:
<pre>
All solutions with the best fitness (0.0):
(F2 / 1.0) + F3 (1)
(3.0 / 3.0) * (F3 + F2) (1)
F3 + F2 (406)
F2 + F3 (7)
(-2.0 / -2.0) * (F3 + F2) (1)
It was 5 different solutions with fitness 0.0
</pre>

TODO: I have some plans to automatically generate this kind of lagged time serie for a sequence, and hope it will be finished soon. This is not rocket science but it can be tricky in the details.


<h3>Example: Classification with the Iris data set</h3>
The <a href="http://en.wikipedia.org/wiki/Iris_flower_data_set">Iris data set</a> is a classic classification examples. It is used to classify three different species (classes) of the Iris flower: Iris setosa, Iris virginica, and Iris versicolor. Since my system cannot - yet - handle nominal data the three difference classes is translated to the numbers 1.0, 2.0, and 3.0 respectively.
<br><br>
In the configuration file <a href="http://www.hakank.org/jgap/iris.conf">iris.conf</a> we use the following functions:
<pre>
 functions: Multiply,Divide,Add,Subtract,IfLessThanOrEqualD
</pre>

The function <code>IfLessThanOrEqualD</code> may be worth a comment: I am currently reading John Koza' first Genetic Programming book <a href="http://www.bokus.com/b/9780262111706.html">Genetic Programming: v. 1 On the Programming of Computers by Means of Natural Selection </a> (ISBN: 9780262111706). He mentions the <code>IFLTE</code> function (<code>if a &lt;= b then c else d</code>), which I implemented - by cloning and mutation of my <a href="http://www.hakank.org/jgap/IfElseD.java">IfElseD.java</a> function (see, programmers also use these operators :). As Koza writes in page 365 about the function, it can be used instead of the following function:
 <ul>
   <li> &lt;
   <li> &lt;= 
   <li> &gt;
   <li> &gt;= 
   <li> if then else
 </ul>

However, it may convenient to instead use these functions explicitly since the generated code may be clearer for the user. Koza also lists equals (<code>==</code>) in this list, but I'm not so convinced that IFLTE can properly replace this.
<br><br>
Some notes: 
<ul>
<li> this is a selected - but real - run, all running are not this good (or bad)
<li> here we use the complete data set as fitness cases. Normally a certain percent of the  data should be used as a validation set (with the option <code>validation_pct</code>). See below for a discussion of this.
</ul>

We use a population of 100 and 100 generations for a rather fast (about 3 seconds) run. If the population iwhere incremented to - say - 1000 and with 1000 generations we most surely getting better results, although it takes some more time.
<pre>
population_size: 100
num_evolutions: 100
</pre>

Other configuration options in the file:

<ul>
<li> <code>input_variables: 4</code><br>
Here we define that the number of input variables are 4, and there is always one output variable). (Note: This should probably be automatically detected.)
<li> <code>variable_names: sl sw pl pw class</code><br>
Here is the names of the variables. They should be rather short to not clutter the output expression to much. The last variable (<code>class</code>) is the output variable. Note that it it possible to change what variable to use as output variable, with the option <code>output_variable</code> and state the (0-based) position in the variable list. Here we could have the following option: <code>output_variable: 4</code>.
<li> <code>terminal_range: -20 20</code><br>
The range of the <code>Terminal</code> (ephemeral terminal), i.e. the numbers used in the solution expression.
<li> <code>terminal_wholenumbers: false</code><br>
This is a special option to be able to state that to use only whole numbers (or rather double representation of integers). In this example, we don't want to use only integers.
<li> <code>max_nodes: 31</code><br>
This is about the only structural restriction there is in genetic programming: This states the maximum number of nodes in the expression trees. Nodes are either functions or terminals (numbers). So, here we allow up to 31 functions or terminals.
<li> <code>mutation_prob: 0.01</code><br>
Probability of mutation. The probability of mutation and crossover are very important and may make the difference of an effective run vs. a slow progression. One have to experiment with these for larger problems.
<li> <code>crossover_prob: 0.9</code><br>
Probability of crossover. See comment above.
<li> <code>show_progression: true</code><br>
If this option is set, the generation number is printed on the last line of the command. Can be useful for larger problems.
<li> <code>show_results: true</code><br>
If true, the results of the fittest programs is shown.
<li> <code>result_precision: 5</code><br>
For <code>show_results</code>: The precision of the results.
<li> <code>hits_criteria: 0.5</code><br>
If the absolute difference between the result and the real value is equal or below this value, it is considered a hit. If set, the fitness measure is the number of non-hits. The value of 0.5 is chosen since the three classes are represented as value 1, 2, and 3, and if the different is &lt;= than 0.5 we have a selected class. (Thoughtful and late note: Maybe it would suffice to add the function <code>Round</code> or <code>Floor</code>?) 
</ul>

Here is a very truncated run.

<pre>
It was 150 data rows
Presentation: Iris
output_variable: class (index: 4)
input variable: sl
input variable: sw
input variable: pl
input variable: pw
function1: &1 * &2
function1: &1 + &2
function1: &1 - &2
function1: /
function1: if(&1 <= &2) then (&3) else(&4)
function1: 16.624795863695333

Evolving generation 0/100(time from start:  0,16s)
Best solution fitness: 64.0
Best solution: ((sw * pl) - (pw * pw)) / ((sl - pl) * (sl - sw))
Depth of chrom: 3. Number of functions+terminals: 15 (7 functions, 8 terminals)
Correlation coefficient: 0.6621167550765015
Number of hits (<= 0.5): 86 (of 150 =  0,57)
Results for this program:

...

(22) 1.0: 0,98889 (diff: 0,01111)
(23) 1.0: 0,87582 (diff: 0,12418)
(24) 1.0: 1,58128 (diff: -0,58128) > 0.5!
(25) 1.0: 0,70000 (diff: 0,30000)

...

total diff: 111.08622273795162 (no abs diff: -42.32778738529426 #hits: 86 (of 150)


Evolving generation 1/100(time from start:  0,29s)
Best solution fitness: 48.0
Best solution: ((sw + pl) * sl) / (18.98720292178895 + pl)
Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals)
Correlation coefficient: 0.8492617928834261
Number of hits (<= 0.5): 102 (of 150 =  0,68)
Results for this program:

...

total diff: 60.411404633990145 (no abs diff: 35.03754935483205 #hits: 102 (of 150)

Evolving generation 4/100(time from start:  0,61s)
Best solution fitness: 7.0
Best solution: (19.035483741865328 / 19.035483741865328) + pw
Depth of chrom: 2. Number of functions+terminals: 5 (2 functions, 3 terminals)
Correlation coefficient: 0.9564638238016164
Number of hits (<= 0.5): 143 (of 150 =  0,95)

...

total diff: 39.80000000000001 (no abs diff: -29.800000000000004 #hits: 143 (of 150)

Evolving generation 16/100(time from start:  0,94s)
Best solution fitness: 6.0
Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (sw + 14.495973076477487))
Depth of chrom: 4. Number of functions+terminals: 19 (8 functions, 11 terminals)
Correlation coefficient: 0.9582679236481598
Number of hits (<= 0.5): 144 (of 150 =  0,96)
Results for this program:

...

total diff: 26.831398327892167 (no abs diff: -3.7720516041580767 #hits: 144 (of 150)

Evolving generation 39/100(time from start:  1,54s)
Best solution fitness: 5.0
Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487))
Depth of chrom: 6. Number of functions+terminals: 25 (11 functions, 14 terminals)
Correlation coefficient: 0.958786524094953
Number of hits (<= 0.5): 145 (of 150 =  0,97)
Results for this program:
(0) 1.0: 0,97740 (diff: 0,02260)
(1) 1.0: 1,01322 (diff: -0,01322)
(2) 1.0: 1,00110 (diff: -0,00110)
(3) 1.0: 1,00869 (diff: -0,00869)

...

(69) 2.0: 1,94192 (diff: 0,05808)
(70) 2.0: 2,59137 (diff: -0,59137) > 0.5!
(71) 2.0: 2,11718 (diff: -0,11718)

...

(118) 3.0: 3,11623 (diff: -0,11623)
(119) 3.0: 2,35925 (diff: 0,64075) > 0.5!
(120) 3.0: 3,08251 (diff: -0,08251)

...

(128) 3.0: 2,91459 (diff: 0,08541)
(129) 3.0: 2,39350 (diff: 0,60650) > 0.5!
(130) 3.0: 2,70539 (diff: 0,29461)
(131) 3.0: 2,73150 (diff: 0,26850)
(132) 3.0: 3,01459 (diff: -0,01459)
(133) 3.0: 2,31546 (diff: 0,68454) > 0.5!
(134) 3.0: 2,23094 (diff: 0,76906) > 0.5!
(135) 3.0: 3,08865 (diff: -0,08865)
(136) 3.0: 3,17414 (diff: -0,17414)

...

(148) 3.0: 3,07502 (diff: -0,07502)
(149) 3.0: 2,60513 (diff: 0,39487)
total diff: 26.224671119186706 (no abs diff: -0.04627940721407109 #hits: 145 (of 150)

...


Total time  2,91s
</pre>


The solutions of the best fit program (generation 39) is this.
<pre>Solution:(((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487))
</pre>

which is kind of funny looking. Here are some comments:
<ul>
  <li> <code>19.035483741865328 / 19.035483741865328</code> is 1. The program don't simplify such expressions, but this would be nice to have. (As I understand it, Eureqa has some kind of simplification process.)
  <li> the <code>if then else</code> is used as an expression with the returning value is used directly in the solution. Something we don't see here but may happen with other settings is that the logical operators (or expressions using these operators) returns 1.0 (true) or 0.0 (false) and these values is used directly in the calculations as any other expression.
</ul>


<pre>
Best solution fitness: 5.0
...
Number of hits (<= 0.5): 145 (of 150 =  0,97)
</pre>

We defined fitness as the number of differences &lt;= 0.5, and we see that there are 5 wrongly classified instances (150-144=6), with a hit rate of 97%. Not too bad, but not very good either. 
<br><br>
If we study the instances more for the best fit program of generation 39 (and the overall best), we see that class 1 (instances 1-50) is classified very good, i.e. none are misclassified. For class 2 (instances 51-100)  there is one misclassified instance (#70), and the rest is of classified as class 3 (#119,#129,#133,and #134). However, due to the very random nature of genetic programming, another run could give another number of bad classifications, and other misclassification instances. However2, these instances - especially #70 - is often misclassified as class 3.

<h3>Validation set</h3>
In this Iris run we used a quite low values of population size and the number of generations. With higher values, say population size=1000 and 1000 generations, it may well be possible to get a perfect hit, and that happened sometimes when I experimented. However this perfect fit could be bad since the fittest program has <b>over fitted</b> the data: it learned the problem to well, so it may not be used for calculating new instances.
<br><br>
A standard procedure in machine learning or other data analysis to remedy over fitting is to move some of the test cases into a <b>validation set</b>, i.e. instances not seen in the training session and then test the solution against the validation test. The option <code>validation_pct</code> does exactly that. The value of the option is the percentage of the data that will be placed in the validation set. Or more exactly: it is the <b>probability</b> that a specific fitness case will be in the validation set. 


<h2>Compiling the program</h2>
All files mentioned here (and at <a href="http://www.hakank.org/jgap/">my JGAP page</a>) are collected in the file <a href="http://www.hakank.org/jgap/symbolic_regression.zip">symbolic_regression.zip</a>. As of now, it is not packaged into a nice Jar file, so you have to compile it manually. There is no file structure either so the files should be unzipped in the same directory.
<br><br>
<a href="http://www.hakank.org/jgap/SymbolicRegression.java">SymbolicRegression.java</a> is the main program. It is based on JGAP's example <code>MathProblem.java</code> but extended with a lot of bells &amp; whistles.
<br><br>
The program is compiled with (on a Linux box) like this. Note that you must have <a href="http://jgap.sourceforge.net/">JGAP</a> installed.
<pre>
javac -Xlint:unchecked -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression.java
</pre>
and run with:
<pre>
java -server -Xmx1024m -Xss2M  -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression [config file]
</pre>
Here is my <a href="http://www.hakank.org/jgap/log4j.properties">log4j.properties</a> file.

<h2>Supported function from JGAP</h2>
The SymbolicRegression program has support for the many of the GP functions from JGAP. The "main" type is double so all functions is not applicable there (e.g. <code>IfElse</code> etc). However, for the ADF functions (defined by setting <code>adf_arity</code> to &gt; 0, but see below) more functions is supported. Please note that some of these functions are experimental (or <b>very</b> experimental</b>) and the result may not make sense in this context.
<br><br>
For some functions, I have made a similar one so it returns double instead of - say - Boolean. These variants are mentioned in this list, has has the suffix <code>D</code>.
<ul>
<li> <code>Multiply</code> (double)
<li> <code>Multiply3</code> (double)
<li> <code>Add</code> (double)
<li> <code>Add3</code> (double)
<li> <code>Add4</code> (double)
<li> <code>Divide</code> (double)
<li> <code>Subtract</code> (double)
<li> <code>Sine</code> (double)
<li> <code>ArcSine</code> (double)
<li> <code>Tangent</code> (double)
<li> <code>ArcTangent</code> (double)
<li> <code>Cosine</code> (double)
<li> <code>ArcCosine</code> (double)
<li> <code>Exp</code> (double)
<li> <code>Log</code> (double)
<li> <code>Abs</code> (double)
<li> <code>Pow</code> (double)
<li> <code>Round</code> (double), compare with my <code>RoundD</code>
<li> <code>Ceil</code> (double)
<li> <code>Floor</code> (double)
<li> <code>Modulo</code> (double), implements Java's <code>%</code> operator for double. See ModuloD for a variant
<li> <code>Max</code> (double)
<li> <code>Min</code> (double)
<li> <code>LesserThan</code> (boolean)
<li> <code>GreaterThan</code> (boolean)
<li> <code>If</code> (boolean)
<li> <code>IfElse</code> (boolean), cf the <code>IfElseD</code>
<li> <code>IfDyn</code> (boolean)
<li> <code>Loop</code> (boolean), cf the experimental <code>LoopD</code>
<li> <code>Equals</code> (boolean), cf <code>EqualsD</code>
<li> <code>ForXLoop</code> (boolean)
<li> <code>ForLoop</code> (boolean)
<li> <code>Increment</code> (boolean)
<li> <code>Pop</code> (boolean)
<li> <code>Push</code> (boolean)
<li> <code>And</code> (boolean), cf the double variant <code>AndD</code>
<li> <code>Or</code> (boolean), cf the double variant <code>OrD</code>
<li> <code>Xor</code> (boolean), cf the double variant <code>XorD</code>
<li> <code>Not</code> (boolean), cf the double variant <code>NotD</code>
<li> <code>SubProgram</code> (boolean, experimental)
<li> <code>Tupel</code> (boolean, experimental)
</ul>


<h2>My own functions</h2>
Here is a complete list of the functions I have wrote (the list will hopefully grow). Some of these may be considered experimental, but may be of some use in experimental settings (or just learning). 
<ul>
  <li> Boolean operators for DoubleClass<br>
  Here are the Boolean operators for use with DoubleClass, i.e. they has <code>double</code> as input and returns a <code>double</code> (0.0d or 1.0d). Some of these functions are tested in <a href="http://www.hakank.org/jgap/odd_parity.conf">odd_parity.conf</a>.
  <ul>
    <li> <a href="http://www.hakank.org/jgap/AndD.java">AndD.java</a>: <code>And</code>
    <li> <a href="http://www.hakank.org/jgap/DifferentD.java">DifferentD.java</a>: <code>Different</code>
    <li> <a href="http://www.hakank.org/jgap/EqualsD.java">EqualsD.java</a>: <code>Equals</code>
    <li> <a href="http://www.hakank.org/jgap/GreaterThanD.java">GreaterThanD.java</a>: <code>GreaterThan</code>
    <li> <a href="http://www.hakank.org/jgap/GreaterThanOrEqualD.java">GreaterThanOrEqualD.java</a>: <code>GreaterThanOrEqual</code>
    <li> <a href="http://www.hakank.org/jgap/IfElseD.java">IfElseD.java</a>: <code>IfElse</code>
    <li> <a href="http://www.hakank.org/jgap/IfLessThanOrEqualD.java">IfLessThanOrEqualD.java</a>: <code>If Less Than Or Equal Then .. Else</code> (if a &lt; b then c else d). Inspired by Koza's function <code>IFLTE</code>
    <li> <a href="http://www.hakank.org/jgap/LesserThanD.java">LesserThanD.java</a>: <code>LesserThan</code>
    <li> <a href="http://www.hakank.org/jgap/LesserThanOrEqualD.java">LesserThanOrEqualD.java</a>: <code>LesserThanOrEqual</code>
    <li> <a href="http://www.hakank.org/jgap/NotD.java">NotD.java</a>: <code>Not</code>
    <li> <a href="http://www.hakank.org/jgap/OrD.java">OrD.java</a>: <code>Or</code>
    <li> <a href="http://www.hakank.org/jgap/XorD.java">XorD.java</a>: <code>Xor</code>
  </ul>

  <li> <a href="http://www.hakank.org/jgap/ModuloD.java">ModuloD.java</a>: Modulo with <code>double</code> as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard <code>%</code> operator on double is not what I wanted.) This is tested in <a href="http://www.hakank.org/jgap/isbn_test.conf">isbn_test.conf</a>.
  <li> <a href="http://www.hakank.org/jgap/ModuloReplaceD.java">ModuloReplaceD.java</a>: Sometimes we want the Modulo function not to return 0 but some other value (e.g. the highest possible values in the data set). Then this function may be tried. Note: The replacement value is manually set in the configuration option <code>mod_replace</code>. This should be considered highly experimental.
  <li> <a href="http://www.hakank.org/jgap/DivideIntD.java">DivideIntD.java</a>: A protected variant of <code>Divide</code> where the division is done by first converting to <code>Integer</code> then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected).
  <li> <a href="http://www.hakank.org/jgap/DivideProtected.java">DivideProtected.java</a>: A protected variant of <code>Divide</code> the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division.
  <li> Mathematical functions.
  <ul>
  <li> <a href="http://www.hakank.org/jgap/Cube.java">Cube.java</a>: <code>Cube</code> (x^3)
  <li> <a href="http://www.hakank.org/jgap/Gamma.java">Gamma.java</a>: <code>Gamma</code>
  <li> <a href="http://www.hakank.org/jgap/Gaussian.java">Gaussian.java</a>: <code>Gaussian</code>
  <li> <a href="http://www.hakank.org/jgap/Hill.java">Hill.java</a>: <code>Hill</code>
  <li> <a href="http://www.hakank.org/jgap/Logistic.java">Logistic.java</a>: <code>Logistic</code>
    <li> <a href="http://www.hakank.org/jgap/RoundD.java">RoundD.java</a>: <code>RoundD</code>, my version of round()
  <li> <a href="http://www.hakank.org/jgap/Sigmoid.java">Sigmoid.java</a>: <code>Sigmoid</code>
  <li> <a href="http://www.hakank.org/jgap/Sign.java">Sign.java</a>: <code>Sign</code>
  <li> <a href="http://www.hakank.org/jgap/Sqrt.java">Sqrt.java</a>: <code>Sqrt</code>
  <li> <a href="http://www.hakank.org/jgap/Square.java">Square.java</a>: <code>Square</code> (x^2)
  <li> <a href="http://www.hakank.org/jgap/Step.java">Step.java</a>: <code>Step</code>    
  </ul>
  <li> Other functions
  <ul>
    <li> <a href="http://www.hakank.org/jgap/Id.java">Id.java</a>: <code>Identity function</code>
    <li> <a href="http://www.hakank.org/jgap/LoopD.java">LoopD.java</a>: <code>Loop</code> for <code>double</code>. Highly experimental.
  </ul>
</ul>

<h3>Making new functions</h3>
As you see there is about 30 new functions written for this package, and it quite simple to write new. My own method for writing a new function is this. Let's see how to write the <code>Sqrt</code> function (which is <a href="http://www.hakank.org/jgap/Sqrt.java">here</a>).
<ul>
<li> decide what the new function will do: Sqrt of a function.
<li> copy a similiar function, if possible. Here I just copied the function <code>org.jgap.gp.function.Log</code> from the JGAP distribution. 
<li> change the old name to the new function name : "Log" -&gt; "Sqrt"
<li> change way the function should be presented in <code>toString()</code>: "sqrt &amp;1". If a function has more arguments, the different arguments are presented as "&amp;1", "&amp;2", "&amp;3", etc. E.g. the <code>ModuloD</code> function has the following presentation "&amp;1 mod &amp;2", but it can be "mod(&amp;1,&amp;2)" or even "(mod &amp;1 &amp;2)" depending on the style of output. (Hmm, maybe there should be an option in all functions how to represent the names, e.g. mathematical, Java version, Lisp version. I have to think about this more.)
<li> change the textual representation in <code>getName()</code>. We use <code>Sqrt</code>.
<li> state the logic of the function in <code>exectute_double</code>. Since double is the only type that is supported right now, it suffices to change for <code>exectute_double</code>. However, in some of the files, there are also support for other types, e.g. <code>exectute_float</code>, <code>exectute_int</code>, etc.
<li> change the number of arguments and call name in <code>execute_object</code>: here we use <code>execute_sqrt</code> as the call name. This same name is to be used in <code>Compatible</code>.
<li> Add the function name in the <code>makeCommands</code> method in SymbolicRegression.java. Recompile.
</ul>

This is the basic procedure, if there are other arguments or types, some more tweaking may have to be done. .
<h2>Configuration files</h2>
One of the primary task of writing my of Symbolic Regression progra was to be able to use a configuration file to state the problem and the data. Below are some examples. Please note that some of these are experimental (and use experimental parameters/operators), and also they may not give any interesting or good results. More info about the data/problem is usually in the header of the file.
<br><br>
Some of these problems was first tested with Eureqa and was commented in <a href="http://translate.google.se/translate?u=http%3A%2F%2Fwww.hakank.org%2Fwebblogg%2Farchives%2F001354.html&sl=sv&tl=en&hl=&ie=UTF-8">Eureqa: Equation discovery with genetic programming</a> (a Google Translation of my original Swedish blog post <a href="http://www.hakank.org/webblogg/archives/001354.html">Eureqa: equation discovery med genetisk programmering</a>).
<ul>
 <li><a href="http://www.hakank.org/jgap/alldifferent3.conf">alldifferent3.conf</a>: All variables should be different
 <li><a href="http://www.hakank.org/jgap/bolts.conf">bolts.conf</a>: Bolts. A machine learning example
 <li><a href="http://www.hakank.org/jgap/boyles_law.conf">boyles_law.conf</a>: Boyle's law.
 <li><a href="http://www.hakank.org/jgap/catalan.conf">catalan.conf</a>: Catalan numbers
 <li><a href="http://www.hakank.org/jgap/circle_1.conf">circle_1.conf </a>: Circle
 <li><a href="http://www.hakank.org/jgap/exp_formula.conf">exp_formula.conf</a>: Test of Exp function
 <li><a href="http://www.hakank.org/jgap/exp_formula_no_exp.conf">exp_formula_no_exp.conf</a>: Test of Exp function, but without <code>Exp</code> in the function list.
 <li><a href="http://www.hakank.org/jgap/fahrenheit_celsius.conf">fahrenheit_celsius.conf</a>: Fahrenheit to Celsius conversion. You may experiment by changing <code>output_variable</code> to 0 for the reverse conversion (C -&gt; F).
 <li><a href="http://www.hakank.org/jgap/fib1.conf">fib1.conf</a>: Fibonacci series as a time serie.
 <li><a href="http://www.hakank.org/jgap/fib2.conf">fib2.conf</a>: Fibonacci series as a time serie.
 <li><a href="http://www.hakank.org/jgap/fib_50.conf">fib_50.conf</a>: Fibonacci numbers, where I try to find the closed formula for the Fibonacci number.
 <li><a href="http://www.hakank.org/jgap/func1.conf">func1.conf</a>: Unknown function (from a homework in a couse in <a href="http://www.cs.bris.ac.uk/Teaching/Resources/COMSM0302">Evolutionary Computing</a>)
 <li><a href="http://www.hakank.org/jgap/gamma_test.conf">gamma_test.conf</a>: Test of Gammma function
 <li><a href="http://www.hakank.org/jgap/gelman.conf">gelman.conf</a>: Linear regression
 <li><a href="http://www.hakank.org/jgap/henon_100.conf">henon_100.conf</a>: Henon, 100 data points 
 <li><a href="http://www.hakank.org/jgap/heron_formula.conf">heron_formula.conf</a>: Heron formula
 <li><a href="http://www.hakank.org/jgap/intro_page_262.conf">intro_page_262.conf</a>: A simple problem
 <li><a href="http://www.hakank.org/jgap/iris.conf">iris.conf</a>: Iris data set
 <li><a href="http://www.hakank.org/jgap/isbn_test.conf">isbn_test.conf</a>: Trying to get the program to calculate the checksum for ISBN13
 <li><a href="http://www.hakank.org/jgap/longley.conf">longley.conf</a>: Longley's data set of number employments
 <li><a href="http://www.hakank.org/jgap/majority_on_3.conf">majority_on_3.conf</a>: Boolean 3-majority on 
 <li><a href="http://www.hakank.org/jgap/mod_test.conf">mod_test.conf</a>: Test of modulus operator.
 <li><a href="http://www.hakank.org/jgap/moons.conf"> moons.conf</a>: Moons data
 <li><a href="http://www.hakank.org/jgap/multiplexer_3.conf">multiplexer_3.conf</a>: 3-multiplexer (i.e. IfThenElse)
 <li><a href="http://www.hakank.org/jgap/multiplexer_6.conf">multiplexer_6.conf</a>: 6-multiplexer
 <li><a href="http://www.hakank.org/jgap/multiplexer_11.conf">multiplexer_11.conf</a>: 11-multiplexer
 <li><a href="http://www.hakank.org/jgap/mysterious.conf">mysterious.conf</a>: Mysterious function
 <li><a href="http://www.hakank.org/jgap/number_puzzle1.conf">number_puzzle1.conf</a>: Number puzzle from Roger Alsing's blog post <a href="http://rogeralsing.com/2010/02/14/genetic-programming-code-smarter-than-you/">Genetic Programming: Code smarter than you!</a>
 <li><a href="http://www.hakank.org/jgap/odd_parity.conf">odd_parity.conf</a>: Odd parity, using the double variants of the boolean functions, i.e. AndD, OrD, NotD (see above)
  <li><a href="http://www.hakank.org/jgap/odd_parity_double.conf">odd_parity_double.conf</a>: Odd parity, using the arithmetic functions +,-,*,/
 <li><a href="http://www.hakank.org/jgap/odd_parity2.conf">odd_parity2.conf</a>: Odd parity for two inputs
 <li><a href="http://www.hakank.org/jgap/p10.conf">p10.conf</a>: Polynom P(10)
 <li><a href="http://www.hakank.org/jgap/p4.conf">p4.conf</a>: Polynom P(4)
 <li><a href="http://www.hakank.org/jgap/p4_2.conf">p4_2.conf</a>: Polynom P(4)
 <li><a href="http://www.hakank.org/jgap/p4_jgap.conf">p4_jgap.conf</a>: Polynom P(4). This is the version in the JGAP example MathFormula.java
 <li><a href="http://www.hakank.org/jgap/p6_2.conf">p6_2.conf</a>: Polynom P(6)
 <li><a href="http://www.hakank.org/jgap/planets.conf">planets.conf</a>: Planets, i.e. Kepler's third law.
 <li><a href="http://www.hakank.org/jgap/quintic.conf">quintic.conf</a>: Quintic polynomial
 <li><a href="http://www.hakank.org/jgap/regression_koza.conf">regression_koza.conf</a>: Regression (0.5 * x^2, from John R. Koza's Lisp implementation)
 <li><a href="http://www.hakank.org/jgap/regression_psh.conf">regression_psh.conf</a>: Regression (y = 12x^2 + 5, from Psh)
 <li><a href="http://www.hakank.org/jgap/seq_ind1.conf">seq_ind1.conf</a>: Sequence induction problem: 5*j^4+4*j^3+3*j^2+2^j+1 (for integers 0..10)
 <li><a href="http://www.hakank.org/jgap/sigmoid_test.conf">sigmoid_test.conf</a>: Test of Sigmoid function
 <li><a href="http://www.hakank.org/jgap/sin_formula.conf">sin_formula.conf</a>: Test of Sine
 <li><a href="http://www.hakank.org/jgap/sin_formula_rand20.conf">sin_formula_rand20.conf</a>: Test of Sine
 <li><a href="http://www.hakank.org/jgap/sine_tiny_gp.conf">sine_tiny_gp.conf</a>: Test of Sine from TinyGP
 <li><a href="http://www.hakank.org/jgap/sorted_3.conf">sorted_3.conf</a>: Sorting 3 variables.
 <li><a href="http://www.hakank.org/jgap/sqrt_formula2.conf">sqrt_formula2.conf</a>: Yet another test of Sine
 <li><a href="http://www.hakank.org/jgap/sqrt_formula3.conf">sqrt_formula3.conf</a>: Yet another test of Sine
 <li><a href="http://www.hakank.org/jgap/sqrt_formula.conf">sqrt_formula.conf</a>: Test of Sqrt function
 <li><a href="http://www.hakank.org/jgap/sunspots.conf">sunspots.conf</a>: Sunspots data as time series
 <li><a href="http://www.hakank.org/jgap/test1.conf">test1.conf</a>: A test of many functions.
 <li><a href="http://www.hakank.org/jgap/test2.conf">test2.conf</a>: A test of new functions.
 <li><a href="http://www.hakank.org/jgap/tic_tac_toe.conf">tic_tac_toe.conf</a>: Tic-tac-toe
 <li><a href="http://www.hakank.org/jgap/triangular_numbers.conf">triangular_numbers.conf</a>: Triangular numbers
 <li><a href="http://www.hakank.org/jgap/weather.conf">weather.conf</a>: Weather (classic classification example)
 <li><a href="http://www.hakank.org/jgap/x2.conf">x2.conf</a>: A simply polynomial: x^2
 <li><a href="http://www.hakank.org/jgap/x4-x3+x2-x.conf">x4-x3+x2-x.conf</a>: Polynomial: x^4-x^3+x^2-x
 <li><a href="http://www.hakank.org/jgap/zoo2.conf">zoo2.conf</a>: Zoo (classic classification example)
</ul>

<h2>The configuration parameters</h2>
The configuration file consists of the following parameters. Here is a short explanation; the full story is in the code: <a href="http://www.hakank.org/jgap/SymbolicRegression.java">SymbolicRegression.java</a>. Most of the parameters has reasonable default values, taken from either MathProblem.java or GPConfiguration.
<ul>
<li> <code>#</code>, <code>%</code>: Line comments; lines that start with the characters "#" or "%" will be ignored. 
<li> <code>presentation</code>: A text which is shown first in the run.
<li> <code>num_input_variables</code>: Number of input variables in the data set.
<li> <code>output_variable</code>: The index (0-based) of the output variable. Default is the last variable.
<li> <code>variable_names</code>: The name of the variables, in order. Default is "V0", "V1", etc
<li> <code>data</code>: Starts the <code>data</code> section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a <code>.</code> (dot).<br> If a data row contains a <code>?</code> (question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run. 
<li> <code>terminal_range</code>: The range for the <code>Terminal</code> as <code>lower upper</code>. Note: Only one Terminal is used.
<li> <code>terminal_wholenumbers</code>: If the <code>Terminal</code> should use wholenumbers or not (boolean)
<li> <code>constant</code>: Define a <code>Constant</code> with this value
<li> <code>functions</code>: Define the functions, with the same name as in JGAP (or own defined functions).
<li> <code>adf_arity</code>: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.
<li> <code>adf_function</code>: The functions used for ADF.
<li> <code>adf_type</code>:  Either double or boolean. If set to boolean, we can use the boolean and logical operators.
<li> <code>max_init_depth</code>: JGAP parameter <code>maxInitDepth</code>
<li> <code>min_init_depth</code>: JGAP parameter <code>minInitDepth</code>
<li> <code>program_creation_max_tries</code>: JGAP parameter <code>programCreationMaxTries</code>
<li> <code>population_size</code>: JGAP parameter <code>populationSize</code>
<li> <code>max_crossover_depth</code>: JGAP parameter <code>maxCrossoverDepth</code>
<li> <code>function_prob</code>: JGAP parameter <code>functionProb</code>
<li> <code>reproduction_prob</code>: JGAP parameter <code>reproductionProb</code>
<li> <code>mutation_prob</code>: JGAP parameter <code>mutationProb</code>
<li> <code>crossover_prob</code>: JGAP parameter <code>crossoverProb</code>
<li> <code>dynamize_arity_prob</code>: JGAP parameter <code>dynamizeArityProb</code>
<li> <code>no_command_gene_cloning</code>: JGAP parameter <code>no_command_gene_cloning</code>
<li> <code>use_program_cache</code>: JGAP parameter <code>use_program_cache</code>
<li> <code>new_chroms_percent</code>: JGAP parameter <code>newChromsPercent</code>
<li> <code>num_evolutions</code>: JGAP parameter <code>numEvolution</code>
<li> <code>tournament_selector_size</code>: JGAP parameter <code>tournamentSelectorSize</code>
<li> <code>max_nodes</code>: JGAP parameter <code>maxNodes</code>
<li> <code>scale_error</code>: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.
<li> <code>stop_criteria_fitness</code>: If set (>= 0) then the program will run "forever" (instead of <code>num_evolution</code>) until fitness is less or equal to the value.
<li> <code>show_population</code>: This shows the whole population in each generation. Mainly for debugging purposes.
<li> <code>show_similar</code>: Shows all the solutions (programs) with the same fitness value as the best solution.
<li> <code>show_progression</code>: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).
<li> <code>sample_pct</code>: (float) Takes a (sample) percentage of the data set if > 0.0.
<li> <code>validation_pct</code>: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.
<li> <code>show_all_generations</code>: Show info of all generations, not just when fitness is changed.
<li> <code>hits_criteria</code>: Criteria of a <b>hit</b>: if the difference is &lt;= this value, it is considered a hit. The number of <b>non-hits</b> is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is &lt;= this value.
<li> <code>mod_replace</code>: Setting the replacement value of 0 (zero) for the <code>ModuloIntD</code> function (see above).
<li> <code>showResults</code>: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.
<li> <code>resultPrecision</code>: the precision in the output used in <code>showResult</code>, default 5
<li> <code>ignore_variables</code>: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.
<li> <code>return_type</code>: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as <code>double/DoubleClass</code>.
</ul>


<h2>ADF - Automatically Defined Function</h2>
JGAP has support for The SymbolicRegression program supports ADF (Automatically Defined Function), and this is a very interesting topic. See the section <a href="http://cswww.essex.ac.uk/staff/poli/gp-field-guide/61EvolvingModularandHierarchicalStructures.html">6.1 Evolving Modular and Hierarchical Structures</a> from <a href="http://dces.essex.ac.uk/staff/rpoli/gp-field-guide/">A Field Guide to Genetic Programming</a>  for more info.
<br><br>
SymbolicRegression program has some support for ADF:s, but it is not very well tested yet. I have tested ADF in some configurations but I am not very happy about the result. One example is <a href="http://www.hakank.org/jgap/sunspots.conf">sunspots.conf</a> which has the following ADF related options:
<pre>
adf_arity: 0
adf_type: boolean
adf_functions: IfElse,GreaterThan,LesserThan
</pre>

One of the problems I have with ADF is that many of the interesting ADF functions, e.g. <code>Loop</code>, <code>ForLoop</code>, <code>ForXLoop</code>, requires a different representation that SymbolicRegression supports. In spite of this, it can be interesting to experiment with the existing support for ADF.

Explanations of the ADF related options:
<ul>
  <li> <code>adf_arity</code>: When &gt; 0 ADF is activated and all the ADF functions has this arity (number of arguments)
  <li> <code>adf_type</code>: return type for the ADF functions. Can be either <code>boolean</code> or <code>double</code>. In order to work, the ADF function must support the stated type (and it is here I have some problems).
  <li> <code>adf_function</code>: a list if functions to be used as ADF. 
</ul>

I hope to come back with a more working support of ADF.


<h2>Todo</h2>
Here are some TODO:s, or things nice to have. This list was simply copied from my <a href="http://www.hakank.org/jgap">JGAP page</a> when writing this blog post.
<ul>  
 <li> option for ignoring specific variables
 <li> option for stopping:
      <ul>
      <li> running forever
      <li> after a specific time, 
      </ul>
 <li> accept nominal values in the data section; then converted to numeric values.
 <li> add more fitness metrics.
 <li> better handling of punishing longer solutions (parsimony pressure).
 <li> support for different "main" return classes, i.e. not just DoubleClass
 <li> correlation coefficient, and other statistical measures, e.g. 
     R-squared, mean squared error, mean absolute error, minimum error,
     maximum error
 <li> more/better error checks
 <li> more building blocks, a la Eureqa http://ccsl.mae.cornell.edu/eureqa_ops
 <li> support for derivatives (a la Eureqa)?
 <li> incorporate in Weka?
 <li> simplify the best solution with a CAS?
</ul>
 

<h2>Also see</h2>
During this current travel with Genetic Programming I have read the following two books, in this order.

<ul>
<li> Wolfgang Banzhaf, Peter Nordin, Robert E Keller, Frank D Francone: <a href="http://www.bokus.com/b/9781558605107.html">Genetic Programming - An Introduction</a> (Elsevier Science & Technology, 1998, ISBN: 9781558605107). This is a great introduction of GP.
<li> Riccardo Poli, William B. Langdon, Nicholas F. McPhee, with contributions by John R. Koza: 
<a href="http://dces.essex.ac.uk/staff/rpoli/gp-field-guide/">A Field Guide to Genetic Programming</a> HTML version of the book <a href="http://www.bokus.com/b/9781409200734.html">A Field Guide to Genetic Programming</a> (Lulu.com, 2008, ISBN: 9781409200734). This is a great summary of the current state of genetic programming.
</ul>
Currently I am reading John Koza' first Genetic Programming book <a href="http://www.bokus.com/b/9780262111706.html">Genetic Programming: v. 1 On the Programming of Computers by Means of Natural Selection </a> (ISBN: 9780262111706). This is the first and ground breaking book about GP. It is very inspiring and detailed. Many of the ideas here are not applicable to Symbolic Regression, so I hope to implement other GP programs as well.
<br><br>
See also:
<ul>
  <li> <a href="http://translate.google.se/translate?u=http%3A%2F%2Fwww.hakank.org%2Fwebblogg%2Farchives%2F001354.html&sl=sv&tl=en&hl=&ie=UTF-8">Eureqa: Equation discovery with genetic programming</a> (a Google Translation of my original Swedish blog post <a href="http://www.hakank.org/webblogg/archives/001354.html">Eureqa: equation discovery med genetisk programmering</a>)
  <li> <a href="http://www.hakank.org//eureqa/">My Eureqa page</a>
  <li> <a href="http://ccsl.mae.cornell.edu/eureqa">Eureqa</a>, the homepage of Eureqa
  <li> <a href="http://www.hakank.org/weka/">My Weka page</a>, where other machine learning topics are shown
</ul>
]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_using_genetic_programming_with_jgap.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/symbolic_regression_using_genetic_programming_with_jgap.html</guid>
         <category>Genetic programming/algorithms</category>
         <pubDate>Sun, 21 Feb 2010 18:33:48 +0100</pubDate>
      </item>
            <item>
         <title>Arrays in Flux: My third blog, and second English</title>
         <description><![CDATA[<p>Here is a - not so short - presentation of this new blog, my third on the domain <a href="http://hakank.org/">hakank.org</a>. </p>

<p>If you have read any of my other blogs, you may ask: why yet another blog? Well, the other two blogs is not enough for my recent plans. One of the blogs, <a href="http://www.hakank.org/webblogg/">hakank.blogg</a>, is in Swedish and it's probably not a good thing to blend different languages in one blog; the other, <a href="http://www.hakank.org/constraint_programming_blog/">My Constraint Programming Blog</a>, is targeted to a very specific topic: constraint programming. Since I want to be able to write about other things in English, it seems to be a good idea to start a new blog.</p>

<p>Something about the name <b>Arrays in Flux</b>. One of the first names that came to mind was <a href="http://en.wikipedia.org/wiki/Panta_Rei">Panta Rei</a> (meaning "everything flows"). I like the philosophical idea that everything is in a steady flow of changes, with a famous saying from Heraclit: <i>You can not step twice into the same river</i>. Related to programming it seems to be a good description of a lot of programs: the code changes all the time, either by adding new functions or changing old, and by correcting bugs.</p>

<p>Well the name "Panta Rei" really don't have the associations I wanted. After some playing with that phrase, a sound-alike come up: "Pant Array" which was immediately discarded. However, the "Array" kind of stuck, since it's a nice reference to programming. Then it was not very long for the final version: <b>Arrays in Flux</b>. (One alternative was to have a sub title: <b>pantA reI</b>, where the upper case "A" and "I" in should allude one of my big interests: AI. I reckon that was too far fetched.)</p>

<p>Also, it helps that the name is right now (almost) a <a href="http://www.google.com/#hl=en&source=hp&q=%22arrays+in+flux%22&btnG=Google+Search&aq=f&oq=%22arrays+in+flux%22&fp=2d78c962cb3eaf7a">unique</a> search phrase in one of the search engines (namely Google).</p>

<p>What will be published here? Everything is possible, but there probably will be in some of these areas (with the Misc as a nice catch all category):<ul>  <li>AI <br />
  <li> Genetic programming/algorithms<br />
  <li> Machine learning/data mining<br />
  <li> Mathematics<br />
  <li> Programming, and programming languages<br />
  <li> Puzzles and fun <br />
  <li> Findings (e.g. links) in popular science<br />
  <li> Misc. (this may be quite large)<br />
</ul></p>

<p>It will not be updated very often, so it's safe to <a href="http://www.hakank.org/arrays_in_flux/atom.xml">subscribe</a> to it :-). </p>

<p><br />
Welcome to Arrays in Flux!</p>

<p>Hakan Kjellerstrand (<a href="mailto:hakank@bonetmail.com">hakank@bonetmail.com</a>), <br />
<a href="http://www.hakank.org/">http://www.hakank.org/</a> .<br />
</p>]]></description>
         <link>http://www.hakank.org/arrays_in_flux/2010/02/arrays_in_flux_my_third_blog_and_second_english.html</link>
         <guid>http://www.hakank.org/arrays_in_flux/2010/02/arrays_in_flux_my_third_blog_and_second_english.html</guid>
         <category>Misc</category>
         <pubDate>Tue, 16 Feb 2010 18:59:59 +0100</pubDate>
      </item>
      
   </channel>
</rss>
