Symbolic regression (using genetic programming) with JGAP
Some weeks ago I tested the data analysis system Eureqa and was very delighted by its functions and performance, and especially the concept of Symbolic Regression. To quote from Eureqa's home page, Eureqa is detecting equations and hidden mathematical relationships in your data. Its primary goal is to identify the simplest mathematical formulas which could describe the underlying mechanisms that produced the data.
I blogged about about this in my Swedish blog Eureqa: equation discovery med genetisk programmering (Google translation to English Eureqa: Equation discovery with genetic programming). Eureqa uses Symbolic Regression (using genetic programming) for calculating a mathematical formula given some data points. For more about Eureqa, see my Eureqa page.
Here I give some examples of the usage of symbolic regression with my program. Last there is full lists of the defined function, options, and configuration files.
I wrote my own system instead of just using Eureqa for a couple of reasons:
For an example of symbolic regression, see (from www.genetic-programming.com.) Also see Wikipedia's Genetic Programming.
What fascinates me most with genetic programming is that the result of is a program which can be understood, it is a "white box" (clear box) technique. For some other machine learning techniques, say neural networks, it is very hard to understand the result; it's just a black box, where you may use the results but don't get any insights into the solution. Also, I tend to prefer other white box techniques such as decision trees and rule based.
My SymbolicRegression program have a lot of options, but I will not explain or exemplify them all here. All full list of the options with a short description is presented below.
Let us start with some simple examples to understand what symbolic regression can do and how to do it with the program SymbolicExpression.
Here is an (edited) sample run of the problem.
The best fit program in this first generation,
Generation 3 has a somewhat better solution, as has generations 11, and 14. A perfect solution is found in generation 17:
Since the option
During the evolution we also see that the correlation coefficient changes from 0.979073833348314 (which is rather good and unusual except in these easy problems) to a perfect fit 1.0.
Other notes:
The relation between the numbers is, of course, that the next number F(n) is F(n-1) + F(n-2), e.g. 1 + 2 = 3, 2 + 3 = 5, 3 + 5 = 8 etc. It is a recursive equation.
In order to analyze this kind of recursive sequences (in this program) we need to translate this sequence into a lagged "time series". Let us assume that we suspect that it is a recursive sequence but we don't know the exact relation or the number of previous elements that is needed. So we make a series where the output variable can be considered dependent of either 1, 2, or 3 variables. Like this:
In the configuration file iris.conf we use the following functions:
Some notes:
If we study the instances more for the best fit program of generation 39 (and the overall best), we see that class 1 (instances 1-50) is classified very good, i.e. none are misclassified. For class 2 (instances 51-100) there is one misclassified instance (#70), and the rest is of classified as class 3 (#119,#129,#133,and #134). However, due to the very random nature of genetic programming, another run could give another number of bad classifications, and other misclassification instances. However2, these instances - especially #70 - is often misclassified as class 3.
A standard procedure in machine learning or other data analysis to remedy over fitting is to move some of the test cases into a validation set, i.e. instances not seen in the training session and then test the solution against the validation test. The option
SymbolicRegression.java is the main program. It is based on JGAP's example
The program is compiled with (on a Linux box) like this. Note that you must have JGAP installed.
For some functions, I have made a similar one so it returns double instead of - say - Boolean. These variants are mentioned in this list, has has the suffix
Some of these problems was first tested with Eureqa and was commented in Eureqa: Equation discovery with genetic programming (a Google Translation of my original Swedish blog post Eureqa: equation discovery med genetisk programmering).
SymbolicRegression program has some support for ADF:s, but it is not very well tested yet. I have tested ADF in some configurations but I am not very happy about the result. One example is sunspots.conf which has the following ADF related options:
See also:
I blogged about about this in my Swedish blog Eureqa: equation discovery med genetisk programmering (Google translation to English Eureqa: Equation discovery with genetic programming). Eureqa uses Symbolic Regression (using genetic programming) for calculating a mathematical formula given some data points. For more about Eureqa, see my Eureqa page.
Symbolic Regression with JGAP
After that, I started to write my own system for symbolic regression using the genetic algorithm/genetic programming system JGAP, written in Java. You will find Java code and example configuration files on My JGAP page.Here I give some examples of the usage of symbolic regression with my program. Last there is full lists of the defined function, options, and configuration files.
I wrote my own system instead of just using Eureqa for a couple of reasons:
- learning genetic programming, and symbolic regression, is probably better done with writing code
- it is easy to write new functions with JGAP, and I want to experiment with different things, new functions and options
- in some cases (like this) I like command line tools better than GUI:s, and the use configuration file where the options for the learning and data is collected
What is symbolic regression (SR)?
Symbolic regression is, simply put, a way of using genetic programming (GP) to generate a mathematical formula given some data points. In some cases of genetic programming it can be "real programs" but in this context there are mostly mathematical expressions. Note that symbolic regression is just one thing you can to with genetic programming.For an example of symbolic regression, see (from www.genetic-programming.com.) Also see Wikipedia's Genetic Programming.
What fascinates me most with genetic programming is that the result of is a program which can be understood, it is a "white box" (clear box) technique. For some other machine learning techniques, say neural networks, it is very hard to understand the result; it's just a black box, where you may use the results but don't get any insights into the solution. Also, I tend to prefer other white box techniques such as decision trees and rule based.
My SymbolicRegression program have a lot of options, but I will not explain or exemplify them all here. All full list of the options with a short description is presented below.
Let us start with some simple examples to understand what symbolic regression can do and how to do it with the program SymbolicExpression.
Simple example: number puzzle
Let's say we got the following puzzle:If we assume that: 2 + 3 = 10 7 + 2 = 63 6 + 5 = 66 8 + 4 = 96 How much is? 9 + 7 = ????This can be somewhat tricky to solve this by hand (or head), or maybe we simply are too lazy to solve it by hand. Using symbolic regression is easier (although probably not that fun): Just create a configuration file like the one below. In this problem there is a specific unknown instance that we want to solved for, so we can write an instance with the
?
(question mark) in the place of the (unknown) output.
presentation: Puzzle num_input_variables: 2 variable_names: x y z functions: Multiply,Divide,Add,Subtract terminal_range: -10 10 terminal_wholenumbers: true population_size: 100 num_evolutions: 100 show_similiar: true data 2 3 10 7 2 63 6 5 66 8 4 96 # the unknown instance 9 7 ?We use the four arithmetic function (*,/, +,-), coded as
Multiply,Divide,Add,Subtract
and have a small population size (100) and just 100 generations. The other options is explained more below.
Here is an (edited) sample run of the problem.
It was 4 data rows It was 1 data rows in the user defined data set Presentation: Puzzle output_variable: z (index: 2) input variable: x input variable: y function1: &1 * &2 function1: / function1: &1 + &2 function1: &1 - &2 function1: 10.0 Creating initial population Evolving generation 0/100(time from start: 0,05s) Best solution fitness: 35.0 Best solution: x + ((y - x) + (10.0 * x)) Depth of chrom: 3. Number of functions/terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.979073833348314 Evolving generation 3/100(time from start: 0,15s) Best solution fitness: 31.0 Best solution: (10.0 * x) + ((x * y) - (8.0 + y)) Depth of chrom: 3. Number of functions/terminals: 11 (5 functions, 6 terminals) Correlation coefficient: 0.9945949940306454 Evolving generation 11/100(time from start: 0,39s) Best solution fitness: 26.0 Best solution: ((10.0 * x) + (x + (-7.0 * y))) + (x * y) Depth of chrom: 4. Number of functions/terminals: 13 (6 functions, 7 terminals) Correlation coefficient: 0.969830602937701 Evolving generation 14/100(time from start: 0,50s) Best solution fitness: 22.0 Best solution: (x * x) + (4.0 * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 0.9726783536388712 Evolving generation 17/100(time from start: 0,58s) Best solution fitness: 0.0 Best solution: (x * y) + (x * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 1.0 All time best (from generation 17) Evolving generation 101/100(time from start: 1,71s) Best solution fitness: 0.0 Best solution: (x * y) + (x * x) Depth of chrom: 2. Number of functions/terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 1.0 Total time 1,71s All solutions with the best fitness (0.0): (x * x) + (x * y) (26) (x * x) + (y * x) (2) (x + y) * x (2) (x * y) + (x * x) (98) ((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1) ((x / (y / x)) + x) * y (1) It was 6 different solutions with fitness 0.0 Testing the fittest program with user defined test data: 9.0 7.0 Result: 144.0Since it is a genetic programming system, the first generation - generation 0 - is a completely random population of programs. Note that the configuration states very few limits in size, and number of population, and there is really no limits of the structure (see below).
The best fit program in this first generation,
x + ((y - x) + (10.0 * x))
has a quite bad fitness measure: 35; rather a long way from the goal of fitness 0 (the perfect score). The fitness is calculated by the sum of the differences between the program's output for each data point and the real data point. (Note: One of my TODO:s is to have more alternatives of error measures.)
Generation 3 has a somewhat better solution, as has generations 11, and 14. A perfect solution is found in generation 17:
(x * y) + (x * x)
with a fitness (error) of 0.0, and a correlation coefficient of 1.0 (perfect fit between the input variable and output variable). After the 100 generations, the best solution is printed again with the total time (about 1.7 seconds).
Since the option
show_similar: true
was set, all solutions with the same fitness score as the best is also shown:
(x * x) + (x * y) (26) (x * x) + (y * x) (2) (x + y) * x (2) (x * y) + (x * x) (98) ((x * y) + (x * x)) * ((2.0 - 2.0) + (2.0 / 2.0)) (1) ((x / (y / x)) + x) * y (1)Some of these solutions are just permutation of the best solution, i.e. the places of the variable names or expressions are changed. Other are not very interesting either, or, like the 5th one, is not at all useful in this example. The numbers in parenthesis after the solution is the number of occurrences of the solution. Luckily the 5th solution was generated only once.
During the evolution we also see that the correlation coefficient changes from 0.979073833348314 (which is rather good and unusual except in these easy problems) to a perfect fit 1.0.
Other notes:
- I consciously selected a rather bad run for this simple example to be able to show the development over the generations. In other runs the problem was solved in generation 2 or 3, and sometimes even in the generation 0. This indicates that it is a very easy problem and actually may have been replaced by just random search. For more serious (larger) problems this is not the case.
- The option in the configuration file are explained below. One of the most tricky thing about genetic programming is to select the
functions
to use. In this problem it would suffice with just the functionsAdd
, andMultiply
, but it is - of course - a special case.
Example: Fibonacci sequence, as lagged "time series"
Let's take another example, one that I wrote in my Eureqa posting: Fibonacci numbers. The first one is 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946.The relation between the numbers is, of course, that the next number F(n) is F(n-1) + F(n-2), e.g. 1 + 2 = 3, 2 + 3 = 5, 3 + 5 = 8 etc. It is a recursive equation.
In order to analyze this kind of recursive sequences (in this program) we need to translate this sequence into a lagged "time series". Let us assume that we suspect that it is a recursive sequence but we don't know the exact relation or the number of previous elements that is needed. So we make a series where the output variable can be considered dependent of either 1, 2, or 3 variables. Like this:
1,1,2,3 1,2,3,5 2,3,5,8 ...After doing this transformation, we proceed in the same manner as the number problem above, i.e. create a configuration file (fib1.conf) with the options and the data:
# Fibonacci with 3 variables num_input_variables: 3 variable_names: F1 F2 F3 F4 functions: Multiply,Divide,Add,Subtract terminal_range: -10 10 max_init_depth: 4 population_size: 20 max_crossover_depth: 8 num_evolutions: 100 max_nodes: 21 show_population: false stop_criteria_fitness: 0 data 1,1,2,3 1,2,3,5 2,3,5,8 3,5,8,13 5,8,13,21 8,13,21,34 13,21,34,55 21,34,55,89 34,55,89,144 55,89,144,233 89,144,233,377 144,233,377,610 233,377,610,987 377,610,987,1597 610,987,1597,2584 987,1597,2584,4181 1597,2584,4181,6765 2584,4181,6765,10946 4181,6765,10946,17711 6765,10946,17711,28657 10946,17711,28657,46368And here is a complete run (slightly edited) where the correct solution is found in generation 10:
F3+F2
. Also, here we added the option stop_criteria_fitness: 0
which makes the program to exit after the criteria have been reached.
It was 21 data rows Presentation: This is the Fibonacci series output_variable: F4 (index: 3) input variable: F1 input variable: F2 input variable: F3 function1: &1 * &2 function1: / function1: &1 + &2 function1: &1 - &2 function1: 4.0 Evolving generation 0/100(time from start: 0,01s) Best solution fitness: 17619.64 Best solution: ((F3 + F1) + 7.0) + (F1 / F3) Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.9999999998872628 Evolving generation 2/100(time from start: 0,06s) Best solution fitness: 7050.70 Best solution: ((F2 + F2) * (F1 + F1)) / (((F2 + 7.0) + F2) + (F1 - F3)) Depth of chrom: 4. Number of functions+terminals: 17 (8 functions, 9 terminals) Correlation coefficient: 0.9999999122259431 Evolving generation 9/100(time from start: 0,19s) Best solution fitness: 6749.0 Best solution: (F2 / F2) + (4.0 * F1) Depth of chrom: 2. Number of functions+terminals: 7 (3 functions, 4 terminals) Correlation coefficient: 0.99999999956117 Evolving generation 10/100(time from start: 0,21s) Best solution fitness: 0.0 Best solution: F3 + F2 Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals) Correlation coefficient: 1.0 Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 10 All time best (from generation 10) Evolving generation 10/100(time from start: 0,21s) Best solution fitness: 0.0 Best solution: F3 + F2 Depth of chrom: 1. Number of functions+terminals: 3 (1 functions, 2 terminals) Correlation coefficient: 1.0 Total time 0,21s All solutions with the best fitness (0.0): F3 + F2 (1) It was 1 different solutions with fitness 0.0A variant where we remove the stopping criteria, and instead use
show_similar: true
results in the following similar
solutions:
All solutions with the best fitness (0.0): (F2 / 1.0) + F3 (1) (3.0 / 3.0) * (F3 + F2) (1) F3 + F2 (406) F2 + F3 (7) (-2.0 / -2.0) * (F3 + F2) (1) It was 5 different solutions with fitness 0.0TODO: I have some plans to automatically generate this kind of lagged time serie for a sequence, and hope it will be finished soon. This is not rocket science but it can be tricky in the details.
Example: Classification with the Iris data set
The Iris data set is a classic classification examples. It is used to classify three different species (classes) of the Iris flower: Iris setosa, Iris virginica, and Iris versicolor. Since my system cannot - yet - handle nominal data the three difference classes is translated to the numbers 1.0, 2.0, and 3.0 respectively.In the configuration file iris.conf we use the following functions:
functions: Multiply,Divide,Add,Subtract,IfLessThanOrEqualDThe function
IfLessThanOrEqualD
may be worth a comment: I am currently reading John Koza' first Genetic Programming book Genetic Programming: v. 1 On the Programming of Computers by Means of Natural Selection (ISBN: 9780262111706). He mentions the IFLTE
function (if a <= b then c else d
), which I implemented - by cloning and mutation of my IfElseD.java function (see, programmers also use these operators :). As Koza writes in page 365 about the function, it can be used instead of the following function:
- <
- <=
- >
- >=
- if then else
==
) in this list, but I'm not so convinced that IFLTE can properly replace this.
Some notes:
- this is a selected - but real - run, all running are not this good (or bad)
- here we use the complete data set as fitness cases. Normally a certain percent of the data should be used as a validation set (with the option
validation_pct
). See below for a discussion of this.
population_size: 100 num_evolutions: 100Other configuration options in the file:
-
input_variables: 4
Here we define that the number of input variables are 4, and there is always one output variable). (Note: This should probably be automatically detected.) -
variable_names: sl sw pl pw class
Here is the names of the variables. They should be rather short to not clutter the output expression to much. The last variable (class
) is the output variable. Note that it it possible to change what variable to use as output variable, with the optionoutput_variable
and state the (0-based) position in the variable list. Here we could have the following option:output_variable: 4
. -
terminal_range: -20 20
The range of theTerminal
(ephemeral terminal), i.e. the numbers used in the solution expression. -
terminal_wholenumbers: false
This is a special option to be able to state that to use only whole numbers (or rather double representation of integers). In this example, we don't want to use only integers. -
max_nodes: 31
This is about the only structural restriction there is in genetic programming: This states the maximum number of nodes in the expression trees. Nodes are either functions or terminals (numbers). So, here we allow up to 31 functions or terminals. -
mutation_prob: 0.01
Probability of mutation. The probability of mutation and crossover are very important and may make the difference of an effective run vs. a slow progression. One have to experiment with these for larger problems. -
crossover_prob: 0.9
Probability of crossover. See comment above. -
show_progression: true
If this option is set, the generation number is printed on the last line of the command. Can be useful for larger problems. -
show_results: true
If true, the results of the fittest programs is shown. -
result_precision: 5
Forshow_results
: The precision of the results. -
hits_criteria: 0.5
If the absolute difference between the result and the real value is equal or below this value, it is considered a hit. If set, the fitness measure is the number of non-hits. The value of 0.5 is chosen since the three classes are represented as value 1, 2, and 3, and if the different is <= than 0.5 we have a selected class. (Thoughtful and late note: Maybe it would suffice to add the functionRound
orFloor
?)
It was 150 data rows Presentation: Iris output_variable: class (index: 4) input variable: sl input variable: sw input variable: pl input variable: pw function1: &1 * &2 function1: &1 + &2 function1: &1 - &2 function1: / function1: if(&1 <= &2) then (&3) else(&4) function1: 16.624795863695333 Evolving generation 0/100(time from start: 0,16s) Best solution fitness: 64.0 Best solution: ((sw * pl) - (pw * pw)) / ((sl - pl) * (sl - sw)) Depth of chrom: 3. Number of functions+terminals: 15 (7 functions, 8 terminals) Correlation coefficient: 0.6621167550765015 Number of hits (<= 0.5): 86 (of 150 = 0,57) Results for this program: ... (22) 1.0: 0,98889 (diff: 0,01111) (23) 1.0: 0,87582 (diff: 0,12418) (24) 1.0: 1,58128 (diff: -0,58128) > 0.5! (25) 1.0: 0,70000 (diff: 0,30000) ... total diff: 111.08622273795162 (no abs diff: -42.32778738529426 #hits: 86 (of 150) Evolving generation 1/100(time from start: 0,29s) Best solution fitness: 48.0 Best solution: ((sw + pl) * sl) / (18.98720292178895 + pl) Depth of chrom: 3. Number of functions+terminals: 9 (4 functions, 5 terminals) Correlation coefficient: 0.8492617928834261 Number of hits (<= 0.5): 102 (of 150 = 0,68) Results for this program: ... total diff: 60.411404633990145 (no abs diff: 35.03754935483205 #hits: 102 (of 150) Evolving generation 4/100(time from start: 0,61s) Best solution fitness: 7.0 Best solution: (19.035483741865328 / 19.035483741865328) + pw Depth of chrom: 2. Number of functions+terminals: 5 (2 functions, 3 terminals) Correlation coefficient: 0.9564638238016164 Number of hits (<= 0.5): 143 (of 150 = 0,95) ... total diff: 39.80000000000001 (no abs diff: -29.800000000000004 #hits: 143 (of 150) Evolving generation 16/100(time from start: 0,94s) Best solution fitness: 6.0 Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (sw + 14.495973076477487)) Depth of chrom: 4. Number of functions+terminals: 19 (8 functions, 11 terminals) Correlation coefficient: 0.9582679236481598 Number of hits (<= 0.5): 144 (of 150 = 0,96) Results for this program: ... total diff: 26.831398327892167 (no abs diff: -3.7720516041580767 #hits: 144 (of 150) Evolving generation 39/100(time from start: 1,54s) Best solution fitness: 5.0 Best solution: (((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487)) Depth of chrom: 6. Number of functions+terminals: 25 (11 functions, 14 terminals) Correlation coefficient: 0.958786524094953 Number of hits (<= 0.5): 145 (of 150 = 0,97) Results for this program: (0) 1.0: 0,97740 (diff: 0,02260) (1) 1.0: 1,01322 (diff: -0,01322) (2) 1.0: 1,00110 (diff: -0,00110) (3) 1.0: 1,00869 (diff: -0,00869) ... (69) 2.0: 1,94192 (diff: 0,05808) (70) 2.0: 2,59137 (diff: -0,59137) > 0.5! (71) 2.0: 2,11718 (diff: -0,11718) ... (118) 3.0: 3,11623 (diff: -0,11623) (119) 3.0: 2,35925 (diff: 0,64075) > 0.5! (120) 3.0: 3,08251 (diff: -0,08251) ... (128) 3.0: 2,91459 (diff: 0,08541) (129) 3.0: 2,39350 (diff: 0,60650) > 0.5! (130) 3.0: 2,70539 (diff: 0,29461) (131) 3.0: 2,73150 (diff: 0,26850) (132) 3.0: 3,01459 (diff: -0,01459) (133) 3.0: 2,31546 (diff: 0,68454) > 0.5! (134) 3.0: 2,23094 (diff: 0,76906) > 0.5! (135) 3.0: 3,08865 (diff: -0,08865) (136) 3.0: 3,17414 (diff: -0,17414) ... (148) 3.0: 3,07502 (diff: -0,07502) (149) 3.0: 2,60513 (diff: 0,39487) total diff: 26.224671119186706 (no abs diff: -0.04627940721407109 #hits: 145 (of 150) ... Total time 2,91sThe solutions of the best fit program (generation 39) is this.
Solution:(((19.035483741865328 / 19.035483741865328) + pw) / (sw / sw)) - ((if(pw <= pl) then (sw) else(pl)) / (((14.495973076477487 - (sl + sl)) / sw) + 14.495973076477487))which is kind of funny looking. Here are some comments:
-
19.035483741865328 / 19.035483741865328
is 1. The program don't simplify such expressions, but this would be nice to have. (As I understand it, Eureqa has some kind of simplification process.) - the
if then else
is used as an expression with the returning value is used directly in the solution. Something we don't see here but may happen with other settings is that the logical operators (or expressions using these operators) returns 1.0 (true) or 0.0 (false) and these values is used directly in the calculations as any other expression.
Best solution fitness: 5.0 ... Number of hits (<= 0.5): 145 (of 150 = 0,97)We defined fitness as the number of differences <= 0.5, and we see that there are 5 wrongly classified instances (150-144=6), with a hit rate of 97%. Not too bad, but not very good either.
If we study the instances more for the best fit program of generation 39 (and the overall best), we see that class 1 (instances 1-50) is classified very good, i.e. none are misclassified. For class 2 (instances 51-100) there is one misclassified instance (#70), and the rest is of classified as class 3 (#119,#129,#133,and #134). However, due to the very random nature of genetic programming, another run could give another number of bad classifications, and other misclassification instances. However2, these instances - especially #70 - is often misclassified as class 3.
Validation set
In this Iris run we used a quite low values of population size and the number of generations. With higher values, say population size=1000 and 1000 generations, it may well be possible to get a perfect hit, and that happened sometimes when I experimented. However this perfect fit could be bad since the fittest program has over fitted the data: it learned the problem to well, so it may not be used for calculating new instances.A standard procedure in machine learning or other data analysis to remedy over fitting is to move some of the test cases into a validation set, i.e. instances not seen in the training session and then test the solution against the validation test. The option
validation_pct
does exactly that. The value of the option is the percentage of the data that will be placed in the validation set. Or more exactly: it is the probability that a specific fitness case will be in the validation set.
Compiling the program
All files mentioned here (and at my JGAP page) are collected in the file symbolic_regression.zip. As of now, it is not packaged into a nice Jar file, so you have to compile it manually. There is no file structure either so the files should be unzipped in the same directory.SymbolicRegression.java is the main program. It is based on JGAP's example
MathProblem.java
but extended with a lot of bells & whistles.
The program is compiled with (on a Linux box) like this. Note that you must have JGAP installed.
javac -Xlint:unchecked -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression.javaand run with:
java -server -Xmx1024m -Xss2M -classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream-1.2.2.jar:jgap/lib/commons-lang-2.1.jar:$CLASSPATH" SymbolicRegression [config file]Here is my log4j.properties file.
Supported function from JGAP
The SymbolicRegression program has support for the many of the GP functions from JGAP. The "main" type is double so all functions is not applicable there (e.g.IfElse
etc). However, for the ADF functions (defined by setting adf_arity
to > 0, but see below) more functions is supported. Please note that some of these functions are experimental (or very experimental) and the result may not make sense in this context.
For some functions, I have made a similar one so it returns double instead of - say - Boolean. These variants are mentioned in this list, has has the suffix
D
.
-
Multiply
(double) -
Multiply3
(double) -
Add
(double) -
Add3
(double) -
Add4
(double) -
Divide
(double) -
Subtract
(double) -
Sine
(double) -
ArcSine
(double) -
Tangent
(double) -
ArcTangent
(double) -
Cosine
(double) -
ArcCosine
(double) -
Exp
(double) -
Log
(double) -
Abs
(double) -
Pow
(double) -
Round
(double), compare with myRoundD
-
Ceil
(double) -
Floor
(double) -
Modulo
(double), implements Java's%
operator for double. See ModuloD for a variant -
Max
(double) -
Min
(double) -
LesserThan
(boolean) -
GreaterThan
(boolean) -
If
(boolean) -
IfElse
(boolean), cf theIfElseD
-
IfDyn
(boolean) -
Loop
(boolean), cf the experimentalLoopD
-
Equals
(boolean), cfEqualsD
-
ForXLoop
(boolean) -
ForLoop
(boolean) -
Increment
(boolean) -
Pop
(boolean) -
Push
(boolean) -
And
(boolean), cf the double variantAndD
-
Or
(boolean), cf the double variantOrD
-
Xor
(boolean), cf the double variantXorD
-
Not
(boolean), cf the double variantNotD
-
SubProgram
(boolean, experimental) -
Tupel
(boolean, experimental)
My own functions
Here is a complete list of the functions I have wrote (the list will hopefully grow). Some of these may be considered experimental, but may be of some use in experimental settings (or just learning).- Boolean operators for DoubleClass
Here are the Boolean operators for use with DoubleClass, i.e. they hasdouble
as input and returns adouble
(0.0d or 1.0d). Some of these functions are tested in odd_parity.conf.- AndD.java:
And
- DifferentD.java:
Different
- EqualsD.java:
Equals
- GreaterThanD.java:
GreaterThan
- GreaterThanOrEqualD.java:
GreaterThanOrEqual
- IfElseD.java:
IfElse
- IfLessThanOrEqualD.java:
If Less Than Or Equal Then .. Else (if a < b then c else d). Inspired by Koza's function
IFLTE
- LesserThanD.java:
LesserThan
- LesserThanOrEqualD.java:
LesserThanOrEqual
- NotD.java:
Not
- OrD.java:
Or
- XorD.java:
Xor
- LesserThanD.java:
- AndD.java:
- ModuloD.java: Modulo with
double
as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard%
operator on double is not what I wanted.) This is tested in < href="http://www.hakank.org/jgap/isbn_test.conf">isbn_test.conf. - ModuloReplaceD.java: Sometimes we want the Modulo function not to return 0 but some other value (e.g. the highest possible values in the data set). Then this function may be tried. Note: The replacement value is manually set in the configuration option
mod_replace
. This should be considered highly experimental. - DivideIntD.java: A protected variant of
Divide
where the division is done by first converting toInteger
then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected). - DivideProtected.java: A protected variant of
Divide
the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division. - Mathematical functions.
- Cube.java:
Cube
(x^3) - Gamma.java:
Gamma
- Gaussian.java:
Gaussian
- Hill.java:
Hill
- Logistic.java:
Logistic
- RoundD.java:
RoundD
, my version of round() - Sigmoid.java:
Sigmoid
- Sign.java:
Sign
- Sqrt.java:
Sqrt
- Square.java:
Square
(x^2) - Step.java:
Step
- Cube.java:
- Other functions
- Id.java:
Identity function
- LoopD.java:
Loop
fordouble
. Highly experimental.
- Id.java:
Making new functions
As you see there is about 30 new functions written for this package, and it quite simple to write new. My own method for writing a new function is this. Let's see how to write theSqrt
function (which is here).
- decide what the new function will do: Sqrt of a function.
- copy a similiar function, if possible. Here I just copied the function
org.jgap.gp.function.Log
from the JGAP distribution. - change the old name to the new function name : "Log" -> "Sqrt"
- change way the function should be presented in
toString()
: "sqrt &1". If a function has more arguments, the different arguments are presented as "&1", "&2", "&3", etc. E.g. theModuloD
function has the following presentation "&1 mod &2", but it can be "mod(&1,&2)" or even "(mod &1 &2)" depending on the style of output. (Hmm, maybe there should be an option in all functions how to represent the names, e.g. mathematical, Java version, Lisp version. I have to think about this more.) - change the textual representation in
getName()
. We useSqrt
. - state the logic of the function in
exectute_double
. Since double is the only type that is supported right now, it suffices to change forexectute_double
. However, in some of the files, there are also support for other types, e.g.exectute_float
,exectute_int
, etc. - change the number of arguments and call name in
execute_object
: here we useexecute_sqrt
as the call name. This same name is to be used inCompatible
. - Add the function name in the
makeCommands
method in SymbolicRegression.java. Recompile.
Configuration files
One of the primary task of writing my of Symbolic Regression progra was to be able to use a configuration file to state the problem and the data. Below are some examples. Please note that some of these are experimental (and use experimental parameters/operators), and also they may not give any interesting or good results. More info about the data/problem is usually in the header of the file.Some of these problems was first tested with Eureqa and was commented in Eureqa: Equation discovery with genetic programming (a Google Translation of my original Swedish blog post Eureqa: equation discovery med genetisk programmering).
- alldifferent3.conf: All variables should be different
- bolts.conf: Bolts. A machine learning example
- boyles_law.conf: Boyle's law.
- catalan.conf: Catalan numbers
- circle_1.conf : Circle
- exp_formula.conf: Test of Exp function
- exp_formula_no_exp.conf: Test of Exp function, but without
Exp
in the function list. - fahrenheit_celsius.conf: Fahrenheit to Celsius conversion. You may experiment by changing
output_variable
to 0 for the reverse conversion (C -> F). - fib1.conf: Fibonacci series as a time serie.
- fib2.conf: Fibonacci series as a time serie.
- fib_50.conf: Fibonacci numbers, where I try to find the closed formula for the Fibonacci number.
- func1.conf: Unknown function (from a homework in a couse in Evolutionary Computing)
- gamma_test.conf: Test of Gammma function
- gelman.conf: Linear regression
- henon_100.conf: Henon, 100 data points
- heron_formula.conf: Heron formula
- intro_page_262.conf: A simple problem
- iris.conf: Iris data set
- isbn_test.conf: Trying to get the program to calculate the checksum for ISBN13
- longley.conf: Longley's data set of number employments
- majority_on_3.conf: Boolean 3-majority on
- mod_test.conf: Test of modulus operator.
- moons.conf: Moons data
- multiplexer_3.conf: 3-multiplexer (i.e. IfThenElse)
- multiplexer_6.conf: 6-multiplexer
- multiplexer_11.conf: 11-multiplexer
- mysterious.conf: Mysterious function
- number_puzzle1.conf: Number puzzle from Roger Alsing's blog post Genetic Programming: Code smarter than you!
- odd_parity.conf: Odd parity, using the double variants of the boolean functions, i.e. AndD, OrD, NotD (see above)
- odd_parity_double.conf: Odd parity, using the arithmetic functions +,-,*,/
- odd_parity2.conf: Odd parity for two inputs
- p10.conf: Polynom P(10)
- p4.conf: Polynom P(4)
- p4_2.conf: Polynom P(4)
- p4_jgap.conf: Polynom P(4). This is the version in the JGAP example MathFormula.java
- p6_2.conf: Polynom P(6)
- planets.conf: Planets, i.e. Kepler's third law.
- quintic.conf: Quintic polynomial
- regression_koza.conf: Regression (0.5 * x^2, from John R. Koza's Lisp implementation)
- regression_psh.conf: Regression (y = 12x^2 + 5, from Psh)
- seq_ind1.conf: Sequence induction problem: 5*j^4+4*j^3+3*j^2+2^j+1 (for integers 0..10)
- sigmoid_test.conf: Test of Sigmoid function
- sin_formula.conf: Test of Sine
- sin_formula_rand20.conf: Test of Sine
- sine_tiny_gp.conf: Test of Sine from TinyGP
- sorted_3.conf: Sorting 3 variables.
- sqrt_formula2.conf: Yet another test of Sine
- sqrt_formula3.conf: Yet another test of Sine
- sqrt_formula.conf: Test of Sqrt function
- sunspots.conf: Sunspots data as time series
- test1.conf: A test of many functions.
- test2.conf: A test of new functions.
- tic_tac_toe.conf: Tic-tac-toe
- triangular_numbers.conf: Triangular numbers
- weather.conf: Weather (classic classification example)
- x2.conf: A simply polynomial: x^2
- x4-x3+x2-x.conf: Polynomial: x^4-x^3+x^2-x
- zoo2.conf: Zoo (classic classification example)
The configuration parameters
The configuration file consists of the following parameters. Here is a short explanation; the full story is in the code: SymbolicRegression.java. Most of the parameters has reasonable default values, taken from either MathProblem.java or GPConfiguration.-
#
,%
: Line comments; lines that start with the characters "#" or "%" will be ignored. -
presentation
: A text which is shown first in the run. -
num_input_variables
: Number of input variables in the data set. -
output_variable
: The index (0-based) of the output variable. Default is the last variable. -
variable_names
: The name of the variables, in order. Default is "V0", "V1", etc -
data
: Starts thedata
section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a.
(dot).
If a data row contains a?
(question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run. -
terminal_range
: The range for theTerminal
aslower upper
. Note: Only one Terminal is used. -
terminal_wholenumbers
: If theTerminal
should use wholenumbers or not (boolean) -
constant
: Define aConstant
with this value -
functions
: Define the functions, with the same name as in JGAP (or own defined functions). -
adf_arity
: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works. -
adf_function
: The functions used for ADF. -
adf_type
: Either double or boolean. If set to boolean, we can use the boolean and logical operators. -
max_init_depth
: JGAP parametermaxInitDepth
-
min_init_depth
: JGAP parameterminInitDepth
-
program_creation_max_tries
: JGAP parameterprogramCreationMaxTries
-
population_size
: JGAP parameterpopulationSize
-
max_crossover_depth
: JGAP parametermaxCrossoverDepth
-
function_prob
: JGAP parameterfunctionProb
-
reproduction_prob
: JGAP parameterreproductionProb
-
mutation_prob
: JGAP parametermutationProb
-
crossover_prob
: JGAP parametercrossoverProb
-
dynamize_arity_prob
: JGAP parameterdynamizeArityProb
-
no_command_gene_cloning
: JGAP parameterno_command_gene_cloning
-
use_program_cache
: JGAP parameteruse_program_cache
-
new_chroms_percent
: JGAP parameternewChromsPercent
-
num_evolutions
: JGAP parameternumEvolution
-
tournament_selector_size
: JGAP parametertournamentSelectorSize
-
max_nodes
: JGAP parametermaxNodes
-
scale_error
: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value. -
stop_criteria_fitness
: If set (>= 0) then the program will run "forever" (instead ofnum_evolution
) until fitness is less or equal to the value. -
show_population
: This shows the whole population in each generation. Mainly for debugging purposes. -
show_similar
: Shows all the solutions (programs) with the same fitness value as the best solution. -
show_progression
: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness). -
sample_pct
: (float) Takes a (sample) percentage of the data set if > 0.0. -
validation_pct
: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown. -
show_all_generations
: Show info of all generations, not just when fitness is changed. -
hits_criteria
: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of non-hits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value. -
mod_replace
: Setting the replacement value of 0 (zero) for theModuloIntD
function (see above). -
showResults
: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values. -
resultPrecision
: the precision in the output used inshowResult
, default 5 -
ignore_variables
: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written. -
return_type
: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program asdouble/DoubleClass
.
ADF - Automatically Defined Function
JGAP has support for The SymbolicRegression program supports ADF (Automatically Defined Function), and this is a very interesting topic. See the section 6.1 Evolving Modular and Hierarchical Structures from A Field Guide to Genetic Programming for more info.SymbolicRegression program has some support for ADF:s, but it is not very well tested yet. I have tested ADF in some configurations but I am not very happy about the result. One example is sunspots.conf which has the following ADF related options:
adf_arity: 0 adf_type: boolean adf_functions: IfElse,GreaterThan,LesserThanOne of the problems I have with ADF is that many of the interesting ADF functions, e.g.
Loop
, ForLoop
, ForXLoop
, requires a different representation that SymbolicRegression supports. In spite of this, it can be interesting to experiment with the existing support for ADF.
Explanations of the ADF related options:
-
adf_arity
: When > 0 ADF is activated and all the ADF functions has this arity (number of arguments) -
adf_type
: return type for the ADF functions. Can be eitherboolean
ordouble
. In order to work, the ADF function must support the stated type (and it is here I have some problems). -
adf_function
: a list if functions to be used as ADF.
Todo
Here are some TODO:s, or things nice to have. This list was simply copied from my JGAP page when writing this blog post.- option for ignoring specific variables
- option for stopping:
- running forever
- after a specific time,
- accept nominal values in the data section; then converted to numeric values.
- add more fitness metrics.
- better handling of punishing longer solutions (parsimony pressure).
- support for different "main" return classes, i.e. not just DoubleClass
- correlation coefficient, and other statistical measures, e.g. R-squared, mean squared error, mean absolute error, minimum error, maximum error
- more/better error checks
- more building blocks, a la Eureqa http://ccsl.mae.cornell.edu/eureqa_ops
- support for derivatives (a la Eureqa)?
- incorporate in Weka?
- simplify the best solution with a CAS?
Also see
During this current travel with Genetic Programming I have read the following two books, in this order.- Wolfgang Banzhaf, Peter Nordin, Robert E Keller, Frank D Francone: Genetic Programming - An Introduction (Elsevier Science & Technology, 1998, ISBN: 9781558605107). This is a great introduction of GP.
- Riccardo Poli, William B. Langdon, Nicholas F. McPhee, with contributions by John R. Koza: A Field Guide to Genetic Programming HTML version of the book A Field Guide to Genetic Programming (Lulu.com, 2008, ISBN: 9781409200734). This is a great summary of the current state of genetic programming.
See also:
- Eureqa: Equation discovery with genetic programming (a Google Translation of my original Swedish blog post Eureqa: equation discovery med genetisk programmering)
- My Eureqa page
- Eureqa, the homepage of Eureqa
- My Weka page, where other machine learning topics are shown