My JGAP page  Genetic programming and Symbolic Regression
JGAP (Java Genetic Algorithms Package) is a Java based system for genetic algorithms and genetic programming.
For more about JGAP:
Symbolic Regression
20100221: I blogged about this package in Symbolic regression (using genetic programming) with JGAP.
Download all Java and configuration files mentioned below: symbolic_regression.zip.
Here is the result of my experiment with Symbolic Regression using Genetic Programming in JGAP. Right now I'm learning both JGAP and genetic programming so there are, for sure, peculiarities in the files. After I learn more, new features will be added or old will be removed.
SymbolicRegression.java is the main program. As you may recognize, it is based on JGAP's example MathProblem.java
but extended with some bells & whistles.
The program is compiled with (on a Linux box) like this:
javac Xlint:unchecked classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream1.2.2.jar:jgap/lib/commonslang2.1.jar:$CLASSPATH" SymbolicRegression.java
and run with:
java server Xmx1024m Xss2M classpath "jgap/jgap.jar:jgap/lib/log4j.jar:jgap/lib/xstream1.2.2.jar:jgap/lib/commonslang2.1.jar:$CLASSPATH" SymbolicRegression [config file]
For compiling, all the Java files below must be downloaded and placed in the same directory as the the file. Here is my log4j.properties file.
See below for more about the configuration files.
Note: Most of these files where incorporated (in some case with changes by Klaus Meffert) in the JGAP distribution, version 3.6 (directory examples/src/examples/gp/symbolicRegression
).
Defined functions
Here are my defined functions. Some of these may be considered experimental, but may be of some use. Please note that the code is not tidied up etc.
 Boolean operators for DoubleClass
Here are the boolean operators for use with DoubleClass, i.e. they has double
as input and returns a double
(0.0d or 1.0d). Some of these functions are tested in odd_parity.conf.
 ModuloD.java: Modulo with
double
as input and output. First the input is converted to integers and then an integer modulo is done which is returned as a double. (The standard %
operator on double is not what I wanted.) This is tested in isbn_test.conf.
 ModuloReplaceD.java: Sometimes we want the Modulo function not to return 0 but some other value (e.g. the highest possible values in the data set). Then this function may be tried. Note: The replacement value is manually set in the configuration option
mod_replace
. This should be considered highly experimental.
 DivideIntD.java: A protected variant of
Divide
where the division is done by first converting to Integer
then doing an integer division. Also, if the divisor is 0 (zero), the result is 1 (i.e. protected).
 DivideProtected.java: A protected variant of
Divide
the result is 1 (i.e. protected) if the divisor is 0 (zero), else standard double division.
 Mathematical functions.
 Other functions
Configuration files
One of the primary tasks was to be able to use a configuration file to state the problem and the data. Below are some examples. Please note that some of these are experimental (and use experimental parameters/operators), and also they may not give any interesting or good results. More info about the data/problem is usually in the header of the file.
Some of the problems was first tested with Eureqa and was commented in Eureqa: Equation discovery with genetic programming (a Google Translation of my original swedish blog post Eureqa: equation discovery med genetisk programmering).
 alldifferent3.conf: All variables should be different
 bolts.conf: Bolts. A machine learning example
 boyles_law.conf: Boyle's law.
 catalan.conf: Catalan numbers
 circle_1.conf : Circle
 exp_formula.conf: Test of Exp function
 exp_formula_no_exp.conf: Test of Exp function, but without
Exp
in the function list.
 fahrenheit_celsius.conf: Fahrenheit to Celsius conversion. You may experiment by changing
output_variable
to 0 for the reverse conversion (C > F).
 fib1.conf: Fibonacci series as a time serie.
 fib2.conf: Fibonacci series as a time serie.
 fib_50.conf: Fibonacci numbers, where I try to find the closed formula for the Fibonacci number.
 func1.conf: Unknown function (from a homework in a course in Evolutionary Computing)
 gamma_test.conf: Test of Gammma function
 gelman.conf: Linear regression
 henon_100.conf: Henon, 100 data points
 heron_formula.conf: Heron formula
 intro_page_262.conf: A simple problem
 iris.conf: Iris data set
 isbn_test.conf: Trying to get the program to calculate the checksum for ISBN13
 leap_years.conf: Leap years
 longley.conf: Longley's data set of number employments
 majority_on_3.conf: Boolean 3majority on
 mod_test.conf: Test of modulus operator.
 moons.conf: Moons data
 multiplexer_3.conf: 3multiplexer (i.e. IfThenElse)
 multiplexer_6.conf: 6multiplexer
 multiplexer_11.conf: 11multiplexer
 mysterious.conf: Mysterious function
 number_puzzle1.conf: Number puzzle from Roger Alsing's blog post Genetic Programming: Code smarter than you!
 number_puzzle4.conf: Number puzzle inspired by Richard Wiseman's It's the Friday Puzzle (20100226) of finding the result 24 from the numbers 5,5,5,1 and the operators +,,*,/. Note: the requirement that the numbers should be used exactly once is not held here. This uses only one fitness case (and the option
no_terminals
).
 odd_parity.conf: Odd parity, using the double variants of the boolean functions, i.e. AndD, OrD, NotD (see above)
 odd_parity_double.conf: Odd parity, using the arithmetic functions +,,*,/
 odd_parity2.conf: Odd parity for two inputs
 p10.conf: Polynom P(10)
 p4.conf: Polynom P(4)
 p4_2.conf: Polynom P(4)
 p4_jgap.conf: Polynom P(4). This is the version in the JGAP example MathFormula.java
 p6_2.conf: Polynom P(6)
 planets.conf: Planets, i.e. Kepler's third law.
 quintic.conf: Quintic polynomial
 regression_koza.conf: Regression (0.5 * x^2, from John R. Koza's Lisp implementation)
 regression_psh.conf: Regression (y = 12x^2 + 5, from Psh)
 seq_ind1.conf: Sequence induction problem: 5*j^4+4*j^3+3*j^2+2^j+1 (for integers 0..10)
 sigmoid_test.conf: Test of Sigmoid function
 sin_formula.conf: Test of Sine
 sin_formula_rand20.conf: Test of Sine
 sine_tiny_gp.conf: Test of Sine from TinyGP
 sorted_3.conf: Sorting 3 variables.
 sqrt_formula2.conf: Yet another test of Sine
 sqrt_formula3.conf: Yet another test of Sine
 sqrt_formula.conf: Test of Sqrt function
 sunspots.conf: Sunspots data as time series
 sunspots_timeseries.conf: Two version of sunspots data using
make_time_series
 test1.conf: A test of many functions.
 test2.conf: A test of new functions.
 tic_tac_toe.conf: Tictactoe
 timeseries_test1.conf: Time series with
make_time_series
 timeseries_dailyisbn.conf: Time series using the classic time series "Daily closing price of IBM stock, Jan 1, 1980 to Oct. 8, 1992" (DAILYIBM.DAT from TSDL)
 triangular_numbers.conf: Triangular numbers
 weather.conf: Weather (classic classification example)
 x2.conf: A simply polynomial: x^2
 x4x3+x2x.conf: Polynomial: x^4x^3+x^2x
 zoo2.conf: Zoo (classic classification example)
The configuration parameters
The configuration file consists of the following parameters. Here is a short explanation; the full story is in the code: SymbolicRegression.java. Most of the parameters has reasonable default values, taken from either MathProblem.java or GPConfiguration.

#
, %
: Line comments; lines that start with the characters "#" or "%" will be ignored.

presentation
: A text which is shown first in the run.

num_input_variables
: Number of input variables in the data set.

output_variable
: The index (0based) of the output variable. Default is the last variable.

variable_names
: The name of the variables, in order. Default is "V0", "V1", etc

data
: Starts the data
section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a .
(dot).
If a data row contains a ?
(question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run.

terminal_range
: The range for the Terminal
as lower upper
. Note: Only one Terminal is used.

terminal_wholenumbers
: If the Terminal
should use wholenumbers or not (boolean)

constant
: Define a Constant
with this value

functions
: Define the functions, with the same name as in JGAP (or own defined functions).

adf_arity
: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.

adf_function
: The functions used for ADF.

adf_type
: Either double or boolean. If set to boolean, we can use the boolean and logical operators.

max_init_depth
: JGAP parameter maxInitDepth

min_init_depth
: JGAP parameter minInitDepth

program_creation_max_tries
: JGAP parameter programCreationMaxTries

population_size
: JGAP parameter populationSize

max_crossover_depth
: JGAP parameter maxCrossoverDepth

function_prob
: JGAP parameter functionProb

reproduction_prob
: JGAP parameter reproductionProb

mutation_prob
: JGAP parameter mutationProb

crossover_prob
: JGAP parameter crossoverProb

dynamize_arity_prob
: JGAP parameter dynamizeArityProb

no_command_gene_cloning
: JGAP parameter no_command_gene_cloning

use_program_cache
: JGAP parameter use_program_cache

new_chroms_percent
: JGAP parameter newChromsPercent

num_evolutions
: JGAP parameter numEvolution

tournament_selector_size
: JGAP parameter tournamentSelectorSize

max_nodes
: JGAP parameter maxNodes

scale_error
: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.

stop_criteria_fitness
: If set (>= 0) then the program will run "forever" (instead of num_evolution
) until fitness is less or equal to the value.

show_population
: This shows the whole population in each generation. Mainly for debugging purposes.

show_similiar
: Shows all the solutions (programs) with the same fitness value as the best solution. Alternative name: show_similar
.

similiar_sort_method
: Method of sorting the similiar solutions when using show_similiar
. Alternative name: similar_sort_method
. Valid options:

occurrence
: descending number of occurrences (default)

length
: ascending length of solutions

show_progression
: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).

sample_pct
: (float) Takes a (sample) percentage of the data set if > 0.0.

validation_pct
: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.

show_all_generations
: Show info of all generations, not just when fitness is changed.

hits_criteria
: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of nonhits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value.

mod_replace
: Setting the replacement value of 0 (zero) for the ModuloIntD
function (see above).

showResults
: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.

resultPrecision
: the precision in the output used in showResult
, default 5

error_method
: Error method to use. Valid options are

totalError
: sum of (absolute) errors (default)

minError
: minimum error

meanError
: mean error

medianError
: median error

maxError
: max error

no_terminals
: If true then no Terminal is used, i.e. no numbers, just variables. Default false.

make_time_series
: Make a time series of the first line of data. The value of num_input_variable
determines the number of laps (+1 for the output variable)

make_time_series_with_index
: As make_time_series
with an extra input variable for the index of the series. (Somewhat experimental.)

minNodes: value penalty
: minimum number of nodes (terminals + functions). If the number of nodes in a program is less than value
then a penalty of penalty
is added.

alldifferent_variables: true/false penalty
: all the variables (terminals) should be different. If there is more than one occurrence of an variable in a program then a penalty of penalty
is added (for each extra variable).

ignore_variables
: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.

return_type
: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as double/DoubleClass
.
Supported function
The program has support for the following functions from JGAP. The "main" type is double so all functions are not applicable there (e.g. IfElse
etc). However, for the ADF functions (defined by setting adf_arity
to > 0) many more functions is supported. Please note that some of these are (very) experimental and maybe don't even make sense in this context.

Multiply
(double)

Multiply3
(double)

Add
(double)

Add3
(double)

Add4
(double)

Divide
(double)

Subtract
(double)

Sine
(double)

ArcSine
(double)

Tangent
(double)

ArcTangent
(double)

Cosine
(double)

ArcCosine
(double)

Exp
(double)

Log
(double)

Abs
(double)

Pow
(double)

Round
(double)

Ceil
(double)

Floor
(double)

Modulo
(double), implements Java's %
operator for double. See ModuloD for a variant

Max
(double)

Min
(double)

LesserThan
(boolean)

GreaterThan
(boolean)

If
(boolean)

IfElse
(boolean)

IfDyn
(boolean)

Loop
(boolean)

Equals
(boolean)

ForXLoop
(boolean)

ForLoop
(boolean) (cf the double variant ForLoopD)

Increment
(boolean)

Pop
(boolean)

Push
(boolean)

And
(boolean), cf the double variant AndD

Or
(boolean), cf the double variant OrD

Xor
(boolean), cf the double variant XorD

Not
(boolean), cf the double variant NotD

SubProgram
(boolean)

Tupel
(boolean)
Also, see my own defined functions defined in the Java files above.
Examples
Here are two small examples of the program, including the configuration file and a sample run.
Polynom
Here is a simple example of a configuration file. It happens to be the same problem as the JGAP example MathProblem, the polynom x^4 + x^3 + x^2  x.
#
# Polynom x^4 + x^3 + x^2  x
# The JGAP example
#
presentation: P(4) x^4 + x^3 + x^2  x (the JGAP example)
num_input_variables: 1
variable_names: x y
functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine
terminal_range: 10 10
max_init_depth: 4
population_size: 1000
max_crossover_depth: 8
num_evolutions: 800
max_nodes: 20
stop_criteria_fitness: 0.1
data
2.378099 26.567495
4.153756 382.45743
2.6789956 75.23481
5.336802 986.33777
2.4132318 51.379707
1.7993588 9.693933
3.9202332 307.8775
2.9227705 103.56364
0.1422224 0.159982
4.9111285 719.39545
1.2542424 4.76668
1.5987749 11.577456
4.7125554 615.356
1.1101999 2.493538
1.7379236 8.631802
3.8303614 282.29697
5.158349 866.7222
3.6650343 239.42934
0.3196721 0.17437163
2.3650131 26.014963
A simple run, slightly edited.
It was 20 data rows
Presentation: P(4) x^4 + x^3 + x^2  x (the JGAP example)
output_variable: y (index: 1)
input variable: x
function1: &1 + &2
function1: &1  &2
function1: &1 * &2
function1: /
function1: &1 ^ &2
function1: log &1
function1: sine &1
function1: 1.0
[19:52:57] INFO GPGenotype  Creating initial population
[19:52:57] INFO GPGenotype  Mem free: 10.5 MB
[19:52:57] INFO GPPopulation  Prototype program set
[19:52:57] INFO GPGenotype  Mem free after creating population: 10.5 MB
Creating initial population
Mem free: 10.5 MB
Evolving generation 1/800, memory free: 6.7 MB (time from start: 0,42s)
Best solution fitness: 968.56
Best solution: x ^ 4.0
Depth of chrom: 1
Correlation coefficient: 0.9995009838030151
Evolving generation 4/800, memory free: 11.4 MB (time from start: 0,84s)
Best solution fitness: 813.62
Best solution: ((4.0 + 4.0) + (log 4.0)) * ((x ^ 3.0) / (9.0 / x))
Depth of chrom: 3
Correlation coefficient: 0.999500983803015
Evolving generation 6/800, memory free: 6.7 MB (time from start: 1,07s)
Best solution fitness: 712.97
Best solution: ((5.0 * x) + ((x ^ 4.0) + x)) + x
Depth of chrom: 4
Correlation coefficient: 0.999781550858714
Evolving generation 7/800, memory free: 40.3 MB (time from start: 1,20s)
Best solution fitness: 582.77
Best solution: ((9.0 * x) + (x * 9.0))  ((9.0  x)  (x ^ 4.0))
Depth of chrom: 3
Correlation coefficient: 0.9965296891627895
Evolving generation 8/800, memory free: 24.9 MB (time from start: 1,32s)
Best solution fitness: 471.58
Best solution: (((x + 7.0) * (x * x)) * x) * (x / 9.0)
Depth of chrom: 4
Correlation coefficient: 0.9980245718601822
Evolving generation 11/800, memory free: 29.6 MB (time from start: 1,69s)
Best solution fitness: 364.73
Best solution: ((x * 8.0) + ((x ^ 4.0) + x))  (4.0  ((x * x) * x))
Depth of chrom: 4
Correlation coefficient: 0.9988207761993971
Evolving generation 16/800, memory free: 33.2 MB (time from start: 2,24s)
Best solution fitness: 317.84
Best solution: ((x * ((x * x) + (log 9.0))) * x) + (x * 9.0)
Depth of chrom: 5
Correlation coefficient: 0.9993672319814505
Evolving generation 17/800, memory free: 19.4 MB (time from start: 2,38s)
Best solution fitness: 169.76
Best solution: (x ^ 4.0) + (x * (x * x))
Depth of chrom: 3
Correlation coefficient: 0.9999752402614274
Evolving generation 22/800, memory free: 24.2 MB (time from start: 3,10s)
Best solution fitness: 136.21
Best solution: ((x * x) * x) + (x + (x * (x * (x * x))))
Depth of chrom: 5
Correlation coefficient: 0.9999485732269622
Evolving generation 23/800, memory free: 10.4 MB (time from start: 3,20s)
Best solution fitness: 3.7509724195622374E4
Best solution: (x * ((((x * x) + x) * x) + x))  x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939
Fitness stopping criteria (0.1) reached with fitness 3.7509724195622374E4 at generation 23
All time best (from generation 23)
Evolving generation 23/800, memory free: 10.4 MB (time from start: 3,20s)
Best solution fitness: 3.7509724195622374E4
Best solution: (x * ((((x * x) + x) * x) + x))  x
Depth of chrom: 6
Correlation coefficient: 0.9999999999999939
Total time 3,20s
Fibonacci series
Here is another example, Fibonacci series as time serie. The object is to give a symbolic regression on the fourth value (F4), which is solved quite fast.
#
# Fibonacci with 3 variables
#
presentation: This is the Fibonacci series
return_type: DoubleClass
num_input_variables: 3
variable_names: F1 F2 F3 F4
functions: Multiply,Divide,Add,Subtract
terminal_range: 10 10
max_init_depth: 4
population_size: 20
max_crossover_depth: 8
num_evolutions: 100
max_nodes: 21
show_similiar: true
show_population: false
stop_criteria_fitness: 0
data
1,1,2,3
1,2,3,5
2,3,5,8
3,5,8,13
5,8,13,21
8,13,21,34
13,21,34,55
21,34,55,89
34,55,89,144
55,89,144,233
89,144,233,377
144,233,377,610
233,377,610,987
377,610,987,1597
610,987,1597,2584
987,1597,2584,4181
1597,2584,4181,6765
2584,4181,6765,10946
4181,6765,10946,17711
6765,10946,17711,28657
10946,17711,28657,46368
And a sample run:
Presentation: This is the Fibonacci series
output_variable: F4 (index: 3)
input variable: F1
input variable: F2
input variable: F3
function1: &1 * &2
function1: /
function1: &1 + &2
function1: &1  &2
function1: 5.0
[19:56:31] INFO GPGenotype  Creating initial population
[19:56:31] INFO GPGenotype  Mem free: 7.5 MB
[19:56:31] INFO GPPopulation  Prototype program set
[19:56:31] INFO GPGenotype  Mem free after creating population: 7.5 MB
Creating initial population
Mem free: 7.5 MB
Evolving generation 1/100, memory free: 6.7 MB (time from start: 0,03s)
Best solution fitness: 17649.74
Best solution: (F3  (6.0  F1)) + ((F2  F2)  (6.0 / F1))
Depth of chrom: 3
Correlation coefficient: 0.9999999838289084
Evolving generation 6/100, memory free: 6.0 MB (time from start: 0,13s)
Best solution fitness: 147.0
Best solution: 7.0 + (F2 + F3)
Depth of chrom: 2
Correlation coefficient: 1.0
Evolving generation 7/100, memory free: 5.8 MB (time from start: 0,14s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0
Fitness stopping criteria (0.0) reached with fitness 0.0 at generation 7
All time best (from generation 7)
Evolving generation 7/100, memory free: 5.8 MB (time from start: 0,15s)
Best solution fitness: 0.0
Best solution: F2 + F3
Depth of chrom: 1
Correlation coefficient: 1.0
Total time 0,15s
All solutions with the best fitness (0.0):
F2 + F3 (1)
Todo
Here are some TODO:s, or things nice to have.
 option for ignoring specific variables
 option for stopping:
 running forever
 after a specific time,

when a specific fitness value is reached Fixed (by stop_criteria_fitness
)

calculate the number of data rows automatically (i.e. skip num_row) Fixed.
 accept nominal values in the data section; then converted to numeric values.

show some progress output. Fixed (by show_progress
)

make the output from this program instead from log4j. Fixed.

for larger data sets it would be nice with an option to use just
a sample (percentage) of the full data. Fixed (by sample_pct
)

Withheld a certain percentage for validation of the best fitted program. Fixed (by validation_pct
)

show number of programs less or equal a hits critera Fixed (by hits_criteria
)

show info for all generations Fixed (by show_all_generations
)

support for user defined test instances Fixed (by ?
in a data row)

show number of functions / terminals of the best program Fixed.
 add more fitness metrics.
 better handling of punishing longer solutions (parsimony pressure).
 support for different "main" return classes, i.e. not just DoubleClass
 more statistical measures, e.g. Rsquared, mean squared error, mean absolut error, minimum error,
maximum error
 more/better error checks
 more building blocks, a la Eureqa http://ccsl.mae.cornell.edu/eureqa_ops
 support for derivattives (a la Eureqa)?
 incorporate in Weka?
 simplify the best solution with a CAS?
See also
Mention/Usage
Here are some mentions or usage of my Symbolic Regression program:
This page was created by Hakan Kjellerstrand (hakank@bonetmail.com), homepage.