« Experimenting with Eureqa's API II: eureca_cli | Main | Symbolic Regression with JGAP - further improvements: minNodes, alldifferent, ForLoopD »

Symbolic Regression with JGAP - some improvements

The SymbolicRegression program (using JGAP in Java) has been updated with some improvements.

New configuration options

Some of of these new options are explained in the examples below.
  • show_similar: Alternative name of show_similiar.
  • similiar_sort_method: Method of sorting the similiar solutions when using show_similiar, which shows all solutions that has the same fitness value as the best found solution. Alternative name: similar_sort_method. Valid options are:
    • occurrence: descending number of occurrences (default)
    • length: length of solutions (ascending)
  • error_method: Error method to use. Valid options are
    • totalError: sum of (absolute) errors (default)
    • minError: minimum error
    • meanError: mean error
    • medianError: median error
    • maxError: max error
  • no_terminals: If true then no Terminal is used, i.e. no numbers, just variables. Default: false.
  • make_time_series: Make a time series of the first line of data. The value of num_input_variable determines the number of laps (+1 for the output variable. See below for some examples.
  • make_time_series_with_index: As make_time_series with an extra input variable for the index of the series. (Somewhat experimental.)

New examples

Some new examples has been published as well.
  • leap_years.conf
    This example tries to figure out how to calculate the leap years. See Leap_year (Wikipedia) for more on leap years.

    The fitness cases consists of all years 1890..2030, and 1200, 1300, 1400, 1500, 1600, 1700, and 1800.

    The functions used are: Multiply,Divide,Add,Subtract,ModuloD,IfElseD where IfElseD may be replaced with IfLessThanOrEqualD, or removed completely.

    ModuloD is not the normal modulo operator. Instead it is "protected modulo" where the arguments are first converted to integers and then taken modulo. However, if the second argument is 0 (zero), the result is 0 (zero). This function is represented as either modp or % below.

    The program found a lot of solutions with error 1 (for year 1900).

    Using IfLessThanOrEqualD

    if(y <= ((modp(y,(y / 471.0))) * (296.0 * y))) { (y - y) } else { (327.0 / 327.0) }

    Without IfElseD:

    (326.0 / (((((y - 536.0) % 536.0) + y) % (y / 226.0)) + 326.0)) % (283.0 % y)
    (y / (((y * 654.0) % (24.0 % y)) + y)) % y
    (y / (((y * (330.0 % y)) % (24.0 % y)) + y)) % y

  • number_puzzle4.conf

    Number puzzle inspired by Richard Wiseman's It's the Friday Puzzle (2010-02-26). The problem is to find the result 24 from the numbers 5,5,5,1 and the operators +,-,*,/. However, the requirement that the numbers should be used exactly once is not held here. (It would be quite useful to have these kind of "global functions" requiring that all variables should be different, or used exactly once etc. Compare with "global constraints" in constraint programming.)

    Note also that this configuration uses only one fitness case and let the program find any solution that comply to the equation. It also use the new option no_terminals for using just variables (no Terminal numbers) which was implemented for this example.

    Here is a result from a sample run. The number in [] is the number of occurrences of the specific programs. In this example we also see the new option similiar_sort_method: length at work, which sorts the similiar solutions according to length (normally it it sorted on the number of occurrences). The variables in the solutions means: a = 5, b = 5, c = 5 and d = 1.

    All solutions with the best fitness (0.0):
    Sort method: length
    (b * c) - d [5]
    (a * c) - d [4162]
    (b * b) - d [4]
    (c * a) - d [251]
    (a * a) - d [10]
    (c * c) - d [424]
    (c * b) - d [1]
    (b * a) - d [36]
    (c - d) * (a + d) [1]
    (b * a) - (b / c) [121]
    (b * a) - (a / c) [2]
    (c * b) - (c / c) [5]
    (b * b) - (a / a) [3]
    (c * a) - (b / b) [2]
    (a * c) - (d * d) [633]
    (a - d) * (d + b) [4]
    (c * b) - (a / c) [1]
    (a * b) - (c / b) [2]
    (c * c) - (b / b) [1]
    It was 19 different solutions with fitness 0.0

    None of these are a solution to Wiseman's puzzle.

    Here we have limited the number of nodes with max_modes: 7 (4 variables + 3 terminals), but there is no standard option in JGAP to state the minimum number of nodes. However, with a "node validator" this could probably be done. I plan to experiment more with node validators for these kind of constraints and "global functions" mentioned above.

  • sunspots_timeseries.conf

    Two version of sunspots data using make_time_series. See below for more about this option.

  • timeseries_test1.conf

    Some other examples of the make_time_series. See below.

  • timeseries_dailyisbn.conf

    Another time series example: the classic time series "Daily closing price of IBM stock, Jan 1, 1980 to Oct. 8, 1992" , DAILYIBM.DAT from Rob J Hyndman's TSDL (Time Series Data Library)

make_time_series

The option make_time_series may require some explanation.

The following configuration file is all that is needed for the Fibonacci problem (in time series representation). Actually, the two lines in bold are the only needed, since the other options has defaults that would work well here.

make_time_series: true
num_input_variables: 4
terminal_range: -10 10
functions: Multiply,Divide,Add,Subtract
max_init_depth: 4
population_size: 100
num_evolutions: 100
max_crossover_depth: 8
max_nodes: 21
data
1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,10946,17711,28657,46368

The option make_time_series will then transform the data into a data set and then proceed as if the data set has been stated explicit. Note: the SymbolicRegression program works with double, hence the somewhat unusual presentation.

The number of time lags is the number of input variables (num_input_variables) + 1 for the output variable; here 4 + 1 = 5 time lags. The program prints the transformed data first, i.e.:

Making timeseries, #elements: 24
1.0 1.0 2.0 3.0 5.0
1.0 2.0 3.0 5.0 8.0
2.0 3.0 5.0 8.0 13.0
3.0 5.0 8.0 13.0 21.0
5.0 8.0 13.0 21.0 34.0
8.0 13.0 21.0 34.0 55.0
13.0 21.0 34.0 55.0 89.0
21.0 34.0 55.0 89.0 144.0
34.0 55.0 89.0 144.0 233.0
55.0 89.0 144.0 233.0 377.0
89.0 144.0 233.0 377.0 610.0
144.0 233.0 377.0 610.0 987.0
233.0 377.0 610.0 987.0 1597.0
377.0 610.0 987.0 1597.0 2584.0
610.0 987.0 1597.0 2584.0 4181.0
987.0 1597.0 2584.0 4181.0 6765.0
1597.0 2584.0 4181.0 6765.0 10946.0
2584.0 4181.0 6765.0 10946.0 17711.0
4181.0 6765.0 10946.0 17711.0 28657.0
It was 19 data rows

And then, as mentioned above, the program proceeds as usual. See Symbolic regression (using genetic programming) with JGAP