Learning DFA from Noisy Samples

A Contest for GECCO 2004

The Challenge

The original challenge was to learn a 50 state machine, but this seems too far beyond the state of the art; on May 14th smaller machines have been added, from 10 states upward)

The challenge is as follows: given a training set of binary sequences, each with a binary class label, learn a predictor that will classify unlabelled sequences in the test set.  The training and test sets consist of random binary sequences labelled by a randomly constructed Deterministic Finite Automata (DFA).  Note that your predictor can take any form you want: a neural net, an expression tree, etc., it does not have to be a DFA.

This contest takes its inspiration from previous DFA contests: namely Abbadingo One and Gowachin

Our contest focuses on a fairly small DFA (nominally between 10 and 50 states), but with a high level of label noise (10%), and a moderate training sample size (5,000).  The noise is added to the class labels in the training set.  That is, we iterate over all the class labels in the training set, flipping each one with a probability of 0.1.

All programs in this contest have to run on the same machine within the same time constraints - a level playing field.

Rules

Note: (changed May 14th to cope with progressive target sizes)

Your program will be given a maximum of 10 minutes CPU time per dataset (on a 2.4ghz Pentium 4 running Windows or Linux), after which time it will be terminated.  Therefore, it is very important that your program should produce the output before its time is up.  We will run each program initially on the same 10 randomly generated problems.

The winning entry will be the program that is able to solve the largest problem.

A problem is considered solved if the learner can get better than 99% test set accuracy at least 50% of the time.  In the event of a tie, the winner will be the one that solves the problem with a higher probability.

File Format

Note: this was updated June 3rd to add an optional parameter into the training set: the nominal size of the target machine (highlighted with yellow background) - thanks to Conor Ryan for querying this.

We use a plain ASCII file format - in fact very similar to the format as used by Gowachin.

The first few lines of an example training and test set pair are shown below:

Sample training set:

```5000 2 10
0 9 1 1 0 1 0 0 1 1 0
1 15 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0
0 15 1 1 0 1 1 0 1 0 0 0 1 1 1 0 1
1 15 1 0 0 1 1 1 1 1 1 0 0 0 1 1 0
0 15 0 0 1 1 1 1 1 0 1 1 0 0 1 1 1
0 13 1 1 0 0 1 0 0 0 0 1 1 1 0
1 15 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0
1 15 1 1 1 0 0 1 0 0 1 1 0 1 0 1 1
0 15 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1
0 15 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0
0 15 1 1 1 0 1 0 0 0 1 1 0 0 0 1 1
1 15 1 0 0 1 1 1 0 1 0 0 0 1 1 1 1
0 15 1 0 1 1 1 1 0 0 1 0 0 1 0 0 1
0 15 0 1 0 0 0 1 1 0 1 1 1 0 0 1 0
0 14 1 1 0 1 0 1 1 0 0 1 1 1 1 0
....```

Sample test set:

```1800 2
-1 14 1 0 0 0 1 1 1 1 1 0 0 1 0 1
-1 11 0 1 0 1 0 0 1 1 1 0 1
-1 14 0 0 0 1 1 0 1 0 0 1 1 1 1 1
-1 14 1 1 0 0 0 1 0 1 1 0 1 1 1 0
-1 13 0 0 0 0 0 0 1 1 1 0 1 0 0
-1 15 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1
-1 15 1 0 1 1 0 0 0 0 1 1 0 0 0 1 1
-1 15 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1
-1 6 0 1 1 1 1 0
-1 14 0 0 1 1 0 1 1 1 0 1 1 1 0 1
....```

The first line states the number of sequences in the file, the number of pattern classes (the latter will always be two), and the nominal number of states in the target machine.

Each subsequent line gives the pattern class followed by the sequence length, followed by the actual sequence.

The output format is even simpler: specify the number of patterns on the first line, followed by the class label on each subsequent line for the sequence on the corresponding line of the test set.  E.g:

```1800
1
1
0
1
....```

Sample Data Sets

The sample data sets are in this directory.  The number in file name indicates the nominal number of states in the target DFA (so gecco-train-10.txt is the training set for a 10 state target DFA, and gecco-test-10.txt is the corresponding test set).

Note that in the sample test data files, the class of each pattern has been left in, for you to get a better idea of how well your learner is doing.

Submission Details

Submit an executable file (Windows or Linux) or a Java file (source or classes in a .jar) that takes three filenames as input (names given as command line arguments), for example: train.txt, test.txt and out.txt.  In this case your program would read the training and test data from train.txt and test.txt respectively, and write the class label output to out.txt.  If your submission was called learner, then we should be able to invoke it like this:

`>learner train.txt test.txt out.txt`

We also strongly encourage you to submit a brief (not more that two page) description of your algorithm (either in plain text, HTML or PDF).

Make your submissions by email to sml at essex.ac.uk.

• Q. Must I use an evolutionary algorithm?
A. No, we encourage submissions of any type.