Noisy DFA Results

There were three entries, and the winning entry was by Jonatan Gomez.


There were three entries, summarised below.  The entries directory contains the jar file or Linux executable for each entry, together with the PDF descriptions.


DFA (Deterministic Finite Automaton) induction is a classic machine learning problem, and has been studied for over thirty years (at least).

Heuristic state merging algorithms (such as EDSM) start by building the prefix tree acceptor for the training set.  This perfectly memorises the training data, but handles nothing else.  State merges are then performed to reduce the size of the machine (while remaining consistent with the training data), and consequently increase its generality. 

The problem can be solved easy given a complete sample of training data (all strings up to a certain length, which depends on the depth of the machine) but becomes harder as the training data gets sparser.  Nonetheless, heuristic state merging algorithms can successfully learn machines with hundreds of states from sparse data.

The twist for the GECCO contest was to add a high level of noise to the data.  The feeling was that evolutionary algorithms should be better able to cope with noise than heuristic state merging methods.

This was borne out with the results, in that the statistical state merging algorithm was unable to solve the size 10 problem.


Tests were run on randomly generated targets with a nominal 10, 20 and 30 states.  For each machine size we generate 10 problem instances, where a problem instance is defined by a randomly generated target, and randomly generated training and test sets.  The full training, test set, and estimates produced by each algorithm are listed in the data directory.

Here are the results summaries.  A problem was considered solved if an algorithm got better than 99% test set accuracy at least 50% of the time.  Hence, by this definition, both Guage and Gomez solve the 10 state problem, but Gomez goes on to solve the 20 and 30 state problems also.

Note that while each algorithm was given a 10 minute upper limit on elapsed time per problem instance, Blue in particular often used much less time than this.


The Gomez entry was significantly better than the other two, and Guage was significantly better than Blue on the 10 state problem, but Blue started to edge ahead of Gauge on larger problems.  Gomez is still doing well on the 30 state problem, but starts to run out of steam on larger problems.

No algorithm has yet solved the 50 state problem (given 5,000 training patterns with 10% noise corruption): this remains an interesting challenge.

What gives Gomez it's lead over Guage?  Is this the biologically inspired coding scheme, or the fact that Gomez only evolves the state transition matrix, and assigns state labels optimally?  Interesting future research...