www.gp-field-guide.org.uk
Contents Top Previous Next

10.4 Running GP on Parallel Hardware

In contrast to much of computer science, evolutionary computation can be readily run on parallel computer hardware; indeed it is "embarrassingly parallel" (Andre and Koza1998). For example, when Openshaw and Turton (1994) ran GP on a Cray supercomputer they obtained about 30% of its theoretical peak performance, embarrassing their supercomputer savvy colleagues who rarely got better than a few percent out of it.

In Sections  10.4.1 - 10.4.3 we look at three ways of running GP on parallel hardware. Section  10.4.4 shows how to get 32 parallel operations from standard hardware.

10.4.1 Master-slave GP

If the objective is purely to speed up runs, we may want our parallel GP to work exactly the same as it did on a single computer. This is possible, but to achieve it we have to be very careful to ensure that, even if some parts of the population are evaluated more quickly, parallelisation does not change how we apply selection and which GP individual crosses over with which. Probably the easiest way to implement this is the master-slave model.

In the master-slave model (Oussaidène, Chopard, Pictet, and Tomassini1997) breeding, selection crossover, mutation etc. occur just as they would on a single computer and only fitness evaluation is spread across a network of computers. Each GP individual and its fitness cases are sent across the network to a different compute node. The central node waits for the compute nodes to return their individuals' fitnesses. Since individuals and fitness values are typically stored in small data structures, this can be quite efficient since transmission overheads are limited.

The central node is an obvious bottleneck. Also, a slow compute node or a lengthy fitness case will slow down the whole GP population, since eventually its result will be needed before moving onto the next generation.

10.4.2 GP Running on GPUs

Modern PC graphics cards contain powerful graphics processing units (GPUs) including a large number of computing components. For example, it is not atypical to have 128 streaming processors on a single PC's graphics card. In the last few years there has been an explosion of interest in porting scientific or general purpose computation to mass market graphics cards (Owens, Luebke, Govindaraju, Harris, Kruger, Lefohn, and Purcell2007).


PIC
Figure 10.1: nVidia 8800 Block diagram. The 128 1360 MHz Stream Processors are arranged in 16 blocks of 8. Blocks share 16 KB memory (not shown), an 8/1 KB L1 cache, 4 Texture Address units and 8 Texture Filters. The 6×64 bit bus (dashed) links off chip RAM at 900 MHz. (Since there are two chips for each of the six off-chip memory banks, the bus is effectively running at up to 1800 Mhz per bank.) There are 6 Raster Operation Partitions. (nVidia2007).


Indeed, the principal manufactures (nVidia and ATI) claim faster than Moore's Law (Moore1965) increase in performance, suggesting that GPU floating point performance will continue to double every twelve months, rather than the 18-24 months observed for electronic circuits in general and personal computer CPUs in particular. In fact, the apparent failure of PC CPUs to keep up with Moore's law in the last few years makes GPU computing even more attractive. Even today's bottom-of-the-range GPUs greatly exceed the floating point performance of their hosts' CPU. However, this speed comes at a price, since GPUs provide a restricted type of parallel processing, often referred to a single instruction multiple data (SIMD) or single program multiple data (SPMD). Each of the many processors simultaneously runs the same program on different data items.

There have been a few genetic programming experiments with GPUs (Chitty2007Ebner, Reinhardt, and Albert2005Harding and Banzhaf2007Langdon and Banzhaf2008Langdon and Harrison2008Loviscach and Meyer-Spradow2003Meyer-Spradow and Loviscach2003Reggia, Tagamets, Contreras-Vidal, Jacobs, Weems, Naqvi, Winder, Chabuk, Jung, and Yang2006). So far, in GP, GPUs have just been used for fitness evaluation.

Harding and Banzhaf (2007) used the Microsoft research GPU development DirectXTMtools to compile (using a technique originally developed by Harris and Buxton (1996)) a whole population of Cartesian GP network programs into a single GPU program which was loaded onto a laptop's GPU to run the fitness cases. Chitty (2007) used a conversion technique, somewhat like an interpreter, to automatically convert each GP tree into a program that could be compiled for the GPU on the host PC. The compiled programs were transferred one at a time to a GPU for fitness evaluation. Both groups obtained impressive speedups by running many test cases in parallel.

Langdon and Banzhaf (2008) and Langdon and Harrison ( 2008) created a SIMD interpreter (Juille and Pollack1996) using RapidMind's GNU C++ OpenGL framework to simultaneously run up to a quarter of a million GP trees on an NVIDIA GPU (see Figure  10.1 ).5 As discussed in Section  7.1.2 , GP trees can be linearised. This avoids pointers and yields a very compact data structure; reducing the amount of memory needed in turn facilitates the use of large populations. To avoid recursive calls in the interpreter, Langdon used reverse polish notation (RPN), i.e., a post-fix rather than a pre-fix notation. Only small modifications are needed to crossover and mutation so that they act directly on the RPN expressions. This means the same representation is used on both the host and the GPU. Almost a billion GP primitives can be interpreted by a single graphics card per second. In both Cartesian and tree-based GP the genetic operations are done by the host CPU. ? showed, for a genetic algorithm, these too can be done by the GPU.

Although each of the GPU's processors may be individually quite fast and the manufacturers claim huge aggregate FLOPS ratings, the GPUs are optimised for graphics work. In practice, it is hard to keep all the processors fully loaded. Nevertheless 30 GFLOPS has been achieved (Langdon and Harrison2008). Given the differences in CPU and GPU architectures and clock speeds, often the speedup from using a GPU rather than the host CPU is the most useful statistic. This is obviously determined by many factors, including the relative importance of amount of computation and size of data. The measured RPN tree speedups were 7.6-fold (Langdon and Harrison2008) and 12.6-fold (Langdon and Banzhaf2008).

10.4.3 GP on FPGAs

Field programmable gate arrays (FPGAs) are chips which contain large arrays of simple logic processing units whose functionality and connectivity can be changed via software in microseconds by simply writing a configuration into a static memory. Once an FPGA is configured it can update all of its thousands of logic elements in parallel at the clock speed of the circuit. Although an FPGA's clock speed is often an order of magnitude slower than that of a modern CPU, its massive parallelism makes it a very powerful computational device. Because of this and of their flexibility there has been significant interest in using FPGAs in GP.

Work has ranged from the use of FPGAs to speed up fitness evaluation (Koza, Bennett, Hutchings, Bade, Keane, and Andre1997Seok, Lee, and Zhang2000) to the definition of specialised operators (Martin and Poli2002). It is even possible to implement a complete GP on FPGAs, as suggested in (Heywood and Zincir-Heywood2000Martin20012002Sidhu, Mei, and Prasanna1998). A massively parallel GP implementation has also been proposed by Eklund ( 20012004) although to date all tests with that architecture have only been performed in simulation.

10.4.4 Sub-machine-code GP

We are nowadays so used to writing programs using high level sequential languages that it is very easy to forget that, underneath, computers have a high degree of parallelism. Internally, CPUs are made up of bit-slices which make it possible for the CPU to process all of the bits of the operands of an instruction in one go, in a single clock tick.

Sub-machine-code GP (SMCGP) (Poli and Langdon1999) is a technique to speed up GP and to extend its scope by exploiting the internal parallelism of sequential CPUs. In Boolean classification problems, SMCGP allows the parallel evaluation of 32 or 64 (depending on the CPU's word size) fitness cases per program execution, thereby providing a significant speed-up. This has made it possible to solve parity problems with up to 4 million fitness cases (Poli and Page2000). SMCGP has also been applied with success in binary image classification problems (Adorni, Cagnoni, and Mordonini2002Quintana, Poli, and Claridge2003). The technique has also been extended to process multiple fitness cases per program execution in continuous symbolic regression problems where inputs and outputs are real-valued numbers (Poli1999b).


www.gp-field-guide.org.uk
Contents Top Previous Next