Popcorn and Genetic Algorithms
Genetic algorithms
Genetic Algorithms (GA) is a way of approximating hard problems by improving a population of individuals. Each individual is a vector over a finite alphabet. Starting from a randomly chosen population, we generate each new generation by genetic operations such as crossover and mutation. Each individual in the population has a fitness value, and our target is to find a 'good enough' individual who represents an approximation to our problem (Holland: Adaptation in natural and artificial systems, MIT Press, Cambridge MA, 1975).
Genetic operations
Cross over is an operator which mixes two individuals genes (vector elements) to create a new individual, while mutation involves single individual modification. By sending sub-populations to remote machine we hope to speed up the search process.
Selecting children
In order to improve the population, we need a way to generate better individuals. One commonly used way is selecting parents according to their fitness function. In this work we check whether a child's fitness is better than the previous old population average fitness. If it has better fitness we add it to the new population, otherwise we generate a new child. maxTrial parameters is an upper bound on the number of trials per child we perform (Baum, Boneh & Garrett: On Genetic Algorithms, ACM 1995).
Approximating TSP
In our case, we try to find an approximation to the Travelling Salesman Problem (TSP). Having n cities and a distance matrix, each individual in our population is a permutation of n elements, and our fitness function is the sum of distances represented by this permutation.
Cross over between individuals A and B to create C is done by:
- choosing a coordinate k (1 <= k <= n).
- copying A(1) .. A(k) into C(1) .. C(k) respectively.
- copying B(k+1) .. B(n) into C(k+1) .. C(n). If we copy B(l) into C(l) for some k < l <= n and B(l) already exists in C B(l) = C(p), we put C(l) in C(p). This way we ensure C would be a permutation.
Mutating an individual is done by swapping two of its elements.
Using a parallel engine
Speeding up the process can be done by sending sub-populations to remote machines and improving those sub-populations on the remote machines for few generations. Applying the genetic operators on a small population we expect that fitness improvement would be slower than for a big population because we decrease the number of sample points in our searching space. On the other hand, sending big sub-populations to a remote machine is expensive due to communication time. In order to find the best sub-populations size and the number of sub-generations they should live on the remote machines, I checked the fitness improvement for several population sizes on one machine, and checked the fitness improvement for several values of sub-populations and sub-generations. Each sub-population with population size / sub-populations individuals was randomly chosen, which means that sub-populations can overlap.
Software description
The package was designed to use genetic algorithm to approximate discrete problems. In order to approximate a problem one has to implement the interface Functionable in the class Function so it would fit the problem. The classes tree can be found here. It contains the following classes:
- Main class creates a population and calls OnePopComputation to generate the following generations. Main API can be found here
- Functionable: Interface that has to be implemented in Function class for each specific problem. Functionable API can be found here. It has the following methods:
- start - user pre-calculations (if needed)
- done - user final calculation (if needed)
- indSize - individual size (number of elements)
- abcSize - alphabet size (elements would have the values 0..abcSize-1)
- popSize - total population size
- generations - total number of generations on local machines
- subPopulations - number of subPopulations
- subGenerations - number of generations on remote machine
- maxTrials - max trials for finding 'good' child
- mutationRatio - 1/mutationRatio individuals would be mutate
- better - which score is better
- calc - calculate individual fitness
- crossOver - generate child
- mutate - mutate individual
- finished - return true if stopping criteria achieved
- Pop class responsible for maintaining a vector of individuals and their total fitness value. Pop's API can be found here.
- Ind class responsible for maintaining a vector of genes (integers) and a fitness value. Ind's API can be found here.
- OnePopComputation class responsible to improve a population by dividing it into subPopulations, and generating computation packet for each such subpopulation. OnePopComputation's API can be found here
- OnePopPacket class responsible for generating 'subGenerations' generations starting from a given population. OnePopPacket's API can be found here
The test
I tried to approximate the TSP for the given 20X20 symmetric distances matrix. The parameters and methods used can be found in geneticAlgorithm/Function.java. The genetic operation are described above (cross-over and mutation). I made two benchmarks:
- Testing a population of 2400 individuals along a total number of 36 generations. The population was sent to 1, 2, 3, 4, 5 and 6 remote machines, each time for 1, 2, 3, 4, and 6 sub-generations. For example, if the sub-population parameter was 3 and the sub-generation parameter was 4, 3 sub-populations were sampled from the entire population, sent to remote machines, and collected together into a new population. In this case, this process was done 36/4 times.
- Testing populations without splitting or sending them to remote machine. This was done to find out about the ratio between the population size and its average fitness convergance. The populations sizes ranged between 10000 and 100000 individuals.
I tried each benchmark several times.
results
Graph 1 describes the results for one population of 2400 individuals along 36 generations.
- Tests 1-6 refers to 1 sub-generation with 1-6 remote machine respectively
- Tests 7-12 refers to 2 sub-generation with 1-6 remote machine respectively
- Tests 13-18 refers to 3 sub-generation with 1-6 remote machine respectively
- Tests 19-24 refers to 4 sub-generation with 1-6 remote machine respectively
- Tests 25-30 refers to 6 sub-generation with 1-6 remote machine respectively
The number of the total generations in each test were the same since the number of generations in the main loop was some constant divided by the number of sub-generations. It can be seen from the graph that sending sub-populations only improves the population average fitness in spite our expectation. Other tests resulted the same behaviour. Graph 2 shows that bigger population results faster convergence of the population average fitness.
Conclusions
Against evoluntary sense, splitting a population into sub-populations appears to improve the population average fitness convergence. Those results encourage the usage of parallel processing in genetic algorithms. I have no explain for those results, further study should include checking bigger populstions with heavier usage of remote machines.