Why we care
Let’s start with the simple question: What is a genetic algorithm? Simply put, a genetic algorithm uses heuristics inspired by biology to find an optimal or near-optimal solution to a problem that is too large to handle by brute force.
Let’s give an example of why we might care. Consider the situation where we need to classify an extensive list of items as either “yes” or “no.” A fundamental binary classification problem can be as simple as buying this item. In our, let’s think one level up. Suppose we have a data set where we’ve collected thousands of features and only have a small number of samples by which to make our judgment. If we want to go through a single decision tree, each feature we use will cut our items in two. So if we use five elements, we will have 32 unique answers. If we’re looking at 1 million customers, we will only need 19 features before we have classified everyone. So we want to pick our features carefully. For the sake of ease, we collect 100 pieces of information for each customer. Picking out only 19 parts gives us 130 quadrillion choices. This will take a few minutes, even on the most powerful supercomputers available in 2022. If we are good data stewards and collect 1000 pieces of information on each customer, we will have roughly 1040 possibilities which are a thousand trillion trillion trillion possibilities. Today’s computing power will take about 220 trillion years to run at total capacity. So it is in our best interest to select carefully.
There are many heuristic optimizers, and the genetic algorithm is one such. Still, it’s a lovely little algorithm that will code quickly, is highly malleable, dashes (with proper parameter tuning), and gives a reliable solution. Since it is heuristic, there is no guarantee that we’ll land on a global optimum, but we won’t be too far off.
The selection offers explainability, of utmost importance, as to why we care about using a genetic algorithm for the feature. Many feature selector techniques involve feature engineering, calculating characteristics, statistical differential geometry, eigenvalue methods (Singular values and principal components), quantum thermodynamics, etc. The difference with a genetic algorithm is that it’s easy to explain and implement.
How to use this Technique, and how CDL 1000 uses it
The genetic algorithm consists of a few supporting functions • Creating an initial population
- A “breeding” method to produce new potential solutions
- A “genetic” mutation for creating new solutions
- A scoring function to know how we’re doing
- A new population selector so we can promote a “survival of the fittest” strategy
- A stopping criterion
In our case, the solution will be a binary vector, that is, do we keep this function or not. For example if we were selecting 5 features of 14, a potential solution may look like
v = [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0]
To create an initial population, we simply pick a bunch of vectors with 19 ones and 81 zeros. In Python, we can use the choice function from NumPy.
initialPopulation = zeros([50, 100])
This will give us 50 vectors (rows) as our initial population of all zeros. Now we fill in the indices
Build Initial Population
procedure GenerateInitialPopulation for idx in range(populationSize) do rowIndex = choice(np.arange(100), size = 19, replace = False) initialPopulation[idx, rowIndex] = 1 end for end procedure
The breeding portion is exactly what it sounds like. We take two potential solutions, which we will call “parents,” and use the 0/1 entries as “chromosomes.” We select from the chromosomes where they match and randomly select where they mismatch until we have filled in the desired number of traits.
For example, let’s again select five traits from 14.
parent1 = [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0]
parent2 = [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0]
Let’s pick all the matching chromosomes
child = [0, 0, ?, 0, 1, ?, 0, ?, ?, 1, 0, 0, 1, 0]
We have three features with the value 1 which will remain, then four mismatched. From the mismatched, we will randomly select 2 to become 1s. Thus our selection will be
child = [0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0]
We flipped coins until we got two heads, the first two came up heads in this case, so we filled in the first two items.
Now for the mutation. We will randomly swap out a 1 for a 0.
mutant = [0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0]
Notice we swapped the 6th (or 5th in python) with the 11th (or 10th). This is how we produce new solutions. That’s the simple part. Now we need a scoring function.
The genetic algorithm looks for the “fittest” solution; that is, we’re approaching this as the survival of the fittest heuristic.
We’ll build a decision tree (classifier) and score it using an elementary out-of-the-box metric in this particular case. From sklearn in python and Julia, we can use the f1 score or receiver operating characteristic. Pick the metric you prefer. I like f1.
The algorithm is simple. Go through each of the items in a population, score them all, and keep the top 20 or so. I like to work in powers of 2. So usually, I’d start an initial population of 64. I’d breed and mutate solutions for each future generation until we get approximately 256 keys, then keep the top 32.
The pseudo-code looks something like this.
Now we need a stopping criterion. With a genetic algorithm, the stopping criterion we usually choose is generating a fixed number of new populations. For example, we may want to stop after 100 generations, or if the best score at a new age has not improved in the last three generations.
Build a new population
procedure GetNewPopulation(oldPopulation, populationSize= N) for k in range(populationSize) do pick two random parents breed two children add a Mutant end for split train and test datasets for item in potentialPopulation do train and test populations use only features in the present item compute (f1) score for the current test set if current score is in top N then keep item in newPopulation else discard item end if end for return Best N items as newPopulation end procedure