F1 Score

One of the difficulties of learning statistics is that one quickly realizes “accuracy” is a metric that gives much less utility than has been taught.

Let’s give an example. 

Consider a scenario where we’re trying to predict an infrequent event such as a tire blowout in a particular stretch of highway. If it happens, this is a severe problem, but tire technology is quite good, and blowouts are pretty rare. For the sake of ease, let’s say that one car in 5000 has a tire blowout each day (that’s a drastic overestimate, but work with me here).

This means that only 1 in 20,000 tires will blow out. That’s 0.005%. The easiest way to be “accurate” is to guess that no tires will ever blow out, and then you’re 99.995% accurate. Unfortunately, the consequences are much more severe for the person or persons whose tire has blown out.

So, pure accuracy is not a good measure in general. We learn there are other measures, specificity, sensitivity for the statisticians among you, and recall and precision for the machine learning types. These measures can most easily be described as “yes, means, yes,” and “no means no” standards.

For example, in the world of coronavirus, we may want a “no means no” test or one with high sensitivity. We’re willing to give up a few false positives (that is a yes prediction, but no actual virus) rather than letting someone with a highly communicable virus walk around falsely confident that they are not contagious. On the other hand, we’re testing for pipe safety in our city. We’re willing to say that a few pipes are unsafe when they are. We don’t want false positives. This is when we want a highly specific (yes means yes) test.

Accuracy will fail to deliver the information we want and need in both cases.

In machine learning, when building a classification model, there are more measures; receiver operating characteristic precision-recall curve. The manner of the f beta score determines the confusion matrix, something we use at CDL1000, but maybe less well-known), and of course, the f1 score. Each of these metrics has its place in the broader scheme of machine learning. But one particular measure I’d like to talk about is the f1 score. This metric gives a good balance between false positives and false negatives. The f1 score is the \emph{harmonic average} of specificity and sensitivity (precision and recall).

The harmonic average of computing averages penalizes the computer for missing more than a simple average.

Quickly, if we have two numbers (fp and fn)

the average is (fp+fn)/2

the geometric average is sqrt((fp*fn)

and the harmonic average is 2*fp*fn/(fp + fn) or zero if both are zero

A well-known fact is that harmonic average <= geometric average <= average.

So we use a harmonic average to keep our models in check. They only get to report the worst score they get. For example, if we guess there are no blowouts, our confusion matrix looks like this

predicted no predicted yes
actual no
actual yes

our accuracy is 0.99995 (AWESOME)

precision is undefined

recall is 0

f1 is 0

roc_auc_score = 0.5 (this is area under the curve of true positive rate vs false positive rate)

In this case, if we want to minimize false negatives, we should use the f1 score or recall. Roc is not a good metric for this, precision doesn’t make any sense, and accuracy gives us an overly confident a potentially catastrophic answer.

Leave a Reply

%d bloggers like this: