Binary classification is the task of, given a set of objects, classifying them into two groups on the basis of whether they have some property or not. Some typical binary classification tasks are

  • medical testing to determine if a patient has certain disease or not (the classification property is the disease)
  • quality control in factories; ie. deciding if a new product is good enough to be sold, or if it should be discarded (the classification property is being good enough)
  • deciding whether a page or an article should be in the result set of a search not (the classification property is the relevance of the article - typically the presense of a certain word in it)

Classification in general is one of the problems studied in computer science, in order to automatically learn classification systems; some methods suitable for learning binary classifiers include the decision trees, Bayesian networks, support vector machines, and neural networks.

Sometimes, classification tasks are trivial. Given 100 balls, some of then red and some blue, a human with normal color vision can easily separate them into red ones and blue ones. However, some tasks, like those in practical medicine, and those interesting from the computer science point-of-view, are far from trivial, and produce also faulty results.

Evaluation of binary classifiers

To measure the performance of a medical test, concepts sensitivity and specificity are often used; these concepts are readily usable for the evaluation of any binary classifier. Say we test some people for the presense of a disease. Some of these people have the disease, and our test says they are positive. They are called true positives. Some have the disease, but the test claims they don't. They are called false negatives. Some don't have the disease, and the test says they don't - true negatives. Finally, we might have healthy people who have a positive test result false positives.

Sensitivity is the proportion of people that tested positive of all the positive people tested; that is (true positives) / (true positives + false negatives). It can be seen as the probability that the test is positive given that the patient is sick. The higher the sensitivity, the less real cases of diseases go undetected (or, in the case of the factory quality control, the less faulty products go to the market).

Specificity is the proportion of people that tested negative of all the negative people tested; that is (true negatives) / (true negatives + false positives). As with sensitivity, it can be looked at as the probability that the test is negative given that the patient is not sick. The higher the specificity, the less healthy people are labeled as sick (or, in the factory case, the less money the factory loses by discarding good products instead of selling them).

In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100 % in both (for instance, the human classifying the red and blue balls most likely does). In practice, there often is a tradeoff, and you can't achieve both. [Explanation why should go here.]

In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive and negative prediction values. These are possibly more intuitively clear: the positive prediction value answers the question "how likely it is that I really have the disease, given that my test result was positive?". It is calculated as (true positives) / (true positives + false positives); that is, it is the proportion of true positives out of all positive results. (The negative prediction value is the same, but for negatives, naturally.)

One should note, though, one important difference between the these concepts. That is, sensitivity and specificity are independent from the population in the sense that they don't change depending on what the proportion of positives and negatives tested are. Indeed, you can determine the sensitivity of the test by testing only positive cases. However, the prediction values are dependent on the population.

As an example, say that you have a test for a disease with 99 % sensitivity and 99 % specificity. Say you test 2000 people, and 1000 of them are sick and 1000 of them are healthy. You are likely to get about 990 true positives, 990 true negatives, and 10 of false positives and negatives each. The positive and negative prediction values would be 99 %, so the people can be quite confident about the result.

Say, however, that of the 2000 people only 100 are really sick. Now you are likely to get 99 true positives, 1 false negative, 1881 true negatives and 190 false positives. Of the 190+99 people tested positive, only 99 really have the disease - that means, intuitively, that given that your test result is positive, there's only 34.2 % chance that you really have the disease. On the other hand, given that your test result is negative, you can really be reassured: there's only 1 chance in 1881, or 0.05% probability, that you have the disease despite of your test result.

The receiver operating characteristic is a graphical way of visualizing the performance of binary classifiers.

See also: