The Mann-Whitney U test is one of the best known non-parametric statistical significance tests. It is sometimes also called the Mann-Whitney-Wilcoxon test.

The test is appropriate to the case of two independent samples of observations that are measured at least at an ordinal level, i.e. we can at least say, of any two observations, which is the greater. The test assesses whether the degree of overlap between the two observed distributions is less than would be expected by chance, on the null hypothesis that the two samples are drawn from a single population.

The test involves the calculation of a statistic, usually called U, whose distribution under the null hypothesis is known. In the case of small samples, the distribution is tabulated, but for samples above about 20 there is a good approximation using the normal distribution. Some books tabulate statistics other than U, such as the sum of ranks in one of the samples, but this deviation from standard practice is unhelpful.

The U test is included in most modern statistical packages. However, it is easily calculated by hand especially for small samples. There are two ways of doing this:

  • Arrange all the observations into a single ranked series, and then add up the ranks in the smaller group (the sum of ranks in the other group follows by calculation, since the sum of all the ranks equals N(N+1)/2 where N is the total number of observations; U is then given by the following formula:

U=n1n2 + n1(n1+1)/2 - R1

where n1 and n2 are the two sample sizes, and R1 is the sum of the ranks in sample 1.

  • For small samples, a direct method is quicker, and it also gives much more insight into the meaning of the U statistic. Choose the sample for which the observations seem to be smaller (or the smaller sample - the choice is relevant only to ease of computation). Call this sample 1, and call the other sample sample 2. Taking each observation in sample 1, count the number of observations in sample 2 that are smaller than it. The total of these counts is U.

Note that the maximum value of U is the product of the two sample sizes (6*6=36 in the present case), and if the value obtained by either of the methods above is more than half of this maximum, it should be subtracted from the maximum to obtain the value to look up in tables.

For example, let us suppose that Aesop is dissatisfied with his classic experiment in which one tortoise was found to beat one hare in a race, and decides to carry out a significance test to discover whether the results could be extended to tortoises in general and hares in general. He collects a sample of 6 tortoises and 6 hares, and makes them all run his race. The order in which they reach the finishing post is as follows, writing T for a tortoise and H for a hare:

T H H H H H T T T T T H

(his original tortoise still goes at warp speed, and his original hare is still lazy, but the others run truer to stereotype). What is the value of U? We take each tortoise in turn, and count the number of hares it beats, getting the following results: 6,1,1,1,1,1. So U=6+1+1+1+1+1+1=11. Consulting the table referenced below, we find that this result does not confirm the greater speed of tortoises, though nor does it show any significant speed advantage for hares. It is left as an exercise for the reader to establish that statistical packages will give the same result, at rather greater expense.

For large samples, the normal approximation:

z = mUU

can be used, where z is a standard normal deviate whose significance can be checked in tables of the normal distribution. mU and σU are the mean and standard deviation of U if the null hypothesis is true, and are given by the following formulae:

mU = n1n2/2

σU = √(n1n2(n1+n2+1)/12)

All the formulae given here are made more complicated in the presence of tied ranks, but if the number of these is small (and especially if there are no large tie bands) these can be ignored when doing calculations by hand. The computer statistical packages will use them as a matter or routine.

The U test is useful in the same situations as the independent samples Student's t test, and the question arises of which should be preferred. Before electronic calculators and computer packages made calculations easy, the U test was preferred on grounds of speed of calculation. It remains the logical choice when the data are inherently ordinal; and it is much less likely than the t-test to give a spuriously significant result because of one or two outliers. On the other hand, the U test is often recommended for situations where the distributions of the two samples are very different. This is an error: it tests whether the two samples come from a common distribution, and Monte Carlo methods have shown that it is capable of giving erroneously significant results in some situations where they are drawn from distributions with the same mean and different variances. In that situation, the version of the t-test that allows for the samples to come from populations of different variance is likely to give more reliable results.

The U test is related to a number of other nonparametric statistical procedures; for example, it is equivalent to using Kendall's τ correlation coefficient in a situation where one of the variables being correlated can only take two values.

A statistic linearly related to U, the ρ statistic proposed by Richard Herrnstein, is widely used in studies of categorization (discrimination learning involving concepts) in birds: see animal cognition.

External links