The

**principle of maximum entropy**is a method for analyzing the available information in order to determine a unique epistemic probability distribution. Claude E. Shannon, the originator of information theory, defined a measure of uncertainty for a probability distribution (

*H*(

**) = - Σ**

*p**p*log

_{i}*p*) which he called information entropy. In his work, information entropy was determined from (i.e. was a function of) a given probability distribution. The principle of maximum entropy tells us that the converse is also possible: a probability distribution can be determined using the information entropy concept. It states the probability distribution that uniquely represents or encodes our state of information is the one that maximizes the uncertainty measure

_{i}*H*(

**) while remaining consistent with our information.**

*p*Naturally, this rule is meaningless to those who espouse the frequency interpretation of probability, for whom probabilities are relative frequencies rather than degrees of belief in uncertain propositions, conditional upon a state of information.

### Testable information

and- "
*p*+_{2}*p*> 0.9"_{3}

Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.

Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy probability distribution is the uniform distribution,

### General solution for the maximum entropy distribution with linear constraints

#### Discrete case

Furthermore, the probabilities must sum to one, giving the constraint

The λ_{k} parameters are Lagrange multipliers whose particular values are determined by the constraints according to

*m*simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.

#### Continuous case

For continuous distributions, (Jaynes, 1963, 1968, 2003) finds that the limiting form of the entropy expression as the distribution approaches a continuous distribution is

*m*(

*x*), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that it is known; we will discuss it further after the solution equations are given.

We have some testable information *I* about a quantity *x* which takes values in some interval of the real numbers (all integrals below are over this interval). We express this information as *m* constraints on the expectations of the functions *f _{k}*, i.e. we require our epistemic probability density function to satisfy

*H_c*subject to these constraints is

_{k}parameters are determined by the constraints according to

*m*(

*x*) can be best understood by supposing that

*x*is known to take values only in the bounded interval (

*a*,

*b*), and that no other information is given. Then the maximum entropy probability density function is

*A*is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory.

### Justifications for the principle of maximum entropy

#### Information entropy as a measure of 'uninformativeness'

#### The Wallis derivation

The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962 (Jaynes, 2003). It is essentially the same mathematical argument used for the derivation of the partition function in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed *a priori*, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make an epistemic probability assignment among *m* mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute *N* quanta of epistemic probability (each worth 1/*N*) at random among the *m* possibilities. (One might imagine that she will throw *N* balls into *m* buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be

*n*is the number of quanta that were assigned to the

_{i}*i*

^{th}proposition.

Now, in order to reduce the 'graininess' of the epistemic probability assignment, it will be necessary to use quite a large number of quanta of epistemic probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, our protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,

The most probable result is the one which maximizes the multiplicity *W*. Rather than maximizing *W* directly, our protagonist could equivalently maximize any monotonic increasing function of *W*. She decides to maximize

*N*→ ∞, i.e. as the epistemic probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, she finds

### References

Jaynes, E. T., 1963, `Information Theory and Statistical Mechanics', in Statistical Physics, K. Ford (ed.), Benjamin, New York, p. 181. Available here.

Jaynes, E. T., 1968, `Prior Probabilities', IEEE Trans. on Systems Science and Cybernetics, SSC-4, 227. Available here.

Jaynes, E. T., 2003, 'Probability Theory: The Logic of Science', Cambridge University Press, 2003.