Arithmetic coding

Arithmetic coding in data compression is the process of encoding a stream of data into a large binary fraction. It can achieve near-optimal entropy encoding.

Arithmetic coding actually refers to half of an Arithmetic Coding data compression system.

It has two parts:

An arithmetic coder
A data model (e.g., Markovian model)

At any point in time in coding the input stream of data,

There is a new data character (a fixed number of bits per character)
There is a set of probabilities for each possible character

The number of bits used to encode a character depends on how much of the range (0, 1) the character's probability is.

The larger the range, the less bits it takes to code the character. The smaller the range, the more bits it takes to code the character.

Typically, the model used to code the data changes based on the data input stream contents. This is known as adaptive coding.

See also : Data compression, Range encoder

The following text has been moved from Arithmetic encoding:

An arithmetic encoder takes a string of symbols as input and produces a rational number in the interval [0, 1) as output. As each symbol is processed, the encoder will restrict the output to a smaller interval.

Let N be the number of distinct symbols in the input; let x₁, x₂ ... x_N represent the symbols, and let P₁, P₂ ... P_N represent the probability of each symbol appearing. At each step in the process, the output is restricted to the current interval [y, y+R). Partition this interval into N disjoint subintervals:

I₁ = [y, y + P₁R)

I₂ = [y + P₁R, y + P₁R + P₂R)

Therefore the size of I_i is P_iR. If the next symbol is x_i, then restrict the output to the new interval I_i.

Note that at each stage, all the possible intervals are pairwise disjoint. Therefore a specific sequence of symbols produces exactly one unique output range, and the process can be reversed.

Since arithmetic encoders are typically implemented on binary computers, the actual output of the encoder is generally the shortest sequence of bits representing the fractional part of a rational number in the final interval.

Suppose our entire input string contains M symbols: then x_i appears exactly P_iM times in the input. Therefore, the size of the final interval will be

By Shannon's theorem, this is the total entropy of the original message. Therefore arithmetic encoding is near-optimal entropy encoding.

However, IBM and other companies own patents in the United States and other countries on algorithms essential for implementing an arithmetic encoder. But are those patent holders willing to license the patents royalty-free for use in open-source software?

An earlier version of the above article was posted on PlanetMath. This article is open content