Data compression

Data compression, a fundamental topic of computer science, is the process of encoding data so that it takes less storage space or less transmission time than it would if it were not compressed. This is possible because most real-world data is very redundant or not most concisely represented in its obvious form.

One very simple means of compression, for example, is run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression, where the data is compressed in such a way that it can be recovered exactly. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).

In other kinds of data such as sounds and pictures, a small loss of quality can be tolerated without losing the essential nature of the data, so lossy data compression methods can be used. These frequently offer a range of compression efficiencies, where the user can choose whether he wants highly-compressed data with noticeable loss of quality or higher-quality data with less compression. In particular, compression of images and sounds can take advantage of limitations of the human sensory system to compress data in ways that are lossy, but nearly indistinguishable from the original.

Many data compression systems are best viewed with a four-stage model.

Closely allied with data compression are the fields of coding theory and cryptography. Theoretical background is provided by information theory and algorithmic information theory. When compressing information in the form of signals we often use digital signal processing methods. The idea of data compression is deeply connected with statistical inference and particularly with the maximum likelihood principle.

Data compression topics:

Common Data compression algorithms:

Lossless Compression
- Run-length encoding
  - PackBits
  - PCX
- Minimum-redundancy coding, also called entropy encoding
  - Huffman coding (simple entropy coding)
  - Arithmetic coding (more advanced entropy coding; encumbered by patents as of October 2001)
- Dictionary coders
- Burrows-Wheeler transform (block sorting preprocessing that makes compression easier)
- Prediction by partial matching (PPM)
Lossy Compression
- Discrete cosine transform based coders
  - MPEG (the founder of video compression and still in use today, uses DCT and delta)
  - MP3 (sound and music compression)
  - JPEG (image compression using a windowed cosine transform, then quantization, then Huffman coding)
- Fractal compression
  - Fractal transform
- Wavelet compression

The Lempel-Ziv (LZ) compression methods are the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio. Compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) was patented by Unisys until June of 2003, and is used in GIF images. This patent is the main reason for GIF's increasing obsolescence. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table based compression model where table entries are subsitituted for redundant data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (eg. SHRI, LZX). The current LZ based code that performs best is the obsolete LZX, although RAR and ACE are now coming close. LZX was purchased by Microsoft, slightly reduced in potency, and used in the CAB format.

Compression of sounds is generally called audio compression, where methods of psychoacoustics are used to remove non-audible components of the signal to make compression more efficient. Audio compression is therefore lossy compression. Different audio compression standards are listed under audio codecs.

External Links

Public domain article on data compression