Voice production is the generation of sound in the human speech organs.

For simplicity we begin our inquiry with the pure vowels, but our results will apply as well to other sounds, since diphthongs and many consonants also should be wholly or partly understood by formant analysis such as we will develop here for the pure vowels.

The sound production with which we are concerned begins with compression of the lungs to create a reservoir of relatively high pressure, and a (often) continuous airflow up and out through the vocal tract. The larynx or voice box is a cylindrical framework of cartilage that serves to anchor the vocal chords while they contract as a pair of lips to impede that airflow, until the vocal chords are periodically forced apart by the increasing air pressure from the lungs. Voiced phonemes such as the pure vowels are by definition distinguished by the buzzing sound of this periodic oscillation of the vocal chords. The lips of the mouth can be used in a similar way to create a similar sound, as any toddler or trumpeter can demonstrate. The same toddler may be entertained with a rubber balloon, inflated but not tied off, and stretched tightly across the neck to squeak or buzz, depending on the tension across the neck and the level of pressure inside the balloon. Similar actions, with similar results, occur when the vocal chords are contracted or relaxed across the voicebox. That is a sufficient description of what singers call chest voice.

What is known as head voice is produced when the vocal chords are relaxed further until elastic oscillations of vocal-chord muscle tissue are no longer so relatively important versus an approximately stiff-walled whistle. A yodel or adolescent voice break is what happens when one runs into the discontinuity between these two modes of vocal-chord action.

The well-defined base frequency provided by the vocal chords in voiced phonemes is only a convenience, however, not a necessity, since a strictly unvoiced whisper is still quite intelligible. Our interest is therefore most focused on further modulations of and additions to the base tone by other parts of the vocal apparatus, determined by the variable dimensions of oral, pharyngeal, and even nasal cavities.

The linguistic function of a vowel is to separate and frame consonants; the vowel is the sound of longest duration in a syllable. That is the context in which for English we say the vowels are $a, e, i, o,$ and $u$. However, different sounds perform the linguistic function of vowels in different languages, even to recategorizing some vowels as consonants, and our models of voice production will be capable of much more precise application if we use an alternate definition of vowel in terms of physiological rather than linguistic behavior. From this viewpoint we define vowel as any phoneme in which airflow is impeded only or mostly by the voicing action of the vocal chords.

A phoneme itself, however, is really too abstract and context variant to have a simple frequency decomposition. We define phoneme as one of the abstract signals of the phonetic system of a language which corresponds to a set of similar speech sounds which are perceived by speakers of the language to be a single distinctive sound in that language. Compare with allophone, which is one of those similar speech sounds: an allophone is a variant of a phoneme. An allophone is the contextually specific implementation of phoneme, and phoneme is the (language dependent) smallest distinguishable unit of sound. In a particular context an habitual approximation of the phonemic ideal usually becomes so familiar as to be conventional.

With these definitions in mind we can state without further explanation that the information which humans require to distinguish between vowels has a purely quantitative representation in the frequency decomposition of the vowel sounds. Adjusting volume and dimension of vocal cavities results in changing coefficients of the transfer function.

Formants are the characteristic frequencies and harmonics that tell a listener what vowels she hears. Most formants are harmonics, produced by tube and chamber resonance, but a few are whistle effects derived from periodic collapse of Venturi-effect low-pressure zones. Those whistle formants are presumably not harmonics of the base frequency from the larynx, unless by accident or deliberation.

See also: Vocal loading, Phonetics, Speech processing