How Stereo Works

(Updated 03/06/18 to include results for Blumlein Pair microphone arrangement.)

initial

A computer simulation of stereo speakers plus listener, showing the listener’s perception of the directions of three sources that have previously been ‘recorded’. The original source positions are shown overlaid with the loudspeakers.

Ever since building DSP-based active speakers and hearing real stereo imaging effectively for the first time, it has seemed to me that ordinary stereo produces a much better effect than we might expect. In fact, it has intrigued me, and it has been hard to find a truly satisfactory explanation of how and why it works so well.

My experience of stereo audio is this:

  • When sitting somewhere near the middle between two speakers and listening to a ‘purist’ stereo recording, I perceive a stable, compelling 3D space populated by the instruments and voices in different positions.
  • The scene can occasionally extend beyond the speakers (and this is certainly the case with recordings made using Q-Sound and other such processes).
  • Turning my head, the image stays plausible.
  • If I move position, the image falls apart somewhat, but when I stop moving it stabilises again into a plausible image – although not necessarily resembling what I might have expected it to be prior to moving.
  • If I move left or right, the image shifts in the direction of the speaker I am moving towards.

An article in Sound On Sound magazine may contain the most perceptive explanation I have seen:

The interaction of the signals from both speakers arriving at each ear results in the creation of a new composite signal, which is identical in wave shape but shifted in time. The time‑shift is towards the louder sound and creates a ‘fake’ time‑of‑arrival difference between the ears, so the listener interprets the information as coming from a sound source at a specific bearing somewhere within a 60‑degree angle in front.

This explanation is more elegant than the one that simply says that if the sound from one speaker is louder we will tend to hear it as if coming from that direction – I have always found it hard to believe that such a ‘blunt’ mechanism could give rise to a precise, sharp 3D image. Similarly, it is hard to believe that time-of-arrival differences on their own could somehow be relayed satisfactorily from two speakers unless the user’s head was locked into a fixed central position.

The Sound On Sound explanation says that by reproducing the sound from two spaced transducers that can reach both ears, the relative amplitude also controls the relative timing of what reaches the ears, thus giving a timing-based stereo image that, it appears, is reasonably stable with position and head rotation. This is not a psychoacoustic effect where volume difference is interpreted as a timing difference, but the literal creation of a physical timing difference from a volume difference.

There must be timbral distortion because of the mixing of the two separately-delayed renditions of the same impulse at each ear, but experience seems to suggest that this is either not significant or that the brain handles it transparently, perhaps because of the way it affects both ears.

Blumlein’s Patent

Blumlein’s original 1933 patent is reproduced here. The patent discusses how time-of-arrival may take precedence over volume-based cues depending on frequency content.

It is not immediately apparent to me that what is proposed in the patent is exactly what goes on in most stereo recordings. As far as I am aware, most ‘purist’ stereo recordings don’t exaggerate the level differences between channels, but simply record the straight signal from a pair of microphones. However, the patent goes on to make a distinction between “pressure” and “velocity” microphones which, I think, corresponds to omni-directional and directional microphones. It is stated that in the case of velocity microphones no amplitude manipulation may be needed. The microphones should be placed close together but facing in different directions (often called the ‘Blumlein Pair‘) as opposed to being spaced as “artificial ears”.

Blumlein -Stereo.png

Blumlein Pair microphone arrangement

The Blumlein microphones are bi-directional i.e. they also respond to sound from the back.

Going by the SoS description, this type of arrangement would record no timing-based information (from the direct sound of the sources at any rate), just like ‘panpot stereo’, but the speaker arrangement would convert orientation-induced volume variations into a timing-based image derived from the acoustic summation of different volume levels via acoustic delays to each ear. This may be the brilliant step that turns a rather mundane invention (voices come from different sides of the cinema screen) into a seemingly holographic rendering of 3D space when played over loudspeakers.

Thus the explanation becomes one of geometry plus some guesswork regarding the way the ears and brain correlate what they are hearing, presumably utilising both time-of-arrival and the more prosaic volume-based mechanism which says that sounds closer to one ear than the other will be louder – enhanced by the shadowing effect of the listener’s head in the way. Is this sufficient to plausibly explain the brilliance of stereo audio? Does a stereo recording in any way resemble the space in which it was recorded?

A Computer Simulation

In order to help me understand what is going on I have created a computer simulation which works as follows (please skip this section unless you are interested in very technical details):

  • It is a floor plan view of a 2D slice through the system. Objects can be placed at any XY location, measured in metres from an origin.
  • There are no reflections; only direct sound.
  • The system comprises
    • a recording system:
      • Three acoustic sources, each of which generate an identical musical transient (loaded from a mono WAV file at CD quality). Each source is considered in isolation from the others.

      • Two microphones that can be spaced and positioned as desired. They can be omni-directional or have a directional response. In the former case, volume is attenuated with distance from the source while in the latter it is attenuated by both distance and orientation to the source.
    • a playback system:

      • Two omni-directional speakers

      • A listener with two ears and the ability to move around and turn his head.

  • The directions and distances from sources to microphones are calculated based on their relative positions, and from these the delays and attenuations of the signals at the microphones are derived. These signals are ‘recorded’.

  • During ‘playback’, the positions of the listener’s ears are calculated based on XY position of the head and its rotation.

  • The distances from speakers to each ear are calculated, and from these, the delays and attenuation thereof.

  • The composite signal from each source that reaches each ear via both speakers is calculated and from this is found:

    • relative amplitude ratio at the ears
    • relative time-of-arrival difference at the ears. This is currently obtained by correlating one ear’s summed signal for that source (from both speakers) against the other and looking for the delay corresponding to peak output of this. (There may be methods more representative of the way human hearing ascertains time-of-arrival, and this might be part of a future experiment).

  • There is currently no attempt to simulate HRTF or the attenuating effect of ‘head shadow’. Attenuation is purely based on distance to each ear.

  • The system then simulates the signals that would arrive at each ear from a virtual acoustic source were the listener hearing it live rather than via the speakers.

    • This virtual source is swept through the XY space in fine increments and at each position the ‘real’ relative timings and volume ratio that would be experienced by the listener are calculated.

    • The results are compared to the results previously found for each of the three sources as recorded and played back over the speakers, and plotted as colour and brightness in order to indicate the position the listener might perceive the recorded sources as emanating from, and the strength of the similarity.

  • The listener’s location and rotation can be incremented and decremented in order to animate the display, showing how the system changes dynamically with head rotation or position.

The results are very interesting!

Here are some images from the system, plus some small animations.

Spaced omni-directional microphones

In these images, the (virtual) signal was picked up by a pair of (virtual) omnidirectional microphones on either side of the origin, spaced 0.3m apart. This is neither a binaural recording (which would at least have the microphones a little closer together) nor the Blumlein Pair arrangement, but does seem to be representative of some types of purist stero recording.

The positions of the three sources during (virtual) recording are shown overlaid with the two speakers, plus the listener’s head and ears. Red indicates response to SRC0; green SRC1; and blue SRC2.

head_rotation

Effect of head rotation on perceived direction of sources based on inter-aural timing when listener is close to the ‘sweet spot’.

side_to_side

Effect of side-to-side movement of listener on perceived imaging based on inter-aural timing.

compound_movement

Compound movement of listener, including front-to-back movement and head rotation.

amplitude

Effect of listener movement on perceived image based on inter-aural amplitudes.

Coincident directional microphones (Blumlein Pair)

Here, directional microphones are set at the origin at right angles to each other, as shown in the earlier diagram. They copy Blumlein’s description in the patent i.e. output is proportional to the cosine of angle of incidence.

blumlein_timing

Time-of-arrival based perception of direction as captured by a coincident pair of directional microphones (Blumlein Pair) and played back over stereo speakers, with compound movement of the listener.

blumlein_amplitude

A similar test, but showing perceived locations of the three sources based on inter-aural volume level

In no particular order, some observations on the results:

  • A stereo image based on time-of-arrival differences at the ears can be created with two spaced omni-directional microphones or coincident directional microphones. Note, the aim is not to ‘track’ the image with the user’s head movement (like headphones would), but to maintain stable positions in space even as the user turns away from ‘the stage’.
  • The Blumlein Pair gives a stable image with listener movement based on time-of-arrival. The image based on inter-aural amplitude may not be as stable, however.
  • Interaural timing can only give a direction, not distance.

  • A phantom mirror image of equal magnitude also accompanies the frontwards time-of-arrival-derived direction, but this would also be true of ‘real life’. The way this behaves with dynamic head movement isn’t necessarily correct; at some locations and listener orientations maybe the listener could be confused by this.

  • Relative volume at the two ears (as a ratio) gives a ‘blunt’ image that behaves differently from the time-of-arrival based image when the listener moves or turns their head. The plot shows that the same ratio can be achieved for different combinations of distance and angle so on its own it is not unambiguous.

  • Even if the time-of-arrival image stays meaningful with listener movement, the amplitude-based image may not.

  • Combined with timing, relative interaural volume might provide some cues for distance (not necessarily the ‘true’ distance).

  • No doubt other cues combining indirect ‘ambient’ reflections in the recording, comb-filtering, dynamic phase shifts with head movement, head-related transfer function, etc. are also used by the listener and these all contribute to the perception of depth.

  • The cues may not all ‘hang together’, particularly in the situation of movement of the listener, but the human brain seems to make reasonable sense of them once the movement stops.

  • The Blumlein Pair does, indeed, create a time-of-arrival-based image from amplitude variations, only. And this image is stable with movement of the listener – a truly remarkable result, I think.
  • Choice of microphone arrangement may influence the sound and stability of the image.
  • Maybe there is also an issue regarding the validity of different recording techniques when played back over headphones versus speakers. The Blumlein Pair gives no time-of-arrival cues when played over headphones.
  • The audio scene is generally limited to the region between the two speakers.
  • The simulation does not address ‘panpot’ stereo yet, although as noted earlier, the Blumlein microphone technique is doing something very similar.
  • In fact, over loudspeakers, the ‘panpot’ may actually be the most correct way of artificially placing a source in the stereo field, yielding a stable, time-of-arrival-based position.

Perhaps the thing that I find most exciting is that the animations really do seem to reflect what happens when I listen to certain recordings on a stereo system and shift position while concentrating on what I am hearing. I think that the directions of individual sources do indeed sometimes ‘flip’ or become ambiguous, and sometimes you need to ‘lock on’ to the image after moving, and from then on it seems stable and you can’t imagine it sounding any other way. Time-of-arrival and volume-based cues (which may be in conflict in certain listening positions), as well as the ‘mirror image’ time-of-arrival cue may be contributing to this confusion. These factors may differ with signal content e.g. the frequency ranges it covers.

It has occurred to me that in creating this simulation I might have been in danger of shattering my illusions about stereo, spoiling the experience forever, but in the end I think my enthusiasm remains intact. What looked like a defect with loudspeakers (the acoustic cross-coupling between channels) turns out to be the reason why it works so compellingly.

In an earlier post I suggested that maybe plain stereo from speakers was the optimal way to enjoy audio and I think I am more firmly persuaded of that now. Without having to wear special apparatus, have one’s ears moulded, make sure one’s face is visible to a tracking camera, or dedicate a large space to a central hot-seat, one or several listeners can enjoy a semi-‘holographic’ rendering of an acoustic recording that behaves in a logical way even as the listener turns their head. The system blends the listening room’s acoustics with the recording meaning that there is a two-way element to the experience whereby listeners can talk and move around and remain connected with the recording in a subtle, transparent way.

Conclusion

Stereo over speakers produces a seemingly realistic three-dimensional ‘image’ that remains stable with listener movement. How this works is perhaps more subtle than is sometimes thought.

The Blumlein Pair microphone arrangement records no timing differences between left and right, but by listening over loudspeakers, the directional volume variations are converted into time-of-arrival differences at the listener’s ears. The acoustic cross-coupling from each speaker to ‘the wrong ear’ is a necessary factor in this.

Some ‘purist’ microphone techniques may not be as valid as others when it comes to stability of the image or the positioning of sources within the field. Techniques that are appropriate for headphones may not be valid for speakers, and vice versa.

 

The First Lossy Codec

(probably).

Nowadays we are used to the concept of the lossy codec that can reduce the bit rate of CD-quality audio by a factor of, say, 5 without much audible degradation. We are also accustomed to lossless compression which can halve the bit rate without any degradation at all.

But many people may not realise that they were listening to digital audio and a form of lossy compression in the 1970s and 80s!

Early BBC PCM

As described here, the BBC were experimenting with digital audio as early as the 1960s, and in the early 70s they wired up much of the UK FM transmitter network with PCM links in order to eliminate the hum, noise, distortion and frequency response errors that were inevitable with the previous analogue links.

So listeners were already hearing 13-bit audio at a sample rate of 32 kHz when they tuned into FM radio in the 1970s. I was completely unaware of this at the time, and it is ironic that many audiophiles still think that FM radio sounds good but wouldn’t touch digital audio with a bargepole.

13 bits was pretty high quality in terms of signal-to-noise-ratio, and the 32 kHz sample rate gave something approaching 15 kHz audio bandwidth which, for many people’s hearing, would be more than adequate. The quality was, however, objectively inferior to that of the Compact Disc that came later.

Downsampling to 10 bits

In the later 70s, in order to multiplex more stations into a lower bandwidth, the BBC wanted to compress higher quality 14-bit audio down to 10 bits

As you may be aware, downsampling to a lower bit depth leads to a higher level of background noise due to the reduced resolution and the mandatory addition of dither noise. For 10 bits with dither, the best that could be achieved would be a signal to noise ratio of 54 dB (I think I am right in saying) although the modern technique of noise shaping the dither can reduce the audibility of the quantisation noise.

This would not have been acceptable audible quality for the BBC.

Companding Noise Reduction

Compression-expansion is a noise reduction technique that was already used with analogue tape recorders e.g. the dbx noise reduction system. Here, the signal’s dynamic range is squashed during recording i.e. the quiet sections are boosted in level, following a specific ‘law’. Upon replay, the inverse ‘law’ is followed in order to restore the original dynamic range. In doing so, any noise which has been added during recording is boosted downwards in level, reducing its audibility.

With such a system, the recorded signal itself carries the information necessary to control the expander, so compressor and expander need to track each other accurately in terms of the relationships between gain, level and time. Different time constants may be used for ‘attack’ and ‘release’ and these are a compromise between rapid noise reduction and audible side effects such as ‘pumping’ and ‘breathing’. The noise itself is being modulated in level, and this can be audible against certain signals more than others. Frequency selective pre- and de-emphasis can also help to tailor the audible quality of the result.

The BBC investigated conventional analogue companding before they turned to the pure digital equivalent.

N.I.C.A.M

The BBC called their digital equivalent of analogue companding ‘NICAM’ (Near Instantaneously Companded Audio Multiplex). It is much, much simpler, and more precise and effective than the analogue version.

It is as simple as this:

  • Sample the signal at full resolution (14 bits for the BBC)
  • Divide the digitised stream into time-based chunks (1ms was the duration they decided upon);
  • For each chunk, find the maximum absolute level within it;
  • For all samples in that chunk, do a binary shift sufficient to bring all the samples down to within the target bit depth (e.g. 10 bits);
  • Transmit the shifted samples, plus a single value indicating by how much they have been shifted;
  • At the other end, restore the full range by shifting samples in the opposite direction by the appropriate number of bits for each chunk.

Using this system, all ‘quiet chunks’ i.e. those already below the 10 bit maximum value are sent unchanged. Chunks containing values that are higher in level than 10 bits lose their least significant bits, but this loss of resolution is masked by the louder signal level. Compared to modern lossy codecs, this method requires minimal DSP and could be performed without software using dedicated circuits based on logic gates, shift registers and memory chips.

You may be surprised at how effective it is. I have written a program to demonstrate it, and in order to really emphasise how good it is, I have compressed the original signal into 8 bits, not the 10 that the BBC used.

In the following clip, a CD-quality recording has been converted as follows:

  • 0-10s is the raw full-resolution data
  • 10-20s is the sound of the signal downsampled to 8 bits with dither – notice the noise!
  • 20-40s is the signal compressed NICAM-style into 8 bits and restored at the other end.

I think it is much better than we might have expected…

(I was wanting to start with high quality, so I got the music extract from here:

http://www.2l.no/hires/index.html

This is the web site of a label providing extracts of their own high quality recordings in various formats for evaluation purposes. I hope they don’t mind me using one of their excellent recorded extracts as the source for my experiment).