A difference doesn’t have to be audible to matter

A common view among scientifically-oriented audiophiles is that controlled, double blind listening tests are equivalent to objective measurements. Such people may be further subdivided into those who believe that ‘preference’ is a genuine indicator of what matters, and those who believe that only ‘difference’ can count as real science in listening tests.

I can think of many, many, philosophical objections to the whole notion of assuming that listening tests are ‘scientific’, but just concentrating on that supposedly rigorous idea of ‘difference’ being scientific, we might suggest the following analogy:

Suppose there is a scene in a film that shows a thousand birds wheeling over a landscape. The emotional response is to see the scene as ‘magnificent’. In this case, the ‘magnificence’ stems from the complexity; the order emerging out of what looks like chaos; the amazing spectacle of so many similar creatures in one place. It would be reasonable, perhaps, to suggest that the ‘magnificence’ is more-or-less proportional to the number of birds.

Well, suppose we wish to stream that scene over the internet in high definition. The bandwidth required to do this would be prohibitive so we feed it into a lossy compression algorithm. One of the things it does is to remove noise and grain, and it finds the birds to be quite noise-like. So it removes a few of them, or fuses a few of them together into a single ‘blob’. Would the viewer identify the difference?

I suggest not. Within such complexity, they might only be able to see it if you pointed it out to them, and even after they knew where to look they might not see it the next time. But the ‘magnificence’ would have been diminished nevertheless. By turning up the compression ratio, we might remove more and more of the birds.

This sensation of ‘magnificence’ is not something you can put into words, and it is not something you are consciously aware of. But in this case, it would be reasonable to suggest that the ‘magnificence’ was being reduced progressively. The complexity would be such that the viewer wouldn’t consciously see the difference when asked to spot it, but clearly the emotional impact would be being reduced/altered.

For all their pretensions to scientific rigour, double blind listening tests are fundamentally failing in what they purport to do. They can only access the listener’s conscious perception, while the main aim of listening to music is to affect the subconscious. Defects in audio hardware (distortion, non-flat frequency response, phase shifts, etc.) all tend to blur the separation between individual sources and in so doing reduce the complexity of what we are hearing – it becomes a flavoured paste rather than maintaining its original granularity and texture, but we cannot necessarily hear the difference consciously. Nevertheless, we can work out rationally that complexity is one of the things that the we respond to emotionally. So even though we cannot hear a difference, the emotional impact is being affected, anyway.

How Stereo Works

(Updated 03/06/18 to include results for Blumlein Pair microphone arrangement.)

initial

A computer simulation of stereo speakers plus listener, showing the listener’s perception of the directions of three sources that have previously been ‘recorded’. The original source positions are shown overlaid with the loudspeakers.

Ever since building DSP-based active speakers and hearing real stereo imaging effectively for the first time, it has seemed to me that ordinary stereo produces a much better effect than we might expect. In fact, it has intrigued me, and it has been hard to find a truly satisfactory explanation of how and why it works so well.

My experience of stereo audio is this:

  • When sitting somewhere near the middle between two speakers and listening to a ‘purist’ stereo recording, I perceive a stable, compelling 3D space populated by the instruments and voices in different positions.
  • The scene can occasionally extend beyond the speakers (and this is certainly the case with recordings made using Q-Sound and other such processes).
  • Turning my head, the image stays plausible.
  • If I move position, the image falls apart somewhat, but when I stop moving it stabilises again into a plausible image – although not necessarily resembling what I might have expected it to be prior to moving.
  • If I move left or right, the image shifts in the direction of the speaker I am moving towards.

An article in Sound On Sound magazine may contain the most perceptive explanation I have seen:

The interaction of the signals from both speakers arriving at each ear results in the creation of a new composite signal, which is identical in wave shape but shifted in time. The time‑shift is towards the louder sound and creates a ‘fake’ time‑of‑arrival difference between the ears, so the listener interprets the information as coming from a sound source at a specific bearing somewhere within a 60‑degree angle in front.

This explanation is more elegant than the one that simply says that if the sound from one speaker is louder we will tend to hear it as if coming from that direction – I have always found it hard to believe that such a ‘blunt’ mechanism could give rise to a precise, sharp 3D image. Similarly, it is hard to believe that time-of-arrival differences on their own could somehow be relayed satisfactorily from two speakers unless the user’s head was locked into a fixed central position.

The Sound On Sound explanation says that by reproducing the sound from two spaced transducers that can reach both ears, the relative amplitude also controls the relative timing of what reaches the ears, thus giving a timing-based stereo image that, it appears, is reasonably stable with position and head rotation. This is not a psychoacoustic effect where volume difference is interpreted as a timing difference, but the literal creation of a physical timing difference from a volume difference.

There must be timbral distortion because of the mixing of the two separately-delayed renditions of the same impulse at each ear, but experience seems to suggest that this is either not significant or that the brain handles it transparently, perhaps because of the way it affects both ears.

Blumlein’s Patent

Blumlein’s original 1933 patent is reproduced here. The patent discusses how time-of-arrival may take precedence over volume-based cues depending on frequency content.

It is not immediately apparent to me that what is proposed in the patent is exactly what goes on in most stereo recordings. As far as I am aware, most ‘purist’ stereo recordings don’t exaggerate the level differences between channels, but simply record the straight signal from a pair of microphones. However, the patent goes on to make a distinction between “pressure” and “velocity” microphones which, I think, corresponds to omni-directional and directional microphones. It is stated that in the case of velocity microphones no amplitude manipulation may be needed. The microphones should be placed close together but facing in different directions (often called the ‘Blumlein Pair‘) as opposed to being spaced as “artificial ears”.

Blumlein -Stereo.png

Blumlein Pair microphone arrangement

The Blumlein microphones are bi-directional i.e. they also respond to sound from the back.

Going by the SoS description, this type of arrangement would record no timing-based information (from the direct sound of the sources at any rate), just like ‘panpot stereo’, but the speaker arrangement would convert orientation-induced volume variations into a timing-based image derived from the acoustic summation of different volume levels via acoustic delays to each ear. This may be the brilliant step that turns a rather mundane invention (voices come from different sides of the cinema screen) into a seemingly holographic rendering of 3D space when played over loudspeakers.

Thus the explanation becomes one of geometry plus some guesswork regarding the way the ears and brain correlate what they are hearing, presumably utilising both time-of-arrival and the more prosaic volume-based mechanism which says that sounds closer to one ear than the other will be louder – enhanced by the shadowing effect of the listener’s head in the way. Is this sufficient to plausibly explain the brilliance of stereo audio? Does a stereo recording in any way resemble the space in which it was recorded?

A Computer Simulation

In order to help me understand what is going on I have created a computer simulation which works as follows (please skip this section unless you are interested in very technical details):

  • It is a floor plan view of a 2D slice through the system. Objects can be placed at any XY location, measured in metres from an origin.
  • There are no reflections; only direct sound.
  • The system comprises
    • a recording system:
      • Three acoustic sources, each of which generate an identical musical transient (loaded from a mono WAV file at CD quality). Each source is considered in isolation from the others.

      • Two microphones that can be spaced and positioned as desired. They can be omni-directional or have a directional response. In the former case, volume is attenuated with distance from the source while in the latter it is attenuated by both distance and orientation to the source.
    • a playback system:

      • Two omni-directional speakers

      • A listener with two ears and the ability to move around and turn his head.

  • The directions and distances from sources to microphones are calculated based on their relative positions, and from these the delays and attenuations of the signals at the microphones are derived. These signals are ‘recorded’.

  • During ‘playback’, the positions of the listener’s ears are calculated based on XY position of the head and its rotation.

  • The distances from speakers to each ear are calculated, and from these, the delays and attenuation thereof.

  • The composite signal from each source that reaches each ear via both speakers is calculated and from this is found:

    • relative amplitude ratio at the ears
    • relative time-of-arrival difference at the ears. This is currently obtained by correlating one ear’s summed signal for that source (from both speakers) against the other and looking for the delay corresponding to peak output of this. (There may be methods more representative of the way human hearing ascertains time-of-arrival, and this might be part of a future experiment).

  • There is currently no attempt to simulate HRTF or the attenuating effect of ‘head shadow’. Attenuation is purely based on distance to each ear.

  • The system then simulates the signals that would arrive at each ear from a virtual acoustic source were the listener hearing it live rather than via the speakers.

    • This virtual source is swept through the XY space in fine increments and at each position the ‘real’ relative timings and volume ratio that would be experienced by the listener are calculated.

    • The results are compared to the results previously found for each of the three sources as recorded and played back over the speakers, and plotted as colour and brightness in order to indicate the position the listener might perceive the recorded sources as emanating from, and the strength of the similarity.

  • The listener’s location and rotation can be incremented and decremented in order to animate the display, showing how the system changes dynamically with head rotation or position.

The results are very interesting!

Here are some images from the system, plus some small animations.

Spaced omni-directional microphones

In these images, the (virtual) signal was picked up by a pair of (virtual) omnidirectional microphones on either side of the origin, spaced 0.3m apart. This is neither a binaural recording (which would at least have the microphones a little closer together) nor the Blumlein Pair arrangement, but does seem to be representative of some types of purist stero recording.

The positions of the three sources during (virtual) recording are shown overlaid with the two speakers, plus the listener’s head and ears. Red indicates response to SRC0; green SRC1; and blue SRC2.

head_rotation

Effect of head rotation on perceived direction of sources based on inter-aural timing when listener is close to the ‘sweet spot’.

side_to_side

Effect of side-to-side movement of listener on perceived imaging based on inter-aural timing.

compound_movement

Compound movement of listener, including front-to-back movement and head rotation.

amplitude

Effect of listener movement on perceived image based on inter-aural amplitudes.

Coincident directional microphones (Blumlein Pair)

Here, directional microphones are set at the origin at right angles to each other, as shown in the earlier diagram. They copy Blumlein’s description in the patent i.e. output is proportional to the cosine of angle of incidence.

blumlein_timing

Time-of-arrival based perception of direction as captured by a coincident pair of directional microphones (Blumlein Pair) and played back over stereo speakers, with compound movement of the listener.

blumlein_amplitude

A similar test, but showing perceived locations of the three sources based on inter-aural volume level

In no particular order, some observations on the results:

  • A stereo image based on time-of-arrival differences at the ears can be created with two spaced omni-directional microphones or coincident directional microphones. Note, the aim is not to ‘track’ the image with the user’s head movement (like headphones would), but to maintain stable positions in space even as the user turns away from ‘the stage’.
  • The Blumlein Pair gives a stable image with listener movement based on time-of-arrival. The image based on inter-aural amplitude may not be as stable, however.
  • Interaural timing can only give a direction, not distance.

  • A phantom mirror image of equal magnitude also accompanies the frontwards time-of-arrival-derived direction, but this would also be true of ‘real life’. The way this behaves with dynamic head movement isn’t necessarily correct; at some locations and listener orientations maybe the listener could be confused by this.

  • Relative volume at the two ears (as a ratio) gives a ‘blunt’ image that behaves differently from the time-of-arrival based image when the listener moves or turns their head. The plot shows that the same ratio can be achieved for different combinations of distance and angle so on its own it is not unambiguous.

  • Even if the time-of-arrival image stays meaningful with listener movement, the amplitude-based image may not.

  • Combined with timing, relative interaural volume might provide some cues for distance (not necessarily the ‘true’ distance).

  • No doubt other cues combining indirect ‘ambient’ reflections in the recording, comb-filtering, dynamic phase shifts with head movement, head-related transfer function, etc. are also used by the listener and these all contribute to the perception of depth.

  • The cues may not all ‘hang together’, particularly in the situation of movement of the listener, but the human brain seems to make reasonable sense of them once the movement stops.

  • The Blumlein Pair does, indeed, create a time-of-arrival-based image from amplitude variations, only. And this image is stable with movement of the listener – a truly remarkable result, I think.
  • Choice of microphone arrangement may influence the sound and stability of the image.
  • Maybe there is also an issue regarding the validity of different recording techniques when played back over headphones versus speakers. The Blumlein Pair gives no time-of-arrival cues when played over headphones.
  • The audio scene is generally limited to the region between the two speakers.
  • The simulation does not address ‘panpot’ stereo yet, although as noted earlier, the Blumlein microphone technique is doing something very similar.
  • In fact, over loudspeakers, the ‘panpot’ may actually be the most correct way of artificially placing a source in the stereo field, yielding a stable, time-of-arrival-based position.

Perhaps the thing that I find most exciting is that the animations really do seem to reflect what happens when I listen to certain recordings on a stereo system and shift position while concentrating on what I am hearing. I think that the directions of individual sources do indeed sometimes ‘flip’ or become ambiguous, and sometimes you need to ‘lock on’ to the image after moving, and from then on it seems stable and you can’t imagine it sounding any other way. Time-of-arrival and volume-based cues (which may be in conflict in certain listening positions), as well as the ‘mirror image’ time-of-arrival cue may be contributing to this confusion. These factors may differ with signal content e.g. the frequency ranges it covers.

It has occurred to me that in creating this simulation I might have been in danger of shattering my illusions about stereo, spoiling the experience forever, but in the end I think my enthusiasm remains intact. What looked like a defect with loudspeakers (the acoustic cross-coupling between channels) turns out to be the reason why it works so compellingly.

In an earlier post I suggested that maybe plain stereo from speakers was the optimal way to enjoy audio and I think I am more firmly persuaded of that now. Without having to wear special apparatus, have one’s ears moulded, make sure one’s face is visible to a tracking camera, or dedicate a large space to a central hot-seat, one or several listeners can enjoy a semi-‘holographic’ rendering of an acoustic recording that behaves in a logical way even as the listener turns their head. The system blends the listening room’s acoustics with the recording meaning that there is a two-way element to the experience whereby listeners can talk and move around and remain connected with the recording in a subtle, transparent way.

Conclusion

Stereo over speakers produces a seemingly realistic three-dimensional ‘image’ that remains stable with listener movement. How this works is perhaps more subtle than is sometimes thought.

The Blumlein Pair microphone arrangement records no timing differences between left and right, but by listening over loudspeakers, the directional volume variations are converted into time-of-arrival differences at the listener’s ears. The acoustic cross-coupling from each speaker to ‘the wrong ear’ is a necessary factor in this.

Some ‘purist’ microphone techniques may not be as valid as others when it comes to stability of the image or the positioning of sources within the field. Techniques that are appropriate for headphones may not be valid for speakers, and vice versa.

 

Reverberation of a point source, compared with a ‘distributed’ loudspeaker

Here’s a fascinating speaker:

CBT36 Manufacturer of loudspeakers that focus on elimination of box resonances.

It uses many transducers arranged in a specific curve, driven in parallel and with ‘shading’ i.e. graduated volume settings along the curve, to reduce vertical dispersion but maintain wide dispersion in the horizontal. I can see how this might appear quite appealing for use in a non-ideal room with low ceilings or whatever.

It is a variation on the phased array concept, where the outputs of many transducers combine to produce a directional beam. It is effectively relying on differing path lengths from the different transducers producing phase cancellation or reinforcement in the air at different angles as you move off axis. All the individual wavefronts sum correctly at the listener’s ear to reproduce the signal accurately.

At a smaller scale, a single transducer of finite size can be thought of as many small transducers being driven simultaneously. At high frequencies (as the wavelengths being reproduced become short compared to the diameter of the transducer) differing path lengths from various parts of the transducer combine in the air to cause phase cancellation as you move off axis. This is known as beaming and is usually controlled in speaker design by using drivers of the appropriate size for the frequencies they are reproducing. Changes in directivity with frequency are regarded as undesirable in speaker design, because although the on-axis measurements can be perfect, the ‘room sound’ (reverberation) has the ‘wrong’ frequency response.

A large panel speaker suffers from beaming in the extreme, but with Quad electrostatics Peter Walker introduced a clever trick, where phase is shifted selectively using concentric circular electrodes as you move outwards from the centre of the panel. At the listener’s position, this simulates the effect of a point source emanating from some distance behind the panel, increasing the size of the ‘sweet spot’ and effectively reducing the high frequency beaming.

There are other ways of harnessing the power of phase cancellation and summation. Dipole speakers’ lower frequencies cancel out at the sides (and top and bottom) as the antiphase rear pressure waves meet those from the front. This is supposed to be useful acoustically, cutting down on unwanted reflections from floor, walls and ceiling. A dipole speaker may be realised by mounting a single driver on a panel of wood with a hole in it, but it behaves effectively as two transducers, one of which is in anti-phase to the other. Some people say they prefer the sound of such speakers over conventional box speakers.

This all works well in terms of the direct sound reaching the listener and, as in the CBT speaker above, may provide a very uniform dispersion with frequency compared to conventional speakers. But beyond the measurements of the direct sound, does the reverberation sound quite ‘right’? What if the overall level of reverberation doesn’t approximate the ‘liveness’ of the room that the listeners notice as they talk or shuffle their feet? If the vertical reflections are reduced but not the horizontal, does this sound unnatural?

Characterising a room from its sound

The interaction of a room and an acoustic source could be thought of as a collection of simultaneous equations – acoustics can be modelled and simulated for computer games, and it is possible for a computer to do the reverse and work out the size and shape of the room from the sound.  If the acoustic source is, in fact, multiple sources separated by certain distances, the computer can work that out, too.

Does the human hearing system do something similar? I would say “probably”. A human can work quite a lot out about a room from just its sound – you would certainly know whether you were in an anechoic chamber, a normal room or a cathedral. Even in a strange environment, a human rarely mistakes the direction and distance from which sound is coming. Head movements may play a part.

And this is where listening to a ‘distributed speaker’ in a room becomes a bit strange.

Stereo speakers can be regarded as a ‘distributed speaker’ when playing a centrally-placed sound. This is unavoidable – if we are using stereo as our system. Beyond that, what is the effect of spreading each speaker itself out, or deliberately creating phased ‘beams’ of sound?

Even though the combination of direct sounds adds up to the familiar sound at the listener’s position as though emanating from its original source, there is information within the reflections that is telling the listener that the acoustic source is really a radically different shape. Reverberation levels and directions may be ‘asymmetric’ with the apparent direct sound.

In effect, the direct sound says we are listening to this:

Image result for zoe wanamaker cassandra

but the reverberation says it is something different.

Image result for zoe wanamaker cassandra

Might there be audible side effects from this? In the case of the dipole speaker, for example, the rear (antiphase) signal reflects off the back wall and some of it does make its way forwards to the listener. In my experience, this comes through as a certain ‘phasiness’ but it doesn’t seem to bother other people.

From a normal listening distance, most musical sources are small and appear close to being a ‘point source’. If we are going to add some more reverberation, should it not appear to be emanating as much as possible from a point source?

It is easy to say that reverberation is so complex that it is just a wash of ‘ambience’ and nothing more; all we need to do is give it the right ‘colour’ i.e. frequency response. And one of the reasons for using a ‘distributed speaker’ may be to reduce the amount of reverberation anyway. But I don’t think we should overdo it: we surely want to listen in real rooms because of the reverberation, not despite it. What is the most side effect-free way to introduce this reverberation?

Clearly, some rooms are not ideal and offer too much of the wrong sort of reverberation. Maybe a ‘distributed speaker’ offers a solution, but is it as good as a conventional speaker in a suitable room? And is it really necessary, anyway? I think some people may be misguidedly attempting to achieve ‘perfect’ measurements by, effectively, eliminating the room from the sound even though their room is perfectly fine. How many people are intrigued by the CBT speaker above simply because it offers ‘better’ conventional in-room measurements, regardless of whether it is necessary?

Conclusion

‘Distributed speakers’ that use large, or multiple, transducers may achieve what they set out to do superficially, but are they free of side-effects?

I don’t have scientific proof, but I remain convinced that the ‘Rolls Royce’ of listening remains ‘point source’ monopole speakers in a large, carpeted, furnished room with a high ceiling. Box speakers with multiple drivers of different sizes are small and can be regarded as being very close to a single transducer, but are not so omnidirectional that they create too much reverberation. The acoustic ‘throw’ they produce is fairly ‘natural’. In other words, for stereo perfection, I think there is still a good chance that the types of rooms and speakers people were listening to in the 1970s remain optimal.

[Last edited 17.30 BST 09/05/17]

The Logic of Listening Tests

Casual readers may not believe this, but in the world of audiophilia there are people who enjoy organising scientific listening tests – or more aptly ‘trials’. These involve assembling panels of human ‘subjects’ to listen to snippets of music played through different setups in double blind tests, pressing buttons or filling in forms to indicate audible differences and preferences. The motivation is often to use science to debunk the ideas of a rival group, who may be known as ‘subjectivists’ or ‘objectivists’, or to confirm the ideas of one’s own group.

There are many, many inherent reasons why such listening tests may not be valid e.g.

  • no one can demonstrate that the knowledge you are taking part in an experiment doesn’t impede your ability to hear differences
  • a participant who has his own agenda may choose to ‘lie’ in order to pretend he is not hearing differences when he, in fact, is.
  • etc. etc.

The tests are difficult and tedious for the participants, and no one who holds the opposing viewpoint will be convinced by the results. At a logical level, they are dubious. So why bother to do the tests? I think it is an ‘appeal to a higher authority’ to arbitrate an argument that cannot be solved any other way. ‘Science’ is that higher authority.

But let’s look at just the logic.

We are told that there are two basic types of listening test:

  1. Determining or identifying audible difference
  2. Determining ‘preference’

Presumably the idea is that (1) suggests whether two or more devices or processes are equivalent, or whether their insertion into the audio chain is audibly transparent. If a difference is identified, then (2) can make the information useful and tell us which permutation sounds best to a human. Perhaps there is a notion that in the best case scenario a £100 DAC is found to sound identical to a £100,000 DAC, or that if they do sound different, the £100 DAC is preferred by listeners. Or vice versa.

But would anything actually have been gained by a listening test over simple measurements? A DAC has a very specific, well-defined job to do – we are not talking about observing the natural world and trying to work out what is going on. With today’s technology, it is trivial to make a DAC that is accurate to very close objective tolerances for £100 – it is not necessary to listen to it to know whether it works.

For two DACs to actually sound different, they must be measurably quite far apart. At least one of them is not even close to being a DAC: it is, in fact, an effects box of some kind. And such are the fundamental uncertainties in all experiments involving the asking of humans how they feel, it is entirely possible that in a preference-based listening test, the listeners are found to prefer the sound of the effects box.

Or not. It depends on myriad unstable factors. An effects box that adds some harmonic distortion may make certain recordings sound ‘louder’ or ‘more exciting’ thus eliciting a preference for it today – with those specific recordings. But the experiment cannot show that the listeners wouldn’t be bored with the effect three hours, days or months down the line. Or that they wouldn’t hate it if it happened to be raining. Or if the walls were painted yellow, not blue. You get the idea: it is nothing but aesthetic judgement, the classic condition where science becomes pseudoscience no matter how ‘scientific’ the methodology.

The results may be fed into statistical formulae and the handle cranked, allowing the experimenter to declare “statistical significance”, but this is just the usual misunderstanding of statistics, which are only valid under very specific mathematical conditions. If your experiment is built on invalid assumptions, the statistics mean nothing.

If we think it is acceptable for a ‘DAC’ to impose its own “effects” on the sound, where do we stop? Home theatre amps often have buttons labelled ‘Super Stereo’ or ‘Concert Hall’. Before we go declaring that the £100,000 DAC’s ‘effect’ is worth the money, shouldn’t we also verify that our experiment doesn’t show that ‘Super Stereo’ is even better? Or that a £10 DAC off Amazon isn’t even better than that? This is the open-ended illogicality of preference-based listening tests.

If the device is supposed to be a “DAC”, it can do no more than meet the objective definition of a DAC to a tolerably close degree. How do we know what “tolerably close” is? Well, if we were to simulate the known, objective, measured error, and amplify it by a factor of a hundred, and still fail to be able to hear it at normal listening levels in a quiet room, I think we would have our answer. This is the one listening test that I think would be useful.

The Secret Science of Pop

secret-science-of-pop

In The Secret Science of Pop, evolutionary biologist Professor Armand Leroi tells us that he sees pop music as a direct analogy for natural selection. And he salivates at the prospect of a huge, complete, historical data set that can be analysed in order to test his theories.

He starts off by bringing in experts in data analysis from some prestigious universities, and has them crunch the numbers on the past 50 years of chart music, analysing the audio data for numerous characteristics including “rhythmic intensity” and “agressiveness”. He plots a line on a giant computer monitor showing the rate of musical change based on an aggregate of these values. The line shows that the 60s were a time of revolution – although he claims that the Beatles were pretty average and “sat out” the revolution. Disco, and to a lesser extent punk, made the 70s a time of revolution but the 80s were not.

He is convinced that he is going to be able to use his findings to influence the production of new pop music. The results are not encouraging: no matter how he formulates his data he finds he cannot predict a song’s chart success with much better than random accuracy. The best correlation seems to be that a song’s closeness to a particular period’s “average” predicts high chart success. It is, he says, “statistically significant”.

Armed with this insight he takes on the role of producer and attempts to make a song (a ballad) being recorded at Trevor Horn’s studio as average as possible by, amongst other things, adjusting its tempo and adding some rap. It doesn’t really work, and when he measures the results with his computer, he finds that he has manoeuvred the song away from average with this manual intervention.

He then shifts his attention to trying to find the stars of tomorrow by picking out the most average song from 1200 tracks that have been sent into BBC Radio 1 Introducing. The computer picks out a particular band who seem to have a very danceable track, and in the world’s least scientific experiment ever, he demonstrates that a BBC Radio 1 producer thinks it’s OK, too.

His final conclusion: “We failed spectacularly this time, but I am sure the answer is somewhere in the data if we can just find it”.

My immediate thoughts on this programme:

-An entertaining, interesting programme.

-The rule still holds: science is not valid in the field of aesthetic judgement.

-If your system cannot predict the future stars of the past, it is very unlikely to be able to predict the stars of the future.

-The choice of which aspects of songs to measure is purely subjective, based on the scientist’s own assumptions about what humans like about music. The chances of the scientist not tweaking the algorithms in order to reflect their own intuitions are very remote. To claim that “The computer picked the song with no human intervention” is stretching it! (This applies to any ‘science’ whose main output is based on computer modelling).

-The lure of data is irresistible to scientists but, as anyone who has ever experimented with anything but the simplest, most controlled, pattern recognition will tell you, there is always too much, and at the same time never enough, training data. It slowly dawns on you that although theoretically there may be multidimensional functions that really could spot what you are looking for, you are never going to present the training data in such a way that you find a function with 100%, or at least ‘human’ levels of, reliability.

-Add to that the myriad paradoxes of human consciousness, and of humans modifying their tastes temporarily in response to novelty and fashion – even to the data itself (the charts) – and the reality is that it is a wild goose chase.

(very relevant to a post from a few months ago)

Image is Everything

I have a couple of audiophile friends for whom ‘imaging’ is very much a secondary hi-fi goal, but I wonder if this is because they’ve never really heard it from their audio systems.

What do we mean by the term anyway? My definition would be the (illusion of) precise placement of acoustic sources in three dimensions in front of the listener – including the acoustics of the recording venue(s). It isn’t a fragile effect that only appears at one infinitesimal position in space or collapses at the merest turn of the head, either.

It is something that I am finding is trivially easy for DSP-based active speakers. Why? Well I think that it just falls out naturally from accurate matching between the channels and phase & time-corrected drivers. Logically, good imaging will only occur when everything in a system is working more-or-less correctly.

I can imagine all kinds of mismatches and errors that might occur with passive crossovers, exacerbated by the compromises that are forced on the designer such as having to use fewer drivers than ideal, or running the drivers outside their ideal frequency ranges.

Imaging is affected by the speaker’s interaction with the room, of course. The ultimate imaging accuracy may occur when we eliminate the room’s contribution completely, and sit in a very tight ‘sweet spot’, but this is not the most practical or pleasant listening situation. The room’s contribution may also enhance an illusion of a palpable image, so it is not desirable to eliminate it completely. Ultimately, we are striking a balance between direct sound and ambient reflections through speaker directivity and positioning relative to walls.

A real audiophile scientist would no doubt be interested in how exactly stereo imaging works, and whether listening tests could be devised to show the relative contributions of poor damping, phase errors, Doppler distortion, timing misalignment etc. Maybe we could design a better passive speaker as a result. But I would say: why bother? The DSP active version is objectively more correct, and now that we have finally progressed to such technology and can actually listen to it, it clearly doesn’t need to do anything but reproduce left and right correctly – no need for any other tricks or the forlorn hope of some accidental magic from natural, organic, passive technology.

An ‘excuse’ for poor imaging is that in many real musical situations, imaging is not nearly as sharp as can be obtained from a good audio system. This is true: if you go to a classical concert and consciously listen for where a solo brass instrument (for example) is coming from, you often can’t really tell. I presume this is because you are generally seated far from the stage with a lot of people in the way and much ‘ambience’ thrown in. I presume that the conductor is hearing much stronger ‘imaging’ than we are – and many recordings are made with the mics much closer than a typical person sitting in the auditorium; the sharper imaging in the recording may well be largely artificial.

However, to cite this as a reason for deliberately blurring the image in some arbitrary way is surely a red herring. The image heard by the audience member is still ‘coherent’ even if it is not sharp. And the ‘artificially imaged’ recording contains extra information that is allowing us to separate the various acoustic sources by a different mechanism than the one that might allow us to tease out the various sources in a mono recording, say. It reduces effort and vastly increases the clarity of the audio ‘scene’.

I think that good imaging due to superior time alignment and phase is going to be much more important than going to the Nth degree to obtain ultra-low low harmonic distortion.

If we mess up the coherence between the channels we are getting the worst of all worlds: something that arbitrarily munges the various acoustic sources and their surroundings in response to signal content. An observation that is sometimes made is that the music “sticks to the speakers” rather than appearing in between. What are our brains to make of it? It must increase the effort of listening and blur the detail of what we are hearing.

Not only this, but good imaging is compelling. Solid voices and instruments that float in mid air grab the attention. The listener immediately understands that there is a lot more information trapped in a stereo recording than they ever knew.

Neural Adaptation

Just an interesting snippet regarding a characteristic of human hearing (and all our senses). It is called neural adaptation.

Neural adaptation or sensory adaptation is a change over time in the responsiveness of the sensory system to a constant stimulus. It is usually experienced as a change in the stimulus. For example, if one rests one’s hand on a table, one immediately feels the table’s surface on one’s skin. Within a few seconds, however, one ceases to feel the table’s surface. The sensory neurons stimulated by the table’s surface respond immediately, but then respond less and less until they may not respond at all; this is an example of neural adaptation. Neural adaptation is also thought to happen at a more central level such as the cortex.

Fast and slow adaptation
One has to distinguish fast adaptation from slow adaptation. Fast adaptation occurs immediately after stimulus presentation i.e., within 100s of milliseconds. Slow adaptive processes that take minutes, hours or even days. The two classes of neural adaptation may rely on very different physiological mechanisms.

Auditory adaptation, as perceptual adaptation with other senses, is the process by which individuals adapt to sounds and noises. As research has shown, as time progresses, individuals tend to adapt to sounds and tend to distinguish them less frequently after a while. Sensory adaptation tends to blend sounds into one, variable sound, rather than having several separate sounds as a series. Moreover, after repeated perception, individuals tend to adapt to sounds to the point where they no longer consciously perceive it, or rather, “block it out”.

What this says to me is that perceived sound characteristics are variable depending on how long the person has been listening, and to what sequence of ‘stimulii’. Our senses, to some extent, are change detectors not ‘direct coupled’.

Something of a conundrum for listening-based audio equipment testing..? Our hearing begins to change the moment we start listening. It becomes desensitised to repeated exposure to a sound – one of the cornerstones of many types of listening-based testing.

The Machine Learning delusion

This morning my personal biological computer detected a correlation between these two articles:

Sony’s SenseMe™ – A Superior Smart Shuffle

Machine learning: why we mustn’t be slaves to the algorithm

In the first article, the author is praising a “smart shuffle” algorithm that sequences tracks in your music collection with various themes such as “energetic, relax, upbeat”. It does this by analysing the music’s mood and tempo. It sounds amazing:

“I would never think of playing Steve Earl’s “Loretta” right after listening to the Boulder Philharmonic’s performance of “Olvidala,” or Ry Cooder’s “Crazy About an Automobile” followed by Doc and Merle Watson playing “Take Me Out to the Ballgame,” but I enjoyed not only the selections themselves but the way SensMe™ juxtaposes one after another, like a DJ who knows your collection better than you do…what will “he” play next? Surprise! It’s all good.”

And the algorithm’s effects go beyond mere music:

“SenseMe™ has brought domestic harmony – interesting selections for me and music with a similar mood for her. That’s better than marriage counseling! “

The author of the second article takes a more sceptical view. He notes the dumbness of Machine LearningTM algorithms, but says that

“…because these outputs are computer-generated, they are currently regarded with awe and amazement by bemused citizens …”

He quotes someone who is aware of the limitations:

“Machine learning is like a deep-fat fryer. If you’ve never deep-fried something before, you think to yourself: ‘This is amazing! I bet this would work on anything!’ And it kind of does. In our case, the deep fryer is a toolbox of statistical techniques. The names keep changing – it used to be unsupervised learning, now it’s called big data or deep learning or AI. Next year it will be called something else. But the core ideas don’t change. You train a computer on lots of data, and it learns to recognise structure.”

“But,” continues Cegłowski, “the fact that the same generic approach works across a wide range of domains should make you suspicious about how much insight it’s adding.”

I have been there. Machine learning is one of the most seductive branches of computer science, and in my experience is a very “easy sell” to people – I use it in my job in actual engineering applications where it can be eerily effective.

But if algorithms are so clever and know us so well, why are we using them only to shuffle the order of music? Why not cut out the middleman and get the computer to compose the music for us directly? The answer is obvious: it doesn’t work because we don’t know how the human brain works, and it is not predictable. By extension, the algorithms that purport to help us in matters of taste don’t actually work either. As the Guardian article says, all we are responding to is the novelty of the idea.

Auditory Scene Analysis

There is a field of study called Auditory Scene Analysis (ASA) that postulates that humans interpret “scenes” using sound just as they do using vision. I am not sure that it necessarily has any particular bearing on the way that audio hardware should be designed: basically the scene is all the clearer if the reproduction of the audio is clean in terms of noise, channel separation, distortion, frequency response and (seemingly controversial to hi-fi folk) the time domain.

However, the seminal work in this field includes the following analogy for hearing:

Your friend digs two narrow channels up from the side of a lake. Each is a few feet long and a few inches wide and they are spaced a few feet apart. Halfway up each one, your friend stretches a handkerchief and fastens it to the sides of the channel. As the waves reach the side of the lake they travel up the channels and cause the two handkerchiefs to go into motion. You are allowed to look only at the handkerchiefs and from their motions to answer a series of questions: How many boats are there on the lake and where are they? Which is the most powerful one? Which one is closer? Is the wind blowing? Has any large object been dropped suddenly into the lake?

Of course, when we listen to reproduced music with an audio system we are, in effect, duplicating the motion of the handkerchiefs using two paddles in another lake (our listening room) and watching the motion of a new pair of handkerchiefs. Amazingly, it works! But the key to this is that the two lakes are well-defined linear systems. Our brains can ‘work back’ to the original sounds using a process akin to ‘blind deconvolution’.

If we want to, we can eliminate the ‘second lake’ by using headphones, or we can almost eliminate it by using an anechoic chamber. We could theoretically eliminate it at a single point in space by deconvolving the reproduced signal with the measured impulse response of the room at that point. Listening with headphones works OK, but listening to speakers in a dead acoustic sounds terrible – probably to do with ‘head related transfer function’ (HRTF) telling us that we are listening to a ‘real’ acoustic but with an absence of the expected acoustic cues when we move our heads. By adding the ‘second lake’ we create enough ‘real acoustic’ to overcome that.

But here is why ‘room correction’ is flawed. The logical conclusion of room correction is to simulate headphones, but this cannot be achieved – and is not what most listeners want anyway, even if they don’t know it. Instead, an incomplete ‘correction’ is implemented based on the idea of trying to make the motion of the two sets of ‘handkerchiefs’ closer to each other than they (in naive measurements) appear to be. If the idea of the brain ‘working back’ to the original sound is correct, it will ‘work back’ to a seemingly arbitrarily modified recording. Modifying the physical acoustics of the room is valid whereas modifying the signal is not.

I think the problem stems ultimately from an engineering tool (frequency domain measurement) proliferating due to cheap computing power. There is a huge difference in levels of understanding between the author of the ASA book and the audiophiles and manufacturers who think that the sound is improved by tweaking graphic equalisers in an attempt to compensate for delays that the brain has compensated for already.

The Man in the White Suit

man-in-the-white-suit

There’s a brilliant film from the 1950s called The Man in the White Suit. It’s a satire on capitalism, the power of the unions, and the story of how the two sides find themselves working together to oppose a new invention that threatens to make several industries redundant.

I wonder if there’s a tenuous resemblance between the film’s new wonder-fabric and the invention of digital audio? I hesitate to say that it’s exactly the same, because someone will point out that in the end, the wonder-fabric isn’t all it seems and falls apart, but I think they do have these similarities:

  1. The new invention is, for all practical purposes, ‘perfect’, and is immediately superior to everything that has gone before.
  2. It is cheap – very cheap – and can be mass-produced in large quantities.
  3. It has the properties of infinite lifespan, zero maintenance and non-obsolescence
  4. It threatens the profits not only of the industry that invented it, but other related industries.

In the film it all turns a bit dark, with mobs on the streets and violence imminent. Only the invention’s catastrophic failure saves the day.

In the smaller worlds of audio and music, things are a little different. Digital audio shows no signs of failing, and it has taken quite a few years for someone to finally come up with a comprehensive, feasible strategy for monopolising the invention while also shutting the Pandora’s box that was opened when it was initially released without restrictions.

The new strategy is this:

  1. Spread rumours that the original invention was flawed
  2. Re-package the invention as something brand new, with a vagueness that allows people to believe whatever they want about it
  3. Deviate from the rigid mathematical conditions of the original invention, opening up possibilities for future innovations in filtering and “de-blurring”. The audiophile imagination is a potent force, so this may not be the last time you can persuade them to re-purchase their record collections, after all.
  4. Offer to protect the other, affected industries – for a fee
  5. Appear to maintain compatibility with the original invention – for now – while substituting a more inconvenient version with inferior quality for unlicensed users
  6. Through positive enticements, nudge users into voluntarily phasing out the original invention over several years.
  7. Introduce stronger protection once the window has been closed.

It’s a very clever strategy, I think. Point (2) is the master stroke.