Audio – Literature Analogy

An audio recording is a bit like a book: created through artistic or intellectual endeavour, then ‘fixed’ as a collection of pure information and distributed to customers for them to ‘consume’ in their own environments. In the case of digital audio, a recording is literally the same as a book, being stored as numbers in a file; you could store a book as a WAV, or an audio recording as a MSWORD file if you wanted.

In rendering the content to be read, there are things you could do to detract from the content of a book:

  • printed too big/too small
  • lighting too dim/too bright
  • inappropriate use of colour
  • blotchy printout
  • typeface varies with content, or randomly
  • corrupted: missing/duplicated/erroneous characters
  • peculiar paper
  • non-neutral typeface – difficult to read or inappropriate e.g. science fiction font for a Jane Austen novel
  • in the case of some ‘boutique’ printing, an appropriate analogy might be a book that spontaneously becomes too hot to touch, or occasionally ruins valuable furniture.

The emotional or intellectual force of the book would actually be reduced because of these problems. In other words, it is not true to say that the quality of reproduction doesn’t matter.

However, there is a finite envelope of neutral, even ‘mundane’, reproduction which achieves an optimal result for the reader – after reading the book they can’t tell you anything about the quality of the printing; all they remember is the content, and the content was thrilling.

Maybe the author specifies the typeface. Some books may include fine illustrations or intricate frontispieces which are intrinsic to the book. In these cases, the reproduction needs to be particularly accurate in order to do justice to what the author has created.

Beyond this, is there anything that the printer can do to enhance the appeal of the book? Well, they can create a fancy binding that the reader notices before they start reading; they can use particularly high quality paper; they can print the characters with micron precision. But only a book collector or printing technology enthusiast would care about these refinements – they have no effect on the actual experience of reading the content, and could easily detract from it.

The manufacturers of the ink and the mains cable that powers the printing press could read lots of books in their spare time, attend evening classes in English Literature, study the physiology of the eye, get diplomas in grammar, and tell us in interviews with speciality magazines about how it all informs their craft. But clearly the results would do nothing whatsoever to change the reading experience.

The printer might decide to dabble in science for the first time since they left printing college. They could do scientific trials in aspects of book reproduction where lucky participants get to read snippets of text or passages from ‘typical’ books, responding with their perceptions of differences, preferences, or even ‘emotional stimulation level’ in aspects such as:

  • typeface
  • ink
  • reading light
  • paper texture and weight
  • reading room shape/dimensions/finishes

But the results would be rather obvious and predictable, with anything slightly interesting being clearly the result of fashion, novelty and human fickleness rather than being a universal law.

The only way to actually enhance the book would be to change its content. An algorithm that replaces certain words? Re-writes sections to make them longer or shorter? Clearly in the case of literature, such a thing would be meaningless and idiotic. It is not so different in the case of audio. There is nothing but the recording: there is no technology, effect or algorithm that can meaningfully enhance it.


Domestic hi-fi is no more than the equivalent of rendering the printed content of a book: it can be done adequately or badly, and beyond that there is no meaningful way of improving on it. People become deluded by the idea that the rendering technology can enhance the content – which is obviously ridiculous in the case of books, but less obvious with audio.

But this is not to say that hi-fi is, in itself, boring: achieving ‘adequate’ is not trivial.

Many people are simply not used to hearing adequate reproduction regardless of how much money they spend, so they are not aware that the experience vs. quality graph has a horizontal flat top. And needless to say, the audiophile quality vs. cost graph is more-or-less random, which makes it even more confusing.

The audio enthusiast would be much happier and richer if they got a sense of proportion of what matters, then put all their creativity (and money if they’ve got nothing else to spend it on) into building the equivalent of a pleasant reading room, comfy chair and attractive bookcases rather than a solid gold and diamond reading light.

[Last edited  30/05/17]

Reverberation of a point source, compared with a ‘distributed’ loudspeaker

Here’s a fascinating speaker:

CBT36 Manufacturer of loudspeakers that focus on elimination of box resonances.

It uses many transducers arranged in a specific curve, driven in parallel and with ‘shading’ i.e. graduated volume settings along the curve, to reduce vertical dispersion but maintain wide dispersion in the horizontal. I can see how this might appear quite appealing for use in a non-ideal room with low ceilings or whatever.

It is a variation on the phased array concept, where the outputs of many transducers combine to produce a directional beam. It is effectively relying on differing path lengths from the different transducers producing phase cancellation or reinforcement in the air at different angles as you move off axis. All the individual wavefronts sum correctly at the listener’s ear to reproduce the signal accurately.

At a smaller scale, a single transducer of finite size can be thought of as many small transducers being driven simultaneously. At high frequencies (as the wavelengths being reproduced become short compared to the diameter of the transducer) differing path lengths from various parts of the transducer combine in the air to cause phase cancellation as you move off axis. This is known as beaming and is usually controlled in speaker design by using drivers of the appropriate size for the frequencies they are reproducing. Changes in directivity with frequency are regarded as undesirable in speaker design, because although the on-axis measurements can be perfect, the ‘room sound’ (reverberation) has the ‘wrong’ frequency response.

A large panel speaker suffers from beaming in the extreme, but with Quad electrostatics Peter Walker introduced a clever trick, where phase is shifted selectively using concentric circular electrodes as you move outwards from the centre of the panel. At the listener’s position, this simulates the effect of a point source emanating from some distance behind the panel, increasing the size of the ‘sweet spot’ and effectively reducing the high frequency beaming.

There are other ways of harnessing the power of phase cancellation and summation. Dipole speakers’ lower frequencies cancel out at the sides (and top and bottom) as the antiphase rear pressure waves meet those from the front. This is supposed to be useful acoustically, cutting down on unwanted reflections from floor, walls and ceiling. A dipole speaker may be realised by mounting a single driver on a panel of wood with a hole in it, but it behaves effectively as two transducers, one of which is in anti-phase to the other. Some people say they prefer the sound of such speakers over conventional box speakers.

This all works well in terms of the direct sound reaching the listener and, as in the CBT speaker above, may provide a very uniform dispersion with frequency compared to conventional speakers. But beyond the measurements of the direct sound, does the reverberation sound quite ‘right’? What if the overall level of reverberation doesn’t approximate the ‘liveness’ of the room that the listeners notice as they talk or shuffle their feet? If the vertical reflections are reduced but not the horizontal, does this sound unnatural?

Characterising a room from its sound

The interaction of a room and an acoustic source could be thought of as a collection of simultaneous equations – acoustics can be modelled and simulated for computer games, and it is possible for a computer to do the reverse and work out the size and shape of the room from the sound.  If the acoustic source is, in fact, multiple sources separated by certain distances, the computer can work that out, too.

Does the human hearing system do something similar? I would say “probably”. A human can work quite a lot out about a room from just its sound – you would certainly know whether you were in an anechoic chamber, a normal room or a cathedral. Even in a strange environment, a human rarely mistakes the direction and distance from which sound is coming. Head movements may play a part.

And this is where listening to a ‘distributed speaker’ in a room becomes a bit strange.

Stereo speakers can be regarded as a ‘distributed speaker’ when playing a centrally-placed sound. This is unavoidable – if we are using stereo as our system. Beyond that, what is the effect of spreading each speaker itself out, or deliberately creating phased ‘beams’ of sound?

Even though the combination of direct sounds adds up to the familiar sound at the listener’s position as though emanating from its original source, there is information within the reflections that is telling the listener that the acoustic source is really a radically different shape. Reverberation levels and directions may be ‘asymmetric’ with the apparent direct sound.

In effect, the direct sound says we are listening to this:

Image result for zoe wanamaker cassandra

but the reverberation says it is something different.

Image result for zoe wanamaker cassandra

Might there be audible side effects from this? In the case of the dipole speaker, for example, the rear (antiphase) signal reflects off the back wall and some of it does make its way forwards to the listener. In my experience, this comes through as a certain ‘phasiness’ but it doesn’t seem to bother other people.

From a normal listening distance, most musical sources are small and appear close to being a ‘point source’. If we are going to add some more reverberation, should it not appear to be emanating as much as possible from a point source?

It is easy to say that reverberation is so complex that it is just a wash of ‘ambience’ and nothing more; all we need to do is give it the right ‘colour’ i.e. frequency response. And one of the reasons for using a ‘distributed speaker’ may be to reduce the amount of reverberation anyway. But I don’t think we should overdo it: we surely want to listen in real rooms because of the reverberation, not despite it. What is the most side effect-free way to introduce this reverberation?

Clearly, some rooms are not ideal and offer too much of the wrong sort of reverberation. Maybe a ‘distributed speaker’ offers a solution, but is it as good as a conventional speaker in a suitable room? And is it really necessary, anyway? I think some people may be misguidedly attempting to achieve ‘perfect’ measurements by, effectively, eliminating the room from the sound even though their room is perfectly fine. How many people are intrigued by the CBT speaker above simply because it offers ‘better’ conventional in-room measurements, regardless of whether it is necessary?


‘Distributed speakers’ that use large, or multiple, transducers may achieve what they set out to do superficially, but are they free of side-effects?

I don’t have scientific proof, but I remain convinced that the ‘Rolls Royce’ of listening remains ‘point source’ monopole speakers in a large, carpeted, furnished room with a high ceiling. Box speakers with multiple drivers of different sizes are small and can be regarded as being very close to a single transducer, but are not so omnidirectional that they create too much reverberation. The acoustic ‘throw’ they produce is fairly ‘natural’. In other words, for stereo perfection, I think there is still a good chance that the types of rooms and speakers people were listening to in the 1970s remain optimal.

[Last edited 17.30 BST 09/05/17]

The Logic of Listening Tests

Casual readers may not believe this, but in the world of audiophilia there are people who enjoy organising scientific listening tests – or more aptly ‘trials’. These involve assembling panels of human ‘subjects’ to listen to snippets of music played through different setups in double blind tests, pressing buttons or filling in forms to indicate audible differences and preferences. The motivation is often to use science to debunk the ideas of a rival group, who may be known as ‘subjectivists’ or ‘objectivists’, or to confirm the ideas of one’s own group.

There are many, many inherent reasons why such listening tests may not be valid e.g.

  • no one can demonstrate that the knowledge you are taking part in an experiment doesn’t impede your ability to hear differences
  • a participant who has his own agenda may choose to ‘lie’ in order to pretend he is not hearing differences when he, in fact, is.
  • etc. etc.

The tests are difficult and tedious for the participants, and no one who holds the opposing viewpoint will be convinced by the results. At a logical level, they are dubious. So why bother to do the tests? I think it is an ‘appeal to a higher authority’ to arbitrate an argument that cannot be solved any other way. ‘Science’ is that higher authority.

But let’s look at just the logic.

We are told that there are two basic types of listening test:

  1. Determining or identifying audible difference
  2. Determining ‘preference’

Presumably the idea is that (1) suggests whether two or more devices or processes are equivalent, or whether their insertion into the audio chain is audibly transparent. If a difference is identified, then (2) can make the information useful and tell us which permutation sounds best to a human. Perhaps there is a notion that in the best case scenario a £100 DAC is found to sound identical to a £100,000 DAC, or that if they do sound different, the £100 DAC is preferred by listeners. Or vice versa.

But would anything actually have been gained by a listening test over simple measurements? A DAC has a very specific, well-defined job to do – we are not talking about observing the natural world and trying to work out what is going on. With today’s technology, it is trivial to make a DAC that is accurate to very close objective tolerances for £100 – it is not necessary to listen to it to know whether it works.

For two DACs to actually sound different, they must be measurably quite far apart. At least one of them is not even close to being a DAC: it is, in fact, an effects box of some kind. And such are the fundamental uncertainties in all experiments involving the asking of humans how they feel, it is entirely possible that in a preference-based listening test, the listeners are found to prefer the sound of the effects box.

Or not. It depends on myriad unstable factors. An effects box that adds some harmonic distortion may make certain recordings sound ‘louder’ or ‘more exciting’ thus eliciting a preference for it today – with those specific recordings. But the experiment cannot show that the listeners wouldn’t be bored with the effect three hours, days or months down the line. Or that they wouldn’t hate it if it happened to be raining. Or if the walls were painted yellow, not blue. You get the idea: it is nothing but aesthetic judgement, the classic condition where science becomes pseudoscience no matter how ‘scientific’ the methodology.

The results may be fed into statistical formulae and the handle cranked, allowing the experimenter to declare “statistical significance”, but this is just the usual misunderstanding of statistics, which are only valid under very specific mathematical conditions. If your experiment is built on invalid assumptions, the statistics mean nothing.

If we think it is acceptable for a ‘DAC’ to impose its own “effects” on the sound, where do we stop? Home theatre amps often have buttons labelled ‘Super Stereo’ or ‘Concert Hall’. Before we go declaring that the £100,000 DAC’s ‘effect’ is worth the money, shouldn’t we also verify that our experiment doesn’t show that ‘Super Stereo’ is even better? Or that a £10 DAC off Amazon isn’t even better than that? This is the open-ended illogicality of preference-based listening tests.

If the device is supposed to be a “DAC”, it can do no more than meet the objective definition of a DAC to a tolerably close degree. How do we know what “tolerably close” is? Well, if we were to simulate the known, objective, measured error, and amplify it by a factor of a hundred, and still fail to be able to hear it at normal listening levels in a quiet room, I think we would have our answer. This is the one listening test that I think would be useful.

The Secret Science of Pop


In The Secret Science of Pop, evolutionary biologist Professor Armand Leroi tells us that he sees pop music as a direct analogy for natural selection. And he salivates at the prospect of a huge, complete, historical data set that can be analysed in order to test his theories.

He starts off by bringing in experts in data analysis from some prestigious universities, and has them crunch the numbers on the past 50 years of chart music, analysing the audio data for numerous characteristics including “rhythmic intensity” and “agressiveness”. He plots a line on a giant computer monitor showing the rate of musical change based on an aggregate of these values. The line shows that the 60s were a time of revolution – although he claims that the Beatles were pretty average and “sat out” the revolution. Disco, and to a lesser extent punk, made the 70s a time of revolution but the 80s were not.

He is convinced that he is going to be able to use his findings to influence the production of new pop music. The results are not encouraging: no matter how he formulates his data he finds he cannot predict a song’s chart success with much better than random accuracy. The best correlation seems to be that a song’s closeness to a particular period’s “average” predicts high chart success. It is, he says, “statistically significant”.

Armed with this insight he takes on the role of producer and attempts to make a song (a ballad) being recorded at Trevor Horn’s studio as average as possible by, amongst other things, adjusting its tempo and adding some rap. It doesn’t really work, and when he measures the results with his computer, he finds that he has manoeuvred the song away from average with this manual intervention.

He then shifts his attention to trying to find the stars of tomorrow by picking out the most average song from 1200 tracks that have been sent into BBC Radio 1 Introducing. The computer picks out a particular band who seem to have a very danceable track, and in the world’s least scientific experiment ever, he demonstrates that a BBC Radio 1 producer thinks it’s OK, too.

His final conclusion: “We failed spectacularly this time, but I am sure the answer is somewhere in the data if we can just find it”.

My immediate thoughts on this programme:

-An entertaining, interesting programme.

-The rule still holds: science is not valid in the field of aesthetic judgement.

-If your system cannot predict the future stars of the past, it is very unlikely to be able to predict the stars of the future.

-The choice of which aspects of songs to measure is purely subjective, based on the scientist’s own assumptions about what humans like about music. The chances of the scientist not tweaking the algorithms in order to reflect their own intuitions are very remote. To claim that “The computer picked the song with no human intervention” is stretching it! (This applies to any ‘science’ whose main output is based on computer modelling).

-The lure of data is irresistible to scientists but, as anyone who has ever experimented with anything but the simplest, most controlled, pattern recognition will tell you, there is always too much, and at the same time never enough, training data. It slowly dawns on you that although theoretically there may be multidimensional functions that really could spot what you are looking for, you are never going to present the training data in such a way that you find a function with 100%, or at least ‘human’ levels of, reliability.

-Add to that the myriad paradoxes of human consciousness, and of humans modifying their tastes temporarily in response to novelty and fashion – even to the data itself (the charts) – and the reality is that it is a wild goose chase.

(very relevant to a post from a few months ago)

Image is Everything

I have a couple of audiophile friends for whom ‘imaging’ is very much a secondary hi-fi goal, but I wonder if this is because they’ve never really heard it from their audio systems.

What do we mean by the term anyway? My definition would be the (illusion of) precise placement of acoustic sources in three dimensions in front of the listener – including the acoustics of the recording venue(s). It isn’t a fragile effect that only appears at one infinitesimal position in space or collapses at the merest turn of the head, either.

It is something that I am finding is trivially easy for DSP-based active speakers. Why? Well I think that it just falls out naturally from accurate matching between the channels and phase & time-corrected drivers. Logically, good imaging will only occur when everything in a system is working more-or-less correctly.

I can imagine all kinds of mismatches and errors that might occur with passive crossovers, exacerbated by the compromises that are forced on the designer such as having to use fewer drivers than ideal, or running the drivers outside their ideal frequency ranges.

Imaging is affected by the speaker’s interaction with the room, of course. The ultimate imaging accuracy may occur when we eliminate the room’s contribution completely, and sit in a very tight ‘sweet spot’, but this is not the most practical or pleasant listening situation. The room’s contribution may also enhance an illusion of a palpable image, so it is not desirable to eliminate it completely. Ultimately, we are striking a balance between direct sound and ambient reflections through speaker directivity and positioning relative to walls.

A real audiophile scientist would no doubt be interested in how exactly stereo imaging works, and whether listening tests could be devised to show the relative contributions of poor damping, phase errors, Doppler distortion, timing misalignment etc. Maybe we could design a better passive speaker as a result. But I would say: why bother? The DSP active version is objectively more correct, and now that we have finally progressed to such technology and can actually listen to it, it clearly doesn’t need to do anything but reproduce left and right correctly – no need for any other tricks or the forlorn hope of some accidental magic from natural, organic, passive technology.

An ‘excuse’ for poor imaging is that in many real musical situations, imaging is not nearly as sharp as can be obtained from a good audio system. This is true: if you go to a classical concert and consciously listen for where a solo brass instrument (for example) is coming from, you often can’t really tell. I presume this is because you are generally seated far from the stage with a lot of people in the way and much ‘ambience’ thrown in. I presume that the conductor is hearing much stronger ‘imaging’ than we are – and many recordings are made with the mics much closer than a typical person sitting in the auditorium; the sharper imaging in the recording may well be largely artificial.

However, to cite this as a reason for deliberately blurring the image in some arbitrary way is surely a red herring. The image heard by the audience member is still ‘coherent’ even if it is not sharp. And the ‘artificially imaged’ recording contains extra information that is allowing us to separate the various acoustic sources by a different mechanism than the one that might allow us to tease out the various sources in a mono recording, say. It reduces effort and vastly increases the clarity of the audio ‘scene’.

I think that good imaging due to superior time alignment and phase is going to be much more important than going to the Nth degree to obtain ultra-low low harmonic distortion.

If we mess up the coherence between the channels we are getting the worst of all worlds: something that arbitrarily munges the various acoustic sources and their surroundings in response to signal content. An observation that is sometimes made is that the music “sticks to the speakers” rather than appearing in between. What are our brains to make of it? It must increase the effort of listening and blur the detail of what we are hearing.

Not only this, but good imaging is compelling. Solid voices and instruments that float in mid air grab the attention. The listener immediately understands that there is a lot more information trapped in a stereo recording than they ever knew.

Neural Adaptation

Just an interesting snippet regarding a characteristic of human hearing (and all our senses). It is called neural adaptation.

Neural adaptation or sensory adaptation is a change over time in the responsiveness of the sensory system to a constant stimulus. It is usually experienced as a change in the stimulus. For example, if one rests one’s hand on a table, one immediately feels the table’s surface on one’s skin. Within a few seconds, however, one ceases to feel the table’s surface. The sensory neurons stimulated by the table’s surface respond immediately, but then respond less and less until they may not respond at all; this is an example of neural adaptation. Neural adaptation is also thought to happen at a more central level such as the cortex.

Fast and slow adaptation
One has to distinguish fast adaptation from slow adaptation. Fast adaptation occurs immediately after stimulus presentation i.e., within 100s of milliseconds. Slow adaptive processes that take minutes, hours or even days. The two classes of neural adaptation may rely on very different physiological mechanisms.

Auditory adaptation, as perceptual adaptation with other senses, is the process by which individuals adapt to sounds and noises. As research has shown, as time progresses, individuals tend to adapt to sounds and tend to distinguish them less frequently after a while. Sensory adaptation tends to blend sounds into one, variable sound, rather than having several separate sounds as a series. Moreover, after repeated perception, individuals tend to adapt to sounds to the point where they no longer consciously perceive it, or rather, “block it out”.

What this says to me is that perceived sound characteristics are variable depending on how long the person has been listening, and to what sequence of ‘stimulii’. Our senses, to some extent, are change detectors not ‘direct coupled’.

Something of a conundrum for listening-based audio equipment testing..? Our hearing begins to change the moment we start listening. It becomes desensitised to repeated exposure to a sound – one of the cornerstones of many types of listening-based testing.

The Machine Learning delusion

This morning my personal biological computer detected a correlation between these two articles:

Sony’s SenseMe™ – A Superior Smart Shuffle

Machine learning: why we mustn’t be slaves to the algorithm

In the first article, the author is praising a “smart shuffle” algorithm that sequences tracks in your music collection with various themes such as “energetic, relax, upbeat”. It does this by analysing the music’s mood and tempo. It sounds amazing:

“I would never think of playing Steve Earl’s “Loretta” right after listening to the Boulder Philharmonic’s performance of “Olvidala,” or Ry Cooder’s “Crazy About an Automobile” followed by Doc and Merle Watson playing “Take Me Out to the Ballgame,” but I enjoyed not only the selections themselves but the way SensMe™ juxtaposes one after another, like a DJ who knows your collection better than you do…what will “he” play next? Surprise! It’s all good.”

And the algorithm’s effects go beyond mere music:

“SenseMe™ has brought domestic harmony – interesting selections for me and music with a similar mood for her. That’s better than marriage counseling! “

The author of the second article takes a more sceptical view. He notes the dumbness of Machine LearningTM algorithms, but says that

“…because these outputs are computer-generated, they are currently regarded with awe and amazement by bemused citizens …”

He quotes someone who is aware of the limitations:

“Machine learning is like a deep-fat fryer. If you’ve never deep-fried something before, you think to yourself: ‘This is amazing! I bet this would work on anything!’ And it kind of does. In our case, the deep fryer is a toolbox of statistical techniques. The names keep changing – it used to be unsupervised learning, now it’s called big data or deep learning or AI. Next year it will be called something else. But the core ideas don’t change. You train a computer on lots of data, and it learns to recognise structure.”

“But,” continues Cegłowski, “the fact that the same generic approach works across a wide range of domains should make you suspicious about how much insight it’s adding.”

I have been there. Machine learning is one of the most seductive branches of computer science, and in my experience is a very “easy sell” to people – I use it in my job in actual engineering applications where it can be eerily effective.

But if algorithms are so clever and know us so well, why are we using them only to shuffle the order of music? Why not cut out the middleman and get the computer to compose the music for us directly? The answer is obvious: it doesn’t work because we don’t know how the human brain works, and it is not predictable. By extension, the algorithms that purport to help us in matters of taste don’t actually work either. As the Guardian article says, all we are responding to is the novelty of the idea.

Auditory Scene Analysis

There is a field of study called Auditory Scene Analysis (ASA) that postulates that humans interpret “scenes” using sound just as they do using vision. I am not sure that it necessarily has any particular bearing on the way that audio hardware should be designed: basically the scene is all the clearer if the reproduction of the audio is clean in terms of noise, channel separation, distortion, frequency response and (seemingly controversial to hi-fi folk) the time domain.

However, the seminal work in this field includes the following analogy for hearing:

Your friend digs two narrow channels up from the side of a lake. Each is a few feet long and a few inches wide and they are spaced a few feet apart. Halfway up each one, your friend stretches a handkerchief and fastens it to the sides of the channel. As the waves reach the side of the lake they travel up the channels and cause the two handkerchiefs to go into motion. You are allowed to look only at the handkerchiefs and from their motions to answer a series of questions: How many boats are there on the lake and where are they? Which is the most powerful one? Which one is closer? Is the wind blowing? Has any large object been dropped suddenly into the lake?

Of course, when we listen to reproduced music with an audio system we are, in effect, duplicating the motion of the handkerchiefs using two paddles in another lake (our listening room) and watching the motion of a new pair of handkerchiefs. Amazingly, it works! But the key to this is that the two lakes are well-defined linear systems. Our brains can ‘work back’ to the original sounds using a process akin to ‘blind deconvolution’.

If we want to, we can eliminate the ‘second lake’ by using headphones, or we can almost eliminate it by using an anechoic chamber. We could theoretically eliminate it at a single point in space by deconvolving the reproduced signal with the measured impulse response of the room at that point. Listening with headphones works OK, but listening to speakers in a dead acoustic sounds terrible – probably to do with ‘head related transfer function’ (HRTF) telling us that we are listening to a ‘real’ acoustic but with an absence of the expected acoustic cues when we move our heads. By adding the ‘second lake’ we create enough ‘real acoustic’ to overcome that.

But here is why ‘room correction’ is flawed. The logical conclusion of room correction is to simulate headphones, but this cannot be achieved – and is not what most listeners want anyway, even if they don’t know it. Instead, an incomplete ‘correction’ is implemented based on the idea of trying to make the motion of the two sets of ‘handkerchiefs’ closer to each other than they (in naive measurements) appear to be. If the idea of the brain ‘working back’ to the original sound is correct, it will ‘work back’ to a seemingly arbitrarily modified recording. Modifying the physical acoustics of the room is valid whereas modifying the signal is not.

I think the problem stems ultimately from an engineering tool (frequency domain measurement) proliferating due to cheap computing power. There is a huge difference in levels of understanding between the author of the ASA book and the audiophiles and manufacturers who think that the sound is improved by tweaking graphic equalisers in an attempt to compensate for delays that the brain has compensated for already.

The Man in the White Suit


There’s a brilliant film from the 1950s called The Man in the White Suit. It’s a satire on capitalism, the power of the unions, and the story of how the two sides find themselves working together to oppose a new invention that threatens to make several industries redundant.

I wonder if there’s a tenuous resemblance between the film’s new wonder-fabric and the invention of digital audio? I hesitate to say that it’s exactly the same, because someone will point out that in the end, the wonder-fabric isn’t all it seems and falls apart, but I think they do have these similarities:

  1. The new invention is, for all practical purposes, ‘perfect’, and is immediately superior to everything that has gone before.
  2. It is cheap – very cheap – and can be mass-produced in large quantities.
  3. It has the properties of infinite lifespan, zero maintenance and non-obsolescence
  4. It threatens the profits not only of the industry that invented it, but other related industries.

In the film it all turns a bit dark, with mobs on the streets and violence imminent. Only the invention’s catastrophic failure saves the day.

In the smaller worlds of audio and music, things are a little different. Digital audio shows no signs of failing, and it has taken quite a few years for someone to finally come up with a comprehensive, feasible strategy for monopolising the invention while also shutting the Pandora’s box that was opened when it was initially released without restrictions.

The new strategy is this:

  1. Spread rumours that the original invention was flawed
  2. Re-package the invention as something brand new, with a vagueness that allows people to believe whatever they want about it
  3. Deviate from the rigid mathematical conditions of the original invention, opening up possibilities for future innovations in filtering and “de-blurring”. The audiophile imagination is a potent force, so this may not be the last time you can persuade them to re-purchase their record collections, after all.
  4. Offer to protect the other, affected industries – for a fee
  5. Appear to maintain compatibility with the original invention – for now – while substituting a more inconvenient version with inferior quality for unlicensed users
  6. Through positive enticements, nudge users into voluntarily phasing out the original invention over several years.
  7. Introduce stronger protection once the window has been closed.

It’s a very clever strategy, I think. Point (2) is the master stroke.

The pickle that listening-based ‘science’ gets us into

fmri-salmonJust expanding on an earlier post: some thoughts on ‘audio science’ and its observation that human perception of sound is often influenced by our imagination. Blind testing doesn’t eliminate our imagination, merely prevents it from biasing the result of the test. We can still imagine anything we like when switching between A and B – and under such conditions, the imagination is likely to flourish. In amongst these high levels of imaginary ‘noise’, audio ‘scientists’ think that the magical powers of statistical formulae can enable them to discern audible differences that the test subjects didn’t even know they had heard. Or they can confidently state that no difference was heard. Such confidence in the validity of their statistics brings to mind a study of the brain of a dead fish that, with the dumb application of dumb formulae, could be interpreted as responding to images flashed up in front of its eyes.

Outside the laboratory, there is an awkward shift when the people who espouse ‘audio science’ want to sell us their products, or even to buy something themselves. It is their implicit position that any demonstration of the product in a showroom or the customer’s own home, is a sham. Customers – including themselves – are malleable creatures that imagine what they are persuaded to hear. Even the ‘science’ that has been used in the equipment’s creation and is promoted in advertising (or, indeed is advertising), feeds back into how people perceive the sound. The audio scientist/objectivist is in a completely paradoxical position where they cannot even know whether they actually like something! They must acknowledge that they only think they like something on that particular day in that particular showroom, or in their own workshop as they tweak their crossover design. They could conduct their own blind listening tests to establish their preference scientifically, but how many of these would they have to run in order to cover every permutation of the variables when they change a setting? Much more than a lifetime’s-worth. And as discussed before, there is no way to tell whether taking part in a listening trial affects our ability to discern differences, anyway.

The only way out of this impasse while maintaining the listening trial dogma is to argue that statistics from blind listening tests carried out by others can tell us what to like on the basis of sheer scale, and the probability that our hearing preferences are the same as everyone else’s. But do we then allow just anyone off the street to tell us what is best, or do we use “trained” listeners? The former would seem just silly, but the latter leaves the whole scheme open to accusations of incestuousness and circularity. In a deft move, a claim is made that trained listeners still register the same average preferences that ordinary people would over thousands of tests, but that they do it more clearly and decisively. Attempting to determine whether this is in fact the case would be such a circular absurdity that people can only accept it on faith. This is one of audio science’s self-deluding sleights of hand: being as rigorous as anyone could like in the execution of the actual trials, but basing the premises of the experiment and the conclusions to be drawn from it on the flimsiest of hand-waving. Of course, as this is science, anyone can challenge those conclusions or conduct their own experiments, but this just replaces one flimsy assertion with another.

Why is this all so difficult and farcical? Well, I think it is because science has no meaning in the world of ‘art’ so our troubles start there. Technically, perhaps it can be argued that we are only attempting to use science to create hardware for reproducing art – which doesn’t sound too difficult. But when we say “reproducing” do we mean “most accurate”, or “most preferred”? The fact that anyone would go to the lengths of using listening trials is a giveaway that they are not sure whether they know (a) how to determine “most accurate” objectively, (b) whether listeners actually want accuracy in the context of listening to music in their own homes. and (c) whether recordings created while monitoring on existing speakers should be reproduced accurately anyway.  

At this point, the entire enterprise is doomed to circularity and farce. The human trial participant is subjected to reproduced ‘art’ (but not the original!) and either by directly registering preferences, or indirectly by registering differences, is assumed to be capable of determining the ‘best’ method of reproducing that art. ‘Art’ is the thing that no one can define – the thing that is supposed to affect us emotionally in ways that cannot be predicted. Using it as the stimulus to gauge human reaction to the hardware is not obviously compatible with science is it?

In contrast, it is perfectly rational to admit that scientific experiments cannot tell us the best way to reproduce art. It is perfectly rational to simply work out on paper a likely way of doing it, then build it and listen to it. We will never know scientifically whether we actually like it, because this is beyond the remit of science. But this doesn’t stop us from enjoying it, anyway. In a normal setting we are not entirely slave to our imaginations – we can make a fair assessment of when something is obviously good or bad.

Rather than the (pseudo)scientific blind listening test, I think there is a much more fruitful test. It is the ultimate ‘sighted’ test, that suppresses imaginary differences, and is only possible because of DSP – which can be used to simulate the characteristics of real world hardware in many ways. The test is this: While listening, be allowed to change whatever parameter you like using DSP and hear the result instantaneously. Change one variable and flick backwards and forwards between two values while listening. Or change several variables simultaneously if you like. Close your eyes while pressing the supplied ‘random’ button and see if you were right. Such a test would condense a lifetime’s worth of exhaustive listening trials into a few minutes or hours of ‘fun’ that is much more representative of normal listening than the dreary alternative. (For example, with my own system I can make instantaneous radical changes to the crossovers that other people can only achieve in a much more limited way with huge effort and long intervals of silent soldering in between). It isn’t science. It won’t tell you definitively what you prefer, or what you are sensitive to in normal listening, but it will certainly put into context the scale of the changes you have to make in order to hear a ‘night and day’ difference. It allows an instantaneous comparison between various types of technology that could never be achieved otherwise. It could help lay to rest a few audio demons.