Audio Objects

Some audio pessimists are convinced that because a stereo recording and reproduction system can only sample a couple of infinitesimal points within the overall ‘sound field’, it is futile to imagine that the result can be anything but a pale imitation of the real thing.

Others are convinced that although the efforts of recording engineers mean that the recording itself is passable, the problem is that speakers playing in a real room are not conveying it to their ears accurately enough. They attempt to alter what comes out of the speakers in order to compensate for the room.

And stereo itself when reproduced over speakers is assumed to be so flawed due to crosstalk to the ‘wrong’ ear that it can’t possibly work, and we must be deluding ourselves if we think it does.

These are assumptions made by people who cannot allow themselves to enjoy their audio systems. I suggest they are fixated on the wrong things and the situation is much better than they imagine. A different way to view the problem of audio is this:

It is a mistake to think that the aim of the system is to recreate the precise waveform that would have reached the listener’s ear at the actual performance. It is not practically achievable, would not necessarily reproduce a realistic perception of the actual performance in the context of the listener’s own room anyway, and also it is not necessary. Most people couldn’t even tell you which of two plausible versions of the waveform are absolutely correct, and that is because they’re not hearing a waveform; they’re hearing musical and acoustic ‘objects’. It is the relationship between those objects that is paramount.

An ‘object’ could be:

  • A voice
  • A choir
  • Silence
  • A sad note
  • A happy chord
  • Song lyrics
  • A violin
  • A rhythm
  • An orchestra
  • A concert hall
  • Tension

The primary aim of a hi-fi system (as opposed to a kitchen radio, for example) is to maintain the integrity of single objects and the separation of different objects.

The secondary aim of the hi-fi system is to present the objects in a plausible way that allows for the normal behaviour of the listener; the sound basically appearing to emanate from in front of the listener, separable by distance and direction, without strange acoustic sensations if they turn to talk to their neigbour.

And that’s it. Everything flows from there.

  • Harmonic distortion (and the corresponding intermodulation distortion) smears objects together.
  • Bumps and dips in the frequency and phase response of a speaker smears the objects together and punches holes in the integrity of the objects.
  • Noise smears itself over all the objects, obscuring their separation.
  • Limited bass damages the integrity of certain objects and smears those objects together.
  • Timing errors smear objects together. Resonators in speakers (e.g. bass reflex) that take time to ‘get going’ and time ‘to stop’ damage and smear the objects together.
  • Stereo obviously aids in separating objects. Just a pair of speakers provides a continuous spread of individual, separate acoustic sources. And stereo over speakers isn’t flawed; the crosstalk to the ‘wrong ear’ is how it produces the image in the first place.
  • Realistic volume helps to elevate objects above the noise floor, with a more natural sound due to our hearing’s volume-dependent frequency sensitivity.

So some objects make it out of a kitchen radio OK: a rhythm, a melody or the words of a song. But other objects may be severely damaged or smeared together. On a hi-fi system you might hear two separate guitars but on the radio they’re just a wash over the whole sound. On the hi-fi you hear a startling, deep bass note, but on the radio there’s nothing.

And the hi-fi system does things ‘without trying’ – which is why some people can’t believe it’s doing them. The stereo system with speakers automatically creates a two-way interaction between the listeners and the performance because both are subject to the listening room’s acoustics. This also solves the problem of how to cram a concert hall into the listener’s room as well as the more intimate performances. Is the aim for the musicians and venue to come to the listener or for the listener to go to the performance? The stereo system with speakers creates a hybrid: regard it as the listener’s room being transported to the venue and its end wall being opened up.

The Definition of ‘High Fidelity’

How would various people define the term ‘high fidelity’?

Average person in the street

“Recreates the sound of being at the performance”

I think an imaginary typical person would probably say something like this, especially after being told that audiophiles pay as much as they earn in a year for a piece of wire.

Unfortunately, high fidelity audio doesn’t reproduce the actual sound of the performance unless, perhaps, through binaural recording and playback over headphones. This technique doesn’t pretend to maintain the illusion as you turn your head and move around, though.

And, of course, for a studio creation rather than live performance, there is no performance as such to recreate.

Average slightly technical person

“The speaker reproduces the recorded signal precisely”.

As I imagine it, the technically-literate layman’s definition of high fidelity would be more realistic and in fact correct, but incomplete because it does not specify how the speaker should interact with the acoustic environment.

Traditional audio enthusiast

“Low distortion, low noise, flat frequency response from the speaker”.

The typical audio enthusiast would translate the goal into audio-centric terms that aspire to nothing but reproducing the signal with the right frequency content on average – which ignores the unavoidable timing & phase distortion that occurs in traditional passive speakers. It also allows for horrors such as bass reflex resonators to further smear transients (as opposed to the perfect results they may give on steady state sinusoids).

Computer-literate audio enthusiast

“Low distortion, low noise, and a target frequency response at the listener’s ear”

The modern audio enthusiast who has discovered laptops, microphones and FFTs thinks that the smoothed, simplified frequency response measurement displayed on their screen is the way a human hears sound. It has to be, because the alternatives – the complex frequency domain representation and its equivalent, the time domain waveform – are visually incomprehensible.

My definition

“The recorded signal is reproduced precisely, from a small acoustic source with equal, controlled directivity at all frequencies”.

This definition is based on logical deductions.

The perceptive audio enthusiast would observe that they can always recognise voices, instruments, musical sounds regardless of acoustics, and turn their heads towards those sounds. Therefore, they would deduce that humans have the ability to ‘focus’ on audio sources regardless of acoustics. Clearly, therefore, we don’t just hear the composite frequency response of source combined with the room but have other interesting hearing abilities, probably related to binaural hearing, head movements, and phase and timing.

If we can focus on the source of a sound i.e. hear through the room, the room is not a problem to be solved but simply something normal and natural that exists. It is puzzling to think that we can improve the sound of one thing (the room) by changing something we perceive as separate from it (the sound of the source).

If the frequency response of the source is modified because of some characteristic of the room (tantamount to changing the frequency response of a musical performer in a live venue), we will hear the source as sounding unnatural. Thus ‘room correction’ based on EQ is illogical. Thus the idea of the ‘target frequency response’ is simply wrong.

If we use phase and timing in our hearing, and/or have unknown hearing abilities, there is no excuse for modifying the source’s phase and timing, arbitrarily or otherwise. Thus, if it is possible, the speaker should not modify the recording’s phase or timing. DSP makes this possible. But because of the laws of physics, this would require the speaker to look into the future, and this is only possible if we introduce a delay in the output i.e. latency. For listening to recordings (as opposed to live monitoring) latency is acceptable.

The final part of the puzzle is how the ideal speaker should interact with the room. The speaker is not intended to recreate the exact acoustic characteristics of a literal musical instrument, but to reproduce the audio field that was picked up by a microphone – possibly a composite of many musical sources plus acoustics. There is only one logical ideal in terms of dispersion (i.e. the angle through which the sound emerges from the front of the speaker) and that is: uniform at all frequencies.

What the size of that constant dispersion angle should be is open to debate and the taste of the listener – as discussed in the Grimm LS1 design paper. Most people seem to prefer something that is a compromise between omni-directional and a super-directional beam.

This is exemplified by modern cardioid speakers such as the Kii Three or the D&D 8C. To quote the designer of the 8C:

No voicing required. Other loudspeakers usually require voicing. Based on listening to a lot of recordings, the tonal balance of the loudspeaker is changed so that most recordings sound good. Voicing is required to balance differences between direct and off-axis sound. The 8c has very even dispersion. It is the first loudspeaker I ever designed that did not benefit from voicing. The tonal balance is purely based on anechoic measurements.

Where some confusion may appear is when a real world speaker (almost all existing types until now) does not possess this ideal uniform dispersion characteristic. In this situation, the reverberant sound fails to point back to the source. Effectively the frequency response of the reverberant sound does not correspond with that of the source, and the listener perceives this discrepancy due to their ability to ‘read’ the acoustic environment in terms of phase, timing and frequency response.

Of course a single musical instrument might have any dispersion characteristic, but if the recording is a composite of several musical sources in their acoustic environment, and a single, unvarying non-neutral dispersion characteristic is applied to all of them, it sounds false. Only neutral dispersion will do.

Some EQ can help here, but it is not true ‘correction’. All that can be done is to steer a middle course between neutral frequency response for the direct sound and the same for the reverberant sound. A commonly known version of this is baffle step compensation which is often applied as a frequency response ‘shelf’ whose frequency is defined by the speaker’s baffle width, and whose depth is dependent on speaker positioning and the room.

The required compensation cannot be deduced from an in-room measurement of the speaker, because that measurement inextricably shows a combination of the room and the speaker’s unknown dispersion characteristics interacting with it. Only some a priori knowledge of the speaker can help to formulate the optimum correction curve.

N.B. the goal is not a flat, or any other ‘target’, in-room response; the goal is minimal deviation from flat direct sound while achieving the most natural in-room sound possible. DSP allows this EQ curve to be applied without distorting the speaker’s phase or timing.

Stereo

It seems reasonable to extend the logic of accurate playback of the signal and uniform dispersion, from mono (one speaker), to stereo (two speakers).

But stereo is where obvious logic gives way to an element of “It has to be heard to be believed”. The operation of stereo is not obvious. Despite all the talk of the human ability to interpret the acoustic environment, stereo relies on fooling human hearing into believing that a sound reproduced from two locations simultaneously is, in fact, coming from a phantom location. This simultaneous reproduction is something that does not occur in nature, hence the potential for this to work.

Aspects that might be potential ‘show stoppers’ include:

  1. Crosstalk from each speaker to ‘the wrong ear’
  2. Nausea-inducing collapse of the stereo image as the listener turns their head or moves off-centre
  3. Room reverberation from individual speakers not pointing back to the phantom stereo source and so sounding unnatural

It turns out that (1) is a fundamental part of the way stereo works over speakers – as opposed to headphones.

And this leads to a very benign situation regarding (2), where the stereo image remains stable and plausible with listener movement.

Because (1) and (2) lead to a counter-intuitively good result where the listener is simply unaware of the location of the speakers, a listening room with reasonable symmetry extends this effect to give a good result for (3) – effectively phantom reverberation. If one speaker was sitting next to a marble wall, floor and ceiling, and the other surrounded by cushions, maybe the result wouldn’t be so good. As it is, a reasonable listening setup does not give rise to any noticeably unnatural reverberation for stereo phantom images .

What High Fidelity Over Speakers Gives Us

The result of high fidelity stereo is remarkable, and could even be the ultimate way to listen to recorded music, being even better than the notion of perfect ‘holographic’ recreation of the listening venue.

The issue is one of compatibility with domestic life and the cognitive dissonance aspect of recreating a large space in your living room. Donning special apparatus in order to listen is a bit of a mood killer; having to sit in a restrictive central location likewise.

Not hearing one’s own voice or that of a companion while listening would seem weird and artificial. Hearing no acoustic link between one’s own voice and the musical performance would also seem peculiar: imagine listening over headphones to an organ playing in a cathedral and speaking to your companion over a lip mic.

Listening to stereo in a living room gets around all these issues naturally and elegantly. It’s good enough for really serious, critical listening, but is effortlessly compatible with more social, casual listening. The addition of the ambient reverberation of the listening room acts as a two-way bridge between the performance and the listeners.

Predictable audio

Various audiophiles and forums are very measurements-oriented. The implication is that any audio system or device can be approached as a blank sheet of paper; open to testing to reveal its performance and that ‘anything is possible’. Until you measure it, you just don’t know what it’s capable of, but that after you make a set of basic measurements you will know all about it.

The truth, surely, is going to be that two brands of similar-sized two-way speakers with 8″ and 1″ drivers are going to behave very similarly to each other in terms of distortion, dispersion and so on. The exact choice of crossover frequencies and slopes may make differences, but they will be entirely predictable trade-offs. Making the speakers slim-fronted floor standers with 6″ and 1″ drivers will change the characteristics in entirely predictable ways.

The same, surely, will be true of DACs that use the same chipsets. Only a bad mistake will change the core performance – there is no ‘trade secret’ that will magically improve what goes on in a chip. Sure, measurements will be a check on the absence of that mistake, but we shouldn’t expect any amazing differences.

Amplifiers of the same type will all behave similarly to each other. If someone claims to have drastically improved the performance of an existing type of amplifier, scepticism is the order of the day. Do these innovations only work with a resistive load, are they stable with temperature, do they progressively shut the amplifier down on long organ notes? Basic measurements won’t necessarily show these limitations up.

So I would say that the most interesting aspects of audio are going to be the motivations of the designers, their obsessions and what they dismiss as unimportant. For me, a speaker that derives from the quote “I realised our product line-up had a gap for a simple passive two-way, ported compact monitor” is not going to be worth measuring. It simply won’t provide any surprises – unless it’s been really messed up badly.

This is something I didn’t know ten years ago. Back then, I somehow assumed that audio was so mysterious that it was, to use an American expression, a crapshoot. A two-way ported speaker might somehow hit just the right balance if designed by a genius and might measure and sound fundamentally better than a larger, more sophisticated speaker with DSP. The reason I was so mistaken was the amazing price range of audio of various types and the work of reviewers who simply weren’t capable of discriminating good sound from bad; who would use the same glowing vocabulary to describe the sound of both a small two-way monitor and a large three-way speaker when in reality they were miles apart.

There are no unpredictable miracles unless the design does something fundamental that has not been done before. The world’s best DAC and amplifier are not going to transform your audio experience; the world’s most expensive two-way passive direct radiator speaker is not going to sound as good as a competent three-way active DSP speaker. On the other hand, innovations like motion feedback and driver power compression elimination might sound different from anything you’ve heard before; as might phase and timing alignment, active cardioid response, BACCH etc.

Such radical innovations defy measurement unless the design fundamentals and implementation are known; and if you know the design fundamentals and implementation, the measurements are going to be predictable and yet will still not tell you how they will sound.

I get to hear the Kii Threes

Thanks to a giant favour from a new friend, I finally get to hear the Kii Threes…


A couple of Sundays ago, a large van arrived at my house containing two Kii Threes and their monumentally heavy stands, plus a pair of Linkwkitz LX Minis with subwoofers along with their knowledgeable owner, John. It was our intention to spend the day comparing speakers.

We first set up the Kiis to compare against my ‘Keph’ speakers and to do this, we had to ‘interleave’ the speaker pairs with slightly less stereo separation and symmetry than ideal, perhaps:

2018-11-18 14-19-29

Setting up went remarkably smoothly, and we soon had the Kiis running off Tidal on a laptop while the Kephs were fed with Spotify Premium – most tracks seemed to be available from both services. The Kiis are elegant in the simplicity of cabling and the lack of extraneous boxes.

John had set up the Kiis with his preferred downwards frequency response slope that starts at 3kHz and ended at 4dB down (at 22 Khz?). I can’t say what significance this might have had on our listening experiment.

The original idea was to match the SPLs using pink noise and a sound level meter. This we did, but didn’t maintain such discipline for long. We were listening rather louder than I would normally, but this was inevitable because of the Kii’s amazing volume capabilities.

The bottom line is that the Kiis are spectacular! The main differences for me were that the Kiis were ‘smoother’ and the bass went deeper, and they seemed to show up the ‘ambience’ in many recordings more than the Kephs – more about that later. An SPL meter revealed that what sounded like equal volume required, in fact, a measurably higher SPL from the Kephs. Could this be our hearing registering the direct sound, but the Kiis’ superior dispersion abilities resulting in less reverberant sound – ignored by our conscious hearing but no doubt obscuring detail? Or possibly an artefact of their different frequency responses? We didn’t really have time to investigate this any further.

When standing a long way from both sets of speakers at the back of the room, the Kephs appeared to be emphasising the midrange more, and at the moment of changeover between speakers that contrast didn’t sound good; with a certain classical piano track, at the moment of changeover the Kephs seemed to render the sound* of the piano kind of ‘plinky plonk’ or toy-like compared to the Kiis – but then after about 10 seconds I got used to it. Without the Kiis to compare against I would have said my Kephs sounded quite good..! But the Kiis were clearly doing something very special.

I did try some ad hoc modifications of the Keph driver gains, baffle step slopes and so on, and we maybe got a bit closer in that regard. But I forgot about the -4dB slope that had been applied to the Kiis, and if I had thought about it, I already had an option in the Kephs’ config file for doing just that. But really, I wish I had had the courage of my convictions and left the the frequency response ‘as is’.

Ultimately, I think that we were running into the very reason why the Kiis are designed the way they are: to resemble a big speaker. As the blurb for the Kii says:

“The THREE’s ability to direct bass is comparable to, but much better controlled than that of a traditional speaker several meters wide.”

It’s about avoiding reflections that blur bass detail, but as R.E. Greene explains, it’s also about frequency response:

“What is true of the mini-monitor, that it cannot be EQed to sound right, is also true of narrow-front floor-standers. They sound too midrange-oriented because of the nature of the room sound. This is something about the geometry of the design. It cannot be substantially altered by crossover decisions and so on.

A conventional small speaker (and the Kephs are relatively small) cannot be equalised to give a flat direct sound and flat room sound. It has to be a compromise and as I described before, I apply baffle step compensation to help bridge this discrepancy between the direct and ambient frequency balances. The results are, so I thought, rather acceptable, but the compromise shows up against a speaker with more controlled dispersion.

This must always be a factor in the sound of conventional speakers unless sitting very close to them. I do believe Bruno Putzeys when he says that large speakers (or those that cleverly simulate largeness) will always sound different from small ones. It would be interesting also to have compared the Kiis against my bigger speakers whose baffle step is almost an octave lower.

However, there was another difference that bothered me (with the usual sighted listening caveats) and this was ‘focus’. With the Kiis I heard lots of ‘ambience’ – almost ‘surround sound’ – but I didn’t hear a super-precise image. When the Kephs were substituted I heard a sudden snap into focus, and everything moved to mainly between and beyond the speakers. The sound was less ‘smooth’ but it was, to me, more focused.

And this is a question I still have about the Kiis and other speakers that utilise anti-phase. I see the animations on the Kii web site that show how the rear drivers cancel out the sound that would otherwise go behind the speaker. To do this, the rear drivers must deliver a measured quantity of accurately-timed anti-phase. This is a brilliant idea.

My question is, though: how complete is this cancellation if you partially obscure one of the side drivers (-with another speaker in this case)? I do wonder if I was hearing the results of anti-phase escaping into the room and messing up the imaging because of the way we had arranged the speakers – along with a mildly (possibly imaginary!) uncomfortable sensation in my ears and head.

To a frequency response measurements oriented person, it doesn’t matter whether sound is anti-, or in-, phase; it is just ‘frequency response material’ that gets chucked into bins and totted up at the end of the measurement. If it is delayed and reflected then in graphs its effects appear no different from the visually-chaotic results of all room reflections; this is the usual argument against phase accuracy in forum discussions. “How can phase matter if it is shifted arbitrarily by reflections in the room, anyway?”.

However, to the person who acknowledges that the time domain is also important, anti-phase is a problem. If human hearing has the ability to separate direct sound from room sound, it is dependent on being able to register the time-delayed similarity between direct and reflected sound. If the reflected sound is inverted relative to the direct, that similarity is not as strong (we are talking about transients more than steady state waveforms). In fact, the reflected sound may partially register as a different source of sound.

Anti-phase is surely going to sound weird – and indeed it does, as anyone who has heard stereo speakers wired out of phase will attest. Where the listener registers in-phase stereo speakers as producing a precise image located at one point in space, out-of-phase speakers produce an image located nowhere and/or everywhere. The makers of pseudo-surround sound systems such as Q-Sound exploit this in order to create images that are not restricted to between the stereo speakers. This may be a factor in the open baffle sound that some people like (but I don’t!).

So I would suggest that allowing anti-phase to bounce around the room is going to produce unpredictable results. This is one reason why I am suspicious of any speaker that releases the backwave of the driver cone into the room. The more this can be attenuated (and its bandwidth restricted) the better.

With the Kiis, was I hearing the effect of less-than-perfect cancellation because of the obscuring of one of the side drivers? Or imagining it? Most people who have heard the Kiis remark on the precise imaging, so I fear that we managed to change something with our layout. Despite the Kiis’ very clever dispersion control system which supposedly makes them placement-independent, does it pay to be a little careful of placement and/or symmetry, anyway? For it not to matter would be miraculous, I would say.

In a recent review of the Kiis (not available online without a subscription), Martin Colloms says that with the Kiis he heard:

“…sometimes larger than life, out-of-the-box imaging”

I wonder if that could be a trace of what I was hearing..? Or maybe he means it as a pure compliment. In the same review he describes how the cardioid cancellation mechanism extends as far as 1kHz, so it is not just a bass phenomenon.


Next, John set up his DIY Linkwitz LX Mini speakers (which look very attractive, being based on vertical plastic tubes with small ‘pods’ on top), as well as their compact-but-heavy subwoofers. These were fed with analogue signals from a Raspberry-Pi based streamer and, again, sounded excellent. They also seek to control dispersion, in this case by purely acoustic means – that I don’t yet understand. And they may also dabble a bit in backwave anti-phase.

If I had any criticism, it was that the very top end wasn’t quite as good as a conventional tweeter..? But it might be my imagination and expectation bias. Also, our ears and critical faculties were pretty far gone by that point…

Really, we had three systems all of which, to me, sounded good in isolation – but with the Kiis revealing their superior performance at the point of changeover. There were certainly moments of confusion when I didn’t know which system was operating and only the changeover gave the game away. I think all three systems were much better than what you often get at audio shows.

What we didn’t have were any problems with distortion, hum, noise. In these respects, all three systems just worked. The biggest source of any such problem was a laptop fan which kicked in sometimes when running Tidal.

There were lots of things we didn’t do. We didn’t try the speakers in different positions; we didn’t try different toe-in angles; we didn’t make frequency response measurements and do things in a particularly scientific way; we listened at pretty high volume and didn’t have the self-control to listen at lower volumes – which might have been more appropriate for some of the classical music. The room was ‘as it comes’: 6 x 3.4 x 2.4m, carpeted, plaster walls and ceiling, and floor-to-ceiling windows behind the speakers with a few boxes and bits of furniture lying about.


So my conclusion is that I have heard the Kiis and am highly impressed, but there might possibly be an extra level of focus and integrity I have yet to experience. I never got to the point where I could listen through the speakers rather than to them, but I am sure that this will happen at some point.

In the meantime I am having to learn to love my Kephs again – which actually isn’t too hard without the Kiis in the same room showing them up!


Footnotes:

*Since writing that paragraph I have found a mention of possibly that very phenomenon:

“…even a brief comparison with a real piano, say, will reveal the midrange-orientation of the narrow- front wide radiators.”

Still listening…

I’m still here, and still listening to my KEF (-derived) speakers most days. I honestly think that they are built to the right formula – although I keenly await the time when I get to hear the Kii Threes.

It all now seems so obvious, but it took me quite a while to disentangle myself from the frequency domain-centric view that most audio design people are committed to – and whose minds (and possibly ears) are warped into believing.

It is clear that human hearing does perform frequency domain analysis, but that it also uses other methods and ‘hardware’ in parallel to characterise what it is hearing. This means that an audio system needs to reproduce the signal without changing it in either the time or frequency domains.

The alternative is to second guess how human hearing works and to assume that arbitrary distortion of phase and timing has no effect. In fact, I would say it is not even as rational as that: what seems to have happened is that while carpentry-and-coil-based technology doesn’t explicitly control phase and timing, conventional 1970s speakers still sounded pretty good. The results have been retrospectively analysed and justified, and a model of human hearing developed to fit the speakers rather than the other way round.

This faulty model leads to ideas like bass reflex and ‘room correction’ that, viewed through the prism of not trying to second guess human hearing, seem as confused and deluded as they sound.

The result is the weird variability in audio systems that all ‘measure well’ – using the subset of measurements that satisfy the model – but sound disappointing even while costing the price of a car. It might even be worse than that: maybe recordings are being made while being monitored through ‘room correction’ resulting in the demise of high fidelity recordings as we know them.

And there’s another delusional idea that stems from the faulty model and the occasionally serendipitous characteristics of old technology: the notion that we listen to a signal rather than through a channel.

The conventional view is that we must change the signal to give the best sound – whether by equalisation or – bizarrely – adding distortion deliberately e.g. with valves or vinyl. If you do this, you are really changing the characteristics of the channel. In real music and acoustics there is no such thing as ‘a signal’ and whatever automatic processing you do of it is, in the general case, arbitrary and meaningless. For sure, you may find that distortion is a pleasing artistic effect on a particular (probably very simple) recording. But are you an artist? If so, you might be much better served by playing with a freeware PC recording studio app rather than churning equipment that represents several years of the retirement you may never get to enjoy.

The only coherent strategy is to reproduce the signal without touching it. In my experience, if you get anywhere near to this, it sounds magnificent. Not ‘neutral’; not ‘clinical’ but deep, open, rich, colourful – like real music.

How to re-sample an audio signal

As I mentioned earlier, I would like to have the flexibility of using digital audio data that emanates externally from the PC that is performing the DSP, and this necessarily will have a different sample clock. Something has got to give!

If the input was analogue, you would just sample it with an ADC locked to your DAC’s sample rate, and then the source’s own sample rate wouldn’t matter to you. With a standard digital audio source (e.g. S/PDIF) you need to be able to do the same thing but purely in software. The incoming sampled data points are notionally turned into a continuous waveform in memory by duplicating a DAC reconstruction filter using floating point maths. You can then sample it wherever you want at a rate locked to the DAC’s sample rate.

You still ‘eat’ the incoming data at the rate at which it comes in, but you vary the number of samples that you ‘decimate’ from it (very, very slightly).

The control algorithm for locking this re-sampling to the DAC’s sample rate is not completely trivial because the PC’s only knowledge of the sample rates of the DAC and S/PDIF are via notifications that large chunks of data have arrived or left, with unknown amounts of jitter. It is only possible to establish an accurate measure of relative sample rates with a very long time constant average. In reality the program never actually calculates the sample rate at all, but merely maintains a constant-ish difference between the read and write pointer positions of a circular buffer. It relies on adequate latency and the two sample rates being reasonably stable by virtue of being derived from crystal oscillators. The corrections will, in practice be tiny and/or occasional.

How is the interesting problem of re-sampling solved?

In order to experiment with it I have created a program that runs on a PC and does the following:

  1. Synthesises a test signal as an array of floating point values at a notional sample rate of 44.1 kHz. This can be a sine wave, or combination of different frequency sine waves.
  2. Plots the incoming waveform as time domain dots.
  3. Plots the waveform as it would appear when reconstructed with the sinc filter. This is a sanity check that the filter is doing approximately the right thing.
  4. Resamples the data at a different sample rate (can be specified with any arbitrary step size e.g. 0.9992 or 1.033 or whatever), using floating point maths. The method can be nearest-neighbour, linear interpolation, or sinc & linear interpolation.
  5. Plots the resampled waveform as time domain dots.
  6. Passes the result into a FFT (65536 points), windowing the data with a raised cosine window.
  7. Plots the resulting resampled spectrum in terms of frequency and amplitude in dB.

This is an ideal test bed for experimenting with different algorithms and getting a feel for how accurate they are.

Nearest-neighbour and linear interpolation are pretty self explanatory methods; the sinc method is similar to that described here:

https://www.dsprelated.com/freebooks/pasp/Windowed_Sinc_Interpolation.html

I haven’t completely reproduced their method, but I was inspired by this image:

\includegraphics[scale=0.8]{eps/Waveforms}

The sinc function is the ideal ‘brick wall’ low pass filter and is calculated as sin(x*PI)/(x*PI). In theory it extends from minus to plus infinity, but for practical uses is windowed so that it tapers to zero at plus or minus the desired width – which should be as wide as practical.

The filter can be set at a lower cutoff frequency than Nyquist by stretching it out horizontally, and this would be necessary to avoid aliasing if wishing to re-sample at an effectively slower sample rate.

If the kernel is slid along the incoming sample points and a point-by-point multiply and sum is performed, the result is the reconstructed waveform. What the above diagram shows is that the kernel can be in the form of discrete sampled points, calculated as the values they would be if the kernel was centred at any arbitrary point.

So resampling is very easy: simply synthesise a sinc kernel in the form of sampled points based on the non-integer position you want to reconstruct, and multiply-and-add all the points corresponding to it.

A complication is the necessity to shorten the filter to a practical length, which involves windowing the filter i.e. multiplying it by a smooth function that tapers to zero at the edges. I did previously mention the Lanczos kernel which apparently uses a widened copy of the central lobe of the sinc function as the window. But looking at it, I don’t know why this is supposed to be a good window function because it doesn’t taper gradually to zero, and at non-integer sample positions you would either have to extend it with zeroes abruptly, or accept non-zero values at its edges.

Instead, I have decided to use a simple raised cosine as the windowing function, and to reduce its width slightly to give me some leeway in the kernel’s position between input samples. At the extremities I ensure it is set to zero. It seems to give a purer output than my version of the Lanczos kernel.

Pre-calculating the kernel

Although very simple, calculating the kernel on-the-fly at every new position would be extremely costly in terms of computing power, so the obvious solution is to use lookup tables. The pre-calculated kernels on either side of the desired sample position are evaluated to give two output values. Linear interpolation can then be used to find the value at the exact position. Because memory is plentiful in PCs, there is no need to skimp on the number of pre-calculated kernels – you could use a thousand of them. For this reason, the errors associated with this linear interpolation can be reduced to negligible.

The horizontal position of the raised cosine window follows the position of the centre of the kernel for all the versions that are calculated to lie in between the incoming sample points.

All that remains is to decide how wide the kernel needs to be for adequate accuracy in the reconstruction – and this is where my demo program comes in. I apologise that there now follows a whole load of similar looking graphs, demonstrating the results with various signals and kernel sizes, etc.

1 kHz sine wave

First we can look at the standard test signal: a 1 kHz sine wave. In the following image, the original sine wave points are shown joined with straight lines at the top right, followed by how the points would look when emerging from a DAC that has a sinc-based reconstruction filter (in this case, the two images look very similar).

Next down in the three time domain waveforms comes the resampled waveform after we have resampled it to shift its frequency by a factor of 0.9 (a much larger ratio than we will use in practice). In this first example, the resampling method being used is ‘nearest neighbour’. As you can see, the results are disastrous!

sin_1k_nn

1kHz sine wave, frequency shift 0.9, nearest neighbour interpolation

The discrete steps in the output waveform are obvious, and the FFT shows huge spikes of distortion.

Linear interpolation is quite a bit better in terms of the FFT, and the time domain waveform at the bottom right looks much better.

sin_1k_li

1kHz sine wave, frequency shift 0.9, linear interpolation

However, the FFT magnitude display reveals that it is clearly not ‘hi-fi’.

Now, compare the results using sinc interpolation:

sin_1k_sinc_50_0.9

1kHz sine wave, frequency shift 0.9, sinc interpolation, kernel width 50

As you can see, the FFT plot is absolutely clean, indicating that this result is close to distortion-free.

Next we can look at something very different: a 20 kHz sine wave.

20 kHz sine wave

sin_20k_nn

20 Khz sine wave, frequency shift 0.9, nearest neighbour interpolation

With nearest neighbour resampling, the results are again disastrous. At the right hand side, though, the middle of the three time domain plots shows something very interesting: even though the discrete points look nothing like a sine wave at this frequency, the reconstruction filter ‘rings’ in between the points, producing a perfect sine wave with absolutely uniform amplitude. This is what is produced by any normal DAC – and is something that most people don’t realise; they often assume that digital audio falls apart at the top end, but it doesn’t: it is perfect.

Linear interpolation is better than nearest-neighbour, but pretty much useless for our purposes.

sin_20k_li

20kHz sine wave, frequency shift 0.9, linear interpolation

Sinc interpolation is much better!

sin_20k_sinc_50

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 50

However, there is an unwanted spike at the right hand side (note the main signal is at 18 kHz because it has been shifted down by a factor of 0.9). This spike appears because of inadequate width of the sinc kernel which in this case has been set at 50 (with 500 pre-calculated versions of it with different time offsets, between sample points).

If we increase the width of the kernel to 200 (actually 201 because the kernel is always symmetrical about a central point with value 1.0), we get this:

sin_20k_sinc_200

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 200

The spike is almost at acceptable levels. Increasing the width to 250 we get this:

sin_20k_sinc_250

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 250

And at 300 we get this:

sin_20k_sinc_300

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 300

Clearly the kernel width does need to be in this region for the highest quality.

For completeness, here is the system working on a more complex waveform comprising the sum of three frequencies: 14, 18 and 19 kHz, all at the same amplitude and a frequency shift of 1.01.

14 kHz, 18 kHz, 19 kHz sum

Nearest neighbour:

sin_14_18_19_nn

14, 18, 19 kHz sine wave, nearest neighbour interpolation

Linear interpolation:

sin_14_18_19_li
14, 18, 19 kHz sine wave, linear interpolation

Sinc interpolation with a kernel width of 50:

sin_14_18_19_sinc_50

14, 18, 19 kHz sine wave, sinc interpolation, kernel width 50

Kernel width increased to 250:

sin_14_18_19_sinc_250
14, 18, 19 kHz sine wave, sinc interpolation, kernel width 250

More evidence that the kernel width needs to be in this region.

Ready made solutions

Re-sampling is often done in dedicated hardware like Analog Devices’ AD1896. Some advanced sound cards like the Creative X-Fi can re-sample everything internally to a common sample rate using powerful dedicated processors – this is the solution that makes connecting digital audio sources together almost as simple as analogue.

In theory, stuff like this goes on inside Linux already, in systems like JACK – apparently. But it just feels too fragile: I don’t know how to make sure it is working, and I don’t really have any handle on the quality of it. This is a tricky problem to solve by trial-and-error because a system can run for ages without any sign that clocks are drifting.

In Windows, there is a product called “Virtual Audio Cable” that I know performs re-sampling using methods along these lines.

There are libraries around that supposedly can do resampling, but the quality is unknown – I was looking at one that said “Not the best quality” so I gave up on that one.

I have a feeling that much of the code was developed at a time when processors were much less powerful than they are now and so the algorithms are designed for economy rather than quality.

Software-based sinc resampling in practice

I have grafted the code from my demo program into my active crossover application and set it running with TOSLink from a CD player going into a cheap USB sound card (Maplin) which my software uses to acquire the stream, and my software’s output going to a better multichannel sound card (the Xonar U7). The TOSLink data is being resampled in order to keep it aligned with the U7’s sample rate. I have had it running for 20 hours without incident.

Originally, before developing the test bed program, I set the kernel size at 50, fearing that anything larger would stress the Intel Atom CPU. However, I now realise that a width of at least 250 is necessary, so with trepidation I upped it to this value. The CPU load trace went up a bit in the Ubuntu system monitor, but not much; the cores are still running cool. The power of modern CPUs is ridiculous!! Remember that for each of the two samples arriving at 44.1 kHz, the algorithm is performing 500 floating point multiplications and sums, yet it hardly breaks into a sweat. There are absolutely no clever efficiencies in the programming. Amazing.

Two hobbies

An acoustic event occurs; a representative portion of the sound pressure variations produced is stored and then replayed via a loudspeaker. The human hearing system picks it up and, using a lifetime’s experience, works out a likely candidate for what might have produced that sequence of sound pressure variations. It is like finding a solution from simultaneous equations. Maybe there is more than enough information there, or maybe the human brain has to interpolate over some gaps. The addition of a room doesn’t matter, because its contribution still allows the brain to work back to that original event.

If this has any truth in it, I would guess that an unambiguous solution would be the most satisfying for the human brain on all levels. On the other hand, no solution at all would lead to a different perception: the reproduction system itself being heard, not what it is reproducing – and people could still enjoy that for what it is, like an old radiogram.

In between, an ambiguous or varying solution might be in an ‘uncanny valley’ where the brain can’t lock onto a fixed solution but nor can it entirely switch off and enjoy the sound at the level of the old radiogram.

I think a big question is: what are the chances that a deviation from neutrality in the reproduction system will result in an improvement in the ability of the brain to work out an unambiguous solution to the simultaneous equations? The answer has go to be: zero. Adding noise, phase shifts, glitches or distortion cannot possibly lead to more ‘realism’; the equations don’t work any more.

But here’s a thought: what if most ‘audiophile’ systems out there are in the ‘uncanny valley’? Speakers in particular doing strange things to the sound with their passive crossovers; ‘high end’ ones being low in nonlinear distortion, but high in linear distortion.

What if some non-neutral technologies ‘work’ by pushing the system out of the uncanny valley and into the realm of the clearly artificial? That is certainly the impression I get from some systems at the few audio shows I go to. People ooh-ing and aah-ing at sounds that, to me, are being generated by the audio system and not through it. I suspect that different ‘audiophiles’ may think they are all talking about the same things, but that in fact there are effectively two separate hobbies: one that seeks to hear through an audio system, and one that enjoys the warm, reassuring sound of the audio system itself.

The First CD Player

sony cdp-101There’s an amazing online archive of vintage magazines that I have only just begun rummaging through. I was pleased to see this 1982 review of the Sony CDP-101, the first commercial CD player. The reviewer gets hold of a unit even before they go on sale commercially, saying:

I feel as though I am a witness to the birth of a new audio era.

This was the first time that the public had encountered disc loading drawers, instant track selection, digital readouts and digital fast forward and rewind, so he goes into great detail on how these work.

And at that time, the mechanics of the disc playing mechanism seemed inextricably linked with the nature of digital audio itself, so, after reading the more technical sections of the article, the reader’s mind would be awhirl with microscopic dots, collimators and laser focusing servos – possibly not really grasping the fundamentals of what is going on.

Audio measurements are shown, though, and of course these are at levels of performance hitherto unknown. (He is not able to make his own measurements this time, but a month later he has received the necessary test disc and is able to do so).

As I write these numbers, I find it difficult to remember that I am talking about a disc player!

Towards the end, the reviewer finally listens to some music. He is impressed:

I was fortunate enough to get my hands on seven different compact digital disc albums. Some of the selections on these albums were obviously dubbed from analog master tapes, but even these were so free of any kind of background noise that they could, for the first time, be thoroughly enjoyed as music. There’s a cut of the beginning of Also Sprach Zarathustra by Richard Strauss, with the Boston Symphony conducted by Ozawa, that delivers the gut -massaging opening bass note with a depth and clarity that I never thought possible for any music reproduction system. But never mind the specific notes or passages. Listening to the complete soundtrack recording of “Chariots of Fire,” the images and scenes of that marvelous film were re- created in my mind with an intensity that would just not have been possible if the music had been heard behind a veil of surface noise and compressed dynamic range.

He talks about

…the sheer magnificence of the sound delivered by Compact Discs

and concludes:

…after my experiences with this first digital audio disc player and the few sample discs that were loaned to me, I am convinced that, sooner or later, the analog LP will have to go the way of the 78 shellac record. I can’t tell you how long the transition will take, but it will happen!

A couple of months later he reviews a Technics player:

Voices and orchestral sounds were so utterly clean and lifelike that every once in a while we just had to pause, look up, and confirm that this heavenly music was, indeed, pouring forth from a pair of loudspeaker systems. As many times as I’ve heard this noise -free, wide dynamic -range sound, it’s still thrilling to hear new music reproduced this way…

…the cleanest, most inspiring sound you have ever heard in your home

So here we are at the very start of the CD era, and an experienced reviewer finding absolutely no problems with the measurements or sound.

In audiophile folklore, however, we are now led to believe that he was deluded. It is very common for audiophiles to sneer about the advertising slogan “Perfect Sound Forever”.

Stereophile in 1995:

When some unknown copywriter coined that immortal phrase to promote the worldwide launch of Compact Disc in late 1982, little did he or she foresee how quickly it would become a term of ridicule.

But in an earlier article from 1983 they had reviewed the Sony player saying that with one particular recording it gave:

…the most realistic reproduction of an orchestra I have heard in my home in 20-odd years of audio listening!

…on the basis of that Decca disc alone, I am now fairly confident about giving the Sony player a clean bill of health, and declaring it the best thing that has happened to music in the home since The Coming of Stereo.

For sure, there were/are many bad CDs and recordings, but it is now commonly held that early CD was fundamentally bad. I don’t believe it was. I would bet that virtually no one could tell the difference between an early CD player and modern ‘high res’.

Both magazines seemed aware that their own livings could be in jeopardy if ‘all CD players sound the same’, but I think that CD’s main problem was the impossibility of divorcing the perceived sound from the physical form of the players. 1980s audio equipment looked absolutely terrible – as a browse through the magazines of the time will attest.

Within a couple of years, CD players turned from being expensive, heavy and solid, to cheap, flimsy and with the cheesiest appearance of any audio equipment. They all measured pretty much the same, however, regardless of cost or appearance. Digital audio was revealed to be what it is: information technology that is affordable by everyone.

This, of course, killed it in the eyes and ears of many audiophiles.

KEF Concord Step Response

I recently decided to measure my converted KEF Concords to check their time alignment. In theory, they should be time aligned because the individual drivers have been corrected for linear phase and then delayed appropriately based on distance to the listener, but I hadn’t quite ‘closed the loop’ by making a direct measurement.

In order to do this, I measured with a microphone at tweeter height and 1m away from the speaker – just to make it the standard measurement position. I didn’t change anything about the normal crossover setup I have been using. I used REW to make the impulse response measurement using a sweep from 10Hz to 20 kHz and duration about 24s. Without completely re-arranging the room I could manage about 3.5ms before the first (major) reflection – it would be good to try it in a bigger room or even outdoors. I am curious about what sort of windowing people normally apply just before the main impulse: depending on what you choose influences just how clean everything is leading up to the impulse and, to some extent how clean the leading edge is. Some of the Stereophile graphs look suspiciously ‘sharp’ at the start.

IMG_2170

This is the result I got:

concored step response

I am assuming that the above graph shows that the time alignment of my speakers is pretty reasonable. In Stereophile’s article on measuring speakers they show a similar image:

Fig.11 shows a good step response produced by a time-coherent, three-way loudspeaker, with the outputs of the three drive-units adding in-phase at the microphone position. There are not that many speakers that produce this good a step response. Of the speakers I have measured for Stereophile, only about 10—models from Quad, Thiel, Dunlavy, Spica, and Vandersteen—have step responses this good.

Fig.12 shows a more typical step response, again of a three-way loudspeaker. This time there are actually three step responses apparent in the graph: a narrow, positive-going step response from the tweeter; the next, negative-going step is the midrange unit (as will be seen, it’s connected with opposite polarity to the tweeter); with finally a slow, wide positive pulse from the woofer.

Stereophile is the go-to publication for these sorts of things.

If you do a Google Image Search for ‘stereophile step response’ the results are quite interesting: true step responses are still quite rare. DSP should make it trivial, but for a passive speaker it can generally only be achieved using first order crossover filters, and these, of course, result in the drivers having to cope with substantial bleed of frequencies outside their comfort zone as well as being inflexible.

Strangely, the Beolab 90 looks nothing like a step! – although extenuating circumstances are listed.

117Beo90fig5.jpg

The Kii Three is more like it:917Kii3fig1.jpg

 

Reverberation of a point source, compared with a ‘distributed’ loudspeaker

Here’s a fascinating speaker:

CBT36 Manufacturer of loudspeakers that focus on elimination of box resonances.

It uses many transducers arranged in a specific curve, driven in parallel and with ‘shading’ i.e. graduated volume settings along the curve, to reduce vertical dispersion but maintain wide dispersion in the horizontal. I can see how this might appear quite appealing for use in a non-ideal room with low ceilings or whatever.

It is a variation on the phased array concept, where the outputs of many transducers combine to produce a directional beam. It is effectively relying on differing path lengths from the different transducers producing phase cancellation or reinforcement in the air at different angles as you move off axis. All the individual wavefronts sum correctly at the listener’s ear to reproduce the signal accurately.

At a smaller scale, a single transducer of finite size can be thought of as many small transducers being driven simultaneously. At high frequencies (as the wavelengths being reproduced become short compared to the diameter of the transducer) differing path lengths from various parts of the transducer combine in the air to cause phase cancellation as you move off axis. This is known as beaming and is usually controlled in speaker design by using drivers of the appropriate size for the frequencies they are reproducing. Changes in directivity with frequency are regarded as undesirable in speaker design, because although the on-axis measurements can be perfect, the ‘room sound’ (reverberation) has the ‘wrong’ frequency response.

A large panel speaker suffers from beaming in the extreme, but with Quad electrostatics Peter Walker introduced a clever trick, where phase is shifted selectively using concentric circular electrodes as you move outwards from the centre of the panel. At the listener’s position, this simulates the effect of a point source emanating from some distance behind the panel, increasing the size of the ‘sweet spot’ and effectively reducing the high frequency beaming.

There are other ways of harnessing the power of phase cancellation and summation. Dipole speakers’ lower frequencies cancel out at the sides (and top and bottom) as the antiphase rear pressure waves meet those from the front. This is supposed to be useful acoustically, cutting down on unwanted reflections from floor, walls and ceiling. A dipole speaker may be realised by mounting a single driver on a panel of wood with a hole in it, but it behaves effectively as two transducers, one of which is in anti-phase to the other. Some people say they prefer the sound of such speakers over conventional box speakers.

This all works well in terms of the direct sound reaching the listener and, as in the CBT speaker above, may provide a very uniform dispersion with frequency compared to conventional speakers. But beyond the measurements of the direct sound, does the reverberation sound quite ‘right’? What if the overall level of reverberation doesn’t approximate the ‘liveness’ of the room that the listeners notice as they talk or shuffle their feet? If the vertical reflections are reduced but not the horizontal, does this sound unnatural?

Characterising a room from its sound

The interaction of a room and an acoustic source could be thought of as a collection of simultaneous equations – acoustics can be modelled and simulated for computer games, and it is possible for a computer to do the reverse and work out the size and shape of the room from the sound.  If the acoustic source is, in fact, multiple sources separated by certain distances, the computer can work that out, too.

Does the human hearing system do something similar? I would say “probably”. A human can work quite a lot out about a room from just its sound – you would certainly know whether you were in an anechoic chamber, a normal room or a cathedral. Even in a strange environment, a human rarely mistakes the direction and distance from which sound is coming. Head movements may play a part.

And this is where listening to a ‘distributed speaker’ in a room becomes a bit strange.

Stereo speakers can be regarded as a ‘distributed speaker’ when playing a centrally-placed sound. This is unavoidable – if we are using stereo as our system. Beyond that, what is the effect of spreading each speaker itself out, or deliberately creating phased ‘beams’ of sound?

Even though the combination of direct sounds adds up to the familiar sound at the listener’s position as though emanating from its original source, there is information within the reflections that is telling the listener that the acoustic source is really a radically different shape. Reverberation levels and directions may be ‘asymmetric’ with the apparent direct sound.

In effect, the direct sound says we are listening to this:

Image result for zoe wanamaker cassandra

but the reverberation says it is something different.

Image result for zoe wanamaker cassandra

Might there be audible side effects from this? In the case of the dipole speaker, for example, the rear (antiphase) signal reflects off the back wall and some of it does make its way forwards to the listener. In my experience, this comes through as a certain ‘phasiness’ but it doesn’t seem to bother other people.

From a normal listening distance, most musical sources are small and appear close to being a ‘point source’. If we are going to add some more reverberation, should it not appear to be emanating as much as possible from a point source?

It is easy to say that reverberation is so complex that it is just a wash of ‘ambience’ and nothing more; all we need to do is give it the right ‘colour’ i.e. frequency response. And one of the reasons for using a ‘distributed speaker’ may be to reduce the amount of reverberation anyway. But I don’t think we should overdo it: we surely want to listen in real rooms because of the reverberation, not despite it. What is the most side effect-free way to introduce this reverberation?

Clearly, some rooms are not ideal and offer too much of the wrong sort of reverberation. Maybe a ‘distributed speaker’ offers a solution, but is it as good as a conventional speaker in a suitable room? And is it really necessary, anyway? I think some people may be misguidedly attempting to achieve ‘perfect’ measurements by, effectively, eliminating the room from the sound even though their room is perfectly fine. How many people are intrigued by the CBT speaker above simply because it offers ‘better’ conventional in-room measurements, regardless of whether it is necessary?

Conclusion

‘Distributed speakers’ that use large, or multiple, transducers may achieve what they set out to do superficially, but are they free of side-effects?

I don’t have scientific proof, but I remain convinced that the ‘Rolls Royce’ of listening remains ‘point source’ monopole speakers in a large, carpeted, furnished room with a high ceiling. Box speakers with multiple drivers of different sizes are small and can be regarded as being very close to a single transducer, but are not so omnidirectional that they create too much reverberation. The acoustic ‘throw’ they produce is fairly ‘natural’. In other words, for stereo perfection, I think there is still a good chance that the types of rooms and speakers people were listening to in the 1970s remain optimal.

[Last edited 17.30 BST 09/05/17]