I get to hear the Kii Threes

Thanks to a giant favour from a new friend, I finally get to hear the Kii Threes…


A couple of Sundays ago, a large van arrived at my house containing two Kii Threes and their monumentally heavy stands, plus a pair of Linkwkitz LX Minis with subwoofers along with their knowledgeable owner, John. It was our intention to spend the day comparing speakers.

We first set up the Kiis to compare against my ‘Keph’ speakers and to do this, we had to ‘interleave’ the speaker pairs with slightly less stereo separation and symmetry than ideal, perhaps:

2018-11-18 14-19-29

Setting up went remarkably smoothly, and we soon had the Kiis running off Tidal on a laptop while the Kephs were fed with Spotify Premium – most tracks seemed to be available from both services. The Kiis are elegant in the simplicity of cabling and the lack of extraneous boxes.

John had set up the Kiis with his preferred downwards frequency response slope that starts at 3kHz and ended at 4dB down (at 22 Khz?). I can’t say what significance this might have had on our listening experiment.

The original idea was to match the SPLs using pink noise and a sound level meter. This we did, but didn’t maintain such discipline for long. We were listening rather louder than I would normally, but this was inevitable because of the Kii’s amazing volume capabilities.

The bottom line is that the Kiis are spectacular! The main differences for me were that the Kiis were ‘smoother’ and the bass went deeper, and they seemed to show up the ‘ambience’ in many recordings more than the Kephs – more about that later. An SPL meter revealed that what sounded like equal volume required, in fact, a measurably higher SPL from the Kephs. Could this be our hearing registering the direct sound, but the Kiis’ superior dispersion abilities resulting in less reverberant sound – ignored by our conscious hearing but no doubt obscuring detail? Or possibly an artefact of their different frequency responses? We didn’t really have time to investigate this any further.

When standing a long way from both sets of speakers at the back of the room, the Kephs appeared to be emphasising the midrange more, and at the moment of changeover between speakers that contrast didn’t sound good; with a certain classical piano track, at the moment of changeover the Kephs seemed to render the sound* of the piano kind of ‘plinky plonk’ or toy-like compared to the Kiis – but then after about 10 seconds I got used to it. Without the Kiis to compare against I would have said my Kephs sounded quite good..! But the Kiis were clearly doing something very special.

I did try some ad hoc modifications of the Keph driver gains, baffle step slopes and so on, and we maybe got a bit closer in that regard. But I forgot about the -4dB slope that had been applied to the Kiis, and if I had thought about it, I already had an option in the Kephs’ config file for doing just that. But really, I wish I had had the courage of my convictions and left the the frequency response ‘as is’.

Ultimately, I think that we were running into the very reason why the Kiis are designed the way they are: to resemble a big speaker. As the blurb for the Kii says:

“The THREE’s ability to direct bass is comparable to, but much better controlled than that of a traditional speaker several meters wide.”

It’s about avoiding reflections that blur bass detail, but as R.E. Greene explains, it’s also about frequency response:

“What is true of the mini-monitor, that it cannot be EQed to sound right, is also true of narrow-front floor-standers. They sound too midrange-oriented because of the nature of the room sound. This is something about the geometry of the design. It cannot be substantially altered by crossover decisions and so on.

A conventional small speaker (and the Kephs are relatively small) cannot be equalised to give a flat direct sound and flat room sound. It has to be a compromise and as I described before, I apply baffle step compensation to help bridge this discrepancy between the direct and ambient frequency balances. The results are, so I thought, rather acceptable, but the compromise shows up against a speaker with more controlled dispersion.

This must always be a factor in the sound of conventional speakers unless sitting very close to them. I do believe Bruno Putzeys when he says that large speakers (or those that cleverly simulate largeness) will always sound different from small ones. It would be interesting also to have compared the Kiis against my bigger speakers whose baffle step is almost an octave lower.

However, there was another difference that bothered me (with the usual sighted listening caveats) and this was ‘focus’. With the Kiis I heard lots of ‘ambience’ – almost ‘surround sound’ – but I didn’t hear a super-precise image. When the Kephs were substituted I heard a sudden snap into focus, and everything moved to mainly between and beyond the speakers. The sound was less ‘smooth’ but it was, to me, more focused.

And this is a question I still have about the Kiis and other speakers that utilise anti-phase. I see the animations on the Kii web site that show how the rear drivers cancel out the sound that would otherwise go behind the speaker. To do this, the rear drivers must deliver a measured quantity of accurately-timed anti-phase. This is a brilliant idea.

My question is, though: how complete is this cancellation if you partially obscure one of the side drivers (-with another speaker in this case)? I do wonder if I was hearing the results of anti-phase escaping into the room and messing up the imaging because of the way we had arranged the speakers – along with a mildly (possibly imaginary!) uncomfortable sensation in my ears and head.

To a frequency response measurements oriented person, it doesn’t matter whether sound is anti-, or in-, phase; it is just ‘frequency response material’ that gets chucked into bins and totted up at the end of the measurement. If it is delayed and reflected then in graphs its effects appear no different from the visually-chaotic results of all room reflections; this is the usual argument against phase accuracy in forum discussions. “How can phase matter if it is shifted arbitrarily by reflections in the room, anyway?”.

However, to the person who acknowledges that the time domain is also important, anti-phase is a problem. If human hearing has the ability to separate direct sound from room sound, it is dependent on being able to register the time-delayed similarity between direct and reflected sound. If the reflected sound is inverted relative to the direct, that similarity is not as strong (we are talking about transients more than steady state waveforms). In fact, the reflected sound may partially register as a different source of sound.

Anti-phase is surely going to sound weird – and indeed it does, as anyone who has heard stereo speakers wired out of phase will attest. Where the listener registers in-phase stereo speakers as producing a precise image located at one point in space, out-of-phase speakers produce an image located nowhere and/or everywhere. The makers of pseudo-surround sound systems such as Q-Sound exploit this in order to create images that are not restricted to between the stereo speakers. This may be a factor in the open baffle sound that some people like (but I don’t!).

So I would suggest that allowing anti-phase to bounce around the room is going to produce unpredictable results. This is one reason why I am suspicious of any speaker that releases the backwave of the driver cone into the room. The more this can be attenuated (and its bandwidth restricted) the better.

With the Kiis, was I hearing the effect of less-than-perfect cancellation because of the obscuring of one of the side drivers? Or imagining it? Most people who have heard the Kiis remark on the precise imaging, so I fear that we managed to change something with our layout. Despite the Kiis’ very clever dispersion control system which supposedly makes them placement-independent, does it pay to be a little careful of placement and/or symmetry, anyway? For it not to matter would be miraculous, I would say.

In a recent review of the Kiis (not available online without a subscription), Martin Colloms says that with the Kiis he heard:

“…sometimes larger than life, out-of-the-box imaging”

I wonder if that could be a trace of what I was hearing..? Or maybe he means it as a pure compliment. In the same review he describes how the cardioid cancellation mechanism extends as far as 1kHz, so it is not just a bass phenomenon.


Next, John set up his DIY Linkwitz LX Mini speakers (which look very attractive, being based on vertical plastic tubes with small ‘pods’ on top), as well as their compact-but-heavy subwoofers. These were fed with analogue signals from a Raspberry-Pi based streamer and, again, sounded excellent. They also seek to control dispersion, in this case by purely acoustic means – that I don’t yet understand. And they may also dabble a bit in backwave anti-phase.

If I had any criticism, it was that the very top end wasn’t quite as good as a conventional tweeter..? But it might be my imagination and expectation bias. Also, our ears and critical faculties were pretty far gone by that point…

Really, we had three systems all of which, to me, sounded good in isolation – but with the Kiis revealing their superior performance at the point of changeover. There were certainly moments of confusion when I didn’t know which system was operating and only the changeover gave the game away. I think all three systems were much better than what you often get at audio shows.

What we didn’t have were any problems with distortion, hum, noise. In these respects, all three systems just worked. The biggest source of any such problem was a laptop fan which kicked in sometimes when running Tidal.

There were lots of things we didn’t do. We didn’t try the speakers in different positions; we didn’t try different toe-in angles; we didn’t make frequency response measurements and do things in a particularly scientific way; we listened at pretty high volume and didn’t have the self-control to listen at lower volumes – which might have been more appropriate for some of the classical music. The room was ‘as it comes’: 6 x 3.4 x 2.4m, carpeted, plaster walls and ceiling, and floor-to-ceiling windows behind the speakers with a few boxes and bits of furniture lying about.


So my conclusion is that I have heard the Kiis and am highly impressed, but there might possibly be an extra level of focus and integrity I have yet to experience. I never got to the point where I could listen through the speakers rather than to them, but I am sure that this will happen at some point.

In the meantime I am having to learn to love my Kephs again – which actually isn’t too hard without the Kiis in the same room showing them up!


Footnotes:

*Since writing that paragraph I have found a mention of possibly that very phenomenon:

“…even a brief comparison with a real piano, say, will reveal the midrange-orientation of the narrow- front wide radiators.”

Advertisements

Still listening…

I’m still here, and still listening to my KEF (-derived) speakers most days. I honestly think that they are built to the right formula – although I keenly await the time when I get to hear the Kii Threes.

It all now seems so obvious, but it took me quite a while to disentangle myself from the frequency domain-centric view that most audio design people are committed to – and whose minds (and possibly ears) are warped into believing.

It is clear that human hearing does perform frequency domain analysis, but that it also uses other methods and ‘hardware’ in parallel to characterise what it is hearing. This means that an audio system needs to reproduce the signal without changing it in either the time or frequency domains.

The alternative is to second guess how human hearing works and to assume that arbitrary distortion of phase and timing has no effect. In fact, I would say it is not even as rational as that: what seems to have happened is that while carpentry-and-coil-based technology doesn’t explicitly control phase and timing, conventional 1970s speakers still sounded pretty good. The results have been retrospectively analysed and justified, and a model of human hearing developed to fit the speakers rather than the other way round.

This faulty model leads to ideas like bass reflex and ‘room correction’ that, viewed through the prism of not trying to second guess human hearing, seem as confused and deluded as they sound.

The result is the weird variability in audio systems that all ‘measure well’ – using the subset of measurements that satisfy the model – but sound disappointing even while costing the price of a car. It might even be worse than that: maybe recordings are being made while being monitored through ‘room correction’ resulting in the demise of high fidelity recordings as we know them.

And there’s another delusional idea that stems from the faulty model and the occasionally serendipitous characteristics of old technology: the notion that we listen to a signal rather than through a channel.

The conventional view is that we must change the signal to give the best sound – whether by equalisation or – bizarrely – adding distortion deliberately e.g. with valves or vinyl. If you do this, you are really changing the characteristics of the channel. In real music and acoustics there is no such thing as ‘a signal’ and whatever automatic processing you do of it is, in the general case, arbitrary and meaningless. For sure, you may find that distortion is a pleasing artistic effect on a particular (probably very simple) recording. But are you an artist? If so, you might be much better served by playing with a freeware PC recording studio app rather than churning equipment that represents several years of the retirement you may never get to enjoy.

The only coherent strategy is to reproduce the signal without touching it. In my experience, if you get anywhere near to this, it sounds magnificent. Not ‘neutral’; not ‘clinical’ but deep, open, rich, colourful – like real music.

Audiophile listening rooms: the hard floor phenomenon

listening rooms

If you Google image search for audiophile listening room, you bring up a selection of images that often have a certain resemblance to each other. A large number of rooms feature exposed wood or stone tiled floors with relatively small rugs in the middle. They also don’t have much furniture in them. Some people choose to sit a long way away from the speakers.

These don’t immediately look like my idea of a good listening room. To me they look echoey – how my room sounded before the carpet was fitted.

This leads me to wonder whether some discussions over room treatments, room measurements and room correction may be at cross purposes. I often don’t understand why people become so interested in these things when my system seems to sound perfectly OK without any of that stuff. This may be the explanation.

The way I look at it, a wall-to-wall carpet may be the best room treatment you can fit, covering a very large area with minimal effort. And some other furniture may also be a good thing. Kind of like a 1970s living room – the sound you’ve been trying to recreate for the last few decades and had convinced yourself was just a false memory…

How Stereo Works

(Updated 03/06/18 to include results for Blumlein Pair microphone arrangement.)

initial

A computer simulation of stereo speakers plus listener, showing the listener’s perception of the directions of three sources that have previously been ‘recorded’. The original source positions are shown overlaid with the loudspeakers.

Ever since building DSP-based active speakers and hearing real stereo imaging effectively for the first time, it has seemed to me that ordinary stereo produces a much better effect than we might expect. In fact, it has intrigued me, and it has been hard to find a truly satisfactory explanation of how and why it works so well.

My experience of stereo audio is this:

  • When sitting somewhere near the middle between two speakers and listening to a ‘purist’ stereo recording, I perceive a stable, compelling 3D space populated by the instruments and voices in different positions.
  • The scene can occasionally extend beyond the speakers (and this is certainly the case with recordings made using Q-Sound and other such processes).
  • Turning my head, the image stays plausible.
  • If I move position, the image falls apart somewhat, but when I stop moving it stabilises again into a plausible image – although not necessarily resembling what I might have expected it to be prior to moving.
  • If I move left or right, the image shifts in the direction of the speaker I am moving towards.

An article in Sound On Sound magazine may contain the most perceptive explanation I have seen:

The interaction of the signals from both speakers arriving at each ear results in the creation of a new composite signal, which is identical in wave shape but shifted in time. The time‑shift is towards the louder sound and creates a ‘fake’ time‑of‑arrival difference between the ears, so the listener interprets the information as coming from a sound source at a specific bearing somewhere within a 60‑degree angle in front.

This explanation is more elegant than the one that simply says that if the sound from one speaker is louder we will tend to hear it as if coming from that direction – I have always found it hard to believe that such a ‘blunt’ mechanism could give rise to a precise, sharp 3D image. Similarly, it is hard to believe that time-of-arrival differences on their own could somehow be relayed satisfactorily from two speakers unless the user’s head was locked into a fixed central position.

The Sound On Sound explanation says that by reproducing the sound from two spaced transducers that can reach both ears, the relative amplitude also controls the relative timing of what reaches the ears, thus giving a timing-based stereo image that, it appears, is reasonably stable with position and head rotation. This is not a psychoacoustic effect where volume difference is interpreted as a timing difference, but the literal creation of a physical timing difference from a volume difference.

There must be timbral distortion because of the mixing of the two separately-delayed renditions of the same impulse at each ear, but experience seems to suggest that this is either not significant or that the brain handles it transparently, perhaps because of the way it affects both ears.

Blumlein’s Patent

Blumlein’s original 1933 patent is reproduced here. The patent discusses how time-of-arrival may take precedence over volume-based cues depending on frequency content.

It is not immediately apparent to me that what is proposed in the patent is exactly what goes on in most stereo recordings. As far as I am aware, most ‘purist’ stereo recordings don’t exaggerate the level differences between channels, but simply record the straight signal from a pair of microphones. However, the patent goes on to make a distinction between “pressure” and “velocity” microphones which, I think, corresponds to omni-directional and directional microphones. It is stated that in the case of velocity microphones no amplitude manipulation may be needed. The microphones should be placed close together but facing in different directions (often called the ‘Blumlein Pair‘) as opposed to being spaced as “artificial ears”.

Blumlein -Stereo.png

Blumlein Pair microphone arrangement

The Blumlein microphones are bi-directional i.e. they also respond to sound from the back.

Going by the SoS description, this type of arrangement would record no timing-based information (from the direct sound of the sources at any rate), just like ‘panpot stereo’, but the speaker arrangement would convert orientation-induced volume variations into a timing-based image derived from the acoustic summation of different volume levels via acoustic delays to each ear. This may be the brilliant step that turns a rather mundane invention (voices come from different sides of the cinema screen) into a seemingly holographic rendering of 3D space when played over loudspeakers.

Thus the explanation becomes one of geometry plus some guesswork regarding the way the ears and brain correlate what they are hearing, presumably utilising both time-of-arrival and the more prosaic volume-based mechanism which says that sounds closer to one ear than the other will be louder – enhanced by the shadowing effect of the listener’s head in the way. Is this sufficient to plausibly explain the brilliance of stereo audio? Does a stereo recording in any way resemble the space in which it was recorded?

A Computer Simulation

In order to help me understand what is going on I have created a computer simulation which works as follows (please skip this section unless you are interested in very technical details):

  • It is a floor plan view of a 2D slice through the system. Objects can be placed at any XY location, measured in metres from an origin.
  • There are no reflections; only direct sound.
  • The system comprises
    • a recording system:
      • Three acoustic sources, each of which generate an identical musical transient (loaded from a mono WAV file at CD quality). Each source is considered in isolation from the others.

      • Two microphones that can be spaced and positioned as desired. They can be omni-directional or have a directional response. In the former case, volume is attenuated with distance from the source while in the latter it is attenuated by both distance and orientation to the source.
    • a playback system:

      • Two omni-directional speakers

      • A listener with two ears and the ability to move around and turn his head.

  • The directions and distances from sources to microphones are calculated based on their relative positions, and from these the delays and attenuations of the signals at the microphones are derived. These signals are ‘recorded’.

  • During ‘playback’, the positions of the listener’s ears are calculated based on XY position of the head and its rotation.

  • The distances from speakers to each ear are calculated, and from these, the delays and attenuation thereof.

  • The composite signal from each source that reaches each ear via both speakers is calculated and from this is found:

    • relative amplitude ratio at the ears
    • relative time-of-arrival difference at the ears. This is currently obtained by correlating one ear’s summed signal for that source (from both speakers) against the other and looking for the delay corresponding to peak output of this. (There may be methods more representative of the way human hearing ascertains time-of-arrival, and this might be part of a future experiment).

  • There is currently no attempt to simulate HRTF or the attenuating effect of ‘head shadow’. Attenuation is purely based on distance to each ear.

  • The system then simulates the signals that would arrive at each ear from a virtual acoustic source were the listener hearing it live rather than via the speakers.

    • This virtual source is swept through the XY space in fine increments and at each position the ‘real’ relative timings and volume ratio that would be experienced by the listener are calculated.

    • The results are compared to the results previously found for each of the three sources as recorded and played back over the speakers, and plotted as colour and brightness in order to indicate the position the listener might perceive the recorded sources as emanating from, and the strength of the similarity.

  • The listener’s location and rotation can be incremented and decremented in order to animate the display, showing how the system changes dynamically with head rotation or position.

The results are very interesting!

Here are some images from the system, plus some small animations.

Spaced omni-directional microphones

In these images, the (virtual) signal was picked up by a pair of (virtual) omnidirectional microphones on either side of the origin, spaced 0.3m apart. This is neither a binaural recording (which would at least have the microphones a little closer together) nor the Blumlein Pair arrangement, but does seem to be representative of some types of purist stero recording.

The positions of the three sources during (virtual) recording are shown overlaid with the two speakers, plus the listener’s head and ears. Red indicates response to SRC0; green SRC1; and blue SRC2.

head_rotation

Effect of head rotation on perceived direction of sources based on inter-aural timing when listener is close to the ‘sweet spot’.

side_to_side

Effect of side-to-side movement of listener on perceived imaging based on inter-aural timing.

compound_movement

Compound movement of listener, including front-to-back movement and head rotation.

amplitude

Effect of listener movement on perceived image based on inter-aural amplitudes.

Coincident directional microphones (Blumlein Pair)

Here, directional microphones are set at the origin at right angles to each other, as shown in the earlier diagram. They copy Blumlein’s description in the patent i.e. output is proportional to the cosine of angle of incidence.

blumlein_timing

Time-of-arrival based perception of direction as captured by a coincident pair of directional microphones (Blumlein Pair) and played back over stereo speakers, with compound movement of the listener.

blumlein_amplitude

A similar test, but showing perceived locations of the three sources based on inter-aural volume level

In no particular order, some observations on the results:

  • A stereo image based on time-of-arrival differences at the ears can be created with two spaced omni-directional microphones or coincident directional microphones. Note, the aim is not to ‘track’ the image with the user’s head movement (like headphones would), but to maintain stable positions in space even as the user turns away from ‘the stage’.
  • The Blumlein Pair gives a stable image with listener movement based on time-of-arrival. The image based on inter-aural amplitude may not be as stable, however.
  • Interaural timing can only give a direction, not distance.

  • A phantom mirror image of equal magnitude also accompanies the frontwards time-of-arrival-derived direction, but this would also be true of ‘real life’. The way this behaves with dynamic head movement isn’t necessarily correct; at some locations and listener orientations maybe the listener could be confused by this.

  • Relative volume at the two ears (as a ratio) gives a ‘blunt’ image that behaves differently from the time-of-arrival based image when the listener moves or turns their head. The plot shows that the same ratio can be achieved for different combinations of distance and angle so on its own it is not unambiguous.

  • Even if the time-of-arrival image stays meaningful with listener movement, the amplitude-based image may not.

  • Combined with timing, relative interaural volume might provide some cues for distance (not necessarily the ‘true’ distance).

  • No doubt other cues combining indirect ‘ambient’ reflections in the recording, comb-filtering, dynamic phase shifts with head movement, head-related transfer function, etc. are also used by the listener and these all contribute to the perception of depth.

  • The cues may not all ‘hang together’, particularly in the situation of movement of the listener, but the human brain seems to make reasonable sense of them once the movement stops.

  • The Blumlein Pair does, indeed, create a time-of-arrival-based image from amplitude variations, only. And this image is stable with movement of the listener – a truly remarkable result, I think.
  • Choice of microphone arrangement may influence the sound and stability of the image.
  • Maybe there is also an issue regarding the validity of different recording techniques when played back over headphones versus speakers. The Blumlein Pair gives no time-of-arrival cues when played over headphones.
  • The audio scene is generally limited to the region between the two speakers.
  • The simulation does not address ‘panpot’ stereo yet, although as noted earlier, the Blumlein microphone technique is doing something very similar.
  • In fact, over loudspeakers, the ‘panpot’ may actually be the most correct way of artificially placing a source in the stereo field, yielding a stable, time-of-arrival-based position.

Perhaps the thing that I find most exciting is that the animations really do seem to reflect what happens when I listen to certain recordings on a stereo system and shift position while concentrating on what I am hearing. I think that the directions of individual sources do indeed sometimes ‘flip’ or become ambiguous, and sometimes you need to ‘lock on’ to the image after moving, and from then on it seems stable and you can’t imagine it sounding any other way. Time-of-arrival and volume-based cues (which may be in conflict in certain listening positions), as well as the ‘mirror image’ time-of-arrival cue may be contributing to this confusion. These factors may differ with signal content e.g. the frequency ranges it covers.

It has occurred to me that in creating this simulation I might have been in danger of shattering my illusions about stereo, spoiling the experience forever, but in the end I think my enthusiasm remains intact. What looked like a defect with loudspeakers (the acoustic cross-coupling between channels) turns out to be the reason why it works so compellingly.

In an earlier post I suggested that maybe plain stereo from speakers was the optimal way to enjoy audio and I think I am more firmly persuaded of that now. Without having to wear special apparatus, have one’s ears moulded, make sure one’s face is visible to a tracking camera, or dedicate a large space to a central hot-seat, one or several listeners can enjoy a semi-‘holographic’ rendering of an acoustic recording that behaves in a logical way even as the listener turns their head. The system blends the listening room’s acoustics with the recording meaning that there is a two-way element to the experience whereby listeners can talk and move around and remain connected with the recording in a subtle, transparent way.

Conclusion

Stereo over speakers produces a seemingly realistic three-dimensional ‘image’ that remains stable with listener movement. How this works is perhaps more subtle than is sometimes thought.

The Blumlein Pair microphone arrangement records no timing differences between left and right, but by listening over loudspeakers, the directional volume variations are converted into time-of-arrival differences at the listener’s ears. The acoustic cross-coupling from each speaker to ‘the wrong ear’ is a necessary factor in this.

Some ‘purist’ microphone techniques may not be as valid as others when it comes to stability of the image or the positioning of sources within the field. Techniques that are appropriate for headphones may not be valid for speakers, and vice versa.

 

The First Lossy Codec

(probably).

Nowadays we are used to the concept of the lossy codec that can reduce the bit rate of CD-quality audio by a factor of, say, 5 without much audible degradation. We are also accustomed to lossless compression which can halve the bit rate without any degradation at all.

But many people may not realise that they were listening to digital audio and a form of lossy compression in the 1970s and 80s!

Early BBC PCM

As described here, the BBC were experimenting with digital audio as early as the 1960s, and in the early 70s they wired up much of the UK FM transmitter network with PCM links in order to eliminate the hum, noise, distortion and frequency response errors that were inevitable with the previous analogue links.

So listeners were already hearing 13-bit audio at a sample rate of 32 kHz when they tuned into FM radio in the 1970s. I was completely unaware of this at the time, and it is ironic that many audiophiles still think that FM radio sounds good but wouldn’t touch digital audio with a bargepole.

13 bits was pretty high quality in terms of signal-to-noise-ratio, and the 32 kHz sample rate gave something approaching 15 kHz audio bandwidth which, for many people’s hearing, would be more than adequate. The quality was, however, objectively inferior to that of the Compact Disc that came later.

Downsampling to 10 bits

In the later 70s, in order to multiplex more stations into a lower bandwidth, the BBC wanted to compress higher quality 14-bit audio down to 10 bits

As you may be aware, downsampling to a lower bit depth leads to a higher level of background noise due to the reduced resolution and the mandatory addition of dither noise. For 10 bits with dither, the best that could be achieved would be a signal to noise ratio of 54 dB (I think I am right in saying) although the modern technique of noise shaping the dither can reduce the audibility of the quantisation noise.

This would not have been acceptable audible quality for the BBC.

Companding Noise Reduction

Compression-expansion is a noise reduction technique that was already used with analogue tape recorders e.g. the dbx noise reduction system. Here, the signal’s dynamic range is squashed during recording i.e. the quiet sections are boosted in level, following a specific ‘law’. Upon replay, the inverse ‘law’ is followed in order to restore the original dynamic range. In doing so, any noise which has been added during recording is boosted downwards in level, reducing its audibility.

With such a system, the recorded signal itself carries the information necessary to control the expander, so compressor and expander need to track each other accurately in terms of the relationships between gain, level and time. Different time constants may be used for ‘attack’ and ‘release’ and these are a compromise between rapid noise reduction and audible side effects such as ‘pumping’ and ‘breathing’. The noise itself is being modulated in level, and this can be audible against certain signals more than others. Frequency selective pre- and de-emphasis can also help to tailor the audible quality of the result.

The BBC investigated conventional analogue companding before they turned to the pure digital equivalent.

N.I.C.A.M

The BBC called their digital equivalent of analogue companding ‘NICAM’ (Near Instantaneously Companded Audio Multiplex). It is much, much simpler, and more precise and effective than the analogue version.

It is as simple as this:

  • Sample the signal at full resolution (14 bits for the BBC)
  • Divide the digitised stream into time-based chunks (1ms was the duration they decided upon);
  • For each chunk, find the maximum absolute level within it;
  • For all samples in that chunk, do a binary shift sufficient to bring all the samples down to within the target bit depth (e.g. 10 bits);
  • Transmit the shifted samples, plus a single value indicating by how much they have been shifted;
  • At the other end, restore the full range by shifting samples in the opposite direction by the appropriate number of bits for each chunk.

Using this system, all ‘quiet chunks’ i.e. those already below the 10 bit maximum value are sent unchanged. Chunks containing values that are higher in level than 10 bits lose their least significant bits, but this loss of resolution is masked by the louder signal level. Compared to modern lossy codecs, this method requires minimal DSP and could be performed without software using dedicated circuits based on logic gates, shift registers and memory chips.

You may be surprised at how effective it is. I have written a program to demonstrate it, and in order to really emphasise how good it is, I have compressed the original signal into 8 bits, not the 10 that the BBC used.

In the following clip, a CD-quality recording has been converted as follows:

  • 0-10s is the raw full-resolution data
  • 10-20s is the sound of the signal downsampled to 8 bits with dither – notice the noise!
  • 20-40s is the signal compressed NICAM-style into 8 bits and restored at the other end.

I think it is much better than we might have expected…

(I was wanting to start with high quality, so I got the music extract from here:

http://www.2l.no/hires/index.html

This is the web site of a label providing extracts of their own high quality recordings in various formats for evaluation purposes. I hope they don’t mind me using one of their excellent recorded extracts as the source for my experiment).

“Great Midrange”

If you do a google search for audio “great midrange” you get a lot of hits – such descriptions of audio systems’ performance are common currency in audiophilia.

But what a peculiar idea: that something that is supposed to sound like music should be judged on the basis of the sound of its “midrange”. What is “midrange”? It has nothing whatsoever to do with music, art, performance, experiencing a concert. Has anyone ever said “This orchestra has great midrange” or “This hall has great midrange”?

I think that the common use of such descriptions reveals some assumptions:

  • all audio systems are assumed to have distinct hardware-related, non-musical, non-acoustic characteristics; they can be compared and ranked on that basis
  • audiophiles are not holding out for the whole, but are prepared to live with systems on the basis of isolated arbitrary non-music related characteristics like “great midrange”
  • the signal is built from non-acoustic, non-musical components like “midrange” rather than being a unique, complex composite of sources and acoustics
  • we can manipulate components in the signal like “midrange” in a meaningful way; the signal is like soup and we can flavour it any way we like
  • audio systems are so poor that it is better to rate them as a collection of components like “midrange” rather than the ways in which they deviate from perfection – it’s quicker that way.
  • it is not possible to describe meaningfully the sound of most audio systems (it’s not as if it’s like listening to real music), so we have to devise a new language to describe it.
  • audiophilia is about listening to the system not through it

I think this pitiful set of assumptions would mystify and put off the intelligent novice who might be curious about audio. As soon as they began to research the subject (as you’re surely supposed to do?) they would experience cognitive dissonance between their existing notion of what an audio system is for, and the apparent priorities and language used by the experts.

I think it is easy to see that not many people would persist in trying to find their way around this weird sub-culture and would simply buy an Apple Homepod instead. The audiophiles have effectively passively appropriated the world of high quality recorded music for themselves, and the only outsiders granted access are those prepared to go through a bewildering, arduous re-orientation process.

How to re-sample an audio signal

As I mentioned earlier, I would like to have the flexibility of using digital audio data that emanates externally from the PC that is performing the DSP, and this necessarily will have a different sample clock. Something has got to give!

If the input was analogue, you would just sample it with an ADC locked to your DAC’s sample rate, and then the source’s own sample rate wouldn’t matter to you. With a standard digital audio source (e.g. S/PDIF) you need to be able to do the same thing but purely in software. The incoming sampled data points are notionally turned into a continuous waveform in memory by duplicating a DAC reconstruction filter using floating point maths. You can then sample it wherever you want at a rate locked to the DAC’s sample rate.

You still ‘eat’ the incoming data at the rate at which it comes in, but you vary the number of samples that you ‘decimate’ from it (very, very slightly).

The control algorithm for locking this re-sampling to the DAC’s sample rate is not completely trivial because the PC’s only knowledge of the sample rates of the DAC and S/PDIF are via notifications that large chunks of data have arrived or left, with unknown amounts of jitter. It is only possible to establish an accurate measure of relative sample rates with a very long time constant average. In reality the program never actually calculates the sample rate at all, but merely maintains a constant-ish difference between the read and write pointer positions of a circular buffer. It relies on adequate latency and the two sample rates being reasonably stable by virtue of being derived from crystal oscillators. The corrections will, in practice be tiny and/or occasional.

How is the interesting problem of re-sampling solved?

Well, it’s pretty new to me, so in order to experiment with it I have created a program that runs on a PC and does the following:

  1. Synthesises a test signal as an array of floating point values at a notional sample rate of 44.1 kHz. This can be a sine wave, or combination of different frequency sine waves.
  2. Plots the incoming waveform as time domain dots.
  3. Plots the waveform as it would appear when reconstructed with the sinc filter. This is a sanity check that the filter is doing approximately the right thing.
  4. Resamples the data at a different sample rate (can be specified with any arbitrary step size e.g. 0.9992 or 1.033 or whatever), using floating point maths. The method can be nearest-neighbour, linear interpolation, or sinc & linear interpolation.
  5. Plots the resampled waveform as time domain dots.
  6. Passes the result into a FFT (65536 points), windowing the data with a raised cosine window.
  7. Plots the resulting resampled spectrum in terms of frequency and amplitude in dB.

This is an ideal test bed for experimenting with different algorithms and getting a feel for how accurate they are.

Nearest-neighbour and linear interpolation are pretty self explanatory methods; the sinc method is similar to that described here:

https://www.dsprelated.com/freebooks/pasp/Windowed_Sinc_Interpolation.html

I haven’t completely reproduced (or necessarily understood) their method, but I was inspired by this image:

\includegraphics[scale=0.8]{eps/Waveforms}

The sinc function is the ideal ‘brick wall’ low pass filter and is calculated as sin(x*PI)/(x*PI). In theory it extends from minus to plus infinity, but for practical uses is windowed so that it tapers to zero at plus or minus the desired width – which should be as wide as practical.

The filter can be set at a lower cutoff frequency than Nyquist by stretching it out horizontally, and this would be necessary to avoid aliasing if wishing to re-sample at an effectively slower sample rate.

If the kernel is slid along the incoming sample points and a point-by-point multiply and sum is performed, the result is the reconstructed waveform. What the above diagram shows is that the kernel can be in the form of discrete sampled points, calculated as the values they would be if the kernel was centred at any arbitrary point.

So resampling is very easy: simply synthesise a sinc kernel in the form of sampled points based on the non-integer position you want to reconstruct, and multiply-and-add all the points corresponding to it.

A complication is the necessity to shorten the filter to a practical length, which involves windowing the filter i.e. multiplying it by a smooth function that tapers to zero at the edges. I did previously mention the Lanczos kernel which apparently uses a widened copy of the central lobe of the sinc function as the window. But looking at it, I don’t know why this is supposed to be a good window function because it doesn’t taper gradually to zero, and at non-integer sample positions you would either have to extend it with zeroes abruptly, or accept non-zero values at its edges.

Instead, I have decided to use a simple raised cosine as the windowing function, and to reduce its width slightly to give me some leeway in the kernel’s position between input samples. At the extremities I ensure it is set to zero. It seems to give a purer output than my version of the Lanczos kernel.

Pre-calculating the kernel

Although very simple, calculating the kernel on-the-fly at every new position would be extremely costly in terms of computing power, so the obvious solution is to use lookup tables. The pre-calculated kernels on either side of the desired sample position are evaluated to give two output values. Linear interpolation can then be used to find the value at the exact position. Because memory is plentiful in PCs, there is no need to skimp on the number of pre-calculated kernels – you could use a thousand of them. For this reason, the errors associated with this linear interpolation can be reduced to negligible.

The horizontal position of the raised cosine window follows the position of the centre of the kernel for all the versions that are calculated to lie in between the incoming sample points.

All that remains is to decide how wide the kernel needs to be for adequate accuracy in the reconstruction – and this is where my demo program comes in. I apologise that there now follows a whole load of similar looking graphs, demonstrating the results with various signals and kernel sizes, etc.

1 kHz sine wave

First we can look at the standard test signal: a 1 kHz sine wave. In the following image, the original sine wave points are shown joined with straight lines at the top right, followed by how the points would look when emerging from a DAC that has a sinc-based reconstruction filter (in this case, the two images look very similar).

Next down in the three time domain waveforms comes the resampled waveform after we have resampled it to shift its frequency by a factor of 0.9 (a much larger ratio than we will use in practice). In this first example, the resampling method being used is ‘nearest neighbour’. As you can see, the results are disastrous!

sin_1k_nn

1kHz sine wave, frequency shift 0.9, nearest neighbour interpolation

The discrete steps in the output waveform are obvious, and the FFT shows huge spikes of distortion.

Linear interpolation is quite a bit better in terms of the FFT, and the time domain waveform at the bottom right looks much better.

sin_1k_li

1kHz sine wave, frequency shift 0.9, linear interpolation

However, the FFT magnitude display reveals that it is clearly not ‘hi-fi’.

Now, compare the results using sinc interpolation:

sin_1k_sinc_50_0.9

1kHz sine wave, frequency shift 0.9, sinc interpolation, kernel width 50

As you can see, the FFT plot is absolutely clean, indicating that this result is close to distortion-free.

Next we can look at something very different: a 20 kHz sine wave.

20 kHz sine wave

sin_20k_nn

20 Khz sine wave, frequency shift 0.9, nearest neighbour interpolation

With nearest neighbour resampling, the results are again disastrous. At the right hand side, though, the middle of the three time domain plots shows something very interesting: even though the discrete points look nothing like a sine wave at this frequency, the reconstruction filter ‘rings’ in between the points, producing a perfect sine wave with absolutely uniform amplitude. This is what is produced by any normal DAC – and is something that most people don’t realise; they often assume that digital audio falls apart at the top end, but it doesn’t: it is perfect.

Linear interpolation is better than nearest-neighbour, but pretty much useless for our purposes.

sin_20k_li

20kHz sine wave, frequency shift 0.9, linear interpolation

Sinc interpolation is much better!

sin_20k_sinc_50

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 50

However, there is an unwanted spike at the right hand side (note the main signal is at 18 kHz because it has been shifted down by a factor of 0.9). This spike appears because of inadequate width of the sinc kernel which in this case has been set at 50 (with 500 pre-calculated versions of it with different time offsets, between sample points).

If we increase the width of the kernel to 200 (actually 201 because the kernel is always symmetrical about a central point with value 1.0), we get this:

sin_20k_sinc_200

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 200

The spike is almost at acceptable levels. Increasing the width to 250 we get this:

sin_20k_sinc_250

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 250

And at 300 we get this:

sin_20k_sinc_300

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 300

Clearly the kernel width does need to be in this region for the highest quality.

For completeness, here is the system working on a more complex waveform comprising the sum of three frequencies: 14, 18 and 19 kHz, all at the same amplitude and a frequency shift of 1.01.

14 kHz, 18 kHz, 19 kHz sum

Nearest neighbour:

sin_14_18_19_nn

14, 18, 19 kHz sine wave, nearest neighbour interpolation

Linear interpolation:

sin_14_18_19_li
14, 18, 19 kHz sine wave, linear interpolation

Sinc interpolation with a kernel width of 50:

sin_14_18_19_sinc_50

14, 18, 19 kHz sine wave, sinc interpolation, kernel width 50

Kernel width increased to 250:

sin_14_18_19_sinc_250
14, 18, 19 kHz sine wave, sinc interpolation, kernel width 250

More evidence that the kernel width needs to be in this region.

Ready made solutions

Re-sampling is often done in dedicated hardware like Analog Devices’ AD1896. Some advanced sound cards like the Creative X-Fi can re-sample everything internally to a common sample rate using powerful dedicated processors – this is the solution that makes connecting digital audio sources together almost as simple as analogue.

In theory, stuff like this goes on inside Linux already, in systems like JACK – apparently. But it just feels too fragile: I don’t know how to make sure it is working, and I don’t really have any handle on the quality of it. This is a tricky problem to solve by trial-and-error because a system can run for ages without any sign that clocks are drifting.

In Windows, there is a product called “Virtual Audio Cable” that I know performs re-sampling using methods along these lines.

There are libraries around that supposedly can do resampling, but the quality is unknown – I was looking at one that said “Not the best quality” so I gave up on that one.

I have a feeling that much of the code was developed at a time when processors were much less powerful than they are now and so the algorithms are designed for economy rather than quality.

Software-based sinc resampling in practice

I have grafted the code from my demo program into my active crossover application and set it running with TOSLink from a CD player going into a cheap USB sound card (Maplin) which my software uses to acquire the stream, and my software’s output going to a better multichannel sound card (the Xonar U7). The TOSLink data is being resampled in order to keep it aligned with the U7’s sample rate. I have had it running for 20 hours without incident.

Originally, before developing the test bed program, I set the kernel size at 50, fearing that anything larger would stress the Intel Atom CPU. However, I now realise that a width of at least 250 is necessary, so with trepidation I upped it to this value. The CPU load trace went up a bit in the Ubuntu system monitor, but not much; the cores are still running cool. The power of modern CPUs is ridiculous!! Remember that for each of the two samples arriving at 44.1 kHz, the algorithm is performing 500 floating point multiplications and sums, yet it hardly breaks into a sweat. There are absolutely no clever efficiencies in the programming. Amazing.

Active crossover with Raspberry Pi?

I was a bit bored this afternoon and finally managed to put myself into the frame of mind to try transplanting my active crossover software onto a Raspberry Pi.

It turns out it works, but it’s a bit delicate: although CPU usage seems to be about 30% on average, extra activity on the RPi can cause glitches in the audio. But I have established in principle that the RPi can do it, and that the software can simply be transplanted from a PC to the RPi – quite an improbable result I think!

A future-proof DSP box?

What I’d like to do is: build a box that can implement my DSP ‘formula’, that isn’t connected to the internet, takes in stereo S/PDIF, and gives out six channels of analogue.

Is this the way to get a future-proof DSP box that the Powers-That-Be can’t continually ‘upgrade’ into obsolescence? In other words, I would always be able to connect the latest PCs, streamers, Chromecast to it without relying on the same box having to be the source of the stereo audio itself (which currently means that every time it is booted it up it could stop working because of some trivial – or major – change that breaks the system). Witness only this week where Spotify has ‘upgraded’ its system and consigned many dedicated smart speakers’ streaming capability into oblivion. The only way to keep up with such changes is to be an IT-support person, staying current with updates and potentially making changes to code.

To avoid this, surely there will always have to be cheap boxes that connect to the internet and give out S/PDIF or TOSLink, maintained by genuine IT-support people, rather than me having to do it. (Maybe not…. It’s possible that if fitment of MQA-capable chips becomes universal in all future consumer audio hardware, they could eventually decide it is viable to enable full data encryption and/or restrict access to unencrypted data to secure, licensed hardware only).

It’s unfortunate, because it automatically means an extra layer of resampling in the system (because the DAC’s clock is not the same as the source’s clock), but I can persuade myself that it’s transparent. If the worst comes to the very worst in future, the box could also have analogue inputs, but I hope it doesn’t come to that.

This afternoon’s exercise was really just to see if it could be done with an even cheaper box than a fanless PC and, amazingly, it can! I don’t know if anyone else out there is like me, but while I understand the guts of something like DSP, it’s the peripheral stuff I am very hazy on. To me, to be able to take a system that runs on an Intel-based PC and make it run on a completely different processor and chipset without major changes is so unlikely that I find the whole thing quite pleasing.

[UPDATE 18/02/18] This may not be as straightforward as I thought. I have bought one of these for its S/PDIF input (TOSLink, actually). This works (being driven by a 30-year old CD player for testing), but it has focused my mind on the problem of sample clock drift:

My own resampling algorithm?

S/PDIF runs at the sender’s own rate, and my DAC will run at a slightly different rate. It is a very specialised thing to be able to reconcile the two, and I am no longer convinced that Linux/Alsa has a ready-made solution. I am feeling my way towards implementing my own resampling algorithm..!

At the moment, I regulate the sample rate of a dummy loopback driver that draws data from any music player app running on the Linux PC. Instead of this, I will need to read data in at the S/PDIF sample rate and store it in the circular buffer I currently use. The same mechanism that regulates the rate of the loopback driver will now control the rate at which data is drawn from this circular buffer for processing, and the values will need to be interpolated in between the stored values using convolution with a windowed sinc kernel. It’s an horrendous amount of calculation that the CPU will have to do for each and every output sample – probably way beyond the capabilities of the Raspberry Pi I’m afraid. This problem is solved in some sound cards by using dedicated hardware to do resampling, but if I want to make a general purpose solution to the problem, I will need to bite the bullet and try to do it in software. Hopefully my Intel Atom-based PC will be up to the job. It’s a good job that I know that high res doesn’t sound any different to 16/44.1 otherwise I could be setting myself up for needing a supercomputer.

[UPDATE 20/02/18] I couldn’t resist doing some tests and trials with my own resampling code.

Resampling Experiments

First, to get a feel for the problem and how much computing power it will take, I tried running some basic multiplies and adds on a Windows laptop programmed in ‘C’. If using a small filter kernel size of 51 and assuming two sweeps of two pre-calculated kernels per sample (then a trivial interpolation between), it could only just keep up with stereo CD in real time. Disappointing, and a problem if the PC is having to do other stuff. But then I realised that the compiler had all optimisations turned off. Optimising for maximum speed, it was blistering! At least 20x real time.

I tried on a Raspberry Pi. Even it could keep up at 3x real time.

There may be other tricks to try as well, including processor-specific optimisations and programming for ‘SIMD’ (apparently where the CPU does identical calculations on vectors i.e. arrays of values, simultaneously) or kicking off threads to work on parts of the calculation where the operating system is able to share the tasks optimally across the processor cores. Or maybe that’s what the optimisation is doing, anyway.

There is also the possibility that for a larger (higher quality) kernel (say >256 values), an FFT might be a more economical way of doing the convolution.

Either way, it seems very promising.

Lanczos Kernel

I then wrote a basic system for testing the actual resampling in non-real time. This is based on the idea of wanting to, effectively, perform the job of a DAC reconstruction filter in software, and then to be able to pick the reconstructed value at any non-integer sample time. To do this ‘properly’ it is necessary to sweep the samples on either side of the desired sample time with a sinc kernel i.e. convolve it. Here’s where it gets interesting. The kernel can be created so that its elements’ values compute the kernel as though centred on the exact non-integer sample time desired, even though it is aligned and calculated on the integer sample times.

It would be possible to calculate on-the-fly a new, exact kernel for every new sample, but this would be very processor intensive, involving many calculations. Instead, it is possible to pre-calculate a range of kernels that represent a few fractional positions between adjacent samples. In operation, the two kernels on either side of the desired non-integer sample time are swept and accumulated, and then linear interpolation between these two values used to find the value representing the exact sample time.

You may be horrified at the thought of linear interpolation until you realise that several thousand kernels could be pre-calculated and stored in memory, so that the error of the linear interpolation would be extremely small indeed.

Of course a true sinc function would extend to plus and minus infinity, so for practical filtering it needs to be windowed i.e. shortened and tapered to zero at the edges. Apparently – and I am no mathematician – the best window is a widened duplicate of the sinc function’s central lobe, and this is known as the Lanczos Kernel.

Using this arrangement I have been resampling some floating point sine waves at different pitches and examining the results in the program Audacity. The results when the spectrum is plotted seem to be flawless.

The exact width (and therefore quality) of the kernel and how many filters to create are yet to be determined.

[Another update] I have put the resampling code into the active crossover program running on an Intel Atom fanless PC. It has no trouble performing the resampling in real time – much to my amazement – so I now have a fully functional system that can take in TOSLink (from a CD player at the moment) and generate six analogue output channels for the two KEF-derived three-way speakers. Not as truly ‘perfect’ as the previous system that controls the rate at which data arrives, but not far off.

[Update 01/03/18] Everything has worked out OK, including the re-sampling described in a later post. I actually had it working before I managed to grasp fully in my head how it worked! But the necessary mental adjustments have been made, now.

However, I am finding that the number of platforms that provide S/PDIF or TOSLink outputs ‘out-of-the-box’ without problems is very small.

I would simply have bought a Chromecast Audio as the source, but apparently its Ogg Vorbis encoded lossy bit rate is limited to 256kbps with Spotify as the source (which is what I might be planning to use for these tests) as opposed to the 320 kbps that it uses with a PC.

So I thought I could just use a cheap USB sound card with a PC, but found that with Linux it did a very stupid thing: turned off the TOSLink output when no data was being written to it – which is, of course, a nightmare for the receiver software to deal with, especially if it is planning to base its resampling ratio on the received sample rate.

I then began messing around with old desktop machines and PCI sound cards. The Asus Xonar DS did the same ridiculous muting thing in Linux. The Creative X-Fi looked as though it was going to work, but then sent out 48 kHz when idling, and switched to the desired 44.1 kHz when sending music. Again, impossible for the receiver to deal with, and I could find no solution.

Only one permutation is working: Creative X-Fi PCI card in a Windows 7 machine with a freeware driver and app because Creative seemingly couldn’t be bothered to support anything after XP. The free app and driver is called ‘PAX’ and looks like an original Creative app – my thanks to Robert McClelland. Using it, it is possible to ensure bit perfect output, and in the Windows Control Panel app it is possible to force the output to 16 bit 44.1 kHz which is exactly what I need.

[Update 03/03/18] The general situation with TOSLink, PCs and consumer grade sound cards is dire, as far as I can tell. I bought one of these ubiquitous devices thinking that Ubuntu/Linux/Alsa would, of course, just work with it and TOSLink.

USB 6 Channel 5.1 External SPDIF Optical Digital Sound Card Audio Adapter for PC

It is reputedly based on the CM6206. At least the TOSLink output stays on all the time with this card, but it doesn’t work properly at 44.1 kHz even though Alsa seems happy at both ends: if you listen to a 1kHz sine wave played over this thing, it has a cyclic discontinuity somewhere – like it’s doing nearest neighbour resampling from 48 to 44.1 or something like that..? As a receiver it seems to work fine.

With Windows, it automatically installs drivers, but Control Panel->Manage Audio Devices->Properties indicates that it will only do 48 kHz sample rate. Windows probably does its own resampling so that Spotify happily works with it, and if I run my application expecting a 48 kHz sample rate, it all works – but I don’t want that extra layer of resampling.

As mentioned earlier I also bought one of these from Maplin (now about to go out of business). It, too, is supposedly based on the CM6206:

Under Linux/Alsa I can make it work as TOSLink receiver, but cannot make its output turn on except for a brief flash when plugging it in.

In Windows you have to install the driver (and large ‘app’ unfortunately) from the supplied CD. This then gives you the option to select various sample rates, etc. including the desired 44.1 kHz. Running Spotify, everything works except… when you pause, the TOSLink output turns off after a few seconds. Aaaaaghhh!

This really does seem very poor to me. The default should be that TOSLink stays on all the time, at a fixed, selected sample rate. Anything else is just a huge mess. Why are they turning it off? Some pathetic ‘environmental’ gesture? I may have to look into whether S/PDIF from other types of sound card is constantly running all the time, in which case a USB-S/PDIF sound card feeding a super-simple hardware-based S/PDIF-to-TOSLink converter would be a reliable solution – or simply use S/PDIF throughout, but I quite like the idea of the electrical isolation from TOSLink.

It’s not that I need this in order to listen to music, you understand – the original ‘bit perfect’ solution still works for now, and maybe always will – but I am just trying to make SPDIF/TOSLink work in principle so that I have a more general purpose, future-proof, system.

Two hobbies

An acoustic event occurs; a representative portion of the sound pressure variations produced is stored and then replayed via a loudspeaker. The human hearing system picks it up and, using the experience of a lifetime, works out a likely candidate for what might have produced that sequence of sound pressure variations. It is like finding a solution from simultaneous equations. Maybe there is more than enough information there, or maybe the human brain has to interpolate over some gaps. The addition of a room doesn’t matter, because its contribution still allows the brain to work back to that original event.

If this has any truth in it, I would guess that an unambiguous solution would be the most satisfying for the human brain on all levels. On the other hand, no solution at all would lead to a different perception: the reproduction system itself being heard, not what it is reproducing – and people could still enjoy that for what it is, like an old radiogram.

In between, an ambiguous or varying solution might be in an ‘uncanny valley’ where the brain can’t lock onto a fixed solution but nor can it entirely switch off and enjoy the sound at the level of the old radiogram.

I think a big question is: what are the chances that a deviation from neutrality in the reproduction system will result in an improvement in the ability of the brain to work out an unambiguous solution to the simultaneous equations? The answer has go to be: zero. Adding noise, phase shifts, glitches or distortion cannot possibly lead to more ‘realism’; the equations don’t work any more.

But here’s a thought: what if most ‘audiophile’ systems out there are in the ‘uncanny valley’? Speakers in particular doing strange things to the sound with their passive crossovers; ‘high end’ ones being low in nonlinear distortion, but high in linear distortion.

What if some non-neutral technologies ‘work’ by pushing the system out of the uncanny valley and into the realm of the clearly artificial? That is certainly the impression I get from some systems at the few audio shows I go to. People ooh-ing and aah-ing at sounds that, to me, are being generated by the audio system and not through it. I suspect that different ‘audiophiles’ may think they are all talking about the same things, but that in fact there are effectively two separate hobbies: one that seeks to hear through an audio system, and one that enjoys the warm, reassuring sound of the audio system itself.

The problem with IT…

…is that you can never rely on things staying the same. Here’s what happened to me last night.

By default I start Spotify when my Linux audio PC boots up. I often leave it running for days. Last night I was listening to something on Spotify (but I suspect it wouldn’t have mattered if it had been a CD or other source). I got a few glitches in the audio – something that never happens. This threatened to spoil my evening – I thought everything was perfect.

I immediately plugged in a keyboard and mouse to begin to investigate and it was at that moment that I noticed that the Intel Atom-based PC was red hot.

Using the Ubuntu system monitor app I could see that the processor cores were running close to flat out. Spotify was running, and on the default opening page was a snazzy animated advert referring to some artist I have no interest in. The basic appearance was a sparkly oscilloscope type display pulsing in time with the music. I had not seen anything like that on Spotify before. I had an inkling that this might be the problem and so I clicked to a more pedestrian page with my playlists on it. The CPU load went down drastically.

Yes, Spotify had decided they needed to jazz up their front page with animation and this had sent my CPU cores into meltdown. Now, my PC is the same chipset as loads of tablets out there. Maybe Ubuntu’s version of flash (or whatever ‘technology’ the animation was based on) is really inefficient or something, but it looks to me as though there is a strong possibility that this Spotify ‘innovation’ might have suddenly resulted in millions of tablets getting hot and their batteries flattening in minutes.

The animation is now gone from their front page. Will it return? I can’t now check whether any changes I make to Spotify’s opening behaviour (opening up minimised?) will prevent the issue.

This is the problem with modern computer-based stuff that is connected to the internet. It’s brilliant, but they can never stop meddling with things that work perfectly as they are.

[06/01/17] Of course it can get worse. Much worse. Since then, we now know that practically every computer in the world will need to be slowed down in order to patch over a security issue that has been designed into the processors at hardware level. At worst it could be a 50% slowdown. Will my audio PC cope? Will it now run permanently hot? I installed an update yesterday and it didn’t seem to cause a problem. Was this patch in it, or is the worst yet to come?

[04/02/18] I defaulted to Spotify opening up minimised when the PC is switched on. Everything still working, and the PC running cool.

But I would like to get to the point where I have a box that always works. I would like to be able to give my code to other people without needing to be an IT support person – believe me, I don’t know enough about that sort of thing.

It now seems to me that the only way to guarantee that a box will always be future-proof without constant updates and the need for IT support is to bite the bullet and accept that the system cannot be bit-perfect. Once that psychological hurdle is overcome, it becomes easy: send the data via S/PDIF. Resample the data in software (Linux will do this automatically if you let it), and bob’s your uncle: a box that isn’t even attached to the internet, that takes in S/PDIF and gives you six analogue outputs or variations thereof; a box with a video monitor output and USB sockets, allowing you to change settings, import WAV files to define filters, etc. then disconnect the keyboard and mouse. Or a box that is accessible over a standard network in a web browser – or does that render it not future-proof? Presumably a very simple web interface will always be valid. I think this is going to be the direction I head in…