The First Lossy Codec

(probably).

Nowadays we are used to the concept of the lossy codec that can reduce the bit rate of CD-quality audio by a factor of, say, 5 without much audible degradation. We are also accustomed to lossless compression which can halve the bit rate without any degradation at all.

But many people may not realise that they were listening to digital audio and a form of lossy compression in the 1970s and 80s!

Early BBC PCM

As described here, the BBC were experimenting with digital audio as early as the 1960s, and in the early 70s they wired up much of the UK FM transmitter network with PCM links in order to eliminate the hum, noise, distortion and frequency response errors that were inevitable with the previous analogue links.

So listeners were already hearing 13-bit audio at a sample rate of 32 kHz when they tuned into FM radio in the 1970s. I was completely unaware of this at the time, and it is ironic that many audiophiles still think that FM radio sounds good but wouldn’t touch digital audio with a bargepole.

13 bits was pretty high quality in terms of signal-to-noise-ratio, and the 32 kHz sample rate gave something approaching 15 kHz audio bandwidth which, for many people’s hearing, would be more than adequate. The quality was, however, objectively inferior to that of the Compact Disc that came later.

Downsampling to 10 bits

In the later 70s, in order to multiplex more stations into a lower bandwidth, the BBC wanted to compress higher quality 14-bit audio down to 10 bits

As you may be aware, downsampling to a lower bit depth leads to a higher level of background noise due to the reduced resolution and the mandatory addition of dither noise. For 10 bits with dither, the best that could be achieved would be a signal to noise ratio of 54 dB (I think I am right in saying) although the modern technique of noise shaping the dither can reduce the audibility of the quantisation noise.

This would not have been acceptable audible quality for the BBC.

Companding Noise Reduction

Compression-expansion is a noise reduction technique that was already used with analogue tape recorders e.g. the dbx noise reduction system. Here, the signal’s dynamic range is squashed during recording i.e. the quiet sections are boosted in level, following a specific ‘law’. Upon replay, the inverse ‘law’ is followed in order to restore the original dynamic range. In doing so, any noise which has been added during recording is boosted downwards in level, reducing its audibility.

With such a system, the recorded signal itself carries the information necessary to control the expander, so compressor and expander need to track each other accurately in terms of the relationships between gain, level and time. Different time constants may be used for ‘attack’ and ‘release’ and these are a compromise between rapid noise reduction and audible side effects such as ‘pumping’ and ‘breathing’. The noise itself is being modulated in level, and this can be audible against certain signals more than others. Frequency selective pre- and de-emphasis can also help to tailor the audible quality of the result.

The BBC investigated conventional analogue companding before they turned to the pure digital equivalent.

N.I.C.A.M

The BBC called their digital equivalent of analogue companding ‘NICAM’ (Near Instantaneously Companded Audio Multiplex). It is much, much simpler, and more precise and effective than the analogue version.

It is as simple as this:

  • Sample the signal at full resolution (14 bits for the BBC)
  • Divide the digitised stream into time-based chunks (1ms was the duration they decided upon);
  • For each chunk, find the maximum absolute level within it;
  • For all samples in that chunk, do a binary shift sufficient to bring all the samples down to within the target bit depth (e.g. 10 bits);
  • Transmit the shifted samples, plus a single value indicating by how much they have been shifted;
  • At the other end, restore the full range by shifting samples in the opposite direction by the appropriate number of bits for each chunk.

Using this system, all ‘quiet chunks’ i.e. those already below the 10 bit maximum value are sent unchanged. Chunks containing values that are higher in level than 10 bits lose their least significant bits, but this loss of resolution is masked by the louder signal level. Compared to modern lossy codecs, this method requires minimal DSP and could be performed without software using dedicated circuits based on logic gates, shift registers and memory chips.

You may be surprised at how effective it is. I have written a program to demonstrate it, and in order to really emphasise how good it is, I have compressed the original signal into 8 bits, not the 10 that the BBC used.

In the following clip, a CD-quality recording has been converted as follows:

  • 0-10s is the raw full-resolution data
  • 10-20s is the sound of the signal downsampled to 8 bits with dither – notice the noise!
  • 20-40s is the signal compressed NICAM-style into 8 bits and restored at the other end.

I think it is much better than we might have expected…

(I was wanting to start with high quality, so I got the music extract from here:

http://www.2l.no/hires/index.html

This is the web site of a label providing extracts of their own high quality recordings in various formats for evaluation purposes. I hope they don’t mind me using one of their excellent recorded extracts as the source for my experiment).

Advertisements

“Great Midrange”

If you do a google search for audio “great midrange” you get a lot of hits – such descriptions of audio systems’ performance are common currency in audiophilia.

But what a peculiar idea: that something that is supposed to sound like music should be judged on the basis of the sound of its “midrange”. What is “midrange”? It has nothing whatsoever to do with music, art, performance, experiencing a concert. Has anyone ever said “This orchestra has great midrange” or “This hall has great midrange”?

I think that the common use of such descriptions reveals some assumptions:

  • all audio systems are assumed to have distinct hardware-related, non-musical, non-acoustic characteristics; they can be compared and ranked on that basis
  • audiophiles are not holding out for the whole, but are prepared to live with systems on the basis of isolated arbitrary non-music related characteristics like “great midrange”
  • the signal is built from non-acoustic, non-musical components like “midrange” rather than being a unique, complex composite of sources and acoustics
  • we can manipulate components in the signal like “midrange” in a meaningful way; the signal is like soup and we can flavour it any way we like
  • audio systems are so poor that it is better to rate them as a collection of components like “midrange” rather than the ways in which they deviate from perfection – it’s quicker that way.
  • it is not possible to describe meaningfully the sound of most audio systems (it’s not as if it’s like listening to real music), so we have to devise a new language to describe it.
  • audiophilia is about listening to the system not through it

I think this pitiful set of assumptions would mystify and put off the intelligent novice who might be curious about audio. As soon as they began to research the subject (as you’re surely supposed to do?) they would experience cognitive dissonance between their existing notion of what an audio system is for, and the apparent priorities and language used by the experts.

I think it is easy to see that not many people would persist in trying to find their way around this weird sub-culture and would simply buy an Apple Homepod instead. The audiophiles have effectively passively appropriated the world of high quality recorded music for themselves, and the only outsiders granted access are those prepared to go through a bewildering, arduous re-orientation process.

How to re-sample an audio signal

As I mentioned earlier, I would like to have the flexibility of using digital audio data that emanates externally from the PC that is performing the DSP, and this necessarily will have a different sample clock from the DAC. Something has got to give!

If the input was analogue, you would just sample it with an ADC locked to your DAC’s sample rate, and then the source’s own sample rate wouldn’t matter to you. With a standard digital audio source (e.g. S/PDIF) you need to be able to do the same thing but purely in software. The incoming sampled data points are notionally turned into a continuous waveform in memory by duplicating a DAC reconstruction filter using floating point maths. You can then sample it wherever you want at a rate locked to the DAC’s sample rate.

You still ‘eat’ the incoming data at the rate at which it comes in, but you vary the number of samples that you ‘decimate’ from it (very, very slightly).

The control algorithm for locking this re-sampling to the DAC’s sample rate is not completely trivial because the PC’s only knowledge of the sample rates of the DAC and S/PDIF are via notifications that large chunks of data have arrived or left, with unknown amounts of jitter. It is only possible to establish an accurate measure of relative sample rates with a very long time constant average. In reality the program never actually calculates the sample rate at all, but merely maintains a constant-ish difference between the read and write pointer positions of a circular buffer. It relies on adequate latency and the two sample rates being reasonably stable by virtue of being derived from crystal oscillators. The corrections will, in practice be tiny and/or occasional.

How is the interesting problem of re-sampling solved?

Well, it’s pretty new to me, so in order to experiment with it I have created a program that runs on a PC and does the following:

  1. Synthesises a test signal as an array of floating point values at a notional sample rate of 44.1 kHz. This can be a sine wave, or combination of different frequency sine waves.
  2. Plots the incoming waveform as time domain dots.
  3. Plots the waveform as it would appear when reconstructed with the sinc filter. This is a sanity check that the filter is doing approximately the right thing.
  4. Resamples the data at a different sample rate (can be specified with any arbitrary step size e.g. 0.9992 or 1.033 or whatever), using floating point maths. The method can be nearest-neighbour, linear interpolation, or sinc & linear interpolation.
  5. Plots the resampled waveform as time domain dots.
  6. Passes the result into a FFT (65536 points), windowing the data with a raised cosine window.
  7. Plots the resulting resampled spectrum in terms of frequency and amplitude in dB.

This is an ideal test bed for experimenting with different algorithms and getting a feel for how accurate they are.

Nearest-neighbour and linear interpolation are pretty self explanatory methods; the sinc method is similar to that described here:

https://www.dsprelated.com/freebooks/pasp/Windowed_Sinc_Interpolation.html

I haven’t completely reproduced (or necessarily understood) their method, but I was inspired by this image:

\includegraphics[scale=0.8]{eps/Waveforms}

The sinc function is the ideal ‘brick wall’ low pass filter and is calculated as sin(x*PI)/(x*PI). In theory it extends from minus to plus infinity, but for practical uses is windowed so that it tapers to zero at plus or minus the desired width – which should be as wide as practical.

The filter can be set at a lower cutoff frequency than Nyquist by stretching it out horizontally, and this would be necessary to avoid aliasing if wishing to re-sample at an effectively slower sample rate.

If the kernel is slid along the incoming sample points and a point-by-point multiply and sum is performed, the result is the reconstructed waveform. What the above diagram shows is that the kernel can be in the form of discrete sampled points, calculated as the values they would be if the kernel was centred at any arbitrary point.

So resampling is very easy: simply synthesise a sinc kernel in the form of sampled points based on the non-integer position you want to reconstruct, and multiply-and-add all the points corresponding to it.

A complication is the necessity to shorten the filter to a practical length, which involves windowing the filter i.e. multiplying it by a smooth function that tapers to zero at the edges. I did previously mention the Lanczos kernel which apparently uses a widened copy of the central lobe of the sinc function as the window. But looking at it, I don’t know why this is supposed to be a good window function because it doesn’t taper gradually to zero, and at non-integer sample positions you would either have to extend it with zeroes abruptly, or accept non-zero values at its edges.

Instead, I have decided to use a simple raised cosine as the windowing function, and to reduce its width slightly to give me some leeway in the kernel’s position between input samples. At the extremities I ensure it is set to zero. It seems to give a purer output than my version of the Lanczos kernel.

Pre-calculating the kernel

Although very simple, calculating the kernel on-the-fly at every new position would be extremely costly in terms of computing power, so the obvious solution is to use lookup tables. The pre-calculated kernels on either side of the desired sample position are evaluated to give two output values. Linear interpolation can then be used to find the value at the exact position. Because memory is plentiful in PCs, there is no need to skimp on the number of pre-calculated kernels – you could use a thousand of them. For this reason, the errors associated with this linear interpolation can be reduced to negligible.

The horizontal position of the raised cosine window follows the position of the centre of the kernel for all the versions that are calculated to lie in between the incoming sample points.

All that remains is to decide how wide the kernel needs to be for adequate accuracy in the reconstruction – and this is where my demo program comes in. I apologise that there now follows a whole load of similar looking graphs, demonstrating the results with various signals and kernel sizes, etc.

1 kHz sine wave

First we can look at the standard test signal: a 1 kHz sine wave. In the following image, the original sine wave points are shown joined with straight lines at the top right, followed by how the points would look when emerging from a DAC that has a sinc-based reconstruction filter (in this case, the two images look very similar).

Next down in the three time domain waveforms comes the resampled waveform after we have resampled it to shift its frequency by a factor of 0.9 (a much larger ratio than we will use in practice). In this first example, the resampling method being used is ‘nearest neighbour’. As you can see, the results are disastrous!

sin_1k_nn

1kHz sine wave, frequency shift 0.9, nearest neighbour interpolation

The discrete steps in the output waveform are obvious, and the FFT shows huge spikes of distortion.

Linear interpolation is quite a bit better in terms of the FFT, and the time domain waveform at the bottom right looks much better.

sin_1k_li

1kHz sine wave, frequency shift 0.9, linear interpolation

However, the FFT magnitude display reveals that it is clearly not ‘hi-fi’.

Now, compare the results using sinc interpolation:

sin_1k_sinc_50_0.9

1kHz sine wave, frequency shift 0.9, sinc interpolation, kernel width 50

As you can see, the FFT plot is absolutely clean, indicating that this result is close to distortion-free.

Next we can look at something very different: a 20 kHz sine wave.

20 kHz sine wave

sin_20k_nn

20 Khz sine wave, frequency shift 0.9, nearest neighbour interpolation

With nearest neighbour resampling, the results are again disastrous. At the right hand side, though, the middle of the three time domain plots shows something very interesting: even though the discrete points look nothing like a sine wave at this frequency, the reconstruction filter ‘rings’ in between the points, producing a perfect sine wave with absolutely uniform amplitude. This is what is produced by any normal DAC – and is something that most people don’t realise; they often assume that digital audio falls apart at the top end, but it doesn’t: it is perfect.

Linear interpolation is better than nearest-neighbour, but pretty much useless for our purposes.

sin_20k_li

20kHz sine wave, frequency shift 0.9, linear interpolation

Sinc interpolation is much better!

sin_20k_sinc_50

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 50

However, there is an unwanted spike at the right hand side (note the main signal is at 18 kHz because it has been shifted down by a factor of 0.9). This spike appears because of inadequate width of the sinc kernel which in this case has been set at 50 (with 500 pre-calculated versions of it with different time offsets, between sample points).

If we increase the width of the kernel to 200 (actually 201 because the kernel is always symmetrical about a central point with value 1.0), we get this:

sin_20k_sinc_200

20kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 200

The spike is almost at acceptable levels. Increasing the width to 250 we get this:

sin_20k_sinc_250

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 250

And at 300 we get this:

sin_20k_sinc_300

20 kHz sine wave, frequency shift 0.9, sinc interpolation, kernel size 300

Clearly the kernel width does need to be in this region for the highest quality.

For completeness, here is the system working on a more complex waveform comprising the sum of three frequencies: 14, 18 and 19 kHz, all at the same amplitude and a frequency shift of 1.01.

14 kHz, 18 kHz, 19 kHz sum

Nearest neighbour:

sin_14_18_19_nn

14, 18, 19 kHz sine wave, nearest neighbour interpolation

Linear interpolation:

sin_14_18_19_li
14, 18, 19 kHz sine wave, linear interpolation

Sinc interpolation with a kernel width of 50:

sin_14_18_19_sinc_50

14, 18, 19 kHz sine wave, sinc interpolation, kernel width 50

Kernel width increased to 250:

sin_14_18_19_sinc_250
14, 18, 19 kHz sine wave, sinc interpolation, kernel width 250

More evidence that the kernel width needs to be in this region.

Ready made solutions

Re-sampling is often done in dedicated hardware like Analog Devices’ AD1896. Some advanced sound cards like the Creative X-Fi can re-sample everything internally to a common sample rate using powerful dedicated processors – this is the solution that makes connecting digital audio sources together almost as simple as analogue.

In theory, stuff like this goes on inside Linux already, in systems like JACK – apparently. But it just feels too fragile: I don’t know how to make sure it is working, and I don’t really have any handle on the quality of it. This is a tricky problem to solve by trial-and-error because a system can run for ages without any sign that clocks are drifting.

In Windows, there is a product called “Virtual Audio Cable” that I know performs re-sampling using methods along these lines.

There are libraries around that supposedly can do resampling, but the quality is unknown – I was looking at one that said “Not the best quality” so I gave up on that one.

I have a feeling that much of the code was developed at a time when processors were much less powerful than they are now and so the algorithms are designed for economy rather than quality.

Software-based sinc resampling in practice

I have grafted the code from my demo program into my active crossover application and set it running with TOSLink from a CD player going into a cheap USB sound card (Maplin), and the output going to a better multichannel sound card (the Xonar U7). The TOSLink data is being resampled in order to keep it aligned with the DAC’s sample rate. I have had it running for 20 hours without incident.

Originally, before developing the test bed program, I set the kernel size at 50, fearing that anything larger would stress the Intel Atom CPU. However, I now realise that a width of at least 250 is necessary, so with trepidation I upped it to this value. The CPU load trace went up a bit in the Ubuntu system monitor, but not much; the cores are still running cool. The power of modern CPUs is ridiculous!! Remember that for each of the two samples arriving at 44.1 kHz, the algorithm is performing 500 floating point multiplications and sums, yet it hardly breaks into a sweat. There are absolutely no clever efficiencies in the programming. Amazing.

Active crossover with Raspberry Pi?

I was a bit bored this afternoon and finally managed to put myself into the frame of mind to try transplanting my active crossover software onto a Raspberry Pi.

It turns out it works, but it’s a bit delicate: although CPU usage seems to be about 30% on average, extra activity on the RPi can cause glitches in the audio. But I have established in principle that the RPi can do it, and that the software can simply be transplanted from a PC to the RPi – quite an improbable result I think!

A future-proof DSP box?

What I’d like to do is: build a box that can implement my DSP ‘formula’, that isn’t connected to the internet, takes in stereo S/PDIF, and gives out six channels of analogue.

Is this the way to get a future-proof DSP box that the Powers-That-Be can’t continually ‘upgrade’ into obsolescence? In other words, I would always be able to connect the latest PCs, streamers, Chromecast to it without relying on the same box having to be the source of the stereo audio itself (which currently means that every time it is booted it up it could stop working because of some trivial – or major – change that breaks the system). Witness only this week where Spotify has ‘upgraded’ its system and consigned many dedicated smart speakers’ streaming capability into oblivion. The only way to keep up with such changes is to be an IT-support person, staying current with updates and potentially making changes to code.

To avoid this, surely there will always have to be cheap boxes that connect to the internet and give out S/PDIF or TOSLink, maintained by genuine IT-support people, rather than me having to do it. (Maybe not…. It’s possible that if fitment of MQA-capable chips becomes universal in all future consumer audio hardware, they could eventually decide it is viable to enable full data encryption and/or restrict access to unencrypted data to secure, licensed hardware only).

It’s unfortunate, because it automatically means an extra layer of resampling in the system (because the DAC’s clock is not the same as the source’s clock), but I can persuade myself that it’s transparent. If the worst comes to the very worst in future, the box could also have analogue inputs, but I hope it doesn’t come to that.

This afternoon’s exercise was really just to see if it could be done with an even cheaper box than a fanless PC and, amazingly, it can! I don’t know if anyone else out there is like me, but while I understand the guts of something like DSP, it’s the peripheral stuff I am very hazy on. To me, to be able to take a system that runs on an Intel-based PC and make it run on a completely different processor and chipset without major changes is so unlikely that I find the whole thing quite pleasing.

[UPDATE 18/02/18] This may not be as straightforward as I thought. I have bought one of these for its S/PDIF input (TOSLink, actually). This works (being driven by a 30-year old CD player for testing), but it has focused my mind on the problem of sample clock drift:

My own resampling algorithm?

S/PDIF runs at the sender’s own rate, and my DAC will run at a slightly different rate. It is a very specialised thing to be able to reconcile the two, and I am no longer convinced that Linux/Alsa has a ready-made solution. I am feeling my way towards implementing my own resampling algorithm..!

At the moment, I regulate the sample rate of a dummy loopback driver that draws data from any music player app running on the Linux PC. Instead of this, I will need to read data in at the S/PDIF sample rate and store it in the circular buffer I currently use. The same mechanism that regulates the rate of the loopback driver will now control the rate at which data is drawn from this circular buffer for processing, and the values will need to be interpolated in between the stored values using convolution with a windowed sinc kernel. It’s an horrendous amount of calculation that the CPU will have to do for each and every output sample – probably way beyond the capabilities of the Raspberry Pi I’m afraid. This problem is solved in some sound cards by using dedicated hardware to do resampling, but if I want to make a general purpose solution to the problem, I will need to bite the bullet and try to do it in software. Hopefully my Intel Atom-based PC will be up to the job. It’s a good job that I know that high res doesn’t sound any different to 16/44.1 otherwise I could be setting myself up for needing a supercomputer.

[UPDATE 20/02/18] I couldn’t resist doing some tests and trials with my own resampling code.

Resampling Experiments

First, to get a feel for the problem and how much computing power it will take, I tried running some basic multiplies and adds on a Windows laptop programmed in ‘C’. If using a small filter kernel size of 51 and assuming two sweeps of two pre-calculated kernels per sample (then a trivial interpolation between), it could only just keep up with stereo CD in real time. Disappointing, and a problem if the PC is having to do other stuff. But then I realised that the compiler had all optimisations turned off. Optimising for maximum speed, it was blistering! At least 20x real time.

I tried on a Raspberry Pi. Even it could keep up at 3x real time.

There may be other tricks to try as well, including processor-specific optimisations and programming for ‘SIMD’ (apparently where the CPU does identical calculations on vectors i.e. arrays of values, simultaneously) or kicking off threads to work on parts of the calculation where the operating system is able to share the tasks optimally across the processor cores. Or maybe that’s what the optimisation is doing, anyway.

There is also the possibility that for a larger (higher quality) kernel (say >256 values), an FFT might be a more economical way of doing the convolution.

Either way, it seems very promising.

Lanczos Kernel

I then wrote a basic system for testing the actual resampling in non-real time. This is based on the idea of wanting to, effectively, perform the job of a DAC reconstruction filter in software, and then to be able to pick the reconstructed value at any non-integer sample time. To do this ‘properly’ it is necessary to sweep the samples on either side of the desired sample time with a sinc kernel i.e. convolve it. Here’s where it gets interesting. The kernel can be created so that its elements’ values compute the kernel as though centred on the exact non-integer sample time desired, even though it is aligned and calculated on the integer sample times.

It would be possible to calculate on-the-fly a new, exact kernel for every new sample, but this would be very processor intensive, involving many calculations. Instead, it is possible to pre-calculate a range of kernels that represent a few fractional positions between adjacent samples. In operation, the two kernels on either side of the desired non-integer sample time are swept and accumulated, and then linear interpolation between these two values used to find the value representing the exact sample time.

You may be horrified at the thought of linear interpolation until you realise that several thousand kernels could be pre-calculated and stored in memory, so that the error of the linear interpolation would be extremely small indeed.

Of course a true sinc function would extend to plus and minus infinity, so for practical filtering it needs to be windowed i.e. shortened and tapered to zero at the edges. Apparently – and I am no mathematician – the best window is a widened duplicate of the sinc function’s central lobe, and this is known as the Lanczos Kernel.

Using this arrangement I have been resampling some floating point sine waves at different pitches and examining the results in the program Audacity. The results when the spectrum is plotted seem to be flawless.

The exact width (and therefore quality) of the kernel and how many filters to create are yet to be determined.

[Another update] I have put the resampling code into the active crossover program running on an Intel Atom fanless PC. It has no trouble performing the resampling in real time – much to my amazement – so I now have a fully functional system that can take in TOSLink (from a CD player at the moment) and generate six analogue output channels for the two KEF-derived three-way speakers. Not as truly ‘perfect’ as the previous system that controls the rate at which data arrives, but not far off.

[Update 01/03/18] Everything has worked out OK, including the re-sampling described in a later post. I actually had it working before I managed to grasp fully in my head how it worked! But the necessary mental adjustments have been made, now.

However, I am finding that the number of platforms that provide S/PDIF or TOSLink outputs ‘out-of-the-box’ without problems is very small.

I would simply have bought a Chromecast Audio as the source, but apparently its Ogg Vorbis encoded lossy bit rate is limited to 256kbps with Spotify as the source (which is what I might be planning to use for these tests) as opposed to the 320 kbps that it uses with a PC.

So I thought I could just use a cheap USB sound card with a PC, but found that with Linux it did a very stupid thing: turned off the TOSLink output when no data was being written to it – which is, of course, a nightmare for the receiver software to deal with, especially if it is planning to base its resampling ratio on the received sample rate.

I then began messing around with old desktop machines and PCI sound cards. The Asus Xonar DS did the same ridiculous muting thing in Linux. The Creative X-Fi looked as though it was going to work, but then sent out 48 kHz when idling, and switched to the desired 44.1 kHz when sending music. Again, impossible for the receiver to deal with, and I could find no solution.

Only one permutation is working: Creative X-Fi PCI card in a Windows 7 machine with a freeware driver and app because Creative seemingly couldn’t be bothered to support anything after XP. The free app and driver is called ‘PAX’ and looks like an original Creative app – my thanks to Robert McClelland. Using it, it is possible to ensure bit perfect output, and in the Windows Control Panel app it is possible to force the output to 16 bit 44.1 kHz which is exactly what I need.

[Update 03/03/18] The general situation with TOSLink, PCs and consumer grade sound cards is dire, as far as I can tell. I bought one of these ubiquitous devices thinking that Ubuntu/Linux/Alsa would, of course, just work with it and TOSLink.

USB 6 Channel 5.1 External SPDIF Optical Digital Sound Card Audio Adapter for PC

It is reputedly based on the CM6206. At least the TOSLink output stays on all the time with this card, but it doesn’t work properly at 44.1 kHz even though Alsa seems happy at both ends: if you listen to a 1kHz sine wave played over this thing, it has a cyclic discontinuity somewhere – like it’s doing nearest neighbour resampling from 48 to 44.1 or something like that..? As a receiver it seems to work fine.

With Windows, it automatically installs drivers, but Control Panel->Manage Audio Devices->Properties indicates that it will only do 48 kHz sample rate. Windows probably does its own resampling so that Spotify happily works with it, and if I run my application expecting a 48 kHz sample rate, it all works – but I don’t want that extra layer of resampling.

As mentioned earlier I also bought one of these from Maplin (now about to go out of business). It, too, is supposedly based on the CM6206:

Under Linux/Alsa I can make it work as TOSLink receiver, but cannot make its output turn on except for a brief flash when plugging it in.

In Windows you have to install the driver (and large ‘app’ unfortunately) from the supplied CD. This then gives you the option to select various sample rates, etc. including the desired 44.1 kHz. Running Spotify, everything works except… when you pause, the TOSLink output turns off after a few seconds. Aaaaaghhh!

This really does seem very poor to me. The default should be that TOSLink stays on all the time, at a fixed, selected sample rate. Anything else is just a huge mess. Why are they turning it off? Some pathetic ‘environmental’ gesture? I may have to look into whether S/PDIF from other types of sound card is constantly running all the time, in which case a USB-S/PDIF sound card feeding a super-simple hardware-based S/PDIF-to-TOSLink converter would be a reliable solution – or simply use S/PDIF throughout, but I quite like the idea of the electrical isolation from TOSLink.

It’s not that I need this in order to listen to music, you understand – the original ‘bit perfect’ solution still works for now, and maybe always will – but I am just trying to make SPDIF/TOSLink work in principle so that I have a more general purpose, future-proof, system.

Two hobbies

An acoustic event occurs; a representative portion of the sound pressure variations produced is stored and then replayed via a loudspeaker. The human hearing system picks it up and, using the experience of a lifetime, works out a likely candidate for what might have produced that sequence of sound pressure variations. It is like finding a solution from simultaneous equations. Maybe there is more than enough information there, or maybe the human brain has to interpolate over some gaps. The addition of a room doesn’t matter, because its contribution still allows the brain to work back to that original event.

If this has any truth in it, I would guess that an unambiguous solution would be the most satisfying for the human brain on all levels. On the other hand, no solution at all would lead to a different perception: the reproduction system itself being heard, not what it is reproducing – and people could still enjoy that for what it is, like an old radiogram.

In between, an ambiguous or varying solution might be in an ‘uncanny valley’ where the brain can’t lock onto a fixed solution but nor can it entirely switch off and enjoy the sound at the level of the old radiogram.

I think a big question is: what are the chances that a deviation from neutrality in the reproduction system will result in an improvement in the ability of the brain to work out an unambiguous solution to the simultaneous equations? The answer has go to be: zero. Adding noise, phase shifts, glitches or distortion cannot possibly lead to more ‘realism’; the equations don’t work any more.

But here’s a thought: what if most ‘audiophile’ systems out there are in the ‘uncanny valley’? Speakers in particular doing strange things to the sound with their passive crossovers; ‘high end’ ones being low in nonlinear distortion, but high in linear distortion.

What if some non-neutral technologies ‘work’ by pushing the system out of the uncanny valley and into the realm of the clearly artificial? That is certainly the impression I get from some systems at the few audio shows I go to. People ooh-ing and aah-ing at sounds that, to me, are being generated by the audio system and not through it. I suspect that different ‘audiophiles’ may think they are all talking about the same things, but that in fact there are effectively two separate hobbies: one that seeks to hear through an audio system, and one that enjoys the warm, reassuring sound of the audio system itself.

The problem with IT…

…is that you can never rely on things staying the same. Here’s what happened to me last night.

By default I start Spotify when my Linux audio PC boots up. I often leave it running for days. Last night I was listening to something on Spotify (but I suspect it wouldn’t have mattered if it had been a CD or other source). I got a few glitches in the audio – something that never happens. This threatened to spoil my evening – I thought everything was perfect.

I immediately plugged in a keyboard and mouse to begin to investigate and it was at that moment that I noticed that the Intel Atom-based PC was red hot.

Using the Ubuntu system monitor app I could see that the processor cores were running close to flat out. Spotify was running, and on the default opening page was a snazzy animated advert referring to some artist I have no interest in. The basic appearance was a sparkly oscilloscope type display pulsing in time with the music. I had not seen anything like that on Spotify before. I had an inkling that this might be the problem and so I clicked to a more pedestrian page with my playlists on it. The CPU load went down drastically.

Yes, Spotify had decided they needed to jazz up their front page with animation and this had sent my CPU cores into meltdown. Now, my PC is the same chipset as loads of tablets out there. Maybe Ubuntu’s version of flash (or whatever ‘technology’ the animation was based on) is really inefficient or something, but it looks to me as though there is a strong possibility that this Spotify ‘innovation’ might have suddenly resulted in millions of tablets getting hot and their batteries flattening in minutes.

The animation is now gone from their front page. Will it return? I can’t now check whether any changes I make to Spotify’s opening behaviour (opening up minimised?) will prevent the issue.

This is the problem with modern computer-based stuff that is connected to the internet. It’s brilliant, but they can never stop meddling with things that work perfectly as they are.

[06/01/17] Of course it can get worse. Much worse. Since then, we now know that practically every computer in the world will need to be slowed down in order to patch over a security issue that has been designed into the processors at hardware level. At worst it could be a 50% slowdown. Will my audio PC cope? Will it now run permanently hot? I installed an update yesterday and it didn’t seem to cause a problem. Was this patch in it, or is the worst yet to come?

[04/02/18] I defaulted to Spotify opening up minimised when the PC is switched on. Everything still working, and the PC running cool.

But I would like to get to the point where I have a box that always works. I would like to be able to give my code to other people without needing to be an IT support person – believe me, I don’t know enough about that sort of thing.

It now seems to me that the only way to guarantee that a box will always be future-proof without constant updates and the need for IT support is to bite the bullet and accept that the system cannot be bit-perfect. Once that psychological hurdle is overcome, it becomes easy: send the data via S/PDIF. Resample the data in software (Linux will do this automatically if you let it), and bob’s your uncle: a box that isn’t even attached to the internet, that takes in S/PDIF and gives you six analogue outputs or variations thereof; a box with a video monitor output and USB sockets, allowing you to change settings, import WAV files to define filters, etc. then disconnect the keyboard and mouse. Or a box that is accessible over a standard network in a web browser – or does that render it not future-proof? Presumably a very simple web interface will always be valid. I think this is going to be the direction I head in…

What more do we want?

As I sit here listening to some big symphonic music playing on my ‘KEF’ DSP-based active crossover stereo system, I am struck by the thought: how could it be any better?

I sometimes read columns where people wonder about the future of audio, as though continuous progress is natural and inevitable – and as though we are accustomed to such progress. But it does occur to me that there is no reason why we cannot have reached the point of practical perfection already.

I think the desire for exotic improvements over what we have now has to be seen within the context of most people having not yet heard a good stereo system. They imagine that if the system they heard was expensive, it must therefore represent the state of the art, but in audio I think they could well be wrong. Some time ago, the audio industry and enthusiasts may even have subconsciously sniffed that they were reaching a plateau and begun to stall or reverse progress just to make life more interesting for themselves.

At the science fiction level, people dream of systems that reproduce live events exactly, including the acoustics of the performance venue. Even if this were possible, would it be worth it without the corresponding visuals? (and smells, temperature, humidity, etc.?)

Something like it could probably be achieved using the techniques of the computer games industry: synthesis of the acoustics from first principles, headphones with head tracking, or maybe even some system of printed transducer array wall coverings that could create the necessary sound fields in mid-air if there was no furniture in the room (and knowing the audio industry, it would also supplement the system with some conventional subwoofers). My prediction is that you would try it a couple of times, find it a rather contrived, unnatural experience, and next time revert to your stereo system with two speakers.

On a more practical level, the increasing use of conventional DSP is predicted. We are now seeing the introduction of systems that aim to reduce the (supposedly) unwanted stereo crosstalk that occurs from stereo speakers. The idea is to send out a slightly attenuated antiphase impulse from one speaker for every impulse from the other speaker, that will cancel out the crosstalk at the ‘wrong ear’. It then needs to send out an anti-antiphase impulse from the other speaker to cancel out that impulse as it reaches the other ear, and so on. My gut instinct is that this will only work perfectly at one precise location, and at all other locations there will be ‘residue’ possibly worse than the crosstalk. In fact we don’t seem bothered by the crosstalk from ordinary stereo – I am not convinced we hear it as “colouration”. Maybe it results in a narrowing of the width of the ‘scene’, but with the benefit of increasing its stability. (Hand-waving justification of the status quo, maybe, but I have tried ambiophonic demonstrations, and I was eventually happy to go back to ordinary stereo).

Other predictions include the increasing use of automatic room correction, ultra-sophisticated tone controls and loudness profiles that allow the user to tailor every recording to their own preferences.

Tiny speakers will generate huge SPLs flat down to 20 Hz – the Devialet Phantom is the first example of this, along with the not-so-futuristic drawback of needing to burn huge amounts of energy to do it. Complete multi-channel surround envelopment will come from hidden speakers.

At the hardware fetish end, no doubt some people imagine that even higher resolution sample rates and bit depths must result in better audible quality. Some people probably think that miniaturised valves will transform the listening experience. High resolution vinyl is on the horizon. Who knows what metallurgical miracles await in the science of audio interconnects?

For the IT-oriented audiophile, what is left to do? Multi-room audio, streaming from the cloud, complete control from handheld devices are all here, to a level of sophistication and ease of use limited only by the ‘cognitive gap’ between computer people and normal human users that sometimes results in clunky user interfaces. The technology is not a limiting factor. Do you want the album artwork to dissolve as one track fades out and the new artwork to spiral in and a CGI gatefold sleeve to open as the new track fades in? The ability to talk to your device and search on artist, genre, label, composer, producer, key signature? Swipe with hand gestures like Minority Report? Trivial. There really is no limit to this sort of thing already.

In fact, for the real music lover, I don’t think there is anything left to do. Truth be told, we were most of the way there in 1968.

The basic test is: how much better do you want the experience of summoning invisible musicians to your living room to be? I can’t imagine many worthwhile improvements over what we have now. The sound achievable from a current neutral stereo system is already at ‘hologram’ level; the solidity of the phantom image is total – the speakers disappear. It isn’t a literal hologram that reproduces the acoustics in absolute terms, allowing you to walk around it, of course, but it is a plausible ‘hologram’ from any static listening position, allowing you to ‘walk around it’ in your mind, and it stays plausible as you turn your head.

It isn’t complete surround envelopment, but there is reverberation from your own room all around you, and it seems natural to sit down and face the music. You will hear fully-formed, discrete, musical parts emerging from an open, three dimensional space, with acoustics that may not be related to the space you are listening in. You have been transported to a different venue – if that is what the recording contains. In terms of volume and dynamics, a modern system can give you the same visceral dynamics as the real performance.

And all this is happening in your living room, but without any visuals of the performance – it is music that you are wanting to listen to after all. If the requirement is to experience a literal night at the opera, then short of a synthesised Star Trek type ‘holodeck’ experience you will be out of luck.

You could always watch a high resolution DVD of some performance or the BBC’s Proms programmes, for example, and such visuals may give you a different experience. They will, however, destroy the pure recreation of the acoustic space in front of you because, by necessity, the visuals jump around from location to location, scene to scene in order to maintain the interest level, and your attention will be split between the sound and the imagery. Anyway, a huge TV will cost you about £200 from Tescos these days so that aspect is pretty well covered, too.

The natural partner to a huge TV is multi-channel surround sound. Quadraphonic sound seemed like the next big thing in the 1970s, but didn’t take off at the time. We now have five or seven channel surround sound. Does this improve the musical experience? Some people say so, but that could just be the gimmick factor, or an inferior stereo system being jazzed up a bit. While the correlation between two good speakers produces an unambiguous ‘solution’ to the equations thereof, multiple sources referring to the same ‘impulse’ could result in no clear ‘solution’ – that is, a fuzzy and indistinct ‘hologram’ that our ears struggle to make sense of. Mr. Linkwitz surmises something similar in the case of the centre speaker, plus he finds it visually distracting; with just two speakers, the space between them becomes a virtual blank space in which it is easier to imagine the audio scene. Most recordings are stereo and are likely to remain that way with a large proportion of listeners using headphones. For these reasons, I am happy that stereo is the best way to carry on listening to music.

Can DSP improve the listening experience further? Hardly at all I would say. So-called ‘room correction’ cannot transform a terrible room into a great one, and it doesn’t even transform a so-so one into a slightly better one. It starts from a faulty assumption: that human hearing is just a frequency response analyser for which real acoustics (the room) are an error, rather than human hearing having a powerful acoustics interpreter at the front end. If you attempt to ‘fix’ the acoustics by changing the source you just end up with a strange-sounding source. At a pinch, the listener could listen in the near(er) field to get rid of the room, anyway.

I am convinced that the audiophile obsession with tailoring recordings to the listener’s exact requirements is a red herring: the listener doesn’t want total predictability, and a top notch system shouldn’t be messed about with. As a reviewer of the Kii Three said:

…the traditional kind of subjective analysis we speaker reviewers default to — describing the tonal balance and making a judgement about the competence of a monitor’s basic frequency response — is somehow rendered a little pointless with the Kii Three. It sounds so transparent and creates such fundamentally believable audio that thoughts of ‘dull’ or ‘bright’ seem somehow superfluous.

The user doesn’t have access to the individual elements of the recording. What can be done in terms of, say, reducing the volume of the hi-hats (or whatever) is crude and unnatural and bleeds over every other element of the recording. The only chance of reproducing a natural sound, maintaining the separation between fully-formed elements and reproducing a three dimensional ‘scene’, is for the system to be neutral. When this happens, the level of the hi-hats likely just becomes just part of the performance. Audiophiles who, without any caveat, say they want DSP tone controls in order to fiddle about with recordings have already given up on that natural sound.

In summary, I see the way music was ‘consumed’ 40 or even 50 years ago as already pretty much at the pinnacle: two large speakers at one side or end of a comfortably-furnished living room, filling the space with beautiful sound – at once combining compatibility with domestic living and the ability to summon musicians to perform in the space in a comprehensible form that one or several people can enjoy without having to don special apparatus or sit in a super-critical location. And the fitted carpets of those times were great for the acoustics!

All that has happened in the meantime is just the ‘mopping up’ of the remaining niggles. We (can) now have better performance with respect to distortion, frequency response, dynamic range, and a more solid, holographic audio ‘scene’; no scratches and pops; instant selection of our choice of the world’s total music library. The incentives for the music lover to want anything more than this are surely extremely limited.

The active crossover in 1952

In the archive of magazines mentioned earlier, I decided to try to find the earliest reference to active crossovers. By sheer good luck, the first magazine I clicked on at random contained an article on triamplification (not yet named “active crossover”) from 1968.

six amplifiersIt lists the following advantages of active crossovers:

  1. Improved damping
  2. Lower intermodulation distortion
  3. Improved frequency handling by drivers
  4. Higher power handling
  5. Smoother response
  6. Adjustable crossover frequencies and slopes

It mentions that there were several biamplification products in the late fifties, but that when stereo came along the concept was forgotten.

This article then led me to one on biamplification from 1956, and finally to possibly the earliest article on active hi-fi crossovers, from 1952.

biamp title 1952

biamplify 1952

In this article, they design and build their own low level crossover.

1952 xover

Switching back and forth produced a subtle but distinct difference in listening pleasure. The low frequencies seemed a little more pure and less obscured, the middles and highs cleaner. The overall effect was that we had moved one step forward toward exact reproduction of the music as inscribed on the phonograph disk. There was a definite improvement in sound over a considerably better than average single amplifier system with a carefully designed dividing network and well balanced speakers.

They find that other compelling reasons to use the system are the freedom it gives to mix and match drivers without having to worry about their relative sensitivities, and the ability to adjust crossover frequencies easily and quickly.

Conclusion

Hi-fi manufacturers and customers alike are still struggling with passive crossovers despite the problem having been solved 65 years ago! This is as much to do with the ‘culture’ of audio as any technical or economic reasons.

Vinyl worship at the extreme

Hats off to the people who thought of this wheeze:

…a £6,300 lacquer of Sarah Vaughan that only survives one play

Yes, it’s a recording on a lacquer-coated aluminium disc, such as is used in the manufacture of LPs. It’s soft, and if it is played it is destroyed in the process. You can buy one of a limited edition of thirty for £6,300, to be played just once. And if you like Sarah Vaughan that would be a bonus.

Presumably the idea is that it gets you one step closer to the original musical event.

But not so fast. This one is derived from a digital transfer. And not just a straight transfer. They digitise the original live recording tapes and then do a bunch of signal processing, explicitly removing some of the original event in the process.

Once the signal is digitised, it’s treated using processing algorithms to try and reduce residual noise – a process that isn’t always easy. While the tapes were in good condition, the Peterson performance proved the most difficult. The tapes hadn’t been opened since 1962, and had much more analog noise than the others.

D’Oria-Nicolas also told us how, in the Evans’ recording, “the drums were too close to the piano and some frequencies did make some drum skins vibrate… We successfully managed to delete that.”

Obviously, the closest you can get to the original event is by playing the analogue tapes, and a straight digital transfer of these will be indistinguishable from the tape. Noise, drop-outs and all.

‘Photoshopping’ is the next stage, and you can actually download the photoshopped version and listen to it. Digital cleaning-up of scratched, dusty images can be a very positive thing, and the audio equivalent may be too. This version may, or may not need some further manipulation in order to cut the lacquer master on a lathe, plus it needs filtering for RIAA equalisation.

As I understand it, in the LP process (which I view with affection, rather like any other ‘heritage’ industry such as keeping steam trains going), the lacquer is then coated in metal and the two layers separated to produce a metal negative of the lacquer disc. This is then coated in metal and the two layers separated to produce a metal positive copy of the lacquer. This is then coated in metal and the two layers separated to produce a negative: the stamper. Multiple stampers are produced – stampers wear out. The stampers are then used to press blobs of hot vinyl to produce the final LPs! It is amazing to me that it works so well.

You can then play the vinyl record using a tiny stylus, a cantilever, and a coil/magnet arrangement to produce a tiny voltage. This is amplified and filtered with the reverse RIAA curve before sending it via the volume control to the power amp and speakers.

A vinyl record is quite a long way from the original event!

In this case, the earliest point in the chain that we have access to is the processed digital file. This is regarded by audiophiles as the poor man’s version of the recording. We pay extra (a lot extra) to listen to the output of the next stage – the self-destructing lacquer. Or, for somewhat less, we can buy the result at the end of the chain: the standard vinyl LP.

Obviously, the people behind this scheme understand exactly what they are doing, and have a good sense of humour. But it does highlight a particular audiophile belief, I think: that music – even the devil’s own digital music – can be purified and cleansed if it is passed through ‘heritage’ technology built by craftsmen and artisans.

The rational person might assume that the earlier in the chain you go should give you the best quality, but audiophiles will pay more – much more – to hear the music passed through extra layers of sanctified materials, such as wood, oil, cellulose, varnish, bakelite, animal glue, silver wire, diamond, waxed paper and plastic vinyl.

The First CD Player

sony cdp-101There’s an amazing online archive of vintage magazines that I have only just begun rummaging through. I was pleased to see this 1982 review of the Sony CDP-101, the first commercial CD player. The reviewer gets hold of a unit even before they go on sale commercially, saying:

I feel as though I am a witness to the birth of a new audio era.

This was the first time that the public had encountered disc loading drawers, instant track selection, digital readouts and digital fast forward and rewind, so he goes into great detail on how these work.

And at that time, the mechanics of the disc playing mechanism seemed inextricably linked with the nature of digital audio itself, so, after reading the more technical sections of the article, the reader’s mind would be awhirl with microscopic dots, collimators and laser focusing servos – possibly not really grasping the fundamentals of what is going on.

Audio measurements are shown, though, and of course these are at levels of performance hitherto unknown. (He is not able to make his own measurements this time, but a month later he has received the necessary test disc and is able to do so).

As I write these numbers, I find it difficult to remember that I am talking about a disc player!

Towards the end, the reviewer finally listens to some music. He is impressed:

I was fortunate enough to get my hands on seven different compact digital disc albums. Some of the selections on these albums were obviously dubbed from analog master tapes, but even these were so free of any kind of background noise that they could, for the first time, be thoroughly enjoyed as music. There’s a cut of the beginning of Also Sprach Zarathustra by Richard Strauss, with the Boston Symphony conducted by Ozawa, that delivers the gut -massaging opening bass note with a depth and clarity that I never thought possible for any music reproduction system. But never mind the specific notes or passages. Listening to the complete soundtrack recording of “Chariots of Fire,” the images and scenes of that marvelous film were re- created in my mind with an intensity that would just not have been possible if the music had been heard behind a veil of surface noise and compressed dynamic range.

He talks about

…the sheer magnificence of the sound delivered by Compact Discs

and concludes:

…after my experiences with this first digital audio disc player and the few sample discs that were loaned to me, I am convinced that, sooner or later, the analog LP will have to go the way of the 78 shellac record. I can’t tell you how long the transition will take, but it will happen!

A couple of months later he reviews a Technics player:

Voices and orchestral sounds were so utterly clean and lifelike that every once in a while we just had to pause, look up, and confirm that this heavenly music was, indeed, pouring forth from a pair of loudspeaker systems. As many times as I’ve heard this noise -free, wide dynamic -range sound, it’s still thrilling to hear new music reproduced this way…

…the cleanest, most inspiring sound you have ever heard in your home

So here we are at the very start of the CD era, and an experienced reviewer finding absolutely no problems with the measurements or sound.

In audiophile folklore, however, we are now led to believe that he was deluded. It is very common for audiophiles to sneer about the advertising slogan “Perfect Sound Forever”.

Stereophile in 1995:

When some unknown copywriter coined that immortal phrase to promote the worldwide launch of Compact Disc in late 1982, little did he or she foresee how quickly it would become a term of ridicule.

But in an earlier article from 1983 they had reviewed the Sony player saying that with one particular recording it gave:

…the most realistic reproduction of an orchestra I have heard in my home in 20-odd years of audio listening!

…on the basis of that Decca disc alone, I am now fairly confident about giving the Sony player a clean bill of health, and declaring it the best thing that has happened to music in the home since The Coming of Stereo.

For sure, there were/are many bad CDs and recordings, but it is now commonly held that early CD was fundamentally bad. I don’t believe it was. I would bet that virtually no one could tell the difference between an early CD player and modern ‘high res’.

Both magazines seemed aware that their own livings could be in jeopardy if ‘all CD players sound the same’, but I think that CD’s main problem was the impossibility of divorcing the perceived sound from the physical form of the players. 1980s audio equipment looked absolutely terrible – as a browse through the magazines of the time will attest.

Within a couple of years, CD players turned from being expensive, heavy and solid, to cheap, flimsy and with the cheesiest appearance of any audio equipment. They all measured pretty much the same, however, regardless of cost or appearance. Digital audio was revealed to be what it is: information technology that is affordable by everyone.

This, of course, killed it in the eyes and ears of many audiophiles.