lossless audio compression Song Director Music player

lossless audio compression

Song Director Software will automatically enter all your digital audio files into a database for easy cataloguing, sorting, and organization.  Song Director is also a music player allowing you to create playlists of songs.  Download Song Director now for free!

Here is a good aticle on lossless audio compression

By: Mat Hans and
Ronald W. Schafer

Although lossless audio compression is not
likely to become a dominating technology,
it may become a useful complement to lossy compression
algorithms in some applications. This is because,
as we will see, lossless compression algorithms rarely obtain
a compression ratio larger than 3:1, while lossy compression
algorithms allow compression ratios to range up
to 12:1 and higher. For lossy algorithms, the higher the
compression ratio becomes, the lower the resulting final
audio quality, but when the lowest possible data rate is required,
lossy techniques are the only alternative.
However, lossless audio
coding of stereo CD quality digital audio
signals sampled at 44.1 kHz and
quantized to 16 bits could become an essential
technology for digital music distribution over the
Internet because some consumers will want to acquire the
best possible quality of an audio recording for their
high-fidelity stereo system. Lossy audio compression
technologies such as MPEG or MP3 may not be acceptable
for this application.

Internet resources are limited and will likely remain
constrained because the demand for service continues to
increase at least as fast as the rate of growth of transmission
capacity. Therefore, compression will continue to be
important for Internet transmission of audio signals. It is
reasonable to suggest, for example, that music distribution
applications will offer highly compressed audio clips
to the consumer for browsing and selection. After selection,
the consumer who is accustomed to audio CD quality
may require access to a losslessly compressed copy of
the original—a copy without any distortion due to the
compression algorithm.
Music distribution over the Internet is not the only application
for which lossless audio compression technology
is of interest. This technology can also be used for the
archiving and mixing of high-fidelity recordings in professional
environments. In this case, lossless compression
avoids the degradations incurred in multiple encoding
and decoding when using lossy compression coders. This
problem arises particularly in the post production industry
and occurs when the signals are modified or when different
coding formats are used. Lossless audio coding is
also used in the new DVD standard for storage of audio
signals at higher resolution and sampling rates [6].
The first section of this article is a survey and a classification
of the current state-of-the-art lossless audio compression
algorithms (lossless coders). This study finds
that lossless audio coders have reached a limit in what can
be achieved for lossless compression of audio.

 

The second
section is a description of a new lossless audio coder called
AudioPaK, which has low algorithmic complexity and
performs as well or even better than most of the lossless
audio coders that have been described in the literature.
State-of-the-Art for Lossless Audio Coders
Lossless compression is not a new problem in the general
context of computing. A variety of utilities such as PkZip
are available and widely used to compress text and program
files for storage or transmission over the Internet.
These programs are designed to be very effective for text
data. For example, a PostScript version of this article occupied
4,560,999 bytes of storage while the PkZipped
version occupied only 696,041 bytes of storage— a compression
ratio of 4,560,999/696,041 = 6.55. On the
other hand, track 04, one of the audio files consisting of
16-bit samples used in our tests, occupied 33,715,964
bytes of storage before compression and 31,573,915
bytes after compression with PkZip. In this case, the compression
ratio was only 1.07. What is needed are algorithms
specifically designed for digital audio files.
Although lossless coding of audio has not received a great
deal of attention so far, a number of different
algorithms have been developed, and code for
testing these algorithms has been made available
in almost all cases. The compression algorithms
to be discussed in this paper are listed
in Table 1. This list represents all the lossless
coders that we found in the literature or
through searching the Web. We believe that
these coders represent the state-of-the-art in
lossless audio compression. Shorten [21],
MUSICompress [23], LTAC [19], DVD [4]—[6] and
Philips’ algorithms [1] are described in the literature. Information
concerning the other coders—Sonarc [22],
OggSquish v98.9 [18], and WA [16]—comes from personal
communications with the designers of these algorithms.
AudioPaK is a new lossless audio coder described
later in this article.
Basic Principles of Lossless Compression
Fig. 1 is a block diagram representation of the operations
involved in compressing a single audio channel. All of the
techniques studied are based on the principle of first removing
redundancy from the signal and then coding the
resulting signal with an efficient digital coding scheme.
Each of the steps shown in Fig. 1 will be discussed in general
terms and in terms of specific coders.
Fig. 1 shows the operations for a single channel of
lossless audio compression. Although the two channels of
a stereo recording are generally not independent, the dependence
is often weak and difficult to take into account.
Therefore, the multichannel case of lossless compression

Audio
Signal
Framing Intrachannel
Decorrelation
Entropy
Coding
Compressed
Signal
x [n] e [n]
 1. The basic operations in most lossless compression algorithms.
Table 1. List of Lossless Compression Algorithms.
Name of Algorithm Authors
AudioPaK M. Hans, R. Schafer [11]
DVD P. Craven et al. [4]-[ 6]
LTAC M. Purat, T. Liebchen,
P. Noll [19]
MUSICompress A. Wegener [23]
OggSquish C. Montgomery [18]
Philips A. Bruekers, A. Oomen,
R. van der Vleuten [1]
Shorten T. Robinson [21]
Sonarc R. Sprague [22]
WA D. Lee [16]
has not received much attention in the literature. Generally,
stereo channels are compressed separately without
attempting to take advantage of the correlation that
might exist between the left and right channels or by simply
coding the left channel and the difference between the
two channels.
Framing
The framing operation in Fig. 1 is introduced to provide
for editibility, an important and necessary property for
most applications dealing with digital audio. It is often
important to quickly and simply edit a compressed audio
bit stream, and the sheer volume of data prohibits repetitive
decompression of the entire signal preceding the region
to be edited. Therefore, a practical solution is to
divide the digital audio signal into independent frames of
equal time duration. This duration should not be too
short, since significant overhead may result from the
header that is prefixed to each frame. This header is important,
since it defines the compression algorithm parameters,
which can change on a frame-to-frame basis to
follow the varying characteristics of the input signal over
time. Furthermore, depending on the application, this
header may include additional data, such as multimedia
and synchronization information. On the other hand, the
frame duration should not be too long, since this would
limit the temporal adaptivity and would make editing of
the compressed audio bit stream more difficult. The algorithms
surveyed compromised with a 13 to 26 ms frame
duration [1], [3], [11], [21]. This time interval translates
to 576 and 1152 samples for a sampling rate of 44.1 kHz.
Intrachannel Decorrelation
The purpose of the second stage of the typical lossless audio
coder in Fig. 1 is to remove redundancy by
decorrelating the samples within a frame. The lossless audio
coders in Table 1 use two basic approaches for
intrachannel decorrelation. Most algorithms remove redundancy
by some type of modified linear predictive
modeling of the signal. In this approach, a linear predictor
is applied to the signal samples in each block resulting
in a sequence of prediction error samples. The predictor
parameters, therefore, represent the redundancy that is
removed from the signal and the losslessly coded predictor
parameters and prediction error together
represent the signal in each block. A second,
less common, approach is to obtain a low
bit-rate quantized or lossy representation of
the signal, and then losslessly compress the
difference between the lossy version and the
original signal. These approaches are discussed
in more detail below.
Predictive Modeling: The basic principle of
predictive modeling is to predict the value of a
sample x n [ ] using the preceding samples
x[n −1], x[n −2], etc. Fig. 2 is a diagram of
intrachannel decorrelation using the predictive modeling
scheme. If the predictor performs well, the prediction errore
n [ ]will be uncorrelated and, therefore, have a flat frequency
spectrum. Likewise, e n [ ] will on average be
smaller than x n [ ], and it will, therefore, require fewer bits
for its exact digital representation.
The prediction models that are used in lossless compression
of audio are based on the structure of Fig. 3. This
diagram depicts

 3. General structure for prediction.
Our survey of currently available
lossless audio coders suggests
that the current technology has
reached a limit in what can be
achieved for lossless
compression of audio.

−= Σ1
terms in (1), respectively. If the polynomial B$(z)=0, the
feedback terms are omitted, and the prediction error filter
has finite impulse response (FIR), neglecting the
quantizer Qfor the moment. Thus, a predictor with only
$A(z)is an FIRpredictor. If the feedback terms are present
(B$(z)≠0), the predictor is an infinite impulse response
(IIR) predictor.
The quantization operation in Fig. 3 makes the predictor
a nonlinear predictor, but, since the quantization is to
16-bit precision, it is reasonable to neglect the
quantization in understanding the first-order effects of
the predictor and in developing methods for estimating
the predictor parameters. The quantization is necessary
for lossless compression since we want to be able to reconstruct
x n [ ]exactly from e n [ ]remotely and possibly on
a different machine architecture. Therefore,e n [ ]is usually
represented with the same fixed-point integer
quantization scheme as x n [ ], so as not to introduce new
quantization levels below the least significant bit. Reconstructing
x n [ ]frome n [ ]can be done by simply solving (1)

Linear predictors are widely used in speech and audio
processing [15], [20]. In most applications, an FIR predictor
is used, and the coefficients of the prediction filter
$A(z) are determined to minimize the mean-squared prediction
error. Without the quantizer and with B$(z)=0 in
Fig. 3, the prediction coefficients can be found for the
FIR case by simply solving a set of linear equations [20].
When FIR linear prediction is used in lossless audio compression,
the prediction coefficients can be found by standard
methods and then quantized for use in Fig. 3 (with
B$(z)=0). As shown in Fig. 4, the same coefficients are
used in the reconstruction of x n [ ] from e n [ ]. Therefore,
the prediction coefficients must be quantized and encoded
as part of the lossless representation. Generally, a
new set of coefficients is determined for each frame of the
signal thereby adapting the predictor to the changing local
structure of the signal. The method of determination
of the predictor coefficients ranges from estimation of the
minimum mean-squared predictor for each block to the
simple choice of a predictor from a small library of fixed
predictors.
In many of the lossless audio compression algorithms
in our survey, the predictor for a given block is chosen
from a finite library of coefficient sets to avoid the computation
required to estimate the optimum predictor coefficients.
In these cases, the identity of the predictor that was
used for a given frame is all that need be encoded with the
compressed prediction error. This is the approach that is
followed, for example, in the case where both
feed forward and feedback terms are involved in the prediction.
This is because the general solution for the minimum
mean-squared predictor is much more complicated
in the IIR case. For this reason, IIR prediction has not
been widely used in other applications, such as speech
processing. In lossless compression, Craven et al. [4], [5]
have shown that, while IIR prediction has the
potential to perform better because it can
match a wider range of spectral shapes than
FIR prediction with the same number of coefficients,
the improvement in compression performance
demonstrated so far is modest at
best.
In general, a minimum mean-squared predictor
for audio signals will involve fractional
coefficients. These must be quantized and
coded as part of the lossless representation,
and the predictor must be implemented with
fixed-point arithmetic. However, when the
prediction coefficients are integers, it is particularly
simple to compute the prediction error
with integer arithmetic that can easily be done
exactly the same way on different computation
platforms.
The following list shows the algorithms
that are based on predictive modeling according
to the type of prediction that is used:
 FIR model: Shorten [21], Sonarc [22],
WA [16], Philips [1].
 IIR model: OggSquish [18], DVD [4],
[5].
24 IEEE SIGNAL PROCESSING MAGAZINE JULY 2001
+
+
+
+
+
e [n] x [n]
Q
B z ( )
^
A^(z) –
 4. General structure for reconstruction.
+
+

x [n] e [n]
c [k]
A Q AT Q
^c [k] ^x [k]
^c [k]
 5. Details of the intrachannel decorrelation block proposed by Purat et al. in
their LTAC. Both signals c$[k ] and e[n] are entropy coded and transmitted. Q
stands for integer truncation.
 Integer coefficients FIR model (fixed or adaptive):
Shorten [21], HEAR [3], WA [16], MUSICompress
[23], AudioPaK.
The details of how the predictor model is chosen at
each frame can be found in each of the above references.
Lossy Coding Model: The second approach to lossless
compression is based on obtaining an approximation to
x n [ ]by some sort of lossy coding scheme that maintains
the signal wave shape with no time shift. Although this
approach has some potentially attractive features, only
one algorithm that clearly used this approach was found.
Purat et al. [19] proposed a method based on transform
coding of the audio signal [15]. The basic idea of this
method is presented in Fig. 5. This algorithm uses an
orthonormal transform, a discrete cosine transform, to
obtain a quantized version of the signal. The cosine transform
and its inverse are indicated by the matrix multiplications
by A and AT in Fig. 5. The use of the cosine
transform is motivated by the well-known energy compaction
property of the cosine transform, which permits a
good reconstruction from only a few quantized values of
the cosine transform [15]. The input signal x n [ ]is transformed
and quantized with an unitary quantization step
size Δ =1. The resulting coefficients c$[k] are then entropy
coded. Because the inverse transformation followed by a
dequantization step results in an approximation x$[n] of
the original signal x n [ ], the decompression steps are duplicated
at the coder side to compute the residual error
e[n]= x[n]− x$[n]. This error is entropy coded and transmitted
along with the coded coefficients c$[k]. At the
decoder, the original signal can be obtained from
x[n]= x$[n]+ e[n], provided that the implementation of
the inverse cosine transformation and quantization at the
decoder is identical to that at the coder.
A potential advantage of this and other lossy-coding-
based approaches is that the lossless representation
contains within it a lossy lower data rate representation of
the audio signal, which could serve as a “thumbnail
sketch” of the signal. MUSICompress [23] also is structured
to provide for a lossy compression mode of operation.
This approach could be useful in situations where
progressive transmission of the signal is desired or where
audio signals are broadcast over the Internet to destinations
with varying computational resource and data rate
[13].
Entropy Coding
Entropy coding removes redundancy from the residual
signale n [ ](and the discrete cosine transform (DCT) coefficients
in the transform-based method). In this process,
no information is lost. The following three methods are
used in the algorithms in Table 1: Huffman, run length,
and Rice coding. The first two methods are well known
[9]. On the other hand, Rice coding, which is used in
some of the coders, is less well known.
The Rice code is characterized by only one parameter,
denoted henceforth asm. In fact, this code is the Huffman
code for a Laplacian probability density function, which
is found to be a good approximation for the distribution
of the prediction residual samples e n [ ] for all the
intrachannel decorrelation operations discussed above
[1], [3], [19], [21]. To form this code, a number is divided
into three parts; a sign bit, themlow-order bits, and
the remaining high-order bits. The first part of a code
word is a single bit indicating the sign of e n [ ]. The second
part consists of the m least-significant bits of the binary
representation of the absolute value ofe n [ ]. The third part
of the code word is made of N consecutive zeroes, where
N has the same binary representation as the yet unused
most significant bits of the absolute value of e n [ ]. Finally,
to end this series of zeroes, a bit set to 1 is inserted. Table
2 gives examples of Rice codes for m =3.
All the lossless audio coders presented in the literature
that use the Rice code (Lossless Transform Audio Compression
(LTAC) [19], Philips [1], Shorten [21], WA
[16]) define the single parameter m to be of constant
value over an entire frame. This value is found by means
of a full search or by estimation. Such an estimation was
first given in [21] as
m ( E( e n )) e = log log ( ) | [ ]| 2 2 , (3)
where E()is the mathematical expectation function (average).
Summary of Compression Algorithm Characteristics
The characteristics of the compression algorithms of this
study are summarized in this section.

A potential advantage of
lossy-coding-based approaches
is that the lossless
representation contains a lossy
lower data rate representation of
the audio signal, serving as a
“thumbnail sketch” of the signal.
Table 2. Examples of Rice Codes for m= 3.
Number Sign
Bit
m
Lower
Bits
Number
of 0s
Full
Code
0 0 000 0 00001
18 0 010 2 0010001
−12 1 100 1 110001
AudioPaK (Version 1.1): This coder is described in detail
later in this article.
DVDStandard: This algorithm is based on the work of
Craven et al. [4]-[6]. It uses IIR prediction and Huffman
coding for the prediction error signal.
LTAC (Version 1.01 for DOS): LTAC is the only coder
that uses transform coding for decorrelation. The Rice
coding scheme is used to pack the transform coefficients
and the residual errors.
MUSICompress (Version 1.2 for Windows):
MUSICompress uses a predictive type computational
structure where adaptively varied approximations are
subtracted from the original signal to produce an error
signal that is losslessly coded. The nature of the approximations
is not specified in [23], so it is possible that
MUSICompress should be placed in the lossy-coding-
based class. MUSICompress uses a block floating-
point representation and Huffman coding to code
both the approximation and the residual signal.
OggSquish (Version 98.9 for Unix): The lossless audio
coder that comes with OggSquish version 98.9 uses IIR
linear prediction and Huffman coding. OggSquish is not
exactly an audio coder, it is a DSP engine that runs any
coding program the user wants [18]. For example,
OggSquish is capable of running scripts that generate
ADPCM, MPEG, TwinVQ, etc. The 98.9 version of
OggSquish includes a lossless coder designed by the authors
of OggSquish. Here we refer to this coder as
OggSquish.
Philips: This algorithm designed by Philips [1] uses a
tenth-order FIR linear predictor along with the Rice coding
scheme.
Shorten (Version 2.1 for DOS): Two commands were
used:
-p 0: shorten -b 1152 -v 2 -t s16 -p 0
FILE.pcm FILEP0.shn, and
-p 10: shorten -b 1152 -v 2 -t s16 -p 10
FILE.pcm FILEP10.shn.
These two commands define different coders, which differ
by their intrachannel decorrelation block. The program
option -p 0 uses an FIR predictor with integer coefficients
while -p 10 uses a tenth-order minimum
mean-squared linear FIR predictor. Both coders use the
Rice coding scheme to encode the prediction error signal.
Sonarc (Version 2.1h for DOS): Sonarc uses FIR linear
prediction and Huffman coding. If the default arguments
are used, a fixed-point integer version of Schur’s algorithm
is used to compute the FIR prediction coefficients.
If the -x argument is included in the arguments then the
prediction coefficients are determined by the “winner” of
a contest between autocorrelation, covariance, Schur’s,
and Burg’s algorithms.
Waveform Archiver (Version 1.1 for DOS): WA uses
FIR linear predictors for the intrachannel decorrelation
block. The entropy coding block uses Rice coding. WA
offers five levels of coding: -c1, -c2, -c3, -c4, and
-c5. The compression ratio was found to increase modestly
with the level number. The setting -c5 gives the
best compression ratios with significant increase in computation
time over the setting -c1.
Performance Evaluation
With the exception of the DVD standard, all the compression
algorithms discussed so far either had executable
files available for downloading or they had been previously
tested on material that was also accessible to us. To
compare their performance, a representative set of six audio
files was used to test each of the algorithms. All these
files have CD format, i.e., 44.1 kHz, stereo, 16 bits, and
are briefly described in the following list and in Table 3.
 Track 4 of [17] (3 min 11 s): Madonna’s “Like a Virgin.”
 Track 20 of [7] (39 s): Saxophone, low frequency musical
signal.
 Track 27 of [7] (20 s): Castanets, high frequency musical
signal.
 Track 48 of [7] (28 s): Voice quartet.
 Track 66 of [7] (18 s): Wind ensemble.
 Track L=R (18 s): This track was constructed using the
left channel of track 66 of [7]. The right
channel is an exact copy of the left channel.
Therefore, the difference between the left
and right channel is zero. This track serves
as a probe to allow us to determine if a
given compression scheme takes advantage
of interchannel correlation.
Table 3 summarizes the original file
size in bytes and the estimated first-order
entropy H0 for both left and right channels.
The estimate was obtained by computing
a histogram with 216 bins and
normalizing by the number of samples to
obtain probability estimates pi . The entropy
estimate is then the sum over all bins
ofp p i i log 2 . Recall that H0 is the entropy
of a source where each symbol
(sample) is independent and identically
26 IEEE SIGNAL PROCESSING MAGAZINE JULY 2001
Table 3. File Size and First-Order Entropy
for Left and Right Channels (x[n]).
Track File Size
(bytes)
H0 Left Channel
(bits per sample)
H0 Right Channel
(bits per sample)
Track 04 33,715,920 14.33 14.28
Track 20 6,879,600 10.86 11.03
Track 27 3,528,000 9.27 9.26
Track 48 4,939,200 11.59 11.48
Track 66 3,175,200 10.81 10.92
Track L=R 3,175,200 10.83 10.83
distributed. For signals such as digital audio, where there
is considerable sample-to-sample correlation, the entropy
will be significantly lower than H0 . However, these entropy
estimates, which are all less than 16 bits, suggest
that some degree of lossless compression should be possible
in all cases. Indeed, our results show that compressed
files average about 5 bits/sample.
Tables 4 and 5 group the lossless coders depending on
the arithmetic used. Table 4 groups the integer and
fixed-point arithmetic coders, while Table 5 groups the
floating-point arithmetic coders. AudioPaK is the only
coder to use only integer arithmetic. These tables give the
compression ratio
r =
original file size
compressed file size
for the six experimental audio files. To convert these
numbers into bits/sample to compare to the entropy estimates,
simply divide 16 by r. For example, r =3is equivalent
to 5.33 bits/sample.
At the outset, it is important to note that most of the
recordings from [7] contain a significant amount of silence.
In fact, we found that a compression ratio of about
1.3 may be easily obtained by compressing only these silent
segments for tracks 20, 27, 48, and 66.
The data in Tables 4 and 5 show several things. First
note that the compression ratios obtained by the various
state-of-the-art lossless audio coders for the same audio
file are very similar. For example, track 20 compression
ratios are in the range 2.72 to 3.28, which is equivalent to
at most a 1 bit per sample difference. For the other experimental
audio files, these differences are at most 0.85 bits
per sample for track 27, 1.12 bits for track 48, 0.80 bits
for track 66, and 0.72 bits for track 4.
Compression ratios associated with track L=R show
that only for AudioPaK, MUSICompress, OggSquish,
and WA is the compression ratio twice that of track 66,
indicating that these coders attempt to take advantage of
the dependence between channels in a way that works
perfectly if the two channels are perfectly correlated.
Finally, these tables show that coderWAwith the -c5
argument consistently gives the best compression ratio
for the six experimental audio files, although in most cases
it performed only slightly better than other algorithms.
This is one of the few coders, which could not provide
real-time compression on our 166 MHz MMX Pentium
machine. In general, floating point computations are not
desirable for data compression. They generally are slower
than integer computations, and floating point formats
can vary between computational platforms.
The Philips algorithm [2] and theDVDstandard algorithm
[4]-[6] are the only algorithms in the list for which
no executable was available. We have not attempted a
comparison with the DVD algorithm since it is designed
JULY 2001 IEEE SIGNAL PROCESSING MAGAZINE 27
Table 4. Compression Ratios for Integer and
Fixed-Point Arithmetic Lossless Audio Coders (Frame
Size Set to 1152 Samples for the Integer Coders, and
Default Arguments Used for the Fixed-Point Coders).
Track
Integer
Arithmetic Fixed-Point Arithmetic
AudioPaK MUSICompress Sonarc
Track 04 1.39 1.37 1.42
Track 20 3.12 2.93 3.13
Track 27 2.47 2.46 2.71
Track 48 2.56 2.41 2.65
Track 66 2.46 2.33 2.55
Track L=R 4.93 4.58 2.56
Table 5. Compression Ratios for Floating-Point Arithmetic Lossless Audio Coders (Frame Size Set to 1152
Samples for Shorten and Default Arguments Used for OggSquish, Sonarc, LTAC, and WA).
Track
Floating-Point Arithmetic
Shorten −p0 Shorten −p10 OggSquish LTAC Sonarc −x WA −c5
Track 04 1.38 1.43 1.43 1.41 1.42 1.46
Track 20 3.11 2.72 3.01 3.11 3.16 3.28
Track 27 2.46 2.69 2.67 2.70 2.72 2.83
Track 48 2.54 2.32 2.53 2.65 2.69 2.77
Track 66 2.46 2.42 2.51 2.54 2.57 2.64
Track L=R 2.47 2.44 5.01 2.70 2.58 5.28
for higher sampling rates and was evaluated on signals
that were not available to us. However, the work on
which the DVD algorithm is based [5] shows little if any
advantage of the IIR predictor over an FIR predictor. In
the case of the Philips algorithm, however, the compression
ratio was published in [1] (frame size was set to 1024
samples) for compression of all 70 tracks of [7]. To allow
a relative comparison with the other coders we also compressed
all 70 tracks using AudioPaK. Table 6 summarizes
results of this comparison. The italicized numbers
were published in [1], and the other number represents
the performance of AudioPaK on the entire Sound Quality
Assessment Material (SQAM) CD. The results in Table
6 must be interpreted with care. Since we were not
able to compare the Philips compression algorithm on
the individual files of our test set, we cannot draw the conclusion
that this algorithm will perform uniformly better
than all the other algorithms. The SQAM CD contains
many simple sounds such as pure tones, single instruments,
and silence which might be more easily compressed
by some algorithms and not others. Other results
in [1] show the Philips algorithm performing similarly to
OggSquish on individual files that were unavailable for
our study. These earlier individual comparisons showed
significantly lower compression ratios than the 3.31 reported
for the entire SQAM disk.
Conclusion of the Survey
Our survey of currently available lossless audio coders
suggests that the current technology has reached a limit in
what can be achieved for lossless compression of audio.
Furthermore, in spite of a wide range of approaches and
complexities, the different algorithms perform similarly
on the same audio signals with generally less than one bit
per sample difference in compressed bit rate.
Therefore, a coder compressing at this limit with the
least number of arithmetic operations would define the
best technology for lossless audio compression. That is,
the design of a lossless audio coder should now focus
on reducing algorithm complexity. With this in mind,
we designed a simple, lossless audio coder, which we
called AudioPaK. This coder uses only a few integer
arithmetic operations on both the coder and the decoder
side. The main operations are prediction with integer
coefficients and Golomb coding, which are done
on a frame basis. As summarized in Tables 4-6, our algorithm
performs as well, or even better, than most
state-of-the-art lossless coders.
AudioPaK:
A Low Complexity Lossless Coder
AudioPaK is well suited for Internet transmission because
of its low instruction complexity and good compression
performance. The algorithm is conveniently
described in terms of the same structure (Fig. 1) that was
used to discuss all the other lossless compression algorithms.
AudioPaK differs from the other algorithms in
how the redundancy removal and coding are implemented.
The details of the algorithm are discussed below.
28 IEEE SIGNAL PROCESSING MAGAZINE JULY 2001
Table 6. Compression Ratio for Complete CD [7].
Philips Shorten OggSquish AudioPaK
3.31 2.33 3.16 2.88
Table 7. Compression Ratios for the Left Channel of
Experimental Audio Files Using AudioPaK with
Different Frame Sizes (in Number of Samples).
Track 192 576 1152 2304 4608
Track 04 1.38 1.39 1.39 1.38 1.36
Track 20 3.02 3.14 3.17 3.17 3.17
Track 27 2.50 2.49 2.46 2.42 2.37
Track 48 2.48 2.55 2.56 2.56 2.56
Track 66 2.42 2.47 2.47 2.46 2.42
x [n]
x [n–3]
x [n]
x3[n]
x [n–2]
x [n–1]
x1[n]
x2[n]
x0[n]
n
Audio Samples
Polynomial Predictors
^
^
^
^
 6. Polynomial approximation interpretation of the predictors
used by AudioPaK for intrachannel decorrelation.
Framing
As in the case of all the other lossless audio coders in our
study, AudioPaK divides the input audio signal into independent
frames. The length of a frame is a coder parameter
and may be set at coding time; however, there are
good practical reasons for using frame sizes that are a
multiple of 192, the number of samples carried by a standard
interconnection digital interface established by the
Audio Engineering Society and the European Broadcasting
Union (AES/EBU). It is a serial transmission format
for linearly represented digital audio data, which permits
transmission of two-channel digital audio information,
including both audio and nonaudio data, from one professional
audio device to another. Table 7 shows the results
for frame lengths ranging from 192 to 4608 in
multiples of 192. This data shows that the
compression ratios are not very sensitive to
the frame length and suggests that a good
length for 44.1 kHz, 16-bit audio sampled
streams is 1152 samples.
Intrachannel Decorrelation
Much of the arithmetic in the algorithms described
earlier is in the computation of the
prediction error signal, so this area is a prime
target for simplification. Therefore, the
intrachannel decorrelation operation uses a
very simple FIR adaptive prediction method
using only integer coefficients. This approximation
was first proposed in Shorten [21] and will be discussed
in detail here.
The prediction method uses the following four simple
FIR predictors, each of which has only integer coefficients:
$ [ ]
$ [ ] [ ]
$ [ ] [ ] [ ]
$ [ ]
x n
x n x n
x n x n x n
x n
0
1
2
3
0
1
2 1 2
3
=
= −
= − − −
= x[n − ]− x[n − ]+ x[n − ].





1 3 2 3
These predictors are FIR predictors since they involve
linear combinations of a finite number of past samples.
However, they also have an interesting interpretation in
terms of the pth-order polynomials that they define. Recall
that a ( p −1)-order polynomial can be found that
passes through the previous p data points
x[n −1], x[n −2],K, x[n − p]. This polynomial can be
evaluated at the nth sample time to obtain the predicted
value x$[n]. These polynomials and the predicted values
that they produce are illustrated in Fig. 6 for a typical set
of previous samples.

This permits the entire set of prediction error signals to
be computed without any multiplications.This algorithm
can be parallelized to take advantage of Intel’s MMX
SIMD architecture, as shown in [11].
For each frame, the four prediction residuals
e n e n e n e n 0 1 2 3 [ ], [ ], [ ],and [ ] are computed at every sample
in the frame, and the absolute values of these residuals
are averaged (summed) over the complete frame. The residual
with the smallest sum of magnitudes over all samples
in the frame is then defined as the best approximation
for that frame, and that predictor is used for that frame.
The information on which predictor was used can be
coded with two bits of information, and that becomes
part of the overhead information for the frame. Table 8
summarizes how many times each predictor was used in
coding the experimental audio tracks.
As seen in Table 8, there is no strong preference for any
of the predictors; i.e., there is an advantage in varying the
predictor from frame-to-frame. More light is shed on
these predictors by considering their frequency

Table 8. Number of Times Each Residual Is Chosen
During Coding of a Left Channel Frame.
Track e0 e1 e2 e3
Total
Frames
Track 04 80 4948 5694 3912 14634
Track 20 621 364 1300 701 2986
Track 27 334 300 666 232 1532
Track 48 288 252 337 1267 2144
Track 66 301 30 407 641 1379
Magnitude
p=3
10
8
6
4
2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p=2
p=1
p=0
Integer Coefficient Predictor Frequency Responses
Normalized Frequency = ωT/π
 7. Frequency responses of FIR predictors used by AudioPaK for intrachannel
decorrelation.
sponses. It is easily shown that the z-transform system
function of the pth-order predictor is
A z z p
( )=(1− −1 ) p ,
and the magnitude of the frequency response of the
pth-order predictor is
A (e ) T p
jωT = 2sin(ω / 2) p , (4)
whereT is the sampling rate. Fig. 7 shows plots of (4) for
the four predictors used in AudioPaK. (The frequency
axis in this plot is normalized so that half the sampling
rate is at normalized frequency one.)
Note that, because of the pth-order zero at z= −1, the
predictors for p =1,2,3 all have a high frequency boost
that increases with increasing p. Specifically, the gain at
frequency 22.05 kHz is 0, 20, 40, and 60 dB. Thus, the
choice of predictor can also be thought of in terms of selecting
the right amount of high frequency boost to flatten
the overall spectrum of the prediction error.
The pth-order pole at z= −1 would be problematic in
many applications of linear predictive modeling because
the corresponding reconstruction system of Fig. 4 would
be unstable. In the lossless case, however, if we start the
reconstruction system with p exact samples from the previous
frame, there are no quantization errors to accumulate,
and the reconstruction is exact.
The question naturally arises as to how much performance
is sacrificed by using simple predictors. The evaluations
of the Shorten algorithm suggest that
simplification of the prediction operation by using
low-order predictors with integer coefficients does not
degrade performance appreciably. Indeed, as Table 5
shows, there is little difference in performance between
the simple integer predictors and the optimum tenth-order
linear predictor, and, in fact, in some cases the
high-order predictor performs slightly worse.
Interchannel Decorrelation
AudioPaK can take advantage of interchannel dependencies
present in stereo channel audio streams. This is done
when the coder is set in stereo mode. In this mode, in addition
to approximating the right stereo channel samples,
we simply compute the polynomial approximation of the
difference between the left and right channel along with
the associated residuals, and the smallest value among the
eight sums of absolute residuals (four from the right
channel and four from the difference) defines the best approximation.
Table 9 presents the estimated first-order
entropy H0 for the left error channel, the
right error channel in dual mode, and the right
error channel in stereo mode. The dual channel
mode compresses the left and right channels
separately.
These measures show very little improvement
in compression when the interchannel
decorrelation is switched on. This is not surprising
since the interchannel decorrelation
operation does not take into account time delays
and other differences that are common
between channels of a stereo signal. Because
we seek a coder with minimum algorithm
complexity, this suggests using AudioPaK in
dual mode.
Entropy Coding
Silent frames can easily be detected with residualse
n e n 0 1 [ ], [ ]and efficiently entropy coded
with an escape code. If the silent frame is made
of all 0 value samples then the sum of e n 0[ ] is
30 IEEE SIGNAL PROCESSING MAGAZINE JULY 2001
Table 9. First-Order Entropy H0 in Bits per Sample
for Left and Right Error Channels (e[n]).
Track
Left
Error
Channel
Right Error Channel
Dual
Mode Stereo Mode
Track 04 12.10 12.27 12.09
Track 20 6.03 6.17 6.17
Track 27 7.58 7.52 7.52
Track 48 7.18 7.16 7.16
Track 66 7.82 7.90 7.87
Compression Ratio Per Frame
Minimum, Maximum Sample Values and Average Value Per Frame
Variance Per Frame

otherwise. (5)
An estimate for the parameter k is given in [24] and is
used in AudioPaK. It is based on the expectation E e n i ( [ ])
already computed in the intrachannel decorrelation block
and is
k  (E(e n )) i = log [ ] 2 . (6)
The parameter k is defined to be constant over an entire
frame and takes values between 0 and(b −1)in the case
of bbit audio samples. This parameter can be estimated efficiently
without any floating-point operations as follows:
for (k = 0, N = framesize; N<AbsError;
k++, N *= 2) {NULL;}
where framesize is the number of samples in a frame
and AbsError is the sum of the absolute values of the residual
signal over the complete frame.
Context Modeling: Experiments varying the value of
the parameter k within a frame were also conducted.
These experiments followed the context modeling idea
found in the new JPEG-LS standard [14]. Context modeling
suggests defining regions (contexts) for which the
statistical behavior of the residual signal e n [ ]are similar.
In the new lossless image compression standard,
JPEG-LS, the contexts are built using the local gradient,
which captures the level of smoothness or edginess surrounding
a pixel. For example, pixels in an edge area are
grouped together as are pixels in flat regions. Once these
regions are defined, the coder adaptively chooses the best
entropy coding parameter for each region.
The maximum compression ratio possible was measured
using context modeling for AudioPaK. This maximum
was found by using the best entropy coding
parameter for every residual sample e n [ ]. This means setting
the parameter k for each residual e n [ ] to give the
smallest code word length. Using our experimental audio
files we found a maximum compression ratio improvement
ranging from 6 to 10%. To achieve this rather small
improvement, it would be necessary to infer the context
from the signal data at a computational cost that is not
justified by the savings in bit rate.
Bit Rate Variability
Because of the nonstationary nature of audio signals, the
compression scheme has a variable bit rate. Fig. 8 illustrates
this variability. In this figure the compression ratios
over frames number 3550 and 3599 for the left channel of
track 04 of [17] are presented. Also depicted are the minimum
and maximum sample values in each frame along
with the mean and variance value over each frame. Note
the relationship between the variance (which also defines
the energy) and the compression ratio. As the energy of a
frame increases, the compression ratio decreases.
In some applications, it may be of interest to have a
smoothed peak rate instead of a strong fluctuation in instantaneous
data rate.Asolution to this problem was proposed
by Craven et al. in [5], who suggested inserting
buffers at both the coder and the decoder sides.
Conclusion
Lossless audio compression is likely to play an important
part in music distribution over the Internet, DVD audio,
digital audio archiving, and mixing. Because common
lossless compression utilities such as PkZip do a poor job
of compressing audio streams, a different technology
adapted to these signals is required.
All current state-of-the-art lossless audio compression
algorithms are built around the removal of local dependency
of the audio samples. These algorithms have reached
a limit in compression that is very modest compared to
lossy compression technology. Using the basic principles
of framing, prediction, and entropy coding, we designed
a new coder—AudioPaK—to be implemented using only
a small number of integer arithmetic operations. As
shown in Tables 4 and 5, AudioPaK performs as well or
better than most state-of-the-art coders.

Link to original article

Leave a Reply

Your email address will not be published. Required fields are marked *