Lossy audio codecs MP3, OGG, WMA, AAC and MPC in more detail
Here are some of modern lossy-coders that exist today: MPEG-1 Layer 3 (MP3), Windows Media Audio (WMA), Ogg Vorbis (OGG), MusePack (MPC), and MPEG-2/4 AAC. We shall elaborate in detail on consideration of these five codecs, which are most used today.
MP3 - MPEG-1 Layer 3
MPEG standards in general and MP3 in particular
MPEG-1 Layer 3 (known as "MP3") is most widespread and popular today. It has won its popularity quite deservedly - it is the first widespread lossy-codec which reached such a high data compression factor, together with very good sounding quality. A little bit of history. MPEG is an abbreviation of "Moving Pictures Coding Experts Group". MPEG has been started at January, 1988. Since the first assembly in May, 1988, the group began to grow, and has grown up to unusual dense experts collectively. Usually, in MPEG assembly about 350 experts participate, from more than 200 companies. The largest part of participants are the experts occupied in various scientific and academic establishments. Today MPEG group has developed the following standards and algorithms:
MPEG-1 (November 1992) - the standard of coding, storage and decoding of moving pictures and audio data; MPEG-2 (November 1994) - the standard of data coding for digital TV; MPEG-4 - the standard for multimedia applications; MPEG-7 - universal standard for multimedia, intended for processing, filtration and management of multimedia data. Let us consider the set of standards MPEG-1. This set, according to ISO standards (International Standards Organization), includes three algorithms of different levels of complexity: Layer 1, Layer 2 and Layer 3. Our well known friend MP3 in exact designation is "MPEG-1 Layer 3". The general structure of encoding process is identical in all Layers. At the same time, in spite of similarity of the Layers in the general approach to encoding, the Layers differ on target use and internal mechanisms. By the way, this fact determines the degree of similarity of the algorithms which have "grown" from MPEG-1 (such as, Ogg Vorbis and MusePack). Each Layer has its own format of data stream and decoding algorithm. MPEG-1 algorithms are mainly based on known properties of perception of sound signals by a hearing aid of human (we have mentioned above about these techniques).
Briefly about encoding algorithm used in MPEG-1. At the beginning of encoding, the source audio stream with the help of filters is divided on bandwidth. The continuation of the encoding process depends on used Layer.
In the case of Layer 3 (MP3) the signal in each obtained bandwidth is decomposed on frequency components by applying MDCT (Modified Discrete Cosine Transform - a special case of Fourier Transform) that gives a set of coefficients. Further processing is focused on simplification of the signal in order to perform re-quantization of its spectral coefficients. Obtained spectrum is cleared (by filtering) of obviously inaudible components - low-frequency noise and high imperceptible spectrum components. At the next stage, considerably more complex psycho acoustic analysis is applied (as was described earlier) on the audible part of spectrum. After all these manipulations, the source signal is deprived of more than half of its information. In completion of all, compression of obtained stream by the simplified analogue of Huffman algorithm is performed (this is lossless compression method), that allows to reduce noticeably the stream size.
In the case of Layer 2 the simplification process is quite similar. The difference consists in the object of re-quantization: re-quantization is performed on amplitude signal in each sub-band and not on the spectrum coefficients (some non-MP3 lossy encoders are based on the same technique).
Complete set MPEG-1 is intended for coding signals with sample rates of 32, 44.1 and 48 kHz. Three MPEG-1 Layers that were mentioned above have distinctions in encoding mechanisms and, thus, they provide different compression factors and sounding quality of resulting streams. Layer 1 allows keeping signals in format 44.1 KHz / 16 bits without significant losses of quality at bitrate of 384 Kbps that gives 4 times profit of data size. Layer 2 provides, subjectively, the same quality at 192 - 224 Kbps, when Layer III (MP3) gives the same results at 128-160 Kbps. It is impossible to speak about advantages and disadvantages of one Layer compared to another, because each Layer is developed to achieve its own aim. For example, the advantage of Layer 3 actually consists in allowing of data compression 8-12 times (depending on bitrate) without significant losses of original sound quality. At the same time, speed of a compression provided by this Layer is the lowest (it is necessary to note, that on modern CPU's this restriction is not appreciable at all). Layer II is potentially capable to provide higher quality of coding on account of "easier" internal signal processing during transformation. However, Layer II does not allow to reach so high compression factors, which may be reached by using Layer III.
The technique of audio coding is complex enough and has a set of nuances. All of them cannot be explained within the framework of one article; however all the most important should be considered, as almost every user meets with them when encoding.
Data encoding into MP3 (as well as into WMA and OGG) is performed by blocks: the coded file is divided on so-called frames of a certain equal length and each frame is encoded separately and is stored in a target stream. Thus, the target stream also has frame structure. Each frame can be encoded not on any bitrate, but only on one of those included in the standard table for MPEG1 Layer 3 (Kbps): 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320 (coding on intermediate bitrates is not stipulated by the standard, though it is possible). Because each frame is processed individually, it is possible to speak about data compression with constant (CBR) and variable (VBR) bitrate.
CBR (Constant Bitrate) is a way of encoding when all frames are encoded on identical bitrate. In other words, bitrate of the whole encoded stream remains constant all along the stream.
VBR (Variable Bit Rate) is a way of encoding when each separate frame is encoded with its own bitrate, calculated by encoder. The choice of bitrate for each frame is performed by the encoder according to performed psycho acoustic analysis.
There is also one more encoding mode - ABR (Average bitrate). Encoding in this mode (it is true, at least, for MP3 coders) is similar to CBR encoding. However this encoding is performed on variable bitrate keeping the same average. Not going into technical details, we shall note that VBR and ABR encoding is much more flexible and, often, more favorable and qualitative, rather than in CBR mode.
It is important to note, that ABR, VBR and CBR modes are used also in many coders rather than MP3.
We shall consider now existing encoding techniques of stereo data stipulated in MPEG-1 Layer 1, 2, 3 standards. These methods, probably, with some different interpretations, are valid not only in MPEG, but also in other codecs.
Dual Channel. This mode is intended for encoding of audio information in two channels as absolutely independent. In other words, encoding of audio occurs separately in each channel without tracking dependence of a signal in channels. As is implied from the name, this mode is mainly intended for coding of data with two parallel independent channels (for example, speech in English and German languages), and NOT with two channels carrying stereo information of sounding. In general, this mode is not recommended to be used for coding of stereo signal. Stereo. This mode differs from the Dual Stereo mode in reservoir usage. Reservoir - is a mechanism that is responsible for assignment of bits for encoded frames in the target stream. During encoding in stereo mode both channels are processed using the same reservoir, when in Dual Stereo mode, the signal is encoded, using independent reservoir for each channel. There are no other differences between the modes. Joint Stereo is common definition of the encoding methods of stereo information, which are based on the use of its redundancy. There are two versions of this method described in MPEG-1. MS Stereo. In this mode the encoded signal is re-divided on a middle channel (common constituent for both right and left channels) and a side channel (differented constituent of the channels) and processed as in Stereo mode, using some additional tricks. Intensity Stereo. In this mode encoded signal is divided on bandwidths. Then only bottom frequency ranges pass the actual encoding. In the top range, the encoder only registers average signal power in each bandwidth and actually doesn't encode the signal there. Encoding of stereo information in the bottom ranges is performed using MS Stereo or Stereo modes. It is necessary to note, that usage of MS Stereo mode does not introduce any additional errors in the signal. When re-dividing <left> + <right> channels on <middle> + <side> channels, nothing occurs, except for harmless and completely convertible mathematical calculations. At the same time, this simple reception of stereo data encoding allows the coder to accomplish its potential more effectively, rather than in mode Stereo.
ÎGG - Ogg Vorbis
One of unpleasant features of codec MP3 always was its commercialization: each manufacturer creating the new MP3 encoder is obliged to pay deductions to the "fathers" of the codec. Such situation should have called for appearance of any new development in audio compression. And this has happened indeed.
Codec Ogg Vorbis was published in June 2000. Ogg Vorbis is a part of Ogg Squish project, which consists in developing of completely independent open multimedia system. In other words, the whole project itself, as well as Ogg Vorbis in particular, is open and free for distribution and for its usage as part of a new software. In developers FAQ (Xiphophorus group) it is written, that Ogg Vorbis is based on the same ideas as the well known MPEG-1 Layer II. However OGG uses its own original mathematical algorithms and its own psycho acoustic model that exempts it from necessity to pay any license deductions and to have to make other payments to exterior manufacturers of audio formats. Ogg Vorbis is designed for compression of the data on all possible bitrates without restrictions that is from 8 Kbps up to 512 Kbps, but only in VBR mode. CBR mode wasn't implemented in Ogg Vorbis. The algorithm enables to store in OGG files (containers) detailed comments about audio material, as well as all standard info (artist/composition name, year and so on). Ogg Vorbis also stipulates an opportunity of coding of audio data with several channels (more than two, theoretically up to 255), an opportunity of editing of files contents, and also so-called "scalable bitrate" - a possibility of changing stream bitrate without necessity of its decoding. Ogg Vorbis also supports streaming playback (audio stream can be played back during its downloading from the Internet) and uses its own universal file format which can store any multimedia data of Ogg Squish system.
WMA - Windows Media Audio
Today Windows Media Audio (in abbreviated form - WMA) is own development of Microsoft Corp. The developing of this codec moves ahead successfully. Initially, WMA was developed by Voxware and had the name "Voxware Audio Codec", however subsequently the company has deserted its completion, having stopped on v4.0. Nevertheless, the codec was not left to decay, and has been completely redeemed by Microsoft. Programmers have seriously rewritten and advanced this codec, and the company has renamed it in Windows Media Audio. WMA is free-of-charge for users, but it is closed for exterior development.
MPEG-1 Layer 3 has been initially standardized for the allowed bitrate values and other key parameters, and WMA was changing in parallel to its growth and development. There are some versions of WMA codec that are available for today: v1, v2, v7, v8 and v9. Version 7 differs from its predecessors in range of supporting bitrates (up to 192 Kbps versus 164 Kbps for v1 and v2), a little bit worse encoding quality and different data structure of output stream. Version 8 of the codec differs from all previous versions by obviously revised advanced psycho acoustic model. Due to this fact encoding quality has increased significantly. So, at 96 Kbps WMA v8.0 may compete in quality with MP3 128 at encoding of not too much exacting audio materials (like pop music). However, for sure, the quality strongly depends on a concrete composition and the equipment used for listening. Ninth version of WMA is logic continuation of the eighth version. Developers announced significant encoding quality improvement in comparison with WMA v8. Version 9 of the codec contains new technology called "Fast Streaming". This technology is aimed to reduce buffering time of end user's client software when WMA-stream is transmitted through the Internet. In addition, WMA 9 represents actually the set of codecs. Besides the lossy-coder, this set includes also a number of specialized codecs, like voice encoding codec and lossless codec.
According to various tests and also to the parameters used for configuring of WMA encoder, its mechanism is quite similar to the mechanism of MPEG-1 Layer 3 - the same frame-by-frame compression with presumably the same signal processing methods.
Codec MusePack (MPC) is one more version of lossy-codecs. Its source name is MPEGplus (MPEG +), but the author of the codec was forced to rename his project in MusePack, because of problems which have appeared as a result of similarity of project's name to the abbreviation "MPEG". MusePack wasn't evolved from MPEG-1 Layer III; the codec has grown from MPEG-1 Layer II (like Ogg Vorbis do). MusePack was created by the enthusiasm of two people: Andre Buschmann and Frank Klemm. The codec is based on MPEG-1 Layer II and therefore it is orientated on coding, mainly, at high bitrates (unlike MP3). At the same time, the codec is completely an independent development. It stipulates coding only in VBR mode. The speed of compression and decompression it provides is higher than the speed of the same operations provided by MPEG-1 Layer 3.
On average, quality of MPC encoding on high bitrates (160 Kbps and higher) is sufficiently (if not to say "considerably") better than the quality provided by MP3. This can be explained by distinctions in encoding mechanisms. During encoding, MP3 divides the signal on sub-bands, then in each sub-band performs decomposition of the signal in a set of cosine coefficients (applying MDCT) with further re-quantization of obtained coefficients by applying psycho acoustics. MPC works similarly to MPEG-1 Layer 2: after splitting the signal on frequency sub-bands, it re-quantizes the amplitude signal in each sub-band (applying psycho acoustics). This difference between MPC and MP3 explains noticeable difference of encoding speed of the codecs.
AAC - MPEG-2/4 AAC (Advanced Audio Coding)
MPEG-2 AAC Standard
MPEG-2 was developed especially for TV broadcasts. In April 1997 this set of standards has received an extension, namely, MPEG-2 AAC (MPEG-2 Advanced Audio Coding). Standard MPEG-2 AAC is a result of efforts shared by a number of companies, such as Sony, NEC, Dolby and Fraunhofer Institute. MPEG-2 AAC is a technological continuation of MPEG-1. Because between publication of MPEG-2 AAC and its standardization enough time have passed, there are several versions (implementations) of this codec: Homeboy AAC, AT*T a2b AAC, Astrid/Quartex AAC, Liquifier AAC, FAAC (Freeware Audio Coder), Mayah AAC and PsyTEL AAC. Liquifier AAC, FAAC and PsyTEL AAC are those codecs, which provide highest sounding quality in comparison to MPEG-1 Layer III. Almost all codecs which were mentioned above are not compatible among themselves.
The main coding reception used in AAC is similar to MP3 and is based on applying of psycho acoustics. At the same time, AAC is furnished with extensions, providing improvement of output sound quality. In particular, another type of transformations is used; noise processing methods were improved, used new filters bank and another output stream storing technique. Besides AAC allows including so-called "watermarks" in encoded stream. "Watermarks" is the information (copyrights, for instance) built in the output stream which can not be deleted, not having destroyed integrity of audio data. This technology (being a part of Multimedia Protection Protocol) allows supervising of audio materials distribution. By the way, the inclusion of this technology in AAC has represented a serious obstacle on the way of its distribution. It is necessary to note also, that the codec is non backward compatible with MPEG-1 Levels 1/2/3.
MPEG-2 AAC provides three various encoding modes (profiles): Main, LC (Low Complexity) and SSR (Scalable Sampling Rate). Time of encoding and also the quality of output stream depends on profile used at encoding. Main profile provides the best sounding quality at the slowest speed of compression. This is because Main profile includes all that is available in AAC mechanisms of sound analysis and processing. LC profile is simplified in comparison with the Main profile that affects sounding quality of output stream, but also increases speed of compression and decompression. SSR profile also represents simplified variant of Main.
Speaking about sound quality provided by the codec, it is possible to tell, that AAC (Main) stream at 96 Kbps provides sounding which is comparable to MPEG-1 Layer III 128 Kbps. At 128 Kbps AAC distinctly surpasses MPEG-1 Layer III and the same bitrate.
MPEG-4 AAC is a part of MPEG-4 standard. MPEG-4 describes ways of object-oriented representation of multimedia data. The standard operates with objects, organizes their hierarchies, classes and other, builds stages and operates their transfer. As a basis of audio compression in MPEG-4, several standards are used: improved MPEG-2 AAC, codec TwinVQ, and also special speech encoders like HVXC (Harmonic Vector eXcitation Coding) and CELP (Code Excited Linear Predictive). In addition MPEG-4 AAC has a set of mechanisms which provide scalability. But as a whole, MPEG-4 AAC is a continuation of MPEG-2 AAC, providing rules and methods of audio coding (http://faac.sourceforge.net/wiki/index.php?page=AAC). MPEG-4 AAC standardizes the following types of objects (the notion "profile" in MPEG-2 AAC was substituted by the notion "object" in MPEG-4 AAC):
MPEG-4 AAC LC (Low Complexity) MPEG-4 AAC Main MPEG-4 AAC SSR (Scalable Sampling Rate) MPEG-4 AAC LTP (Long Term Prediction) MPEG-4 Version 2 MPEG-4 Version 3 (âêëþ÷àÿ HE-AAC)
Apparently, first three are borrowed at MPEG-2 AAC, the fourth is an innovation. LTP is based on methods of signal prediction and it is more complex than the others. Version 2 - is the set of standards which extend encoding tools of MPEG-4. Version 3 - is one more extension of the standard. Its main innovation is HE-AAC (High Efficiency AAC) - a new standard (May, 2003) also known as aacPlus.
aacPlus was announced by Coding Tech. at 9th, Oct 2002. aacPlus is based on SBR technology (Spectral Band Replication). This technology is intended to provide better transition of high frequencies. Audio codecs based on psychoacoustics have one common drawback: sound quality of encoded files start to degrade quickly when the bitrate falls below 112-128 Kbps. SBR is intended to supplement psychoacoustics and to remove the described drawback. When SBR is used, high frequencies of the source signal are not being encoded; only average intensity of high frequencies in several sub-bands is being registered instead. During decoding, the decoder synthesizes (replicates) high frequencies by copying the low frequencies into high diapason and multiplying them by the registered intensity factor in each sub-band.
About quality and practical applicability of codecs MP3, OGG, WMA, MPC and AAC
Despite completely different origin of all considered codecs, their mechanisms are based on the same idea of "simplification" of input signal, with subsequent compression of simplified data. Each codec has its individual innovations and completely independent implementation; however, as these codecs are based on approximately the same idea, their average compression results in identical conditions (evaluated as size/quality ratio) are approximately at the same level.
Codec MP3 was the first codec which used the idea of signal simplification using psycho acoustics. As of today, disregarding contrivance of competitors, MP3 remains one of the most popular audio codecs. Certainly, it is wrong to talk about MP3 in general, as there are its various independent implementations. One of the most successful and continuously developing implementation of MP3 is Lame Encoder (it is developed by a group of independent enthusiasts and the coder is distributed as free-of-charge). Lame has a set of configuration parameters, allowing fine tuning of encoding individually for each encoded material. If you're going to encode pop-music with subsequent listening on low/average quality audio equipment, then you can obtain enough good sounding at 128 Kbps. At 160 Kbps you may get even better sounding results, not having dramatically reduced compression factor. If you need an "audiophile quality" to store encoded material in audio collection and to listen to it on high-quality equipment, then you may need to use 320 Kbps (and higher). This bitrate choice will provide the highest sounding quality of compressed materials. Thus, each time when it is necessary to compress audio material, user should consider what are the purposes of the compression and only then, depending on the answer, he may choose the bitrate (as well as other parameters). Practice shows, that it is enough to encode at 160-192 Kbps to obtain quite qualitative sounding of pop- and classical music. When encoding electronic or instrumental music, bitrate requirements maybe higher. When encoding only voice material (lectures, for example), then it is enough to use ultra-low bitrates (below 64 Kbps) as in this case not the sound quality is important, but only legibility of speech at playback.
Ogg Vorbis and MusePack, being followers of MPEG-1 Layer II, yield, on average, appreciably better results of coding on high bitrates compared to MP3. This statement is fair for bitrates of 160 Kbps and higher. Usage of Ogg Vorbis and MusePack on low bitrates is not recommended.
Codec WMA, especially versions 8 and 9, yields slightly better results of coding, than MP3. It is necessary to note, that as practice and different tests show, WMA takes special significance at low bitrates. For example, MP3 at 32 Kbps sounds just awful (on such a low bitrate the signal undergoes hard distortions) while WMA sounds quite properly. This means that at low bitrates WMA is preferable to MP3.
Codec MPEG-2/4 AAC is a direct continuation of MP3. Thus in general ACC wins against MP3 on all bitrates. Though, it is necessary to note, that this result varies from one coder to another.
Thus, user's choice of codec and compression parameters should be guided by expediency reasons, future usage plans and also reasons of digestibility (in particular, MP3 and WMA files are accepted by many of hardware players while OGG, AAC and MPC files aren't acceptable by most). User doesn't need to be afraid that original quality of materials will be irrevocably lost after compression - by using high bitrates, it is possible to obtain almost the original sounding quality, as well as great gain in data size.