Formatting and Source Coding

Formatting is the process whereby source data is prepared for the ensuing digital processing. Sometimes this process is included in the functionality of source coding. The purpose of source coding is to reduce the amount of redundant or unnecessary information from the raw data.

Bits are assembled into patterns or code words with a certain length which is expressed in number of bits. The code words represent all or a part of the entire alphabet including letters, numbers, special characters and control codes, or represent the pixels of a fax or samples of digitized speech.

Code words are assembled into alphabets or codes. In some codes the code words are of unequal length. A distinction should be made between source coding, which is the coding used to communicate between a data source or sink (a teleprinter, a PC) and data communication equipment, e.g., a modem or a decoder, and channel coding, which is the coding used on the channel between the transmitting and receiving data communication equipment. Sometimes the source code is also used as the channel code.

The Morse code is an unequal-length code. Code words are composed of dots - the smallest unit -, dashes and spaces, one dash being equal to three dots. The character "E" is represented by the shortest code word “dot” equal to one dot or '1' in binary notation. The character Zero (0) is represented by the longest code word, "dash-dash-dash-dash-dash" equal to 19 dots or '1110111011101110111' in binary notation. The reason for the unequal length of the code words was the desire to reduce the amount of work for the operator when transmitting many messages. Samuel Morse found by visiting a Philadelphia printing office, that the compositors had sorted the lead types in such a way that the types most frequently used were the ones most easily accessible.

An example of an equal-length, but non-integral code is the Baudot or ITA-2 alphabet, which was formerly in use on the majority of the world's land lines and radio links. It is still the base for many codes constructed later, as compatibility to existing equipment and networks was essential.

In the ITA-2 code a character is represented by five bits. For instance, the letter "D" is represented by the codeword '10110'. As we have five bits which can assume one of two possible states we are able to represent 25 = 32 characters. However, the number of all letters, figures, and special characters add up to more than 32. Therefore a trick is employed: ITA-2 makes a distinction between two cases, lower (letters) case and upper (figures) case. Shifting between these cases is accomplished by special shift characters. In this manner it is possible to transfer (2 x 32) - 6 = 58 characters (the last six are subtracted because they have same functions in either case). Shift characters are also used to toggle between Latin and non-Latin alphabets in the same transmission, e.g., Latin-Cyrillic and Latin-Arab alphabets.

The alphabet most widely used in modern data communication is the ASCII code (American Standard Code for Information Interchange) which is internationalized as ITU ITA-5. The alphabet is originally based on 7-bit words, but normally 8 bits are used either to expand the alphabet or to include a parity bit. Because of the number of bits available for each codeword, it is unnecessary to use special case shift characters as for ITA-2. Also both capital and miniscule letters can be accommodated as well as non-printing commands, and if 8-bit words are used completely transparent binary data.

7-bit ASCII code. Normally eight bits are transmitted with the 8th bit either set to 1 or 0, used for odd or even parity or to expand the alphabet.

An example of source coding for analogue input is the process of transforming analogue voice to digital bits by way of sampling the input signal, quantizing it into discrete amplitude levels, and finally converting the quantized signal into 8-bit data words. This process is used for the conversion of ordinary analogue telephone speech into standard PCM (Pulse Coded Modulation) digital signals used globally in the Public Switched Telephone Network. Other examples for voice coding are the coding used for GSM mobile telephones, or LPC (Linear Predictive Coding) used for narrow band digital voice.

To reduce redundancy, i.e., to use the transmission medium more efficiently, the formatted data is further processed through compression. As we saw above some codes, e.g., the Morse code inherently has the ability by statistical observation of the source data to reduce redundancy. Statistical reduction is also the basis for Huffman coding used in fax communication where the most frequently occurring bit combinations are transformed into symbols having the lowest number of bits. Huffman coding is thus an example of variable-length coding.

The degree of compression achievable, the compression ratio, is related to the properties of the data to be compressed.

Other examples of compression codes are ARJ, Lempel-Ziv, JPEG and MPEG, the later ones used for video, voice and music compression.

PACTOR and G-TOR are examples of the use of redundancy removal source coding for radio communication.

Baseband waveforms can be formatted in various ways. The most common method is called Non-Return-To-Zero (NRZ-L) meaning that the bits will have one of two voltage levels. NRZ-M also called differential encoding, uses a change in level for a logical one and no change for a logical 0. NRZ-S is complementary to NRZ-M. Unipolar-RZ represents a logical 1 with a positive half-bit wide pulse and a logical 0 with no pulse, i.e., at 0 level. Bipolar-RZ has opposite half-bit wide pulses for 1 and 0. The Manchester code or Biphase-φ-L is a subtype of NRZ coding and has a level transition at mid pulse, negative going for 1s and positive going for 0s. Differential Manchester omits the level transition for 1s and 0s respectively.

The exact waveform to be used depends on the application. For instance systems needing self-clocking would use Manchester coding because of transitions are always available even if the transmission consists of long rows of succeeding 1s or 0s. The BBC radio data system used on long wave utilizes Manchester coding as does Ethernet LANs. The AIS system used on VHF uses differential encoding to resolve polarity ambiguity and this encoding form is also commonly used in satellite transmission systems.

The illustration below depicts spectral density, i.e., efficiency, as a function of pulse bandwidth.

fig2_23