Wireless Net DesignLine | How VoIP works: Protocols, codecs, and more

July 28, 2006

How VoIP works: Protocols, codecs, and more

Learn the basics of packetization, network latency and bandwidth, transport and media protocols, and speech and video codecs

By David Katz, Tomasz Lukasiak, Rick Gentile, and Wayne Meyer, Analog Devices

The age of voice-over-Internet-protocol (VoIP) is here, bringing together telephony and data communications to provide packetized voice and fax data streamed over low-cost Internet links. The transition from circuit-switched to packet-switched networking, continuing right now at breakneck speed, is encouraging applications that go far beyond simple voice transmission, embracing other forms of data and allowing them to all travel over the same infrastructure.

What is VoIP?
Today's voice networks—such as the public switched telephone network (PSTN)—utilize digital switching technology to establish a dedicated link between the caller and the receiver. While this connection offers only limited bandwidth, it does provide an acceptable quality level without the burden of a complicated encoding algorithm.

The VoIP alternative uses Internet protocol (IP) to send digitized voice traffic over the Internet or private networks. An IP packet consists of a train of digits containing a control header and a data payload. The header provides network navigation information for the packet, and the payload contains the compressed voice data.

While circuit-switched telephony deals with the entire message, VoIP-based data transmission is packet-based, so that chunks of data are packetized (separated into units for transmission), compressed, and sent across the network—and eventually re-assembled at the designated receiving end. The key point is that there is no need for a dedicated link between transmitter and receiver.

Packetization is a good match for transporting data (for example, a JPEG file or email) across a network, because the delivery falls into a non-time-critical "best-effort" category. For voice applications, however, "best-effort" is not adequate, because variable-length delays as the packets make their way across the network can degrade the quality of the decoded audio signal at the receiving end. For this reason, VoIP protocols, via quality-of-service (QoS) techniques, focus on managing network bandwidth to prevent delays from degrading voice quality.

Packetizing voice data involves adding header and trailer information to the data blocks. Packetization overhead (additional time and data introduced by this process) must be reduced to minimize added latencies (time delays through the system). Therefore, the process must achieve a balance between minimizing transmission delay and using network bandwidth most efficiently—smaller size allows packets to be sent more often, while larger packets take longer to compose. On the other hand, larger packets amortize the header and trailer information across a bigger chunk of voice data, so they use network bandwidth more efficiently than do smaller packets.

By their nature, networks cause the rate of data transmission to vary quite a bit. This variation, known as jitter, is removed by buffering the packets long enough to ensure that the slowest packets arrive in time to be decoded in the correct sequence. Naturally, a larger jitter buffer contributes to more overall system latency.

As mentioned above, latency represents the time delay through the IP system. A one-way latency is the time from when a word is spoken to when the person on the other end of the call hears it. Round-trip latency is simply the sum of the two one-way latencies. The lower the latency value, the more natural a conversation will sound. For the PSTN phone system in North America, the round-trip latency is less than 150 ms.

For VoIP systems, a one-way latency of up to 200 ms is considered acceptable. The largest contributors to latency in a VoIP system are the network and the gateways at either end of the call. The voice encoders and decoders (codecs) add some latency—but this is usually small by comparison (<20 ms).

When the delay is large in a voice network application, the main challenges are to cancel echoes and eliminate overlap. Echo cancellation directly affects perceived quality; it becomes important when the round-trip delay exceeds 50 ms. Voice overlap becomes a concern when the one-way latency is more than 200 ms.

Because most of the time elapsed during a voice conversation is "dead time"—during which no speaker is talking—codecs take advantage of this silence by not transmitting any data during these intervals. Such "silence compression" techniques detect voice activity and stop transmitting data when there is no voice activity, instead generating "comfort" noise to ensure that the line does not appear dead when no one is talking.

In a standard PSTN telephone system, echoes that degrade perceived quality can happen for a variety of reasons. The two most common causes are impedance mismatches in the circuit-switched network ("line echo") and acoustic coupling between the microphone and speaker in a telephone ("acoustic echo"). Line echoes are common when there is a two-wire-to-four-wire conversion in the network (e.g., where analog signaling is converted into a T1 system).

Because VoIP systems can link to the PSTN, they must be able to deal with line echo, and IP phones can also fall victim to acoustic echo. Echo cancellers can be optimized to operate on line echo, acoustic echo, or both. The effectiveness of the cancellation depends directly on the quality of the algorithm used.

An important parameter for an echo canceller is the length of the packet on which it operates. Put simply, the echo canceller keeps a copy of the signal that was transmitted. For a given time after the signal is sent, it seeks to correlate and subtract the transmitted signal from the returning reflected signal—which is, of course, delayed and diminished in amplitude. To achieve effective cancellation, it usually suffices to use a standard correlation window size (e.g., 32 ms, 64 ms, or 128 ms), but larger sizes may be necessary.

VoIP users tend to think of their connection as being "free," since they can call anywhere in the world, as often as they want, for just pennies per minute. Although they are also paying a monthly fee to their Internet service provider, it can be amortized over both data and voice services.

Besides the low cost relative to the circuit-switched domain, many new features of IP services become available. For instance, incoming phone calls on the PSTN can be automatically rerouted to a user's VoIP phone, as long as it's connected to a network node. This arrangement has clear advantages over a global-enabled cellphone, since there are no roaming charges involved. From the VoIP standpoint, the end user's location is irrelevant; it is simply seen as just another network-connection point. This is especially useful where wireless local-area networks (LANs) are available; IEEE-Standard-802.11-enabled VoIP handsets allow conversations at worldwide Wi-Fi hotspots without the need to worry about mismatched communications infrastructure and transmission standards.

Everything discussed so far in relation to voice-over-IP extends to other forms of data-based communication as well. After all, once data is digitized and packetized, the nature of the content doesn't much matter, as long as it is appropriately encoded and decoded with adequate bandwidth. Because of this, the VoIP infrastructure facilitates an entirely new set of networked real-time applications, such as:

The signaling process involves creating, maintaining, and terminating connections between nodes. To learn more about signaling and the IP network, see "Voice over IP (VoIP)--The basics."

In order to reduce network bandwidth requirements, audio and video are encoded before transmission and decoded during reception. This compression and conversion process is governed by various codec standards for both audio and video streams.

The compressed packets move through the network governed by one or more transport protocols. A switching gateway ensures that the packet set is interoperable at the destination with another IP-based system or a PSTN system. At its final destination, the packet set is decoded and converted back to an audio/video signal, at which point it is played through the receiver's speakers and/or display unit.

The Open Systems Interconnection (OSI) seven-layer model (Figure 2) specifies a framework for networking. If there are two parties to a communication session, data generated by each starts at the top, undergoing any required configuration and processing through the layers, and is finally delivered to the physical layer for transmission across the medium. At the destination, processing occurs in the reverse direction, until the packets are finally reassembled and the data is provided to the second user.

Session control: H.323 vs. SIP
The first requirement in a VoIP system is a session-control protocol to establish presence and locate users, as well as to set up, modify, and terminate sessions. There are two protocols in wide use today. Historically, the H.323 protocol was used, but Session Initiation Protocol (SIP) is rapidly becoming the main standard. Let's take a look at the role played by each.

International Telecommunication Union (ITU) H.323
H.323 is an ITU standard originally developed for real-time multimedia (voice and video) conferencing and supplementary data transfer. It has rapidly evolved to meet the requirements of VoIP networks. It is technically a container for a number of network and media codec standards. The connection signaling part of H.323 is handled by the H.225 protocol, while feature negotiation is supported by H.245.

Session Initiation Protocol (SIP)
SIP is defined by the Internet Engineering Task Force (IETF) under RFC 3261. It was developed specifically for IP telephony and other Internet services—and though it overlaps H.323 in many ways, it is usually considered a more streamlined solution.

SIP is used with Session Description Protocol (SDP) for user discovery; it provides feature negotiation and call management. SDP is essentially a format for describing initialization parameters for streaming media during session announcement and invitation. The SIP/SDP pair is somewhat analogous to the H.225/H.245 protocol set in the H.323 standard.

SIP can be used in a system with only two endpoints and no server infrastructure. However, in a public network, special proxy and registrar servers are utilized for establishing connections. In such a setup, each client registers itself with a server, in order to allow callers to find it from anywhere on the Internet. Transport layer protocols
The signaling protocols above are responsible for configuring multimedia sessions across a network. Once the connection is set up, media flow between network nodes is established by one or more data-transport protocols, such as UDP or TCP.

User Datagram Protocol (UDP)
UDP is a network protocol covering only packets that are broadcast out. There is no acknowledgement that a packet has been received at the other end. Since delivery is not guaranteed, voice transmission will not work very well with UDP alone when there are peak loads on a network. That is why a media transport protocol, like RTP (discussed below), usually runs on top of UDP.

Transmission Control Protocol (TCP)
TCP uses a client/server communication model. The client requests (and is provided) a service by another computer (a server) in the network. Each client request is handled individually, unrelated to any previous one. This ensures that "free" network paths are available for other channels to use.

TCP creates smaller packets that can be transmitted over the Internet and received by a TCP layer at the other end of the call, such that the packets are "reassembled" back into the original message. The IP layer interprets the address field of each packet so that it arrives at the correct destination.

Unlike UDP, TCP does guarantee complete receipt of packets at the receiving end. However, it does this by allowing packet retransmission, which adds latencies that are not helpful for real-time data. For voice, a late packet due to retransmission is as bad as a lost packet. Because of this characteristic, TCP is usually not considered an appropriate transport for real-time streaming media transmission.

Figure 2 (above) shows how the TCP/IP Internet model, and its associated protocols, compares with and utilizes various layers of the OSI model.

Media transport
As noted above, sending media data directly over a transport protocol is not very efficient for real-time communication. Because of this, a media transport layer is usually responsible for handling this data in an efficient manner.

Real-Time Transport Protocol (RTP)
RTP provides delivery services for real-time packetized audio and video data. It is the standard way to transport real-time data over IP networks. The protocol resides on top of UDP to minimize packet header overhead—but at a cost; there is no guarantee of reliability or packet ordering. Compared to TCP, RTP is less reliable—but it has less latency in packet transmission, since its packet header overhead is much smaller than for TCP (Figure 3).

In order to maintain a given QoS level, RTP utilizes timestamps, sequence numbering, and delivery confirmation for each packet sent. It also supports a number of error-correction schemes for increased robustness, as well as some basic security options for encrypting packets.

RTP Control Protocol (RTCP)
RTCP is a complementary protocol used to communicate control information, such as number of packets sent and lost, jitter, delay, and endpoint descriptions. It is most useful for managing session time bases and for analyzing QoS of an RTP stream. It also can provide a backchannel for limited retransmission of RTP packets. Media codecs
At the top of the VoIP stack are protocols to handle the actual media being transported. There are potentially quite a few audio and video codecs that can feed into the media transport layer. A sampling of the most common ones is listed below.

A number of factors help determine how desirable a codec is—including how efficiently it makes use of available system bandwidth, how it handles packet loss, and what costs are associated with it, including intellectual-property royalties.

G.711
Introduced in 1988, G.711—the international standard for encoding telephone audio on a 64-kbps channel—is the simplest standard among the options presented here. The only compression used in G.711 is companding (using either the �-law or A-law standards), which compresses each data sample to an 8-bit word, yielding an output bit rate of 64 kbps. The H.323 standard specifies that G.711 must be present as a baseline for voice communication.

G.723.1
G.723.1 is an algebraic code-excited linear-prediction (ACELP)-based dual-bit-rate codec, released in 1996 to target VoIP applications. The encoding time frame for G.723.1 is 30 ms. Each frame can be encoded in 20 bytes or 24 bytes, thus translating to 5.3-kbps or 6.3-kbps streams, respectively. The bit rates can be effectively reduced through voice-activity detection and comfort-noise generation. The codec offers good immunity against network imperfections—like lost frames and bit errors. G.723.1 is suitable for video-conferencing applications, as described by the H.324 family of international standards for multimedia communication.

G.729
Another speech codec, released in 1996, is the low-latency G.729 audio data-compression algorithm, which partitions speech into 10 ms frames. It uses an algorithm called conjugate-structure ACELP (CS-ACELP). G.729 compresses 16-bit signals sampled at 8 kHz via 10 ms frames into a standard bit rate of 8 kbps, but it also supports 6.4 kbps and 11.8 kbps rates. In addition, it supports voice-activity detection and comfort-noise generation.

GSM
The GSM speech codecs find use in cell phone systems around the world. The governing body for these standards is the European Telecommunications Standards Institute (ETSI). Standards in this domain have evolved since the first one, GSM Full Rate (GSM-FR). This standard uses a CELP variant called regular pulse-excited linear predictive coder (RPELPC). The input speech signal is divided into 20 ms frames. Each frame is encoded as 260 bits, thereby producing a total bit rate of 13 kbps. Free GSM-FR implementations are available for use under certain restrictions.

Speex
Speex, an Open Source/Free Software audio compression format designed for speech codecs, was released by Xiph.org, with the goal of being a totally patent-free speech solution. Like many other speech codecs, Speex is based on CELP with residue coding. It can code 8 kHz, 16 kHz, and 32 kHz linear PCM signals into bit rates ranging from 2 kbps to 44 kbps. Speex is resilient to network errors, and it supports voice-activity detection. Besides allowing variable bit rates, Speex also has the unique feature of stereo encoding.

H.261
This standard, developed in 1990, was the first widely used video codec. It introduced the idea of segmenting a frame into 16 � 16 "macroblocks" that are tracked between frames to establish motion-compensation vectors. It is mainly targeted at videoconferencing applications over ISDN lines (which operate at p � 64 kbps, where p ranges from 1 to 30). Input frames are typically CIF resolution (352 � 288) at 30 frames per second (fps), and output compressed frames occupy 64 kbps to 128 kbps for 10 fps resolution. Although still used today, it has been largely superseded by H.263. Nevertheless, H.323 specifies that H.261 must be present as a baseline for video communication.

H.263
This codec is ubiquitous in videoconferencing, outperforming H.261 at all bit rates. Input sources are usually quarter-common intermediate format (QCIF, i.e., 176 � 144) or CIF at 30 fps, and output bit rates can be less than 28.8 kbps at 10 fps, for the same performance as H.261. So whereas H.261 needed an ISDN (integrated-services-digital-network) line, H.263 can use ordinary phone lines. H.263 finds use in end markets such as video telephony and networked surveillance, and it is popular in IP-based applications.

Conclusion
Clearly, VoIP technology has the potential to revolutionize the way people communicate—whether they're at home or at work, plugged-in or untethered, video-enabled or just plain audio-minded. The power and versatility of VoIP will make it increasingly pervasive in embedded environments, creating value-added features in many markets that are not yet experiencing the benefits of this exciting technology.

About the authors
David Katz is Blackfin Applications Manager for New Product Development at Analog Devices, Inc. He is co-author of Embedded Media Processing (Newnes 2005). Previously, he worked at Motorola, Inc., as a senior design engineer in cable modem and factory automation groups. David holds both a B.S. and an M. Eng. in Electrical Engineering from Cornell University. He can be reached at david.katz@analog.com.

Tom Lukasiak joined Analog Devices in 2000. His current focus is Blackfin systems design. He received an Sc.B. in 2000 and an Sc.M. in 2002, both in Electrical Engineering, from Brown University. He can be reached at tomasz.lukasiak@analog.com.

Rick Gentile joined ADI in 2000 as a Senior DSP Applications Engineer, and he currently leads the Blackfin DSP Applications Group. Prior to joining ADI, Rick was a Member of the Technical Staff at MIT Lincoln Laboratory, where he designed several signal processors used in a wide range of radar sensors. He received a B.S. in 1987 from the University of Massachusetts at Amherst and an M.S. in 1994 from Northeastern University, both in Electrical and Computer Engineering. He can be reached at richard.gentile@analog.com.

Wayne Meyer is the Product Marketing Manager for Blackfin processors. He has over 15 years of experience in the semiconductor industry working for both Analog Devices and Motorola (Freescale) Semiconductor. Previous positions include Product Engineering and Product Manager roles for 8-bit and 16-bit microcontrollers and Business Development Manager for 24-bit DSP and 32-bit RISC microprocessors. He can be reached at wayne.meyer@analog.com.