Featured Resources:

line

Newsletter

Email Address:


line

Ask the Expert

Have a question for our resident expert? Email your questions to Ken.

« Voice over Internet Protocol (VoIP) How secure is your network infrastructure for handling VoIP? | Main | The Realtime VoIP Community »

Reader Question: Why on a VoIP network does it sometimes sound like I am talking to a robot? Why do voices get a metallic sound?

Here's another reader question that was submitted here via the weblog:

Why on a VoIP network does it sometimes sound like I am talking to a robot? Why do voices get a metallic sound?

There are a number of factors related to digitizing and packetizing voice that come into play. I don't think we can give a specific reason without technical evaluation of the connection in question, but here's an explanation of the factors that acn lead to robotic or metallic voice quality.

Since the legacy PSTN transmits digital signals, and VoIP then packetizes that signal, analog to digital conversion comes first. Voice communication is conducted via an analog signal, the human voice. The human voice is then converted into electrical signalsm, which are then packetized for transmission in IP packets.

Voice quality was often measured by placing a group of people in a room and having them listen to sound in headphones. The evaluators rate the quality of the sound from 1 to 5. A 5 is the highest rating, being what might be called “pin drop” quality. Many providers refer to this “toll quality voice”. This is the highest-grade voice quality, and has always been the benchmark for conducting corporate business calls. A rating of 1 equates more to the scratchy sound quality of an intercom speaker in a warehouse or at the drive-thru hamburger stand. This rating, from 1 to 5, is referred to as the Mean Opinion Score (MOS). While the statistical validity may be questionable to some, the process worked satisfactorily for a number of years, and this method has long been accepted worldwide in telephony networks. In the real world, the human ear can clearly distinguish between a 4 MOS and a 4,5 MOS. Today there are a variety of tolls that can measure MOS systematically rather than using a statistically inaccurate group of people.

A MOS of 4 to 5 is considered toll quality voice and rated suitable for the long distance business world to use in negotiating business deals. Ratings below a 3 are generally considered to be synthetic in quality, and may be referred to as a “robotic” sounding voice. This has always been important because the fewer number of bits used in the encoding scheme, the poorer the quality of the voice has always been. Since the PCM encoding scheme drives the network to a 64 kbps voice channel, a sampling algorithm that requires less bandwidth could result in improved network efficiency. With the enhancements in digital signal processor technology and improvements in electronics

The complete process for Pulse Code Modulation (PCM), widely used inthe PSTN consists of four steps to ready the signal for transmission.

  1. Filtering unwanted frequencies is the first step. Only those frequencies in the 0 Hz to 4,000 Hz range need be sampled because that’s the range used to transmit voice. This is also sometimes referred to as a 4 kHz voice channel.
  2. Sample the analog signal using Nyquist’s sampling theorem. Since the maximum bandwidth is 4,000 Hz, this says we sample 8,000 times per second. The output of this is a series of analog pulses called the PAM signal.
  3. Quantizing using a companding scheme  to assign a discrete digital value to each sample.
  4. Pulse Code Modulation (PCM) is the process of assigning and 8-bit value (called a PCM word).
The chart below represents a comparison of several encoding schemes that are in common use for various applications. It lists the Codec types and algorithms used, the bit rate and sample size, the algorithmic encoding delay, and then compares the mean opinion scores for various digitization approaches.

ITU-T Codec Standard
Coding Scheme Used
Bit Rate
Sample Size (Bits
Encoding Delay Time
Mean Opinion Score
G.711
PCM
64 kbps
8
<1 msec
4.4
G.722
SB-ADPCM
64 kbps
8
4 msec

 

G.726
ADPCM
32 kbps
4
1 msec
4.2
G.728
LC-CELP
16 kbps
40
2 msec
4.2
G.729
CS-ACELP
8 kbps
80
15 msec
4.2
G.723.1
MPMLQ
6.3 kbps
192
37.5 msec
3.98
G.723.1
ACELP
5.3 kbps
160
37.5 msec
3.5


The encoding delay time is directly tied to the algorithm used. There are many factors that influence delay, but the processing time of the algorithm itself must be considered in terms of the total system delay. In any IP network, the overall transport delays of moving packets are unpredictable and variable. These factors alone may make the network unsuitable for real-time voice traffic. The nodal processing delay involved in encoding and decoding could add enough overhead to the end-to-end delay that the threshold of acceptable service is crossed and the network becomes unusable. The sum total of all the delays cannot exceed 300 milliseconds for an interactive voice network, and many providers strive for 200 milliseconds total delay or less.

Each algorithm is documented in ITU-T standards and a wide variety of papers.

Pulse Code Modulation, or G.711, is the approach still most widely used in the PSTN. It's worth noting that even this "toll quality voice" falls somewhat short of the perfect Mean Opinion score of 5. The 64 kbps PCM time slot provides the basic framework for contemporary public telephone services and equipment. This encoding scheme was widely used in most early IVoIP systems. It s supported by virtually every equipment vendor in the VoIP sector.

The G.722 codec is used for FM radio and does not have an MOS associated with it. It is included for comparison. It's simply another method of encoding sound waves for transmission.

Adaptive Differential Pulse Code modulation offers a solution that could reduce the bandwidth requirements by half while only sacrificing .2 of a point on perceived quality in the MOS.

Low-Delay Code Excited Linear Predicate (LC-CELP) coding has been widely used in voice mail systems for digitizing voice messages stored on a hard drive.

Conjugated Structure Algebraic-Code-Excited Linear Predictive (CS-ACELP) can deliver an 8-kilobit sample with less than 16 msec of processing time. This G.729 codec standard has been widely used in digital telephony, satellite transmission and wireless communications. It is also used in Voice over Frame Relay (VoFR) and is supported by many frame relay equipment vendors.

Multipulse Maximum Likelihood Quantization (MPMLQ), while published as an ITU-T standard, has also seen several proprietary implementations with smaller samples, as low as 4.8 kbps. MPMLQ is able to maintain reliable performance despite a high bit error rate and has been deployed in many Russian telephony implementations over data networks.

Algebraic Code Excited Linear Predictive (ACELP) coding can produce a sample that has a bit rate of only 5.3 kbps. This approach has been deployed in many frame relay networks. It can be adjusted to encode and several bit rates. At 8 kbps, ACELP measures a 4.2 MOS, and it’s able to adapt rate on the fly. This capability could provide a mechanism for adapting to a network that doesn’t offer consistent, predictable performance. ACELP actually creates models of the human voice then predicts what the next sound will be. It encodes the difference between the actual sound and the predicted sound, and the difference is transmitted to the receiving end. Since the other end of the call is also running ACELP, the calculation of the differences allows for an acceptable recreation of the human voice at the receiving end. In the past there were some complaints that ACELP techniques created a less accurate representation of women and children’s voice, which are generally higher in pitch than male voices. With improvements in digital signal processors, this drawback is less of an issue today.

While PCM is still the most widely used voice digitization technology, others are often used in VoIP. G.729, or CS-ACELP is very popular in many VoIP implementations because it requires far less bandwidth that G.711 PCM.

The MOS can be affected by a number of factors. Delay, loss, and jitter all play a role. In general, robotic or tinny voice quality is a result of underprovisiong the resources of the network and demonstrates a shortfall in engineering. As I've noted many times, performing a comprehensive readiness assessment of the network can help circumvent creating this sort of a problem.




Technorati Tags: , , , , , ,

Comments

You omitted to include the packetization delay, which is encountered by even G.711. This delay is indirectly proportional to the bandwidth overhead.

I am not an acoustic guy; but I would have guessed that the robotic tone comes from sampling limited frequency and the quality of the mic and speakers. I think so because the other factors you identify will either introduce clipping or echo. But then I am not certain.

Absolutely on target as always, Aswath makes a great point that I overlooked in my haste to get a response posted.

It's crucial to remember that delay, any delay, is cumulative. Packetization delay is code independent. They all suffer to some degree. I'm also not an acoustic guy, so I can't address the sampling rate of a softphone implementation on PC, mic and speakers, but I'd certainly expect that results vary depending on system resources available. Even a high end Windows machine would encounter issues if too many applications are loaded in active memory. The swap pages still use resources swapping in and out.

An angle, perhaps related to the acousit issue Aswath noted, is using a laptop with built in speakers and mic. We've all noted the echo quality, but I think that also leads to the tinny sound. Seems more common when I'm talking to someone on a laptop who isn't using a headset.

The bottom line is that all delay is cumulative and impacts the MOS in VoIP traffic.

Thanks Aswath, for catching an angle I completely missed!

Post a comment

(All comments are approved by site leader before appearing here. Thanks for commenting!)

line

Ken Camp's Bio:

Ken Camp has more than 25 years of experience in information technology. Ken spent 17 years with AT&T and Lucent Technologies successfully designing and implementing voice and data networks. He later worked in the security marketplace and played a key role in early IPSec VPN deployments. As an independent consultant, Ken's primary focal areas include network performance improvement, security practices and the design and deployment of integrated voice and data solutions. He may be contacted at: ken_camp@realtimepublishers.net

line

Blog Roll