Summary:
Jigar Dani, Principal PM Manager, Microsoft
Sriram Srinivasan, Principal Software Engineering Manager, Microsoft
Over a decade ago Skype invented the Silk audio codec to transmit speech over the internet and catalyzed the voice over internet protocol (VoIP) industry. The primary codec used in VoIP then was G.722 that required 64 kbps to transmit wide band (16 kHz) speech, Silk on the other hand offered wideband quality starting at 14 kbps. Additionally, Silk was an adaptive variable bitrate codec that seamlessly switched from delivering narrow band (8 kHz) speech at ultra-low bandwidth of 6 kbps to offer a near transparent quality speech at higher bit rates. This was critical for dial-up and limited broadband internet that existed at that time and has served us well as the default codec for Skype and Microsoft Teams. It is also the basis of the voice mode of the OPUS codec which has been predominantly used in VoIP solutions in the last decade.
As we enter a new decade users have options to choose from several high-end connectivity alternatives such as high-speed broadband, optical fiber and 5G. Yet large segments of our user base are still limited to low cable internet speeds or 3/4G cellular networks. They encounter constrained network situations with over 50% packet loss and sporadic loss of coverage when moving between cell towers on commute or switching between network types. Network availability becomes unpredictable even when sharing internet at home with family members to stream video, gaming, work remotely and attend online schooling. Meanwhile, user expectations and essential need especially in the pandemic sometimes outpace the improvements in network connectivity. We have a need to communicate and collaborate on the go – on every device, every network, and every environment. Thus, efficient utilization of available bitrate is every bit as important today as it was in the dial-up world. Bitrate savings can be used to provide additional resiliency and/or improve experiences on other workloads like video and content sharing. We have considered these aspects to holistically address the challenges and deliver a virtual voice experience that is as good as talking in person even in ultra-low bandwidth and highly constrained network conditions.
Today we share details on our new AI powered audio codec – Satin, that can deliver super wide band speech starting at a bitrate of 6 kbps, and full-band stereo music starting at a bitrate of 17 kbps, with progressively higher quality at higher bitrates. Satin has been designed to provide great audio quality even under high packet loss. Here is the net effect of our improved resiliency algorithms and new Satin codec (use your favorite headset to hear the audio files):
Silk at 6 kbps, burst packet loss:
Your browser does not support the audio
element.
Satin at 6 kbps with improved resilience, burst packet loss:
Your browser does not support the audio
element.
We have built this codec with multiple decades of algorithmic experience combined with advanced machine learning techniques and in this blog we provide a deeper look at getting this codec ready for our users.
What’s narrowband, wideband, and super wideband voice?
Our ear can generally perceive sounds that range in frequency from 20 Hz to 20 kHz. When dealing with discrete time signals, we need to sample the audio waveform at a minimum of twice the highest frequency we wish to reproduce. This is generally why CD-quality music is sampled at 44.1 kHz (44100 samples per second) or 48 kHz. Early telephony systems used a sampling rate of 8 kHz and could reproduce frequencies up to 4 kHz (in practice up to 3.4 kHz), which was considered sufficient at the time for speech communication. While a lower sampling rate implies fewer bits per second to transmit over the wire, it resulted in the all too familiar tinny voice quality over the phone as the higher vocal frequencies present in natural speech could not be reproduced. VoIP solutions, which were no longer limited by the narrowband telephony infrastructure, introduced us to the magic of wideband speech (reproduce up to 8 kHz, sampled at 16 kHz) and users were immediately able to appreciate the crisper, more natural and intelligible sound.
Codecs such as Silk and Opus (the default audio codec in WebRTC) took this a step further with the introduction of super wideband voice, capturing frequencies up to 12 kHz, sampled at 24 kHz (energy drops off rapidly at frequencies above 12 kHz for human voice). As mentioned earlier, higher sampling rates imply a higher bitrate. Satin re-defines super wideband to cover frequencies up to 16 kHz (sampled at 32 kHz) for greater clarity and sibilance, and its efficient compression enables super wideband voice at 6 kbps.
Frequency components of the sound /t/ in the word "suit." There is a significant amount of energy well beyond the narrowband cut-off of 4kHz and even the wideband cutoff of 8 kHz. Preserving energy in the higher spectral components results in more natural sounding speech.
Listen to the two samples below in your favorite headphones. The Satin super wideband speech sample sounds a lot more natural and intelligible, much like what you will hear when you are talking to someone in person.
Silk narrowband at 6 kbps: Your browser does not support the audio
element. Satin super wideband at 6 kbps: Your browser does not support the audio
element.
How do you get super wideband at 6 kbps?
To achieve super wideband quality at 6 kbps, Satin uses a deep understanding of speech production, modelling and psychoacoustics to extract and encode a sparse representation of the signal. To further reduce the required bitrate, Satin only encodes and transmits certain parameters in the lower frequency bands. At the decoder, Satin uses deep neural networks to estimate the high band parameters from the received low band parameters, and a minimal amount of side information sent over the wire. This approach solved the primary challenge of reproducing super wideband voice at ultra-low bitrates but introduced a new challenge of computational complexity. The analysis of the input speech signal to extract a low dimensional representation is computationally intensive. Real-time inference on deep neural networks adds to the complexity. The team then focused on reducing the complexity through both algorithmic optimizations as well as techniques such as loop vectorization beyond what the compiler could achieve. This resulted in close to a 40% reduction in computational complexity and allowed us to run on all our users’ devices.
As with all features, we A/B tested Satin before widely rolling it out – both to ensure there were no regressions, as well as to quantify the positive impact for our users. The A/B tests showed a high statistical significant increase in call duration for Satin compared to Silk at these low bitrates. Offline crowdsourced subjective tests to evaluate codec quality at 6 kbps showed the mean opinion score (MOS) rating of Satin to be 1.7 MOS higher than Silk.
How resilient is Satin to packet loss?
Yes, majority of our calls are on Wi-Fi and mobile networks, where packet loss is common and can adversely affect call quality. Satin is uniquely positioned to compensate for packet loss. Unlike most other voice codecs, Satin encodes each packet independently, so the effect of losing one packet does not affect the quality of subsequent packets. The codec is also designed to facilitate high quality packet loss concealment in an internal parametric domain. These features help Satin gracefully handle random losses where one or two packets are lost at a time.
Another type of packet loss, which is even more detrimental to perceived quality, is where several packets are lost in a burst. Here, Satin’s ability to deliver great audio at a low rate of 6 kbps provides the flexibility to use some of the available bitrate for adding redundancy and forward error correction that helps us recover from burst packet loss. Satin allows us to do this without having to compromise overall audio quality.
Satin is already used for all Teams and Skype two-party calls. We are rolling it out for meetings soon. Satin currently operates in wideband voice mode within a bitrate range of 6 – 36 kbps and will soon be extended to support full-band stereo music at a maximum sampling rate of 48 kHz. We are very excited for you to try this new codec, let us know what you think.
Subscribe to the Teams Engineering Tag RSS feed to stay in touch with the latest updates from our engineering teams.
Want to work on the team that builds bleeding edge AI technology: AI Jobs in M365 Intelligent Conversations and Communications Cloud Team
Date: 2021-02-15 16:00:00Z
Link: https://techcommunity.microsoft.com/t5/microsoft-teams-blog/satin-microsoft-s-latest-ai-powered-audio-codec-for-real-time/ba-p/2119234