Designing to Meet New Expectations for Audio
August 16, 2016, Intel Developer Forum, San Francisco—A group of experts talked about the changes in capabilities and characteristics that computer systems will need as consumer devices increasingly move to voice-activated interfaces. Users expect greater personal mobility when not tethered to their devices even if the user interface devices are wireless.
Most audio interfaces to computers require fairly restrictive operating conditions. The microphone needs to be in close proximity to the mouth and the speaker must be in a quiet environment. The close field audio conditions result in very good signal-to-noise ratios and minimal influences from noise, echoes, and reverberation.
The interface was changed dramatically to handle the far field audio where the speaker is greater than 1 m from the input device. To perform at distances from 1 to 4 m requires significant changes in the input and output characteristics of the audio system. Most speakers talk at a sound pressure level of about 90 dBA at a distance of 1 cm from the mouth. At 4 m, this level drops to about 54 dBA.
The system must also be able to output sound levels in the same range, since people cannot hear sounds below some threshold. Ideally, a system should be able to provide 90 to 100 dB at the speaker which will result in a sound pressure level of 70 dB at the user's ear,Therefore, next-generation systems with voice activated interfaces needs much better input and output capabilities as well as better algorithms to improve the signal-to-noise ratio enough for the system to actually recognize speech.
Most existing algorithms can handle sound input ranges over a 25 to 30 dB range, but this obviously inadequate for a sound field with 40-50 dB of loss. In addition, most of these do ices will be in a sonic environment that includes reverberation, noise, distortion, multiple speakers, and have difficulty using a time delay of response process to localize the speaker.
One use case for a far field interface would be to wake a sleeping device when you say a key phrase. When you walk into the living room and say "hey television,let's watch a show", your expectations are that the TV set or turn on and give you a verbal response within a very short timeframe. This function can be very challenging due to the amount of activity required to change from a low-power state to en active one.
The speaker localization is a nontrivial problem, since the speaker's voice can be affected by attenuation, noise, reverberation, and the fact that the speaker and the noise sources may be non-stationary. The microphone systems must distinguish between user facing and world facing inputs and also must have underlying processing which can help to reduce interference from other sound sources. Multiple microphone systems can use beam forming techniques to minimize the volume of space for a voice inputs. This active focusing process can improve the base SNR from five dB to well over 50 dB SNR.
The speech recognition algorithms must have sufficient processing capabilities to not only do speech recognition but also reduce noise and echoes. Signal processing algorithms also need fast and accurate automatic gain control to adjust levels for the speech recognition portions. In conjunction with the beam forming, the sound system greatly improves SNR and rejects close proximity noise sources. One requirement for the microphones as they have good frequency and dynamic response and all the mikes must be fairly closely matched.
Different configurations of microphones provide different types of spatial response to a linear array does poorly at rejecting side noise and tend to be 1-D tracking. Two microphones are okay for near field work, but if for microphone linear array provides better tracking. In L-shaped array with three microphones in the corner of the display can provide 2-D tracking and much better direction on the compared to a linear array this configuration provides good performance even in high noise environments. One concern is that the packaging can alter the microphone response. Therefore, a recommendation is to use sealed mounting to isolate the microphones from other components.
Even though the requirements for audio fidelity differ between speech and music, the speaker system must be developed to handle the full dynamic range and frequency response of CD quality music. Unfortunately, device packaging places severe constraints on the ability of speakers to provide these high-level outputs. The speaker systems must not only overcome size and power constraints of the device, but must also compensate for room acoustics.
Getting high volume, high fidelity sound from very small speakers is not easy. One way to improve the low-end frequency response is to use chassis attached speakers to increase the physical area driving the sound. Smart amplifiers can improve dynamic range and have a better crest factors than other amplifier designs. The abilities of the signal processing can make digital driving and multi-coil speakers capable of producing a wide dynamic range with good frequency response. Another facet of the whole aural experience is to use the onboard signal processing capabilities to tune the output frequency response to compensate for room acoustics.