New Algorithms for Speech to Text
April 5, 2016, GPU Technology Conference, San Jose, CA—William Chan from Carnegie Mellon University described a different way to extract textural content from speech. The conversational speech recorder and text translator may be more suitable for mobile devise as it requires a much smaller memory footprint.
Conversational speech recorders attempt to automatically transform acoustic signals into text. The current processes use some (trained) pronunciation dictionaries and many models to discern the text. The models convert dependent phonemes from context, extract sequence trains, and map these data to a language model. All of the various models are stand-alone and independently optimized.
The current apps all have implicit assumptions such as conditional dependence, Markov chains, and phoneme matching to got to their conclusions about sounds to text. Unfortunately, many of the assumptions are untrue as a means to simplify the computational problems.
A different approach is to model the acoustics to the underlying characters. Signal processing can use a Listen, Attend and Spell approach to map the sounds to a language model. This flow can be optimized from end-to-end, since it is a self-contained process. The LAS model uses conditional distributions of acoustic signals to transform the acoustics to a single character. Some of the underlying mathematics requires probabilities to be conditionally dependent on previous inputs.
The basic flow has a better efficiency gradient flow that derives form the explicit alignment of the acoustic and text characters. The bad part of the current approach is that the system takes a very long time for training. One version has taken over a month and is still not converged. As a result, they changed the flow from a linear flow to a time-scale convolution pyramid to reduce the processing time, since each layer reduces the time by half.
The latest version has much better convergence, only taking 2-3 weeks for training, but obviously still needs more work to become user friendly. Other work is to change the training model. A "sample trick" takes a sample from the model and uses that data as a conditional sample for the next prediction. In addition, they added some language model rescoring to leverage the vast quantity of text.
They also tried an experiment to use inputs from Google voice search repositories. They got 2,000 hours of content accounting for 3 M training utterances. A subset of 16 hours with 22 k test utterances served as a basis. The training uses a stochastic gradient with 32 replicas. Compared to Google's 8 percent word error rate, the basic LAS system rates at 16 percent. Adding a sample trick improves the LAS to 14, and using the LAS with a language model reduces the WER to 12.6 percent. Combining the LAS wit sampling and a language model gets the WER to 10.3 percent.
The decoding process does not use a dictionary. The LAS has an implicit learning dictionary and surprisingly, has few spelling mistakes. As a result, the translator does not need to perform searches to match words. An improved and extended language model would help improve accuracy. Since the algorithm directly converts acoustic signals to English language text characters, the system only needs a very small database. The change to a single, comprehensive data model has the potential to influence all apps that use voice inputs.