Speech Recognition using Dynamical Systems Models

Project Overview

Summary

Current state-of-the-art speech recognition systems generally use Hidden Markov Models (HMMs) with frame-based spectral measures (often cepstral coefficients) as the primary features. Traditional spectral analysis techniques have been used for many years, with progress in recognition accuracy over the last 10-15 years being primarily incremental. This research project focuses on the development of a significantly different approach to characterizing speech signals, based on state-of-the-art techniques for time-series modeling. These time-series techniques combine state-space embedding methods and learning algorithms to create highly accurate non-linear models of a system's state. This research integrates a dynamical systems approach with a continuous speech recognition system, changing the analytical focus from the frequency domain to the time domain. The time-delay embedding technique, taken from dynamical systems theory, is used to reconstruct the state spaces of the speech waveforms. The resulting state spaces are then characterized to generate a set of features, which are evaluated with respect to their ability to differentiate the individual phonemes that are the building blocks of speech.

Objectives

The focus of this project is to use time-domain analysis of speech to create new modeling techniques and to gain a better understanding of speech signals, leading to a subsequent improvement in speech recognition accuracy. To achieve this, the primary research objectives include the application of the time domain embedding approach to the characterization of speech signals, the development of an effective model for measuring differences between the signals, and the integration of this model with an HMM-based speech recognition system. The speech tasks used for implementation of these objectives include both isolated phoneme recognition and continuous word recognition experiments.

Methods

Successful achievement of the above objectives requires the development of several new technologies. For the characterization of speech signals in the time domain, the Time Series Data Mining approach, which has been successfully applied to event prediction, is modified for application to speech waveforms, including the development of techniques for identifying optimal lag times for the time-domain embedding process. Stochastic methods, including various clustering techniques for learning parametric densities such as Gaussian Mixture Models, are used for identifying appropriate feature representations of the embedded waveforms. For integrating these features with a recognition system, an HMM-based speech system is modified to use the new time-domain features for computing state occupancy likelihoods within the training and recognition algorithms.

Impact

The impact of these new technologies and their application to the speech recognition task extends into both the machine learning and signal processing communities. The development of time-domain characterization methods is directly applicable to many problems of interest in the chaos and non-linear modeling domains. These new methods are able to concretely measure differences between the phase-space representations of dynamical systems. The application to the speech recognition task is particularly appropriate for this research, since it is a novel approach in a field where traditional linear systems approaches have been unable to achieve fully satisfactory results. It is expected that the experiments conducted will lead to significant gains with respect to a fundamental understanding of the characteristics and analysis of speech signals, with potential long-term application to other areas of speech processing such as speech coding and synthesis.

Demonstration videos (in AVI format)

Contact information

Michael T. Johnson
web: http://johnson.engineering.uky.edu/
email: mike.johnson@uky.edu

Richard J. Povinelli
web: http://povinelli.eece.mu.edu
email: richard.povinelli@marquette.edu

This material is based upon work supported by the National Science Foundation under Grant No. 0113508.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.