Nonlinear feature transformations for noise robust speech recognition

Robustness against external noise is an important requirement for automatic speech recognition (ASR) systems, when it comes to deploying them for practical applications. This thesis proposes and evaluates new feature-based approaches for improving the ASR noise robustness. These approaches are based on nonlinear transformations that, when applied to the spectrum or feature, aim to emphasize the part of the speech that is relatively more invariant to noise and/or deemphasize the part that is more sensitive to noise. Spectral peaks constitute high signal-to-noise ratio part of the speech. Thus an efficient parameterization of the components only from the peak locations is expected to improve the noise robustness. An evaluation of this requires estimation of the peak locations. Two methods proposed in this thesis for the peak estimation task are: 1) frequency-based dynamic programming (DP) algorithm, that uses the spectral slope values of single time frame, and 2) HMM/ANN based algorithm, that uses distinct time-frequency (TF) patterns in the spectrogram (thus imposing temporal constraints during the peak estimation). The learning of the distinct TF patterns in an unsupervised manner makes the HMM/ANN based algorithm sensitive to energy fluctuations in the TF patterns, which is not the case with frequency-based DP algorithm. For an efficient parameterization of spectral components around the peak locations, parameters describing activity pattern (energy surface) within local TF patterns around the spectral peaks are computed and used as features. These features, referred to as spectro-temporal activity pattern (STAP) features, show improved noise robustness, however they are inferior to the standard features in clean speech. The main reason for this is the complete masking of the non-peak regions in the spectrum, which also carry significant information required for clean speech recognition. This leads to a development of a new approach that utilizes a soft-masking procedure instead of discarding the non-peak spectral components completely. In this approach, referred to as phase autocorrelation (PAC) approach, the noise robustness is actually addressed in the autocorrelation domain (time-domain Fourier equivalent of the power spectral domain). It uses phase (i.e., angle) variation of the signal vector over time as a measure of correlation, as opposed to the regular autocorrelation which uses dot product. This alternative measure of autocorrelation is referred to as PAC, and is motivated by the fact that angle gets less disturbed by the additive disturbances than the dot product. Interestingly, the use of PAC has an effect of emphasizing the peaks and smoothing out the valleys, in the spectral domain, without explicitly estimating the peak locations. PAC features exhibit improved noise robustness. However, even the soft masking strategy tends to degrade the clean speech recognition performance. This points to the fact that externally designed transfo

Nonlinear feature transformations for noise robust speech recognition

Graph Chatbot

Functional-Basis Analysis of Non-Stationary Signals in Modern Power Grids: Theory and Implementation in Embedded Systems

Spectral Estimators for High-Dimensional Matrix Inference

Random matrix methods for high-dimensional machine learning models

Functional-Basis Analysis of Non-Stationary Signals in Modern Power Grids: Theory and Implementation in Embedded Systems

Spectral Estimators for High-Dimensional Matrix Inference

Random matrix methods for high-dimensional machine learning models