Speaker recognition systems achieve acceptable performance in controlled laboratory conditions. However, in real-life environments, the performance of a speaker recognition system degrades drastically, the principal cause being the mismatch that exists between the testing and the training recording conditions. Indeed, the degradations introduced by the background noise as well as the distortions produced by the transmission channel are the two main factors for creating this mismatch between testing and training recordings. In spite of the major advances in the speaker recognition field, no optimal solution has been found yet for coping with this problem. This thesis proposes new methods for speaker recognition systems that make use of auxiliary information in order to reduce the influence of background noise and transmission channel distortions. For this purpose, statistical models capable of taking into account several sources of information in a unified framework are provided. While most state-of-the-art speaker recognition systems use spectral envelope features alone, the use of other features can complete the information about the speaker's individuality, but also give information about the conditions under which the testing process takes place. Both informations can help to improve the performance of the speaker recognition system. In this thesis, we focus on three auxiliary sources of information: the pitch, the voicing status and the reliability status of the spectral envelope features. These auxiliary features are used together with the spectral envelope features. The algorithms to efficiently extract the pitch and the voicing status from noisy telephone quality speech are developed. The algorithm for extracting the reliability status of spectral envelope features is also provided. Two new statistical modeling approaches for handling auxiliary sources of informations are proposed: the state-dependent transitions (SDT) model and the state-dependent states (SDS) model. Both models take into account the temporal dependencies between features of a given source of information, and also the dependencies between features that belong to different sources of information. Speaker identification experiments were conducted for evaluating the SDT modeling approach. Experiments were also performed for evaluating the novel pitch-dependent GMMs system, which is based on the SDS modeling approach. The results of all these experiments show that the modeling techniques proposed in this thesis are capable of capturing the key characteristics of the speech features and their dependencies. The concept of conditional independence and the use of conditional models are important in SDT and SDS models. One of the major drawbacks of these models is that the dependencies between features are fixed. To eliminate this drawback, a more flexible approach using Bayesian networks is introduced. Bayesian networks, have the ability to manage the dependencies between feature
Mathieu Salzmann, Zheng Dang, Zhen Wei, Jiancheng Yang, Haobo Jiang