(44-4) 14 * << * >> * Russian * English * Content * All Issues

Automatic text-independent speaker verification using convolutional deep belief network
I.A. Rakhmanenko 1, A.A. Shelupanov 1, E.Y. Kostyuchenko 1

Tomsk State University of Control Systems and Radioelectronics,
prospect Lenina 40, 634050, Tomsk, Russia

 PDF, 1382 kB

DOI: 10.18287/2412-6179-CO-621

Pages: 596-605.

Full text of article: Russian language.

This paper is devoted to the use of the convolutional deep belief network as a speech feature extractor for automatic text-independent speaker verification. The paper describes the scope and problems of automatic speaker verification systems. Types of modern speaker verification systems and types of speech features used in speaker verification systems are considered. The structure and learning algorithm of convolutional deep belief networks is described. The use of speech features extracted from three layers of a trained convolution deep belief network is proposed. Experimental studies of the proposed features were performed on two speech corpora: own speech corpus including audio recordings of 50 speakers and TIMIT speech corpus including audio recordings of 630 speakers. The accuracy of the proposed features was assessed using different types of classifiers. Direct use of these features did not increase the accuracy compared to the use of traditional spectral speech features, such as mel-frequency cepstral coefficients. However, the use of these features in the classifiers ensemble made it possible to achieve a reduction of the equal error rate to 0.21% on 50-speaker speech corpus and to 0.23% on the TIMIT speech corpus.

speaker recognition, speaker verification, Gaussian mixture models, GMM-UBM system, speech features, speech processing, deep learning, neural networks, pattern recognition.

Rakhmanenko IA, Shelupanov AA, Kostyuchenko EYu. Automatic text-independent speaker verification using convolutional deep belief network. Computer Optics 2020; 44(4): 596-605. DOI: 10.18287/2412-6179-CO-621.

The work was funded within the basic part of the government project of the Russian Federation Education and Science Ministry, project 8.9628.2017/8.9.


  1. Campbell JP. Speaker recognition: a tutorial. Proc IEEE Inst Electr Electron Eng 1997; 85(9): 1437-1462.
  2. Soldatova OP, Garshin AA. Convolutional neural network applied to handwritten digits recognition. Computer Optics 2010; 34(2): 252-259.
  3. Lee H, Grosse R, Ranganath R, Ng AY. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proc 26th Annual International Conference on Machine Learning 2009: 609-616.
  4. Lee H, Pham P, Largman Y, Ng AY. Unsupervised feature learning for audio classification using convolutional deep belief networks. Adv Neural Inform Process Syst 2009: 1096-1104.
  5. Ren Y, Wu Y. Convolutional deep belief networks for feature extraction of EEG signal. IJCNN 2014: 2850-2853.
  6. Sahidullah M, Saha G. A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Process Lett 2013; 20(2): 149-152.
  7. Motlicek P, Dey S, Madikeri S, Burget L. Employment of subspace gaussian mixture models in speaker recognition. ICASSP 2015: 4445-4449.
  8. Greenberg CS, Bansé D, Doddington GR, Garcia-Romero D, Godfrey JJ, Kinnunen T, Martin AF, McCree A, Przybocki M, Reynolds DA. The NIST 2014 speaker recognition i-vector machine learning challenge. Odyssey: The Speaker and Language Recognition Workshop 2014: 224-230.
  9. Lei Y, Scheffer N, Ferrer L, McLaren M. A novel scheme for speaker recognition using a phonetically-aware deep neural network. ICASSP 2014: 1695-1699.
  10. Stafylakis T, Kenny P, Gupta V, Alam J, Kockmann M. Compensation for phonetic nuisance variability in speaker recognition using DNNs. Odyssey: The Speaker and Language Recognition Workshop 2016: 340-345.
  11. Kenny P, Gupta V, Stafylakis T, Ouellet P, Alam J. Deep neural networks for extracting baum-welch statistics for speaker recognition. Proc Odyssey 2014: 293-298.
  12. Xu L, Lee KA, Li H, Yang Z. Rapid Computation of I-vector. Odyssey: The Speaker and Language Recognition Workshop 2016: 47-52.
  13. McLaren M, Ferrer L, Lawson A. Exploring the role of phonetic bottleneck features for speaker and language recognition. ICASSP 2016: 5575-5579.
  14. Richardson F, Reynolds D, Dehak N. Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 2015; 22(10): 1671-1675.
  15. Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digit Signal Process 2000; 10(1): 19-41.
  16. Sizov A, Khoury E, Kinnunen T, Wu Z, Marcel S. Joint speaker verification and antispoofing in the I-vector space. IEEE Trans Inf Forensics Secur 2015; 10(4): 821-832.
  17. Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J. Deep neural networks for small footprint text-dependent speaker verification. ICASSP 2014: 4052-4056.
  18. Jung JW, Heo HS, Yang IH, Shim HJ, Yu HJ. A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. ICASSP 2018: 5349-5353.
  19. Rohdin J, Silnova A, Diez M, Plchot O, Matějka P, Burget L. End-to-end DNN based speaker recognition inspired by i-vector and PLDA. ICASSP 2018: 4874-4878.
  20. Rakhmanenko IA, Meshcheryakov RV. Identification features analysis in speech data using GMM-UBM speaker verification system [In Russian]. SPIIRAS Proc 2017; 52(3): 22-50.
  21. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Audio Speech Lang Process 1980; 28(4): 357-366.
  22. Jurafsky D, Martin JH. Speech and language processing. 2nd ed. New Jersey: Pearson Education; 2009.
  23. Eyben F, Weninger F, Gross F, Schuller B. Recent developments in opensmile, the munich open-source multimedia feature extractor. Proc 21st ACM Int Conf Multimedia 2013: 835-838.
  24. Hinton GE, Osindero S, The YW. A fast learning algorithm for deep belief nets. Neural Comput 2006; 18(7): 1527-1554.
  25. Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Comput 2002; 14(8): 1771-1800.
  26. Sadjadi SO, Slaney M, Heck L. MSR identity toolbox v1.0: A MATLAB toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter 2013; 1(4): 1-32.
  27. Zue V, Seneff S, Glass J. Speech database development at MIT: TIMIT and beyond. Speech Commun 1990; 9(4): 351-356.
  28. Yoshimura T, Koike N, Hashimoto K, Oura K, Nankaku Y, Tokuda K. Discriminative feature extraction based on sequential variational autoencoder for speaker recognition. APSIPA ASC 2018: 1742-1746.
  29. Zeng CY, Ma CF, Wang ZF, Ye JX. Stacked Autoencoder Networks Based Speaker Recognition. ICMLC 2018; 1: 294-299.
  30. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech. Advances in neural information processing systems 2015: 577-585.
  31. Meriem F, Farid H, Messaoud B, Abderrahmene A. Robust speaker verification using a new front end based on multitaper and gammatone filters. 2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems 2014: 99-103.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: ko@smr.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20