(47-1) 18 * << * >> * Russian * English * Content * All Issues

A new approach to training neural networks using natural gradient descent with momentum based on Dirichlet distributions
R.I. Abdulkadirov 1, P.A. Lyakhov 2

North-Caucasus Center for Mathematical Research, 355009, Russia, Stavropol, Pushkin str. 1;
North-Caucasus Federal University, 355009, Russia, Stavropol, Pushkin str. 1

 PDF, 1299 kB

DOI: 10.18287/2412-6179-CO-1147

Pages: 160-169.

Full text of article: Russian language.

In this paper, we propose a natural gradient descent algorithm with momentum based on Dirichlet distributions to speed up the training of neural networks. This approach takes into account not only the direction of the gradients, but also the convexity of the minimized function, which significantly accelerates the process of searching for the extremes. Calculations of natural gradients based on Dirichlet distributions are presented, with the proposed approach introduced into an error backpropagation scheme. The results of image recognition and time series forecasting during the experiments show that the proposed approach gives higher accuracy and does not require a large number of iterations to minimize loss functions compared to the methods of stochastic gradient descent, adaptive moment estimation and adaptive parameter-wise diagonal quasi-Newton method for nonconvex stochastic optimization.

pattern recognition, machine learning, optimization, Dirichlet distributions, natural gradient descent.

Abdulkadirov RI, Lyakhov PA. A new approach to training neural networks using natural gradient descent with momentum based on Dirichlet distributions. Computer Optics 2023; 47(1): 160-169. DOI: 10.18287/2412-6179-CO-1147.

The authors would like to thank the North-Caucasus Federal University for the award of funding in the contest of competitive projects of scientific groups and individual scientists of the North-Caucasus Federal University. The research in section 2 was supported by the North-Caucasus Center for Mathematical Research through the Ministry of Science and Higher Education of the Russian Federation (Project No. 075-02-2022-892). The research in section 3 was supported by the Russian Science Foundation (Project No. 21-71-00017). The research in section 4 was supported by the Russian Science Foundation (Project No. 22-71-00009).


  1. Gardner WA. Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique. Signal Proces 1984; 6(2): 113-133. DOI: 10.1016/0165-1684(84)90013-6.
  2. Loizou N, Richtárik P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Comput Optim Appl 2020; 77: 653-710. DOI: 10.1007/s10589-020-00220-z.
  3. Gao S, Pei Z, Zhang Y, Li T. Bearing fault diagnosis based on adaptive convolutional neural network with nesterov momentum. IEEE Sens J 2021; 21(7): 9268-9276. DOI: 10.1109/JSEN.2021.3050461.
  4. Hadgu AT, Nigam A, Diaz-Aviles E. Large-scale learning with AdaGrad on Spark. 2015 IEEE Int Conf on Big Data (Big Data) 2015: 2828-2830. DOI: 10.1109/BigData.2015.7364091.
  5. Wang Y, Liu J, Mišić J, Mišić VB, Lv S, Chang X. Assessing optimizer impact on DNN model sensitivity to adversarial examples. IEEE Access 2019; 7: 152766-152776. DOI: 10.1109/ACCESS.2019.2948658.
  6. Xu D, Zhang S, Zhang H, Mandic DP. Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw 2021; 139: 17-23. DOI: 10.1016/j.neunet.2021.02.011.
  7. Melinte DO, Vladareanu L. Facial expressions recognition for human–robot interaction using deep convolutional neural networks with rectified Adam optimizer. Sensors 2020; 20: 2393. DOI: 10.3390/s20082393.
  8. Noh S-H. Performance comparison of CNN models using gradient flow analysis. Informatics 2021; 8: 53. DOI: 10.3390/informatics8030053.
  9. Huang Y, Zhang Y, Chambers JA. A Novel Kullback–Leibler divergence minimization-based adaptive student's t-filter. IEEE Trans Signal Process 2019; 67(20): 5417-5432. DOI: 10.1109/TSP.2019.2939079.
  10. Asperti, A. Trentin. M. Balancing reconstruction error and Kullback-Leibler divergence in variational autoencoders. IEEE Access 2020; 8: 199440-199448. DOI: 10.1109/ACCESS.2020.3034828.
  11. Martens J. New insights and perspectives on the natural gradient method. J Mach Learn Res 2020; 21(146): 1-76.
  12. Ma X. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization. arXiv Preprint. 2021. Source: <https://arxiv.org/abs/2009.13586>.
  13. Li W, Montúfar G. Natural gradient via optimal transport. Information Geometry 2018; 1: 181-214. DOI: 10.1007/s41884-018-0015-3.
  14. Alvarez F, Bolte J, Brahic O. Hessian Riemannian gradient flows in convex programming. SIAM 2004; 43(2): 68-73. DOI: 10.1137/S0363012902419977.
  15. Abdulkadirov RI, Lyakhov PA. Improving extreme search with natural gradient descent using dirichlet distribution. In Book: Tchernykh A, Alikhanov A, Babenko M, Samoylenko I, eds. Mathematics and its applications in new computer systems. Cham: Springer Nature Switzerland AG; 2022: 19-28. DOI: 10.1007/978-3-030-97020-8_3.
  16. Graf M. Regression for compositions based on a generalization of the Dirichlet distribution. Stat Methods Appt 2020; 29: 913-936. DOI: 10.1007/s10260-020-00512-y.
  17. Li Y. Goodness-of-fit tests for Dirichlet distributions with applications. A PhD dissertation. 2015.
  18. Haykin SS. Neural networks: a comprehensive foundation. Prentice Hall; 1999.
  19. Aghdam HH, Heravi EJ. Guide to convolutional neural networks: A practical application to traffic-sign detection and classification. Cham: Springer International Publishing AG; 2017.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20