Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation

Feng, Yuanyuan; Gao, Tingran; Li, Lei; Liu, Jian-Guo; Lu, Yulong

doi:10.4310/CMS.2020.v18.n1.a7

Contents Online

Communications in Mathematical Sciences

Volume 18 (2020)

Number 1

Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation

Pages: 163 – 188

DOI: https://dx.doi.org/10.4310/CMS.2020.v18.n1.a7

Authors

Yuanyuan Feng (Department of Mathematics, Pennsylvania State University, University Park, Pa., U.S.A.)

Tingran Gao (Committee on Computational and Applied Mathematics, Department of Statistics, University of Chicago, Illinois, U.S.A.)

Lei Li (School of Mathematical Sciences, Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China)

Jian-Guo Liu (Departments of Mathematics and of Physics, Duke University, Durham, North Carolina, U.S.A.)

Yulong Lu (Department of Mathematics, Duke University, Durham, North Carolina, U.S.A.)

Abstract

Diffusion approximation provides weak approximation for stochastic gradient descent algorithms in a finite time horizon. In this paper, we introduce new tools motivated by the backward error analysis of numerical stochastic differential equations into the theoretical framework of diffusion approximation, extending the validity of the weak approximation from finite to infinite time horizon. The new techniques developed in this paper enable us to characterize the asymptotic behavior of constant-step-size SGD algorithms near a local minimum around which the objective functions are locally strongly convex, a goal previously unreachable within the diffusion approximation framework. Our analysis builds upon a truncated formal power expansion of the solution of a Kolmogorov equation arising from diffusion approximation, where the main technical ingredient is uniform-in-time bounds controlling the long-term behavior of the expansion coefficient functions near the local minimum. We expect these new techniques to bring new understanding of the behaviors of SGD near local minimum and greatly expand the range of applicability of diffusion approximation to cover wider and deeper aspects of stochastic optimization algorithms in data science.

Keywords

stochastic gradient descent, weak error analysis, diffusion approximation, stochastic differential equation, backward Kolmogorov equation

2010 Mathematics Subject Classification

60J20, 90C15

Full Text (PDF format)

Received 8 May 2019

Accepted 3 September 2019

Published 1 April 2020