Gradient Descent's Last Iterate is Often (slightly) Suboptimal

By Guy Kornowski, Ohad Shamir

· ArXiv · AI/CL/LG · Apr 15, 2026

Paper proves that without prior knowledge of the time horizon T, no stepsize sequence can ensure optimal error for SGD's last iterate, settling a long-standing conjecture.

Categories: Research

Excerpt

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Read at source: https://arxiv.org/abs/2604.13870v1