Recursion in optimization

In this blog post, I aim to provide a overview of the various recursive methods I have seen in convex optimization. Optimization methods often yield a sequence denoted as {xt}t0\{x_t\}_{t\geq 0}, prompting a detailed analysis of the convergence properties inherent in these sequences.

Convergent sequence

The concept of convergent and divergent sequences is fundamental in advanced mathematical studies, such as the well-known Cauchy sequence. In the following discussion, I will narrow our focus to specific recursion techniques, shedding light on their convergence properties. To delve deeper into this subject, let’s explore a particular recursion method:

xt+1=γtxt+εt.(1)\begin{aligned} x_{t+1} = \gamma_t x_t + \varepsilon_t.\tag{1} \end{aligned}

The formulation mentioned above holds universal significance in the field of optimization (see Polyak ch 2.2.3).

Now, let’s delve into two fundamental lemmas that play a crucial role in determining the convergence of a sequence governed by Eq.(1).

Lemma 1 Let xt0x_t \geq 0 and

xt+1=γtxt+εt,x_{t+1} = \gamma_t x_{t} + \varepsilon_t,

where 0γt<10\leq \gamma_t < 1, ε0\varepsilon \geq 0, t=0(1γt)=\sum_{t=0}^{\infty} (1-\gamma_t) =\infty, and εt/(1γt)0\varepsilon_t/ (1-\gamma_t) \rightarrow 0, then

xt+10.x_{t+1}\to 0.

The condition ε1γt0\frac{\varepsilon}{1-\gamma_t} \to 0 implies that ε=o(1γt)\varepsilon = o(1-\gamma_t).

Lemma 2 Let xt0x_t \geq 0 and

xt+1=(1+αt)xt+εt,x_{t+1} = (1+\alpha_t) x_{t} + \varepsilon_t,

where t=0αt<\sum_{t=0}^{\infty} \alpha_t < \infty and t=0εt<\sum_{t=0}^\infty \varepsilon_t < \infty, then

xt+1xˉ.x_{t+1}\to \bar x.

Drawing from these two lemmas, it becomes evident that various convergent recursions can be derived, as illustrated in Franci, B..

Convergence rate

When dealing with a convergent sequence, a critical question arises: How many iterations are needed to obtain an ϵ\epsilon-approximate solution? In other words, what is the convergence rate for this sequence? Understanding the convergence rate allows us to select the most efficient method based on the least number of iterations required.

To begin our analysis, let’s consider a scenario where both γtγ\gamma_t \equiv \gamma and εtε\varepsilon_t \equiv \varepsilon are fixed. While this violates the assumption in Lemma 1, it provides a starting point for our exploration. Expanding the recursion (1), we obtain:

xt+1γT+1x0+ε1γ, x_{t+1} \leq \gamma^{T+1} x_0 +\frac{\varepsilon}{1-\gamma},

This indicates that xt+1x_{t+1} lies within the ε1γ\frac{\varepsilon}{1-\gamma}-neighborhood of 00. Achieving a convergence guarantee necessitates driving ε1γ\frac{\varepsilon}{1-\gamma} to approach 00, as outlined in Lemma 1.

As highlighted in Bottou thm4.6, employing a constant learning rate results in the linear convergence of expected objective values to a neighborhood of the optimal value. While a smaller stepsize might degrade the contraction constant in the convergence rate, it facilitates approaching closer to the optimal value.

To ensure convergence, we opt for a diminishing stepsize strategy, leading to a linear convergence rate of O(1T)\mathcal{O}\left(\frac{1}{T}\right).

In general, we reformulate (1) as follows:

xt+1(1ηt)xt+ηt2σ2.x_{t+1} \leq (1-\eta_t) x_t + \eta_t^2\sigma^2.

Theorem 1 Let xt0x_t \geq 0 and

xt+1(1ηt)xt+ηt2σ2,x_{t+1} \leq (1-\eta_t) x_t + \eta_t^2\sigma^2,

where ηt=ba+t\eta_t = \frac{b}{a + t}, b,σ2>0b, \sigma^2 > 0, a0a\geq 0, then

xTva+T,x_{T} \leq \frac{v}{a + T},

where v=max{(a+1)x0,b2σ2b1}v=\max\{(a + 1)x_0, \frac{b^2\sigma^2}{b-1}\}.

We prove it by induction. When t=1t=1, it obviously holds. Assume that it holds for tt, then

xt+1t+a1(t+a)2vbv1(a+t)2v+b2σ2(t+a)2,x_{t+1} \leq \frac{t + a - 1}{(t+a)^2}v - \frac{bv - 1}{(a+t)^2} v + \frac{b^2\sigma^2}{(t+a)^2},

which leads to xt+1vt+1+ax_{t+1} \leq \frac{v}{t+1+a}.

In Polyak lemma 2.2.4, there is the convergence rate for the recursion

xt+1=(1ct)xt+dtp+1,x_{t+1} = (1-\frac{c}{t})x_t + \frac{d}{t^{p+1}},

xtd(cp)1tp+o(tp),c>p,xt=O(tclogt),p=c,xt=O(tc),p>c.\begin{matrix} x_t &\leq &d(c-p)^{-1}t^{-p} + o(t^{-p}), & c > p,\\ x_t &= &\mathcal{O}(t^{-c}\log t), & p =c,\\ x_t &= &\mathcal{O}(t^{-c}), & p> c. \end{matrix}

As shown in Bottou thm5.1, when εt\varepsilon_t converges geometrically, the recursion exhibits a linear convergence rate with a constant stepsize, a technique often referred to as dynamic sampling.

Reference