10.1. Basic Subgradient MethodΒΆ

The theory of subgradients for convex functions is discussed in Subgradients. This section discusses basic subgradient method for optimization. Primary references for this section are [8].

Recall that a vector g is a subgradient of f at a point x if

f(y)β‰₯f(x)+⟨yβˆ’x,gβŸ©βˆ€y∈Rn.

10.1.1. Unconstrained MinimizationΒΆ

We start with a simple case of minimizing a convex function f:Rn→R.

  1. The optimal value shall be denoted by fβˆ—.

  2. We shall assume that the optimal value is finite (i.e., fβˆ—>βˆ’βˆž).

  3. The set of optimal minimizers shall be denoted by Xβˆ—.

  4. Since f attains its optimal value, hence Xβˆ— is nonempty.

  5. A global minimizer will be denoted by xβˆ—.

The basic subgradient method uses the following iteration.

(10.1)ΒΆxk+1=xkβˆ’tkgk.

In this iteration

  1. xk is the current (k-th) iterate.

  2. gk is any subgradient of f at xk. We have gkβˆˆβˆ‚f(xk).

  3. tk>0 is the step size for the k-th iteration in the negative subgradient direction βˆ’gk.

  4. xk+1 is the new (k+1-th) iterate.

  5. β€–tkgkβ€–2 denotes the step-length.

Thus, in each iteration of the subgradient method, we take a step in the direction of a negative subgradient.

10.1.1.1. Descent DirectionsΒΆ

Observation 10.1 (Negative subgradient need not be a descent direction)

  1. Recall from Definition 9.22 that a descent direction of f at x is a direction along which the directional derivative is negative; i.e., fβ€²(x;d)<0.

  2. If d is a descent direction at x, then there exists a step size t>0 such that

    f(x+td)<f(x).
  3. If f is a differentiable function and d=βˆ’βˆ‡f(x), then

    fβ€²(x;βˆ’βˆ‡f(x))=βˆ’β€–βˆ‡f(x)β€–2≀0.
  4. Hence, the negative gradient is always a descent direction if the gradient is nonzero.

  5. However, for a nondifferentiable function f, a negative subgradient may not be a descent direction.

  6. Recall from max formula (Theorem 8.219) that

    fβ€²(x;d)=sup{⟨d,g⟩|gβˆˆβˆ‚f(x)}.
  7. Let g~ be the chosen subgradient at an iteration of the subgradient method.

  8. Then

    fβ€²(x;βˆ’g~)=supgβˆˆβˆ‚f(x){βŸ¨βˆ’g~,g⟩}=βˆ’infgβˆˆβˆ‚f(x){⟨g~,g⟩}.
  9. It is possible that the quantity ⟨g~,g⟩ is negative for some other subgradient g at x.

  10. Hence the directional derivative along βˆ’g~ may be positive.

  11. This argument shows that a negative subgradient is not necessarily a descent direction.

10.1.1.2. Tracking of the Best EstimateΒΆ

  1. Since the negative subgradient βˆ’gk may not be a descent direction, hence it is quite likely that f(xk+1)>f(xk).

  2. Even if the selected βˆ’gk is a descent direction at xk, it is possible that the step size can be such that f(xk+1)>f(xk).

  3. This happens since typically there is no local line search method invoked to select an appropriate step size in a subgradient method.

  4. Since the negative subgradient itself may not be a descent direction, hence running a local line search will be futile.

  5. The alternative is to track the best estimate of the optimal value so far using the following rule

    fbestk=min{fbestkβˆ’1,f(xk)}.
  6. If the best estimate has decreased, we also update the estimate of the local minimizer as xbest to xk+1.

  7. It is easy to see that

    fbestk=min{f(x1),…,f(xk)}.
  8. We can see that the sequence {fbestk} is a nonincreasing sequence.

  9. Hence it has a limit.

  10. The success of the subgradient method depends on whether the limit equals the optimal value.

10.1.1.3. Step Size SelectionΒΆ

In a gradient descent method, one typically uses a line search method to select the step size. Hence, it depends on current iterate and the function values in its neighborhood.

In a subgradient method, the step size selection is different. Typically, the step sizes (or step lengths) are determined beforehand and they don’t depend on the data computed during the execution of the algorithm. Following are some of the common methods for step size selection.

  1. Constant step size. tk=t for some positive constant t>0 which is independent of k.

  2. Constant step length.

    tk=cβ€–gkβ€–2

    where c>0 is a predefined step length. Note that with this choice of step size, we have:

    β€–xk+1βˆ’xkβ€–2=β€–tkgkβ€–2=c.
  3. Square summable but not summable. We choose step sizes that satisfy the following constraints:

    tk>0βˆ€k,βˆ‘k=1∞tk2<∞,βˆ‘k=1∞tk=∞.

    As an example, fix some a>0 and bβ‰₯0 and let

    tk=ab+k.
  4. Nonsummable diminishing. We choose step sizes that satisfy the following constraints:

    tk>0βˆ€k,limkβ†’βˆžtk=0,βˆ‘k=1∞tk=∞.

    As and example, fix some a>0 and let

    tk=ak.
  5. Nonsummable diminishing step lengths: We choose step sizes as tk=ckβ€–gkβ€–2 where

    ck>0,limkβ†’βˆžck=0,βˆ‘k=1∞ck=∞.

10.1.1.4. The Subgradient MethodΒΆ

We now present the overall template for the subgradient method.

Algorithm 10.1 (The subgradient method)

Inputs

  1. f : function to be minimized

  2. x1: initial guess for the minimizer

  3. t1: initial step size

Outputs

  1. fbest1: Best estimate of minimum value of f

  2. xbest1 the best estimate of the minimizer

Initialization

  1. fbest=f(x1).

  2. xbest=x1.

General iteration: for k=1,2,…, execute the following steps

  1. Select a subgradient gk from βˆ‚f(xk).

  2. Select a step size: tk.

  3. Update minimizer estimate: xk+1=xkβˆ’tkgk.

  4. k=k+1.

  5. Compute f(xk).

  6. if fbestkβˆ’1>f(xk):

    1. Update best estimate of optimal value: fbestk=f(xk).

    2. Update best estimate of minimizer: xbestk=xk.

  7. Otherwise, retain current values

    1. fbestk=fbestkβˆ’1.

    2. xbestk=xbestkβˆ’1.

  8. If stopping criterion is met, then STOP.

For a concrete implementation, we shall need the following:

  1. A way to compute f(x) at every x.

  2. A way to pick a subgradient gβˆˆβˆ‚f(x) at every x.

  3. A step size selection strategy.

  4. A stopping criterion.

Note that there is no need to specify the complete subdifferential at every x. The stopping criterion will be developed in the following as part of the convergence analysis of the algorithm.

10.1.2. Convergence AnalysisΒΆ

Our goal is to show that fbestk↓fβˆ—. Towards this, we will establish a suboptimality bound on the estimate of the optimal value after k iterations.

We shall make the following assumptions in the analysis.

  1. f is a real valued convex function.

  2. The optimal value fβˆ— is finite and the minimizer xβˆ— exists.

  3. The norm of the subgradients is bounded. i.e., there exists G>0 such that

    β€–gβ€–2≀Gβˆ€gβˆˆβˆ‚f(x)βˆ€x.

    Recall from Theorem 8.232 that if a convex function is Lipschitz continuous, then its subgradients are bounded.

  4. A number R that satisfies

    Rβ‰₯β€–xβˆ—βˆ’x1β€–2

    for some xβˆ—βˆˆXβˆ— is known beforehand. R can be interpreted as an upper bound on the distance of the initial point to the set of optimal minimizers.

    Rβ‰₯dist(x1,Xβˆ—).

Theorem 10.1 (Suboptimality bound for subgradient method)

Let β€–x1βˆ’xβˆ—β€–2≀R.

After k iterations of the subgradient method, we have

(10.2)ΒΆfbestkβˆ’fβˆ—β‰€R2+βˆ‘i=1kti2β€–giβ€–222βˆ‘i=1kti.

If the subgradients of f are bounded by β€–g‖≀G for all gβˆˆβˆ‚f(x) for every x, then this simplifies to:

(10.3)ΒΆfbestkβˆ’fβˆ—β‰€R2+G2βˆ‘i=1kti22βˆ‘i=1kti.

Proof. Let xβˆ— be a minimizer of f. From the subgradient inequality, we have

fβˆ—=f(xβˆ—)β‰₯f(xk)+⟨xβˆ—βˆ’xk,gk⟩.

Hence

⟨xβˆ—βˆ’xk,gkβŸ©β‰€fβˆ—βˆ’f(xk).

We first develop an upper bound on the distance between the k+1-th iterate and a minimizer.

β€–xk+1βˆ’xβˆ—β€–22=β€–xkβˆ’tkgkβˆ’xβˆ—β€–22=β€–(xkβˆ’xβˆ—)βˆ’tkgkβ€–22=β€–xkβˆ’xβˆ—β€–22βˆ’2tk⟨xkβˆ’xβˆ—,gk⟩+tk2β€–gkβ€–22≀‖xkβˆ’xβˆ—β€–22βˆ’2tk(f(xk)βˆ’fβˆ—)+tk2β€–gkβ€–22.

Applying this inequality recursively, we get

β€–xk+1βˆ’xβˆ—β€–22≀‖x1βˆ’xβˆ—β€–22βˆ’2βˆ‘i=1kti(f(xi)βˆ’fβˆ—)+βˆ‘i=1kti2β€–giβ€–22.

Using the fact that β€–xk+1βˆ’xβˆ—β€–22β‰₯0 and β€–x1βˆ’xβˆ—β€–2≀R, we have

2βˆ‘i=1kti(f(xi)βˆ’fβˆ—)≀R2+βˆ‘i=1kti2β€–giβ€–22.

On the L.H.S., we can see that

βˆ‘i=1kti(f(xi)βˆ’fβˆ—)β‰₯(βˆ‘i=1kti)mini=1,…,k(f(xi)βˆ’fβˆ—)=(βˆ‘i=1kti)(fbestkβˆ’fβˆ—).

Combining this with the previous inequality, we get

2(βˆ‘i=1kti)(fbestkβˆ’fβˆ—)≀R2+βˆ‘i=1kti2β€–giβ€–22.

This can be rewritten as

fbestkβˆ’fβˆ—β‰€R2+βˆ‘i=1kti2β€–giβ€–222βˆ‘i=1kti

as desired.

Applying the upper bound on the subgradient β€–gβ€–2≀G, we have

fbestkβˆ’fβˆ—β‰€R2+G2βˆ‘i=1kti22βˆ‘i=1kti

as desired.

Based on this result, we can provide upper bounds on the estimation error for different step size selection strategies.

10.1.2.1. Constant Step SizeΒΆ

Corollary 10.1 (Convergence of subgradient method with constant step size)

If tk=t for all k, then we have

fbestkβˆ’fβˆ—β‰€R2+G2t2k2tk.

The subgradient method converges to within G2t/2 of the optimal value fβˆ—. In other words,

limkβ†’βˆžfbestkβˆ’fβˆ—β‰€G2t2.

Also, fbestkβˆ’fβˆ—β‰€G2t within at most R2/(G2t2) steps.

Proof. By putting tk=t in (10.3), we obtain the desired upper bound. Taking the limit kβ†’βˆž, we see that

limkβ†’βˆž(fbestkβˆ’fβˆ—)≀G2t2.

Now,

R2+G2t2k2tk≀G2t⟺R2+G2t2k≀2G2t2k⟺R2≀G2t2k⟺kβ‰₯R2G2t2.

Since fbestk is a nonincreasing sequence, hence for all kβ‰₯R2/(G2t2), we must have

fbestkβˆ’fβˆ—β‰€G2t.

10.1.2.2. Constant Step LengthΒΆ

Corollary 10.2 (Convergence of subgradient method with constant step length)

Let c be the constant step length for the subgradient method. Let tk=cβ€–gkβ€–2 for every k. Then we have

fbestkβˆ’fβˆ—β‰€R2+c2k2ck/G.

The subgradient method converges to within Gc/2 of the optimal value fβˆ—. In other words,

limkβ†’βˆžfbestkβˆ’fβˆ—β‰€Gc2.

Proof. By putting ti=cβ€–giβ€–2 in (10.2), we see that

fbestkβˆ’fβˆ—β‰€R2+βˆ‘i=1kc22βˆ‘i=1kti.

Also, note that

ti=cβ€–giβ€–2⟹tiβ‰₯cGβŸΉβˆ‘i=1ktiβ‰₯ckG⟹1βˆ‘i=1kti≀1ck/G.

Putting this back, we have

fbestkβˆ’fβˆ—β‰€R2+c2k2ck/G.

Taking the limit kβ†’βˆž, we see that

limkβ†’βˆž(fbestkβˆ’fβˆ—)≀Gc2.

10.1.2.3. Square Summable Step SizesΒΆ

Corollary 10.3 (Convergence of subgradient method with square summable step sizes)

Let the step sizes satisfy the rule

tk>0βˆ€k,βˆ‘k=1∞tk2<∞,βˆ‘k=1∞tk=∞.

Further, let T=βˆ‘k=1∞tk2.

Then we have

fbestkβˆ’fβˆ—β‰€R2+G2T2βˆ‘i=1ktk.

The subgradient method converges to fβˆ—.

Proof. By putting the step size constraints into (10.3), we obtain

fbestkβˆ’fβˆ—β‰€R2+G2T2βˆ‘i=1kti.

Taking the limit kβ†’βˆž, the denominator grows to ∞ while the numerator converges to R2+G2T. Hence

limkβ†’βˆžfbestkβˆ’fβˆ—=0.

10.1.2.4. Diminishing Step SizesΒΆ

Corollary 10.4 (Convergence of subgradient method with diminishing step sizes)

Let the step sizes satisfy the rule

tk>0βˆ€k,limkβ†’βˆžtk=0,βˆ‘k=1∞tk=∞.

Then the subgradient method converges to fβˆ—.

Proof. We shall show that the R.H.S. of (10.3) converges to 0 as kβ†’βˆž.

  1. Let Ο΅>0.

  2. There exists integer n1 such that ti≀ϡ/G2 for all i>n1 since tkβ†’0.

  3. There also exists an integer n2 such that

    βˆ‘i=1n2tiβ‰₯1Ο΅(R2+G2βˆ‘i=1n1ti2)

    since ti is nonsummable.

  4. Let n=max{n1,n2}.

  5. Then, for every k>n, we have

    R2+G2βˆ‘i=1kti22βˆ‘i=1kti=R2+G2βˆ‘i=1n1ti22βˆ‘i=1kti+G2βˆ‘i=n1+1kti22βˆ‘i=1n1ti+2βˆ‘i=n1+1kti.
  6. We can see that

    βˆ‘i=1ktiβ‰₯βˆ‘i=1n2tiβ‰₯1Ο΅(R2+G2βˆ‘i=1n1ti2).
  7. Hence

    R2+G2βˆ‘i=1n1ti22βˆ‘i=1kti≀ϡ2.
  8. For every i>n1, we have ti<Ο΅/G2.

  9. Hence

    G2βˆ‘i=n1+1kti2β‰€Ο΅βˆ‘i=n1+1kti.
  10. We can now see that

    G2βˆ‘i=n1+1kti22βˆ‘i=1n1ti+2βˆ‘i=n1+1kti<G2βˆ‘i=n1+1kti22βˆ‘i=n1+1ktiβ‰€Ο΅βˆ‘i=n1+1kti2βˆ‘i=n1+1kti=Ο΅2.
  11. Combining, we have

    R2+G2βˆ‘i=1kti22βˆ‘i=1kti<Ο΅2+Ο΅2=Ο΅

    for every k>n.

  12. Thus, for every Ο΅>0, there exists n such that for all k>n, we have

    R2+G2βˆ‘i=1kti22βˆ‘i=1kti<Ο΅.
  13. Hence

    limkβ†’βˆžR2+G2βˆ‘i=1kti22βˆ‘i=1kti=0.
  14. Hence the subgradient method converges.

10.1.2.5. Diminishing Step LengthsΒΆ

Corollary 10.5 (Convergence of subgradient method with diminishing step lengths)

Let the step sizes tk=ckβ€–gkβ€–2 be chosen such that

ck>0,limkβ†’βˆžck=0,βˆ‘k=1∞ck=∞

where ck is the step length for the k-th iteration.

Then the subgradient method converges to fβˆ—.

Proof. From (10.2) we have

fbestkβˆ’fβˆ—β‰€R2+βˆ‘i=1kti2β€–giβ€–222βˆ‘i=1kti=R2+βˆ‘i=1kci22βˆ‘i=1kti.
  1. We have

    βˆ‘i=1kti=βˆ‘i=1kciβ€–giβ€–2β‰₯βˆ‘i=1kciG.
  2. Hence we have

    fbestkβˆ’fβˆ—β‰€R2+βˆ‘i=1kci2(2/G)βˆ‘i=1kci.
  3. Following an argument similar to the proof of Corollary 10.4, the R.H.S. converges to 0 as kβ†’βˆž.

  4. Hence

    limkβ†’βˆžfbestkβˆ’fβˆ—=0.

10.1.2.6. On the Suboptimality BoundΒΆ

If we run the subgradient method for k iterations, we get a suboptimality bound given by (10.3).

fbestkβˆ’fβˆ—β‰€R2+G2βˆ‘i=1kti22βˆ‘i=1kti.

A natural question is how tight can this bound be by an appropriate selection of step sizes t1,…,tk.

  1. Note that the R.H.S. is a convex and symmetric function of t1,…,tk.

  2. Hence, at the optimal value, we must have t1=β‹―=tk.

  3. Let t=t1=β‹―=tk.

  4. Then the suboptimality bound reduces to

    R2+G2kt22kt.
  5. This is minimized at t=RGk.

  6. In other words, the finite sequence of step-sizes t1,…,tk that minimizes the suboptimality bound in (10.3) after exactly k iterations is given by

    ti=RGk,βˆ€i=1,…,k.
  7. It must be noted that this suboptimality bound is dependent on the number of iterations.

  8. This choice of constant step size gives us the bound

    fbestkβˆ’fβˆ—β‰€RGk.

Theorem 10.2 (A bound on the suboptimality bound)

Let t1,…,tk be any choice of step sizes for the subgradient method for k iterations. Then we must have

R2+G2βˆ‘i=1kti22βˆ‘i=1ktiβ‰₯RGk

where the L.H.S. is the step-size selection dependent suboptimality bound on fbestkβˆ’fβˆ—.

Accordingly, the number of steps required to achieve a guaranteed accuracy of fbestkβˆ’fβˆ—β‰€Ο΅ for some Ο΅>0 is at least (RG/Ο΅)2 as per the suboptimality bound (10.3) irrespective of the step size selection.

Proof. We have already shown that the suboptimality bound is minimized by the constant step size selection where ti=RGk for every i=1,…,k and is given by RGk.

  1. Let Ο΅>0 be the required guaranteed accuracy.

  2. Then Ο΅β‰₯RGk since RGk is the minimum guaranteed suboptimality bound after k iterations as per (10.3).

  3. Hence we have kβ‰₯RGΟ΅.

  4. Hence we have kβ‰₯(RGΟ΅)2.

We note that the suboptimality bound of Ο΅ can be guaranteed in k iterations only if we use the constant step sizes of t=RGk.

Observation 10.2 (Interpretation of RG)

Initial uncertainty

  1. Assume that f is Lipschitz continuous with the constant G.

  2. By Lipschitz continuity, we have

    f1βˆ’fβˆ—β‰€Gβ€–x1βˆ’xβˆ—β€–2≀RG.
  3. Hence RG is an estimate of the initial uncertainty in fβˆ—.

Reduction in uncertainty

  1. If our goal is to achieve fbestkβˆ’fβˆ—β‰€Ο΅, then Ο΅ is an estimate of the final uncertainty in fβˆ—.

  2. Hence RG/Ο΅ is the ratio of initial uncertainty in fβˆ— to the final uncertainty in fβˆ—.

  3. By Theorem 10.2, we require a minimum of (RG/Ο΅)2 iterations to achieve this much reduction in uncertainty.

This analysis shows that the subgradient method is very slow. To achieve a 1000x reduction in the uncertainty of fβˆ—, we need a million iterations at least.