8.18. SmoothnessΒΆ

Primary references for this section are [6].

8.18.1. L-Smooth FunctionsΒΆ

Definition 8.80 (L-smooth functions)

For some Lβ‰₯0, a function f:Vβ†’(βˆ’βˆž,∞] is called L-smooth over a set DβŠ†V if it is differentiable over D and satisfies

β€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—β‰€Lβ€–xβˆ’yβ€–βˆ€x,y∈D.

The constant L is called the smoothness parameter.

  • Since f is differentiable over D, hence DβŠ†intdomf.

  • If f is L-smooth over the entire V, we simply say that f is L-smooth.

  • L-smooth functions are also known as functions with Lipschitz gradient with constant L.

  • The class of functions which are L-smooth over a set D is denoted by CL1,1(D).

  • When D=V, then the class is simply denoted as CL1,1.

  • The class of functions which are L-smooth for some Lβ‰₯0 (but L may not be known), is denoted by C1,1.

  • By definition, if a function is L1-smooth, then it is L2-smooth for every L2β‰₯L1.

  • Thus, it is often useful to identify the smallest possible value of L for which a function is L-smooth.

Example 8.82 (Zero smoothness of affine functions)

Let b∈Vβˆ— and c∈R. Let f:Vβ†’R be given by:

f(x)=⟨x,b⟩+c.

Then, f is 0-smooth.

To see this, we note that βˆ‡f(x)=b. Consequently,

β€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—=β€–bβˆ’bβ€–βˆ—=β€–0β€–βˆ—=0≀0β€–xβˆ’yβ€–.

Theorem 8.263 (Smoothness of quadratic functions)

Let A∈Sn, b∈Rn and c∈R. Assume that Rn is endowed with β„“p-norm for some 1≀pβ‰€βˆž. Let f:Rnβ†’R be given by:

f(x)=12xTAx+bTx+c.

Then, f is L-smooth with L=β€–Aβ€–p,q where q∈[1,∞] is the conjugate exponent satisfying

1p+1q=1

and β€–Aβ€–p,q is the induced norm given by

β€–Aβ€–p,q=sup{β€–Axβ€–q|β€–xβ€–p≀1}.

Proof. We note that the dual norm of β„“p norm is the β„“q norm. Now,

β€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—=β€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–q=β€–Ax+bβˆ’Ayβˆ’bβ€–q=β€–Axβˆ’Ayβ€–q≀‖Aβ€–p,qβ€–xβˆ’yβ€–p.

Thus, f is β€–Aβ€–p,q smooth.

We next show that β€–Aβ€–p,q is the smallest smoothness parameter for f.

  1. Assume that f is L smooth for some L.

  2. By definition of β€–Aβ€–p,q, there exists a vector x~ such that β€–x~β€–p=1 and

    β€–Ax~β€–q=β€–Aβ€–p,qβ€–x~β€–p=β€–Aβ€–p,q.
  3. Then,

    β€–Aβ€–p,q=β€–Ax~β€–q=β€–Ax~+bβˆ’A0βˆ’bβ€–q=β€–βˆ‡f(x~)βˆ’βˆ‡f(0)β€–q≀Lβ€–x~βˆ’0β€–p=L.
  4. Thus, β€–Aβ€–p,q≀L.

Thus, β€–Aβ€–p,q is indeed the smallest smoothness parameter for L.

8.18.1.1. Descent LemmaΒΆ

Theorem 8.264 (Descent lemma)

Let f:Vβ†’(βˆ’βˆž,∞] be L-smooth for some Lβ‰₯0 over some convex set D. Then for any x,y∈D,

f(y)≀f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+L2β€–xβˆ’yβ€–2.

Proof. we proceed as follows:

  1. By the fundamental theorem of calculus

    f(y)βˆ’f(x)=∫01⟨yβˆ’x,βˆ‡f(x+t(yβˆ’x))⟩dt.
  2. By adding and subtracting ⟨yβˆ’x,βˆ‡f(x)⟩, we get:

    f(y)βˆ’f(x)=⟨yβˆ’x,βˆ‡f(x)⟩+∫01⟨yβˆ’x,βˆ‡f(x+t(yβˆ’x))βˆ’βˆ‡f(x)⟩dt.
  3. This gives us

    |f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)⟩|=|∫01⟨yβˆ’x,βˆ‡f(x+t(yβˆ’x))βˆ’βˆ‡f(x)⟩dt|β‰€βˆ«01|⟨yβˆ’x,βˆ‡f(x+t(yβˆ’x))βˆ’βˆ‡f(x)⟩|dtβ‰€βˆ«01β€–yβˆ’xβ€–β€–βˆ‡f(x+t(yβˆ’x))βˆ’βˆ‡f(x)β€–βˆ—dt (a) β‰€βˆ«01β€–yβˆ’xβ€–tLβ€–yβˆ’xβ€–dt (b) =∫01tLβ€–yβˆ’xβ€–2dt=Lβ€–yβˆ’xβ€–2∫01tdt=L2β€–yβˆ’xβ€–2.
    • (a) is an application of Generalized Cauchy Schwartz inequality (Theorem 4.110}).

    • (b) is the application of L-smoothness of f (Definition 8.80).

  4. Thus,

    |f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)⟩|≀L2β€–yβˆ’xβ€–2⟹f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)βŸ©β‰€L2β€–yβˆ’xβ€–2⟹f(y)≀f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+L2β€–yβˆ’xβ€–2.

8.18.1.2. Characterization of L-Smooth FunctionsΒΆ

Theorem 8.265 (Characterization of L-smooth functions)

Let f:V→R be convex and differentiable over V. Let L>0. The following claims are equivalent:

  1. f is L-smooth.

  2. f(y)≀f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+L2β€–xβˆ’yβ€–2βˆ€x,y∈V.

  3. f(y)β‰₯f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+12Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2βˆ€x,y∈V.

  4. ⟨xβˆ’y,βˆ‡f(x)βˆ’βˆ‡f(y)⟩β‰₯1Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2βˆ€x,y∈V.

  5. f(tx+(1βˆ’t)y)β‰₯tf(x)+(1βˆ’t)f(y)βˆ’L2t(1βˆ’t)β€–xβˆ’yβ€–2βˆ€x,y∈V,t∈[0,1].

Proof. (1) ⟹ (2). This is a direct implication of the descent lemma (Theorem 8.264).

(2) ⟹ (3)

  1. We are given that (2) is satisfied.

  2. If βˆ‡f(x)=βˆ‡f(y), then the inequality is trivial due to the convexity of f. Hence, we consider the case where βˆ‡f(x)β‰ βˆ‡f(y).

  3. Fix a x∈V.

  4. Consider a function gx:V→R given by

    gx(y)=f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)⟩.
  5. Then,

    βˆ‡gx(y)=βˆ‡f(y)βˆ’βˆ‡f(x).
  6. By hypothesis in property (2), for any z∈V

    f(z)≀f(y)+⟨zβˆ’y,βˆ‡f(y)⟩+L2β€–zβˆ’yβ€–2.
  7. Now,

    gx(z)=f(z)βˆ’f(x)βˆ’βŸ¨zβˆ’x,βˆ‡f(x)βŸ©β‰€f(y)+⟨zβˆ’y,βˆ‡f(y)⟩+L2β€–zβˆ’yβ€–2βˆ’f(x)βˆ’βŸ¨zβˆ’x,βˆ‡f(x)⟩=f(y)βˆ’f(x)βˆ’βŸ¨zβˆ’x,βˆ‡f(x)⟩+⟨zβˆ’y,βˆ‡f(x)βŸ©βˆ’βŸ¨zβˆ’y,βˆ‡f(x)⟩+⟨zβˆ’y,βˆ‡f(y)⟩+L2β€–zβˆ’yβ€–2=f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)⟩+⟨zβˆ’y,βˆ‡f(y)βˆ’βˆ‡f(x)⟩+L2β€–zβˆ’yβ€–2=gx(y)+⟨zβˆ’y,βˆ‡gx(y)⟩+L2β€–zβˆ’yβ€–2.
  8. Thus, gx also satisfies the inequality in property (2).

  9. We note in particular that βˆ‡gx(x)=βˆ‡f(x)βˆ’βˆ‡f(x)=0.

  10. Since gx is convex, hence x is the global minimizer of gx.

  11. In other words,

    gx(x)≀gx(z)βˆ€z∈V.
  12. We can also see that gx(x)=f(x)βˆ’f(x)βˆ’βŸ¨xβˆ’x,βˆ‡f(x)⟩=0.

  13. Let y∈V

  14. Let v∈V be the unit norm vector satisfying β€–βˆ‡gx(y)β€–βˆ—=⟨v,βˆ‡gx(y)⟩.

  15. Choose

    z=yβˆ’β€–βˆ‡gx(y)β€–βˆ—Lv.
  16. Then,

    0=gx(x)≀gx(z)=gx(yβˆ’β€–βˆ‡gx(y)β€–βˆ—Lv).
  17. Using property (2) on gx(z), we get

    0≀gx(z)≀gx(y)+⟨zβˆ’y,βˆ‡gx(y)⟩+L2β€–zβˆ’yβ€–2=gx(y)βˆ’β€–βˆ‡gx(y)β€–βˆ—L⟨v,βˆ‡gx(y)⟩+L2β€–β€–βˆ‡gx(y)β€–βˆ—Lvβ€–2=gx(y)βˆ’β€–βˆ‡gx(y)β€–βˆ—Lβ€–βˆ‡gx(y)β€–βˆ—+12Lβ€–βˆ‡gx(y)β€–βˆ—2=gx(y)βˆ’12Lβ€–βˆ‡gx(y)β€–βˆ—2=f(y)βˆ’f(x)βˆ’βŸ¨yβˆ’x,βˆ‡f(x)βŸ©βˆ’12Lβ€–βˆ‡f(y)βˆ’βˆ‡f(x)β€–βˆ—2.
  18. Simplifying this, we get

    f(y)β‰₯f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+12Lβ€–βˆ‡f(y)βˆ’βˆ‡f(x)β€–βˆ—2

    as desired.

(3) ⟹ (4)

  1. For x,y, the property (3) gives us:

    f(y)β‰₯f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+12Lβ€–βˆ‡f(y)βˆ’βˆ‡f(x)β€–βˆ—2.
  2. For y,x, the property (3) gives us:

    f(x)β‰₯f(y)+⟨xβˆ’y,βˆ‡f(y)⟩+12Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2.
  3. Adding the two inequalities and canceling the term f(x)+f(y) gives us

    0β‰₯⟨xβˆ’y,βˆ‡f(y)βˆ’f(x)⟩+1Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2.
  4. Rearranging, we get

    ⟨xβˆ’y,βˆ‡f(x)βˆ’f(y)⟩β‰₯1Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2

    as desired.

(4) ⟹ (1)

  1. When βˆ‡f(x)=βˆ‡f(y), then the Lipschitz condition in (1) is trivial. Hence, we consider the case where βˆ‡f(x)β‰ βˆ‡f(y).

  2. By generalized Cauchy Schwartz inequality (Theorem 4.110)

    ⟨xβˆ’y,βˆ‡f(x)βˆ’f(y)βŸ©β‰€β€–xβˆ’yβ€–β€–f(x)βˆ’f(y)β€–βˆ—.
  3. Thus, combining with hypothesis (4), we obtain

    1Lβ€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—2≀‖xβˆ’yβ€–β€–f(x)βˆ’f(y)β€–βˆ—.
  4. Since βˆ‡f(x)β‰ βˆ‡f(y), hence β€–f(x)βˆ’f(y)β€–βˆ—>0.

  5. Canceling it from both sides, we get

    β€–βˆ‡f(x)βˆ’βˆ‡f(y)β€–βˆ—β‰€Lβ€–xβˆ’yβ€–

    as desired.

We have shown so far that (1), (2), (3) and (4) are equivalent statements. We are left with showing that (5) is equivalent to the other statements.

(2) ⟹ (5)

  1. Pick x,y∈V and t∈[0,1].

  2. Let z=tx+(1βˆ’t)y.

  3. By hypothesis (2),

    f(x)≀f(z)+⟨xβˆ’z,βˆ‡f(z)⟩+L2β€–xβˆ’zβ€–2;f(y)≀f(z)+⟨yβˆ’z,βˆ‡f(z)⟩+L2β€–yβˆ’zβ€–2.
  4. Note that xβˆ’z=(1βˆ’t)(xβˆ’y) and yβˆ’z=t(yβˆ’x).

  5. Thus, the previous two inequalities are same as

    f(x)≀f(z)+(1βˆ’t)⟨xβˆ’y,βˆ‡f(z)⟩+L(1βˆ’t)22β€–xβˆ’yβ€–2;f(y)≀f(z)+t⟨yβˆ’x,βˆ‡f(z)⟩+Lt22β€–xβˆ’yβ€–2.
  6. Multiplying the first inequality by t, the second by (1βˆ’t) and adding, we get

    tf(x)+(1βˆ’t)f(y)≀f(z)+Lt(1βˆ’t)2β€–xβˆ’yβ€–2.
  7. Rearranging, we get

    f(tx+(1βˆ’t)y)=f(z)β‰₯tf(x)+(1βˆ’t)f(y)βˆ’L2t(1βˆ’t)β€–xβˆ’yβ€–2.

(5) ⟹ (2)

  1. Pick x,y∈V and t∈(0,1).

  2. By hypothesis in inequality (5)

    f(tx+(1βˆ’t)y)β‰₯tf(x)+(1βˆ’t)f(y)βˆ’L2t(1βˆ’t)β€–xβˆ’yβ€–2.
  3. Rearranging the terms, we obtain

    (1βˆ’t)f(y)≀f(tx+(1βˆ’t)y)βˆ’tf(x)+L2t(1βˆ’t)β€–xβˆ’yβ€–2⟺(1βˆ’t)f(y)≀f(tx+(1βˆ’t)y)βˆ’f(x)+(1βˆ’t)f(x)+L2t(1βˆ’t)β€–xβˆ’yβ€–2⟺f(y)≀f(x)+f(tx+(1βˆ’t)y)βˆ’f(x)1βˆ’t+L2tβ€–xβˆ’yβ€–2.

    Division by (1βˆ’t) is fine since (1βˆ’t)∈(0,1).

  4. Recalling the definition of directional derivative (Definition 8.70):

    limtβ†’1βˆ’f(tx+(1βˆ’t)y)βˆ’f(x)1βˆ’t=limsβ†’0+f((1βˆ’s)x+sy)βˆ’f(x)s=limsβ†’0+f(x+s(yβˆ’x))βˆ’f(x)s=fβ€²(x;yβˆ’x).
  5. Since the previous inequality is valid for every t∈(0,1), taking the limit to tβ†’1βˆ’ on the R.H.S., we obtain

    f(y)≀f(x)+fβ€²(x;yβˆ’x)+L2β€–xβˆ’yβ€–2.
  6. Recall from Theorem 8.203 that fβ€²(x;yβˆ’x)=⟨yβˆ’x,βˆ‡f(x)⟩.

  7. Thus, we get:

    f(y)≀f(x)+⟨yβˆ’x,βˆ‡f(x)⟩+L2β€–xβˆ’yβ€–2.

    as desired.

8.18.1.3. Second Order CharacterizationΒΆ

We now restrict our attention to the vector space Rn equipped with an β„“p norm with pβ‰₯1.

Theorem 8.266 (L-smoothness and the boundedness of the Hessian)

Let f:Rnβ†’R be a twice continuously differentiable function over Rn. Then, for any Lβ‰₯0, the following claims are equivalent:

  1. f is L-smooth w.r.t. the β„“p-norm (p∈[1,∞]).

  2. β€–βˆ‡2f(x)β€–p,q≀L for any x∈Rn where qβ‰₯1 satisfies 1p+1q=1.

Proof. (2) ⟹ (1)

  1. We are given that β€–βˆ‡2f(x)β€–p,q≀L for any x∈Rn.

  2. By the fundamental theorem of calculus

    βˆ‡f(y)βˆ’βˆ‡f(x)=∫01βˆ‡2f(x+t(yβˆ’x))(yβˆ’x)dt=(∫01βˆ‡2f(x+t(yβˆ’x))dt)(yβˆ’x).
  3. Taking the (dual)-norm on both sides

    β€–βˆ‡f(y)βˆ’βˆ‡f(x)β€–q=β€–(∫01βˆ‡2f(x+t(yβˆ’x))dt)(yβˆ’x)β€–qβ‰€β€–βˆ«01βˆ‡2f(x+t(yβˆ’x))dtβ€–p,qβ€–yβˆ’xβ€–p≀(∫01β€–βˆ‡2f(x+t(yβˆ’x))β€–p,qdt)β€–yβˆ’xβ€–p≀(∫01Ldt)β€–yβˆ’xβ€–p=Lβ€–yβˆ’xβ€–p.
  4. Thus, β€–βˆ‡f(y)βˆ’βˆ‡f(x)β€–q≀Lβ€–yβˆ’xβ€–p as desired.

(1) ⟹ (2)

  1. We are given that f is L smooth with β„“p norm.

  2. By fundamental theorem of calculus, for any d∈Rn and s>0,

    βˆ‡f(x+sd)βˆ’βˆ‡f(x)=∫0sβˆ‡2f(x+td)ddt.
  3. Taking q norm on both sides

    β€–(∫0sβˆ‡2f(x+td)dt)dβ€–q=β€–βˆ‡f(x+sd)βˆ’βˆ‡f(x)β€–q≀Lβ€–x+sdβˆ’xβ€–p=sLβ€–dβ€–p.
  4. Dividing by s on both sides and taking the limit s→0+, we get

    β€–βˆ‡2f(x)dβ€–q≀Lβ€–dβ€–p.
  5. Since this is valid for every d∈Rn, hence

    β€–βˆ‡2f(x)β€–p,q≀L.

Corollary 8.27 (L-smoothness and largest eigenvalue of Hessian)

Let f:Rn→R be a twice continuously differentiable convex function over Rn. Then f is L-smooth w.r.t. ℓ2-norm if and only if

Ξ»max(βˆ‡2f(x))≀Lβˆ€x∈Rn.

Proof. Since f is convex, hence it follows that βˆ‡2f(x)βͺ°O for every x. Thus,

β€–βˆ‡2f(x)β€–2,2=Ξ»max(βˆ‡2f(x)2)=Ξ»max(βˆ‡2f(x)).

From Theorem 8.266, f is L-smooth is equivalent to the condition that

Ξ»max(βˆ‡2f(x))=β€–βˆ‡2f(x)β€–2,2≀L.

8.18.2. Strong ConvexityΒΆ

Definition 8.81 (Strong convexity)

A function f:Vβ†’(βˆ’βˆž,∞] is called Οƒ-strongly convex for Οƒ>0 if domf is convex and the following holds for any x,y∈domf and t∈[0,1]:

(8.10)ΒΆf(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒ2t(1βˆ’t)β€–xβˆ’yβ€–2.

8.18.2.1. Strong Convexity ⟹ Convexity¢

Strongly convex functions are convex. In fact, we have a stronger result available.

Theorem 8.267 (Strong convexity and convexity)

Assume that the ambient space V is Euclidean.

A function f:Vβ†’(βˆ’βˆž,∞] is Οƒ-strongly convex if and only if the function f(β‹…)βˆ’Οƒ2β€–β‹…β€–2 is convex.

Proof. Let us define a function g:Vβ†’(βˆ’βˆž,∞] as

g(x)=f(x)=Οƒ2β€–xβ€–2.

We need to show that f is Οƒ-strongly convex if and only if g is convex.

  1. We first note that domg=domf.

  2. Thus, domg is convex if and only if domf is convex.

  3. Now, g is convex if and only if domg=domf is convex and for any x,y∈domf and t∈(0,1)

    g(tx+(1βˆ’t)y)≀tg(x)+(1βˆ’t)g(y).
  4. Now,

    g(tx+(1βˆ’t)y)≀tg(x)+(1βˆ’t)g(y)⟺f(tx+(1βˆ’t)y)βˆ’Οƒ2β€–tx+(1βˆ’t)yβ€–2≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒ2[tβ€–xβ€–2+(1βˆ’t)β€–yβ€–2]⟺f(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)+Οƒ2[β€–tx+(1βˆ’t)yβ€–2βˆ’tβ€–xβ€–2βˆ’(1βˆ’t)β€–yβ€–2].
  5. Since the norm is Euclidean, hence

    β€–tx+(1βˆ’t)yβ€–2βˆ’tβ€–xβ€–2βˆ’(1βˆ’t)β€–yβ€–2=⟨tx+(1βˆ’t)y,tx+(1βˆ’t)yβŸ©βˆ’tβ€–xβ€–2βˆ’(1βˆ’t)β€–yβ€–2=t2β€–xβ€–2+(1βˆ’t)2β€–yβ€–2+2t(1βˆ’t)⟨x,yβŸ©βˆ’tβ€–xβ€–2βˆ’(1βˆ’t)β€–yβ€–2=βˆ’t(1βˆ’t)β€–xβ€–2βˆ’t(1βˆ’t)β€–yβ€–2+2t(1βˆ’t)⟨x,y⟩=βˆ’t(1βˆ’t)(β€–xβ€–2+β€–yβ€–2βˆ’2⟨x,y⟩)=βˆ’t(1βˆ’t)β€–xβˆ’yβ€–2.
  6. Thus, the convexity inequality for g is equivalent to

    f(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒ2t(1βˆ’t)β€–xβˆ’yβ€–2

    which is nothing but the Οƒ-strong convexity condition of f.

8.18.2.2. Quadratic FunctionsΒΆ

Theorem 8.268 (Strong convexity of quadratic functions)

Let A∈Sn, b∈Rn and c∈R. Let f:Rnβ†’R be given by:

f(x)=12xTAx+bTx+c.

Then f is Οƒ-strongly convex if and only if A is positive definite and σ≀λmin(A).

Proof. Due to Theorem 8.267, f is strongly convex with Οƒ>0 if and only if g(x)=f(x)βˆ’Οƒ2β€–xβ€–2 is convex.

  1. We note that

    g(x)=12xTAx+bTx+cβˆ’Οƒ2β€–xβ€–2=12xT(Aβˆ’ΟƒI)x+bTx+c.
  2. As shown in Example 8.40, g is convex if and only if Aβˆ’ΟƒIβͺ°O.

  3. This is equivalent to σ≀λmin(A).

8.18.2.3. CoercivenessΒΆ

Theorem 8.269 (Strong convexity and coerciveness)

Assume that the ambient space V is Euclidean. Assume that f:Vβ†’(βˆ’βˆž,∞] is a FrΓ©chet-differentiable function. If f is Οƒ-strongly convex, then it is coercive.

Proof. We proceed as follows.

  1. Define

    g(x)=f(x)βˆ’Οƒ2β€–xβ€–2.
  2. By Theorem 8.267, g is convex.

  3. Since f is differentiable, hence g is also differentiable.

  4. Specifically, βˆ‡g(x)=βˆ‡f(x)βˆ’Οƒx.

  5. Fix some x∈intdomf.

  6. Then βˆ‚g(x)={βˆ‡g(x)}.

  7. By subgradient inequality, for any y∈V,

    g(y)β‰₯g(x)+⟨yβˆ’x,βˆ‡g(x)⟩.
  8. Expanding g and βˆ‡g:

    f(y)βˆ’Οƒ2β€–yβ€–2β‰₯f(x)βˆ’Οƒ2β€–xβ€–2+⟨yβˆ’x,βˆ‡f(x)βˆ’Οƒx⟩.
  9. Let v=f(x)βˆ’Οƒx.

  10. Rearranging terms

    f(y)β‰₯Οƒ2β€–yβ€–2+⟨y,v⟩+Kx

    where Kx=f(x)βˆ’Οƒ2β€–xβ€–2βˆ’βŸ¨x,v⟩.

  11. We note that the term Kx depends solely on x which is fixed. Hence Kx is a fixed quantity.

  12. By Cauchy-Schwarz inequality

    ⟨y,v⟩β‰₯βˆ’β€–vβ€–β€–yβ€–.
  13. Hence

    f(y)β‰₯Οƒ2β€–yβ€–2βˆ’β€–vβ€–β€–yβ€–+Kx.
  14. It is easy to see that, the R.H.S. goes to ∞ as β€–yβ€–β†’βˆž.

  15. Hence f is coercive.

8.18.2.4. Sum RuleΒΆ

Theorem 8.270 (Sum of strongly convex and convex functions)

Let f be Οƒ-strongly convex and g be convex. Then f+g is Οƒ-strongly convex.

Proof. Since both f and g are convex, hence their domains are convex. Hence, dom(f+g)=domf∩domg is also convex.

We further need to show that f+g satisfies (8.10).

  1. Let x,y∈V and t∈(0,1).

  2. Since f is Οƒ-strongly convex, hence

    f(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒ2t(1βˆ’t)β€–xβˆ’yβ€–2.
  3. Since g is convex, hence

    g(tx+(1βˆ’t)y)≀tg(x)+(1βˆ’t)g(y).
  4. Then,

    (f+g)(tx+(1βˆ’t)y)=f(tx+(1βˆ’t)y)+g((tx+(1βˆ’t)y))≀f(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒ2t(1βˆ’t)β€–xβˆ’yβ€–2+tg(x)+(1βˆ’t)g(y)=t(f+g)(x)+(1βˆ’t)(f+g)(y)βˆ’Οƒ2t(1βˆ’t)β€–xβˆ’yβ€–2.
  5. Thus, f+g is also Οƒ-strongly convex.

Example 8.83 (Strong convexity of 12|β‹…|2+IC)

Let V be a Euclidean space.

  1. The function 12β€–xβ€–2 is 1-strongly convex due to Theorem 8.268.

  2. Let C be a convex set.

  3. Then, the indicator function IC is convex.

  4. Due to Theorem 8.270, the function

    g(x)=12β€–xβ€–2+IC(x)

    is also 1-strongly convex.

8.18.2.5. First Order CharacterizationΒΆ

Recall that dom(βˆ‚f) denotes the set of points at which f is subdifferentiable.

Theorem 8.271 (First order characterization of strong convexity)

Let f:Vβ†’(βˆ’βˆž,∞] be a proper closed and convex function. For a given Οƒ>0, the following statements are equivalent.

  1. f is Οƒ-strongly convex.

  2. For every x∈dom(βˆ‚f), y∈domf and gβˆˆβˆ‚f(x), the following holds true

    f(y)β‰₯f(x)+⟨yβˆ’x,g⟩+Οƒ2β€–yβˆ’xβ€–2.
  3. For any x,y∈dom(βˆ‚f) and gxβˆˆβˆ‚f(x), gyβˆˆβˆ‚f(x), the following holds true:

    ⟨xβˆ’y,gxβˆ’gy⟩β‰₯Οƒβ€–xβˆ’yβ€–2.

Proof. We shall prove the equivalence of these statements in the following order. (2)⟹(1), (1)⟹(3), (3)⟹(2).

(2) ⟹ (1)

  1. We assume that (2) is true.

  2. Let x,y∈domf and t∈(0,1).

  3. We need to show that (8.10) holds for f.

  4. Since domf is convex, its relative interior is not empty (see Theorem 8.142).

  5. Let z∈ridomf.

  6. Choose some α∈(0,1].

  7. Let x~=(1βˆ’Ξ±)x+Ξ±z.

  8. By the line segment property (Theorem 8.143), x~∈ridomf.

  9. Let xt=tx~+(1βˆ’t)y.

  10. Again, by the line segment property, xt∈ridomf.

  11. Since f is a proper convex function, hence the subdifferential of f at relative interior points is nonempty (Theorem 8.217).

  12. Thus, βˆ‚f(xt)β‰ βˆ… and xt∈dom(βˆ‚f).

  13. Take some gβˆˆβˆ‚f(xt).

  14. By hypothesis (2)

    f(x~)β‰₯f(xt)+⟨x~βˆ’xt,g⟩+Οƒ2β€–x~βˆ’xtβ€–2.
  15. Substituting xt=tx~+(1βˆ’t)y, we have x~βˆ’xt=(1βˆ’t)(x~βˆ’y). Thus,

    f(x~)β‰₯f(xt)+(1βˆ’t)⟨x~βˆ’y,g⟩+Οƒ(1βˆ’t)22β€–x~βˆ’yβ€–2.
  16. Similarly, by hypothesis (2)

    f(y)β‰₯f(xt)+⟨yβˆ’xt,g⟩+Οƒ2β€–yβˆ’xtβ€–2.
  17. yβˆ’xt=yβˆ’tx~βˆ’(1βˆ’t)y=t(yβˆ’x~).

  18. This gives us,

    f(y)β‰₯f(xt)+t⟨yβˆ’x~,g⟩+Οƒt22β€–yβˆ’x~β€–2.
  19. Multiplying the first inequality by t and the second one by (1βˆ’t) and adding them together, we get

    tf(x~)+(1βˆ’t)f(y)β‰₯f(xt)+Οƒt(1βˆ’t)2β€–x~βˆ’yβ€–2.
  20. Thus,

    f(tx~+(1βˆ’t)y)=f(xt)≀tf(x~)+(1βˆ’t)f(y)βˆ’Οƒt(1βˆ’t)2β€–x~βˆ’yβ€–2.
  21. Expanding x~,

    tx~+(1βˆ’t)y=t((1βˆ’Ξ±)x+Ξ±z)+(1βˆ’t)y=t(1βˆ’Ξ±)x+(1βˆ’t)y+tΞ±z.
  22. Define g1(Ξ±)=f(tx~+(1βˆ’t)y)=f(t(1βˆ’Ξ±)x+(1βˆ’t)y+tΞ±z).

  23. Define g2(Ξ±)=f(x~)=f((1βˆ’Ξ±)x+Ξ±z).

  24. Substituting these into the previous inequality, we obtain

    g1(Ξ±)≀tg2(Ξ±)+(1βˆ’t)f(y)βˆ’Οƒt(1βˆ’t)2β€–(1βˆ’Ξ±)x+Ξ±zβˆ’yβ€–2.
  25. The functions g1 and g2 are one dimensional, proper, closed and convex functions.

  26. By Theorem 8.173, both g1 and g2 are continuous on their domain.

  27. Therefore, taking the limit Ξ±β†’0+, it follows that

    g1(0)≀tg2(0)+(1βˆ’t)f(y)βˆ’Οƒt(1βˆ’t)2β€–xβˆ’yβ€–2.
  28. Now g1(0)=f(tx+(1βˆ’t)y) and g2(0)=f(x).

  29. Thus,

    f(tx+(1βˆ’t)y)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒt(1βˆ’t)2β€–xβˆ’yβ€–2.
  30. This establishes that f is indeed Οƒ-strongly convex.

(1) ⟹ (3)

  1. We are given that f is Οƒ-strongly convex.

  2. Let x,y∈dom(βˆ‚f).

  3. Pick any gxβˆˆβˆ‚f(x) and gyβˆˆβˆ‚f(y).

  4. Let t∈[0,1) and denote xt=tx+(1βˆ’t)y.

  5. By the hypothesis

    f(xt)≀tf(x)+(1βˆ’t)f(y)βˆ’Οƒt(1βˆ’t)2β€–xβˆ’yβ€–2.
  6. This is same as

    f(xt)βˆ’f(x)≀(1βˆ’t)[f(y)βˆ’f(x)]βˆ’Οƒt(1βˆ’t)2β€–xβˆ’yβ€–2.
  7. We can see that (1βˆ’t)∈(0,1].

  8. Dividing both sides of inequality by (1βˆ’t), we obtain

    f(xt)βˆ’f(x)1βˆ’t≀f(y)βˆ’f(x)βˆ’Οƒt2β€–xβˆ’yβ€–2.
  9. Since gxβˆˆβˆ‚f(x), hence by subgradient inequality

    f(xt)β‰₯f(x)+⟨xtβˆ’x,gx⟩.
  10. We can rewrite this as

    f(xt)βˆ’f(x)1βˆ’tβ‰₯⟨xtβˆ’x,gx⟩1βˆ’t.
  11. Note that xtβˆ’x=(1βˆ’t)(yβˆ’x).

  12. Thus,

    f(xt)βˆ’f(x)1βˆ’tβ‰₯⟨yβˆ’x,gx⟩.
  13. Thus,

    ⟨yβˆ’x,gxβŸ©β‰€f(y)βˆ’f(x)βˆ’Οƒt2β€–xβˆ’yβ€–2.
  14. This inequality holds for every t∈[0,1).

  15. Taking the limit tβ†’1βˆ’, we obtain

    ⟨yβˆ’x,gxβŸ©β‰€f(y)βˆ’f(x)βˆ’Οƒ2β€–xβˆ’yβ€–2.
  16. An identical reasoning by switching the roles of x and y, gives us

    ⟨xβˆ’y,gyβŸ©β‰€f(x)βˆ’f(y)βˆ’Οƒ2β€–yβˆ’xβ€–2.
  17. Adding these two inequalities gives us

    ⟨xβˆ’y,gyβˆ’gxβŸ©β‰€βˆ’Οƒβ€–xβˆ’yβ€–2.
  18. Multiplying both sides by βˆ’1 (and switching the inequality accordingly), we get

    ⟨xβˆ’y,gxβˆ’gy⟩β‰₯Οƒβ€–xβˆ’yβ€–2

    as desired.

(3) ⟹ (2)

  1. We are given that (3) is satisfied.

  2. Let x∈dom(βˆ‚f), y∈domf and gβˆˆβˆ‚f(x).

  3. Pick any z∈ridomf.

  4. Pick some α∈(0,1).

  5. Define y~=(1βˆ’Ξ±)y+Ξ±z.

  6. By line segment property y~∈ridomf.

  7. Define xt=(1βˆ’t)x+ty~.

  8. Consider the 1D function

    Ο†(t)=f(xt),βˆ€t∈[0,1].
  9. Pick any t∈(0,1).

  10. Then, by line segment principle xt∈ridomf.

  11. Due to (Theorem 8.217), βˆ‚f(xt)β‰ βˆ… and xt∈dom(βˆ‚f).

  12. Take some gtβˆˆβˆ‚f(xt).

  13. By subgradient inequality

    f(z)β‰₯f(xt)+⟨zβˆ’xt,gtβŸ©βˆ€z∈V.
  14. In particular, for xs=(1βˆ’s)x+sy~, we have

    f(xs)β‰₯f(xt)+⟨(1βˆ’s)x+sy~βˆ’(1βˆ’t)xβˆ’ty~,gtβŸ©βŸΉΟ†(s)β‰₯Ο†(t)+⟨(sβˆ’t)(y~βˆ’x),gtβŸ©βŸΉΟ†(s)β‰₯Ο†(t)+(sβˆ’t)⟨y~βˆ’x,gt⟩.
  15. Since this is valid for every s, hence ⟨y~βˆ’x,gtβŸ©βˆˆβˆ‚Ο†(t).

  16. Applying the mean value theorem (Theorem 8.234)

    f(y~)βˆ’f(x)=Ο†(1)βˆ’Ο†(0)=∫01⟨y~βˆ’x,gt⟩dt.
  17. Since gβˆˆβˆ‚f(x) and gtβˆˆβˆ‚f(xt), hence applying the hypothesis (3), we get

    ⟨xtβˆ’x,gtβˆ’g⟩β‰₯Οƒβ€–xtβˆ’xβ€–2.
  18. But xtβˆ’x=t(y~βˆ’x).

  19. Hence

    t⟨y~βˆ’x,gtβˆ’g⟩β‰₯Οƒt2β€–y~βˆ’xβ€–2.
  20. This simplifies to

    ⟨y~βˆ’x,gt⟩β‰₯⟨y~βˆ’x,g⟩+Οƒtβ€–y~βˆ’xβ€–2.

    Canceling t on both sides doesn’t change the sign of inequality since t>0.

  21. Applying the inequality to the integral above

    f(y~)βˆ’f(x)β‰₯∫01[⟨y~βˆ’x,g⟩+Οƒtβ€–y~βˆ’xβ€–2]dt.
  22. Integrating, we get

    f(y~)βˆ’f(x)β‰₯⟨y~βˆ’x,g⟩+Οƒ2β€–y~βˆ’xβ€–2.
  23. Expanding for y~ for any α∈(0,1), we have

    f((1βˆ’Ξ±)y+Ξ±z)β‰₯f(x)+⟨(1βˆ’Ξ±)y+Ξ±zβˆ’x,g⟩+Οƒ2β€–(1βˆ’Ξ±)y+Ξ±zβˆ’xβ€–2.
  24. The 1D function g(Ξ±)=f((1βˆ’Ξ±)y+Ξ±z) is continuous again due to Theorem 8.173.

  25. Taking the limit Ξ±β†’0+ on both sides, we obtain

    f(y)β‰₯f(x)+⟨yβˆ’x,g⟩+Οƒ2β€–yβˆ’xβ€–2

    which is the desired result.

8.18.2.6. MinimizationΒΆ

Theorem 8.272 (Existence and uniqueness of a a minimizer of closed strongly convex function)

Let f:Vβ†’(βˆ’βˆž,∞] be a proper, closed and Οƒ-strongly convex function with Οƒ>0. Then,

  1. f has a unique minimizer a∈domf such that f(x)>f(a) for every x∈domf and xβ‰ a.

  2. The increase in the value of f w.r.t. its minimum satisfies

    f(x)βˆ’f(a)β‰₯Οƒ2β€–xβˆ’aβ€–2

    where a∈domf is the unique minimizer of f.

Proof. (1) Existence of the minimizer

  1. Since f is proper and convex, hence domf is nonempty and convex.

  2. Since domf is nonempty and convex, hence its relative interior is nonempty (Theorem 8.142).

  3. Pick y∈ridomf.

  4. By Theorem 8.214, βˆ‚f(y) is nonempty.

  5. Pick some gβˆˆβˆ‚f(y).

  6. Then, by property 2 of Theorem 8.271,

    f(x)β‰₯f(y)+⟨xβˆ’y,g⟩+Οƒ2β€–xβˆ’yβ€–2

    holds true for every x∈V.

  7. Let β€–β‹…β€–2β‰œβŸ¨β‹…,β‹…βŸ© denote the Euclidean norm associated with the inner product of the space V. This might be different from the endowed norm β€–β‹…β€–.

  8. Since all norms in a finite dimensional space are equivalent, hence, there exists a constant C>0 such that

    β€–zβ€–β‰₯Cβ€–zβ€–2

    for every z∈V.

  9. Therefore,

    f(x)β‰₯f(y)+⟨xβˆ’y,g⟩+ΟƒC2β€–xβˆ’yβ€–22βˆ€x∈V.
  10. This in turn is same as

    f(x)β‰₯f(y)βˆ’12CΟƒβ€–gβ€–22+CΟƒ2β€–xβˆ’(yβˆ’1CΟƒg)β€–22βˆ€x∈V.
  11. Let St denote the sublevel set {x|f(x)≀t}.

  12. Consider the sublevel set Sf(y).

  13. Let x∈Sf(y).

  14. Then, f(x)=f(y)βˆ’r for some rβ‰₯0.

  15. But then

    f(y)βˆ’rβ‰₯f(y)βˆ’12CΟƒβ€–gβ€–22+CΟƒ2β€–xβˆ’(yβˆ’1CΟƒg)β€–22.
  16. This simplifies to

    r≀12CΟƒβ€–gβ€–22βˆ’CΟƒ2β€–xβˆ’(yβˆ’1CΟƒg)β€–22.
  17. Since r must be nonnegative, hence the R.H.S. must be nonnegative also.

  18. Thus, we require that

    12CΟƒβ€–gβ€–22β‰₯CΟƒ2β€–xβˆ’(yβˆ’1CΟƒg)β€–22.
  19. This simplifies to

    β€–xβˆ’(yβˆ’1CΟƒg)β€–2≀1CΟƒβ€–gβ€–2.
  20. In other words, x must belong to an β„“2 closed ball given by

    Bβ€–β‹…β€–2[yβˆ’1CΟƒg,1CΟƒβ€–gβ€–2].
  21. Since this is valid for every x∈Sf(y), hence

    Sf(y)βŠ†Bβ€–β‹…β€–2[yβˆ’1CΟƒg,1CΟƒβ€–gβ€–2].
  22. Since f is closed, hence all its sublevel sets are closed.

  23. since Sf(y) is contained in a ball, hence Sf(y) is bounded.

  24. Thus, Sf(y) is closed and bounded.

  25. Since V is finite dimensional, hence Sf(y) is compact.

  26. Sf(y) is also nonempty since y∈Sf(y).

  27. Thus, the problem of minimizing f over domf reduces to the problem of minimizing f over the nonempty compact set Sf(y).

  28. Since f is closed, it is also lower semicontinuous.

  29. By Theorem 3.121, f attains a minimum on Sf(y) at some point a∈Sf(y).

  30. Thus, we have established the existence of a minimizer of f at some a∈Sf(y)βŠ†domf.

(1) Uniqueness of the minimizer

  1. To show the uniqueness, for contradiction, assume that u and v are two different minimizers of f with f(u)=f(v)=pβˆ—, the optimal value.

  2. Let w=12u+12v.

  3. We must have f(w)β‰₯pβˆ—.

  4. By strong convexity of f,

    f(w)≀12f(u)+12f(v)βˆ’Οƒ21212β€–uβˆ’vβ€–2=pβˆ—βˆ’Οƒ8β€–uβˆ’vβ€–2.
  5. If uβ‰ v, then f(w)<pβˆ—; a contradiction.

  6. Hence, the minimizer must be unique.

(2) Increase in value of f

  1. Let a be the unique minimizer of f.

  2. By Fermat’s optimality condition 0βˆˆβˆ‚f(a).

  3. Since f is Οƒ-strongly convex, hence by property (2) in the Theorem 8.271,

    f(x)βˆ’f(a)β‰₯⟨xβˆ’a,0⟩+Οƒ2β€–xβˆ’aβ€–2=Οƒ2β€–xβˆ’aβ€–2

    holds true for any x∈domf.

8.18.3. Smoothness and Strong ConvexityΒΆ

8.18.3.1. The Conjugate Correspondence TheoremΒΆ

The idea of smoothness and strong convexity is connected. Roughly speaking, a function is strongly convex if and only if its conjugate is smooth.

Theorem 8.273 (Conjugate correspondence theorem)

Let Οƒ>0. Then

  1. If f:Vβ†’R is a 1Οƒ-smooth convex function, then fβˆ— is Οƒ-strongly convex w.r.t. the dual norm β€–β‹…β€–βˆ—.

  2. If f:Vβ†’(βˆ’βˆž,∞] is a proper, closed Οƒ-strongly convex function, then fβˆ—:Vβˆ—β†’R is 1Οƒ-smooth.

Proof. (1) Smooth convex to strongly convex conjugate

  1. We are given that f:V→R is a 1σ-smooth convex function.

  2. Due to Theorem 8.239, fβˆ— is closed and convex.

  3. Since f is proper and convex, hence due to Theorem 8.240, fβˆ— is proper.

  4. Thus fβˆ— is a proper, closed and convex function.

  5. Pick any y1,y2∈dom(βˆ‚fβˆ—).

  6. Let v1βˆˆβˆ‚fβˆ—(y1) and v2βˆˆβˆ‚fβˆ—(y2).

  7. Since f is proper and convex, hence by conjugate subgradient theorem (Theorem 8.246)

    y1βˆˆβˆ‚f(v1) and y2βˆˆβˆ‚f(v2).
  8. Since f is smooth, hence it is differentiable. Hence due to Theorem 8.220,

    y1=βˆ‡f(v1) and y2=βˆ‡f(v2).
  9. Following characterization of smoothness (Theorem 8.265), by its property 4,

    ⟨v1βˆ’v2,y1βˆ’y2⟩β‰₯Οƒβ€–y1βˆ’y2β€–βˆ—2.
  10. Since the last inequality holds for any y1,y2∈dom(βˆ‚fβˆ—) and any v1βˆˆβˆ‚fβˆ—(y1),v2βˆˆβˆ‚fβˆ—(v2), hence following the first order characterization of strong convexity in Theorem 8.271, fβˆ— is a Οƒ-strongly convex function.

(2) Strongly convex to smooth conjugate

  1. We are given that f is proper, closed and Οƒ-strongly convex.

  2. Pick any y∈Vβˆ—.

  3. The conjugate is given by

    fβˆ—(y)=supx∈V{⟨x,yβŸ©βˆ’f(y)}.
  4. Define g(x)=f(x)βˆ’βŸ¨x,y⟩.

  5. We can see that

    fβˆ—(y)=βˆ’infx∈Vg(x).
  6. Due to the sum rule (Theorem 8.270), g is Οƒ-strongly convex.

  7. Due to Theorem 8.272, g has a unique minimizer.

  8. Hence fβˆ—(y) is finite.

  9. Since this is valid for any y∈Vβˆ—, hence domfβˆ—=Vβˆ—.

  10. This justifies the signature for fβˆ— as fβˆ—:Vβˆ—β†’R being real valued.

  11. Let’s continue with any y.

  12. Since domfβˆ—=Vβˆ—, hence y∈intdomfβˆ—.

  13. Now, by the second formulation of conjugate subgradient theorem (Corollary 8.26),

    βˆ‚fβˆ—(y)=argmaxx∈V{⟨x,yβŸ©βˆ’f(x)}.
  14. We can see that

    βˆ‚fβˆ—(y)=βˆ’argminx∈Vg(x).
  15. Since g has a unique minimizer, hence βˆ‚fβˆ—(y) is a singleton.

  16. Due to Theorem 8.220, fβˆ— is differentiable at y.

  17. Since y is arbitrary, hence fβˆ— is differentiable over entire Vβˆ—.

  18. We now pickup two points y1,y2∈Vβˆ— and denote v1=βˆ‡fβˆ—(y1),v2=βˆ‡fβˆ—(y2).

  19. By conjugate subgradient theorem (Theorem 8.246), this is equivalent to y1βˆˆβˆ‚f(v1) and y2βˆˆβˆ‚f(v2).

  20. Following the first order characterization of strong convexity in Theorem 8.271,

    ⟨v1βˆ’v2,y1βˆ’y2⟩β‰₯Οƒβ€–v1βˆ’v2β€–2.
  21. In other words

    βŸ¨βˆ‡fβˆ—(y1)βˆ’βˆ‡fβˆ—(y2),y1βˆ’y2⟩β‰₯Οƒβ€–βˆ‡fβˆ—(y1)βˆ’βˆ‡fβˆ—(y2)β€–2.
  22. By generalized Cauchy Schwartz inequality (Theorem 4.110)

    βŸ¨βˆ‡fβˆ—(y1)βˆ’βˆ‡fβˆ—(y2),y1βˆ’y2βŸ©β‰€β€–βˆ‡fβˆ—(y1)βˆ’βˆ‡fβˆ—(y2)β€–β€–y1βˆ’y2β€–βˆ—.
  23. Thus the previous inequality simplifies to

    β€–βˆ‡fβˆ—(y1)βˆ’βˆ‡fβˆ—(y2)‖≀1Οƒβ€–y1βˆ’y2β€–βˆ—.
  24. This establishes that fβˆ— is 1Οƒ-smooth.

8.18.4. ExamplesΒΆ

Example 8.84 (Smoothness of 1+|β‹…|22)

Let f:Rn→R be given by

f(x)=1+β€–xβ€–22.

f is 1-smooth w.r.t. the β„“2-norm.

  1. Note that for any x∈Rn, the gradient is given by

    βˆ‡f(x)=x1+β€–xβ€–22.
  2. The Hessian is given by

    βˆ‡f(x)=I1+β€–xβ€–22βˆ’xxT(1+β€–xβ€–22)32βͺ―I1+β€–xβ€–22βͺ―I.
  3. Therefore, Ξ»max(βˆ‡2f(x))≀1 for every x∈Rn.

  4. Hence, by Corollary 8.27, f is 1-smooth w.r.t. the β„“2-norm.

8.18.4.1. Log-Sum-ExpΒΆ

Example 8.85 (Smoothness of log-sum-exp)

Consider the log-sum-exp function f:Rn→R given by

f(x)=ln⁑(βˆ‘i=1nexi).

f is 1-smooth w.r.t. β„“2 and β„“βˆž norms.

Smoothness w.r.t. β„“2 norm

  1. The partial derivatives of f are

    βˆ‚fβˆ‚xi(x)=exiβˆ‘k=1nexk.
  2. The second order partial derivatives are

    βˆ‚2fβˆ‚xiβˆ‚xj(x)={βˆ’exiexj(βˆ‘k=1nexk)2,iβ‰ j;βˆ’exiexi(βˆ‘k=1nexk)2+exiβˆ‘k=1nexk,i=j.
  3. The Hessian can be written as

    βˆ‡2f(x)=diag(w)βˆ’wwT

    where wi=exiβˆ‘k=1nexk.

  4. We can now see that

    βˆ‡2f(x)=diag(w)βˆ’wwTβͺ―diag(w)βͺ―I.
  5. Hence Ξ»max(βˆ‡2f(x))≀1 for every x∈Rn.

  6. Hence, by Corollary 8.27, f is 1-smooth w.r.t. the β„“2-norm.

Smoothness w.r.t. β„“βˆž norm

  1. We first show that for any v∈V

    vTβˆ‡2f(x)v≀‖vβ€–βˆž2.
  2. To see this, we expand the L.H.S. as

    vTβˆ‡2f(x)v=vT(diag(w)βˆ’wwT)v=vTdiag(w)vβˆ’(wTv)2≀vTdiag(w)v=βˆ‘i=1nwivi2≀‖vβ€–βˆž2βˆ‘i=1nwi=β€–vβ€–βˆž2.
  3. Since f is twice differentiable over Rn, hence by linear approximation theorem (Theorem 5.5), for any x,y∈Rn, there exists z∈[x,y] such that

    f(y)βˆ’f(x)=βˆ‡f(x)T(yβˆ’x)+12(yβˆ’x)Tβˆ‡2f(z)(yβˆ’x).
  4. Let v=yβˆ’x.

  5. Then from above,

    (yβˆ’x)Tβˆ‡2f(z)(yβˆ’x)≀‖vβ€–βˆž2.
  6. Putting this back in the approximation, we have

    f(y)≀f(x)+βˆ‡f(x)T(yβˆ’x)+12β€–yβˆ’xβ€–βˆž2.
  7. Following characterization of smoothness (Theorem 8.265), f is indeed 1-smooth w.r.t. the β„“βˆž-norm.

8.18.4.2. Negative EntropyΒΆ

Example 8.86 (Strong convexity of negative entropy over the unit simplex)

Let f:Rnβ†’(βˆ’βˆž,∞] be given by:

f(x)β‰œ{βˆ‘i=1nxiln⁑xixβˆˆΞ”n∞ otherwise .

f is 1-strongly convex for both β„“1 and β„“2 norms.

  1. By Theorem 8.257, its conjugate is given by

    fβˆ—(y)=ln⁑(βˆ‘j=1neyj)

    which is the log sum exp function.

  2. By Example 8.85, the log-sum-exp function is 1-smooth w.r.t. both β„“2 and β„“βˆž norms.

  3. Hence by conjugate correspondence theorem Theorem 8.273, f is 1-strongly convex for both β„“1 and β„“2 norms.