6.8. Subgaussian Distributions

In this section we review subgaussian distributions and matrices drawn from subgaussian distributions.

Examples of subgaussian distributions include

  • Gaussian distribution

  • Rademacher distribution taking values ±1M

  • Any zero mean distribution with a bounded support

Definition 6.2

A random variable X is called subgaussian if there exists a constant c>0 such that

(6.1)MX(t)=E[exp(Xt)]exp(c2t22)

holds for all tR. We use the notation XSub(c2) to denote that X satisfies the constraint (6.1). We also say that X is c-subgaussian.

E[exp(Xt)] is moment generating function of X.

exp(c2t22) is moment generating function of a Gaussian random variable with variance c2.

The definition means that for a subgaussian variable X, its M.G.F. is bounded by the M.G.F. of a Gaussian random variable N(0,c2).

Example 6.1 (Gaussian r.v. as subgaussian r.v.)

Consider zero-mean Gaussian random variable XN(0,σ2) with variance σ2. Then

E[exp(Xt)]=exp(σ2t22).

Putting c=σ we see that (6.1) is satisfied. Hence XSub(σ2) is a subgaussian r.v. or X is σ-subgaussian.

Example 6.2 (Rademacher distribution)

Consider X with

PX(x)=12δ(x1)+12δ(x+1)

i.e. X takes a value 1 with probability 0.5 and value 1 with probability 0.5.

Then

E[exp(Xt)]=12exp(t)+12exp(t)=coshtexp(t22).

Thus XSub(1) or X is 1-subgaussian.

Example 6.3 (Uniform distribution)

Consider X as uniformly distributed over the interval [a,a] for some a>0. i.e.

fX(x)={12aaxa0otherwise

Then

E[exp(Xt)]=12aaaexp(xt)dx=12at[eateat]=n=0(at)2n(2n+1)!

But (2n+1)!n!2n. Hence we have

n=0(at)2n(2n+1)!n=0(at)2n(n!2n)=n=0(a2t2/2)n(n!)=exp(a2t22).

Thus

E[exp(Xt]exp(a2t22).

Hence XSub(a2) or X is a-subgaussian.

Example 6.4 (Random variable with bounded support)

Consider X as a zero mean, bounded random variable i.e.

P(|X|B)=1

for some BR+ and

E(X)=0.

Then, the following upper bound holds:

E[exp(Xt)]=BBexp(xt)fX(x)dxexp(B2t22).

This result can be proven with some advanced calculus. XSub(B2) or X is B-subgaussian.

There are some useful properties of subgaussian random variables.

Lemma 6.2 (Mean and variance of subgaussian random variables)

If XSub(c2) then

E(X)=0

and

E(X2)c2.

Thus subgaussian random variables are always zero-mean. Their variance is always bounded by the variance of the bounding Gaussian distribution.

Proof. We proceed as follows:

  1. Note that

    n=0tnn!E(Xn)=E(n=0(Xt)nn!)=E(exp(Xt)).
  2. But since XSub(c2) hence

    n=0tnn!E(Xn)exp(c2t22)=n=0c2nt2n2nn!.
  3. Restating

    E(X)t+E(X2)t22!c2t22+o(t2) as t0.
  4. Dividing throughout by t>0 and letting t0 we get E(X)0.

  5. Dividing throughout by t<0 and letting t0 we get E(X)0.

  6. Thus E(X)=0. So Var(X)=E(X2).

  7. Now we are left with

    E(X2)t22!c2t22+o(t2) as t0.
  8. Dividing throughout by t2 and letting t0 we get Var(X)c2.

Subgaussian variables have a linear structure.

Theorem 6.8 (Linearity of subgaussian variables)

If XSub(c2) i.e. X is c-subgaussian, then for any αR, the r.v. αX is |α|c-subgaussian.

If X1,X2 are r.v. such that Xi is ci-subgaussian, then X1+X2 is c1+c2-subgaussian.

Proof. Scalar multiplication:

  1. Let X be c-subgaussian.

  2. Then

    E[exp(Xt)]exp(c2t22).
  3. Now for α0, we have

    E[exp(αXt)]exp(α2c2t22)exp((|α|c)2t22).
  4. Hence αX is |α|c-subgaussian.

Addition:

  1. Consider X1 as c1-subgaussian and X2 as c2-subgaussian.

  2. Thus

    E(exp(Xit))exp(ci2t22).
  3. Let p,q>1 be two numbers s.t. 1p+1q=1.

  4. Using H”older’s inequality, we have

    E(exp((X1+X2)t))[E(exp(X1t))p]1p[E(exp(X2t))q]1q=[E(exp(pX1t))]1p[E(exp(qX2t))]1q[exp((pc1)2t22)]1p[exp((qc2)2t22)]1q=exp(t22(pc12+qc22))=exp(t22(pc12+pp1c22)).
  5. Since this is valid for any p>1, we can minimize the r.h.s. over p>1.

  6. If suffices to minimize the term

    r=pc12+pp1c22.
  7. We have

    rp=c121(p1)2c22.
  8. Equating it to 0 gives us

    p1=c2c1p=c1+c2c1pp1=c1+c2c2.
  9. Taking second derivative, we can verify that this is indeed a minimum value.

  10. Thus

    rmin=(c1+c2)2.
  11. Hence we have the result

    E(exp((X1+X2)t))exp((c1+c2)2t22).
  12. Thus X1+X2 is (c1+c2)-subgaussian.

If X1 and X2 are independent, then X1+X2 is c12+c22-subgaussian.

If X is c-subgaussian then naturally, X is d-subgaussian for any dc. A question arises as to what is the minimum value of c such that X is c-subgaussian.

Definition 6.3 (Subgaussian moment)

For a centered random variable X, the subgaussian moment of X, denoted by σ(X), is defined as

σ(X)=inf{c0|E(exp(Xt))exp(c2t22),tR.}

X is subgaussian if and only if σ(X) is finite.

We can also show that σ() is a norm on the space of subgaussian random variables. And this normed space is complete.

For centered Gaussian r.v. XN(0,σ2), the subgaussian moment coincides with the standard deviation. σ(X)=σ.

Sometimes it is useful to consider more restrictive class of subgaussian random variables.

Definition 6.4 (Strictly subgaussian distribution)

A random variable X is called strictly subgaussian if XSub(σ2) where σ2=E(X2), i.e. the inequality

E(exp(Xt))exp(σ2t22)

holds true for all tR.

We will denote strictly subgaussian variables by XSSub(σ2).

Example 6.5 (Gaussian distribution)

If XN(0,σ2) then XSSub(σ2).

6.8.1. Characterization

We quickly review Markov’s inequality which will help us establish the results in this subsection.

Let X be a non-negative random variable. And let t>0. Then

P(Xt)E(X)t.

Theorem 6.9

For a centered random variable X, the following statements are equivalent:

  1. moment generating function condition:

    E[exp(Xt)]exp(c2t22)tR.
  2. Subgaussian tail estimate: There exists a>0 such that

    P(|X|λ)2exp(aλ2)λ>0.
  3. ψ2-condition: There exists some b>0 such that

    E[exp(bX2)]2.

Proof. (1)(2)

  1. Using Markov’s inequality, for any t>0 we have

    P(Xλ)=P(tXtλ)=P(etXetλ)E(etX)etλexp(tλ+c2t22)tR.
  2. Since this is valid for all tR, hence it should be valid for the minimum value of r.h.s.

  3. The minimum value is obtained for t=λc2.

  4. Thus we get

    P(Xλ)exp(λ22c2).
  5. Since X is c-subgaussian, hence X is also c-subgaussian.

  6. Hence

    P(Xλ)=P(Xλ)exp(λ22c2).
  7. Thus

    P(|X|λ)=P(Xλ)+P(Xλ)2exp(λ22c2).
  8. Thus we can choose a=12c2 to complete the proof.

(2)(3)

TODO PROVE THIS

E(exp(bX2))1+02btexp(bt2)P(|X|>t)dt

(3)(1)

TODO PROVE THIS

6.8.2. More Properties

We also have the following result on the exponential moment of a subgaussian random variable.

Lemma 6.3

Suppose XSub(c2). Then

E[exp(λX22c2)]11λ

for any λ[0,1).

Proof. We are given that

E(exp(Xt))exp(c2t22)exp(tx)fX(x)dxexp(c2t22)tR

Multiplying on both sides with exp(c2t22λ):

exp(txc2t22λ)fX(x)dxexp(c2t22λ1λ)=exp(t22c2(1λ)λ)

Integrating on both sides w.r.t. t we get:

exp(txc2t22λ)fX(x)dxdtexp(t22c2(1λ)λ)dt

which reduces to:

1c2πλexp(λx22c2)fX(x)dx1c2πλ1λE(exp(λX22c2))11λ

which completes the proof.

6.8.3. Subgaussian Random Vectors

The linearity property of subgaussian r.v.s can be extended to random vectors also. This is stated more formally in following result.

Theorem 6.10

Suppose that X=[X1,X2,,XN], where each Xi is i.i.d. with XiSub(c2). Then for any αRN, X,αSub(c2α22). Similarly if each XiSSub(σ2), then for any αRN, X,αSSub(σ2α22).

Norm of a subgaussian random vector

  1. Let X be a random vector where each Xi is i.i.d. with XiSub(c2).

  2. Consider the l2 norm X2. It is a random variable in its own right.

  3. It would be useful to understand the average behavior of the norm.

  4. Suppose N=1. Then X2=|X1|.

  5. Also X22=X12. Thus E(X22)=σ2.

  6. It looks like E(X22) should be connected with σ2.

  7. Norm can increase or decrease compared to the average value.

  8. A ratio based measure between actual value and average value would be useful.

  9. What is the probability that the norm increases beyond a given factor?

  10. What is the probability that the norm reduces beyond a given factor?

These bounds are stated formally in the following theorem.

Theorem 6.11

Suppose that X=[X1,X2,,XN], where each Xi is i.i.d. with XiSub(c2). Then

(6.2)E(X22)=Nσ2.

Moreover, for any α(0,1) and for any β[c2σ2,βmax], there exists a constant κ4 depending only on βmax and the ratio σ2c2 such that

(6.3)P(X22αNσ2)exp(N(1α)2κ)

and

(6.4)P(X22βNσ2)exp(N(β1)2κ)
  • First equation gives the average value of the square of the norm.

  • Second inequality states the upper bound on the probability that norm could reduce beyond a factor given by α<1.

  • Third inequality states the upper bound on the probability that norm could increase beyond a factor given by β>1.

  • Note that if Xi are strictly subgaussian, then c=σ. Hence β(1,βmax).

Proof. Since Xi are independent hence

E[X22]=E[i=1NXi2]=i=1NE[Xi2]=Nσ2.

This proves the first part.

Now let us look at (6.4).

By applying Markov’s inequality for any λ>0 we have:

P(X22βNσ2)=P(exp(λX22)exp(λβNσ2))E(exp(λX22))exp(λβNσ2)=i=1NE(exp(λXi2))exp(λβNσ2)

Since Xi is c-subgaussian, hence from \cref {lem:subgaussian_exp_square_moment} we have

E(exp(λXi2))=E(exp(2c2λXi22c2))112c2λ.

Thus:

i=1NE(exp(λXi2))(112c2λ)N2.

Putting it back we get:

P(X22βNσ2)(exp(2λβσ2)12c2λ)N2.

Since above is valid for all λ>0, we can minimize the R.H.S. over λ by setting the derivative w.r.t. λ to 0.

Thus we get optimum λ as:

λ=βσ2c22c2σ2(1+β).

Plugging this back we get:

P(X22βNσ2)(βσ2c2exp(1βσ2c2))N2.

Similarly proceeding for (6.3) we get

P(X22αNσ2)(ασ2c2exp(1ασ2c2))N2.

We need to simplify these equations. We will do some jugglery now. Consider the function

f(γ)=2(γ1)2(γ1)lnγγ>0.

By differentiating twice, we can show that this is a strictly increasing function. Let us have γ(0,γmax]. Define

κ=max(4,2(γmax1)2(γmax1)lnγmax)

Clearly

κ2(γ1)2(γ1)lnγγ(0,γmax].

Which gives us:

ln(γ)(γ1)2(γ1)2κ.

Hence by exponentiating on both sides we get:

γexp[(γ1)2(γ1)2κ].

By slight manipulation we get:

γexp(1γ)exp[2(1γ)2κ].

We now choose

γ=ασ2c2

Substituting we get:

P(X22αNσ2)(γexp(1γ))N2exp[N(1γ)2κ].

Finally

cσσ2c21γα1γ1α

Thus we get

P(X22αNσ2)exp[N(1α)2κ].

Similarly by choosing γ=βσ2c2 proves the other bound.

We can now map γmax to some βmax by:

γmax=βmaxσ2c2.

This result tells us that given a vector with entries drawn from a subgaussian distribution, we can expect the norm of the vector to concentrate around its expected value Nσ2.