12.5. Sparse and Redundant Representations

12.5.1. Dictionaries

Definition 12.3 (Dictionary)

A dictionary for CN is a finite collection D of unit-norm vectors which span the whole space.

The elements of a dictionary are called atoms and they are denoted by ϕω where ω is drawn from an index set Ω.

The dictionary is written as

D={ϕω:ωΩ}

where

ϕω2=1ωΩ

and every signal xCN can be expressed as

x=ωΩcωϕω.

We use the letter D to denote the number of elements in the dictionary; i.e.,

D=|Ω|.

This definition is adapted from [35].

The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.

Note that the dictionary need not provide a unique representation for any vector xCN, but it provides at least one representation for each xCN.

When D=N we have a set of unit norm vectors which span the whole of CN. Thus we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have D<N. The more interesting case is when D>N.

12.5.2. Redundant Dictionaries and Sparse Signals

With D>N, clearly there are more atoms than necessary to provide a representation of every signal xCN. Thus such a dictionary is able provide multiple representations to same vector x. We call such dictionaries redundant dictionaries or over-complete dictionaries.

In contrast a basis with D=N is called a complete dictionary. A special class of signals is those signals which have a sparse representation in a given dictionary D.

Definition 12.4 ((D,K)-sparse signals)

A signal xCN is called (D,K)-sparse if it can be expressed as a linear combination of at-most K atoms from the dictionary D.

Normally, for sparse signals, we have KD. It is usually expected that KN also holds.

Let ΛΩ be a subset of indices with |Λ|=K.

Let x be any signal in CN such that x can be expressed as

x=λΛbλϕλwhere bλC.

Note that this is not the only possible representation of x in D. This is just one of the possible representations of x. The special feature of this representation is that it is K-sparse i.e. only at most K atoms from the dictionary are being used.

Now there are (DK) ways in which we can choose a set of K atoms from the dictionary D.

Thus the set of (D,K)-sparse signals is given by

Σ(D,K)={xCN|x=λΛbλϕλ,ΛΩ,|Λ|=K}.

This set Σ(D,K) is dependent on the chosen dictionary D. In the sequel, we will simply refer to it as ΣK.

Example 12.15 (K-sparse signals for standard basis)

For the special case where D is nothing but the standard basis of CN, then

ΣK={xCN|x0K};

i.e., the set of signals which have K or less non-zero elements.

Example 12.16 (K-sparse signals for orthonormal basis)

In contrast if we choose an orthonormal basis Ψ such that every xCN can be expressed as

x=Ψa

then with the dictionary D=Ψ, the set of K-sparse signals is given by

ΣK={x=Ψa|a0K}.

We also note that for a specific choice of ΛΩ with |Λ|=K, the set of vectors

{xCN|x=λΛbλϕλ}=span{ϕλ|λΛ}

form a subspace of CN.

So we have (DK) K-sparse subspaces contained in the dictionary D. And the K-sparse signals lie in the union of these subspaces.

12.5.3. Sparse Approximation Problem

In sparse approximation problem, we attempt to approximate a given signal xCN as a linear combination of K atoms from the dictionary D where KN and typically ND; i.e., the number of atoms in a dictionary D is typically much larger than the ambient signal space dimension N.

Naturally, we wish to obtain a best possible sparse representation of x over the atoms ϕωD which minimizes the approximation error.

Let Λ denote the index set of atoms which are used to create a K-sparse representation of x where ΛΩ with |Λ|=K.

Let xΛ denote an approximation of x over the set of atoms indexed by Λ.

Then we can write xΛ as

xΛ=λΛbλϕλwhere bλC.

We put all complex valued coefficients bλ in the sum into a list bΛ. The approximation error is given by

e=xxΛ2.

Clearly we would like to minimize the approximation error over all possible choices of K atoms and corresponding set of coefficients bΛ.

Thus the sparse approximation problem can be cast as a minimization problem given by

(12.17)min|Λ|=KminbΛxλΛbλϕλ2.

If we choose a particular Λ, then the inner minimization problem becomes a straight-forward least squares problem. But there are (DK) possible choices of Λ and solving the inner least squares problem for each of them becomes prohibitively expensive.

We reemphasize here that in this formulation we are using a fixed dictionary D while the vector xCN is arbitrary.

This problem is known as (D,K)-SPARSE approximation problem.

A related problem is known as (D,K)-EXACT-SPARSE problem where it is known a-priori that x is a linear combination of at-most K atoms from the given dictionary D i.e. x is a K-sparse signal as defined above for the dictionary D.

This formulation simplifies the minimization problem (12.17) since it is known a priori that for K-sparse signals, a 0 approximation error can be achieved. The only problem is to find a set of subspaces from the (DK) possible K-sparse subspaces which are able to provide a K-sparse representation of x and among them choose one. It is imperative to note that even the K-sparse representation need not be unique.

Clearly the EXACT-SPARSE problem is simpler than the SPARSE approximation problem. Thus if EXACT-SPARSE problem is NP-Hard then so is the harder SPARSE-approximation problem. It is expected that solving the EXACT-SPARSE problem will provide insights into solving the SPARSE problem.

In Theorem 12.8 we identified conditions under which a sparse representation for a given vector x in a two-ortho-basis is unique. It would be useful to get similar conditions for general dictionaries. such conditions would help us guarantee the uniqueness of EXACT-SPARSE problem.

12.5.4. Synthesis and Analysis

The atoms of a dictionary D can be organized into a N×D matrix as follows:

Φ[ϕω1ϕω2ϕωD].

where Ω={ω1,ω2,,ωN} is the index set for the atoms of D. We recall that ϕωCN, hence they have a column vector representation in the standard basis for CN. The order of columns doesn’t matter as long as it remains fixed once chosen.

Thus in matrix terminology a representation of xCN in the dictionary can be written as

x=Φb

where bCD is a vector of coefficients to produce a superposition x from the atoms of dictionary D. Clearly with D>N, b is not unique. Rather for every vector zN(Φ), we have:

Φ(b+z)=Φb+Φz=x+0=x.

Definition 12.5 (Synthesis matrix)

The matrix Φ is called a synthesis matrix since x is synthesized from the columns of Φ with the coefficient vector b.

We can also view the synthesis matrix Φ as a linear operator from CD to CN.

There is another way to look at x through Φ.

Definition 12.6 (Analysis matrix)

The conjugate transpose ΦH of the synthesis matrix Φ is called the analysis matrix. It maps a given vector xCN to a list of inner products with the dictionary:

c=ΦHx

where cCD.

Note that in general xΦ(ΦHx) unless D is an orthonormal basis.

Definition 12.7 ((D,K) EXACT SPARSE problem)

With the help of synthesis matrix Φ, the (D,K) EXACT SPARSE problem can now be written as

(12.18)minimizeaa0subject to x=Φaand a0K

If xΣK, then the EXACT SPARSE problem is infeasible. Otherwise, we are looking to find the sparsest possible solution.

Definition 12.8 ((D,K) SPARSE approximation problem)

With the help of synthesis matrix Φ, the (D,K) SPARSE approximation problem can now be written as

(12.19)minimizeaxΦa2subject to a0K.

This problem can be visualized as a projection of x on to the set ΣK. Hence, it always has a solution.

12.5.5. P-Norms

There are some simple and useful results on relationships between different p-norms listed in this section. We also discuss some interesting properties of l1-norm specifically.

Definition 12.9 (Complex sign vector)

Let vCN. Let the entries in v be represented as

vk=rkexp(iθk)

where rk=|vk| with the convention that θk=0 whenever rk=0.

The sign vector for v denoted by sgn(v) is defined as

sgn(v)=[sgn(v1)sgn(vN)]

where

sgn(vk)={exp(iθk) if rk0;0 if rk=0.

Theorem 12.9 (1 norm as product of vector with its sign)

For any vCN:

v1=sgn(v)Hv=v,sgn(v).

Proof. This follows from:

v1=k=1Nrk=k=1N[rkeiθk]eiθk=k=1Nvkeiθk=sgn(v)Hv.

Note that whenever vk=0, corresponding 0 entry in sgn(v) has no effect on the sum.

Theorem 12.10 (Equivalence of 1 and 2 norms)

Suppose vCN. Then

v2v1Nv2.

Proof. For the lower bound, we go as follows

v22=i=1N|vi|2(i=1N|vi|2+2i,j,ij|vi||vj|)=(i=1N|vi|)2=v12.

This gives us

v2v1.

We can write 1 norm as

v1=v,sgn(v).

By Cauchy-Schwartz inequality we have

v,sgn(v)v2sgn(v)2.

Since sgn(v) can have at most N non-zero values, each with magnitude 1,

sgn(v)22Nsgn(v)2N.

Thus, we get

v1Nv2.

Theorem 12.11 (Equivalence of 2 and norms)

Let vCN. Then

v2Nv.

Proof. This follows from:

v22=i=1N|vi|2Nmax1iN(|vi|2)=Nv2.

Thus

v2Nv.

Theorem 12.12 (Relationship between p-norms)

Let vCN. Let 1p,q. Then

vqvp whenever pq.

Proof. TBD

Theorem 12.13

Let 1CN be the vector of all ones; i.e., 1=(1,,1). Let vCN be some arbitrary vector. Let |v| denote the vector of absolute values of entries in v; i.e., |v|i=|vi|1iN. Then

v1=1T|v|=1H|v|.

Proof. This follows from:

1T|v|=i=1N|v|i=i=1N|vi|=v1.

Finally since 1 consists only of real entries, hence its transpose and Hermitian transpose are same.

Theorem 12.14

Let 1CN×N be a square matrix of all ones. Let vCN be some arbitrary vector. Then

|v|T1|v|=v12.

Proof. We know that

1=11T

Thus,

|v|T1|v|=|v|T11T|v|=(1T|v|)T(1T|v|)=v1v1=v12.

We used the fact that v1=1T|v|.

Theorem 12.15 (An upper bound on the k-th largest value)

k-th largest (magnitude) entry in a vector xCN denoted by x(k) obeys

(12.20)|x(k)|x1k.

Proof. Let n1,n2,,nN be a permutation of {1,2,,N} such that

|xn1||xn2||xnN|.

Thus, the k-th largest entry in x is xnk. It is clear that

x1=i=1N|xi|=i=1N|xni|.

Obviously

|xn1|i=1N|xni|=x1.

Similarly

k|xnk|=|xnk|++|xnk||xn1|++|xnk|i=1N|xni|x1.

Thus

|xnk|x1k.

12.5.6. Sparse Signals

In this subsection we explore some useful properties of ΣK, the set of K-sparse signals in standard basis for CN.

We recall that

ΣK={xCN|x0K}.

We established before that this set is a union of (NK) subspaces of CN each of which is is constructed by an index set Λ{1,,N} with |Λ|=K choosing K specific dimensions of CN.

We first present some results which connect the l1, l2 and l norms of vectors in ΣK.

Theorem 12.16 (Relation between norms of sparse vectors)

Suppose uΣK. Then

u1Ku2Ku.

Proof. Due to Theorem 12.9, we can write 1 norm as

u1=u,sgn(u).

By Cauchy-Schwartz inequality we have

u,sgn(u)u2sgn(u)2

Since uΣK, sgn(u) can have at most K non-zero values each with magnitude 1. Thus, we have

sgn(u)22Ksgn(u)2K.

Thus we get the lower bound

u1u2Ku1Ku2.

Now |ui|max(|ui|)=u. So we have

u22=i=1N|ui|2Ku2

since there are only K non-zero terms in the expansion of u22. This establishes the upper bound:

u2Ku.

12.5.7. Compressible Signals

In this subsection, we first look at some general results and definitions related to K-term approximations of arbitrary signals xCN. We then define the notion of a compressible signal and study properties related to it.

12.5.7.1. K-term Approximation of General Signals

Definition 12.10 (Restriction of a signal)

Let xCN. Let T{1,2,,N} be any index set. Further let

T={t1,t2,,t|T|}

such that

t1<t2<<t|T|.

Let xTC|T| be defined as

(12.21)xT=(xt1xt2xt|T|).

Then xT is a restriction of the signal x on the index set T.

Alternatively let xTCN be defined as

(12.22)xT(i)={x(i) if iT;0 otherwise.

In other words, xTCN keeps the entries in x indexed by T while sets all other entries to 0. Then we say that xT is obtained by masking x with T.

As an abuse of notation, we will use any of the two definitions whenever we are referring to xT. The definition being used should be obvious from the context.

Example 12.17 (Restrictions on index sets)

x=(1580030000)C10.

Let

T={1,3,7,8}.

Then

xT=(1080000000)C10.

Since |T|=4, sometimes we will also write

x=(1800)C4.

Definition 12.11 (K-term signal approximation)

Let xCN be an arbitrary signal. Consider any index set T{1,,N} with |T|=K. Then xT is a K-term approximation of x.

Clearly for any xCN there are (NK) possible K-term approximations of x.

Example 12.18 (K-term approximation)

Let

x=(1580030000)C10.

Let T={1,6}. Then

xT=(1000030000)

is a 2-term approximation of x.

If we choose T={7,8,9,10}, the corresponding 4-term approximation of x is

(0000000000).

Definition 12.12 (K-largest entries approximation)

Let xCN be an arbitrary signal. Let λ1,,λN be indices of entries in x such that

|xλ1||xλ2||xλN|.

In case of ties, the order is resolved lexicographically; i.e., if |xi|=|xj| and i<j then i will appear first in the sequence {λk}.

Consider the index set ΛK={λ1,λ2,,λK}. The restriction of x on ΛK given by xΛK contains the K largest entries x while setting all other entries to 0. This is known as the K largest entries approximation of x.

This signal is denoted henceforth as x|K; i.e.

(12.23)x|K=xΛK

where ΛK is the index set corresponding to K largest entries in x (magnitude wise).

Example 12.19 (Largest entries approximation)

Let

x=(1580030000).

Then

x|1=(0080000000).
x|2=(0580000000).
x|3=(0580030000)
x|4=x.

All further K largest entries approximations are same as x.

A pertinent question at this point is: which K-term approximation of x is the best K-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose p norm as the criterion. The next result gives an interesting result for best K-term approximations in p norm sense.

Theorem 12.17 (Best K-term approximation for p norms)

Let xCN. Let the best K term approximation of x be obtained by the following optimization program:

(12.24)maximizeT{1,,N}xTpsubject to |T|=K.

where p[1,].

Let an optimal solution for this optimization problem be denoted by xT. Then

x|Kp=xTp;

i.e., the K-largest entries approximation of x is an optimal solution to (12.24).

Proof. For p=, the result is obvious. In the following, we focus on p[1,).

We note that maximizing xTp is equivalent to maximizing xTpp.

Let λ1,,λN be indices of entries in x such that

|xλ1||xλ2||xλN|.

Further let {ω1,,ωN} be any permutation of {1,,N}. Clearly

x|Kpp=i=1K|xλi|pi=1K|xωi|p.

Thus if T corresponds to an optimal solution of (12.24) then

x|Kpp=xTpp.

Thus x|K is an optimal solution to (12.24).

This result helps us establish that whenever we are looking for a best K-term approximation of x under any p norm, all we have to do is to pickup the K-largest entries in x.

Definition 12.13 (Restriction of a matrix)

Let ΦCM×N. Let T{1,2,,N} be any index set. Further let

T={t1,t2,,t|T|}

such that

t1<t2<<t|T|.

Let ΦTCM×|T| be defined as

(12.25)ΦT=[ϕt1ϕt2ϕt|T|].

Then ΦT is a restriction of the matrix Φ on the index set T.

Alternatively let ΦTCM×N be defined as

(12.26)(ΦT)i={ϕiif iT;0otherwise.

In other words, ΦTCM×N keeps the columns in Φ indexed by T while sets all other columns to 0. Then we say that ΦT is obtained by masking Φ with T.

As an abuse of notation, we will use any of the two definitions whenever we are referring to ΦT. The definition being used should be obvious from the context.

Theorem 12.18

Let supp(x)=Λ. Then

Φx=ΦΛxΛ.

Proof. This follows from:

Φx=i=1Nxiϕi=λiΛxλiϕλi=ΦΛxΛ.

The result remains valid whether we use the restriction or the mask version of xΛ notation as long as same version is used for both Φ and x.

Corollary 12.1

Let S and T be two disjoint index sets such that for some xCN

x=xT+xS

using the mask version of xΛ notation. Then the following holds

Φx=ΦTxT+ΦSxS.

Proof. Straightforward application of Theorem 12.18:

Φx=ΦxT+ΦxS=ΦTxT+ΦSxS.

Theorem 12.19

Let T be any index set. Let ΦCM×N and yCM. Then

[ΦHy]T=ΦTHy.

Proof. Note that

ΦHy=[ϕ1,yϕN,y]

Now let

T={t1,,tK}.

Then

[ΦHy]T=[ϕt1,yϕtK,y]=ΦTHy.

The result remains valid whether we use the restriction or the mask version of ΦT notation.

12.5.7.2. Compressible Signals

We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.

Definition 12.14 (Compressible signal)

Let xCN be an arbitrary signal. Let λ1,,λN be indices of entries in x such that

|xλ1||xλ2||xλN|.

In case of ties, the order is resolved lexicographically, i.e. if |xi|=|xj| and i<j then i will appear first in the sequence {λk}. Define

(12.27)x^=(xλ1,xλ2,,xλN).

The signal x is called p-compressible with magnitude R if there exists p(0,1] such that

(12.28)|x^i|Ri1pi=1,2,,N.

Theorem 12.20 (1-compressible signals)

Let x be be p-compressible with p=1. Then

x1R(1+ln(N)).

Proof. Recalling x^ from (12.27) it is straightforward to see that

x1=x^1

since the 1 norm doesn’t depend on the ordering of entries in x.

Now since x is 1-compressible, hence from (12.28) we have

|x^i|R1i.

This gives us

x^1i=1NR1i=Ri=1N1i.

The sum on the R.H.S. is the N-th Harmonic number (sum of reciprocals of first N natural numbers). A simple upper bound on Harmonic numbers is

Hk1+ln(k).

This completes the proof.

We now demonstrate how a compressible signal is well approximated by a sparse signal.

Theorem 12.21 (Sparse approximation of compressible signals)

Let x be a p-compressible signal and let x|K be its best K-term approximation. Then the 1 norm of approximation error satisfies

(12.29)xx|K1CpRK11p

with

Cp=(1p1)1.

Moreover the 2 norm of approximation error satisfies

(12.30)xx|K2DpRK11p

with

Dp=(2p1)1/2.

Proof. Expanding the 1 approximation error

xx|K1=i=K+1N|xλi|Ri=K+1Ni1p.

We now approximate the R.H.S. sum with an integral.

i=K+1Ni1px=KNx1pdxx=Kx1pdx.

Now

x=Kx1pdx=[x11p11p]K=CpK11p.

We can similarly show the result for 2 norm.

Example 12.20 (Sparse approximation for 12-compressible signals)

Let p=12. Then

Cp=(1p1)1=1 and Dp=(2p1)1/2=13.

Hence

xx|K1RK

and

xx|K213RK.

Both 1 and 2 approximation error bounds decrease as K increases at the same rate.