Conjugate Gradient 1

Gradient is one the key ingredients when it comes to optimization. In some cases, one might employ second order methods, i.e. using Hessian; however, it is not common in machine learning to use such methods because we simply have too many variables.

In this article, I am going to talk about the conjugate gradient method for optimizing quadratic functions, i.e.

f(x)=12xTAxbTxx=argminxf(x)\begin{aligned} f(x) &= \frac{1}{2} x^TAx - b^Tx \\ x^* &= \underset{x}\text{argmin} f(x) \end{aligned}

where ARn×nA \in \mathbb{R}^{n\times n} is symmetric positive semi-definite and bRnb \in \mathbb{R}^n. AA and bb are coefficients of the functions. In this setting, the method ensures that we will converge to the solution within nn steps, where nn is the dimensions of variables.

Steepest Gradient Descent

Let first revisit the gradient descent method. The method iteratively finds f(x)f(x)'s possible minimizer x^\hat x by using xi+1=xiαtf(xi),x_{i+1} = x_i - \alpha_t \nabla f(x_i), where αi\alpha_i is a learning rate at each iteration and f(xi)\nabla f(x_i) denotes f(x)\nabla f(x) evaluated at xix_{i}. Noting that, in general, the gradient descent update does not guarantee that x^=x\hat x = x^*, i.e. x^\hat x might be at some local optima.

Naively, one can set αi\alpha_i at some constate value, but a better way might be to choose the value based on the situation at hand. Consider performing a line search at each iteration to find the best value. In other word, we seek αi\alpha_i by finding the first derivative at f(xi+1)f(x_{i+1}) and set it to zero:

αix=xi+1f(x)=f(xi+1)T(αixi+αif(xi))=f(xi+1)Tf(xi)=!0.\begin{aligned} \frac{\partial}{\partial \alpha_i} \bigg|_{x=x_{i+1}} f(x) &= \nabla f(x_{i+1})^T \bigg( \frac{\partial}{\partial \alpha_i} x_i + \alpha_i \nabla f(x_i) \bigg) \\ &= \nabla f(x_{i+1})^T \nabla f(x_i) \\ &\overset{!}{=} 0. \\ \end{aligned}

This derivation tells us that, along the search direction f(xt)\nabla f(x_t), the optimal value of αt\alpha_t is the one leading to the location that two gradients are orthogonal. Solving the equation above for α\alpha

f(xi+1)Tf(xi)=(Axi+1b)Tf(xi)=(A(xiαf(xi))b)Tf(xi)=(AxiαAf(xi)b)Tf(xi)=(f(xi)αAf(xi))Tf(xi)α=f(xi)Tf(xi)f(xi)TAf(xi).\begin{aligned} \nabla f(x_{i+1})^T \nabla f(x_i) &= (Ax_{i+1} - b)^T\nabla f(x_i) \\ &= (A(x_i - \alpha \nabla f(x_i))-b)^T\nabla f(x_i) \\ &= (Ax_i - \alpha A \nabla f(x_i)-b)^T\nabla f(x_i) \\ &= (\nabla f(x_i) - \alpha A \nabla f(x_i))^T\nabla f(x_i) \\ \alpha &= \frac{\nabla f(x_i)^T \nabla f(x_i)}{\nabla f(x_i)^T A \nabla f(x_i)}. \end{aligned}

Residual and Error

Assume xx^* is known. Error at each step ete_t is

ei=xxi,e_i = x^* - x_i,

which simply tells us how far we are from the answer. In contrast, residual at each step is the difference between the minimum value and our current value:

ri=bAxi=AxAxi=Aei\begin{aligned} r_i &= b - Ax_i \\ &= Ax^* - Ax_i \\ &= Ae_i \end{aligned}

Combining the definitions of error and residual yields

ri=bAxi=AxAxi=A(xxi)=Aei.\begin{aligned} r_i &= b - Ax_i \\ &= Ax^* - A x_i \\ &= A(x^* - x_i) \\ &= Ae_i. \end{aligned}

Conjugate Directions

One problem of Steepest Descent is that every move inteferes previous moves; hence slow convergence. It would be good to have search directions that do not affect each other.

Consider an update

xi+1=xi+αidi,x_{i+1} = x_i + \alpha_i d_i,

where αi\alpha_i is update step and did_i is a search direction. We can to find x^\hat x in nn updates (using nn directions d1,,dnd_1, \dots, d_n), if these directions are orthogonal to each other

d0d1d2dn.d_0 \perp d_1 \perp d_2 \perp \dots \perp d_n.

Therefore, the error at each step ete_t is also orthogonal to previous search directions, e.g.

e1d0e2d1d0.\begin{aligned} e_1 &\perp d_0 \\ e_2 &\perp d_1 \perp d_0. \end{aligned}

Using this fact, we have

diTei+1=diT(xxi+1)=diT(x(xi+αidi))=diT(eiαidi)=0\begin{aligned} d_i^T e_{i+1} &= d_i^T(x^* - x_{i+1}) \\ &= d_i^T(x^* - (x_i + \alpha_i d_i)) \\ &= d_i^T(e_i - \alpha_i d_i) \\ &= 0 \end{aligned}

The derivation above yields αi=diTeidiTdi.\alpha_i = \frac{d_i^T e_i}{d_i^T d_i}.

A-Orthogonal (Conjugacy)

However, as we do not know xx^* so do αi\alpha_i. Instead of requiring orthogonality diTdjd_i^T d_j, we construct search directions that are orthogonal under the transformation of the matrix AA. In other word, we want did_i and djd_j to be A-Conjugate (Orthogonal):

diTAdj=0.d_i^T A d_j = 0.

If we find the minimum along the search direction did_i, we have

αif(xi+1)=f(xi+1)T(αixi+1)=ri+1Tdi=(Aei+1)Tdi=!0\begin{aligned} \frac{\partial}{\partial \alpha_i} f(x_{i+1}) &= \nabla f(x_{i+1})^T \bigg (\frac{\partial}{\partial \alpha_i} x_{i+1} \bigg ) \\ &= - r_{i+1}^T d_{i} \\ &= - (A e_{i+1})^T d_{i} \\ &\overset{!}{=} 0 \end{aligned}

Using this fact, we have

diTAei+1=diTA(eiαidi)=!0    αi=diTAeidiTAdi=diTridiTAdi.\begin{aligned} d_{i}^T A e_{i+1} &= d_i^T A (e_i - \alpha_i d_i) \\ &\overset{!}{=} 0 \\ \implies \alpha_i &= \frac{d_i^T A e_i}{d_i^T A d_i} \\ &= \frac{d_i^T r_i}{d_i^T A d_i}. \end{aligned}

Now, αi\alpha_i is computable. By construction, taking the αidi\alpha_i d_i step ensures that diTri+1=0d_i^Tr_{i+1}=0. The only thing that we need to find is the set of A-conjugate search directions {di}\{d_i\}.

Gram-Schmidt Conjugation

Consider linearly independent vectors u0,u1,,uj1u_0, u_1, \dots, u_{j-1}. One can construct A-orthogonal basis {d0,d1,,dj1}\{d_0, d_1, \dots, d_{j-1}\} from these vectors using (Conjugate) Gram-Schmidt Process

dj=ujk=1j1βjkdk,d_j = u_j - \sum_{k=1}^{j-1} \beta_{jk} d_k,

where βjk\beta_{jk} is a scaling factor that we seek to find. One issue immediately arises: How should one construct uju_j? It turns out that we can use rjr_j as uju_j (Proof). Therefore, we have

dj=rjk=1j1βjkdk.(1)d_j = r_j - \sum_{k=1}^{j-1} \beta_{jk} d_k. \tag{1}

Based on our conjugatecy requirement, this new direction djd_j should be A-orthogonal to all previous search directions did_i for i<ji < j, hence

0=!diTAdj=diTA(rjk=1j1βjkdk)=diTArjk=1j1βjkdiTAdk=diTArjβjidiTAdi.\begin{aligned} 0 \overset{!}{=} d_i^T A d_j &= d_i^TA ( r_j - \sum_{k=1}^{j-1} \beta_{jk} d_k ) \\ &= d_i^TA r_j - \sum_{k=1}^{j-1} \beta_{jk} d_i^T A d_k \\ &= d_i^TA r_j - \beta_{ji} d_i^T A d_i. \end{aligned}

Therefore, we have βji=diTArjdiTAdi.\beta_{ji} = \frac{d_i^T A r_j}{d_i^T A d_i}.

The derivation above shows how we construct A-orthogonal search directions. However, another problem is that the contruction uses all previous search directions in the process, hence large memory complexity. Fortunately, we can make this memory aspect more efficient.

Consider that diTrj=0d_i ^T r_j = 0 for i<j\forall i < j (Proof). Multiplying rjTr_{j}^T to Equation (1) yields

0=!rjTdi=rjT(rik=0i1βikdk)=rjTrik=0i1βikrjTdk.\begin{aligned} 0 \overset{!}{=}r_{j}^T d_i &= r_{j}^T \bigg( r_i - \sum_{k=0}^{i-1} \beta_{ik} d_k \bigg) \\ &= r_{j}^Tr_i - \sum_{k=0}^{i-1} \beta_{ik} \cancel{r_{j}^T d_k}. \\ \end{aligned}

Essentially, the derivation above tells us that riTrj=0,i<jr_i^T r_j = 0, \forall i < j. Using this fact, one can derive

riTrj+1=riT(rjαjAdj)=riTrjαjriTAdj.\begin{aligned} r_{i}^T r_{j+1} &= r_i^T (r_j - \alpha_jA d_j ) \\ &= r_i^T r_j - \alpha_j r_i^T A d_j. \end{aligned}

Because riTrj=0r_i^T r_j = 0 if iji \ne j,

riTAdj={1αiriTri;  i=j1αi1riTri;  i=j+10;  otherwiser_i^T A d_j = \begin{cases} \frac{1}{\alpha_i} r_i^T r_i &; \ \ i = j \\ - \frac{1}{\alpha_{i-1}} r_i^T r_i &; \ \ i = j + 1 \\ 0 &; \ \ \text{otherwise} \\ \end{cases}

Recall that βik=riTAdkdkTAdk\beta_{ik} = \frac{r_i^T A d_k}{d_k^T A d_k} for k=0,,i1k = 0, \dots, i-1. Because riTAdk=0r_i^T A d_k = 0 when i>k+1i > k+1, we have

di=rik=0i1βikdk=riβi,i1di1=ri+(riTriαi1di1TAdi1)βidi1.\begin{aligned} d_i &= r_i - \sum_{k=0}^{i-1} \beta_{ik} d_k \\ &= r_i - \beta_{i,i-1} d_{i-1} \\ &= r_i + \underbrace{ \bigg( \frac{r_i^Tr_i}{\alpha_{i-1} d_{i-1}^T A d_{i-1}} \bigg) }_{\beta_i } d_{i-1}. \end{aligned}

Because αi1=di1Tri1di1Adi1\alpha_{i-1} = \frac{d_{i-1}^Tr_{i-1}}{d_{i-1}A d_{i-1}}, we can derive

βi=riTriαi1di1TAdi1=riTridi1Tri1.\begin{aligned} \beta_{i} &= \frac{r_{i}^Tr_{i}}{\alpha_{i-1} d_{i-1}^T A d_{i-1} } \\ &= \frac{r_{i}^T r_{i}}{d_{i-1}^Tr_{i-1}}. \end{aligned}

Using diTrj=0,i<jd_i ^T r_j = 0, \forall i < j yields

di=ri+βidi1riTdi=riT(ri+βidi1)=riTri+βiriTdi1\begin{aligned} d_{i} &= r_{i} + \beta_{i} d_{i-1} \\ r_i^T d_{i} &= r_i^T( r_{i} + \beta_{i} d_{i-1}) \\ &= r_i^Tr_{i} + \beta_{i} \cancel{r_i^T d_{i-1}} \end{aligned}

Therefore, we have

βi=riTriri1Tri1.\beta_{i} = \frac{r_{i}^T r_{i}}{r_{i-1}^Tr_{i-1}}.

Summary: Conjugate Gradient Algorithm

Consider a random starting point x0x_0 and r0=d0=f(x0)r_0 = d_0 = -\nabla f(x_0). The update step for the conjugate gradient algorithm contains

αi=riTridiTAdixi+1=xi+αidiri+1=riαiAdiβi+1=ri+1Tri+1riTridi+1=ri+1+βi+1di\begin{aligned} \alpha_i &= \frac{r_i^T r_i}{d_i^T A d_i} \\ x_{i+1} &= x_{i} + \alpha_i d_i \\ r_{i+1} &= r_i - \alpha_i A d_i \\ \beta_{i+1} &= \frac{r_{i+1}^T r_{i+1}}{r_{i}^Tr_i} \\ d_{i+1} &= r_{i+1} + \beta_{i+1} d_i \end{aligned}


Othogonality of residual and search directions

We know that

e1=e0α0d0e2=e1α1d1=e0α0d0α1d1  ej=e0k=0j1αjdj.\begin{aligned} e_1 &= e_0 - \alpha_0 d_0 \\ e_2 &= e_1 - \alpha_1 d_1 \\ &= e_0 - \alpha_0 d_0 - \alpha_1 d_1 \\ &\ \ \vdots \\ e_j &= e_0 - \sum_{k=0}^{j-1} \alpha_j d_j. \end{aligned}

Assume that e0=k=0n1δkdke_0 = \sum_{k=0}^{n-1} \delta_k d_k. Multiplying the equation by djTAd_j^TA yields

djTAe0=k=0n1δkdjTAdk=δjdjTAdj    δj=djTAe0djTAdj=djTA(ej+k=0j1(αkdk))djTAdj=djTAej+k=0j1αkdjTAdkdjTAdj=djTrjdjTAdj=αj.\begin{aligned} d_j^T A e_0 &= \sum_{k=0}^{n-1} \delta_k d_j^T A d_k \\ &= \delta_j d_j^T A d_j \\ \implies \delta_j &= \frac{d_j^T A e_0}{d_j^T A d_j} \\ &= \frac{d_j^T A (e_j + \sum_{k=0}^{j-1}(\alpha_k d_k))}{d_j^T A d_j} \\ &= \frac{d_j^T A e_j + \sum_{k=0}^{j-1}\alpha_k \cancel{d_j^T A d_k}}{d_j^T A d_j} \\ &= \frac{d_j^T r_j }{d_j^T A d_j} \\ &= \alpha_j. \end{aligned}


ej=k=0n1αkdkk=0j1αkdk=k=jn1αkdk\begin{aligned} e_j &= \sum_{k=0}^{n-1} \alpha_k d_k - \sum_{k=0}^{j-1} \alpha_k d_k \\ &= \sum_{k=j}^{n-1} \alpha_k d_k \end{aligned}

Consider i<ji < j and multiplying the equation above by diTAd_i^TA. We have

diTAej=k=jn1αkdiTAdk=0.\begin{aligned} d_i^T A e_j &= \sum_{k=j}^{n-1} \alpha_k \cancel{d_i^T A d_k} \\ &= 0. \end{aligned}

Therefore, we prove that diTrj=0,i<jd_i^T r_j = 0, \forall i < j. ■

Conjugate Gradient needs only n steps

Consider the following identity (from the previous proof)

ej=k=0n1αkdkk=0j1αkdk.\begin{aligned} e_j &= \sum_{k=0}^{n-1} \alpha_k d_k - \sum_{k=0}^{j-1} \alpha_k d_k. \end{aligned}

One can see that the error after nn step (ene_n) is zero. ■

Independency of residuals

From the Gram-Schmidt construction, we can show that

ri=di+βidi1rjTri=rjT(di+βidi1)=rjTdi+βirjTdi1)=0.\begin{aligned} r_i &= d_i + \beta_i d_{i-1} \\ r_j^T r_i &= r_j^T(d_i + \beta_i d_{i-1}) \\ &= \cancel{r_j^Td_i} + \beta_i \cancel{r_j^T d_{i-1})} \\ &=0. \end{aligned}


This article is based on content from

Figures are made from Google Colab.