One challenge is in classification is the curse dimensionality in which the amount of finite samples required to properly estimate the underlying distribution of the data goes exponentially with the dimension of the data. Therefore, it might be a reasonable and efficient way to first perform dimensionality reduction and operate on this subspace.
However, one issue might arise is that we might lose some information that is meaningful for classification. To illustrate, the figure shows the effect of projection on distinguishing two groups of data.
Fig. 1: Projection direction that does not separate the two distribution well.
How could we find such a direction?
Let w∈Rd be the direction that we aim to find; We also assume that ∥w∥=1. Consider two groups of data N1={x∈Rd} and N2={x∈Rd}; The projection of these samples are : wTx. Noting here that we implicitly represent the bias term in the inner production, i.e. the last feature of these samples x∈N1∪N2 is always 1.
From Fig 1, we can observe that the direction that would allow us to separate the two classes best is the one that maximizes the distance between the means (μ^1,μ^2∈R) of these data after projection. Moreover, the projection would be more discriminate if the two distributions after projection are narrow; in other word, it implies that the variances of these distributions should be small. With this, we can construct the objective function:
J(w)=σ12+σ22(μ^1−μ^2)2,
where our goal is argmaxw∈Rd,∥w∥=1J(w).
where our goal is argmaxw∈Rd,∥w∥=1J(w). For the nominator, we can rewrite it as
With this derivation, we have σ12+σ22=wT(S1+S2)w. Let define SW≜S1+S2 called within-class covariate matrix. Therefore, the objective function becomes
J(w)=wTSWwwTSBw.
Solving the optimization problem
With the assumption that the underlying distribution of x is a multivariate Gaussian distribution. SW is positive definite; therefore, we can use Cholesky decomposition: SW=RTR, where R∈Rd×d is an upper-triangular matrix. Let define v≜Rw and rewrite wTSWw:
wTSWw=wTRTRw=(Rw)TRw=vTv.
Because w=R−1v, we have
wTSBw=(R−1v)TSBR−1v=vT≜A(R−1)TSBR−1v.
With this reparameterization, the optimization problem is simplified to
v∗=v∈Rd;∥v∥=1argmaxvTvvTAv.
This is the Rayleigh quotient, and it can be shown that the maximum is attained when v∗ is the eigenvalue corresponds to the largest eigenvalue λ∗ (See ... for the proof) . Therefore, we have
Noting here that SB is the outer product of the vector μ1−μ2; hence rank(SB)=1. Therefore, SBw∗ is in the direction of μ1−μ2. With this observation, we do not need to solve the eigenvalue problem above and rather compute
w∗∝(SW)−1(μ1−μ2).
Fig. 2: Projection direction that maximizes the class separation and minimize the variance of the distributions along the projection direction.
Decision Boundary
So far, what we have now is a direction in space that maximally separates the two classes once they are projected onto this direction. To prediction whether a sample belongs to which class, we can derive the discriminant function of this projection. Let ω1 and ω2 be the two classes we consider. We know that
P(ωi∣x)∝p(x∣ωi)P(ωi).
Using Bayes' decision theory, we decide x to belong to ω1 if P(ω1∣x)>P(ω2∣x) and vice versa; hence the decision boundary lies at P(ω1∣x)−P(ω2∣x)=0. Applying logarithm, we have
logp(x∣ω1)+logP(ω1)−logp(x∣ω2)−logP(ω2)=0.
Let assume that after the projection the data of each class follows a Gaussian distribution (in this case one dimension). The equation is reduced to
If we assume that P(ω1)=P(ω2) , the decision boundary then depends only on μ1,μ2,σ1,σ2, which can be estimated using maximum likelihood. Furthermore, one might further assume that σ1=σ2, hence they can be dropped from the equation. In this case, the decision boundary becomes the point between μi's, which yields a linear decision boundary, called Linear Discriminant Analysis (LDA). On the other hand, if we estimate σi's and use them, we have a quadratic decision boundary, called Quadratic Discriminant Analysis (QDA), i.e. it can curve following the data distributions.
Below are decision boundaries derived from these two cases on three different datasets.
Fig. 3: Decision boundaries from LDA and QDA on three different datasets. Numbers on the bottom right of each plot are test set accuracies.
Consider a symmetric and positive definite A∈Rd×d. Denote v∈Rd the vector that we seek under the following objective:
v∈Rdargmax≜J(v)vTvvTAv.
Because v would be scale invariant, we only need to find the solution in a unit ball, i.e. ∥v∥=1. Because A is symmetric and positive definite, we know that it can be diagonalizable. In particular, the diagonalization is A=QΣQT, where Q is an orthogonal matrix and Σ is an diagonal matrix. We can write the nominator as follows:
vTAv=vTQΣQTv=(QTv)TΣuQTv=uTΣu.
We can see that
∥u∥=uTu=vTQTQv=vTv=1
Without loss of generality, let's first assume that the eigenvalues of A is λ1≥λ2≥⋯≥λn. Because Σ is a diagonal matrix, we know that
uTΣu=i=1∑dλiui2.
With the assumption that the eigenvalues are in descending order, it implies that u12=1 and ui2=0,∀i=1. Hence, we have maxu∗TΣu∗=λ1 and u∗ is the basis vector (after the rotation with the eigen basis) corresponding to λ1, i.e. u∗=[1,0,…,0]T. Therefore, we have v∗=Qu∗, which is the eigenvector corresponding to λ1. ◾️
Conversely, if the problem is minimization, the minimum is attained at λd.