One challenge is in classification is the curse dimensionality in which the amount of finite samples required to properly estimate the underlying distribution of the data goes exponentially with the dimension of the data. Therefore, it might be a reasonable and efficient way to first perform dimensionality reduction and operate on this subspace.
However, one issue might arise is that we might lose some information that is meaningful for classification. To illustrate, the figure shows the effect of projection on distinguishing two groups of data.
Fig. 1: Projection direction that does not separate the two distribution well.
How could we find such a direction?
Let be the direction that we aim to find; We also assume that . Consider two groups of data and ; The projection of these samples are : . Noting here that we implicitly represent the bias term in the inner production, i.e. the last feature of these samples is always 1.
From Fig 1, we can observe that the direction that would allow us to separate the two classes best is the one that maximizes the distance between the means ( of these data after projection. Moreover, the projection would be more discriminate if the two distributions after projection are narrow; in other word, it implies that the variances of these distributions should be small. With this, we can construct the objective function:
where our goal is .
where our goal is . For the nominator, we can rewrite it as
where is called between-class covariate matrix. Similarly, we can expand the denominator term:
Let look at
: we can complete the square and get the outer product:
With this derivation, we have . Let define called within-class covariate matrix. Therefore, the objective function becomes
Solving the optimization problem
With the assumption that the underlying distribution of is a multivariate Gaussian distribution. is positive definite; therefore, we can use Cholesky decomposition: where is an upper-triangular matrix. Let define and rewrite :
Because , we have
With this reparameterization, the optimization problem is simplified to
This is the Rayleigh quotient, and it can be shown that the maximum is attained when is the eigenvalue corresponds to the largest eigenvalue (See ... for the proof) . Therefore, we have
Noting here that is the outer product of the vector ; hence . Therefore, is in the direction of . With this observation, we do not need to solve the eigenvalue problem above and rather compute
Fig. 2: Projection direction that maximizes the class separation and minimize the variance of the distributions along the projection direction.
So far, what we have now is a direction in space that maximally separates the two classes once they are projected onto this direction. To prediction whether a sample belongs to which class, we can derive the discriminant function of this projection. Let and be the two classes we consider. We know that
Using Bayes' decision theory, we decide to belong to if and vice versa; hence the decision boundary lies at . Applying logarithm, we have
Let assume that after the projection the data of each class follows a Gaussian distribution (in this case one dimension). The equation is reduced to
If we assume that , the decision boundary then depends only on , which can be estimated using maximum likelihood. Furthermore, one might further assume that , hence they can be dropped from the equation. In this case, the decision boundary becomes the point between 's, which yields a linear decision boundary, called Linear Discriminant Analysis (LDA). On the other hand, if we estimate 's and use them, we have a quadratic decision boundary, called Quadratic Discriminant Analysis (QDA), i.e. it can curve following the data distributions.
Below are decision boundaries derived from these two cases on three different datasets.
Fig. 3: Decision boundaries from LDA and QDA on three different datasets. Numbers on the bottom right of each plot are test set accuracies.
Google Colab used for making Figure 3.
Some references for this blog:
Proof: Maximization Rayleigh Quotient
Consider a symmetric and positive definite . Denote the vector that we seek under the following objective:
Because would be scale invariant, we only need to find the solution in a unit ball, i.e. . Because A is symmetric and positive definite, we know that it can be diagonalizable. In particular, the diagonalization is , where is an orthogonal matrix and is an diagonal matrix. We can write the nominator as follows:
We can see that
Without loss of generality, let's first assume that the eigenvalues of is . Because is a diagonal matrix, we know that
With the assumption that the eigenvalues are in descending order, it implies that and . Hence, we have and is the basis vector (after the rotation with the eigen basis) corresponding to , i.e. . Therefore, we have , which is the eigenvector corresponding to . ◾️
Conversely, if the problem is minimization, the minimum is attained at .