A summary of Duda et al. (2020), Pattern Classification. Chapter 2.
Machine learning is about extracting useful information from the data we have. Often, this is referred to as modeling. Consider a situation where we have collected fish data and we want to build a classifier, classifying whether the fish is salmon.
Let assume that each fish has continuous attributes, such as weights and body length, that we measure; mathematically, we write . The simplest way to build such a classifier is to choose a decision threshold based on each class's histogram of a certain attribute.
Fig. 1: Distributions of each group's feature; dashed line is the decision boundary.
Although the process is straight forward, it is quite tedious if is large, i.e. we would not be able to inspect plots efficiently. Moreover, there might be some correlation between some attributes for each class.
Bayesian Decision Theory
First, denote and class labels, i.e. salmon and non-salmon. Using Bayes' rule, we write
where and denote probability and likelihood respectively. So, the decision whether we classify as (salmon) or (non-salmon) can be based on comparing the posteriors and . More precisely, we classify as if and vice versa. With this decision rule, the probability of making error () would be
Therefore, the expected error would be
Because our decision rule yields the smallest error for each , the expected error would be also minimized; the decision rule is called Bayes' decision rule.
Recall that is a multiplication of two terms : the likelihood and the class probability . Because both terms are between zero and one and computer sometimes can not represent small value well (i.e. underflow issue), operating with the current form might lead to numerical instability. Fortunately, we observation that the decision based on comparing and is insensitive to monotonic transformation such as logarithm; therefore, we can instead operate on this transformation. In particular, we define a discriminant function as
Multivariate Gaussian Cases
Imaging now that we want to build the fish classifier with all attributes. We assume that these attributes are from each category's multivariate Gaussian distribution, i.e. the two groups (salmon and non-salmon) have their own set of parameters for the distribution. We will look at three cases:
- C1: The two distributions have the same covariate , and it is ;
- C2: The two distributions have the same covariate , but we do not know the structure of it;
- C3: The two distributions have different covariates and .
Let recall the density function of a multivariate Gaussian distribution :
Because we assume that if is salmon and otherwise, e can write the discriminant function as
From the deviation, we see that ❶ does not depend on any distribution parameter (, ; thus, it can be dropped.
Case C1: \Sigma_1 = \Sigma_2 = \Sigma = \sigma^2 I_d
In this case, ❷ can be also dropped, leaving us with . Then, we have
Furthermore, the term is the same for the two groups, we can also drop it. Using the fact that , we can rewrite the discriminant function as
With this result, we can find the decision boundary from :
Fig. 2: Case 1's decision boundary
Case C2: \Sigma_1 = \Sigma_2 = \Sigma
In this case, the derivation of the decision boundary is similar to the first case. More precisely, we have
Fig. 3: Case 2's decision boundary
Case C3: \Sigma_1 \ne \Sigma_2
In this case, we can only drop ❶. Follow the same derivation except not dropping ❷, we get
This can be written as This is a quadratic equation in ; hence, the decision boundary would be hyperquadric surfaces.
Fig. 4: Case 3's decision boundary
Figures are produced from a Google Colab's notebook.