A summary of Duda et al. (2020), Pattern Classification. Chapter 2.
Machine learning is about extracting useful information from the data we have. Often, this is referred to as modeling. Consider a situation where we have collected fish data and we want to build a classifier, classifying whether the fish is salmon.
Let assume that each fish has d continuous attributes, such as weights and body length, that we measure; mathematically, we write x∈Rd. The simplest way to build such a classifier is to choose a decision threshold based on each class's histogram of a certain attribute.
Fig. 1: Distributions of each group's feature; dashed line is the decision boundary.
Although the process is straight forward, it is quite tedious if d is large, i.e. we would not be able to inspect d plots efficiently. Moreover, there might be some correlation between some attributes for each class.
Bayesian Decision Theory
First, denote ω1and ω2 class labels, i.e. salmon and non-salmon. Using Bayes' rule, we write
P(ω1∣x)=P(x)p(x∣ω1)P(ω1),
where P(⋅) and p(⋅) denote probability and likelihood respectively. So, the decision whether we classify x as ω1 (salmon) or ω1 (non-salmon) can be based on comparing the posteriors P(ω1∣x) and P(ω2∣x). More precisely, we classify x as ω1 if P(ω1∣x)>P(ω2∣x) and vice versa. With this decision rule, the probability of making error (Err) would be
Because our decision rule yields the smallest error for each x, the expected error would be also minimized; the decision rule is called Bayes' decision rule.
Discriminant Function
Recall that P(ω1∣x) is a multiplication of two terms : the likelihood p(x∣ω1) and the class probability P(ω1). Because both terms are between zero and one and computer sometimes can not represent small value well (i.e. underflow issue), operating with the current form might lead to numerical instability. Fortunately, we observation that the decision based on comparing P(ω1∣x) and P(ω2∣x) is insensitive to monotonic transformation such as logarithm; therefore, we can instead operate on this transformation. In particular, we define a discriminant function g as
gi(x)≜logp(x∣ωi)+logP(ωi).
Multivariate Gaussian Cases
Imaging now that we want to build the fish classifier with all d attributes. We assume that these attributes are from each category's multivariate Gaussian distribution, i.e. the two groups (salmon and non-salmon) have their own set of parameters for the distribution. We will look at three cases:
C1: The two distributions have the same covariate Σ, and it is Σ≜σ2Id;
C2: The two distributions have the same covariate Σ, but we do not know the structure of it;
C3: The two distributions have different covariates Σ1 and Σ2.
Let recall the density function of a multivariate Gaussian distribution N(μ,Σ):
p(x)=(2π)d/2∣Σ∣21exp(−21(x−μ)TΣ−1(x−μ)).
Because we assume that x∼N(μ1,Σ1) if x is salmon and N(μ2,Σ2) otherwise, e can write the discriminant function as
Furthermore, the term xTΣ−1x is the same for the two groups, we can also drop it. Using the fact that Σ−1=σ21Id, we can rewrite the discriminant function as