Gaussian Discriminant Analysis

So we see the word 'discriminant' again, that means we're still dealing with conditional probability P(Y|X). To be more specific, our target is to maximize the posterior P(Y=y|X=x):

so we can compare P(Y = C | X) and P(Y = D | X) , where C and D represent different classes.

For the rest of the whole blag post, everything is based upon this fundamental assumption:
each class comes from a normal distribution (Gaussian), like this:

So let's say for each class, we collect some data and construct the normal distribution, so for a class C, which has its own mean and standard deviation,

and call

Compared to P(X=x|Y=C), P(Y=C) is much easier to find. Consider a 2-class classification, and we collect 600 samples for class C and 400 samples for class D,
P(Y=C) = 600/(600 + 400) = 0.6 and  P(Y=D) = 400/(600 + 400) = 0.4

Now all we need is P(X) and we''ll be done filling every piece of the equation. Actually, we don't need that. Our mission is to compare P(Y=C|X=x) and P(Y=D|X=x), which actually share the same denominator P(X) according to Bayes' Theorem. Therefore, we're actually comparing P(X=x|Y=C)P(Y=C) and P(X=x|Y=D)P(Y=D).

Given x,Bayes' rule r*(x) return class that maximizes P(X=x|Y=C)P(Y=C).

And to make the computation easier, we're gonna do something about P(X=x|Y=C)P(Y=C) and P(X=x|Y=D)P(Y=D). Since ln(z) is monotonically increasing for z > 0,

Therefore,

And so, we have

Note that this is a quadratic function. So we can move on to the next part.

Suppose  we have only two classes C and D. Then we have the Bayes' Decision Rule

Prediction function is quadratic in x, which we've just calculated.
Bayes Decision Boundary is when

In 1-D, Bayes Decision Boundary may have one or two points.
In d-D, Bayes Decision Boundary is a quadratic.

Note that we're dealing with those two quadratic functions instead of the original P(Y=C|X=x) and P(Y=D|X=x). Therefore, to recover those posterior probabilities in 2-class case, we write them in terms of Q(x) using Bayes' Theorem:

s(-) is the logistic function aka sigmoid function. This is what it looks like:

• s(0) = 0.5 => Bayesian Decision Boundary
• s(infinity) = 1 => Class C
• s(-infinity) = 0 => Class D
Since s(-) is always within 0 and 1, it's perfect for modeling probabilities.

So again, we can think about this as sort of a 2-step process. First, we'll find these quadratic functions. Then we take difference between them and run this difference through the sigmoid function to get a probability.

Linear Discriminant Analysis (LDA)

If we make this fundamental assumption: all the Gaussians have same variance.

We'll have

So what happened here is I take the expressions of C and D, subtract them from each other and use the fact that when both of their sigmas are the same. σC and σD are now equal, which makes certain terms in the equation drop out. One of the certain terms dropped out was the x^2 term. So this equation is now linear instead of quadratic.

This is nice for two reasons. One is the simplicity, it's a little faster to compute compared to using quadratic functions. But the other more important reason is if QDA is overfitting, then switching to a simpler model makes it less likely to overfit, because a linear model has fewer parameters (but also less possibility and less flexible).

So what we now have is a linear classifier! Choose C that maximizes

Here we are just comparing expression of C and expression of D. Therefore, if we wanna compare a million different classes, I just need to compute one scalar value for each class and figure out which one is biggest.

In 2-class case, decision boundary is when ωx + α=0

Another nice simplification happens if the two priors are equal. So suppose you have two classes, and 50% of your samples are in one class and the other 50% belong to another class then this simplifies the equation into this:

If P(Y=C) = P(Y=D) = 0.5, we'll have

This is the centroid method.

Also point out that just like what we did in QDA, we can get the Bayes posterior. Why do we want that? We want that because sometimes we don't want just what class we're predicting, but what the probability is in that class. And so the Bayes posterior is we take that logistic function I showed you before and we apply it to our predictor function ωx + α. And now we have a probability!

Maximum Likelihood Estimation of Parameters

Now the question remaining is how do we find the mean and standard deviation of the distributions we have based on the data we collect. We're gonna use something called maximum likelihood estimation.

If we flip some biased coins 10 times, heads with probability p; tails with probability 1-p, and we got 8 heads and 2 tails. What is the most likely value of p?

Using Binomial Distribution: X~B(n,p)

Back to our example, n = 10, x = 8:

We define this as L(p), and call it the likelihood function.

So what we basically just did is we wrote the probability of 8 heads in 10 flips as a function L(p) of distribution parameter(s); this is the likelihood function.

Maximum Likelihood Estimation (MLE):
A method of estimating the parameters of a statistical model by picking the params that maximize the likelihood function. (The idea is what parameter would let us find the maximum possible chance of getting 8 heads.)

So back to our example, we 're trying to find p that maximizes L(p):
By setting  derivative = 0,

Likelihood of a Gaussian

Moving onto Gaussian distributions,

The log likelihood l(.) is the natural of the likelihood L(.), and maximizing likelihood is like maximizing log-likelihood, and vice versa. Therefore, to make our life easier, we're gonna take the log-likelihood here.

And of course, what we're gonna do next to setting the derivative = 0 to find our precious, precious, mean and standard deviation.

Therefore, the 'mean' we have is the mean of all the samples we got.
(Note that we use this triangle thingy because the mean may be a vector.)

And since we don't exactly  know what the true mean is, we're just gonna substitute it with the μ hat we just calculated.

Therefore, the 'variance' (just take a square root of it and we'll get the standard deviation) we have is the variance of all the sample points.

In short, we use mean & variance of samples in class C to estimate mean & variance of Gaussian for class C.

Now that we have every piece of the puzzle, we can go back to LDA and QDA.

QDA

In the case of QDA, we can simply estimate the mean and variance of each class using the previous results that we have and estimate the priors for each class C:

LDA

However, in the case of LDA, things are a little bit complicated, since we assume every class shares the same standard deviation (and variance, of course).

Small Summary

Let's end this blag by wrapping up everything:

• We assume every class comes from a Gaussian distribution
• Our target is to compare posteriors for different classes
• We do this by using QDA and LDA, and the sigmoid function
• In the meantime, a Gaussian distribution needs a mean and standard deviation, and we find them using maximum likelihood estimation

What's next? Anisotropic normal distributions, that means more Gaussians. Crap...

Kev
My Machine Learning Learning Experience (Part 6): Gaussian Discriminant Analysis And Maximum Likelihood Estimation Reviewed by Unknown on 11:17:00 PM Rating: 5