Table of Contents
- 2.1 Measuring Classifier Performance
- 2.1.1 Displaying Error Performance (Optional Not Completed)
- 2.1.2 Computational Performance
- 2.2 k-Nearest Neighbors
- 2.3 Bayes Classifiers: ECE313 Redux
- 2.3.1 Binary Classifiers
- 2.3.2 M-ary Classifiers
- 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
- 2.4 Linear Classifiers
- 2.5 Linear Discriminant Analysis
- 2.5.1 Estimating the Unknown Parameters
- 2.5.2 The LDA Algorithm
- 2.6 Naive Bayes Classifier
- 2.6.1 Choosing a Model for the Features
- 2.7 Kernel tricks
- 2.8 Logistic Regression
- 2.9 Support Vector Machines
- 2.9.1 Connection to Logistic Regression
NOTE
- Unless stated the following L’s are using the 0-1 loss function
Vocab
- Loss Function: A function used to evaluate the performance of a training
- Training Error: For a classifier f it is the average loss over a given training set
- Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
- Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
- Bayes Classifier: A MAPish rule using priors
- Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
2.1 Measuring Classifier Performance
- To properly gauge performance we use a loss function to evaluate performance
- Tells us cost we incur when estimating y based on given x
- Training error is also used to gauge quality
- It is the average of the loss function for a training set T
- 0-1 Loss Function
- 1 if error 0 if correct
- The training error of this loss function is
- Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
- Because of the overfitting phenomenon it is unwise to rely solely on training error
- Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
- Prediction/Generalization Error: This is the correct way to measure accuracy
- Recall that the classification “f” was trained using training data T
- Meaning the classifier is a function of both T and f
- Prediction error then should be estimated using average loss given T
- Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
- Thus we run on a validation set and use that to calculate Err
- Validation Set Error Checking
- This is still implicitly a function of T as f itself depends on T
- In the 0-1 loss function case the fraction remains the same as previous only in terms of V not T
2.1.1 Displaying Error Performance
- Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
- True Positives
- True Negatives
- False Negatives
- New Metrics
- True Positive Rate (TPR)
- False Positive Rate (FPR)
- Specificity (True negative Rate)
- Precision (Positive Predictive Value)
- Accuracy
- Harmonic mean / F-Measure / F-Score / F1-Score
- Reciever Operating Characteristic (ROC)
- Area Under an ROC Curve (AUC)
2.1.2 Computational Performance
- Computational Performance: Concered with time and space requirements
- In general, during training we have a lot of resources to play with
- In general, during testing there should be less resources available (quick testing)
- Often linear support vector machines and nearest neighbor classifiers are first used
- Linear support vector machines have lower training computational requirements while having relatively good statistical performance
- Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)
2.2 k-Nearest Neighbors
- Given a training set and an input feature vector compute k feature vectors in the training set closest to x
- Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
- Typically it’s the Euclidian distance
- Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
- 1 – cosine similarity
-
Plain Text>Algo for k-Nearest Neighborsfunction k-NearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for
Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]
Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=||x_bar-x_bar_i||end forFind i* such that delta_i is smallest (break ties arbitrarily) return y_i* end function In the k-NN classifier the higher the k is the smoother the boundaries are In the naive implementation of k-NN you calculate distance between test sample and all feature vectors. aka complexity of algo grows linearly in the size of the training set Note: The whoel training set must be kept in memory k-NN does not tell us much about the structure of the data only what's closest This is known as instance-based learning
2.3 Bayes Classifiers: ECE 313 Redux
2.3.1 Binary Classifiers
- Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)
- p(x|-1) used to denote X has pdf/pmf p(x|-1)
- Assumed probabilitiy distributions p(x|-1) & p(x|1) are known
- Bayesian Setting: We are also given priors
- Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
- Difference between hypothesis testing and classification problem is that
are not known
- Thus everything is based on T
- Using 0,1 loss the probability of error is
- Bayes classifier minimizes P_e
- Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have
2.3.2 M-ary Classifiers
- M>=2 Bayes classifiers can be constructed essentially the same way as binary
- We now have M hypotheses, M priors
- Giving the following estimate
- Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(x|y) generated from the data
- The challenge is estimating the values efficiently
2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
- Vectors are d-dimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
- To denote X has the above pdf use
- Note: miu_l is the l-th element of miu_bar and is equal to E[X_l]
- The cov matrix contains all pairwaise covariances of the elements of X
- The diagonal entries are the variances of the features (cov(a,a)=var(a))
- Off diagonal entries are the "linear part" of the relationship between features
- Eg. 2 vars height and weight, diagonals are the variances and the off-diagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
- Now for a M-ary Hypothesis with y hypotheses Hy
- Then...
- If Cy's are different the objective function is quadratic in x
- Else if Cy=C then the objective function is linear and the ln|Cy| and x.T C_inv x term can be ignored giving the following
- If C's are all equal we can compare each variable in pairs using the following decision rule
- And the boundary exists where the two values are equal
- Rearranging the terms with x, miu, and C into w and all pi's into beta we get
and the boundary as
- The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
- Putting this all together we can summarize and conclude
2.4 Linear Classifiers
- Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
with decision boundary g=0
- g describes a hyperplane in
- The plane thus divides the space into two parts
- The plane thus divides the space into two parts
- Binary Linear Classifier: Uses the above hyper plane to designate -1 or 1 for either side of the hyper plane
- An M-ary classifier can be built in nCr(M, 2) pairs of labels
- Allows us to build non-linear decision boundaries using a linear classifier
- Cheap to use on test data
- Useful for building neural networks
2.5 Linear Discriminant Analysis
- Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
- Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
- Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
- If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
2.5.1 Estimating the Unknown Parameters
- miu, pi and C are estimated from the training data
- Note M = number of features, N = number of samples
- Partition the data by label
2.5.2 The LDA Algorithm
-
Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function