Chapter 2: Classification
Table of Contents
 2.1 Measuring Classifier Performance
 2.1.1 Displaying Error Performance (Optional Not Completed)
 2.1.2 Computational Performance
 2.2 kNearest Neighbors
 2.3 Bayes Classifiers: ECE313 Redux
 2.3.1 Binary Classifiers
 2.3.2 Mary Classifiers
 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
 2.4 Linear Classifiers
 2.5 Linear Discriminant Analysis
 2.5.1 Estimating the Unknown Parameters
 2.5.2 The LDA Algorithm
 2.6 Naive Bayes Classifier
 2.6.1 Choosing a Model for the Features
 2.7 Kernel tricks
 2.8 Logistic Regression
 2.9 Support Vector Machines
 2.9.1 Connection to Logistic Regression
NOTE
 Unless stated the following L’s are using the 01 loss function
Vocab
 Loss Function: A function used to evaluate the performance of a training
 Training Error: For a classifier f it is the average loss over a given training set
 Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
 Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
 Bayes Classifier: A MAPish rule using priors
 Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
2.1 Measuring Classifier Performance
 To properly gauge performance we use a loss function to evaluate performance
 Tells us cost we incur when estimating y based on given x
 Training error is also used to gauge quality
 It is the average of the loss function for a training set T
 01 Loss Function
 1 if error 0 if correct
 The training error of this loss function is
 Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
 Because of the overfitting phenomenon it is unwise to rely solely on training error
 Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
 Prediction/Generalization Error: This is the correct way to measure accuracy
 Recall that the classification “f” was trained using training data T
 Meaning the classifier is a function of both T and f
 Prediction error then should be estimated using average loss given T
 Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
 Thus we run on a validation set and use that to calculate Err
 Validation Set Error Checking
 This is still implicitly a function of T as f itself depends on T
 In the 01 loss function case the fraction remains the same as previous only in terms of V not T
2.1.1 Displaying Error Performance
 Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
 True Positives
 True Negatives
 False Negatives
 New Metrics
 True Positive Rate (TPR)
 False Positive Rate (FPR)
 Specificity (True negative Rate)
 Precision (Positive Predictive Value)
 Accuracy
 Harmonic mean / FMeasure / FScore / F1Score
 Reciever Operating Characteristic (ROC)
 Area Under an ROC Curve (AUC)
2.1.2 Computational Performance
 Computational Performance: Concered with time and space requirements
 In general, during training we have a lot of resources to play with
 In general, during testing there should be less resources available (quick testing)
 Often linear support vector machines and nearest neighbor classifiers are first used
 Linear support vector machines have lower training computational requirements while having relatively good statistical performance
 Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)
2.2 kNearest Neighbors
 Given a training set and an input feature vector compute k feature vectors in the training set closest to x
 Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
 Typically it’s the Euclidian distance
 Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
 1 – cosine similarity

Plain Text>Algo for kNearest Neighborsfunction kNearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for
Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]
Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=x_barx_bar_iend forFind i* such that delta_i is smallest (break ties arbitrarily) return y_i* end function In the kNN classifier the higher the k is the smoother the boundaries are In the naive implementation of kNN you calculate distance between test sample and all feature vectors. aka complexity of algo grows linearly in the size of the training set Note: The whoel training set must be kept in memory kNN does not tell us much about the structure of the data only what's closest This is known as instancebased learning
2.3 Bayes Classifiers: ECE 313 Redux
2.3.1 Binary Classifiers
 Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)

 p(x1) used to denote X has pdf/pmf p(x1)
 Assumed probabilitiy distributions p(x1) & p(x1) are known

 Bayesian Setting: We are also given priors
 Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
 Difference between hypothesis testing and classification problem is that are not known
 Thus everything is based on T
 Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
 Using 0,1 loss the probability of error is
 Bayes classifier minimizes P_e
 Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have
2.3.2 Mary Classifiers
 M>=2 Bayes classifiers can be constructed essentially the same way as binary
 We now have M hypotheses, M priors
 Giving the following estimate
 Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(xy) generated from the data
 The challenge is estimating the values efficiently
2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
 Vectors are ddimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
 To denote X has the above pdf use
 Note: miu_l is the lth element of miu_bar and is equal to E[X_l]
 The cov matrix contains all pairwaise covariances of the elements of X
 The diagonal entries are the variances of the features (cov(a,a)=var(a))
 Off diagonal entries are the "linear part" of the relationship between features
 Eg. 2 vars height and weight, diagonals are the variances and the offdiagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
 To denote X has the above pdf use
 Now for a Mary Hypothesis with y hypotheses Hy
 Then...

 If Cy's are different the objective function is quadratic in x
 Else if Cy=C then the objective function is linear and the lnCy and x.T C_inv x term can be ignored giving the following
 If C's are all equal we can compare each variable in pairs using the following decision rule
 And the boundary exists where the two values are equal
 Rearranging the terms with x, miu, and C into w and all pi's into beta we get
 and the boundary as
 The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
 Putting this all together we can summarize and conclude
2.4 Linear Classifiers
 Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form with decision boundary g=0
 g describes a hyperplane in
 The plane thus divides the space into two parts
 Binary Linear Classifier: Uses the above hyper plane to designate 1 or 1 for either side of the hyper plane
 An Mary classifier can be built in nCr(M, 2) pairs of labels
 Allows us to build nonlinear decision boundaries using a linear classifier
 Cheap to use on test data
 Useful for building neural networks
2.5 Linear Discriminant Analysis
 Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
 Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
 Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
 If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
2.5.1 Estimating the Unknown Parameters
 miu, pi and C are estimated from the training data

 Note M = number of features, N = number of samples
 Partition the data by label
2.5.2 The LDA Algorithm

Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function
# Run to classify test samplefunction ClassifyLinearDiscriminantAnalysis(Test sample x_bar, Parameter estimates pi_bar, miu_bar, C_hat)for each class 1=1,...M doCalc function delta_l(x_bar)=x_bar.T*C_hat_inv*miu_hat_bar_l  .5*miu_hat_bar_l.T*C_hat_inv*miu_hat_bar_l + ln(pi_hat_l)end forreturn the l that has the largest delta_l(x_bar)# break ties arbitrarilyend function</code></li></code></ul><p><code><strong>2.6 Naive Bayes Classifier</strong></code><br></p>Not a linear classifierIt is an "online learning/streaming algorithm" in which training data arrives in seqeuence and the classifier can be updated over time and as such works well for slowly changing feature distributions
 Naive Bayes works on the assumption that features are conditionally independent under each class
 This will result in Md distributions d=cardinality(x) and M=cardinality(y)
 Reductio in number of distributions means that it is not as taxing for large datasets while reducing overfitting
 Naive Bayes also allows you to mix categorical and continuous features
 Categorical: Values in a finite set (eg. T/F)
 Continuous: Realvalues
 Joint features will have distributions as one needs to take into account the relation between two distributions themselves
 The estimate and estimates on now becomes
 Note: N_y is the number of training samples in class y
 Of course the joint distribution estimate depends on your model of choice
2.6.1 Choosing a Model for the Features
 Assumed that all features come from the same family of distributions for all classes with different parameters
 Distribution of feature j under class y is modeled as a distribution on the parameter vector theta
 This will replace the p(xy) in the Naive Bayes Classifer
 Furthermore in addition to that theta will be estimated itself by
 For continuous data common choice is the Gaussian distribution
 The second image represents the parameter vector containing the mean and variance of feature j on data y
 The third image represents the estimates of these values
 Take the jth feature for all the data in class y and average it to find miu_jy
 Then use this estimate to calculate variance of the jth feature for all data in class y
 For categorical data, common choice is the Multinoulli or Categorical distribution where feature vectors take values 0, …, l1
 Each of the probabilities is essentially based on the histogram of the data
 If we have a probability for that is equal to 1 we call this degenerate
 This means that in our training data every example had the occurence of “X”
 During validation if any data had information regarding “X” but never mentioned “X” itself it the model would still mark it as having nothing to do with “X"
 One way to solve this is through Laplace smoothing or additive smoothing
 This essentially rebalances the estimated probabilities by adding “samples” that we pretend to have for each value the feature can take on and normalizing it
 Typically alpha=1
 Feature selection is also an alternative technique where we determine which feature contains the most informational about the class
 Calculating “mutual information” allows us to weight the features based on the amount of information it has
 Mutual information measures how much information the presence/absence of a term contributes to making the correct classification decision on y
 Revisited in 3.5.4
2.7 Kernel Tricks
 A trick to adapt linear classification algorithms to give nonlinear decision boundaries
 Let the nonlinear mapping function be known as
 Often these mappings are to a larger feature space where the classes can then be linearly separated
 For example in the following example using the mapping we can create the linear separation
 Recall that classifiers can be defined by the form below
 Note recall that w and beta both depend on the terms miu_l, miu_m, and C
 Now to talk about kernel tricks
 The idea is to train a model that depends only on dot products (like the pictures above)
 Then we can replace the dot products with a kernel function and then perform a linear classification after mapping to a bigger feature space
 Here are the steps to obtain the kernel version of the model
 All computations with the model should only be written with dot products between pairs of feature vectors
 Replace with the following map

 Essentially it needs to act like its doing a dot product in a different space
 Something that uses this kernel trick is called a kernel method
 There are various different families of kernels that are valid for the kernel to act like its just doing a dot product in a different space
 One is Mercer Kernels
 Mercer Kernels follow the following properties
 Other common kernels are shown below
 It is possible to overfit when enlarging a feature space if there is insufficient data
2.8 Logistic Regression
 One of the most common classification approaches
 Can do Mary class, ordinal cases (ordered cases), determining which features are statistically significant and more
 Logistic regression is a discriminative model
 It tries to directly model which we also use to fit by maximum likelihood
 The maximum likelihood function L is based on training data and is
 Note: Maximizing this function cannot be represented in a formula and must be solved iteratively
 In addition to that the values of beta_0 and beta cannot be calculated via an equation as well and must be solved iteratively
 The classifier is a linear classifier where
 The left hand side is nonnegative iff which is known as a logit transformation
 And finally we obtain the binary logistic regression
2.9 Support Vector Machines
 Linear support vector machines/Support vector classifiers are binary linear classifiers
 SVC + kernel trick gives a support vector machine (SVM)
2.9.1 Connection to Logistic Regression
 Naive Bayes works on the assumption that features are conditionally independent under each class