# Chapter 2: Classification

• 2.1 Measuring Classifier Performance
• 2.1.1 Displaying Error Performance (Optional Not Completed)
• 2.1.2 Computational Performance
• 2.2 k-Nearest Neighbors
• 2.3 Bayes Classifiers: ECE313 Redux
• 2.3.1 Binary Classifiers
• 2.3.2 M-ary Classifiers
• 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
• 2.4 Linear Classifiers
• 2.5 Linear Discriminant Analysis
• 2.5.1 Estimating the Unknown Parameters
• 2.5.2 The LDA Algorithm
• 2.6 Naive Bayes Classifier
• 2.6.1 Choosing a Model for the Features
• 2.7 Kernel tricks
• 2.8 Logistic Regression
• 2.9 Support Vector Machines
• 2.9.1 Connection to Logistic Regression

NOTE

• Unless stated the following L’s are using the 0-1 loss function

Vocab

• Loss Function: A function used to evaluate the performance of a training
• Training Error: For a classifier f it is the average loss over a given training set
• Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
• Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
• Bayes Classifier: A MAPish rule using priors
• Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form

2.1 Measuring Classifier Performance

• To properly gauge performance we use a loss function to evaluate performance
• 
• Tells us cost we incur when estimating y based on given x
• Training error is also used to gauge quality
• It is the average of the loss function for a training set T
• 
• 0-1 Loss Function
• 1 if error 0 if correct
• The training error of this loss function is
• 
• Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
• Because of the overfitting phenomenon it is unwise to rely solely on training error
• Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
• Prediction/Generalization Error: This is the correct way to measure accuracy
• Recall that the classification “f” was trained using training data T
• Meaning the classifier is a function of both T and f
• Prediction error then should be estimated using average loss given T
• 
• Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
• Thus we run on a validation set and use that to calculate Err
• Validation Set Error Checking
• 
• This is still implicitly a function of T as f itself depends on T
• In the 0-1 loss function case the fraction remains the same as previous only in terms of V not T

2.1.1 Displaying Error Performance

• Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
• True Positives
• True Negatives
• False Negatives
• New Metrics
• True Positive Rate (TPR)
• False Positive Rate (FPR)
• Specificity (True negative Rate)
• Precision (Positive Predictive Value)
• Accuracy
• Harmonic mean / F-Measure / F-Score / F1-Score
• Reciever Operating Characteristic (ROC)
• Area Under an ROC Curve (AUC)

2.1.2 Computational Performance

• Computational Performance: Concered with time and space requirements
• In general, during training we have a lot of resources to play with
• In general, during testing there should be less resources available (quick testing)
• Often linear support vector machines and nearest neighbor classifiers are first used
• Linear support vector machines have lower training computational requirements while having relatively good statistical performance
• Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)

2.2 k-Nearest Neighbors

• Given a training set and an input feature vector compute k feature vectors in the training set closest to x
• Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
• Typically it’s the Euclidian distance
• Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
• 1 – cosine similarity
• `Plain Text>Algo for k-Nearest Neighborsfunction k-NearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for`

Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]

• ```	Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=||x_bar-x_bar_i||end forFind i* such that delta_i is smallest (break ties arbitrarily)

return y_i*
end function
In the k-NN classifier the higher the k is the smoother the boundaries are
In the naive implementation of k-NN you calculate distance between test sample and all feature vectors.  aka complexity of algo grows linearly in the size of the training set	Note: The whoel training set must be kept in memory	k-NN does not tell us much about the structure of the data only what's closest	This is known as instance-based learning```
```2.3 Bayes Classifiers: ECE 313 Redux
```

`2.3.1 Binary Classifiers`

``` Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)  p(x|-1) used to denote X has pdf/pmf p(x|-1) Assumed probabilitiy distributions p(x|-1) & p(x|1) are known Bayesian Setting: We are also given priors   Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)  Difference between hypothesis testing and classification problem is that  are not known Thus everything is based on T Using 0,1 loss the probability of error is  Bayes classifier minimizes P_e Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have ```

`2.3.2 M-ary Classifiers`

``` M>=2 Bayes classifiers can be constructed essentially the same way as binary We now have M hypotheses, M priors Giving the following estimate  Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(x|y) generated from the data The challenge is estimating the values efficiently ```

`2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions`

``` Vectors are d-dimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf  To denote X has the above pdf use  Note: miu_l is the l-th element of miu_bar and is equal to E[X_l] The cov matrix contains all pairwaise covariances of the elements of X  The diagonal entries are the variances of the features (cov(a,a)=var(a)) Off diagonal entries are the "linear part" of the relationship between features Eg. 2 vars height and weight, diagonals are the variances and the off-diagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight Now for a M-ary Hypothesis with y hypotheses Hy Then...   If Cy's are different the objective function is quadratic in x Else if Cy=C then the objective function is linear and the ln|Cy| and x.T C_inv x term can be ignored giving the following  If C's are all equal we can compare each variable in pairs using the following decision rule And the boundary exists where the two values are equal Rearranging the terms with x, miu, and C into w and all pi's into beta we get  and the boundary as  The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0 Putting this all together we can summarize and conclude ```

`2.4 Linear Classifiers`

``` Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form with decision boundary g=0 g describes a hyperplane in  The plane thus divides the space into two parts  Binary Linear Classifier: Uses the above hyper plane to designate -1 or 1 for either side of the hyper plane An M-ary classifier can be built in nCr(M, 2) pairs of labels Allows us to build non-linear decision boundaries using a linear classifier Cheap to use on test data Useful for building neural networks ```

`2.5 Linear Discriminant Analysis`

``` Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA) Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis) ```

`2.5.1 Estimating the Unknown Parameters`

``` miu, pi and C are estimated from the training data Note M = number of features, N = number of samples Partition the data by label ```

`2.5.2 The LDA Algorithm`

``` Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function ```
``` Written by pokhym Leave a comment Posted in Big Data Tagged with ece398, ece398bd, ml ```
``` September 6, 2017 Chapter 1: Introduction Table of Contents 1.1 What is Machine Learning Notation Capital Letters: Random variables, matrices, number of classes (clear by context) Bold: Vectors (column vectors, unless transposed) Hats: estimates Vocab i.i.d: Independently and identically distributed Joint Distribution: Probablility of something occuring taking into account 2 or more random variables Supervised Learning: Learning with a training set Unsupervised Learning: Learning with a set of data without values Feature Vectors: These are the vectors that “describe” a defining feature of the data Feature Space: Vector space containing the feature vectors Finite Set: Labels occur from {-1, 1} Non-Finite Set: Labels occur on real number line Classification (Pattern Recognition): Supervised learning with a finite set of labels Labels are called classes Clustering: Data grouped together giving away some sort of pattern 1.1 What is Machine Learning Supervised Learning: Machine Learning with a training set T. Feature vectors x generate labels y $T=\{\{\textbf{x}_i, y_i\}\limits_{i=1}^N\}$ Data is drawn i.i.d from some joint distribution $p(\textbf{x}, y)$ Vectors x_i are feature vectors or inputs Values y_i are known as labels  (outputs, responses) Feature Space: Vector space containing the feature vectors The goal of supervised learning is to create a map from the training data This allows us to create a model to predict output values This process is called learning or training a model Classification (Pattern Recognition) Labels are a finite-set {-1, 1} Labels are now known as classses Then the supervised problem is known as classification or pattenr recognition  Labels are  a non-finite set, any real number Then the supervised problem is known as regression Unsupervised Learning: Machine Learning without a training set but only a set of data (feature vectors) $D=\{\textbf{x}_i\}_{i=1}^N$ Data is drawn i.i.d and we want to infer something useful from the data Clustering is used to identify such patterns Take vectors from D and group them into similar groups Written by pokhym Leave a comment Posted in Machine Learning Tagged with classification, ece398, ece398bd, Machine Learning, ml ```