Tagged: ml
Chapter 2: Classification
Table of Contents
 2.1 Measuring Classifier Performance
 2.1.1 Displaying Error Performance (Optional Not Completed)
 2.1.2 Computational Performance
 2.2 kNearest Neighbors
 2.3 Bayes Classifiers: ECE313 Redux
 2.3.1 Binary Classifiers
 2.3.2 Mary Classifiers
 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
 2.4 Linear Classifiers
 2.5 Linear Discriminant Analysis
 2.5.1 Estimating the Unknown Parameters
 2.5.2 The LDA Algorithm
 2.6 Naive Bayes Classifier
 2.6.1 Choosing a Model for the Features
 2.7 Kernel tricks
 2.8 Logistic Regression
 2.9 Support Vector Machines
 2.9.1 Connection to Logistic Regression
NOTE
 Unless stated the following L’s are using the 01 loss function
Vocab
 Loss Function: A function used to evaluate the performance of a training
 Training Error: For a classifier f it is the average loss over a given training set
 Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
 Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
 Bayes Classifier: A MAPish rule using priors
 Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
2.1 Measuring Classifier Performance
 To properly gauge performance we use a loss function to evaluate performance
 Tells us cost we incur when estimating y based on given x
 Training error is also used to gauge quality
 It is the average of the loss function for a training set T
 01 Loss Function
 1 if error 0 if correct
 The training error of this loss function is
 Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
 Because of the overfitting phenomenon it is unwise to rely solely on training error
 Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
 Prediction/Generalization Error: This is the correct way to measure accuracy
 Recall that the classification “f” was trained using training data T
 Meaning the classifier is a function of both T and f
 Prediction error then should be estimated using average loss given T
 Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
 Thus we run on a validation set and use that to calculate Err
 Validation Set Error Checking
 This is still implicitly a function of T as f itself depends on T
 In the 01 loss function case the fraction remains the same as previous only in terms of V not T
2.1.1 Displaying Error Performance
 Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
 True Positives
 True Negatives
 False Negatives
 New Metrics
 True Positive Rate (TPR)
 False Positive Rate (FPR)
 Specificity (True negative Rate)
 Precision (Positive Predictive Value)
 Accuracy
 Harmonic mean / FMeasure / FScore / F1Score
 Reciever Operating Characteristic (ROC)
 Area Under an ROC Curve (AUC)
2.1.2 Computational Performance
 Computational Performance: Concered with time and space requirements
 In general, during training we have a lot of resources to play with
 In general, during testing there should be less resources available (quick testing)
 Often linear support vector machines and nearest neighbor classifiers are first used
 Linear support vector machines have lower training computational requirements while having relatively good statistical performance
 Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)
2.2 kNearest Neighbors
 Given a training set and an input feature vector compute k feature vectors in the training set closest to x
 Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
 Typically it’s the Euclidian distance
 Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
 1 – cosine similarity

Plain Text>Algo for kNearest Neighborsfunction kNearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for
Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]
Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=x_barx_bar_iend forFind i* such that delta_i is smallest (break ties arbitrarily) return y_i* end function In the kNN classifier the higher the k is the smoother the boundaries are In the naive implementation of kNN you calculate distance between test sample and all feature vectors. aka complexity of algo grows linearly in the size of the training set Note: The whoel training set must be kept in memory kNN does not tell us much about the structure of the data only what's closest This is known as instancebased learning
2.3 Bayes Classifiers: ECE 313 Redux
2.3.1 Binary Classifiers
 Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)

 p(x1) used to denote X has pdf/pmf p(x1)
 Assumed probabilitiy distributions p(x1) & p(x1) are known

 Bayesian Setting: We are also given priors
 Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
 Difference between hypothesis testing and classification problem is that are not known
 Thus everything is based on T
 Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
 Using 0,1 loss the probability of error is
 Bayes classifier minimizes P_e
 Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have
2.3.2 Mary Classifiers
 M>=2 Bayes classifiers can be constructed essentially the same way as binary
 We now have M hypotheses, M priors
 Giving the following estimate
 Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(xy) generated from the data
 The challenge is estimating the values efficiently
2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
 Vectors are ddimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
 To denote X has the above pdf use
 Note: miu_l is the lth element of miu_bar and is equal to E[X_l]
 The cov matrix contains all pairwaise covariances of the elements of X
 The diagonal entries are the variances of the features (cov(a,a)=var(a))
 Off diagonal entries are the "linear part" of the relationship between features
 Eg. 2 vars height and weight, diagonals are the variances and the offdiagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
 To denote X has the above pdf use
 Now for a Mary Hypothesis with y hypotheses Hy
 Then...

 If Cy's are different the objective function is quadratic in x
 Else if Cy=C then the objective function is linear and the lnCy and x.T C_inv x term can be ignored giving the following
 If C's are all equal we can compare each variable in pairs using the following decision rule
 And the boundary exists where the two values are equal
 Rearranging the terms with x, miu, and C into w and all pi's into beta we get
 and the boundary as
 The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
 Putting this all together we can summarize and conclude
2.4 Linear Classifiers
 Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form with decision boundary g=0
 g describes a hyperplane in
 The plane thus divides the space into two parts
 Binary Linear Classifier: Uses the above hyper plane to designate 1 or 1 for either side of the hyper plane
 An Mary classifier can be built in nCr(M, 2) pairs of labels
 Allows us to build nonlinear decision boundaries using a linear classifier
 Cheap to use on test data
 Useful for building neural networks
2.5 Linear Discriminant Analysis
 Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
 Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
 Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
 If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
2.5.1 Estimating the Unknown Parameters
 miu, pi and C are estimated from the training data

 Note M = number of features, N = number of samples
 Partition the data by label
2.5.2 The LDA Algorithm

Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function
Chapter 1: Introduction
Table of Contents
 1.1 What is Machine Learning
Notation
 Capital Letters: Random variables, matrices, number of classes (clear by context)
 Bold: Vectors (column vectors, unless transposed)
 Hats: estimates
Vocab
 i.i.d: Independently and identically distributed
 Joint Distribution: Probablility of something occuring taking into account 2 or more random variables
 Supervised Learning: Learning with a training set
 Unsupervised Learning: Learning with a set of data without values
 Feature Vectors: These are the vectors that “describe” a defining feature of the data
 Feature Space: Vector space containing the feature vectors
 Finite Set: Labels occur from {1, 1}
 NonFinite Set: Labels occur on real number line
 Classification (Pattern Recognition): Supervised learning with a finite set of labels
 Labels are called classes
 Clustering: Data grouped together giving away some sort of pattern
1.1 What is Machine Learning
 Supervised Learning: Machine Learning with a training set T. Feature vectors x generate labels y
 Data is drawn i.i.d from some joint distribution
 Vectors x_i are feature vectors or inputs
 Values y_i are known as labels (outputs, responses)
 Feature Space: Vector space containing the feature vectors
 The goal of supervised learning is to create a map from the training data
 This allows us to create a model to predict output values
 This process is called learning or training a model
 Classification (Pattern Recognition)
 Labels are a finiteset {1, 1}
 Labels are now known as classses
 Then the supervised problem is known as classification or pattenr recognition
 Labels are a nonfinite set, any real number
 Then the supervised problem is known as regression
 Unsupervised Learning: Machine Learning without a training set but only a set of data (feature vectors)
 Data is drawn i.i.d and we want to infer something useful from the data
 Clustering is used to identify such patterns
 Take vectors from D and group them into similar groups