Chapter 2: Classification

Table of Contents

  • 2.1 Measuring Classifier Performance
    • 2.1.1 Displaying Error Performance (Optional Not Completed)
    • 2.1.2 Computational Performance
  • 2.2 k-Nearest Neighbors
  • 2.3 Bayes Classifiers: ECE313 Redux
    • 2.3.1 Binary Classifiers
    • 2.3.2 M-ary Classifiers
    • 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
  • 2.4 Linear Classifiers
  • 2.5 Linear Discriminant Analysis
    • 2.5.1 Estimating the Unknown Parameters
    • 2.5.2 The LDA Algorithm
  • 2.6 Naive Bayes Classifier
    • 2.6.1 Choosing a Model for the Features
  • 2.7 Kernel tricks
  • 2.8 Logistic Regression
  • 2.9 Support Vector Machines
    • 2.9.1 Connection to Logistic Regression

NOTE

  • Unless stated the following L’s are using the 0-1 loss function

Vocab

  • Loss Function: A function used to evaluate the performance of a training
  • Training Error: For a classifier f it is the average loss over a given training set
  • Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
  • Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
  • Bayes Classifier: A MAPish rule using priors
  • Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the formScreenshot from 2017-09-07 00-42-49

2.1 Measuring Classifier Performance

  • To properly gauge performance we use a loss function to evaluate performance
    • Tells us cost we incur when estimating y based on given x
  • Training error is also used to gauge quality
    • It is the average of the loss function for a training set T
  • 0-1 Loss Function
    • 1 if error 0 if correct
    • The training error of this loss function is
  • Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
    • Because of the overfitting phenomenon it is unwise to rely solely on training error
  • Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
  • Prediction/Generalization Error: This is the correct way to measure accuracy
    • Recall that the classification “f” was trained using training data T
    • Meaning the classifier is a function of both T and f
    • Prediction error then should be estimated using average loss given T
      • Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
      • Thus we run on a validation set and use that to calculate Err
    • Validation Set Error Checking
      • This is still implicitly a function of T as f itself depends on T
      • In the 0-1 loss function case the fraction remains the same as previous only in terms of V not T

2.1.1 Displaying Error Performance

  • Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
    • True Positives
    • True Negatives
    • False Negatives
  • New Metrics
    • True Positive Rate (TPR)
    • False Positive Rate (FPR)
    • Specificity (True negative Rate)
    • Precision (Positive Predictive Value)
    • Accuracy
    • Harmonic mean / F-Measure / F-Score / F1-Score
  • Reciever Operating Characteristic (ROC)
    • Area Under an ROC Curve (AUC)

2.1.2 Computational Performance

  • Computational Performance: Concered with time and space requirements
  • In general, during training we have a lot of resources to play with
  • In general, during testing there should be less resources available (quick testing)
  • Often linear support vector machines and nearest neighbor classifiers are first used
    • Linear support vector machines have lower training computational requirements while having relatively good statistical performance
    • Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)

2.2 k-Nearest Neighbors

  • Given a training set and an input feature vector compute k feature vectors in the training set closest to x
  • Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
    • Typically it’s the Euclidian distance
    • Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
      • 1 – cosine similarity
  • Plain Text>Algo for k-Nearest Neighborsfunction k-NearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for

    Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]

  • 	Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=||x_bar-x_bar_i||end forFind i* such that delta_i is smallest (break ties arbitrarily)
    
    return y_i*
    end function
    	In the k-NN classifier the higher the k is the smoother the boundaries are
    	In the naive implementation of k-NN you calculate distance between test sample and all feature vectors.  aka complexity of algo grows linearly in the size of the training set	Note: The whoel training set must be kept in memory	k-NN does not tell us much about the structure of the data only what's closest	This is known as instance-based learning
2.3 Bayes Classifiers: ECE 313 Redux

 

2.3.1 Binary Classifiers

  • Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)
      • p(x|-1) used to denote has pdf/pmf p(x|-1)
    • Assumed probabilitiy distributions p(x|-1) & p(x|1) are known
  • Bayesian Setting: We are also given priors 
    • Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
      • Difference between hypothesis testing and classification problem is that  are not known
      • Thus everything is based on T
  • Using 0,1 loss the probability of error is
    • Bayes classifier minimizes P_e
    • Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have

2.3.2 M-ary Classifiers

  • M>=2 Bayes classifiers can be constructed essentially the same way as binary
  • We now have M hypotheses, M priors
  • Giving the following estimate
  • Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(x|y) generated from the data
  • The challenge is estimating the values efficiently

2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions

  • Vectors are d-dimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
    • To denote X has the above pdf use 
      • Note: miu_l is the l-th element of miu_bar and is equal to E[X_l]
    • The cov matrix contains all pairwaise covariances of the elements of X
      • The diagonal entries are the variances of the features (cov(a,a)=var(a))
      • Off diagonal entries are the "linear part" of the relationship between features
        • Eg. 2 vars height and weight, diagonals are the variances and the off-diagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
  • Now for a M-ary Hypothesis with y hypotheses Hy
    • Then...
      • If Cy's are different the objective function is quadratic in x
      • Else if Cy=C then the objective function is linear and the ln|Cy| and x.T C_inv x term can be ignored giving the following
  • If C's are all equal we can compare each variable in pairs using the following decision rule
    • Screenshot from 2017-09-07 00-39-18
    • And the boundary exists where the two values are equal
    • Rearranging the terms with x, miu, and C into w and all pi's into beta we get
      • Screenshot from 2017-09-07 00-42-39 and the boundary as Screenshot from 2017-09-07 00-42-49
      • The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
  • Putting this all together we can summarize and conclude
    • Screenshot from 2017-09-07 00-46-03

2.4 Linear Classifiers

  • Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the formScreenshot from 2017-09-07 00-42-49 with decision boundary g=0
  • g describes a hyperplane in 
    • The plane thus divides the space into two parts Screenshot from 2017-09-07 00-52-03
  • Binary Linear Classifier: Uses the above hyper plane to designate -1 or 1 for either side of the hyper plane
    • Screenshot from 2017-09-07 00-54-38
    • Screenshot from 2017-09-07 00-57-43
  • An M-ary classifier can be built in nCr(M, 2) pairs of labels
  • Allows us to build non-linear decision boundaries using a linear classifier
  • Cheap to use on test data
  • Useful for building neural networks

2.5 Linear Discriminant Analysis

  • Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
    • Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
    • Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
    • If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
    • Screenshot from 2017-09-07 09-46-10

2.5.1 Estimating the Unknown Parameters

  • miu, pi and C are estimated from the training data
  • Screenshot from 2017-09-07 09-47-22Screenshot from 2017-09-07 09-47-49Screenshot from 2017-09-07 09-48-12
    • Note M = number of features, N = number of samples
  • Partition the data by label

2.5.2 The LDA Algorithm

  • Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function
  • # Run to classify test samplefunction ClassifyLinearDiscriminantAnalysis(Test sample x_bar, Parameter estimates pi_bar, miu_bar, C_hat)for each class 1=1,...M doCalc function delta_l(x_bar)=x_bar.T*C_hat_inv*miu_hat_bar_l - .5*miu_hat_bar_l.T*C_hat_inv*miu_hat_bar_l + ln(pi_hat_l)end forreturn the l that has the largest delta_l(x_bar)# break ties arbitrarilyend function</code></li></code></ul><p><code><strong>2.6 Naive Bayes Classifier</strong></code><br></p>Not a linear classifierIt is an "online learning/streaming algorithm" in which training data arrives in seqeuence and the classifier can be updated over time and as such works well for slowly changing feature distributions
    

    • Naive Bayes works on the assumption that features are conditionally independent under each class
      • This will result in Md distributions d=cardinality(x) and M=cardinality(y)
      • Reductio in number of distributions means that it is not as taxing for large datasets while reducing overfitting
      • Naive Bayes also allows you to mix categorical and continuous features
        • Categorical: Values in a finite set (eg. T/F)
        • Continuous: Real-values
    • Joint features will have Md^2 distributions as one needs to take into account the relation between two distributions themselves
    • The estimate and estimates on T={(\textbf{x}_i,y_i)}_{i=1}^N now becomes
      • Note: N_y is the number of training samples in class y

     

     

     

     

     

     

     

     

     

     

    • Of course the joint distribution estimate depends on your model of choice

    2.6.1 Choosing a Model for the Features

    • Assumed that all features come from the same family of distributions for all classes with different parameters

    • Distribution of feature j under class y is modeled as a distribution on the parameter vector theta
    • This will replace the p(x|y) in the Naive Bayes Classifer
    • Furthermore in addition to that theta will be estimated itself by

    • For continuous data common choice is the Gaussian distribution

     

     

    • The second image represents the parameter vector containing the mean and variance of feature j on data y 
    • The third image represents the estimates of these values
      • Take the j-th feature for all the data in class y and average it to find miu_jy
      • Then use this estimate to calculate variance of the jth feature for all data in class y
    • For categorical data, common choice is the Multinoulli or Categorical distribution where feature vectors take values 0, …, l-1
      • Each of the probabilities is essentially based on the histogram of the data
      • If we have a probability for \widehat{p}_{k,jy} that is equal to 1 we call this degenerate
      • This means that in our training data every example had the occurence of “X”
      • During validation if any data had information regarding “X” but never mentioned “X” itself it the model would still mark it as having nothing to do with “X"

     

     

     

     

     

     

     

     

    • One way to solve this is through Laplace smoothing or additive smoothing
      • This essentially rebalances the estimated probabilities by adding “samples” that we pretend to have for each value the feature can take on and normalizing it
      • Typically alpha=1

    • Feature selection is also an alternative technique where we determine which feature contains the most informational about the class
      • Calculating “mutual information” allows us to weight the features based on the amount of information it has
      • Mutual information measures how much information the presence/absence of a term contributes to making the correct classification decision on y
      • Revisited in 3.5.4

    2.7 Kernel Tricks

    • A trick to adapt linear classification algorithms to give non-linear decision boundaries
    • Let the non-linear mapping function be known as \phi
    • Often these mappings are to a larger feature space where the classes can then be linearly separated
    • For example in the following example using the mapping we can create the linear separation

     

     

     

     

     

     

     

     

    • Recall that classifiers can be defined by the form below

     

     

     

     

     

     

    • Note recall that w and beta both depend on the terms miu_l, miu_m, and C
    • Now to talk about kernel tricks
      • The idea is to train a model that depends only on dot products (like the pictures above)
      • Then we can replace the dot products with a kernel function and then perform a linear classification after mapping to a bigger feature space

    • Here are the steps to obtain the kernel version of the model
      1. All computations with the model should only be written with dot products between pairs of feature vectors
      2. Replace with the following map

      • Essentially it needs to act like its doing a dot product in a different space
    • Something that uses this kernel trick is called a kernel method
    • There are various different families of kernels that are valid for the kernel to act like its just doing a dot product in a different space
      • One is Mercer Kernels
      • Mercer Kernels follow the following properties
      • Other common kernels are shown below

     

     

    • It is possible to overfit when enlarging a feature space if there is insufficient data

    2.8 Logistic Regression

    • One of the most common classification approaches
    • Can do M-ary class, ordinal cases (ordered cases), determining which features are statistically significant and more
    • Logistic regression is a discriminative model
      • It tries to directly model p(y|\mathbf{x})</li> </ul> </li> </ul> <ul> <li>With the modeled posterior probability one can measure risk/loss, combine multiple models, and reject classification</li> <li><strong>ALL CLASSIFIERS DISCUSSED IN THIS CLASS CAN PRODUCE ESTIMATES OF THE POSTERIOR PROBABILITY EXCEPT SVM</strong></li> <li>In the binary case where there are two classes -1 and 1 the posterior probabilities are given by the following</li> </ul> <p> </p> <p><img src="https://pokhym.files.wordpress.com/2017/09/screen-shot-2017-09-19-at-4-18-28-pm1.png" align="middle" width="332" height="194" class="aligncenter"></p> <ul> <li>The function [latex]\frac{e^t}{e^t+1}[/latex[ is known as the logistic function</li> <li>The posteriior probabilities are based on a liear model [latex]\beta_{0}+\mathbf{\beta}^T\mathbf{x} which we also use to fit by maximum likelihood
      • The maximum likelihood function L is based on training data and is

       

      • Note: Maximizing this function cannot be represented in a formula and must be solved iteratively
      • In addition to that the values of beta_0 and beta cannot be calculated via an equation as well and must be solved iteratively
      • The classifier is a linear classifier where

      • The left hand side is non-negative iff which is known as a logit transformation

       

      • And finally we obtain the binary logistic regression

       

      2.9 Support Vector Machines

      • Linear support vector machines/Support vector classifiers are binary linear classifiers
      • SVC + kernel trick gives a support vector machine (SVM)

      2.9.1 Connection to Logistic Regression

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s