Tagged: ml

Chapter 2: Classification

Table of Contents

  • 2.1 Measuring Classifier Performance
    • 2.1.1 Displaying Error Performance (Optional Not Completed)
    • 2.1.2 Computational Performance
  • 2.2 k-Nearest Neighbors
  • 2.3 Bayes Classifiers: ECE313 Redux
    • 2.3.1 Binary Classifiers
    • 2.3.2 M-ary Classifiers
    • 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
  • 2.4 Linear Classifiers
  • 2.5 Linear Discriminant Analysis
    • 2.5.1 Estimating the Unknown Parameters
    • 2.5.2 The LDA Algorithm
  • 2.6 Naive Bayes Classifier
    • 2.6.1 Choosing a Model for the Features
  • 2.7 Kernel tricks
  • 2.8 Logistic Regression
  • 2.9 Support Vector Machines
    • 2.9.1 Connection to Logistic Regression

NOTE

  • Unless stated the following L’s are using the 0-1 loss function

Vocab

  • Loss Function: A function used to evaluate the performance of a training
  • Training Error: For a classifier f it is the average loss over a given training set
  • Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
  • Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
  • Bayes Classifier: A MAPish rule using priors
  • Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the formScreenshot from 2017-09-07 00-42-49

2.1 Measuring Classifier Performance

  • To properly gauge performance we use a loss function to evaluate performance
    • Tells us cost we incur when estimating y based on given x
  • Training error is also used to gauge quality
    • It is the average of the loss function for a training set T
  • 0-1 Loss Function
    • 1 if error 0 if correct
    • The training error of this loss function is
  • Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
    • Because of the overfitting phenomenon it is unwise to rely solely on training error
  • Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
  • Prediction/Generalization Error: This is the correct way to measure accuracy
    • Recall that the classification “f” was trained using training data T
    • Meaning the classifier is a function of both T and f
    • Prediction error then should be estimated using average loss given T
      • Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
      • Thus we run on a validation set and use that to calculate Err
    • Validation Set Error Checking
      • This is still implicitly a function of T as f itself depends on T
      • In the 0-1 loss function case the fraction remains the same as previous only in terms of V not T

2.1.1 Displaying Error Performance

  • Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
    • True Positives
    • True Negatives
    • False Negatives
  • New Metrics
    • True Positive Rate (TPR)
    • False Positive Rate (FPR)
    • Specificity (True negative Rate)
    • Precision (Positive Predictive Value)
    • Accuracy
    • Harmonic mean / F-Measure / F-Score / F1-Score
  • Reciever Operating Characteristic (ROC)
    • Area Under an ROC Curve (AUC)

2.1.2 Computational Performance

  • Computational Performance: Concered with time and space requirements
  • In general, during training we have a lot of resources to play with
  • In general, during testing there should be less resources available (quick testing)
  • Often linear support vector machines and nearest neighbor classifiers are first used
    • Linear support vector machines have lower training computational requirements while having relatively good statistical performance
    • Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)

2.2 k-Nearest Neighbors

  • Given a training set and an input feature vector compute k feature vectors in the training set closest to x
  • Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
    • Typically it’s the Euclidian distance
    • Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
      • 1 – cosine similarity
  • Plain Text>Algo for k-Nearest Neighborsfunction k-NearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for

    Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]

  • 	Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=||x_bar-x_bar_i||end forFind i* such that delta_i is smallest (break ties arbitrarily)
    
    return y_i*
    end function
    	In the k-NN classifier the higher the k is the smoother the boundaries are
    	In the naive implementation of k-NN you calculate distance between test sample and all feature vectors.  aka complexity of algo grows linearly in the size of the training set	Note: The whoel training set must be kept in memory	k-NN does not tell us much about the structure of the data only what's closest	This is known as instance-based learning
2.3 Bayes Classifiers: ECE 313 Redux

 

2.3.1 Binary Classifiers

  • Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)
      • p(x|-1) used to denote has pdf/pmf p(x|-1)
    • Assumed probabilitiy distributions p(x|-1) & p(x|1) are known
  • Bayesian Setting: We are also given priors 
    • Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
      • Difference between hypothesis testing and classification problem is that  are not known
      • Thus everything is based on T
  • Using 0,1 loss the probability of error is
    • Bayes classifier minimizes P_e
    • Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have

2.3.2 M-ary Classifiers

  • M>=2 Bayes classifiers can be constructed essentially the same way as binary
  • We now have M hypotheses, M priors
  • Giving the following estimate
  • Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(x|y) generated from the data
  • The challenge is estimating the values efficiently

2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions

  • Vectors are d-dimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
    • To denote X has the above pdf use 
      • Note: miu_l is the l-th element of miu_bar and is equal to E[X_l]
    • The cov matrix contains all pairwaise covariances of the elements of X
      • The diagonal entries are the variances of the features (cov(a,a)=var(a))
      • Off diagonal entries are the "linear part" of the relationship between features
        • Eg. 2 vars height and weight, diagonals are the variances and the off-diagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
  • Now for a M-ary Hypothesis with y hypotheses Hy
    • Then...
      • If Cy's are different the objective function is quadratic in x
      • Else if Cy=C then the objective function is linear and the ln|Cy| and x.T C_inv x term can be ignored giving the following
  • If C's are all equal we can compare each variable in pairs using the following decision rule
    • Screenshot from 2017-09-07 00-39-18
    • And the boundary exists where the two values are equal
    • Rearranging the terms with x, miu, and C into w and all pi's into beta we get
      • Screenshot from 2017-09-07 00-42-39 and the boundary as Screenshot from 2017-09-07 00-42-49
      • The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
  • Putting this all together we can summarize and conclude
    • Screenshot from 2017-09-07 00-46-03

2.4 Linear Classifiers

  • Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the formScreenshot from 2017-09-07 00-42-49 with decision boundary g=0
  • g describes a hyperplane in 
    • The plane thus divides the space into two parts Screenshot from 2017-09-07 00-52-03
  • Binary Linear Classifier: Uses the above hyper plane to designate -1 or 1 for either side of the hyper plane
    • Screenshot from 2017-09-07 00-54-38
    • Screenshot from 2017-09-07 00-57-43
  • An M-ary classifier can be built in nCr(M, 2) pairs of labels
  • Allows us to build non-linear decision boundaries using a linear classifier
  • Cheap to use on test data
  • Useful for building neural networks

2.5 Linear Discriminant Analysis

  • Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
    • Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
    • Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
    • If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
    • Screenshot from 2017-09-07 09-46-10

2.5.1 Estimating the Unknown Parameters

  • miu, pi and C are estimated from the training data
  • Screenshot from 2017-09-07 09-47-22Screenshot from 2017-09-07 09-47-49Screenshot from 2017-09-07 09-48-12
    • Note M = number of features, N = number of samples
  • Partition the data by label

2.5.2 The LDA Algorithm

  • Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function

Chapter 1: Introduction

Table of Contents

  • 1.1 What is Machine Learning

Notation

  • Capital Letters: Random variables, matrices, number of classes (clear by context)
  • Bold: Vectors (column vectors, unless transposed)
  • Hats: estimates

Vocab

  • i.i.d: Independently and identically distributed
  • Joint Distribution: Probablility of something occuring taking into account 2 or more random variables
  • Supervised Learning: Learning with a training set
  • Unsupervised Learning: Learning with a set of data without values
  • Feature Vectors: These are the vectors that “describe” a defining feature of the data
  • Feature Space: Vector space containing the feature vectors
  • Finite Set: Labels occur from {-1, 1}
  • Non-Finite Set: Labels occur on real number line
  • Classification (Pattern Recognition): Supervised learning with a finite set of labels
    • Labels are called classes
  • Clustering: Data grouped together giving away some sort of pattern

1.1 What is Machine Learning

  • Supervised Learning: Machine Learning with a training set T. Feature vectors x generate labels y
    • Data is drawn i.i.d from some joint distribution 
    • Vectors x_i are feature vectors or inputs
    • Values y_i are known as labels  (outputs, responses)
    • Feature Space: Vector space containing the feature vectors
    • The goal of supervised learning is to create a map from the training data
      • This allows us to create a model to predict output values
      • This process is called learning or training a model
  • Classification (Pattern Recognition)
    • Labels are a finite-set {-1, 1}
      • Labels are now known as classses
      • Then the supervised problem is known as classification or pattenr recognition
    •  Labels are  a non-finite set, any real number
      • Then the supervised problem is known as regression
  • Unsupervised Learning: Machine Learning without a training set but only a set of data (feature vectors)
    • Data is drawn i.i.d and we want to infer something useful from the data
    • Clustering is used to identify such patterns
      • Take vectors from D and group them into similar groups