Tagged: ece398
Chapter 2: Classification
Table of Contents
- 2.1 Measuring Classifier Performance
- 2.1.1 Displaying Error Performance (Optional Not Completed)
- 2.1.2 Computational Performance
- 2.2 k-Nearest Neighbors
- 2.3 Bayes Classifiers: ECE313 Redux
- 2.3.1 Binary Classifiers
- 2.3.2 M-ary Classifiers
- 2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
- 2.4 Linear Classifiers
- 2.5 Linear Discriminant Analysis
- 2.5.1 Estimating the Unknown Parameters
- 2.5.2 The LDA Algorithm
- 2.6 Naive Bayes Classifier
- 2.6.1 Choosing a Model for the Features
- 2.7 Kernel tricks
- 2.8 Logistic Regression
- 2.9 Support Vector Machines
- 2.9.1 Connection to Logistic Regression
NOTE
- Unless stated the following L’s are using the 0-1 loss function
Vocab
- Loss Function: A function used to evaluate the performance of a training
- Training Error: For a classifier f it is the average loss over a given training set
- Overfitting: The phenomenon in which a classifier is tuned specifically for one specific training set and will thus perform badly with other sets
- Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
- Bayes Classifier: A MAPish rule using priors
- Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
2.1 Measuring Classifier Performance
- To properly gauge performance we use a loss function to evaluate performance
- Tells us cost we incur when estimating y based on given x
- Training error is also used to gauge quality
- It is the average of the loss function for a training set T
- 0-1 Loss Function
- 1 if error 0 if correct
- The training error of this loss function is
- Overfitting: An overfit set is fit specifically to meet the needs of the training set and may not serve other sets of data well
- Because of the overfitting phenomenon it is unwise to rely solely on training error
- Validation Sets: A set of data, not the training data, used to verify the accuracy of the classifier
- Prediction/Generalization Error: This is the correct way to measure accuracy
- Recall that the classification “f” was trained using training data T
- Meaning the classifier is a function of both T and f
- Prediction error then should be estimated using average loss given T
- Recalling that expected value is a summation of probability*value then we can see this is an average over the true joint distribution p(x,y) which we do not have
- Thus we run on a validation set and use that to calculate Err
- Validation Set Error Checking
- This is still implicitly a function of T as f itself depends on T
- In the 0-1 loss function case the fraction remains the same as previous only in terms of V not T
2.1.1 Displaying Error Performance
- Confusion matrices or Contingency tables are used to display performance of a classifier looks like a karnough map
- True Positives
- True Negatives
- False Negatives
- New Metrics
- True Positive Rate (TPR)
- False Positive Rate (FPR)
- Specificity (True negative Rate)
- Precision (Positive Predictive Value)
- Accuracy
- Harmonic mean / F-Measure / F-Score / F1-Score
- Reciever Operating Characteristic (ROC)
- Area Under an ROC Curve (AUC)
2.1.2 Computational Performance
- Computational Performance: Concered with time and space requirements
- In general, during training we have a lot of resources to play with
- In general, during testing there should be less resources available (quick testing)
- Often linear support vector machines and nearest neighbor classifiers are first used
- Linear support vector machines have lower training computational requirements while having relatively good statistical performance
- Neartest neighbor can have high test computational requirements and acceptable statistical performance (for little work/coding)
2.2 k-Nearest Neighbors
- Given a training set and an input feature vector compute k feature vectors in the training set closest to x
- Dissimilarity Measure: Function that calculates dissimilarity between two vectors. Larger = more dissimilar
- Typically it’s the Euclidian distance
- Other choices (2.7) radial basis functions, 1 – cos(angle(x1, x2))
- 1 – cosine similarity
-
Plain Text>Algo for k-Nearest Neighborsfunction k-NearestNeighbors(test sample x_bar, Training Set T, dissimilarity measure between feature vectors delta(.,.))for i=1 in range(N) doCalculate delta_i=dissimilarity(x_bar, x_bar_i)end for
Find indicies {i_1, …, i_k} corresponding to ksmallest values of delta_i (break ties arbitrarily)return most frequently occuring element in {y_i1, …, y_ik}end function[/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code][/code]
Plain Text>Plain Text>Algo for Nearest Neighbor with Euclidean Distancefunction NearestNeighbors(test sample x, Training Set T)for i=1 in range(N) doCalculate delta_i=||x_bar-x_bar_i||end forFind i* such that delta_i is smallest (break ties arbitrarily) return y_i* end function In the k-NN classifier the higher the k is the smoother the boundaries are In the naive implementation of k-NN you calculate distance between test sample and all feature vectors. aka complexity of algo grows linearly in the size of the training set Note: The whoel training set must be kept in memory k-NN does not tell us much about the structure of the data only what's closest This is known as instance-based learning
2.3 Bayes Classifiers: ECE 313 Redux
2.3.1 Binary Classifiers
- Binary Hypothesis Testing: Observations X drawn from 2 hypotheses (labels, classes)
- p(x|-1) used to denote X has pdf/pmf p(x|-1)
- Assumed probabilitiy distributions p(x|-1) & p(x|1) are known
- Bayesian Setting: We are also given priors
- Bayes Classifier: Is like a likelihood ratio test (Maximum A Posteriori (MAP) decision rule or Bayes Decision Rule)
- Difference between hypothesis testing and classification problem is that
are not known
- Thus everything is based on T
- Using 0,1 loss the probability of error is
- Bayes classifier minimizes P_e
- Bayes classifier is best in terms of prediction eror because it was given the joint probability which supervised learning algorithms don't have
2.3.2 M-ary Classifiers
- M>=2 Bayes classifiers can be constructed essentially the same way as binary
- We now have M hypotheses, M priors
- Giving the following estimate
- Classifiers are generated by writing down a model of p(x_bar, y) and using estimates of the priors pi and p(x|y) generated from the data
- The challenge is estimating the values efficiently
2.3.3 The Bayes Classifier for Multivariate Gaussian Distributions
- Vectors are d-dimensional with mean miu_bar, and covariance matrix C where C is positive definite (symmetric, all positive eigenvalues, invertable) if it has the joint pdf
- To denote X has the above pdf use
- Note: miu_l is the l-th element of miu_bar and is equal to E[X_l]
- The cov matrix contains all pairwaise covariances of the elements of X
- The diagonal entries are the variances of the features (cov(a,a)=var(a))
- Off diagonal entries are the "linear part" of the relationship between features
- Eg. 2 vars height and weight, diagonals are the variances and the off-diagonals determine the slope (scaled by variance of the other var) of the linear minimum mean square estimator of height by weight
- Now for a M-ary Hypothesis with y hypotheses Hy
- Then...
- If Cy's are different the objective function is quadratic in x
- Else if Cy=C then the objective function is linear and the ln|Cy| and x.T C_inv x term can be ignored giving the following
- If C's are all equal we can compare each variable in pairs using the following decision rule
- And the boundary exists where the two values are equal
- Rearranging the terms with x, miu, and C into w and all pi's into beta we get
and the boundary as
- The function g is called a linear discriminant function as it is linear in x and boundary exists when g=0
- Putting this all together we can summarize and conclude
2.4 Linear Classifiers
- Linear classifier: A classifier for which the decision between each pair of label pair (l,m) is made through a linear discriminant function of the form
with decision boundary g=0
- g describes a hyperplane in
- The plane thus divides the space into two parts
- The plane thus divides the space into two parts
- Binary Linear Classifier: Uses the above hyper plane to designate -1 or 1 for either side of the hyper plane
- An M-ary classifier can be built in nCr(M, 2) pairs of labels
- Allows us to build non-linear decision boundaries using a linear classifier
- Cheap to use on test data
- Useful for building neural networks
2.5 Linear Discriminant Analysis
- Gaussian Discriminant Analysis (GDA or Quadratic Discriminant Analysis (QDA)
- Assume feature vectors are multivariate Gaussian distributions w/ unknown means and covariance matrices
- Using the Bayes Classifier we need to estimate the prior probabilites, means, and covariance matrices
- If the covariance matrices are the same the problem simplifies and we have an LDA problem (Linear Discriminant Analysis)
2.5.1 Estimating the Unknown Parameters
- miu, pi and C are estimated from the training data
- Note M = number of features, N = number of samples
- Partition the data by label
2.5.2 The LDA Algorithm
-
Plain Text>Algorithm: Linear Discriminant Analysis# Run to make modelfunction TrainLinearDiscriminantAnalaysis(Training Set T)for each class l=1,...,M doCalc est of the prior probability class l, pi_hat_lCalc est of mean of class l, miu_hat_lend forCalc est of cov matrix C_hatend function
# Run to classify test samplefunction ClassifyLinearDiscriminantAnalysis(Test sample x_bar, Parameter estimates pi_bar, miu_bar, C_hat)for each class 1=1,...M doCalc function delta_l(x_bar)=x_bar.T*C_hat_inv*miu_hat_bar_l - .5*miu_hat_bar_l.T*C_hat_inv*miu_hat_bar_l + ln(pi_hat_l)end forreturn the l that has the largest delta_l(x_bar)# break ties arbitrarilyend function</code></li></code></ul><p><code><strong>2.6 Naive Bayes Classifier</strong></code><br></p>Not a linear classifierIt is an "online learning/streaming algorithm" in which training data arrives in seqeuence and the classifier can be updated over time and as such works well for slowly changing feature distributions
- Naive Bayes works on the assumption that features are conditionally independent under each class
- This will result in Md distributions d=cardinality(x) and M=cardinality(y)
- Reductio in number of distributions means that it is not as taxing for large datasets while reducing overfitting
- Naive Bayes also allows you to mix categorical and continuous features
- Categorical: Values in a finite set (eg. T/F)
- Continuous: Real-values
- Joint features will have
distributions as one needs to take into account the relation between two distributions themselves
- The estimate and estimates on
now becomes
- Note: N_y is the number of training samples in class y
- Of course the joint distribution estimate depends on your model of choice
2.6.1 Choosing a Model for the Features
- Assumed that all features come from the same family of distributions for all classes with different parameters
- Distribution of feature j under class y is modeled as a distribution on the parameter vector theta
- This will replace the p(x|y) in the Naive Bayes Classifer
- Furthermore in addition to that theta will be estimated itself by
- For continuous data common choice is the Gaussian distribution
- The second image represents the parameter vector containing the mean and variance of feature j on data y
- The third image represents the estimates of these values
- Take the j-th feature for all the data in class y and average it to find miu_jy
- Then use this estimate to calculate variance of the jth feature for all data in class y
- For categorical data, common choice is the Multinoulli or Categorical distribution where feature vectors take values 0, …, l-1
- Each of the probabilities is essentially based on the histogram of the data
- If we have a probability for
that is equal to 1 we call this degenerate
- This means that in our training data every example had the occurence of “X”
- During validation if any data had information regarding “X” but never mentioned “X” itself it the model would still mark it as having nothing to do with “X"
- One way to solve this is through Laplace smoothing or additive smoothing
- This essentially rebalances the estimated probabilities by adding “samples” that we pretend to have for each value the feature can take on and normalizing it
- Typically alpha=1
- Feature selection is also an alternative technique where we determine which feature contains the most informational about the class
- Calculating “mutual information” allows us to weight the features based on the amount of information it has
- Mutual information measures how much information the presence/absence of a term contributes to making the correct classification decision on y
- Revisited in 3.5.4
2.7 Kernel Tricks
- A trick to adapt linear classification algorithms to give non-linear decision boundaries
- Let the non-linear mapping function be known as
- Often these mappings are to a larger feature space where the classes can then be linearly separated
- For example in the following example using the mapping we can create the linear separation
- Recall that classifiers can be defined by the form below
- Note recall that w and beta both depend on the terms miu_l, miu_m, and C
- Now to talk about kernel tricks
- The idea is to train a model that depends only on dot products (like the pictures above)
- Then we can replace the dot products with a kernel function and then perform a linear classification after mapping to a bigger feature space
- Here are the steps to obtain the kernel version of the model
- All computations with the model should only be written with dot products between pairs of feature vectors
- Replace with the following map
-
- Essentially it needs to act like its doing a dot product in a different space
- Something that uses this kernel trick is called a kernel method
- There are various different families of kernels that are valid for the kernel to act like its just doing a dot product in a different space
- One is Mercer Kernels
- Mercer Kernels follow the following properties
- Other common kernels are shown below
- It is possible to overfit when enlarging a feature space if there is insufficient data
2.8 Logistic Regression
- One of the most common classification approaches
- Can do M-ary class, ordinal cases (ordered cases), determining which features are statistically significant and more
- Logistic regression is a discriminative model
- It tries to directly model
which we also use to fit by maximum likelihood
- The maximum likelihood function L is based on training data and is
- Note: Maximizing this function cannot be represented in a formula and must be solved iteratively
- In addition to that the values of beta_0 and beta cannot be calculated via an equation as well and must be solved iteratively
- The classifier is a linear classifier where
- The left hand side is non-negative iff which is known as a logit transformation
- And finally we obtain the binary logistic regression
2.9 Support Vector Machines
- Linear support vector machines/Support vector classifiers are binary linear classifiers
- SVC + kernel trick gives a support vector machine (SVM)
2.9.1 Connection to Logistic Regression
- It tries to directly model
- Naive Bayes works on the assumption that features are conditionally independent under each class
Chapter 1: Introduction
Table of Contents
- 1.1 What is Machine Learning
Notation
- Capital Letters: Random variables, matrices, number of classes (clear by context)
- Bold: Vectors (column vectors, unless transposed)
- Hats: estimates
Vocab
- i.i.d: Independently and identically distributed
- Joint Distribution: Probablility of something occuring taking into account 2 or more random variables
- Supervised Learning: Learning with a training set
- Unsupervised Learning: Learning with a set of data without values
- Feature Vectors: These are the vectors that “describe” a defining feature of the data
- Feature Space: Vector space containing the feature vectors
- Finite Set: Labels occur from {-1, 1}
- Non-Finite Set: Labels occur on real number line
- Classification (Pattern Recognition): Supervised learning with a finite set of labels
- Labels are called classes
- Clustering: Data grouped together giving away some sort of pattern
1.1 What is Machine Learning
- Supervised Learning: Machine Learning with a training set T. Feature vectors x generate labels y
- Data is drawn i.i.d from some joint distribution
- Vectors x_i are feature vectors or inputs
- Values y_i are known as labels (outputs, responses)
- Feature Space: Vector space containing the feature vectors
- The goal of supervised learning is to create a map from the training data
- This allows us to create a model to predict output values
- This process is called learning or training a model
- Classification (Pattern Recognition)
- Labels are a finite-set {-1, 1}
- Labels are now known as classses
- Then the supervised problem is known as classification or pattenr recognition
- Labels are a non-finite set, any real number
- Then the supervised problem is known as regression
- Labels are a finite-set {-1, 1}
- Unsupervised Learning: Machine Learning without a training set but only a set of data (feature vectors)
- Data is drawn i.i.d and we want to infer something useful from the data
- Clustering is used to identify such patterns
- Take vectors from D and group them into similar groups