We will be taking an example of classification problem with the help of KNearestNeighbors in Scikit-Learn.
-> Basic idea : Predict the label of data point by
-> Looking at the 'k' closest labeled data points
-> Taking a majority vote
Scikit-Learn fit and predict
-> All Machine Learning models implemented as Python class :
-> Implement the algorithm for learning and predicting
-> Store the information learned from data
-> Training a model on the data = "fitting" a model on the data
-> .fit() method
-> To predict the labels of new data
-> .predict() method
Measuring Model Performance
-> In Classification problems, accuracy is commonly used metric.
-> Accuracy = It is the fraction of correct predictions
Accuracy = No. of correct predictions / No. of data points
Now, here comes two problems ?
- Which data should be used to compute the accuracy ?
- How will the model perform on new data ?
To resolve this problem we can,
Split data into training and testing set :
-> Fit/Train the classifier on the training set
-> Make predictions on the test set
-> Compare predictions with the known labels
Model Complexity(Important Thing) :
-> Larger k = Smoother decision boundary= Less complex model
-> Smaller k = Can Lead to overfitting = More complex model
Let's code :
We are using Iris Dataset in the following code :
Happy Machine Learning