Every ML project starts with knowing what your data is all about.You should analyze and understand your data and should think of what Algorithms we should choose on dataset to bring the end result.
Some of the steps for the well defined ML Project:
1.Understand and define the problem
2.Analyze and prepare the data
3.Apply the Algorithms
4.Reduce the Errors
5.Predict the result
Let's use the iris dataset one of the famous dataset available
These datasets consists of the physical parameters of the three species of flower-Versicolor,Setosa,Virginica.The numeric parameter which the dataset contains are Sepal width,Sepal length,Petal width,Petal length.In this data we will be predicting the classes of the flower on the basis of Sepal width,Sepal length,Petal width,Petal length.The data consists of continuous numeric values whic describes the dimension of the respective feature.We will now train the model on the basis of these feature.
So, lets begin our first ML Project
We will be use Python to develop this Project and some of the inbuilt libraries like Numpy,Pandas,scikit-learn and we will use classifiers also to train the model and make predictions.
Note:I will provide you this all code on github at last.
>>>import numpy as np
>>>import pandas as pd
>>>import matplotlib.pyplot as plt
>>>from sklearn.metrics import accuracy_score
>>>from sklearn import tree
>>>from sklearn.linear_model import LinearRegression
>>>from sklearn.linear_model import LogisticRegression
>>>from sklearn.ensemble import RandomForestClassifier
>>>from sklearn.neighbors import KNeighborsClassifier
>>>from sklearn.svm import svc
Numpy is the fundamental package for scientific computing with Python
Pandas written for the Python programming language for data manipulation and analysis.
Matplotlib a plotting language for Python Programming Language.
sklearn (scikit-learn )is a free software machine learning library for the Python programming language
Iris dataset is directly available in scikit learn library and we can directly import it from this code:
>>>from sklearn.datasets import load_iris
The parameters of the iris flower can be expressed in terms of dataframe as show in image below, and the column 'class' tells the category which it belongs to.
>>>iris_data = iris.data
As mentioned there are three types of flowers in the dataset:
This is relatively a very small data set with 150 samples. Since the dataframe has four features (Sepal length, sepal width, petal length and petal width) with 150 samples belonging to either of the three target classes, our matrix will be:
Now going into the mathematics let us find out the standard deviation, mean, minimum value and the four quartile percentile of the data.
Lets see the data visually:
Let's use the boxplot of the dataset which shows the visual representation of how your data is scattered over the plane.Boxplot is a percentile based graph which divides the graph into four quartiles of 25% each.This method used in statistical analysis.
Applying the Algorithm
1.Dividing the data for training and tesing
Once we have understood what the training set is all about,we can now start training the model based on the algorithms.We will be implementing some of the common used ML algorithms.Let us start by training the model with some samples.We will be using the library called 'model_selection.train_test_split' which divides our dataset in the ratio 70:30 .We will implement it in the following code.
>>>#we can use this also
2.Training the model
We will be useing some of the common algorithm and will check which algorithm predicts best on this dataset.
1.K – Nearest Neighbour (KNN)
2.Support Vector Machine (SVM)
Now we can start implementing the algorithms
K – Nearest Neighbour (KNN)
Support Vector Machine (SVM)
3.Choose the best model from the abov or tune the paramenters
LogisticRegression classifier gave the best result
Github link for the Code.
No algorithm can give you 100% accuracy in ML,but try to gain as much accuracy as you can.So,here we completed our first dataset for more dataset please stay tuned and follow us for more interesting datasets.