To develop a machine learning model there is roadmap which you can follow so that you can build the best model with the best accuracy.
Preprocessing - Getting Data into Shape
Raw data rarely comes in the form and shape which we need. We always need to preprocess raw data for the optimal performance of our machine learning algorithm. Thus, preprocessing of data is the crucial steps in making machine learning application.
Ex : If we take the example of Iris dataset , we can think our raw data as a series of flower images form which we have to extract meaningful features. Our useful features can be the color, the hue, the intensity of the flower,height, width and length of the flower.
So, some of the ML algorithms also require that the features we have selected should be on the same scale for optimal performance, which we usually achieve through transforming the features in the range of [0,1] or a standard normal distribution with zero mean and unit variance.
Some, of the features may be highly correlated and therefore reduntant to a certain degree. In those cases, we use dimensionality reduction technique is useful , for compressing the features into lower dimensional subspace. Reducing dimensionality of feature space help to reduce the storage and can run even much faster. If our dataset contains large number of irrelavant features (or noise), we can use dimensionality reduction t0 remove the noise.
To determine our ML model not only performs well on training data but also genaralizes on the new data for this we should randomly divide our dataset into training and testing set. We use training set to train the model and testing set to test and evaluate the model.
Training and Selecting the predictive model
Each ML algorithm is designed to solve different problems and task. We cannot directly depend on the model if we do not make assumptions about our task and set our model according to our task. We have to compare different ML models to select the best performing model. But before we can compare different models, we should decide our metric to measure the performance of our model. We commonly use Classification accuracy , which is the proportion of correctly classified instances.
We use different Cross-Validation technique to which is used where the training dataset is further divided into training and validation subsets in order to estimate the generalization performance of our model.
Finally, we cannnot use the default parameters of the different ML algorithms provided by different ML libraries which are not optimal to solve our specific task. Therefore, we will use Hyperparameter Optimization technique that helps us to fine-tune the performance of our model by changing the parameters of different ML models.
After selecting the model that has been fitted on our training dataset, we can use the test dataset to estimate how well our model is performed on the unseen data to estimate the genarilzation error.
Predicting on the Unseen Data
If we are satisfied with the performance of our model, we can now use this model to predict new, future data.
Note : It is important to note that the parameters for
the previously mentioned procedures, such as feature scaling and dimensionality
reduction, are solely obtained from the training dataset, and the same parameters are
later reapplied to transform the test dataset, as well as any new data samples—the
performance measured on the test data may be overly optimistic otherwise.
Thank You !