AlgoDaily - Into the World of Machine Learning

Home > Machine Learning Fundamentals > Machine Learning > Into the World of Machine Learning

The Data

Data is the most important component in machine learning since every machine learning algorithm's task is to learn that data. Machine learning algorithms are first learned by some data, and then the model is used to predict unknown data. In this sense, data is divided into two parts. Many online resources misinform readers about training, testing, and cross-validating data. Some sources say that a program validates the model with test data, but this is completely wrong. After this lesson, however, you will have a solid idea of how data is really divided and used in machine learning algorithms.

Let's look at an example of vehicle data where we attempt to estimate the price of a car based on the features of the car.

Training Data: The training data is the data that we know everything about. We know what the output should look like, so we can calculate the loss of our model from this data. We can also calculate the accuracy of the model from the training data. This is the data that is used in the training loop of a machine learning algorithm. In the example above, all the features along with the prices are the training data.
Test Data: This is the data that we want to understand using our machine learning algorithm. We need to predict the result of this unknown data with our model. Suppose a model learned from the given training data of vehicles. If you have a list of features for a new set of cars that you do not know the price of, then that data is the test data.

Thus, if you train using the whole training data, then you will not have any solid way to test the accuracy of the model. Maybe the model will work on seen data but will break on unseen data. You won't be able to get any accuracy from your test data since you do not know the price of vehicles in the test data. To solve this, the training data is divided into two parts.

Actual Train Data: This is the actual training data that is used in the training loop. In most cases, this is 80% or 75% of all the train data.
Cross-Validation Data: This is kept separate from the actual training data and is not used in the training loop. Thus, we say that after training, this cross-validation data is unseen to the model. We can then cross validate our model on both seen and unseen data from the train data. Most of the time, this data is 20% or 25% of all the training data.

Dataset attributes have classifications according to their usage:

Feature: These are the attributes or columns that the machine learning model will analyze and learn. In the vehicle dataset, all the columns except "price" are features.
Label: These are the attributes or columns that the machine learning model will try to predict. This label is used to calculate loss and accuracy. In the vehicle dataset, the column "price" is the label.

To illustrate the whole scenario with the vehicle dataset example, let's look at the image above. The prices in green are the label of the data while the rest of the columns are features. We will understand more about different types of features such as numeric and categorical features in another lesson.

The Data

Programming Categories

Popular Lessons