Mark As Completed Discussion

Introduction

Machine learning (ML) is a hot topic these days, so it's no wonder that many people want to orient their careers towards this promising field. Therefore, if you want to get a machine learning job, then make sure you review these common ML questions and topics to prepare yourself for a successful interview.

AI vs. ML vs. DL

Artificial intelligence (AI) is the broadest term of the three. AI consists of everything related to making machines smart and capable of thinking without human intervention. The overall goal of AI is to allow computers to perform cognitive tasks in a wide range of areas like a human would.

Machine learning (ML) is a subset of AI. The main idea here is to create algorithms that can learn how to make decisions by themselves. For example, in the table below, we are deciding if we’d play tennis based on some features (outlook, temperature, humidity, and wind). In ML, we’d give this data to the model and let it learn how the decision is being made. Then we would have the model independently make the decision once we find only the features without the label (yes/no).

AI vs. ML vs. DL

Deep learning (DL) is a subset of ML. DL algorithms mimic the processing patterns in the human brain by creating artificial neural networks consisting of several different layers of neurons. These algorithms require more data than ML algorithms.

AI vs. ML vs. DL

Classifying Machine Learning Algorithms

There are three distinct types of ML:

  • Supervised – In these types of algorithms, data is labeled, i.e. the targets are known. This means that the model learns from a labeled dataset before making decisions about new data.

  • Unsupervised – In these algorithms, data is unlabeled. This means that the model is being trained without knowing the targets. The model has to learn by observing different patterns and structures in the data.

  • Reinforcement – This is a trial and error method. You can think of it as a kind of a game where for every correct assumption, the model gets a reward. For every mistake, however, it gets punished.

Using Classification vs Regression

Classification and regression are both supervised machine learning algorithms. Their difference is derived from the type of their target variable:

  • Classification is used when the target is categorical. This means it can be used when answering a yes/no question, estimating gender, type of vehicle, breed of a dog, etc.

  • Regression is used when the target is continuous. This means it can be used when estimating people’s age, salary, price of real estate, how fast an animal is going to get adopted, etc.

Using Classification vs Regression

Best Machine Learning Models

Supervised learning:

  • Logistic regression is used for classification. It analyzes independent variables (categorical or numeric) and provides a binary output (yes/no, pass/fail, cat/dog) which is always categorical.

  • Linear regression is used for regression. This algorithm assumes a linear relationship between input and output variables.

  • Decision tree can be used for both classification and regression. It is very intuitive and fast. A decision tree is simply a set of cascading questions, i.e., it separates the data into the two most similar categories at a time, thus, creating branches and leaves.

  • Random forest algorithms can also be used for both classification and regression like decision trees. Random forests consist of several randomly created decision trees that operate as an ensemble. Random forests frequently outperform a single decision tree since a large number of loosely correlated trees protect each other from individual errors.

Unsupervised learning:

  • K-means clustering is an iterative technique that attempts to split a dataset into K separate and non-overlapping clusters (subgroups) such that each data point is part of only one of these clusters. The idea is to make the data points belonging to the same subgroup as similar as possible while keeping the clusters as separate as possible from each other.

  • Apriori algorithm is used for data categorization and the generation of association rules. Association rules specify how closely or loosely two items are related. These rules are created using a breadth-first search algorithm.

Try this exercise. Click the correct answer from the options.

Which of the following algorithms is used for supervised learning?

Click the option that best answers the question.

  • Apriori Algorithm
  • Linear Regression
  • K-means Clustering

Steps in Machine Learning

1) Data collection – the quantity and quality of the data is one of the most important factors that determine how accurate the model is.

2) Data preparation – this includes data wrangling, cleaning (normalization, handling missing values, removing duplicates, etc.), visualization, and splitting the dataset into training, validation, and testing subsets.

3) Model selection – choose the right model for the task.

4) Model training – train the model with the defined training dataset.

5) Model evaluation – see how well the model performed with regard to a chosen metric by testing it against previously unseen data. For this step, we use the validation dataset.

6) Parameter tuning – also known as hyperparameter tuning. Changing some parameters that are important for the chosen model can improve its performance significantly.

7) Making predictions - use the testing dataset to test the model and see how it would perform in the real world.

Build your intuition. Fill in the missing part by typing it in.

In the ____ step, data is cleaned, visualized, split, and prepared for further processing.

Write the missing line below.

Confusion Matrix

A confusion matrix is a performance measurement for a classification problem in which there are two or more classes. The confusion matrix gives the number of predicted and actual labels for each of the possible classes. In the picture below, we can see what the classification matrix looks like for a binary problem.

Confusion Matrix

  • True positive – predicted positive and it’s true.
  • True negative – predicted negative and it’s true.
  • False positive – redicted positive but it’s false (it’s actually negative)
  • False negative – predicted negative but it’s false (it’s actually positive)

Are you sure you're getting this? Is this statement true or false?

A false negative is when you predict a negative and it is actually negative.

Press true if you believe the statement is correct, or false otherwise.

The ROC Curve

The idea of the Receiver Operating Characteristic (ROC) curve is to illustrate the performance of a model at all possible thresholds. By doing this, we can find the threshold that would separate the classes the best.

What do we mean by setting the threshold? Well, every time a new sample comes, the model calculates the probability of its label belonging in any of the possible classes. Based on that probability and the specified threshold, it assigns the label. For example, if we try to classify if a person is obese or not, and the threshold is set at 0.5, every time the probability of that person being obese is over 0.5, the model classifies them as obese.

Even though the intuitive approach when setting a threshold is to put it at 0.5, it is sometimes more convenient to put the threshold lower or higher like when classifying patients as sick from some disease. In such a case, in order to correctly classify all patients that are sick, we might have to lower the threshold and get a higher number of false-positive predictions.

So, by using the ROC curve, we can determine the optimal threshold for our approach. We do this by plotting the true positive rate (TPR):

TPR=TPTP+FN

against the false positive rate (FPR):

FPR=FPFP+TN:

on a graph and drawing a line where each point of this line represents the ratio of TPR and FPR for a specific threshold.

Evaluating a Classification Model

There are several metrics we can use to evaluate our model. Some of the most popular ones include:

  • Precision – the ratio of true positives to the total number of predicted positives

    • P=TPTP+FP
  • Recall – the ratio of true positives to the total number of actual positives. Also known as the True Positive Rate (TPR).

    • R=TPTP+FN.
  • Accuracy – the ratio of correctly predicted samples to the total number of samples.

    • A=TP+TNTP+TN+FP+FN
  • AUC – also known as the Area Under Curve. More specifically, this is the area under the ROC curve. It measures the whole area under the curve. The higher the curve, the better the predictions the model makes. AUC is a composite measure of performance that takes all potential thresholds into account. In other words, it is the likelihood that the model rates a random positive example higher than a random negative example.

Overfitting and Underfitting

If our model is performing poorly, then the first thing we should check is whether the model is overfitting or underfitting.

  • Overfitting occurs when the model is modeling the training data too well. In other words, it learns all the details about the specific data we have provided, including noise. The problem appears when new data is introduced. If this data lacks those particular details/noise, then the model is not able to model the new data correctly.

  • Underfitting, on the other side, is when the model cannot model anything - even the training data. Obviously, the model then can't generalize to new data.

The best way to handle either of these unwanted behaviors is to introduce a validation set.

Are you sure you're getting this? Fill in the missing part by typing it in.

___ is the behavior where the model learns all the details from the data to such an extent that it cannot generalize well with unseen data.

Write the missing line below.

One Pager Cheat Sheet

  • By reviewing common ML questions and topics, you can prepare yourself for a successful Machine Learning job interview.
  • AI is the broadest term of the three, covering everything related to making machines smart, while ML and DL are subsets of AI which can learn to make decisions and mimic the processing patterns of the human brain using artificial neural networks respectively.
  • Machine learning algorithms can be classified into three distinct categories: Supervised, Unsupervised, and Reinforcement.
  • Classification is used when the target variable is categorical, and Regression is used when the target variable is continuous.
  • Supervised Learning models such as Logistic Regression, Linear Regression and Decision Tree as well as Random Forest are used for classification and regression, while the K-means Clustering and Apriori Algorithm are used for Unsupervised Learning tasks.
  • Supervised learning algorithms such as Logistic Regression, Linear Regression, Decision Tree and Random Forest are used to model linear relationships between input and output variables.
  • The important steps involved in Machine Learning are Data collection, Data preparation, Model selection, Model training, Model evaluation, Parameter tuning and Making predictions.
  • Data is cleaned, visualized, split, and prepared in the data preparation step.
  • A confusion matrix is a performance measurement for a classification problem that depicts the number of true positives, true negatives, false positives, and false negatives.
  • A false negative is when a negative prediction is actually positive, resulting in a misclassification of the positive instance.
  • The ROC Curve is used to plot the True Positive Rate (TPR) against the False Positive Rate (FPR) to determine the optimal threshold to separate the classes.
  • We can evaluate our classification model's performance using popular metrics such as Precision, Recall, Accuracy, and AUC.
  • We should use a validation set to check if our model is overfitting or underfitting.
  • Overfitting occurs when a model learns too many details from the training data, resulting in poor generalization with unseen data.