Machine Learning Theory
- Explain overfitting and techniques to overcome it like regularization
Overfitting occurs when a model fits the training data too closely, losing the ability to generalize to new data. Regularization helps prevent overfitting by penalizing model complexity. Common regularization techniques include L1/L2 regularization, dropout layers, and early stopping.
- How does gradient descent work?
Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting model parameters in the direction that reduces loss. The learning rate determines the size of adjustment steps. Parameters move toward local minima on loss surface.
- What is the difference between supervised, unsupervised, and reinforcement learning?
Supervised learning uses labeled data, unsupervised learns from unlabeled data, and reinforcement learn from interactions with environment. Supervised models predict outcomes, unsupervised find hidden patterns, reinforcement optimize actions.
- Explain bias-variance tradeoff
The bias-variance tradeoff describes the balance between a model's simplicity and complexity. High bias leads to underfitting while high variance causes overfitting. Regularization techniques like L1/L2 regularization add a penalty term to the loss function to shrink parameters that can cause overfitting. This helps improve generalizability.
- What is regularization and why is it useful?
Regularization is a technique used to prevent overfitting in machine learning models. It works by adding a penalty term to the loss function that discourages model complexity. This helps improve the model's generalizability to new, unseen data.
Some reasons why regularization is useful:
- Reduces overfitting: Penalizes models that are too complex or have high variance. This improves ability to generalize. 
- Prevents coefficients becoming too large: Regularization shrinks the magnitude of model parameters that try to overfit the training data. 
- Makes model interpretation easier: Simpler models are easier to understand and explain. 
- Reduces chance of numerical issues: Can prevent parameters growing so large they cause numerical instability. 
- Provides feature selection: Techniques like L1 regularization zero out less important features. 
- Improves accuracy: In some cases, regularized models can achieve higher accuracy by reducing overfitting. 
- What are ensemble methods and why are they useful?
Ensemble methods combine multiple models to create a single, more robust model. Techniques like bagging, boosting, and stacking are commonly used ensemble methods. They are useful for improving model performance, reducing overfitting, and increasing stability.
- Explain the concept of feature selection and its importance.
Feature selection involves choosing the most relevant features (variables) for training a model. This is crucial for improving model performance, reducing overfitting, and speeding up training. Methods for feature selection include filter methods, wrapper methods, and embedded methods.
- What are hyperparameters and how are they different from parameters?
Hyperparameters are settings that define the structure and behavior of a machine learning model, such as learning rate, regularization strength, and the number of hidden layers in a neural network. Parameters, on the other hand, are the internal variables that the model learns during training. Hyperparameters are set before training, while parameters are learned during training.
- Describe the k-Nearest Neighbors (k-NN) algorithm.
The k-NN algorithm classifies a data point based on how its neighbors are classified. Given a new data point, the algorithm looks for the 'k' nearest data points in the training set and assigns the most frequent class among those neighbors to the new point. It's a lazy learner, meaning it doesn't build an explicit model during training but rather makes decisions based on the entire dataset during inference.
- How do support vector machines (SVMs) work?
Support Vector Machines work by finding the hyperplane that best separates data points of different classes. The optimal hyperplane is the one that maximizes the margin, which is the distance between the nearest points (support vectors) of different classes. Kernel methods can be used to transform data into higher dimensions to make it linearly separable.


