Mark As Completed Discussion

One Pager Cheat Sheet

  • Deep learning learns complex input→output mappings by stacking layers of simple units into a neural network that is a composable function (e.g., output = layer_L(...layer_2(layer_1(input)))), where many layers provide depth and the model’s numeric knobs—the weights and biases—are tuned to minimize a loss.
  • Machine learning (ML) learns patterns from data, Representation learning learns useful features automatically, and Deep learning (DL) is representation learning with many layers of differentiable transformations that excels on large datasets, high-dimensional inputs (images, audio, text), and when end-to-end learning is required.
  • A perceptron is a mathematical model of a biological neuron that originally produced a binary output via y = step(w·x + b), while modern neurons compute z = w·x + b then a = φ(z) with an activation function (e.g., ReLU, sigmoid, tanh), and stacking layers of such neurons produces neural networks.
  • Because the composition of linear maps is itself a linear map, stacking layers that compute z = W x + b (with identity activations) simply collapses to a single equivalent layer with W_eq = W^(L) ... W^(1) and a combined bias, so without a non-linear activation (e.g. ReLU, sigmoid, tanh) depth does not increase a network's representational power and cannot produce non-linear decision boundaries.
  • Neural networks learn Weights (W) and biases (b) — the parameters we learn — apply an activation function φ (e.g., ReLU(x) = max(0,x)) to add non-linearity, measure performance with a Loss (e.g., MSE or cross-entropy) as the error measure, compute the Gradient (partial derivatives) as the direction to improve parameters, and use Gradient descent with the update rule θ ← θ − η ∇θ L (step size given by learning rate η) as the update rule.
  • A tiny implementation of a single neuron with ReLU activation, trained with gradient descent to learn y ≈ 2*x + 1 on synthetic data, using the standard library only.
  • ReLU f(x)=max(0,x) is differentiable for x ≠ 0 but not differentiable at 0 because the left-hand derivative = 0 and the right-hand derivative = 1, though it is continuous at 0, has subgradients in [0,1] there, and is therefore almost everywhere differentiable, so gradient-based training remains practical.
  • The core training loop—Forward (compute predictions), Loss (measure error), Backward (compute gradients via backpropagation), and Update (adjust weights using gradient descent or other optimizers)—repeats many times to reduce error and improve the model.
  • The steps must occur in order: forward pass to compute y_hat and cache activations, then compute loss to get a scalar L(y_hat, y), then backpropagate gradients to obtain ∂L/∂θ, and finally update parameters with an optimizer (e.g., SGD), because each step depends on the previous step's outputs.
  • This is a minimal implementation of a Two-Layer Network: a 2-layer MLP performing binary classification on a toy dataset using the standard library only.
  • The missing word is softmax, a mapping from raw logits via p_i = exp(z_i)/sum_j exp(z_j) that produces non-negative outputs which sum to 1 (forming a proper probability distribution), preserves ordering (so the argmax is unchanged), is invariant to additive constants (enabling numerical stability by subtracting max), supports temperature scaling to control peakiness (→ one-hot as temp→0, uniform as temp→∞), reduces to the sigmoid for two classes, and has Jacobian ∂p_i/∂z_j = p_i(δ_ij - p_j) which with cross-entropy and a one-hot target yields the simple gradient p - y.
  • Multiclass heads compute a vector of logits z ∈ ℝ^K for K classes, convert them to probabilities with softmax softmax(z)_k = e^{z_k} / Σ_j e^{z_j}, and optimize using cross-entropy loss L = − Σ_k y_k log(softmax(z)_k) where y is a one-hot label.
  • The pipeline Linearsoftmaxcross-entropy is standard because the final Linear produces unconstrained real-valued logits that softmax turns into a probability distribution, cross-entropy (the negative log-likelihood) trains those probabilities with simple, stable gradients (∂L/∂z = p − y) and a clear probabilistic interpretation with numerically stable fused implementations, while for multi-label problems one should instead use sigmoid + binary cross-entropy.
  • Overfitting (low training loss, high validation loss) versus Underfitting (high training and validation loss): Regularization aims to improve generalization using techniques like L2 (weight decay), Early stopping, Dropout, and Data augmentation.
  • Add L2 Weight Decay: illustrates adding an L2 penalty to the loss inside the training loop.
  • The statement is true: unlike RNN/LSTM models that use recurrence, the transformer uses self-attention—computing queries, keys, and values and weights via softmax(Q K^T / sqrt(d_k)) V—so each layer yields direct, learnable, parallel connections between all positions (thereby eliminating recurrence, providing a short path length for dependencies, and enabling parallel processing across sequence positions), while practical additions like positional encoding, multi-head attention, and masked attention supply order information, richer relations, and autoregressive causality, at the cost of an O(n^2) trade-off in memory and compute.
  • For problems with a tiny dataset and easily engineered features try simpler ML (e.g. linear or tree-based models), when you need perfect interpretability or strict guarantees DL is hard to justify, and with low compute or tight latency constraints a smaller model is preferable—start simple and scale up when the problem/data demands it.
  • This provides a minimal 2-layer MLP that implements the XOR function using standard libraries only.
  • Training cost grows with data size, model size, and sequence/image resolution; Batch size (samples per gradient step) and Epoch (one full pass over data) affect memory and training dynamics, and while typical accelerators are GPUs/TPUs, conceptually you only need the underlying math.
  • Neural nets learn what they see, so to mitigate biased training data you should perform dataset curation and evaluation on diverse slices, use explainability tools such as feature attributions and probes to audit behavior, and adopt safety measures like rate limits, human review, and domain constraints to avoid harmful outputs.
  • Run a sanity check by confirming the model can overfit a tiny subset (e.g., 10 samples); if loss not decreasing, lower lr and inspect gradients signs/shapes; if exploding loss, clip gradients, reduce lr, and check for NaNs; if validation worse than training, add regularization or gather more data.
  • Because Overfitting is primarily a high-variance problem, adding an L2 penalty (a weight decay term like lambda * ||w||^2 that shrinks weights) and using early stopping (monitoring val_loss and halting after patience) both primarily reduce variance—the former by constraining parameter magnitudes and the latter by limiting optimization time—and together act complementarily to improve generalization.
  • The correct fill-in is epoch: a single pass through the entire training dataset (aka a pass), which differs from a batch/mini-batch and an iteration—one iteration updates parameters using one batch—and because the number of epochs controls how often the model sees the full data, training for too many epochs can cause overfitting (mitigate with a validation set, early stopping, fewer epochs, or regularization).
  • The composition of linear layers of the form f(x) = W x + b is itself a single linear transformation—e.g. f2(f1(x)) = (W2 W1) x + (W2 b1 + b2)—so stacking layers without non-linear activations adds no expressive power, though hidden dimensions can impose a rank constraint on the resulting matrix.
  • You’ve learned what deep learning is and why it works, implemented tiny nets from scratch, and are ready to port them to a proper framework—now knowing exactly what the framework is doing under the hood.