Residual Connections + LayerNorm
Every sub-layer (attention, or feed-forward) is wrapped like this:
- Take the input x.
- Run the sub-layer to get Sublayer(x).
- Add them: x + Sublayer(x)(this is aresidual connection).
- Apply layer normalization.
Why:
- Residuals help gradients flow in deep networks.
- LayerNorm stabilizes training by normalizing across the hidden dimension.
This structure repeats in every encoder and decoder layer.

