Masked Self-Attention in the Decoder
The decoder can’t look into the future when predicting the next word. To enforce that, the Transformer uses a mask inside attention.
Mechanically:
- Any attention score from position i to a future position j>i is set to -∞before softmax.
- After softmax, those future positions get probability ~0.
- So token y_ionly depends on tokens< i.
This keeps generation left-to-right, like a language model.

