Model Variations: What Matters Most
They tried lots of variants:
- Changing number of - attention heads(1, 4, 8, 16, 32).- 8 heads worked best.
- Too few heads loses expressiveness.
- Too many heads also hurt a bit.
 
- Scaling up model width ( - d_model), depth (- Nlayers), and feed-forward size (- d_ff):- Bigger models → better BLEU (unsurprising).
 
- Changing - dropout:- Removing dropout overfit and hurt BLEU.
 
- Replacing sinusoidal positional encodings with learned positional embeddings: - Performance was basically the same.
- They kept sinusoids for extrapolation reasons.
 

