Inference: How It Generates Translations
At inference:
- The decoder generates tokens one at a time.
- They use beam search(beam size ~4 for translation), which keeps multiple candidate sequences in parallel and chooses the best-scoring.
- They apply a length penaltyso the model doesn’t unfairly prefer too-short outputs.
They also cap max output length to input_length + 50, but will stop early if it predicts an end-of-sentence token.

