Hardware and Complexity
- Training cost grows with data size, model size, and sequence/image resolution.
Batch size
: how many samples per gradient step. Larger batches use more memory.Epoch
: one full pass over training data.- Typical accelerators: GPUs/TPUs; but conceptually all you need is the math we wrote.