Hands-On Code: Scaled Dot-Product Attention
Below is a tiny runnable demo of scaled dot-product attention for one attention head, using only the standard library.
It:
- Computes attention weights from Q and K.
- Applies softmax with scaling.
- Uses those weights to mix the values V.
xxxxxxxxxx72
    main()# file: scaled_dot_attention.py# Minimal scaled dot-product attention.# Only standard library. Run with `python scaled_dot_attention.py`.βimport mathimport randomβdef softmax(xs):    # subtract max for numerical stability    m = max(xs)    exps = [math.exp(x - m) for x in xs]    s = sum(exps)    return [e / s for e in exps]βdef matmul(a, b):    # a: [n x d], b: [d x m] => [n x m]    n = len(a)    d = len(a[0])    m = len(b[0])    out = [[0.0]*m for _ in range(n)]    for i in range(n):        for j in range(m):            s = 0.0            for k in range(d):                s += a[i][k] * b[k][j]            out[i][j] = s    return outβdef transpose(m):OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment

