ESM-2 style protein language model (9.6M params) trained on Swiss-Prot with MLU370
9.6M parameter protein language model trained on Swiss-Prot with MLU370 (Cambricon).
Architecture mirrors [ESM-2](https://facebookresearch.github.io/esm/):
| Parameter | Value |
| Data | Swiss-Prot (456,404 train / 22,821 val) |
| Device | MLU370 (Cambricon) — 1 card |
| Batch | 32 × 512 tokens |
| Speed | ~30K tokens/s |
| Epochs | 5 (~2h/epoch, total ~10h) |
| Optimizer | AdamW (lr=1e-4, warmup=1000 steps, cosine decay) |
| Final val loss | 0.4170 |
| Epoch | Val Loss | Notes |
| 1 | 0.4195 | checkpoint ~38MB (EMA) |
| 2 | 0.4235 | — |
| 3 | 0.4182 | best so far |
| 4 | 0.4185 | — |
| 5 | 0.4179 | final epoch |
| Final | 0.4170 | checkpoint_final_best.pt |