Executable Agent Skill

Train a Protein Language Model from Scratch

An end-to-end skill that trains a 9.6M-parameter ESM-2 "mini" architecture on Swiss-Prot — from raw sequences to deployed model weights, zero-shot fitness evaluation, and GitHub release.

9.6M
Parameters
0.417
Val Loss
0.200
GFP Spearman rho
~11h
Training Time
01 — The Model

ESM-2 "mini" Architecture

A compact Transformer encoder trained with masked language modeling (MLM) on protein sequences. The skill mirrors the ESM-2 "mini" configuration — 12 layers, hidden dimension 256, 8 attention heads — scaled down for single-card training while retaining meaningful zero-shot capabilities.

Layers
12
Transformer encoder blocks
Hidden Dim
256
Per-token embedding dimension
Attention Heads
8
Multi-head self-attention
FFN Dim
1024
Feed-forward hidden size
Vocab Size
31
20 AA + special tokens
Max Length
512
Sequence token limit
Tokenizer — 31-token Amino Acid Vocabulary
A R N D C Q E G H I L K M F P S T W Y V [MASK] [PAD] [CLS] [SEP] [UNK]
02 — Training Pipeline

End-to-End from Raw Sequences

The skill automates every step: data download from UniProt, tokenizer construction, MLM training with three-tier checkpointing, and model upload to GitHub Release.

1
Step 01
Data Download & Dedup
Downloads Swiss-Prot curated protein sequences from UniProt REST API. Removes duplicates, splits 95/5 train/val. Sequences > 512 tokens truncated at C-terminus.
python scripts/download_data.py
Output: data/swissprot_train.fasta (433,583 seqs) · data/swissprot_val.fasta (22,821 seqs)
2
Step 02
Masked Language Model Training
Trains ESM2-small on GPU (CUDA) or Cambricon MLU370. 15% uniform random masking, AdamW optimizer, linear warmup + cosine decay, three-tier checkpointing (periodic / epoch / best).
python train.py \
  --data data/swissprot_train.fasta \
  --val_data data/swissprot_val.fasta \
  --out_dir output \
  --epochs 5 \
  --batch_size 32 \
  --device cuda
67,500 steps · ~11 hours on single MLU370 · ~30K tokens/sec throughput
3
Step 03
Zero-Shot Fitness Evaluation
Evaluates the trained model on GFP (Green Fluorescent Protein) mutation prediction using MLM logit difference. Positive Delta = potentially beneficial mutation.
python scripts/evaluate_fitness.py \
  --checkpoint output/checkpoint_final_best.pt
Score(mutant) - Score(WT) via MLM logits
4
Step 04
Upload to GitHub Release
Upload the best checkpoint to a GitHub Release as a binary asset using the provided upload script.
python scripts/upload_to_github.py \
  --token ghp_xxxx \
  --repo junior1p/ESM2-small \
  --tag v1.0.0
Or use: gh release create v1.0.0 && gh release upload v1.0.0 output/checkpoint_final_best.pt

Optimizer

  • AdamW (lr=1e-4, beta=(0.9, 0.999), eps=1e-8)
  • Weight decay: 0.01
  • No bias correction warmup

LR Schedule

  • Linear warmup: 1,000 steps
  • Cosine decay to ~1e-9
  • Total: 67,500 steps (~5 epochs)

Precision

  • Weights: FP32
  • Forward/backward: BF16 (MLU370)
  • CUDA: TF32 on Ampere+

Checkpointing

  • Periodic: every 2,000 steps
  • Epoch: end of each epoch
  • Best: when val loss improves

Checkpoint Contents

  • model_state_dict + optimizer_state_dict
  • lr_scheduler_state_dict
  • rng_state (torch/cuda/numpy)
  • train_loss, val_loss, config

Masking Strategy

  • 15% uniform random masking
  • Replaced with [MASK] token
  • No neighboringtoken co-prediction
03 — Results

Training Outcomes

Trained on a single MLU370 accelerator for 67,500 steps (~11 hours). The model achieves competitive zero-shot fitness prediction on GFP mutations despite its compact size.

0.417
Validation Loss
~1.1B tokens seen
0.200
GFP Spearman rho
Zero-shot MLM logits
67,500
Total Steps
~13,500 steps/epoch
~30K
Tokens/sec
Single card throughput
Checkpoint Output
output/
├── config.json                    # training hyperparameters
├── checkpoint_epoch1.pt           # epoch snapshots
├── checkpoint_epoch1_best.pt
├── checkpoint_epoch2.pt
├── checkpoint_epoch2_best.pt
├── checkpoint_epoch3.pt
├── checkpoint_epoch3_best.pt
├── checkpoint_epoch4.pt
├── checkpoint_epoch4_best.pt
├── checkpoint_epoch5.pt
├── checkpoint_final.pt            # final epoch
├── checkpoint_final_best.pt       # best val loss ← USE THIS
├── checkpoint_step2000.pt          # periodic (every 2000 steps)
├── checkpoint_step4000.pt
└── ...
Resume from any checkpoint: python train.py --resume output/checkpoint_step20000.pt --data data/swissprot_train.fasta --val_data data/swissprot_val.fasta
GFP Mutation Prediction Examples
MutationTypeExpected DeltaInterpretation
K7VNeutral~0Conservative substitution, minimal structural impact
K7INeutralNegativeHydrophobic substitution at surface position
G66YBrighterLarge negativeUnexpected direction — aromatic at small residue site
G66HDimerLarge negativeHistidine at dimer interface, disrupts packing
Evaluation method: Encode WT and mutant sequences separately with MLM head. Compute per-position logit difference. Sum over mutated positions. Positive Delta = beneficial mutation.
04 — Important Notes

Known Pitfalls

Resume requires --data re-passed
The data loader is not saved inside checkpoints. Always re-pass --data and --val_data when resuming: python train.py --resume output/checkpoint_*.pt --data data/swissprot_train.fasta --val_data data/swissprot_val.fasta
Sequences > 512 tokens are truncated
Truncation happens at the C-terminus. Functional domains near the C-end may be lost for long proteins.
Val loss may exceed train loss
This is normal for protein MLM. Validation sequences are entirely unseen proteins, not random noise.
MLU370 BF16 fallback
If CNNL operators fail on BF16 forward pass, the skill falls back to FP32 automatically.
Disk space: ~5GB for full run
Each checkpoint is ~115MB (FP32 model + optimizer state). Budget accordingly for 2000-step periodic saves.

Full SKILL.md Content

The complete executable skill file used by AI agents. Reproduces the full training pipeline from raw data to deployed model.

---
name: train-esm2-small
description: End-to-end training of ESM2-small (9.6M-parameter protein language model) on Swiss-Prot — data download, tokenization, training, checkpointing, evaluation, and model upload to GitHub. Works on GPU (CUDA) and Cambricon MLU370.
version: 1.0.0
author: Max
license: MIT
dependencies: [torch>=2.0, tqdm, requests]
metadata:
  hermes:
    tags: [protein language model, ESM-2, masked language modeling, MLU370, protein engineering, PyTorch, MLM]
    repo: https://github.com/junior1p/ESM2-small
---

# Train ESM2-small: Protein Language Model from Scratch

Train a compact 9.6M-parameter ESM-2 architecture on Swiss-Prot protein sequences end-to-end.

## When to Use This Skill

- Training a protein language model from scratch
- Evaluating zero-shot mutation prediction (fitness)
- Adapting the ESM-2 architecture to new protein datasets
- Setting up checkpointing and resume for long training runs
- Uploading trained models to GitHub Release

## Quick Start

# 1. Download data
python scripts/download_data.py

# 2. Train (GPU)
python train.py --data data/swissprot_train.fasta --val_data data/swissprot_val.fasta \
  --out_dir output --epochs 5 --batch_size 32 --device cuda

# 3. Evaluate zero-shot fitness
python scripts/evaluate_fitness.py --checkpoint output/checkpoint_final_best.pt

## Training Pipeline

### Data Download

python scripts/download_data.py

Downloads Swiss-Prot curated protein sequences from UniProt, splits into train/val, and saves FASTA files.

- Output: data/swissprot_train.fasta, data/swissprot_val.fasta
- Train: 433,583 sequences | Val: 22,821 sequences
- Truncation: sequences > 512 tokens are truncated at C-terminus

### Tokenizer

31-token amino acid vocabulary:
- 20 standard amino acids (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V)
- Special: [MASK], [PAD], [CLS], [SEP], [UNK]

### Architecture

ESM2-small mirrors ESM-2 "mini" (12-layer Transformer):

| Parameter | Value |
|---|---|
| Layers | 12 |
| Hidden dim | 256 |
| Attention heads | 8 |
| FFN dim | 1024 |
| Vocab size | 31 |
| Max length | 512 |
| Total params | 9,624,607 |

### Training Configuration

| Parameter | Default |
|---|---|
| Optimizer | AdamW (lr=1e-4, beta=(0.9, 0.999), eps=1e-8, weight_decay=0.01) |
| LR schedule | Linear warmup 1000 steps -> cosine decay to ~1e-9 |
| Batch size | 32 sequences |
| Masking | 15% uniform random |
| Mixed precision | FP32 weights, BF16 forward/backward (MLU370) |
| Epochs | 5 (~2h/epoch on single MLU370, ~1h/epoch on A100) |
| Steps per epoch | ~13,500 |
| Throughput | ~30K tokens/sec (single card) |
| Total tokens | ~1.1 billion |

### Checkpointing

Three checkpoint types are saved automatically:

1. Periodic (every 2000 steps): full snapshot for resume
2. Epoch (end of each epoch): history checkpoint
3. Best (when val loss improves): *_best.pt copy

Checkpoint contents:
- model_state_dict: model weights
- optimizer_state_dict: AdamW state
- lr_scheduler_state_dict: cosine schedule position
- rng_state: torch/cuda/numpy RNG for reproducibility
- train_loss, val_loss, config

### Resume from Checkpoint

python train.py --resume output/checkpoint_step20000.pt --data data/swissprot_train.fasta --val_data data/swissprot_val.fasta

Resumes exact training state (model weights + optimizer + LR schedule + RNG seed).

## Evaluation: Zero-Shot Fitness Prediction

python scripts/evaluate_fitness.py --checkpoint output/checkpoint_final_best.pt

Uses masked language modeling logit difference for zero-shot variant effect prediction:

1. Encode wild-type (WT) sequence -> get per-position MLM logits
2. Encode mutant sequence -> get per-position MLM logits
3. Delta = Score(mutant) - Score(WT)

Positive Delta -> potentially beneficial mutation
Negative Delta -> potentially deleterious mutation

Tested on GFP (Green Fluorescent Protein) mutations:

| Mutation | Type | Expected Delta |
|---|---|---|
| K7V | neutral | ~0 |
| K7I | neutral | negative |
| G66Y | brighter | large negative |
| G66H | dimer | large negative |

## Model Upload to GitHub

After training, upload model weights to GitHub Release:

# Using GitHub CLI
gh release create v1.0.0 \
    --title "ESM2-small v1.0.0" \
    --notes "Trained on Swiss-Prot, val_loss=0.417"

gh release upload v1.0.0 output/checkpoint_final_best.pt

# Or using the upload script
python scripts/upload_to_github.py \
    --token hf_xxxx \
    --repo junior1p/ESM2-small \
    --tag v1.0.0

## Known Limitations

- GFP Spearman rho ~ 0.200 (small model, short training — larger models achieve 0.3-0.5)
- No MSA or structure features — pure MLM only
- Single card training — multi-card scaling not included
- No dropout — model is meant for transfer/fine-tuning

## Pitfalls

- Resume requires --data flag re-passed: data loader not saved in checkpoints
- Truncated sequences: sequences > 512 tokens are cut at C-terminus
- Val loss > Train loss: normal for protein MLM
- MLU370 BF16: fallback to FP32 if CNNL BF16 ops fail
- Checkpoint disk space: ~5GB for full training run
06 — Reproduce

Clone and Run

Full reproducibility in three commands. The skill handles everything from data download to model upload.

# Clone the repository
git clone https://github.com/junior1p/ESM2-small.git
cd ESM2-small

# Download Swiss-Prot data (~433K train + 22K val sequences)
python scripts/download_data.py

# Train on GPU — ~11 hours on single MLU370, ~5h on A100
python train.py \
  --data data/swissprot_train.fasta \
  --val_data data/swissprot_val.fasta \
  --out_dir output \
  --epochs 5 \
  --batch_size 32 \
  --device cuda

# Evaluate zero-shot fitness on GFP mutations
python scripts/evaluate_fitness.py \
  --checkpoint output/checkpoint_final_best.pt
M

Max

Protein language model training, evaluation, and skill development