README

ESM-2 Zero-Shot Mutation Fitness Prediction with ProteinGym Benchmark Validation

TL;DR: Zero-shot prediction of mutation effects on protein fitness using ESM-2 masked marginal scoring — no training data required. Automatically validates against ProteinGym's 217+ DMS assays.

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

[![ESM-2](https://img.shields.io/badge/ESM-2-650M-green.svg)](https://github.com/facebookresearch/esm)

[![ProteinGym](https://img.shields.io/badge/ProteinGym-217%20DMS%20assays-orange.svg)](https://github.com/OATML-Markslab/ProteinGym)

Overview

This repository implements a fully automated zero-shot mutation fitness prediction pipeline using ESM-2 protein language models. It generates all single-point mutants for a given protein, scores each using masked marginal log-likelihood ratio (LLR), and optionally validates predictions against the ProteinGym DMS benchmark.

Core Algorithm

Masked Marginal Scoring (Meier et al., 2021, NeurIPS):

score(X_i → Y_i) = log p(Y_i | x_{-i}) − log p(X_i | x_{-i})

This is the best-performing zero-shot strategy for ESM models — outperforming wild-type marginal and PPPL approaches. The score measures how much more (or less) likely the mutant amino acid is compared to wild-type, conditioned on all other positions.

Quick Start

pip install -r requirements.txt
python run.py                                    # Demo: GFP, 35M model (~5 min)
python run.py --uniprot P42212 --model 650M      # GFP with 650M + ProteinGym validation
python run.py --sequence MKTIIALSYIFCLVFA...      # Custom protein

Key Design Decisions

1. Masked Marginal Scoring (not wild-type marginal or PPPL)

ESM-Scan (Totaro et al., 2024, Protein Science) validated that masked marginal achieves the highest Spearman correlation (~0.48–0.56) among zero-shot ESM strategies, comparable to or better than Rosetta ΔΔG.

Reference: [Wiley Online Library — ESM-Scan](https://onlinelibrary.wiley.com/doi/full/10.1002/pro.5221)

2. Demo Protein: GFP

GFP (UniProt P42212) with the Sarkisyan et al. 2016 DMS dataset is the most complete single-protein assay in ProteinGym. Using GFP as the demo ensures the highest possible automated validation success rate and reproducibility.

Reference: [AWS Open Data Registry — ProteinGym](https://registry.opendata.aws/proteingym/)

3. Output Design: 4 Files for Agent Evaluation

File	Purpose
`mutation_scores.csv`	Full ranked mutant list for agent logging

esm2-proteingym