V1.0 — released, May 2026

Lossless mapping
for the regions
every other mapper drops.

When two genome regions look identical — paralogs, pseudogenes, IGH, MHC, segmental duplications — current mappers are forced to pick one. LLmap keeps each read as a probability over the places it could come from, and only collapses when the data justifies it. No silent drops. No arbitrary primary alignments.

Two L's. Lossless by construction. LLM-augmented by design. The wave-particle mathematics is isomorphic, not metaphor.

View on GitHub → How WaveCollapse works

C++23 / CUDA 12.3+ BAM + Parquet output CPU fallback MIT

Fig. 1 — A read kept as a distribution; collapses when the maximum exceeds τ.

In plain English

What LLmap does, without the jargon.

Plain English Technical

The everyday version.

When a DNA sequencer reads a genome it produces millions of short snippets of text. A mapper is the program that figures out where each snippet came from on a reference genome — like matching every torn page to its right place in a book.

For most of the genome this works. But some parts of the book have several near-identical pages — antibody genes, immune-system regions, segmental duplications. Existing mappers are forced to pick one page even when the data cannot tell two identical pages apart. That choice gets remembered as fact, and the rest of the analysis is built on a quiet lie.

LLmap does not pick. It keeps each snippet as a distribution over the pages it could have come from. When the data justifies a single answer, it collapses to that answer. When it doesn't, the uncertainty is preserved — and propagated through the rest of the pipeline, so downstream tools can use the real information instead of an arbitrary guess.

Why this matters.

For routine whole-genome sequencing, LLmap looks much like any other mapper — same input, same BAM file out. The difference shows up in regions where current tools quietly fail: antibody loci (IGH), immune compatibility regions (MHC), duplicated genes (CLN3, NPHP1, 22q11.2), and other segmental duplications.

These are exactly the regions where rare disease diagnostics, immunology, and cancer evolution research currently lose information. Recovering them changes what is detectable in clinical-scale datasets.

The deeper change is conceptual: a mapper that admits uncertainty turns a hidden source of bias into an explicit signal. Downstream tools can then either treat that signal as evidence, or weight it appropriately — instead of being unaware it ever existed.

The problem

Mainstream mappers were not designed for paralog-rich loci.

minimap2, Winnowmap2, BWA-MEM and strobealign are excellent at what they were designed for. For IGH, MHC, NPHP1, CLN3, 22q11.2 and similar segmental-duplication regions, the per-read formulation forces structural compromises that throw away recoverable information.

01 — forced

One primary alignment, by design.

A single primary must be chosen at MAPQ-time. For sequence-identical paralog copies this is mathematically under-determined; reads end up at MAPQ=0 or arbitrarily assigned. LLmap keeps reads probabilistic until the data justifies collapse.

02 — independent

No read-to-read coupling.

Each read is mapped independently of the others. The collective signal that disambiguates paralogs — coverage asymmetry, cluster coherence — is not used. LLmap's Stage 1 lets reads inform each other before they ever project to the reference.

03 — uniform

No biology-aware priors.

Every position in the reference gets equal a-priori weight. Decades of accumulated knowledge about SD-regions, pseudogene families and recurrent-NAHR loci is ignored. LLmap bakes reference-specific priors into the bucket pyramid at index time.

“minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal.” — minimap2 documentation, v2.30 (June 2025). This is the problem class LLmap addresses.

The algorithm

WaveCollapse — reads as probability waves, not points.

A read is not a point to be located. It is a probability mass over a hierarchical bucket space. The mass evolves under four physical quantities: sequence likelihood, coverage coupling, AI-embedding prior, and biology prior. Reads collapse only when the mass concentrates above threshold τ = 0.99 — never forced, never silently dropped.

The update rule.

Each EM iteration mixes four contributions per candidate bucket. Sequence likelihood L(r|b) — the standard alignment score, but path-integrated over all alignment trajectories rather than the single Viterbi maximum. Coverage prior λ(b) — the symmetry-breaking field that resolves sequence-identical paralog degeneracy.

AI prior π_AI is a cosine similarity in a frozen foundation-model embedding space (Caduceus-Ph distilled, 50 MB). Biology prior π_bio is a per-bucket weight from the reference's annotated prior file, generated once at index-build time.

Reads collapse when max_b P_t(b|r) > 0.99; otherwise they remain probabilistic in the output as Tentative with the full distribution preserved.

# WaveCollapse — one EM step, per read

P_t+1(b | r) = (1-γ) · P_t(b | r)  +  γ · Z^-1 · [
    L(r | b)              · likelihood (path-integrated)
  · λ_t(b)                · coverage prior
  · π_AI(b | r)          · embedding prior
  · π_bio(b)              · biology prior
  · Σ_b'∈N(b) K(b,b') · P_t(b' | r)
                          · neighbour coupling
]

collapse  if   max_b P_t(b|r) > 0.99
otherwise       status = Tentative   with full distribution preserved

Stage 1 — Self-Interference

Reads inform each other first.

Before any read sees the reference, ~100M raw reads are reduced to ~1M coherent cluster representatives. FAISS-GPU k-NN over embeddings, Leiden community detection, intra-cluster EM.

FAISS-GPU sparse k-NN over read embeddings
Leiden community detection on similarity graph
Intra-cluster self-EM refinement
Output: ~1M cluster representatives

Stage 2 — Reference WaveCollapse

Reps project onto a bucket pyramid.

Only the ~1M reps run reference EM, over a 4-level pyramid (chromosomes → 5 MB → 50 kb → exact). Member reads inherit the rep's assignment with a cheap delta-correction. WFA2 extension handles residual hard reads.

Bucket pyramid L0 → L1 → L2 → L3
EM iteration with coverage coupling
Collapse-dropout per level
WFA2 extension on residuals

Fig. 2 — Bucket pyramid. Converged reads drop out per level; only the residue refines to the next.

reads as photons · genome as crystal· mapping as decoherence · every read accounted for.

vs. the field

Five mappers, twelve axes.

minimap2 v2.30 (Jun 2025), Winnowmap2, BWA-MEM and strobealign are first-rate at their design goals — large-genome long-read alignment, repeat-aware seeding, short-read accuracy, fast indexing. The axes below describe properties orthogonal to those design goals; targeted by LLmap because paralog work needs them.

Capability	minimap2	Winnowmap2	BWA-MEM	strobealign	LLmap
Lossless output (no silent read drop)	no	no	no	no	by construction
Read-to-read information sharing	independent	independent	independent	independent	Stage 1 self-interference
Per-read paralog probability preserved	single primary	single primary	single primary	single primary	full P(b\|r)
Biology-aware priors (SD, paralog catalogs)	uniform	weighted minimizers	uniform	uniform	annotated buckets
Foundation-model embeddings	—	—	—	—	Caduceus + Evo distilled
Reads stay probabilistic on ambiguity	forces MAPQ=0	forces MAPQ=0	forces MAPQ=0	forces MAPQ=0	status = Tentative
Self-healing at runtime (custom CUDA on stall)	—	—	—	—	diagnostic agent
samtools / bcftools / IGV-compatible BAM	yes	yes	yes	yes	+ lossless Parquet sidecar
Single-cell paralog matrix (CB/UB preserved)	—	—	—	—	cells × paralog → h5ad
GPU + AI architecturally first-class	CPU-only	CPU-only	CPU-only	CPU-only	CUDA 12.3 / TensorRT
Wallclock vs minimap2 (HiFi WGS)	1.0×	~1.5×	~3-10× short-read	~0.8×	0.51×
Paralog accuracy uplift vs minimap2	baseline	+ small	baseline-low	baseline	+ 11.4 pp

V1.0 · measured performance

Faster and lossless. Both.

Measured against minimap2 v2.30 on the V1.0 validation suite: HG002 HiFi WGS, a synthetic IGH locus across the {5,10,30,50,100}% mosaic-dup spectrum, and HPRC iso-seq lymph samples.

0.51×

wallclock vs minimap2, HG002 HiFi WGS (49% speedup)

+ 11.4 pp

paralog accuracy over minimap2 baseline

2.3×

usable reads recovered in SD regions vs minimap2

99.7%

recall of minimap2 on uniquely-mappable WGS

1.12×

peak RAM vs minimap2 with AI + GPU on

0 drops

silent read loss — count(in) == count(record)

~$3

amortised LLM-agent API cost per sample

10 µs

per-read Caduceus-Ph embedding at batch 10k on H100

LLM agent sessions

An LLM agent runs four async sessions per analysis.

The LLM is not a per-read voter — that would be prohibitively expensive. It is a tool-using agent with bash, read/write, web fetch and a sandboxed CUDA codegen tool. It runs asynchronously and never blocks the GPU pipeline; its output flows in as additive bias when ready.

A · Index-build

Annotates the reference, once.

Runs bedtools against RepeatMasker/SD tracks, fetches paralog catalogs from public sources, writes preprocessors, emits biology_prior.json with per-bucket weights.

~$5amortised per reference

B · Sample-init

Picks preset + tunes parameters.

Reads FASTQ headers, runs seqkit stats + fastqc, reasons about library type, writes sample_params.json before the run starts.

~$1before each sample

C · Diagnostic

Writes a custom CUDA kernel on stall.

Triggered when EM convergence rate drops below 10% per iteration. Dumps wave-state, investigates, writes a custom CUDA kernel, compiles in a bubblewrap sandbox, hot-loads. The stalled batch resumes with the new kernel.

~$5–15only on stall

D · Reporter

Per-sample diagnostic markdown.

Runs samtools flagstat, mosdepth, reads the final wave-state, reasons over coverage patterns, writes a sample-specific markdown report and updates the memoisation cache.

~$2post-run

Physics, not metaphor

The wave-particle analogy is mathematically isomorphic.

Each row in the table is implemented in the codebase. The full mapping lives in docs/PHYSICS.md — here is the executive summary.

What each row means in code.

Path integrals — L(r|b) is the sum over alignment trajectories (Forward algorithm), not the single Viterbi maximum. Multi-path support replaces traditional "best-alignment" scoring.

Symmetry breaking — sequence-identical paralog copies are degenerate eigenstates of L(r|b). The coverage-coupling term acts as the symmetry-breaking field: collective coverage asymmetry resolves what no single read can.

Decoherence-T₂ — sequencing error rate maps to a damping parameter γ. PacBio HiFi has long T₂, ONT has short T₂. Per-platform damping is theory-derived, not heuristic-tuned.

Wavefunction ψ(b)	Read probability vector `P(b\|r)`
Hamiltonian	Likelihood + Coverage + AI + Biology
Path integrals	Forward over alignment trajectories
Decoherence-T₂	Platform damping γ (HiFi ≪ ONT)
Symmetry breaking	Coverage coupling resolving paralog degeneracy
RG flow	Bucket pyramid L0 → L1 → L2 → L3
Entanglement	Read-cluster coupling (Stage 1)
Measurement / collapse	Convergence threshold τ = 0.99

Flagship use case

Single-cell IGH paralogs.

Class-switched B-cells expressing IGHG1/2/3/4 with sequence-identical constant exons, partial dup-region overlap, and a 5'-UTR sometimes truncated. The full probability distribution is preserved through the pipeline; cell barcodes and UMIs are propagated losslessly through bam → align → bam without a samtools fastq round-trip; the per-cell-paralog matrix exports straight into h5ad.

Fig. 3 — Schematic per-cell paralog matrix. Each column is a single cell; each row is one IgH paralog.

schlein-lab · companion tools

LLmap is one piece of a wider paralog-genomics stack.

schlein-lab builds open tooling for genome regions that mainstream pipelines under-serve. LLmap is the mapper layer; below are the projects that sit above and beside it.

consumer · phasing layer

pseudocaller

PSV-catalog + phasing layer for paralog-rich loci. Reads LLmap's lossless BAM, emits per-haplotype paralog assignments. Drives IGHG4 ↔ IGHGP disambiguation at 90.7% concordance over 155 PSVs.

github.com/schlein-lab/pseudocaller →

companion · variant calling

BRANCH

Bubble-aware variant caller for high-instability regions. Pangenome-anchored de-novo + reconciliation; built on HG002 Blood (~13k validated bubbles) and HPRC cross-sample data.

github.com/schlein-lab/BRANCH →

viewer · BRANCH companion

VariantPaths

Visualization tool for the structural variants — duplications, deletions, rearrangements — that standard sequencing pipelines miss. Each variant stays visible individually rather than collapsed into a summary row.

variantpaths.com →

infrastructure · automation

nano-zyrkel

Open SDK for autonomous micro-agents that live inside a GitHub repository and run on Actions cron — no server required. The compute fabric we publish below the tool layer.

nano-zyrkel.com →

Every read accounted for.

Read the spec, browse the source, or install the release.

View on GitHub Read the SPEC

Lossless mapping for the regions every other mapper drops.