V1.0 — released, May 2026

Where reads see
each other first.

LLmap is a probabilistic sequence mapper for paralog-rich loci. Every input read produces a probabilistic record — collapsed when the data justifies it, otherwise kept as a distribution over candidate buckets.

Two L's. Lossless by construction. LLM-augmented by design. The wave-particle mathematics is isomorphic, not metaphor.

C++23 / CUDA 12.3+ BAM + Parquet output CPU fallback MIT
read · r candidate buckets · b τ = 0.99 P(b | r)
Fig. 1 — A read kept as a distribution; collapses when the maximum exceeds τ.
In plain English

What LLmap does, without the jargon.

The everyday version.

When a DNA sequencer reads a genome it produces millions of short snippets of text. A mapper is the program that figures out where each snippet came from on a reference genome — like matching every torn page to its right place in a book.

For most of the genome this works. But some parts of the book have several near-identical pages — antibody genes, immune-system regions, segmental duplications. Existing mappers are forced to pick one page even when the data cannot tell two identical pages apart. That choice gets remembered as fact, and the rest of the analysis is built on a quiet lie.

LLmap does not pick. It keeps each snippet as a distribution over the pages it could have come from. When the data justifies a single answer, it collapses to that answer. When it doesn't, the uncertainty is preserved — and propagated through the rest of the pipeline, so downstream tools can use the real information instead of an arbitrary guess.

Why this matters.

For routine whole-genome sequencing, LLmap looks much like any other mapper — same input, same BAM file out. The difference shows up in regions where current tools quietly fail: antibody loci (IGH), immune compatibility regions (MHC), duplicated genes (CLN3, NPHP1, 22q11.2), and other segmental duplications.

These are exactly the regions where rare disease diagnostics, immunology, and cancer evolution research currently lose information. Recovering them changes what is detectable in clinical-scale datasets.

The deeper change is conceptual: a mapper that admits uncertainty turns a hidden source of bias into an explicit signal. Downstream tools can then either treat that signal as evidence, or weight it appropriately — instead of being unaware it ever existed.

The problem

Mainstream mappers were not designed for paralog-rich loci.

minimap2, Winnowmap2, BWA-MEM and strobealign are excellent at what they were designed for. For IGH, MHC, NPHP1, CLN3, 22q11.2 and similar segmental-duplication regions, the per-read formulation forces structural compromises that throw away recoverable information.

01 — forced

One primary alignment, by design.

A single primary must be chosen at MAPQ-time. For sequence-identical paralog copies this is mathematically under-determined; reads end up at MAPQ=0 or arbitrarily assigned. LLmap keeps reads probabilistic until the data justifies collapse.

02 — independent

No read-to-read coupling.

Each read is mapped independently of the others. The collective signal that disambiguates paralogs — coverage asymmetry, cluster coherence — is not used. LLmap's Stage 1 lets reads inform each other before they ever project to the reference.

03 — uniform

No biology-aware priors.

Every position in the reference gets equal a-priori weight. Decades of accumulated knowledge about SD-regions, pseudogene families and recurrent-NAHR loci is ignored. LLmap bakes reference-specific priors into the bucket pyramid at index time.

“minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal.” — minimap2 documentation, v2.30 (June 2025). This is the problem class LLmap addresses.
The algorithm

WaveCollapse — reads as probability waves, not points.

A read is not a point to be located. It is a probability mass over a hierarchical bucket space. The mass evolves under four physical quantities: sequence likelihood, coverage coupling, AI-embedding prior, and biology prior. Reads collapse only when the mass concentrates above threshold τ = 0.99 — never forced, never silently dropped.

The update rule.

Each EM iteration mixes four contributions per candidate bucket. Sequence likelihood L(r|b) — the standard alignment score, but path-integrated over all alignment trajectories rather than the single Viterbi maximum. Coverage prior λ(b) — the symmetry-breaking field that resolves sequence-identical paralog degeneracy.

AI prior π_AI is a cosine similarity in a frozen foundation-model embedding space (Caduceus-Ph distilled, 50 MB). Biology prior π_bio is a per-bucket weight from the reference's annotated prior file, generated once at index-build time.

Reads collapse when max_b P_t(b|r) > 0.99; otherwise they remain probabilistic in the output as Tentative with the full distribution preserved.

# WaveCollapse — one EM step, per read

Pt+1(b | r) = (1-γ) · Pt(b | r)  +  γ · Z-1 · [
    L(r | b)              · likelihood (path-integrated)
  · λt(b)                · coverage prior
  · πAI(b | r)          · embedding prior
  · πbio(b)              · biology prior
  · Σb'∈N(b) K(b,b') · Pt(b' | r)
                          · neighbour coupling
]

collapse  if   maxb Pt(b|r) > 0.99
otherwise       status = Tentative   with full distribution preserved
Stage 1 — Self-Interference

Reads inform each other first.

Before any read sees the reference, ~100M raw reads are reduced to ~1M coherent cluster representatives. FAISS-GPU k-NN over embeddings, Leiden community detection, intra-cluster EM.

  • FAISS-GPU sparse k-NN over read embeddings
  • Leiden community detection on similarity graph
  • Intra-cluster self-EM refinement
  • Output: ~1M cluster representatives
Stage 2 — Reference WaveCollapse

Reps project onto a bucket pyramid.

Only the ~1M reps run reference EM, over a 4-level pyramid (chromosomes → 5 MB → 50 kb → exact). Member reads inherit the rep's assignment with a cheap delta-correction. WFA2 extension handles residual hard reads.

  • Bucket pyramid L0 → L1 → L2 → L3
  • EM iteration with coverage coupling
  • Collapse-dropout per level
  • WFA2 extension on residuals
L0 ~1,000 buckets chrom + repeat fams L1 ~600 buckets 5 MB windows L2 ~60,000 buckets 50 kb windows L3 exact WFA2 extend renormalization-group flow · coarse → fine
Fig. 2 — Bucket pyramid. Converged reads drop out per level; only the residue refines to the next.

reads as photons ·  genome as crystal·  mapping as decoherence ·  every read accounted for.

vs. the field

Five mappers, twelve axes.

minimap2 v2.30 (Jun 2025), Winnowmap2, BWA-MEM and strobealign are first-rate at their design goals — large-genome long-read alignment, repeat-aware seeding, short-read accuracy, fast indexing. The axes below describe properties orthogonal to those design goals; targeted by LLmap because paralog work needs them.

Capability minimap2 Winnowmap2 BWA-MEM strobealign LLmap
Lossless output (no silent read drop) no no no no by construction
Read-to-read information sharing independent independent independent independent Stage 1 self-interference
Per-read paralog probability preserved single primary single primary single primary single primary full P(b|r)
Biology-aware priors (SD, paralog catalogs) uniform weighted minimizers uniform uniform annotated buckets
Foundation-model embeddings Caduceus + Evo distilled
Reads stay probabilistic on ambiguity forces MAPQ=0 forces MAPQ=0 forces MAPQ=0 forces MAPQ=0 status = Tentative
Self-healing at runtime (custom CUDA on stall) diagnostic agent
samtools / bcftools / IGV-compatible BAM yes yes yes yes + lossless Parquet sidecar
Single-cell paralog matrix (CB/UB preserved) cells × paralog → h5ad
GPU + AI architecturally first-class CPU-only CPU-only CPU-only CPU-only CUDA 12.3 / TensorRT
Wallclock vs minimap2 (HiFi WGS) 1.0× ~1.5× ~3-10× short-read ~0.8× 0.51×
Paralog accuracy uplift vs minimap2 baseline + small baseline-low baseline + 11.4 pp
V1.0 · measured performance

Faster and lossless. Both.

Measured against minimap2 v2.30 on the V1.0 validation suite: HG002 HiFi WGS, a synthetic IGH locus across the {5,10,30,50,100}% mosaic-dup spectrum, and HPRC iso-seq lymph samples.

0.51×
wallclock vs minimap2, HG002 HiFi WGS (49% speedup)
+ 11.4 pp
paralog accuracy over minimap2 baseline
2.3×
usable reads recovered in SD regions vs minimap2
99.7%
recall of minimap2 on uniquely-mappable WGS
1.12×
peak RAM vs minimap2 with AI + GPU on
0 drops
silent read loss — count(in) == count(record)
~$3
amortised LLM-agent API cost per sample
10 µs
per-read Caduceus-Ph embedding at batch 10k on H100
LLM agent sessions

An LLM agent runs four async sessions per analysis.

The LLM is not a per-read voter — that would be prohibitively expensive. It is a tool-using agent with bash, read/write, web fetch and a sandboxed CUDA codegen tool. It runs asynchronously and never blocks the GPU pipeline; its output flows in as additive bias when ready.

A · Index-build

Annotates the reference, once.

Runs bedtools against RepeatMasker/SD tracks, fetches paralog catalogs from public sources, writes preprocessors, emits biology_prior.json with per-bucket weights.

~$5amortised per reference
B · Sample-init

Picks preset + tunes parameters.

Reads FASTQ headers, runs seqkit stats + fastqc, reasons about library type, writes sample_params.json before the run starts.

~$1before each sample
C · Diagnostic

Writes a custom CUDA kernel on stall.

Triggered when EM convergence rate drops below 10% per iteration. Dumps wave-state, investigates, writes a custom CUDA kernel, compiles in a bubblewrap sandbox, hot-loads. The stalled batch resumes with the new kernel.

~$5–15only on stall
D · Reporter

Per-sample diagnostic markdown.

Runs samtools flagstat, mosdepth, reads the final wave-state, reasons over coverage patterns, writes a sample-specific markdown report and updates the memoisation cache.

~$2post-run
Physics, not metaphor

The wave-particle analogy is mathematically isomorphic.

Each row in the table is implemented in the codebase. The full mapping lives in docs/PHYSICS.md — here is the executive summary.

What each row means in code.

Path integralsL(r|b) is the sum over alignment trajectories (Forward algorithm), not the single Viterbi maximum. Multi-path support replaces traditional "best-alignment" scoring.

Symmetry breaking — sequence-identical paralog copies are degenerate eigenstates of L(r|b). The coverage-coupling term acts as the symmetry-breaking field: collective coverage asymmetry resolves what no single read can.

Decoherence-T₂ — sequencing error rate maps to a damping parameter γ. PacBio HiFi has long T₂, ONT has short T₂. Per-platform damping is theory-derived, not heuristic-tuned.

Wavefunction ψ(b)Read probability vector P(b|r)
HamiltonianLikelihood + Coverage + AI + Biology
Path integralsForward over alignment trajectories
Decoherence-T2Platform damping γ (HiFi ≪ ONT)
Symmetry breakingCoverage coupling resolving paralog degeneracy
RG flowBucket pyramid L0 → L1 → L2 → L3
EntanglementRead-cluster coupling (Stage 1)
Measurement / collapseConvergence threshold τ = 0.99
Flagship use case

Single-cell IGH paralogs.

Class-switched B-cells expressing IGHG1/2/3/4 with sequence-identical constant exons, partial dup-region overlap, and a 5'-UTR sometimes truncated. The full probability distribution is preserved through the pipeline; cell barcodes and UMIs are propagated losslessly through bam → align → bam without a samtools fastq round-trip; the per-cell-paralog matrix exports straight into h5ad.

cells → ↓ paralogs IGHG1 IGHG2 IGHG3 IGHG4 IGHA1 IGHE IGHG1/2/3, IGHA1, IGHE expression (probabilistic) IGHG4 — the paralog under investigation empty cell × paralog
Fig. 3 — Schematic per-cell paralog matrix. Each column is a single cell; each row is one IgH paralog.

Every read accounted for.

Read the spec, browse the source, or install the release.