PolyPanner

Dynamic Variant Detection in Metagenomes via Temporal Sampling

What is PolyPanner?

PolyPanner is a Python tool for detecting dynamic polymorphic variants in complex microbial communities by leveraging dense longitudinal (temporal or spatial) metagenome sampling. It co-assembles genomes across all time points to maximise assembly completeness, then identifies variant sites whose allele frequencies change significantly over time β€” filtering out noise from sequencing errors, mapping artefacts, paralogs, and homologous genes.

PolyPanner answers the question: β€œWhich specific nucleotide variants within microbial genomes change in frequency across a time-series metagenome, indicating evolutionary selection or strain replacement?”


When to Use PolyPanner

Use PolyPanner when you have:

  • Longitudinal metagenomes from the same subject or environment (β‰₯ 3 time points recommended)
  • A question about within-species evolution: selective sweeps, de novo resistance mutations, allele frequency shifts
  • Data from an intervention study (e.g. antibiotics, diet, probiotics) where microbial populations may evolve rapidly
Note

PolyPanner is distinct from strain-tracking tools like StrainPhlAn, which follow discrete strains across hosts. PolyPanner works within a single host’s time-series to identify genetic changes that happened during the study period β€” including de novo mutations that arose and swept to high frequency.

Tip

Compared with inStrain β€” inStrain profiles diversity at variant sites per sample independently. PolyPanner instead leverages temporal co-assembly to improve call accuracy and explicitly tests for frequency change across time points, making it more powerful for detecting dynamic evolutionary events in longitudinal data.


Installation

Conda environment

conda create -n polypanner python=3.10
conda activate polypanner
git clone https://github.com/eitanyaffe/PolyPanner.git
cd PolyPanner
pip install -r requirements.txt

Input Files

Input Description
Co-assembled FASTA Contigs assembled across all time-point samples together
Per-sample FASTQ pairs Paired-end shotgun metagenomic reads for each time point
Sample manifest Tab-separated file listing sample IDs, FASTQ paths, and time points
Warning

Co-assembly is required β€” PolyPanner is designed to work with a single co-assembly built from all time points (e.g. using MEGAHIT or metaSPAdes with all reads). Do not provide separate per-sample assemblies; this will produce inaccurate variant calls.


Basic Usage

# Step 1: Map reads from each time point to the co-assembly
polypanner map \
  --assembly co_assembly.fasta \
  --manifest sample_manifest.tsv \
  --outdir mapping_output/

# Step 2: Call and filter dynamic variants
polypanner call \
  --mapping mapping_output/ \
  --assembly co_assembly.fasta \
  --outdir variant_calls/

# Step 3: Summarise evolutionary events
polypanner summarise \
  --calls variant_calls/ \
  --outdir summary/

Sample manifest format (sample_manifest.tsv):

sample_id   timepoint   r1                      r2
T0_S1       0           T0_S1_R1.fastq.gz       T0_S1_R2.fastq.gz
T7_S1       7           T7_S1_R1.fastq.gz       T7_S1_R2.fastq.gz
T14_S1      14          T14_S1_R1.fastq.gz      T14_S1_R2.fastq.gz
T28_S1      28          T28_S1_R1.fastq.gz      T28_S1_R2.fastq.gz

Output

Output file Description
dynamic_variants.tsv Variant sites with significant frequency change (position, alleles, p-value, effect size per time point)
allele_frequencies.tsv Full allele frequency table across all time points for all called variant sites
sweep_events.tsv Summary of detected selective sweep events per genomic region
assembly_contigs_annotated.gff Annotation of contigs with variant-dense regions flagged

Examining sweep events

import pandas as pd
import matplotlib.pyplot as plt

sweeps = pd.read_csv("summary/sweep_events.tsv", sep="\t")
freqs  = pd.read_csv("variant_calls/allele_frequencies.tsv", sep="\t")

# Plot allele frequency trajectory for top sweep
top = sweeps.sort_values("effect_size", ascending=False).iloc[0]
site = freqs[freqs["variant_id"] == top["variant_id"]]

plt.plot(site["timepoint"], site["alt_freq"], marker="o")
plt.axhline(0.5, color="grey", linestyle="--", alpha=0.5)
plt.xlabel("Day")
plt.ylabel("Alt allele frequency")
plt.title(f"Sweep at {top['contig']}:{top['position']}")
plt.tight_layout()
plt.savefig("top_sweep.png", dpi=150)

Tips & Gotchas

Warning

Minimum sampling density β€” PolyPanner requires at least 3–4 time points to reliably model frequency trajectories. With only 2 time points, the statistical test for frequency change is underpowered and false discovery rates increase substantially.

Warning

Sequencing depth β€” Accurate allele frequency estimation requires β‰₯ 20Γ— mean coverage per contig per time point. Contigs with lower coverage are automatically flagged and excluded from variant calling.

Tip

Co-assembly strategy β€” Use all time-point reads together in a single MEGAHIT or metaSPAdes run. Pooling reads improves contig length and completeness, which directly increases the number of callable variant sites.

Tip

Interpreting sweeps β€” A detected sweep does not necessarily mean antibiotic resistance. Cross-reference sweep_events.tsv with functional annotation (e.g. prokka, eggNOG) to assess whether swept variants are in genes of known biological relevance.


Further Reading