anpan
Statistical Models for Microbial Pan-genome Analysis
What is anpan?
anpan is an R package providing statistical models for studying the association between microbial pan-genome features (gene presence/absence, structural variants, SNPs) and host phenotypes. It extends microbiome association analysis beyond community composition to the within-species genetic level.
anpan answers the question: βAre specific genes or genetic variants within a microbial species associated with a host phenotype?β
- π GitHub
- π Documentation
- ποΈ Paper: Emelie Ahles et al.
When to Use anpan
Use anpan when you have species-level genomic features from metagenomes (from tools like StrainPhlAn, Panphlan, or MIDAS) and want to:
- Test whether specific microbial genes are associated with a phenotype
- Account for phylogenetic structure when testing gene-phenotype associations
- Analyze pan-genome variation within a single species across samples
Installation
R (from GitHub)
devtools::install_github("biobakery/anpan")Dependencies
install.packages(c(
"tidyverse",
"cmdstanr", # for Bayesian models
"lme4", # for mixed models
"broom"
))Basic Usage
library(anpan)
# Load gene presence/absence data
gene_table <- read.csv("gene_table.csv", row.names = 1)
# Rows = samples, columns = genes
# Load metadata
metadata <- read.csv("metadata.csv", row.names = 1)
# Run pan-genome association
results <- anpan_batch(
gene_table = gene_table,
metadata = metadata,
outcome = "disease_status",
covariates = c("age", "sex", "BMI")
)Phylogenetic correction
# Load phylogenetic tree (from StrainPhlAn)
library(ape)
tree <- read.tree("strainphlan_output.tre")
# Run with phylogenetic mixed model
results_phylo <- anpan_batch(
gene_table = gene_table,
metadata = metadata,
outcome = "disease_status",
tree = tree,
model = "phylo_lm"
)Models Available
| Model | Code | Description |
|---|---|---|
| Linear mixed model | "lm" |
Standard association, no phylogeny |
| Phylogenetic LM | "phylo_lm" |
Accounts for phylogenetic relatedness |
| Logistic regression | "glm" |
Binary outcomes |
| Bayesian model | "bayes" |
Fully Bayesian estimation via Stan |
Output
anpan returns a tidy data frame with one row per gene-phenotype association:
| Column | Description |
|---|---|
gene |
Gene or feature name |
estimate |
Effect size (log-odds or coefficient) |
std.error |
Standard error |
p.value |
Raw p-value |
q.value |
FDR-adjusted p-value |
Tips & Gotchas
Phylogenetic correction is important β Closely related strains share many genes. Without phylogenetic correction, you may detect associations that are purely due to population structure, not true phenotypic effects.
Minimum prevalence filtering β Rare genes (present in <10% of samples) have low power. Filter low-prevalence genes before running to reduce multiple testing burden.
Integrate with StrainPhlAn β Use StrainPhlAn to get within-species phylogenetic trees, which anpan can use for phylogenetic correction.