flowchart TD
A([Raw 16S / ITS reads]) --> B[QIIME 2\nimport, denoise, taxonomy]
B --> C[ASV table + taxonomy]
B --> D[Diversity metrics + ordination]
C --> E[PICRUSt2\npredict gene families + pathways]
C --> F[taxUMAP / CCREPE\ncommunity structure and co-occurrence]
C --> G[MaAsLin2\ntaxon-metadata associations]
E --> H[Predicted KO / EC / pathway tables]
H --> I[MaAsLin2\npredicted function associations]
H --> J[Paired metabolomics table]
J --> K[Microbe-function-metabolite hypotheses]
Microbiome for Dummies
A beginner-friendly guide to microbiome bioinformatic tools
Welcome
This website is a beginner-friendly reference guide for the most commonly used microbiome bioinformatics tools, with a focus on the Biobakery suite developed by the Huttenhower and Segata labs.
Whether you’re new to microbiome research or just need a quick reference, this site walks you through what each tool does, how to install it, and how to run it.
What is the Microbiome?
The microbiome is the collection of all microorganisms (bacteria, fungi, viruses, archaea) living in a particular environment. In human health research, the gut microbiome has attracted enormous attention due to its role in immunity, metabolism, and disease.
Studying the microbiome relies on sequencing DNA (or RNA) from environmental samples and then using computational tools to answer questions like:
- Who is there? (taxonomic profiling)
- What are they doing? (functional profiling)
- Are community shifts associated with disease? (statistical analysis)
The Biobakery Suite
The Biobakery suite is a collection of tools for end-to-end microbiome analysis maintained by the Huttenhower Lab (Harvard) and the Segata Lab (University of Trento). These tools cover the major analysis steps from raw sequence data to biological interpretation.
Functional & Taxonomic Profiling
- HUMAnN — Functional profiling of metagenomes
- MetaPhlAn — Taxonomic profiling from shotgun metagenomics
- PICRUSt2 — Predict metagenome function from 16S data
- PhyloPhlAn — Phylogenetic placement and genome characterization
- StrainPhlAn — Strain-level metagenomic profiling
Advanced Analysis
- ShortBRED — Identify and quantify protein families in metagenomes
- WAAFLE — Detect horizontal gene transfer events
- MACARRoN — Metabolome prioritization
- metawibele — Microbial protein function characterization
- baqlava — Viral profiling from metagenomes
- MaAsLin2 — Multivariable association discovery
- anpan — Microbial pan-genome statistical models
- CCREPE — Compositional data correlation
A Typical Microbiome Workflow
Which path you take depends heavily on whether you start with 16S/amplicon reads or shotgun metagenomes. Select a workflow type to view the corresponding diagram.
16S / amplicon workflow
Best for: community composition, diversity, and low-cost functional hypotheses when you only have amplicon data.
Shotgun metagenome workflow
flowchart TD
A([Raw shotgun reads]) --> B[KneadData\nquality control + host removal]
B --> C[MetaPhlAn\nspecies profile]
B --> D[HUMAnN\ngene families + pathways]
B --> E[ShortBRED\ntarget protein families]
B --> F[baqlava\nviral profile]
B --> G[Assembly / MAGs\ne.g., MEGAHIT · MetaBAT2]
C --> H[StrainPhlAn\nstrain tracking]
C --> I[taxUMAP\nordination]
C --> J[MaAsLin2\ntaxon-metadata associations]
D --> K[MaAsLin2\npathway + functional associations]
D --> L[MACARRoN + metabolomics\nmicrobe-metabolite prioritization]
E --> M[AMR / enzyme family abundance]
G --> N[metawibele / WAAFLE / PhyloPhlAn\nnovel proteins, HGT, phylogeny]
Best for: direct measurement of taxonomy, pathways, protein families, viruses, strain variation, and metabolomics-linked functional readouts.
Tool Summary Table
The table below gives a quick at-a-glance comparison of every tool covered on this site — what it does, why you would reach for it, what data it accepts, where it typically sits in the workflow, and whether it is most relevant to a 16S/amplicon or shotgun sequencing (SGS) workflow. Use the buttons to toggle between the two sequencing modes; tools tagged Both / downstream remain visible in either view because they are commonly used after either type of profiling.
For shotgun studies, most analyses listed here assume you have already done read cleaning with a preprocessing step such as KneadData, leaving host-depleted reads ready for profiling, assembly, or strain analysis.
| Tool | Best fit workflow | Category | Input data | Workflow role & hand-off | Purpose & Why use it | Prominent Use Case |
|---|---|---|---|---|---|---|
| HUMAnN | Shotgun | Functional profiling | Shotgun metagenomics / metatranscriptomics (.fastq / .fastq.gz) | Starts once cleaned reads are ready, often after or alongside MetaPhlAn; hands off gene family and pathway tables to association testing, metabolomics integration, or biological interpretation. | Profiles microbial gene family and metabolic pathway abundances. Gold-standard for measuring what the community is doing at the metabolic level; stratified output links functions back to specific species. | Franzosa EA et al. Nature Methods 2018 |
| MetaPhlAn | Shotgun | Taxonomic profiling | Shotgun metagenomics (.fastq / .fastq.gz) | Usually the first analytical step after KneadData-style read cleaning; hands off species profiles to HUMAnN, StrainPhlAn, taxUMAP, MaAsLin2, or other downstream follow-up. | Determines which species/strains are present and at what relative abundance. Fast, reference-based, and highly accurate; produces profiles usable by HUMAnN, StrainPhlAn, and many downstream tools. | Blanco-Míguez A et al. Nature Biotechnology 2023 |
| PICRUSt2 | 16S / amplicon | Functional prediction (16S) | 16S rRNA amplicon ASVs (.biom / .tsv) | Starts after denoising and taxonomy assignment in a QIIME 2-style workflow; hands off predicted EC, KO, and pathway tables to MaAsLin2 or pathway-focused interpretation. | Predicts functional gene and pathway content from 16S data. Extracts functional information from 16S surveys without shotgun sequencing; best option when only amplicon data are available. | Douglas GM et al. Nature Biotechnology 2020 |
| PhyloPhlAn | Shotgun | Phylogenetics | Whole genomes / MAGs (.fasta / .fna) | Starts once genomes or MAGs have been assembled; hands off placements and reference trees to taxonomic interpretation, comparative genomics, or genome-centered follow-up. | Places new genomes on a reference tree and assigns taxonomy. Resolves the evolutionary position of novel or draft genomes using universal marker genes; backbone of MetaPhlAn’s SGB taxonomy. | Asnicar F et al. Nature Communications 2020 |
| StrainPhlAn | Shotgun | Strain-level tracking | Shotgun metagenomics (.fastq / .fastq.gz) | Starts after MetaPhlAn and marker extraction identify a species with sufficient coverage; hands off strain trees to transmission, persistence, or within-host evolution analyses. | Tracks specific microbial strains across samples via phylogenetic trees. Answers whether the same strain is shared between individuals (e.g. mother-infant transmission) or persists across time points. | Truong DT et al. Genome Research 2017 |
| ShortBRED | Shotgun | Targeted gene profiling | Shotgun metagenomics (.fastq / .fastq.gz) | Starts with cleaned reads plus a predefined marker set for the protein family of interest; hands off targeted abundance tables to focused statistical or mechanistic follow-up. | Builds compact marker sequences for target proteins then quantifies them in metagenomes. Efficiently screens large cohorts for any protein of interest (e.g. antimicrobial resistance genes) without full-database alignment. | Lloyd-Price J et al. Nature 2019 |
| WAAFLE | Shotgun | Horizontal gene transfer | Metagenomic assemblies — contigs (.fasta / .fna) | Starts after contigs are assembled from shotgun data; hands off candidate HGT events for genome-context inspection and mobile-element follow-up. | Detects lateral gene transfer events in assembled metagenomes. Identifies genes that appear to have moved between phylogenetically distant lineages, revealing mobile genetic elements in communities. | Hsu TY et al. Nature Microbiology 2025 |
| MACARRoN | Both / downstream | Metabolomics prioritization | Untargeted LC-MS metabolomics (.csv / .tsv) | Starts once paired microbiome and metabolomics tables have been assembled; hands off ranked metabolite candidates to manual review, validation, and targeted experiments. | Ranks metabolites by biological relevance and microbiome association. Cuts through thousands of unannotated metabolite features to surface the ones most worth experimental follow-up once you have paired microbiome and metabolomics data. | Bhosle A et al. Molecular Systems Biology 2024 |
| metawibele | Shotgun | Protein characterization | Metagenomic assemblies (.fasta / .fna) | Starts after assembly or MAG recovery; hands off prioritized novel protein families to annotation curation, experimental follow-up, or comparative genomics. | Annotates and prioritizes novel microbial protein families. Multi-database annotation pipeline that highlights unannotated or poorly characterized proteins likely to have biological significance. | Zhang Y et al. Nature 2022 |
| baqlava | Shotgun | Viral profiling | Shotgun metagenomics (.fastq / .fastq.gz) | Starts with cleaned shotgun reads; hands off viral abundance profiles to comparison with bacterial taxa, host phenotypes, and multi-omics readouts. | Identifies and quantifies viruses (especially bacteriophages) alongside bacteria. Adds the viral dimension to standard metagenomics; designed to complement MetaPhlAn for a complete community profile. | Jensen JSL et al. bioRxiv 2026 |
| MaAsLin2 | Both / downstream | Statistical association | Any multi-omics table + metadata (.tsv / .csv) | Starts once feature tables and metadata are ready; hands off effect sizes, q-values, and model summaries to figures, interpretation, and reporting. | Finds microbial features significantly associated with host/environmental variables. Handles compositionality, sparsity, and confounders in a single multivariable model; the go-to tool for differential abundance analysis. | Mallick H et al. PLOS Computational Biology 2021 |
| anpan | Shotgun | Pan-genome statistics | Gene presence/absence or SNP tables (per species) + metadata (.tsv / .csv) | Starts after species-resolved gene or SNP tables have been derived from shotgun data; hands off within-species association hits to mechanistic follow-up and validation. | Tests associations between within-species genetic variation and phenotypes. Goes beyond community-level analyses to ask whether specific microbial genes within a species drive a phenotype, with built-in phylogenetic correction. | Ghazi AR et al. bioRxiv 2025 |
| CCREPE | Both / downstream | Co-occurrence / correlation | Any compositional abundance table (.tsv / .csv) | Starts with a normalized compositional abundance table; hands off corrected correlation results to co-occurrence networks and ecological interpretation. | Computes statistically corrected correlations between microbial features. Solves the compositional bias problem in microbiome correlations; essential for building reliable co-occurrence networks. | HMP Consortium Nature 2012 |
| taxUMAP | Both / downstream | Visualisation | Any species/ASV relative abundance table + taxonomy (.tsv / .csv) | Starts with taxonomically annotated ASV or species tables from either workflow; hands off publication-ready embeddings and exploratory figures for interpretation and reporting. | Produces taxonomy-aware UMAP embeddings of microbiome community composition. Captures biologically meaningful community structure that standard UMAP misses by aggregating abundances up the taxonomic tree before computing distances. | Schluter J et al. Cell Host & Microbe 2023 |
| PolyPanner | Shotgun | Intra-species evolution | Longitudinal metagenomic reads + co-assembly (.fastq / .fasta) | Starts after a focal species has been defined in a longitudinal shotgun cohort and co-assembly is available; hands off variant-frequency trajectories to within-host evolution interpretation. | Detects dynamic polymorphic variants whose allele frequencies change across a time-series of metagenomes. Specifically designed for longitudinal cohorts: leverages co-assembly to improve accuracy and tests for frequency change, enabling detection of de novo selective sweeps that cross-sectional tools miss. | Yaffe E et al. Nature 2025 |
| QIIME 2 | 16S / amplicon | Amplicon analysis platform | 16S / ITS amplicon reads (.fastq / .fastq.gz) | Usually the first major analytical step for raw amplicon reads; hands off ASV tables, taxonomy, diversity metrics, and ordinations to PICRUSt2, taxUMAP, MaAsLin2, or direct reporting. | End-to-end amplicon microbiome analysis: denoising, taxonomy, diversity, ordination, and differential abundance. Gold-standard reproducible platform with full data provenance, an extensive plugin ecosystem, and the largest user community in amplicon-based microbiome research. | Bolyen E et al. Nature Biotechnology 2019 |
How to Use This Site
Each tool page provides:
- What it does — Plain-language description of the tool’s purpose
- When to use it — Where it fits in a typical microbiome workflow
- Installation — How to install the tool (conda, pip, Docker)
- Basic usage — Example command-line invocation
- Output — Description of key output files
- Tips & Gotchas — Common pitfalls for beginners
- Further reading — Links to documentation and key papers