taxUMAP

Taxonomy-Aware UMAP Visualization for Microbiome Data

What is taxUMAP?

taxUMAP is a Python tool that extends the popular UMAP dimensionality reduction algorithm to incorporate microbial taxonomy, enabling intuitive and biologically meaningful visualizations of large microbiome datasets. By weighting distances between samples according to the taxonomic hierarchy of their community members, taxUMAP produces embeddings that reflect the biological structure of the microbiome — not just statistical distances.

taxUMAP answers the question: “How do microbiome community compositions cluster across samples, displayed in a way that respects microbial evolutionary relationships?”


When to Use taxUMAP

Use taxUMAP when you want to:

  • Visualise and explore microbiome composition across many samples
  • Identify community clusters or gradients in a biologically interpretable way
  • Analyse longitudinal or clinical microbiome cohorts
  • Compare community-level structure across treatment groups, time points, or body sites
Note

Standard UMAP applied to relative abundances treats all taxa as independent dimensions. taxUMAP aggregates abundances up the taxonomic tree, so communities with similar higher-level structure (e.g. same dominant phyla) are placed closer together even if their species composition differs. This is especially useful for large, clinically complex datasets.


Installation

pip (from GitHub)

git clone https://github.com/jsevo/taxumap.git
cd taxumap
pip install -e .

Input Files

taxUMAP requires two CSV files:

File Description
Microbiota table Rows = samples, columns = ASVs/OTUs; must include an index_column column for sample IDs
Taxonomy table Rows = ASVs/OTUs, columns = taxonomic ranks (Kingdom, Phylum, Class, Order, Family, Genus, Species)

Microbiota table example (microbiota_table.csv):

index_column,ASV1,ASV2,ASV3
sample1,0.50,0.30,0.20
sample2,0.10,0.70,0.20
sample3,0.60,0.05,0.35

Taxonomy table example (taxonomy.csv):

ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species
ASV1,Bacteria,Firmicutes,Bacilli,Lactobacillales,Lactobacillaceae,Lactobacillus,acidophilus
ASV2,Bacteria,Bacteroidota,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,fragilis
ASV3,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,coli
Tip

Unknown taxonomic levels should be represented as nan. Do not leave them blank, as blank values can break the aggregation logic.


Basic Usage

Command-line

python run_taxumap.py \
  -t taxonomy.csv \
  -m microbiota_table.csv

This produces a taxumap_embedding.csv file with the 2D coordinates for each sample.

Python API

from taxumap.taxumap_base import TaxUMAP

# Initialise
tu = TaxUMAP(
    microbiota="microbiota_table.csv",
    taxonomy="taxonomy.csv"
)

# Fit and transform
tu.fit_transform()

# Access embedding
embedding_df = tu.embedding_  # pandas DataFrame with columns X and Y

# Plot
tu.scatter()

Adjusting aggregation weights

taxUMAP weights aggregated profiles at each taxonomic level. You can adjust those weights to emphasise or de-emphasise higher-level taxonomy:

tu = TaxUMAP(
    microbiota="microbiota_table.csv",
    taxonomy="taxonomy.csv",
    agg_levels=["Genus", "Family", "Order"],   # levels to aggregate at
    weights=[1.0, 0.5, 0.25]                   # relative weight per level
)
tu.fit_transform()

Output

Output Description
taxumap_embedding.csv 2D UMAP coordinates for each sample
Scatter plot Interactive or static visualisation (via built-in .scatter())

Downstream analysis

import pandas as pd
import matplotlib.pyplot as plt

embedding = pd.read_csv("taxumap_embedding.csv", index_col=0)
metadata  = pd.read_csv("metadata.csv", index_col=0)

merged = embedding.join(metadata[["group"]])

fig, ax = plt.subplots()
for grp, sub in merged.groupby("group"):
    ax.scatter(sub["X"], sub["Y"], label=grp, alpha=0.7)
ax.legend()
ax.set_title("taxUMAP — coloured by group")
plt.tight_layout()
plt.savefig("taxumap_coloured.png", dpi=150)

Tips & Gotchas

Warning

Compositionality — Normalise your microbiota table to relative abundances (rows sum to 1) before running taxUMAP. Raw read counts will produce misleading results because UMAP distances are not sequencing-depth-aware.

Warning

Taxonomy completeness — Incomplete taxonomy tables (many nan values at lower ranks) reduce the benefit of the taxonomy-aware weighting. Use a classifier such as QIIME2/DADA2 + SILVA to maximise taxonomic resolution before running taxUMAP.

Tip

UMAP parameters — taxUMAP passes n_neighbors and min_dist directly to the underlying UMAP call. Start with n_neighbors=15 and min_dist=0.1, then tune based on the cohort size and desired cluster tightness.

Tip

Large cohorts — taxUMAP was specifically designed for large clinical datasets (thousands of samples). The aggregation step is the main computational cost; for very large tables, reduce the number of aggregation levels.


Further Reading