Relative Abundance

Interactive longitudinal taxonomic bar charts with taxa-preserving significance overlays

This page shows how to build a family-level relative-abundance bar chart from MetaPhlAn output, while keeping the visual logic consistent with the intestinal phylogenetic tree navigator elsewhere on this site.

The main idea is:

  1. keep families ordered by phylogenetic relatedness
  2. keep colors anchored to the same phylum-level palette used in the tree navigator
  3. show statistical significance in a second visual layer instead of overloading the bar colors
Note

For most beginner-friendly microbiome plots, family level is a good compromise: less sparse than species-level data, but still biologically interpretable.


The palette to reuse

The intestinal tree page already uses these bacterial phylum colors:

Phylum Tree color
Actinomycetota #7e57c2
Bacteroidota #1976d2
Verrucomicrobiota #00897b
Bacillota #2e7d32
Pseudomonadota #c62828
Fusobacteriota #ef6c00

For family-level bars, it usually looks cleaner to use lighter shades within each phylum, which matches the tree’s logic that deeper ranks become lighter.

Interactive taxonomic bar-chart explorer

The interactive explorer below now uses a longitudinal control-versus-experimental design with triplicate subjects at time -3, Pre- (time 0), and time +4. You can change taxonomic level, choose a statistical contrast, inspect the sample-code key, and compare the stacked bars against a LEfSe-like significance strip while keeping the full framework table below.

My recommended visualization idea is the taxa-preserving significance strip:

  • the stacked bars keep taxa-specific color as the primary encoding
  • a second aligned panel shows LEfSe-like signed effect size for the selected longitudinal contrast
  • significant taxa keep their q labels, but taxa keep the same taxon color instead of switching to a significance palette

That way you get the best part of LEfSe-style interpretation without losing the phylogenetic color logic of the bar chart itself.


R packages

install.packages(c("dplyr", "tidyr", "tibble", "forcats", "ggplot2", "vegan", "scales"))
library(dplyr)
library(tidyr)
library(tibble)
library(forcats)
library(ggplot2)
library(vegan)
library(scales)

Step 1: Read a family-level MetaPhlAn table

metaphlan_raw <- read.delim("metaphlan_merged.tsv", check.names = FALSE, comment.char = "#")
metadata <- read.delim("metadata.tsv", check.names = FALSE)

family_long <- metaphlan_raw |>
  rename(feature = clade_name) |>
  filter(grepl("\\|f__[^|]+$", feature)) |>
  mutate(
    family = sub(".*\\|f__", "", feature),
    phylum = sub(".*\\|p__([^|]+).*", "\\1", feature),
    class = sub(".*\\|c__([^|]+).*", "\\1", feature),
    order = sub(".*\\|o__([^|]+).*", "\\1", feature)
  ) |>
  select(feature, phylum, class, order, family, everything()) |>
  pivot_longer(
    cols = -c(feature, phylum, class, order, family),
    names_to = "sample_id",
    values_to = "abundance"
  ) |>
  left_join(metadata, by = "sample_id")

If you have repeated rows at the same family level, aggregate them:

family_long <- family_long |>
  group_by(sample_id, group, timepoint, phylum, class, order, family) |>
  summarise(abundance = sum(abundance), .groups = "drop")

Step 2: Choose a phylogenetic order for families

To make the legend and stacked bars feel more biological, do not sort families only by abundance. Instead, keep closely related families adjacent.

A simple gut-oriented order that matches the intestinal tree page is:

phylo_order_tbl <- tibble(
  family = c(
    "Bifidobacteriaceae",
    "Bacteroidaceae", "Prevotellaceae", "Rikenellaceae", "Akkermansiaceae",
    "Lachnospiraceae", "Ruminococcaceae", "Peptostreptococcaceae",
    "Bacillaceae", "Staphylococcaceae", "Enterococcaceae",
    "Enterobacteriaceae", "Pseudomonadaceae", "Sutterellaceae",
    "Fusobacteriaceae"
  ),
  phylo_rank = seq_len(15)
)

If your study contains a different set of families, extend this table using the same logic:

  • sort by phylum first
  • then by class/order
  • then by family

That gives you a legend that reads like a simplified phylogenetic tree rather than a random color list.


Step 3: Keep the top families and lump the rest into “Other”

top_families <- family_long |>
  group_by(family) |>
  summarise(mean_abundance = mean(abundance), .groups = "drop") |>
  slice_max(mean_abundance, n = 15) |>
  pull(family)

family_plot <- family_long |>
  mutate(family = if_else(family %in% top_families, family, "Other")) |>
  group_by(sample_id, group, timepoint, phylum, class, order, family) |>
  summarise(abundance = sum(abundance), .groups = "drop")

Step 4: Build a tree-matched family palette

The phylum anchor colors below come directly from the tree navigator. Within each phylum, we create a short gradient so that related families share the same color family.

phylum_anchor_colors <- c(
  Actinomycetota = "#7e57c2",
  Bacteroidota = "#1976d2",
  Verrucomicrobiota = "#00897b",
  Bacillota = "#2e7d32",
  Pseudomonadota = "#c62828",
  Fusobacteriota = "#ef6c00"
)

phylum_light_colors <- c(
  Actinomycetota = "#f5f0fb",
  Bacteroidota = "#f2f8fe",
  Verrucomicrobiota = "#f0fbfa",
  Bacillota = "#f0f8e8",
  Pseudomonadota = "#fff5f6",
  Fusobacteriota = "#fff8ee"
)

family_key <- family_plot |>
  filter(family != "Other") |>
  distinct(phylum, class, order, family) |>
  left_join(phylo_order_tbl, by = "family") |>
  arrange(phylum, class, order, phylo_rank, family)

family_palette <- family_key |>
  group_by(phylum) |>
  group_modify(~ {
    shades <- colorRampPalette(c(
      phylum_anchor_colors[.y$phylum],
      phylum_light_colors[.y$phylum]
    ))(nrow(.x))

    mutate(.x, fill_color = shades)
  }) |>
  ungroup() |>
  select(family, fill_color) |>
  deframe()

family_palette <- c(family_palette, Other = "#d9d9d9")

This produces a palette where:

  • Bacteroidota families stay blue
  • Bacillota families stay green
  • Pseudomonadota families stay red
  • and so on

That way your stacked bar chart still communicates phylogeny at a glance.


Step 5: Apply the phylogenetic order to the legend and stacks

family_levels <- c(family_key$family, "Other")

family_plot <- family_plot |>
  mutate(
    family = factor(family, levels = family_levels),
    sample_id = factor(sample_id, levels = metadata$sample_id)
  )

Step 6: Draw the stacked bar chart

ggplot(family_plot, aes(x = sample_id, y = abundance, fill = family)) +
  geom_col(width = 0.9) +
  facet_grid(. ~ group, scales = "free_x", space = "free_x") +
  scale_fill_manual(values = family_palette, drop = FALSE) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  labs(
    title = "Family-level relative abundance",
    subtitle = "Families are ordered by phylogenetic relatedness and colored by the tree navigator palette",
    x = NULL,
    y = "Relative abundance (%)",
    fill = "Family"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    panel.grid.major.x = element_blank(),
    legend.position = "right"
  )
Tip

If your study is longitudinal, put time on the x-axis and facet by subject or treatment group. The same color/order logic still works.


How to show statistical significance without breaking the color logic

If fill color already encodes phylogeny, avoid reusing the same fill for significance. That usually makes the plot harder to read.

A better approach is:

  • keep bar colors for taxonomy / phylogenetic relatedness
  • add a second panel for significance or effect size
  • optionally add *, **, or arrows in the legend labels or strip labels

Optional shortcut: add significance to legend labels

If you want something lighter-weight, append stars to the labels of significant families.

sig_families <- stats_tbl |>
  filter(qval < 0.05) |>
  pull(family)

legend_labels <- setNames(
  nm = levels(family_plot$family),
  object = ifelse(levels(family_plot$family) %in% sig_families,
                  paste0(levels(family_plot$family), " *"),
                  levels(family_plot$family))
)

Then pass labels = legend_labels into scale_fill_manual(). This is simple, but the side-panel heatmap is usually easier to interpret.


Tutorial: generate the standard volcano plot

The interactive volcano plot above now uses a standard volcano layout:

  1. compute a log2 fold change from the experimental-arm Pre- mean abundance to the time +4 mean abundance,
  2. compute a Kruskal-Wallis p-value for the same Pre- vs time +4 comparison,
  3. plot x = log2FC and y = -log10(p).

Because this page is a small synthetic teaching example, the browser-side code below focuses on the standard volcano geometry. A LEfSe-style follow-up comparison can be layered in later if we decide to add that analysis.

const SIGNIFICANCE_THRESHOLD = 0.05;
const MIN_DISPLAY_P_VALUE = 0.0005;
const VOLCANO_LOG2FC_PSEUDOCOUNT = 0.01;

function averageRanks(values) {
  const ranked = values
    .map((value, index) => ({ value, index }))
    .sort((left, right) => left.value - right.value);

  const ranks = new Array(values.length);
  let start = 0;
  while (start < ranked.length) {
    let end = start + 1;
    while (end < ranked.length && ranked[end].value === ranked[start].value) {
      end += 1;
    }
    const averageRank = (start + 1 + end) / 2;
    for (let rankIndex = start; rankIndex < end; rankIndex += 1) {
      ranks[ranked[rankIndex].index] = averageRank;
    }
    start = end;
  }
  return ranks;
}

function computeKruskalWallisPValue(groups) {
  const flattened = groups.flatMap((groupValues, groupIndex) =>
    groupValues.map((value) => ({ value, groupIndex }))
  );
  const ranks = averageRanks(flattened.map((entry) => entry.value));
  const totalCount = flattened.length;
  const tieCounts = flattened.reduce((counts, entry) => {
    counts[entry.value] = (counts[entry.value] || 0) + 1;
    return counts;
  }, {});
  const rankSums = groups.map(() => 0);

  flattened.forEach((entry, entryIndex) => {
    rankSums[entry.groupIndex] += ranks[entryIndex];
  });

  const statisticBase = groups.reduce((total, groupValues, groupIndex) =>
    total + ((rankSums[groupIndex] ** 2) / groupValues.length), 0
  );
  const statistic = ((12 / (totalCount * (totalCount + 1))) * statisticBase) - (3 * (totalCount + 1));
  const tieCorrectionDenominator = totalCount ** 3 - totalCount;
  const tieCorrection = tieCorrectionDenominator > 0
    ? 1 - (Object.values(tieCounts).reduce((total, count) => total + (count ** 3 - count), 0) / tieCorrectionDenominator)
    : 1;
  return Math.max(
    MIN_DISPLAY_P_VALUE,
    chiSquareSurvivalDf1(tieCorrection > Number.EPSILON ? statistic / tieCorrection : statistic)
  );
}

function computeLog2FoldChange(preValues, postValues) {
  if (preValues.length === 0 || postValues.length === 0) {
    return 0;
  }
  const preMean = mean(preValues);
  const postMean = mean(postValues);
  return Math.log2(
    (postMean + VOLCANO_LOG2FC_PSEUDOCOUNT) /
    (preMean + VOLCANO_LOG2FC_PSEUDOCOUNT)
  );
}

const volcanoPoints = level.taxa.map((taxon) => {
  const experimentalPreValues = valuesForTaxonSamples(taxon, "Experimental", "0");
  const experimentalPostValues = valuesForTaxonSamples(taxon, "Experimental", "+4");
  const pValue = computeKruskalWallisPValue([experimentalPreValues, experimentalPostValues]);
  const log2FoldChange = computeLog2FoldChange(experimentalPreValues, experimentalPostValues);

  return {
    taxon,
    x: log2FoldChange,
    y: -Math.log10(pValue)
  };
});

const circles = volcanoPoints.map((point) => `
  <circle
    cx="${xForLog2FoldChange(point.x)}"
    cy="${yForLogP(point.y)}"
    r="4.8"
    fill="${point.taxon.color}"
    stroke="${point.y >= -Math.log10(SIGNIFICANCE_THRESHOLD) ? "#0f172a" : "#94a3b8"}"
  ></circle>
`);

In the live figure:

  • horizontal position is the experimental-arm log2 fold change,
  • vertical position is the Kruskal-Wallis -log10(p) value,
  • the dashed horizontal guide marks the conventional p = 0.05 threshold.