ShortBRED
Short Better Representative Extract Dataset
What is ShortBRED?
ShortBRED (Short Better Representative Extract Dataset) is a pipeline for identifying and quantifying protein families of interest in metagenomic data. It first builds a set of short, highly specific marker sequences (ShortBRED-markers) for your proteins of interest, then uses those markers to rapidly quantify those protein families in large metagenomic datasets.
ShortBRED answers the question: “How abundant are these specific proteins or protein families in my metagenome?”
When to Use ShortBRED
Use ShortBRED when you want to:
- Profile the abundance of specific protein families in metagenomes (e.g., antibiotic resistance genes, virulence factors)
- Quickly screen large datasets for proteins of interest
- Analyze proteins not well covered by general metabolic databases
ShortBRED is particularly useful for antimicrobial resistance (AMR) profiling and quantifying specific gene categories across large cohorts.
Installation
Via conda
conda create -n shortbred -c biobakery shortbred
conda activate shortbredFrom source
git clone https://github.com/biobakery/shortbred.git
cd shortbred
pip install .Dependencies
usearch— for clusteringMUSCLE— for multiple sequence alignmentBLAST+— for marker generation
Workflow
ShortBRED has two main steps:
Step 1: Build markers (ShortBRED-Identify)
Build compact marker sequences from your target protein sequences:
shortbred_identify.py \
--goi target_proteins.faa \
--ref uniref90.fasta \
--markers output_markers.faa \
--tmp tmp/| Option | Description |
|---|---|
--goi |
Genes of interest (target proteins, FASTA) |
--ref |
Reference protein database (e.g., UniRef90) |
--markers |
Output marker file |
--tmp |
Temporary directory |
Step 2: Quantify in metagenomes (ShortBRED-Quantify)
Use the markers to quantify target proteins in metagenomic samples:
shortbred_quantify.py \
--markers output_markers.faa \
--wgs sample.fastq.gz \
--results sample_shortbred_results.txt \
--tmp tmp/Output Files
| File | Contents |
|---|---|
output_markers.faa |
Compact marker sequences |
*_results.txt |
RPKM-normalized abundance of each protein family per sample |
Tips & Gotchas
Pre-built markers are available for common gene sets (e.g., ARGs, virulence factors) on the Biobakery website, saving you the identify step.
usearch licensing — usearch requires a free license for 32-bit use and a commercial license for 64-bit. Alternatively, use vsearch as a drop-in open-source replacement.
Batch processing — Run shortbred_quantify.py separately on each sample, then join results with standard table-joining tools.