metawibele
Metagenomic-based Microbial Bioactive Elements
What is metawibele?
metawibele (Metagenomic-based Microbial Bioactive Elements) is a pipeline for characterizing and prioritizing novel protein families from metagenomes. It processes metagenomic assemblies, predicts open reading frames (ORFs), clusters them into protein families, and then annotates them against multiple functional databases to identify candidate bioactive proteins.
metawibele answers the question: βWhat novel or understudied microbial proteins are present in this metagenome, and which ones are worth investigating further?β
- π GitHub
- π Documentation
- ποΈ Paper: Ma et al. 2021, Nature Methods
When to Use metawibele
Use metawibele when you want to:
- Characterize microbial proteins from metagenomic assemblies
- Identify novel proteins with unknown functions
- Prioritize proteins for experimental follow-up
- Annotate protein families against multiple databases simultaneously
Installation
Via conda (recommended)
conda create -n metawibele -c biobakery metawibele
conda activate metawibeleFrom source
git clone https://github.com/biobakery/metawibele.git
cd metawibele
pip install .Workflow Overview
metawibele has three main stages:
Metagenomic assemblies (contigs)
β
βΌ
1. Preprocessing
(ORF prediction, protein clustering)
β
βΌ
2. Characterization
(multi-database annotation)
β
βΌ
3. Prioritization
(scoring and ranking)
Running the full pipeline
metawibele \
--input-sequence contigs.fasta \
--input-count counts.tsv \
--output-folder metawibele_output/ \
--threads 8Annotation Databases
metawibele integrates annotation from multiple sources:
| Database | Information |
|---|---|
| UniRef90 | Protein family membership |
| Pfam | Protein domain content |
| KEGG | Metabolic pathway annotations |
| eggNOG | Orthologous group annotations |
| PSORTb | Protein subcellular localization |
| SignalP | Signal peptide prediction |
| TMHMM | Transmembrane domain prediction |
| MaAsLin2 | Association with metadata |
Output Files
| File | Contents |
|---|---|
*_proteinfamilies.tsv |
Protein family abundance table |
*_characterization.tsv |
Multi-database annotations |
*_prioritization.tsv |
Ranked prioritization scores |
Prioritization Score
metawibele computes a composite prioritization score based on:
- Prevalence β How common is this protein family across samples?
- Abundance β How highly abundant is it?
- Annotation novelty β Is it unannotated or poorly characterized?
- Differential abundance β Is it associated with a phenotype of interest?
Tips & Gotchas
Compute requirements β The characterization step involves running BLAST against several large databases. This can require significant disk space (>100 GB) and compute time.
Start with the demo dataset to understand the expected input/output formats before running on your own data.
metawibele integrates with MaAsLin2 for differential abundance testing as part of the prioritization step. Make sure your metadata file is formatted correctly.