Tutorial¶
A step-by-step guide for running BASALT with demo data.
Prerequisites¶
git clone https://github.com/PKU-EMBL/BASALT-Air.git
cd BASALT-Air
# Edit pixi.toml to set BASALT_WEIGHT and CHECKM2DB paths
pixi install
pixi shell
conda activate basalt_env
- Demo data downloaded from the Google Drive
- At least 32 cores and 128 GB RAM recommended
Demo Files¶
The BASALT demo files/ directory contains:
| File | Description |
|---|---|
Data.tar.gz |
Short-read, long-read FASTQ files and an OPERA-MS assembled contig file |
Final_bestbinset.tar.gz |
Expected final output bins for validation |
basalt.sh |
Shell script to run the demo |
Extract the data:
tar -zxf BASALT\ demo\ files/Data.tar.gz
A 32-core workstation with 128 GB RAM is expected to complete the demo in ~6 hours.
Example 1: Short-Read Only, Single Assembly¶
Step 1: Prepare Working Directory¶
mkdir basalt_test && cd basalt_test
ln -s /path/to/assembly.fasta .
ln -s /path/to/sample_R1.fq .
ln -s /path/to/sample_R2.fq .
Step 2: Activate Environment¶
conda activate basalt_env
Step 3: Run BASALT¶
BASALT -a assembly.fasta \
-s sample_R1.fq,sample_R2.fq \
-t 60 -m 250
Step 4: Check Results¶
ls Final_binset_final_binset/
# List of MAG FASTA files
Example 2: Multi-Assembly with Short + Long Reads¶
Step 1: Run¶
BASALT -a as1.fa,as2.fa,as3.fa \
-s s1_R1.fq,s1_R2.fq/s2_R1.fq,s2_R2.fq/s3_R1.fq,s3_R2.fq \
-l lr1.fastq,lr2.fastq,lr3.fastq \
-t 64 -m 256
Step 2: What happens¶
- Autobinning: MetaBAT2, CONCOCT, Semibin2 (sensitive preset) run on all 3 assemblies.
- Bin Selection: Within each assembly, the best bins are selected. Then cross-assembly dereplication removes redundancies.
- Refinement: DL model removes contamination. Paired-end tracking and long-read connections retrieve missed contigs. Long reads also trigger polishing (Racon).
- Reassembly: SPAdes reassembles each bin to improve contiguity.
Example 3: Quick Mode¶
For large or complex datasets where runtime is a concern:
BASALT -a as1.fa,as2.fa \
-s s1_R1.fq,s1_R2.fq/s2_R1.fq,s2_R2.fq \
-t 64 -m 256 \
--sensitive quick \
--refinepara quick
This uses fewer binners and parameter combinations, and skips deep contig retrieval.
Example 4: Only Run Refinement on Existing Bins¶
If you already have bins from another tool and want to use BASALT's deep-learning refinement:
BASALT -a assembly.fa \
-s sample_R1.fq,sample_R2.fq \
-r My_Bins_Folder \
-c Coverage_matrix_for_binning_assembly.fa.txt \
-t 32 -m 128
Example 5: Data Feeding (Multiple External Binsets)¶
Import bins from VAMB, manual curation, or other tools for refinement:
BASALT -s sample_R1.fq,sample_R2.fq \
-d vamb_output,my_curated_bins \
--binset-index 1 \
-t 32 -m 128
This will: 1. Reindex bins from both folders 2. Recompute coverage matrices 3. Generate paired-end connections 4. Run CheckM2 quality assessment 5. Proceed to Refinement
Running on HPC Clusters¶
SLURM Example¶
#!/bin/bash
#SBATCH --job-name=basalt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH --mem=256G
#SBATCH --time=48:00:00
source ~/.bashrc
conda activate basalt_env
BASALT -a as1.fa,as2.fa \
-s s1_R1.fq,s1_R2.fq/s2_R1.fq,s2_R2.fq \
-t 64 -m 250 \
--mode new
Singularity with SLURM¶
#!/bin/bash
#SBATCH --job-name=basalt_sif
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
singularity run -B $PWD basalt.sif BASALT \
-a as1.fa -s s1_R1.fq,s1_R2.fq \
-t 32 -m 128
Verifying Results¶
After a successful run, check the quality of your MAGs:
# Check completeness and contamination
head Final_binset_final_binset/quality_report.tsv
# Count MAGs meeting quality thresholds
awk -F'\t' 'NR>1 && $3>=50 && $4<10 {count++} END {print count " MAGs ≥50% complete, <10% contamination"}' \
Final_binset_final_binset/quality_report.tsv
Troubleshooting Common Issues¶
Run hangs or crashes¶
- Check
Basalt_log.txtfor the last completed step. - Ensure sufficient RAM: BASALT is memory-intensive, especially during binning and reassembly.
- Resume with
BASALT --mode continue.
Low MAG count¶
- Try
--sensitive more-sensitivefor more thorough binning. - Lower
--min-cpnthreshold (e.g.,--min-cpn 20). - Ensure sufficient sequencing depth — low-coverage samples produce fewer recoverable MAGs.
- Consider using
--refinepara deepfor more aggressive contig retrieval.
Error: CheckM2 database not found¶
checkm2 database --download
Refer to the FAQ for more troubleshooting guidance.