Deconvolve bulk sequencing samples with DeepDETAILS

Prepare datasets

DeepDETAILS requires the following input files:

  • Strand-specific signals for the bulk library (bigWig format)
  • Region of interests (e.g. peaks) in the bulk library (bed format)
  • Aligned fragments from the reference sc/snATAC-seq (bed-like tabular format, required columns: chrom, chromStart, chromEnd, barcode, and readSupport).
    chr1	9992	10068	TCACAAGTCTAGCAAC	1
    chr1	9992	10068	TCTAGTTTCGGTGATT	1
    chr1	9993	10061	GGAGTAGTCCACTAGA	1
    chr1	9994	10061	ACCAAACTCCTGTGGG	1
    chr1	9994	10074	CTTGAAGGTAATGCAA	1
    chr1	9994	10085	CCGAAGCAGGGACGTT	1
    chr1	9994	10086	CTTAATCCATTTGTTC	1
    chr1	9995	10061	GAGTGAGTCGAGCGCT	1
    chr1	9995	10078	CCATACCCAGCAATGG	1
  • Cell type annotation for each cell barcode (tabular format, required columns: barcode and cell type annotation).
    TCACAAGTCTAGCAAC	cellTypeA
    TCTAGTTTCGGTGATT	cellTypeB
    GGAGTAGTCCACTAGA	cellTypeC
    ACCAAACTCCTGTGGG	cellTypeD
    CTTGAAGGTAATGCAA	cellTypeC
    CCGAAGCAGGGACGTT	cellTypeA
  • Reference genome sequence (fasta format). Example file: hg38, mm10
  • Chromosome size. Example file: hg38, mm10

Process bulk samples

You can use the pipeline of your own choice to process bulk samples, the requirements from DeepDETAILS are:

  1. BigWig files for the signals on the forward and reverse strands respectively.
  2. Peak regions in bed format.

For example, you can use the pipeline here to process sequencing library from TSS-assays like PRO/GRO-cap.

If your pipeline doesn't produce bigWig files, but it generates BAM files, then your can use third-party tools, such as bamCoverage or bedtools to convert them into bigWig files.

bamCoverage -b input.bam -o pl.bw --binSize 1 --filterRNAstrand forward
bamCoverage -b input.bam -o mn.bw --binSize 1 --filterRNAstrand reverse

Process reference samples

You can use the pipeline of your own choice to process reference single-cell/single-nucleus ATAC-seq samples. After clustering cells and annotating the cell types of each cluster, you can follow examples below to export the fragments and annotation files required by DeepDETAILS.

# assume your Signac/Seurat object is `scexp`
# use cluster labels from non-linear dimension reduction and clustering
write.table(
  as.data.frame(scexp$seurat_clusters),
  'ann.tsv',
  sep = '\t',
  quote = F,
  col.names = F,
  row.names = T
)

# or if you have integrated labels from other scRNA-seq experiments
write.table(
  as.data.frame(scexp$predicted.id),
  'ann.tsv',
  sep = '\t',
  quote = F,
  col.names = F,
  row.names = T
)

# assume your ArchR project object is `proj`
meta_df <- getCellColData(proj, select='Clusters')

# Note: the cell barcode in ArchR maybe concatenated with the sample name
# In this case, the row names of meta_df looks like
# DeepDETAILS requires a strict match in the fragments file and the annotation
# so you may need to run the followling line to remove sample name
# rownames(meta_df) <- sub("^[^#]+#", "", rownames(meta_df))

write.table(
  meta_df, 'ann.tsv',
  sep = '\t',
  quote = F,
  col.names = F,
  row.names = T)

The mapping between cell barcodes and cluster labels are available in the output folder outs/analysis/clustering. The exported file is in csv format. You can use the following shell command in your terminal to get the required input for DeepDETAILS:
tail -n +2 Graph-Based.csv | sed 's/,/\t/g' > ann.tsv

If you want DeepDETAILS to perform a preflight check (removing cell types in the reference but not in the bulk), you need to tell DeepDETAILS where are the accessible regions. Depend on your pipeline, you can use

  • peaks.bed, for 10x Cell Ranger ATAC
  • atac_peaks.bed, for 10x Cell Ranger Arc
  • other peak calls from conventional peak callers like MACS2.

Build DeepDETAILS datasets

DeepDETAILS stores datasets in hdf5 files. After you have the required input files ready, you can build a deepDETAILS dataset with the following command:

deepdetails prep-data \
    --bulk-pl bulk.pl.bw \
    --bulk-mn bulk.mn.bw \
    --regions bulk.peaks.bed \
    --fragments fragments.tsv.gz \
    --barcodes barcodes.tsv \
    --accessible-regions atac_peaks.bed \
    --save-to ./dataset \
    --genome-fa /refs/sequences/human/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
    --chrom-size /refs/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/chrNameLength.txt

A successful execution of the above command will create the following files:

  • data.h5: The data file for deconvolution
  • regions.csv: Extended bulk peak regions and background regions
  • scatac.norm.csv: Normalization factors for each cell type
  • *.fragments.bw: Pseudo-bulk ATAC-seq tracks for each cell type
  • background_sampling.png: Diagnostic plot for the background sampling process
  • signatures.csv/png: When preflight is enabled, these files give you the signal distribution across the signature regions

Deconvolve signals

After building the dataset folder, you can run the deconvolution process (requires GPU):

deepdetails deconv \
    --dataset ./dataset \
    --save-to . \
    --study-name sample-a \
    --save-preds

The outputs from a successful deconvolution process look like the following:

.
├── sample-a
│   └── 250212144109: The folder containing deconvolution results (name changes according to the time).
│       ├── epoch=0-step=2538.ckpt: Trained model
│       ├── hparams.yaml: Hyperparamters
│       ├── metrics.csv: Training log
...
│       └── preview972.0.131072.0000.s21250212144109.png: Preview genome browser views
├── sample-a.counts.csv.gz: Deconvolved read counts (1-kb resolution) for each cell type
└── sample-a.predictions.h5: Deconvolved signal (1-bp resolution) for each cell type

Visualize tracks

This step exports deconvolved signal tracks (bigWig) to visualize the signals in each cell type, and it's optional. You need to locate the exported predictions from the previous step by looking for files like sample-a.predictions.h5. After you get the file, you can run the following command:

deepdetails build-bw \
    -p sample-name.predictions.h5 \
    --save-to . \
    --chrom-size /refs/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/chrNameLength.txt

You should be able to see deconvolved signal tracks for each cell type (named like cell_type.pl.bw / cell_type.mn.bw) in the output directory after the command finishes.