DeepDETAILS requires the following input files:
chr1 9992 10068 TCACAAGTCTAGCAAC 1 chr1 9992 10068 TCTAGTTTCGGTGATT 1 chr1 9993 10061 GGAGTAGTCCACTAGA 1 chr1 9994 10061 ACCAAACTCCTGTGGG 1 chr1 9994 10074 CTTGAAGGTAATGCAA 1 chr1 9994 10085 CCGAAGCAGGGACGTT 1 chr1 9994 10086 CTTAATCCATTTGTTC 1 chr1 9995 10061 GAGTGAGTCGAGCGCT 1 chr1 9995 10078 CCATACCCAGCAATGG 1
TCACAAGTCTAGCAAC cellTypeA TCTAGTTTCGGTGATT cellTypeB GGAGTAGTCCACTAGA cellTypeC ACCAAACTCCTGTGGG cellTypeD CTTGAAGGTAATGCAA cellTypeC CCGAAGCAGGGACGTT cellTypeA
You can use the pipeline of your own choice to process bulk samples, the requirements from DeepDETAILS are:
For example, you can use the pipeline here to process sequencing library from TSS-assays like PRO/GRO-cap.
If your pipeline doesn't produce bigWig files, but it generates BAM files, then your can use third-party tools, such as bamCoverage or bedtools to convert them into bigWig files. Example
bamCoverage -b input.bam -o pl.bw --binSize 1 --filterRNAstrand forward bamCoverage -b input.bam -o mn.bw --binSize 1 --filterRNAstrand reverse
You can use the pipeline of your own choice to process reference single-cell/single-nucleus ATAC-seq samples. After clustering cells and annotating the cell types of each cluster, you can follow examples below to export the fragments and annotation files required by DeepDETAILS.
# assume your Signac/Seurat object is `scexp` # use cluster labels from non-linear dimension reduction and clustering write.table( as.data.frame(scexp$seurat_clusters), 'ann.tsv', sep = '\t', quote = F, col.names = F, row.names = T ) # or if you have integrated labels from other scRNA-seq experiments write.table( as.data.frame(scexp$predicted.id), 'ann.tsv', sep = '\t', quote = F, col.names = F, row.names = T )
# assume your ArchR project object is `proj` meta_df <- getCellColData(proj, select='Clusters') # Note: the cell barcode in ArchR maybe concatenated with the sample name # In this case, the row names of meta_df looks like # DeepDETAILS requires a strict match in the fragments file and the annotation # so you may need to run the followling line to remove sample name # rownames(meta_df) <- sub("^[^#]+#", "", rownames(meta_df)) write.table( meta_df, 'ann.tsv', sep = '\t', quote = F, col.names = F, row.names = T)
outs/analysis/clustering
.
The exported file is in csv format. You can use the following shell command in your terminal to get the required input for DeepDETAILS:
tail -n +2 Graph-Based.csv | sed 's/,/\t/g' > ann.tsv
If you want DeepDETAILS to perform a preflight check (removing cell types in the reference but not in the bulk), you need to tell DeepDETAILS where are the accessible regions. Depend on your pipeline, you can use
peaks.bed
, for 10x Cell Ranger ATACatac_peaks.bed
, for 10x Cell Ranger ArcDeepDETAILS stores datasets in hdf5 files. After you have the required input files ready, you can build a deepDETAILS dataset with the following command:
deepdetails prep-data \ --bulk-pl bulk.pl.bw \ --bulk-mn bulk.mn.bw \ --regions bulk.peaks.bed \ --fragments fragments.tsv.gz \ --barcodes barcodes.tsv \ --accessible-regions atac_peaks.bed \ --save-to ./dataset \ --genome-fa /refs/sequences/human/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \ --chrom-size /refs/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/chrNameLength.txt
A successful execution of the above command will create the following files:
data.h5
: The data file for deconvolutionregions.csv
: Extended bulk peak regions and background regionsscatac.norm.csv
: Normalization factors for each cell type*.fragments.bw
: Pseudo-bulk ATAC-seq tracks for each cell typebackground_sampling.png
: Diagnostic plot for the background sampling processsignatures.csv/png
: When preflight is enabled, these files give you the signal distribution across the signature regionsAfter building the dataset folder, you can run the deconvolution process (requires GPU):
deepdetails deconv \ --dataset ./dataset \ --save-to . \ --study-name sample-a \ --save-preds
The outputs from a successful deconvolution process look like the following:
.
├── sample-a
│ └── 250212144109: The folder containing deconvolution results (name changes according to the time).
│ ├── epoch=0-step=2538.ckpt: Trained model
│ ├── hparams.yaml: Hyperparamters
│ ├── metrics.csv: Training log
...
│ └── preview972.0.131072.0000.s21250212144109.png: Preview genome browser views
├── sample-a.counts.csv.gz: Deconvolved read counts (1-kb resolution) for each cell type
└── sample-a.predictions.h5: Deconvolved signal (1-bp resolution) for each cell type
This step exports deconvolved signal tracks (bigWig) to visualize the signals in each cell type, and it's optional.
You need to locate the exported predictions from the previous step by looking for files like sample-a.predictions.h5
.
After you get the file, you can run the following command:
deepdetails build-bw \ -p sample-name.predictions.h5 \ --save-to . \ --chrom-size /refs/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/chrNameLength.txt
You should be able to see deconvolved signal tracks for each cell type (named like cell_type.pl.bw / cell_type.mn.bw) in the output directory after the command finishes.