10k PBMCs with multiple SCENIC iterations¶
This tutorial follows closely the following 10x PBMC case study:
However, here we focus on setting up the workflow for running SCENIC multiple times and automatically integrating the results. Please refer to the above case study for the explanation of the remainder of the setup steps.
Setup the VSN-pipelines project¶
Update the repository¶
Pull/update the vsn-pipelines repository cached by nextflow.
Here, we use the -r
flag to specify the pipeline version to use:
nextflow pull vib-singlecell-nf/vsn-pipelines -r v0.25.0
Build the config file¶
We use a combination of profiles to build the config file:
tenx
: defines the input data typesingle_sample_scenic
: loads the basic parameters to run the single_sample and scenic workflowsscenic_multiruns
: loads the multiruns parametersscenic_use_cistarget_motifs
andscenic_use_cistarget_tracks
: includes parameters to specify the location of the cistarget database fileshg38
: specifies the genome. Other options are:hg19
,dm6
,mm10
.singularity
(ordocker
): specifies container system to use to run the processes
nextflow config vib-singlecell-nf/vsn-pipelines \
-profile tenx,single_sample_scenic,scenic_multiruns,scenic_use_cistarget_motifs,scenic_use_cistarget_tracks,hg38,singularity \
> pbmc10k.vsn-pipelines.complete.config
Important SCENIC multi-runs parameters to check in the config:
params.sc.scenic.numRuns
: the number of SCENIC iterations to run. Using10
will give a good balance with computation time, while100
will provide an exhaustive evaluation of the regulon and target gene characteristics.params.sc.scenic.aucell.min_genes_regulon
: The threshold used for filtering the regulons based on the number of target genes. Regulons are keps only if they contain this number of genes or greater. This parameter should be considered in proportion to the number of runs (default: 5).params.sc.scenic.aucell.min_regulon_gene_occurrence
: The threshold used for filtering the genes bases on their occurrence. Target genes that occur this many times or less across all of the runs are discarded. This parameter should be considered in proportion to the number of runs (default: 5).
Specifying compute resource usage in the config:
- The global executor (
process.executor
) is set tolocal
by default. It can be changed toqsub
, etc. to run specific processes as jobs. The executor parameter can be added to specific labels to run only these processes as jobs. Typically the GRN step should be submitted as a job (compute_resources__scenic_grn
), especially for the multi-runs approach. - The number of cpus and memory usage can be adjusted for each label.
The complete config file used here is available at: pbmc10k/pbmc10k.vsn-pipelines.scenic-multiruns.complete.config.
Run the VSN-pipelines project¶
Testing the settings (optional)¶
With the SCENIC multi-runs approach, it is highly recommended to test the approach with params.sc.scenic.numRuns
set to a small number (e.g. 2
).
It may also be useful to use a small test dataset for this purpose:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/sample_data_small.tar.gz
tar xzvf sample_data_small.tar.gz
This can be loaded in the config by temporarily replacing the input data with:
data {
tenx {
cellranger_mex = 'sample_data/outs'
}
}
As this is a small dataset and the numRuns
parameter is set to a small number, also change these settings:
scanpy {
filter {
cellFilterMinNGenes = 1
}
neighborhood_graph {
nPcs = 2
}
dim_reduction {
pca {
nComps = 2
}
}
}
scenic {
aucell {
min_genes_regulon = 0
min_regulon_gene_occurrence = 0
}
}
First pass¶
As in the original PBMC10k tutorial, we can first run without the SCENIC steps to get the filtering and other parameters set correctly.
Re-run the pipeline as many times as needed (with resume
to skip alread-completed steps) to select the proper filters:
nextflow -C pbmc10k.vsn-pipelines.complete.config \
run vib-singlecell-nf/vsn-pipelines \
-entry single_sample \
-r v0.25.0 -resume
Second pass¶
With the parameters set, the full multi-runs workflow can be run:
nextflow -C pbmc10k.vsn-pipelines.complete.config \
run vib-singlecell-nf/vsn-pipelines \
-entry single_sample_scenic \
-r v0.25.0 -resume
This can take a long time to run, depending on the number of iterations used.
Results¶
Once the pipeline is complete, the output will be the following files (display truncated):
$ tree out
out/
├── data
│ ├── intermediate
│ │ └── [...]
│ └── pbmc10k.PBMC10k_DSL2.single_sample.output.h5ad
├── loom
│ ├── pbmc10k.SCENIC_SCope_output.loom
│ └── pbmc10k.SCope_output.loom
├── nextflow_reports
│ ├── execution_report.html
│ ├── execution_timeline.html
│ ├── execution_trace.txt
│ └── pipeline_dag.dot
├── notebooks
│ ├── intermediate
│ ├── pbmc10k.merged_report.html
│ ├── pbmc10k.merged_report.ipynb
│ ├── pbmc10k.merged_report.louvain_0.4.html
│ ├── pbmc10k.merged_report.louvain_0.4.ipynb
│ ├── pbmc10k.merged_report.louvain_0.8.html
│ └── pbmc10k.merged_report.louvain_0.8.ipynb
└── scenic
└── pbmc10k
├── arboreto_with_multiprocessing
│ ├── run_1
│ │ └── sample_data__run_1__adj.tsv
│ └── run_2
│ └── sample_data__run_2__adj.tsv
├── aucell
│ ├── run_1
│ │ └── sample_data__run_1__auc_mtf.loom
│ └── run_2
│ └── sample_data__run_2__auc_mtf.loom
├── cistarget
│ ├── run_1
│ │ └── sample_data__run_1__reg_mtf.csv.gz
│ └── run_2
│ └── sample_data__run_2__reg_mtf.csv.gz
├── multi_runs_aucell
│ └── multi_runs_regulons_auc_mtf.tsv
├── multi_runs_cistarget
│ ├── multi_runs_features_mtf.csv.gz
│ └── multi_runs_regulons_mtf.pkl.gz
├── multi_runs_looms
│ └── multi_runs_regulons_auc_mtf.loom
├── multi_runs_regulons_mtf
│ ├── BRF1(+).tsv
│ ├── FOS(+).tsv
│ └── regulons.tsv
├── notebooks
│ ├── SCENIC_report.html
│ └── SCENIC_report.ipynb
└── SCENIC_SCope_output.loom
The final output is similar to that of a SCENIC pipeline with a single iteration, with the exception of some additional files being stored in out/scenic/<sampleId>
:
- GRN, cisTarget, and AUCell outputs for each run/iteration in their respective directories.
- In the
out/scenic/<sampleId>/multi_runs_regulons_mtf/
directory (and optionallymulti_runs_regulons_trk
if track databases were used):regulons.tsv
: Contains a list of all regulons found, along with their occurrence (count) across all SCENIC iterations.- A file for each regulon (e.g.
BRF1(+).tsv
). Each file contains two columns: 1) the target gene name, and 2) the number of times that gene occurred across all SCENIC iterations.
The final SCENIC output is packaged into a loom file, which includes the results of the multi-runs analysis.
This can be found at out/scenic/<sampleId>/SCENIC_SCope_output.loom
, and is ready to be uploaded to a SCope session.