Introduction

Performing comprehensive quality control (QC) is necessary to remove poor quality cells for downstream analysis of single-cell RNA sequencing (scRNA-seq) data. Within droplet-based scRNA-seq data, droplets containing cells must be differentiated from empty droplets. Therefore, assessment of the data is required, for which various QC algorithms have been developed. In singleCellTK (SCTK), we have written convenience functions for several of these tools. In this guide, we will demonstrate how to use these functions to perform quality control on unfiltered, droplet data. (For definition of droplet data, please refer this documentation.)

The package can be loaded using the library() command.


Running QC on droplet raw data

Load PBMC4k data from 10X

SCTK takes in a SingleCellExperiment object from the SingleCellExperiment package. We will utilize the 10X PBMC 4K dataset as an example. For the QC of droplet-based counts data, we will install the dataset from the 10X Genomics website using the BiocFileCache package.

# Install BiocFileCache if is it not already
if (!requireNamespace("BiocFileCache", quietly = TRUE)) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
  }
  BiocManager::install("BiocFileCache")
}

library("BiocFileCache")
bfc <- BiocFileCache::BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path(
  "http://cf.10xgenomics.com/samples",
  "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"
))
untar(raw.path, exdir = file.path(tempdir(), "pbmc4k"))

fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
pbmc4k.droplet <- DropletUtils::read10xCounts(fname, col.names = TRUE)

### change the sample column
names(colData(pbmc4k.droplet)) <- c("sample", "Barcode")
colData(pbmc4k.droplet)$sample <- rep("pbmc4k", ncol(colData(pbmc4k.droplet)))

SCTK also supports the importing of single-cell data from the following platforms: 10X CellRanger, STARSolo, BUSTools, SEQC, DropEST, Alevin, as well as dataset already stored in SingleCellExperiment object and AnnData object. To load your own input data, please refer Function Reference for pre-processing tools under Console Analysis section in Import data into SCTK for detailed instruction.

runDropletQC

All droplet-based QC functions are able to be run under the wrapper function runDropletQC(). By default all possible QC algorithms will be run.

pbmc4k.droplet <- runDropletQC(pbmc4k.droplet)
## Sat Mar 18 10:54:26 2023 ... Running 'perCellQCMetrics'
## Sat Mar 18 10:54:26 2023 ...... Attempting to find mitochondrial genes by identifying features in 'rownames' that match mitochondrial genes from reference 'human' and ID type 'ensembl'.
## Sat Mar 18 10:54:28 2023 ... Running 'emptyDrops'
## Sat Mar 18 10:54:44 2023 ... Running 'barcodeRanks'

If users choose to only run a specific set of algorithms, they can specify which to run with the algorithms parameter.

After running QC functions with SCTK, the output will be stored in the colData slot of the SingleCellExperiment object.

head(colData(pbmc4k.droplet), 5)
## DataFrame with 5 rows and 19 columns
##                         sample            Barcode       sum  detected
##                    <character>        <character> <numeric> <integer>
## AGAATAGGTCACACGC-1      pbmc4k AGAATAGGTCACACGC-1        32        30
## GCATGATGTGTGCCTG-1      pbmc4k GCATGATGTGTGCCTG-1         0         0
## GCATGTACATTGCGGC-1      pbmc4k GCATGTACATTGCGGC-1         0         0
## CCCTCCTAGGACACCA-1      pbmc4k CCCTCCTAGGACACCA-1         0         0
## CTCGTACGTTGTGGAG-1      pbmc4k CTCGTACGTTGTGGAG-1         0         0
##                    percent.top_50 percent.top_100 percent.top_200
##                         <numeric>       <numeric>       <numeric>
## AGAATAGGTCACACGC-1            100             100             100
## GCATGATGTGTGCCTG-1            NaN             NaN             NaN
## GCATGTACATTGCGGC-1            NaN             NaN             NaN
## CCCTCCTAGGACACCA-1            NaN             NaN             NaN
## CTCGTACGTTGTGGAG-1            NaN             NaN             NaN
##                    percent.top_500  mito_sum mito_detected mito_percent
##                          <numeric> <numeric>     <integer>    <numeric>
## AGAATAGGTCACACGC-1             100         1             1        3.125
## GCATGATGTGTGCCTG-1             NaN         0             0          NaN
## GCATGTACATTGCGGC-1             NaN         0             0          NaN
## CCCTCCTAGGACACCA-1             NaN         0             0          NaN
## CTCGTACGTTGTGGAG-1             NaN         0             0          NaN
##                        total dropletUtils_emptyDrops_total
##                    <numeric>                     <integer>
## AGAATAGGTCACACGC-1        32                            32
## GCATGATGTGTGCCTG-1         0                             0
## GCATGTACATTGCGGC-1         0                             0
## CCCTCCTAGGACACCA-1         0                             0
## CTCGTACGTTGTGGAG-1         0                             0
##                    dropletUtils_emptyDrops_logprob
##                                          <numeric>
## AGAATAGGTCACACGC-1                              NA
## GCATGATGTGTGCCTG-1                              NA
## GCATGTACATTGCGGC-1                              NA
## CCCTCCTAGGACACCA-1                              NA
## CTCGTACGTTGTGGAG-1                              NA
##                    dropletUtils_emptyDrops_pvalue
##                                         <numeric>
## AGAATAGGTCACACGC-1                             NA
## GCATGATGTGTGCCTG-1                             NA
## GCATGTACATTGCGGC-1                             NA
## CCCTCCTAGGACACCA-1                             NA
## CTCGTACGTTGTGGAG-1                             NA
##                    dropletUtils_emptyDrops_limited dropletUtils_emptyDrops_fdr
##                                          <logical>                   <numeric>
## AGAATAGGTCACACGC-1                              NA                          NA
## GCATGATGTGTGCCTG-1                              NA                          NA
## GCATGTACATTGCGGC-1                              NA                          NA
## CCCTCCTAGGACACCA-1                              NA                          NA
## CTCGTACGTTGTGGAG-1                              NA                          NA
##                    dropletUtils_BarcodeRank_Knee
##                                        <integer>
## AGAATAGGTCACACGC-1                             0
## GCATGATGTGTGCCTG-1                             0
## GCATGTACATTGCGGC-1                             0
## CCCTCCTAGGACACCA-1                             0
## CTCGTACGTTGTGGAG-1                             0
##                    dropletUtils_BarcodeRank_Inflection
##                                              <integer>
## AGAATAGGTCACACGC-1                                   0
## GCATGATGTGTGCCTG-1                                   0
## GCATGTACATTGCGGC-1                                   0
## CCCTCCTAGGACACCA-1                                   0
## CTCGTACGTTGTGGAG-1                                   0
sample Barcode sum detected percent.top_50 percent.top_100 percent.top_200 percent.top_500 mito_sum mito_detected mito_percent total dropletUtils_emptyDrops_total dropletUtils_emptyDrops_logprob dropletUtils_emptyDrops_pvalue dropletUtils_emptyDrops_limited dropletUtils_emptyDrops_fdr dropletUtils_BarcodeRank_Knee dropletUtils_BarcodeRank_Inflection
AGAATAGGTCACACGC-1 pbmc4k AGAATAGGTCACACGC-1 32 30 100 100 100 100 1 1 3.125 32 32 NA NA NA NA 0 0
GCATGATGTGTGCCTG-1 pbmc4k GCATGATGTGTGCCTG-1 0 0 NaN NaN NaN NaN 0 0 NaN 0 0 NA NA NA NA 0 0
GCATGTACATTGCGGC-1 pbmc4k GCATGTACATTGCGGC-1 0 0 NaN NaN NaN NaN 0 0 NaN 0 0 NA NA NA NA 0 0
CCCTCCTAGGACACCA-1 pbmc4k CCCTCCTAGGACACCA-1 0 0 NaN NaN NaN NaN 0 0 NaN 0 0 NA NA NA NA 0 0
CTCGTACGTTGTGGAG-1 pbmc4k CTCGTACGTTGTGGAG-1 0 0 NaN NaN NaN NaN 0 0 NaN 0 0 NA NA NA NA 0 0


A summary of all outputs
QC output Description Methods Package/Tool
sum Total counts runPerCellQC() scater
detected Total features runPerCellQC() scater
percent_top % Expression coming from top features runPerCellQC() scater
subsets_* sum, detected, percent_top calculated on specified gene list runPerCellQC() scater
dropletUtils_emptyDrops_total Total counts runEmptyDrops() DropletUtils
dropletUtils_emptyDrops_logprob The log-probability of droplet being empty runEmptyDrops() DropletUtils
dropletUtils_emptyDrops_pvalue Monte Carlo p-value of droplet being empty runEmptyDrops() DropletUtils
dropletUtils_emptyDrops_limited Whether a lower p-value could be obtained by increasing niters runEmptyDrops() DropletUtils
dropletUtils_emptyDrops_fdr p-value of droplet being empty, corrected for false detection rate runEmptyDrops() DropletUtils
dropletUtils_BarcodeRank_Knee Whether total UMI count value is higher than knee point runBarcodeRankDrops() DropletUtils
dropletUtils_BarcodeRank_Inflection Whether total UMI count value is higher than inflection point runBarcodeRankDrops() DropletUtils

Running individual QC methods

Instead of running all quality control methods on the dataset at once, users may elect to execute QC methods individually. The parameters as well as the outputs to individual QC functions are described in detail as follows:

runEmptyDrops

It is crucial to distinguish the data occurring from real cells and empty droplets containing ambient RNA. SCTK employs the EmptyDrops algorithm from the DropletUtils package to test for empty droplets. The wrapper function runEmptyDrops() can be used to separately run the EmptyDrops algorithm on its own.

  • lower is the lower bound of the total UMI count, in which all barcodes below the lower bound are assumed to be empty droplets.
  • niters is the number of iterations the function will run for the calculation.
  • testAmbient indicates whether results should be returned for barcodes that have a total UMI count below what is specified in lower.
pbmc4k.droplet <- runEmptyDrops(
  inSCE = pbmc4k.droplet,
  useAssay = "counts",
  lower = 100,
  niters = 10000
)
runBarcodeRankDrops

BarcodeRanks from the DropletUtils package computes barcode rank statistics and identifies the knee and inflection points on the total count curve. The knee and inflection points on the curve represent the difference between empty droplets and cell-containing droplets with much more RNA. The wrapper function runBarcodeRankDrops() can be used to separately run the BarcodeRanks algorithm on its own.

  • lower is the lower bound of the total UMI count, in which all barcodes below the lower bound are assumed to be empty droplets.
pbmc4k.droplet <- runBarcodeRankDrops(
  inSCE = pbmc4k.droplet,
  useAssay = "counts",
  fitBounds = NULL, df = 20
)

Plotting QC metrics

Upon running runDropletQC() or any of the individual droplet QC functions, the QC outputs will need to be plotted. For each QC method, SCTK provides a specialized plotting function.

EmptyDrops

The wrapper function plotEmptyDropsResults() can be used to plot the results from the EmptyDrops algorithm. This will visualize the empty droplets, by plotting the total UMI counts against the log probability for each barcode.

emptyDropsResults <- plotEmptyDropsResults(
  inSCE = pbmc4k.droplet,
  axisLabelSize = 20,
  sample = NULL,
  fdrCutoff = 0.01,
  dotSize = 0.5,
  defaultTheme = TRUE
)
emptyDropsResults$scatterEmptyDrops

Data points are colored by FDR values, where we see a small portion of the dataset contains barcodes that do not meet the threshold.

BarcodeRanks

The wrapper function plotBarcodeRankScatter() can be used to plot the results from the BarcodeRanks algorithm.

plotBarcodeRankScatter(
  inSCE = pbmc4k.droplet,
  title = "BarcodeRanks Rank Plot",
  legendSize = 14
)

The total UMI count of each barcode is plotted against its rank. We can see a steep dropoff of UMI counts around the inflection point, where we see a presumed separation between cell-containing and empty droplets.

Filtering the dataset

SingleCellExperiment objects can be subset by its colData using subsetSCECols(). The colData parameter takes in a character vector of expression(s) which will be used to identify a subset of cells using variables found in the colData of the SingleCellExperiment object. For example, if x is a numeric vector in colData, then setting colData = "x < 5" will return a SingleCellExperiment object where all columns (cells) meet the condition that x is less than 5. The index parameter takes in a numeric vector of indices which should be kept, while bool takes in a logical vector of TRUE or FALSE which should be of the same length as the number of columns (cells) in the SingleCellExperiment object. Please refer to our Filtering documentation for detail.

#Before filtering:
dim(pbmc4k.droplet)
## [1] 33694 60000
pbmc4k.droplet <- subsetSCECols(pbmc4k.droplet, colData = 'dropletUtils_BarcodeRank_Inflection == 1')
pbmc4k.droplet <- subsetSCECols(pbmc4k.droplet, colData = '!is.na(pbmc4k.droplet$dropletUtils_emptyDrops_fdr)')
pbmc4k.droplet <- subsetSCECols(pbmc4k.droplet, colData = 'pbmc4k.droplet$dropletUtils_emptyDrops_fdr < 0.01')
#After filtering:
dim(pbmc4k.droplet)
## [1] 33694   344

We can compare the average total UMI counts per cell before and after cell filtration:

p1 <- plotSCEViolinColData(pbmc4k.droplet.prefilt, coldata = "sum", summary = "mean", title = "Pre-filter", ylab = "Total counts")
p2 <- plotSCEViolinColData(pbmc4k.droplet, coldata = "sum", summary = "mean", title = "Post-filter", ylab = "Total counts")
plot(cowplot::plot_grid(p1, p2, ncol = 2))


For performing QC on cell-filtered count matrix with SCTK, please refer to our Cell QC documentation.


Session Information
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocFileCache_2.6.1         dbplyr_2.3.1               
##  [3] dplyr_1.1.0                 singleCellTK_2.8.1         
##  [5] DelayedArray_0.24.0         Matrix_1.5-3               
##  [7] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
##  [9] Biobase_2.58.0              GenomicRanges_1.50.2       
## [11] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [13] S4Vectors_0.36.2            BiocGenerics_0.44.0        
## [15] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.7.1          colorspace_2.1-0         
##   [3] rprojroot_2.0.3           GSVAdata_1.34.0          
##   [5] scuttle_1.8.4             XVector_0.38.0           
##   [7] BiocNeighbors_1.16.0      fs_1.6.1                 
##   [9] rstudioapi_0.14           farver_2.1.1             
##  [11] ggrepel_0.9.3             bit64_4.0.5              
##  [13] AnnotationDbi_1.60.2      fansi_1.0.4              
##  [15] xml2_1.3.3                codetools_0.2-19         
##  [17] R.methodsS3_1.8.2         sparseMatrixStats_1.10.0 
##  [19] cachem_1.0.7              knitr_1.42               
##  [21] scater_1.26.1             jsonlite_1.8.4           
##  [23] annotate_1.76.0           png_0.1-8                
##  [25] R.oo_1.25.0               graph_1.76.0             
##  [27] HDF5Array_1.26.0          compiler_4.2.2           
##  [29] httr_1.4.5                dqrng_0.3.0              
##  [31] fastmap_1.1.1             limma_3.54.2             
##  [33] cli_3.6.0                 BiocSingular_1.14.0      
##  [35] htmltools_0.5.4           tools_4.2.2              
##  [37] rsvd_1.0.5                gtable_0.3.2             
##  [39] glue_1.6.2                GenomeInfoDbData_1.2.9   
##  [41] rappdirs_0.3.3            Rcpp_1.0.10              
##  [43] jquerylib_0.1.4           pkgdown_2.0.7            
##  [45] vctrs_0.6.0               Biostrings_2.66.0        
##  [47] rhdf5filters_1.10.0       svglite_2.1.1            
##  [49] DelayedMatrixStats_1.20.0 xfun_0.37                
##  [51] stringr_1.5.0             rvest_1.0.3              
##  [53] beachmat_2.14.0           lifecycle_1.0.3          
##  [55] irlba_2.3.5.1             XML_3.99-0.13            
##  [57] edgeR_3.40.2              zlibbioc_1.44.0          
##  [59] scales_1.2.1              ragg_1.2.5               
##  [61] parallel_4.2.2            rhdf5_2.42.0             
##  [63] yaml_2.3.7                curl_5.0.0               
##  [65] gridExtra_2.3             memoise_2.0.1            
##  [67] reticulate_1.28           ggplot2_3.4.1            
##  [69] sass_0.4.5                stringi_1.7.12           
##  [71] RSQLite_2.3.0             highr_0.10               
##  [73] desc_1.4.2                ScaledMatrix_1.6.0       
##  [75] eds_1.0.0                 filelock_1.0.2           
##  [77] BiocParallel_1.32.5       rlang_1.1.0              
##  [79] pkgconfig_2.0.3           systemfonts_1.0.4        
##  [81] bitops_1.0-7              evaluate_0.20            
##  [83] lattice_0.20-45           purrr_1.0.1              
##  [85] Rhdf5lib_1.20.0           labeling_0.4.2           
##  [87] cowplot_1.1.1             bit_4.0.5                
##  [89] tidyselect_1.2.0          GSEABase_1.60.0          
##  [91] magrittr_2.0.3            R6_2.5.1                 
##  [93] generics_0.1.3            DBI_1.1.3                
##  [95] pillar_1.8.1              withr_2.5.0              
##  [97] KEGGREST_1.38.0           RCurl_1.98-1.10          
##  [99] tibble_3.2.0              crayon_1.5.2             
## [101] DropletUtils_1.18.1       utf8_1.2.3               
## [103] rmarkdown_2.20            viridis_0.6.2            
## [105] locfit_1.5-9.7            grid_4.2.2               
## [107] blob_1.2.4                webshot_0.5.4            
## [109] digest_0.6.31             xtable_1.8-4             
## [111] R.utils_2.12.2            textshaping_0.3.6        
## [113] munsell_0.5.0             kableExtra_1.3.4         
## [115] viridisLite_0.4.1         beeswarm_0.4.0           
## [117] vipor_0.4.5               bslib_0.4.2