vignettes/articles/console_analysis_tutorial.Rmd
console_analysis_tutorial.RmdSingle Cell Toolkit (singleCellTK, SCTK) is a package that works on single-cell RNA-seq (scRNAseq) dataset. SCTK allows users to import multiple datasets, perform quality control and a series of preprocessing, get clustering on cells and markers of clusters, and run various downstream analysis. Meanwhile, SCTK also wraps curated workflows for celda and Seurat.
This tutorial will take the real-world scRNAseq dataset as an example, which consists of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor. This dataset (PBMC3K) is available from 10X Genomics and can be found on the 10X website.
SCTK uses the SingleCellExperiment (SCE) object for management of expression matrices, feature/cell annotation data, and metadata. All of the functions have an SCE object as the first input parameter. The functions operate on a matrix stored in the assay slot of the SCE object. The parameter useAssay can be used to specify which matrix to use (the default is "counts"). Matrices can be of class matrix or dgCMatrix from the Matrix package.
SCTK has wrapper functions for importing count matrix from the output of many types of processing tools. See the documentation for importing data for more detail. Here we use importExampleData() to load PBMC3k data from the Bioconductor package TENxPBMCData.
library(singleCellTK)
sce <- importExampleData("pbmc3k")Here, we briefly introduce the approach to importing the output of the widely used preprocessing tool, cellranger. SCTK has a generic function importCellRanger() for this purpose, and, explicitly, importCellRangerV2() and importCellRangerV3() for different versions of cellranger. For the detail of these functions, please click on the function names to be redirected to the reference page.
The input arguments basically asks users what the exact paths of the input data files are (i.e. "matrix.mtx", "features.tsv", and "barcodes.tsv"). They are cellRangerDirs, sampleDirs, cellRangerOuts, matrixFileNames, featuresFileNames and barcodesFileNames. And the function will identify the specified path, for example, of the barcode file, as a combination of: {cellRangerDirs}/{sampleDirs}/{cellRangerOuts}/{barcodesFileNames}. Theses functions automatically try to recognize a preset substucture of cellRangerDirs, so that in most of the cases, users only need to specify cellRangerDirs to tell where the top level directory of the output is. However, sometimes the three essential files may be placed or named in a different way and the default detection method won’t find them. In this case, users will need to check the exact paths and manually specify the correct input according to the combination rule above and the error messages.
An example folder structure:
./datasets/
sample1/
outs/filtered_feature_bc_matrix/
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
sample2/
outs/filtered_feature_bc_matrix/
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
./otherCellRangerData/
barcodes.tsv
genes.tsv
matrix.mtx
# Default use case
sce <- importCellRanger(cellRangerDirs = "dataset")
# In case the three files are placed in a different way
sce <- importCellRanger(sampleDirs = "otherCellRangerData",
cellRangerOuts = "",
barcodesFileNames = "barcodes.tsv",
featuresFileNames = "genes.tsv",
matrixFileNames = "matrix.mtx")Quality control and filtering of cells is often needed before down-stream analyses such as dimensionality reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice.
To perform QC, we suggest using the runCellQC() function. This is a wrapper for several methods for calculation of QC metrics, doublet detection, and estimation of ambient RNA. Below is a quick example of how to perform standard QC before heading to downstream analyses. If you have another preferred approach or your data has already been QC’ed, you can move to Feature Selection section. For this tutorial, we will only run one doublet detection algorithm (scDblFinder) and one decontamination algorithm (decontX). For a full list of algorithms that this function runs by default, see ?runCellQC. We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability.
# Get list of mitochondrial genes
sce <- importMitoGeneSet(sce, reference = "human", id = "symbol",
by = "rownames", collectionName = "mito")
# Run QC
sce <- runCellQC(sce, sample = NULL,
algorithms = c("QCMetrics", "scDblFinder", "decontX"),
collectionName = "mito",
geneSetListLocation = "rownames")
sce <- getUMAP(inSCE = sce, reducedDimName = "QC_UMAP", seed = 12345)Note: If you have cells from multiple samples stored in the SCE object, make sure to supply the sample parameter as the QC tools need to be applied to cells from each sample individually.
Individual sets of QC metrics can be plotted with specific functions. For example to plot distributions of total numbers of UMIs derived from runPerCellQC(), doublet scores from runScDblFinder(), and contamination scores from runDecontX() (all of which were run by the runCellQC() function), the following plotting functions can be used:

plotScDblFinderResults(sce, reducedDimName = "QC_UMAP")
plotDecontXResults(sce, reducedDimName = "QC_UMAP")
A comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail:
reportCellQC(sce)After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Cells can be removed using the subsetSCECols() function. Metrics stored in the colData slot of the SCE object can be filtered using the colData parameter. Here we will limit to cells with at least 600 counts and 300 genes detected, doublet score no more than 0.9, and at most 5% of mitochondrial genes detected:
sce <- subsetSCECols(sce, colData = c("total > 600", "detected > 300",
"scDblFinder_doublet_score < 0.8",
"subsets_mito_percent < 5"))
# See number of cells after filtering
ncol(sce)## [1] 2545
After removing cells of low quality, we next need to normalize the feature expression matrix. SCTK has a generic wrapper function for various types of normalization, called runNormalization(). In this tutorial, we apply a global-scaling normalization method from scater, "logNormCounts". It normalizes the feature expression for each cell by the total expression, multiplies this by a scale factor, and log-transforms the result. Afterwards, we recommend running a z-score scaling on the log-normalized matrix prior to dimensionality reduction.
sce <- runNormalization(sce, useAssay = "counts", outAssayName = "logcounts",
normalizationMethod = "logNormCounts")
sce <- runNormalization(sce, useAssay = "logcounts",
outAssayName = "logcounts_scaled", scale = TRUE)SCTK also supports a number of batch correction methods. Here we don’t present the workflow because PBMC3k dataset only has one batch. For examples, please refer to the document of batch correction.
Selecting highly variable genes (HVG) for downstream analyses is recommended, since a subset of variable features can speed up the computation and reduce the noise being introduced. SCTK wraps the methods used in Seurat and Scran. We recommend keeping at least 2,000-5,000 HVGs. In the example code, we use the “VST” method from Seurat, and select the top 2,000 genes.
sce <- seuratFindHVG(sce, useAssay = "counts", hvgMethod = "vst")
sce <- getTopHVG(sce, method = "vst", n = 2000, altExp = "hvg")
plotTopHVG(sce, method = "vst")
Note: The row-subsetted matrix is stored in the “alternative experiment” slot (
altExp) within the SCE object. This allows for a matrix with a different number of rows to be stored within the same SCE object (rather than creating two SCE objects). Some of the SCTK functions described in the next several sections will operate on a matrix stored in thealtExpslot. The list of alternative experiments in an SCE can be view withaltExpNames(sce). In the future, there will be update to utilize the ExperimentSubset package for simpler operation on the subsets.
Although we have selected variable features, we would like to further reduce the noise. The next step is performing a principal component analysis (PCA), which creates new uncorrelated variables that successively maximize variance and finally increases the data interpretability but at the same time minimizes the information loss.
In this tutorial, we use the wrapper function scaterPCA(), which comes from Scater. This function allows users to directly use a full-sized feature expression assay, as well as a subsetted expression matrix. Here, we will use the HVG selected in the last section and feed it to useAltExp argument.
SCTK also supports Seurat’s PCA and ICA method, which do a similar job. For Seurat methods, please refer to our Seurat Curated Workflow.
sce <- runDimReduce(inSCE = sce, method = "scaterPCA",
useAssay = "logcounts_scaled", useAltExp = "hvg",
reducedDimName = "PCA", seed = 12345)
plotDimRed(sce, useReduction = "PCA")
Note: When using
useAltExpargument in SCTK functions, users should also be aware of theuseAssayargument. Usually when both are available, using onlyuseAssaymeans using a full-sized feature expression matrix; usinguseAssayanduseAltExpin the same call means using a subsetted expression matrix. In the latter scenario,useAltExppoints to the alternative experiment stored inaltExpslot of the SCE object, anduseAssaypoints to theassayslot in the alternative experiment object.
SCTK supports clustering cells with Shared Nearest Neighbors (SNN) graph based methods and K-Means algorithm. For SNN graph based methods, we wrap the graph constructor from scran library, and have options of algorithms from igraph that can perform clustering on the graph. The result of clustering will be stored in the colData slot with variable name set by clusterName.
Similarly as the steps above, runScranSNN() allows various types of input matrix, including log-normalized a full sized feature expression matrix or a subset of it, as well as a dimension reduction matrix. In this tutorial, we will use the PCA embedding generated in the previous section.
sce <- runScranSNN(sce, useReducedDim = "PCA", clusterName = "cluster", algorithm = "louvain", k = 4)Meanwhile, in the Seurat Curated Workflow, the clustering is also done with an SNN graph, but the wrapper function alone, seuratFindClusters(), would be hard to use out of the curated workflow.
In the other curated workflow that SCTK wraps, Celda Curated Workflow, we use a Bayesian hierarchical model that can bi-cluster features into modules and observations into subpopulations. For detail of this method, please refer to the Celda Documentation.
Uniform Manifold Approximation and Projection (UMAP) is one of the most commonly used methods to visualize the relationships between the cells in a 2-D embedding. This can be done using the function getUMAP(), which is a wrapper of scater::calculateUMAP() method. The result will be saved in reducedDim slot.
Following the steps above, we will now use useReducedDim argument to pass the PCA embedding just generated to the function. This wrapper function also allows useAssay or useAltExp + useAssay method, while providing an option of running PCA dimension reduction on the expression matrix prior to the UMAP computation. In this way, however, the PCA will be done by scater, thus cannot be customized.
SCTK also has the function for t-distributed Stochastic Neighbor Embedding (tSNE) method to create 2-D embedding for visualization, getTSNE(), where the general arguments for inputting a matrix work in the similar way.
sce <- runDimReduce(inSCE = sce, method = "scaterUMAP", useReducedDim = "PCA", reducedDimName = "UMAP", seed = 12345)plotSCEDimReduceColData() is a generic plotting function that use the values stored in reducedDim slot as the axis to make 2-D scatter plots, and use the variables stored in colData slot to label each cell, by colors or shapes.
Now, with the clustering result labeled for each cell and stored in the colData slot. We can plot the UMAP generated previously, with the clustering result labeled.
plotSCEDimReduceColData(sce, colorBy = "cluster", reducedDimName = "UMAP")
Users can next perform marker detection process with clusters labeled, or any other types of reasonable grouping of similar biological state of cells annotated in colData slot. SCTK finds markers for each cluster by running DE analysis on one cluster against all other clusters and iterating the same process for each cluster.
sce <- findMarkerDiffExp(sce, useAssay = "logcounts", method = "wilcox",
cluster = "cluster",
log2fcThreshold = 0, fdrThreshold = 0.05,
minClustExprPerc = 0, maxCtrlExprPerc = 1,
minMeanExpr = 0)By running findMarkerDiffExp(), SCTK stores a data.table of marker table to the matadata slot of the SCE object. The marker table will include all detected markers for each cluster and statistical values associated to each marker. To fetch the table, users can use the command below, and even add filtering parameters.
topMarkers <- findMarkerTopTable(sce, topN = 1, log2fcThreshold = 0, fdrThreshold = 0.05,
minClustExprPerc = 0.5, maxCtrlExprPerc = 0.4,
minMeanExpr = 0)
head(topMarkers, 10)## Gene Log2_FC Pvalue FDR cluster clusterExprPerc
## 28818 CST3 3.1226757 1.715335e-89 2.807832e-85 1 1.0000000
## 19561 S100A9 4.4606995 1.163272e-138 1.269440e-134 2 0.9931741
## 31076 NKG7 4.3346353 3.033295e-79 9.930400e-75 3 1.0000000
## 30657 CD79A 2.2750095 8.153307e-145 6.673074e-141 4 0.9335347
## 7697 PPBP 5.6252950 2.750979e-07 1.585936e-03 5 1.0000000
## 8764 IL7R 0.6596116 7.433999e-37 1.872110e-33 6 0.6972281
## 106092 LST1 2.9220617 2.282925e-67 1.868460e-63 7 1.0000000
## 268252 CCL5 2.8645460 1.758310e-105 5.756355e-101 8 0.9492754
## 198022 CD3D 0.8091627 2.801816e-73 1.994040e-70 9 0.8556231
## ControlExprPerc clusterAveExpr
## 28818 0.33576642 3.952828
## 19561 0.27131439 5.009483
## 31076 0.24947851 4.861231
## 30657 0.04245709 2.319663
## 7697 0.02366864 5.651527
## 8764 0.33236994 1.153092
## 106092 0.31524441 3.556905
## 268252 0.23182019 3.324227
## 198022 0.39692634 1.535737
For visualization, SCTK has a specific heatmap plotting function for detected markers. The heatmap will take the log-normalized feature expression matrix for the values, and subset the features to only the top markers of each cluster. For the organization of the heatmap, the function groups the markers by the belonging to the clusters, and groups the cells, obviously, by clusters. Inside the grouping, the function finally performs the default hierarchical clustering.
markerHm <- plotMarkerDiffExp(sce, topN = 5, log2fcThreshold = 0,
fdrThreshold = 0.05, minClustExprPerc = 0.5,
maxCtrlExprPerc = 0.4, minMeanExpr = 0,
rowLabel = TRUE)
As we can easily tell from the heatmap above, “cluster 6” and “cluster 7” are having a very similar expression profile across the top markers detected. In order to further explore the difference between these two clusters, or any other two reasonable groups of cells, user can use the function runDEAnalysis().
runDEAnalysis() is a generic function for performing differential expression (DE) analysis between two selected groups of cells. This is also used by findMarkerDiffExp(). By running DE analysis, the function will take the normalized read count data and perform statistical analysis to discover quantitative changes in expression levels between selected groups. See here for the detail of running DE analysis with SCTK.
There are multiple ways to specify the groups of cells to be compared: by either directly specifying indices of cells, or selecting groups of cells with a categorical variable in colData. Additionally, runDEAnalysis() requires a relatively sophisticated inputs for naming the analyses, including the names of the two groups, and the name of one analysis, though these are not necessary for calling the algorithms alone. By requiring these, the results of the analyses can be stored with human readable identifiers and within one single SCE object, and can be more manageable. See the DE documentation for the detail.
sce <- runDEAnalysis(inSCE = sce, method = "wilcox",
class = "cluster", classGroup1 = c(6), classGroup2 = c(7),
groupName1 = "cluster6", groupName2 = "cluster7",
analysisName = "cluster6_VS_7")Similarly to the marker detection section, SCTK also has a DE analysis specific heatmap plotting function, plotDEGHeatmap(). Here, users can just easily enter the analysisName specified for one run and the function will automatically arrange the output. The heatmap is also splitted into a “checker board”, by the “conditions” (i.e. the groups selected when running runDEAnalysis()) and the “regulation” (i.e. “up” if the log2 fold change is positive in “group1” against “group2”).
deHm <- plotDEGHeatmap(sce, useResult = "cluster6_VS_7", rowLabel = TRUE)
SCTK adopts SingleR method for labeling the cell types, and wraps the function in runSingleR(). Briefly, it labels each cell (or cluster) in users’ datasets basing on similarity to a reference datasets. SingleR has recommended some datasets, so we enable choosing abbreviation name of them in argument useBltinRef. Please click on the function name for usage detail.
Note: Choosing a reference from the recommended builtin references would also require some prior knowledge of user data. Users should choose a reference that covers the cell types possibly exist in user dataset. SingleR works by scoring the similarity to each cell type and assigning a cell type by the highest score, regardless of how low the score is. Thus a beta cell from pancrea data cound even be wrongly labeled as a platelet if using a blood cell reference.
sce <- runSingleR(sce, useAssay = "logcounts", level = "fine")The result of labeling will again be saved to colData slot of the SCE object. There will be multiple types of information being stored, including a scoring matrix named by "SingleR_{ref}_{level}_scores", where {ref} refers to the reference selected, {level} refers to the level argument being passed to the function. And three "SingleR_{ref}_{level}_{typeOfLabels}" columns, where {typeOfLabels} includes first.labels, labels, pruned.labels. For the difference, please refer to ?SingleR::classifySingleR.
We can also plot the labeling on to the UMAP we got previously.
plotSCEDimReduceColData(sce, colorBy = "SingleR_hpca_fine_pruned.labels",
reducedDimName = "UMAP", dotSize = 1, legendSize = 8)
For the purpose of gene set enrichment (GSE) analysis, SCTK imports VAM and GSVA as options. Both of them work by loading some user-defined gene sets (more about loading gene sets), and calculating the enrichment score of each gene set in each cell. Here we will go through calling VAM to inspect the enrichment of the hallmark gene sets available in MSigDB database.
sce <- importGeneSetsFromMSigDB(sce, categoryIDs = "H", species = "Homo sapiens")
sce <- runVAM(sce, geneSetCollectionName = "H", useAssay = "logcounts")To visualize the result, we simply adopt a violin plot to show the distribution of the CDF scores of the cells, which can be, optionally, grouped by the cluster labels.
hallmark <- "HALLMARK_INFLAMMATORY_RESPONSE"
plotPathway(sce, resultName = "VAM_H_CDF", geneset = hallmark, groupby = "cluster")
Note that available options for argument resultName and geneset can be printed with getter functions below.
getPathwayResultNames(inSCE = sce)
getGenesetNamesFromCollection(inSCE = sce, geneSetCollectionName = "H")Note: In the current version of SCTK, the returned score matrices from either VAM or GSVA are stored in the
reducedDimsslot, since the feature dimensions are for the gene sets instead of the genes or subsets of genes. In the future, SCTK will adopt other types of data container for simpler operation on different types of feature sets.