vignettes/articles/scanpy_curated_workflow.Rmd
scanpy_curated_workflow.Rmd
The singleCellTK package integrates functions from Scanpy which is a scalable Python-based toolkit for analyzing single-cell gene expression data built on top of the AnnData framework to store data and results. It includes functions for finding variable genes, dimensionality reduction, clustering, and differential expression. The Python-based implementation efficiently deals with datasets of more than one million cells.
Users can assess the Scanpy workflow via the singleCellTK package using either the R console or interactive graphical user interface (GUI) built with R/Shiny. The singleCellTK package uses the SingleCellExperiment object to store data and results. The Scanpy wrapper functions in singleCellTK automatically convert the data and results back and forth between SCE and AnnData objects so users do not have to do this manually.
The Shiny app allows users to access Scanpy functions and workflows without having to know the underlying code. Users can access all functions in the a la carte workflow in combination with functions from other packages. The Shiny app also contains a curated workflow that lets the users run the steps of the complete Scanpy workflow in a sequential manner and display the results with interactive plots.
In this tutorial, you can select the ‘Interactive Analysis’ tabs below for detailed instructions on how to run the Scanpy workflow with the Shiny or the ‘Console Analysis’ tabs for instructions on running the workflow using the R console.
Note: This tutorial assumes that the data has already been imported, QC’ed, and filtered. Please see the Import and QC tutorial for more information.
In this tutorial example, we illustrate all the steps of the
Scanpy curated workflow and focus on the options
available to customize the individual steps per the Scanpy instructions. To start
the Scanpy
curated workflow in the Shiny app, click on the
‘Curated Workflows’ tab in the top nagivation menu and select
Scanpy
:
1. Normalize Data
The first major step in the
analysis is to normalize the raw counts for differences in overall
library size (i.e. total number of counts) and apply a log
transformation. For this purpose, any assay containing raw counts in the
uploaded data can be used. 1. Click on the ‘Normalize’ vertical tab. 2.
Select the assay
to normalize from the dropdown menu. 3.
Modify optional parameters if necessary 4. Press the ‘Normalize’ button
to start the normalization process.
Note: The input raw counts matrix should only contained filtered cells and empty droplets should already have been removed. If you applied ambient RNA removal algorithms and want to use the decontominated counts, make sure to specify the appropriate matrix as input.
2. Highly Variable Genes
Identification and
selection of highly variable features/genes is important in the
Scanpy
workflow to produce high quality clustering results.
Currently, Scanpy
provides three methods for variable genes
identification (seurat
, cell_ranger
and
seurat_v3
).
3. Dimensionality Reduction Scanpy
workflow offers PCA
for dimensionality reduction. The
principal components (PCs) from PCA will be used in the downstream
clustering and 2-D embedding analyses. In the downstream analysis, you
will have to select the number of PCs to use. Several plots are
available for the user to inspect the output of the dimensionality
reduction and identify the appropriate number of PCs to use in the
downstream analyses.
4. 2D-Embedding
The relationships between cells
can be viewed on 2-D scatter plots using embedding algorithms such as
‘UMAP’ and ‘tSNE’. In these plots, each point is a cell and cells with
more similar expression profiles will be closer together in the
plot.
5. Clustering
Cells can be clustered using a
previously computed reduced dimensional object. Cluster labels will be
displaye on previously computed 2-D embedding or reduction plots.
6. Find Markers
‘Find Markers’ tab can be used
to identify the marker genes that are enriched in each cluster or group.
Markers genes can be identified that are differentially expressed
between each cluster and all other cluster or between two selected
clusters/groups. Expression of individual marker genes can be visualized
in Dot plots, Violin plots, Matrix plots and via a Heatmap.
All methods Scanpy wrapper functions in singleCellTK take a
SingleCellExperiment
(SCE) object as an input and will
return an SCE with additional information. For this tutorial, we will
use a small PBMC dataset that has been previously run through quality
control and has been filtered:
library(singleCellTK)
sce <- readRDS("tutorial_pbmc3k_qc.rds")
1. Normalize Data
Once raw data is uploaded and
stored in a SingleCellExperiment
object,
runScanpyNormalizeData()
function can be used to normalize
the data. The method returns a SingleCellExperiment
object
with normalized data stored as a new assay in the input object.
The main parameters to this function include useAssay
which is used to specify the assay with raw counts that should be
normalized and normAssayName
which will be the name of the
new normalized assay to be created.
Note: The input raw counts matrix should only contained filtered cells and empty droplets should already have been removed. If you applied ambient RNA removal algorithms and want to use the decontominated counts, make sure to specify the appropriate matrix as input.
sce <- runScanpyNormalizeData(inSCE = sce, useAssay = "counts", normAssayName = "scanpyNormData")
2. Scale Data
Normalized data can be scaled by
z-scoring with the runScanpyScaleData()
function that takes
input a SingleCellExperiment
object that has been
normalized previously by the runScanpyNormalizeData()
function. The scaled assay is stored back in the input object.
The main parameters to this function include useAssay
which is used to specify the normalized assay from previous steps and
scaledAssayName
which will be the name of the new scaled
assay to be created.
sce <- runScanpyScaleData(inSCE = sce, useAssay = "scanpyNormData", scaledAssayName = "scanpyScaledData")
3. Highly Variable Genes
Highly variable genes
can be identified by first using the runScanpyFindHVG()
function that computes that statistics against a selected HVG method in
the rowData of input object. The variable genes can be visualized using
the plotScanpyHVG()
method.
The main parameters to this function include useAssay
which is used to specify the name of the input log normalized assay and
method
which is used to specify the method to use for
computing the variability of each gene.
sce <- runScanpyFindHVG(inSCE = sce, useAssay = "scanpyNormData", method = "seurat", minMean = 0.0125, maxMean = 3, maxDisp = 0.5)
The getTopHVG
function can be used to select the top
most variable genes and store it back in rowData
of the SCE
object as a Boolean vector.
## [1] "PPBP" "NFE2L2" "DOK3" "YPEL2" "ARVCF"
## [6] "UBE2D4" "FAM210B" "CTB-113I20.2" "GBGT1" "GMPPA"
plotScanpyHVG(sce)
4. Dimensionality Reduction
PCA can be computed
using the runScanpyPCA()
function.
The main parameters to this function include useAssay
which is used to specify the name of the input log normalized assay and
method
which is used to specify the method to use for
computing the variability of each gene.
sce <- runScanpyPCA(inSCE = sce, useAssay = "scanpyScaledData", reducedDimName = "scanpyPCA", nPCs = 50, method = "auto", use_highly_variable = TRUE)
The results of the PCA can be visualized in several ways. The
plotScanpyPCAVariance()
function shows the percentage of
variability explained by each PC. This plot can be useful for selecting
the number of PCs. In general, the “elbow” of the curve is a good
selection criteria. The elbow is where there is a sharp drop in the
amount the percentage of variance decreases after including additional
PCs.
plotScanpyPCAVariance(inSCE = sce, nPCs = 50)
The plotScanpyPCA()
function can be used to view cells
in the reduced dimensional space of PC1 versus PC2.
plotScanpyPCA(inSCE = sce, reducedDimName = "scanpyPCA")
The plotScanpyPCAGeneRanking()
function shows the
highest ranked genes for each PC.
plotScanpyPCAGeneRanking(inSCE = sce)
5. 2-D Embedding
The relationships between cells can be viewed on 2-D scatter plots
using embedding algorithms such as ‘UMAP’ and ‘tSNE’. In these plots,
each point is a cell and cells with more similar expression profiles
will be closer together in the plot. runScanpyTSNE()
and
runScanpyUMAP()
can be used to compute 2-D embeddings which
are stored in the reducedDim
slot of the SCE object.
Parameters to both functions include useReducedDim
which
specifies the reduced dimensional object to use as input,
reducedDimName
which will be the name of this new
reduction, and dims
which specifies the number of
dimensions to use from the reduced dimensional object. The
plotScanpyEmbedding()
function can be used to visualize the
results. Below is an example of calculating and displaying a UMAP:
sce <- runScanpyUMAP(inSCE = sce, useReducedDim = "scanpyPCA", reducedDimName = "scanpyUMAP")
plotScanpyEmbedding(sce, reducedDimName = "scanpyUMAP")
6. Clustering
The runScanpyFindClusters()
function can be used to
cluster cells. The main parameters to this function include
useAssay
which should be the name of the scaled assay,
useReducedDim
useReducedDim
which specifies
the reduced dimensional object to use as input, dims
which
specifies the number of dimensions to use from the reduced dimensional
object, and method
which specifies the clustering
method.
sce <- runScanpyFindClusters(inSCE = sce, useAssay = "scanpyScaledData", useReducedDim = "scanpyPCA", method = "leiden", resolution = 1, nNeighbors = 10, dims = 10, cor_method = "pearson")
plotScanpyEmbedding()
can then be used to plot the
cluster labels on the previously computed 2-D embeddings or reduced
dimensional objects. For example, we can plot the cluster labels on the
UMAP:
plotScanpyEmbedding(sce, reducedDimName = "scanpyUMAP", color = 'Scanpy_leiden_1')
7. Find Markers
Marker genes can be identified
using the runScanpyFindMarkers()
function. The main
parameter for this function is a vector of cluster/group labels that are
stored in the colData
of the input SCE object. By default.
each cluster of cells will be compared to the rest of the cells in all
other clusters. Users can can find also find differentially expressed
genes between two clusters using the group1
and
group2
parameters. Generally, group labels defined by
clustering functions will be used as input. Here we will find markers
for the clusters we identified previously:
sce <- runScanpyFindMarkers(inSCE = sce, colDataName = "Scanpy_leiden_1", nGenes = 10, test = "t-test", corr_method = "benjamini-hochberg")
The full table of differentially expressed marker genes can be
accessed in the metadata with the command
metadata(sce)[["scanpyMarkersTable"]]
. Here are the first
few genes in the table:
## Gene findMarker_cluster Log2_FC Pvalue zscore
## 01 CD74 0 4.057573 0.000000e+00 78.69228
## 02 HLA-DRA 0 4.877226 0.000000e+00 74.43016
## 03 CD79A 0 7.735484 2.272841e-165 52.24474
## 04 HLA-DPB1 0 4.021307 4.142218e-253 51.94960
## 05 HLA-DRB1 0 3.864657 3.225850e-229 47.71487
## 06 CD79B 0 5.529685 1.860679e-150 44.38144
Expression of individual marker genes can be visualized in Feature plots, Dot plots, Violin plots, Matrix plots and via a Heatmap:
plotScanpyEmbedding(sce, color = c("CD79A", "CST3"), reducedDimName = "scanpyUMAP")
plotScanpyDotPlot(sce, features = c("CD79A", "CST3"), groupBy = "Scanpy_leiden_1")
plotScanpyViolin(sce, features = c("CD79A", "CST3"), groupBy = "Scanpy_leiden_1")
plotScanpyMatrixPlot(sce, features = c("CD79A", "CST3"), groupBy = "Scanpy_leiden_1")
plotScanpyHeatmap(sce, features = c("CD79A", "CST3"), groupBy = "Scanpy_leiden_1")