vignettes/articles/seurat_curated_workflow.Rmd
seurat_curated_workflow.Rmd
singleCellTK integrates functions from Seurat [[1]][2][[3]][4] package in an easy to use streamlined workflow using both the shiny user interface as well as the R console. The shiny application contains a separate tab that lets the users run the steps of the workflow in a sequential manner with ability to visualize through interactive plots from within the application. On the R console, the toolkit offers wrapper functions that use the SingleCellExperiment [5] object as the input and the output. All computations from the wrapper functions are stored within this object for further manipulation.
To view detailed instructions on how to use the workflow, please select ‘Interactive Analysis’ for using the workflow in shiny application or ‘Console Analysis’ for using these methods on R console from the tabs below:
In this tutorial example, we illustrate all the steps of the curated
workflow and focus on the options available to manipulate and customize
the steps of the workflow as per user requirements. To initiate the
Seurat
workflow, click on the ‘Curated Workflows’ from the
top menu and select Seurat
:
NOTE: This tutorial assumes that the data has already been uploaded via the upload tab of the toolkit and filtered before using the workflow.
1. Normalize Data
Assuming that the data has
been uploaded via the Upload tab of the toolkit, the first step for the
analysis of the data is the Normalization of data. For this purpose, any
assay available in the uploaded data can be used against one of the
three methods of normalization available through Seurat
i.e. LogNormalize
, CLR
(Centered Log Ratio) or
RC
(Relative Counts).
assay
to normalize from the dropdown
menu.LogNormalize
, CLR
or
RC
.10000
.2. Scale Data
Once normalization is complete,
data needs to be scaled and centered accordingly. Seurat
uses linear
(linear model), poisson
(generalized linear model) or negbinom
(generalized linear
model) as a regression model.
linear
,
poisson
or negbinom
.3. Highly Variable Genes
Identification of the
highly variable genes is core to the Seurat
workflow and
these highly variable genes are used throughout the remaining workflow.
Seurat
provides three methods for variable genes
identification i.e. vst
(uses local polynomial regression
to fit a relationship between log of variance and log of mean),
mean.var.plot
(uses mean and dispersion to divide features
into bins) and dispersion
(uses highest dispersion values
only).
vst
, mean.var.plot
and
dispersion
.2000
.4. Dimensionality Reduction Seurat
workflow offers PCA
or ICA
for dimensionality
reduction and the components from these methods can be used in the
downstream analysis. Moreover, several plots are available for the user
to inspect the output of the dimensionality reduction such as the
standard ‘PCA Plot’, ‘Elbow Plot’, ‘Jackstraw Plot’ and ‘Heatmap
Plot’.
50
.TRUE
.5. tSNE/UMAP
‘tSNE’ and ‘UMAP’ can be computed
and plotted once components are available from ‘Dimensionality
Reduction’ tab.
6. Clustering
Cluster labels can be generated
for all cells/samples using one of the computed reduction method. Plots
are automatically re-computed with cluster labels. The available
algorithms for clustering as provided by Seurat
include
original Louvain algorithm
,
Louvain algorithm with multilevel refinement
and
SLM algorithm
.
original Louvain algorithm
,
Louvain algorithm with multilevel refinement
and
SLM algorithm
0.8
.TRUE
.7. Find Markers
‘Find Markers’ tab can be used
to identify and visualize the marker genes using on of the provided
visualization methods. The tab offers identification of markers between
two selected phenotype groups or between all groups and can be decided
at the time of the computation. Furthermore, markers that are conserved
between two phenotype groups can also be identified. Visualizations such
as Ridge Plot, Violin Plot, Feature Plot and Heatmap Plot can be used to
visualize the individual marker genes.
1. Select if you want to identify marker genes against all groups in a
biological variable or between two pre-defined groups. Additionally,
users can select the last option to identify the marker genes that are
conserved between two groups. 2. Select phenotype variable that contains
the grouping information. 3. Select test used for marker genes
identification. 4. Select if only positive markers should be returned.
5. Press “Find Markers” button to run marker identification.
6. Identified marker genes are populated in the table. 7. Filters can be
applied on the table.
8. Filters allow different comparisons based on the type of the column
of the table.
9. Table re-populated after applying filters. 10. Heatmap plot can be
visualized for all genes populated in the table (9) against all
biological groups in the selected phenotype variable.
11. To visualize each individual marker gene through gene plots, they
can be selected by clicking on the relevant rows of the table.
12. Selected marker genes from the table are plotted with gene
plots.
8. Downstream Analysis
Once all steps of the
Seurat workflow are completed, users can further analyze the data by
directly going to the various downstream analysis options (Differential
Expression, Marker Selection & Pathway Analysis) from within the
Seurat workflow.
All methods provided by SCTK for Seurat workflow use a
SingleCellExperiment
object both as an input and
output.
Using a sample dataset:
library(singleCellTK)
sce <- importExampleData('pbmc3k')
print(sce)
## class: SingleCellExperiment
## dim: 32738 2700
## metadata(0):
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames(2700): pbmc3k_AAACATACAACCAC-1 pbmc3k_AAACATTGAGCTAC-1 ...
## pbmc3k_TTTGCATGAGAGGC-1 pbmc3k_TTTGCATGCCTCAC-1
## colData names(12): Sample Barcode ... Date_published sample
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
1. Normalize Data
Once raw data is uploaded and
stored in a SingleCellExperiment
object,
runSeuratNormalizeData()
function can be used to normalize
the data. The method returns a SingleCellExperiment
object
with normalized data stored as a new assay in the input object.
Parameters to this function include useAssay
(specify
the assay that should be normalized), normAssayName
(specify the new name of the normalized assay, defaults to
"seuratNormData"
), normalizationMethod
(specify the normalization method to use, defaults to
"LogNormalize"
) and scaleFactor
(defaults to
10000
).
sce <- runSeuratNormalizeData(inSCE = sce, useAssay = "counts", normAssayName = "seuratNormData")
2. Scale Data
Normalized data can be scaled by
using the runSeuratScaleData()
function that takes input a
SingleCellExperiment
object that has been normalized
previously by the runSeuratNormalizeData()
function. Scaled
assay is stored back in the input object.
Parameters include useAssay
(specify the name of
normalized assay), scaledAssayName
(specify the new name
for scaled assay, defaults to "seuratScaledData"
),
model
(specify the method to use, defaults to
"linear"
), scale
(specify if the data should
be scaled, defaults to TRUE
), center
(specify
if the data should be centered, defaults to TRUE
) and
scaleMax
(specify the maximum clipping value, defaults to
10
).
sce <- runSeuratScaleData(inSCE = sce, useAssay = "seuratNormData", scaledAssayName = "seuratScaledData")
3. Highly Variable Genes
Highly variable genes
can be identified by first using the runSeuratFindHVG()
function that computes that statistics against a selected HVG method in
the rowData of input object. The genes can be identified by using the
getTopHVG()
function. The variable genes can be visualized
using the plotSeuratHVG()
method.
Parameters for
runSeuratFindHVG()
include useAssay
(specify
the name of the scaled assay, defaults to
"seuratScaledData"
but "counts"
should be used
with "vst"
method) and hvgMethod()
(specify
the method to use for variable genes computation, defaults to
"vst"
).
sce <- runSeuratFindHVG(inSCE = sce, useAssay = "counts", method = "vst")
print(getTopHVG(inSCE = sce, method = "vst", hvgNumber = 10))
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
## [9] "GNG11" "FCER1A"
plotSeuratHVG(sce, labelPoints = 10)
4. Dimensionality Reduction
PCA or ICA can be
computed using the runSeuratPCA()
and
runSeuratICA()
functions respectively. Plots can be
visualized using plotSeuratReduction()
,
plotSeuratElbow()
, plotSeuratJackStraw()
(must
previously be computed by runSeuratJackStraw()
) and
runSeuratHeatmap()
.
sce <- runSeuratPCA(inSCE = sce, useAssay = "seuratScaledData", reducedDimName = "seuratPCA", nPCs = 20)
plotSeuratReduction(inSCE = sce, useReduction = "pca")
plotSeuratElbow(inSCE = sce)
sce <- runSeuratJackStraw(inSCE = sce, useAssay = "seuratScaledData", dims = 20)
plotSeuratJackStraw(inSCE = sce, dims = 20)
runSeuratHeatmap(inSCE = sce, useAssay = "seuratScaledData", useReduction = "pca", nfeatures = 20, dims = 2)
## NULL
5. tSNE/UMAP runSeuratTSNE()
and
runSeuratUMAP()
can be used to compute tSNE/UMAP statistics
and store into the input object. Parameters to both functions include
inSCE
(input SCE object), useReduction
(specify the reduction to use i.e. "pca"
or
"ica"
), reducedDimName
(name of this new
reduction) and dims
(number of dims to use).
plotSeuratReduction()
can be used to visualize the
results.
sce <- runSeuratTSNE(inSCE = sce, useReduction = "pca", reducedDimName = "seuratTSNE", dims = 10)
plotSeuratReduction(sce, "tsne")
sce <- runSeuratUMAP(inSCE = sce, useReduction = "pca", reducedDimName = "seuratUMAP", dims = 10)
plotSeuratReduction(sce, "umap")
6. Clustering runSeuratFindClusters()
function can be used to compute the
clusters, which can later be plotted through the
plotSeuratReduction()
method with cluster labels. The
parameters to the function include inSCE
(input SCE
object), useAssay
(name of the scaled assay),
useReduction
(specify which reduction to use
i.e. "pca"
or "ica"
), dims
(number of dims to use) and the algorithm (either
"louvain"
, "multilevel"
or
"SLM"
).
sce <- runSeuratFindClusters(inSCE = sce, useAssay = "seuratScaledData", useReduction = "pca", dims = 10, algorithm = "louvain")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2700
## Number of edges: 92305
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8487
## Number of communities: 13
## Elapsed time: 0 seconds
plotSeuratReduction()
can then be used to plot all
reductions previously computed with cluster labels:
plotSeuratReduction(sce, "pca", showLegend = TRUE)
plotSeuratReduction(sce, "tsne", showLegend = TRUE)
plotSeuratReduction(sce, "umap", showLegend = TRUE)
7. Find Markers
Marker genes can be identified
using the runSeuratFindMarkers()
function. This function
can either use one specified column from colData
of the
input object as a group variable if all groups from that variable are to
be used (allGroup
parameter) or users can manually specify
the cells included in one group vs cells included in the second group
(cells1
and cells2
parameter).
sce <- runSeuratFindMarkers(inSCE = sce, allGroup = "Seurat_louvain_Resolution0.8")
print(head(metadata(sce)[["seuratMarkers"]]))
## gene.id p_val avg_log2FC pct.1 pct.2 p_val_adj cluster1 cluster2
## 1 LDHB 1.216804e-88 -12.281337 0.970 0.603 3.983572e-84 0 all
## 2 RPS27 3.922537e-82 -13.666973 1.000 0.992 1.284160e-77 0 all
## 3 RPS25 5.623154e-70 -3.707828 1.000 0.975 1.840908e-65 0 all
## 4 RPS12 4.002685e-69 39.667606 1.000 0.991 1.310399e-64 0 all
## 5 RPL31 1.143651e-67 3.198830 0.996 0.966 3.744086e-63 0 all
## 6 CD3D 6.321440e-66 20.940084 0.895 0.437 2.069513e-61 0 all
The marker genes identified can be visualized through one of the
available plots from ridge plot
, violin plot
,
feature plot
, dot plot
and
heatmap plot
. All marker genes visualizations can be
plotted through the wrapper function plotSeuratGenes()
,
which must be supplied the SCE object (markers previously computed),
name of the scaled assay, type of the plot (available options are
"ridge"
, "feature"
, "violin"
,
"dot"
and "heatmap"
), features that should be
plotted (character
vector) and the grouping variable that
is available in the colData
slot of the input object. An
additional parameter ncol
decides in how many columns
should the visualizations be plotted.
plotSeuratGenes(inSCE = sce, useAssay = "seuratScaledData", plotType = "ridge", features = metadata(sce)[["seuratMarkers"]]$gene.id[1:4], groupVariable = "Seurat_louvain_Resolution0.8", ncol = 2, combine = TRUE)