vignettes/articles/cell_type_labeling.Rmd
cell_type_labeling.Rmd
The methods for labeling the cell type for single-cell RNA-sequencing
(scRNAseq) data varies a lot. Including using a labeled reference
dataset as the training set and classifying the test data by cluster or
individual cells, or using a prior knowledge of the projection between
marker gene sets and specific cell types and labeling the cells basing
on detected markers. Currently, SCTK only supports using
SingleR
[1] for cell type
labeling, which works with the first approach as previously mentioned.
We are planning to add more methods that annotate cells in various
ways.
SingleR
is an automatic annotation method for scRNAseq
data. Given a reference dataset of samples (single-cell or bulk) with
known labels, it labels new cells from a test dataset based on
similarity to the reference. The only prerequisite for using this method
is a labeled reference dataset, and we have already enabled automated
caching of online datasets, while users can always use their own data in
the console analysis.
To view detailed instructions on how to use this method, please
select ‘Interactive Analysis’ for using SingleR
in shiny
application or ‘Console Analysis’ for using this method on R console
from the tabs below:
Entry
Users should enter the page as instructed in the screenshot above, by clicking on “Differential Expression & Cell Type Labeling” first and choosing “Cell Type Labeling” sub-option next.
Usage
There are 6 basic widgets that users need to work with before reaching
to the result:
The result will not be pop out directly, but three levels of cell
annotation will be inserted to the background object. They will all have
the same prefix as
"SingleR_{reference abbr}_{annotation level}_"
, then
"score"
, "first.labels"
, "labels"
and "pruned.labels"
, respectively.
"first.labels"
refers to the labeling initially indicated
by the scores, "labels"
is fine-tuned, and
"pruned.labels"
is the pruned result.
Visualization
SCTK does not have a quick visualization at this point. But users can go to the CellViewer tab to create a scatter plot using the pre-generated embedding and color it with the cell type labeling result.
Preprocessing
As mentioned previously in this article, performing cell type labeling with SingleR does not require sophisticated preprocessing, but only a reference dataset that covers the expected cell types. Now that we have already incorporated many curated online dataset that SingleR recommends, users only need to log-normalize the feature expression matrix to match what is in the reference.
Here, we use “PBMC3k” dataset provided by CellRanger as an example.
library(singleCellTK)
sce <- importExampleData("pbmc3k")
sce <- runNormalization(inSCE = sce, normalizationMethod = "logNormCounts", outAssayName = "logcounts", useAssay = "counts")
Running SingleR
Then users can use runSingleR()
to invoke the algorithm.
Here we choose to use useBltinRef = "hpca"
to set the
reference, which refers to Human Primary Cell Atlas Data [2] and this will be automatically cached to
users’ local. We set level = "main"
to label the PBMC3k
dataset with general cell types. Users can also set it to
"fine"
for more subtypes, and "ont"
for Gene
Ontology information. For more information, please click on the function
name and see the reference page of this function.
sce <- runSingleR(inSCE = sce, useAssay = "logcounts", useBltinRef = "hpca", level = "main", featureType = "symbol")
Four results from SingleR will be stored in colData()
slot with the same prefix as
"SingleR_{reference abbr}_{annotation level}_"
, then
"score"
, "first.labels"
, "labels"
and "pruned.labels"
, respectively.
"first.labels"
refers to the labeling initially indicated
by the scores, "labels"
is fine-tuned, and
"pruned.labels"
is the pruned result.
Besides, users can also use their own labeled reference datasets
wrapped in a SingleCellExperiment
object. Refer to argument useSCERef
and
labelColName
. Additionally, the labeling can also be done
on cluster label if users have already performed clustering on their
dataset and have the result stored in colData()
slot. Refer
to argument labelByCluster
.
Visualization
Users can choose to visualize the labeling result on a scatter plot. Here, we provide an example to create a UMAP embedding that can clearly reflect the relationship between cells in 2D, and then color the cells by the labels.
sce <- runSeuratFindHVG(inSCE = sce, useAssay = "counts", method = "vst")
sce <- setTopHVG(inSCE = sce, method = "vst", hvgNumber = 2000)
sce <- runDimReduce(inSCE = sce, method = "scaterPCA", useAssay = "logcounts", useFeatureSubset = "HVG_vst2000", reducedDimName = "PCA")
sce <- runDimReduce(inSCE = sce, method = "scaterUMAP", useReducedDim = "PCA", reducedDimName = "UMAP")
plotSCEDimReduceColData(inSCE = sce, colorBy = "SingleR_hpca_main_pruned.labels", reducedDimName = "UMAP")