vignettes/articles/differential_expression.Rmd
differential_expression.Rmd
singleCellTK (SCTK) performs differential expression (DE) analysis in a group-VS-group way. The groups to be compared can be set by selecting the individual cells or by using existing categorical cell annotation. SCTK A La Carte Workflow supports five methods, including wilcoxon test [1], MAST [2], Limma [3], DESeq2 [4], and ANOVA[5]. Users can perform multiple sets of DE analysis with unique analysis name set each time. SCTK also provides the Seurat Curated Workflow, where Seurat marker testing methods are supported, but will not be introduced in this documentation.
To view detailed instructions on how to use these methods, please select ‘Interactive Analysis’ for using differential expression in shiny application or ‘Console Analysis’ for using these methods on R console from the tabs below:
Entry of the Panel
From anywhere of the UI, the panel for DE can be accessed from the top navigation panel shown below (highlighted in red):
The UI is constructed with mainly five parts:
1. Method Selection, 2. Assay Input, 3. Condition Selection, 4. Parameter Settings 5. Results & Visualization
1. Method Selection
As mentioned in the Introduction section above, totally five methods are available from SCTK UI, including: Wilcoxon, MAST, Limma, DESeq2 and ANOVA. Users can choose the preferred one from the selection input “Choose analysis method”, as shown in the screenshot below.
2. Assay Input
Users need to choose a feature expression matrix for the differential expression analysis, and the selection should be made at the selection input “Select input matrix”. For most of the methods, a log-normalized assay would be recommended for the algorithm, though a “raw” counts would be recommended for “DESeq2.” (How to normalize an assay?)
Alternatively, user can choose to use the Pathway Analysis scores to perform the statistic tests. The matrix names for these scores will lie within the “redDims” section in the drop-down list of “Select input matrix”. When using the pathway analysis scores, DESeq2 method is not available since it expects non-negative integer value in the input matrix.
3. Condition Selection
Usually in terms of a DE analysis, we have two necessary conditions that have to be defined. One is the condition of interests and the other one is the condition of control. The condition of interests will then be compared to the condition of control via computational methods and statistical result will then be returned. The selection for a condition here means deciding which cells should be grouped together as one condition.
SCTK provides reasonably flexible approaches to setting the conditions:
The first approach is designed for the fastest use, where conditions of interests and of control can be simply defined by categorical variables of the same class in the cell annotation (i.e. within one single column in colData
of the background SCE object).
First, users need to choose one single option from the selection list “Choose Annotation Class”. Then, in the UI, there will be two columns for each condition. For each of them, users can make selections for which categories should belong to the condition, at the selection input “Select Condition(s)”. One or more selections are acceptable. When selections are made for one condition while no selection for the other, all the categories but those already used for the former will go to the latter. The text boxes, “Cells selected”, are only for showing users what are selected. While the text span below the boxes summaries the total number of cells selected.
In the example in the figure shown above, we intend to define the conditions based on the clustering result generated by louvain clustering on an SNN graph prepared by Scran, "scran_snn_cluster"
. (How to cluster cells?) Then we define that the cells assigned to “cluster 1” are of our interests, while the cells assigned to “cluster 7” should be a control against “cluster 1.”
The second approach allows users to make the most use of all the cell annotations available. Meanwhile, single selection on each cell is also enabled to ensure the highest flexibility, though can be relatively ineffective.
In this approach, we utilize a data table, where filters can be applied, for the flexible definition on each condition. By default, all classes of annotations are displayed. In the selection input “Columns to display”, users can select one or more variables of annotation, which should be displayed and used for filtering. The blank box under each column title is where the filters are applied. For categorical columns, one or more selections on available variables can be made; for numeric columns (with continuous values), value range can be set. All filters applied are not technically making any selection on cells, but adjusting which cells to display in the table. This can be inferred from the first text span summary below the table area. After users applied all necessary filters, click on “Add all filtered” button below the table area to technically make the selection. Meanwhile, each row in the table is clickable to make a single selection/cancellation on cells.
Similarly as the first approach, in the figure, we are making an equivalent condition definition.
The final way allows users without any useful annotation but still know which cells are of interests. It is done by just simply pasting a list of cell identifiers into the text box for each condition. Note that the cell identifiers used must be the default cell IDs in the background SCE object (i.e. colnames
). The input text should be formatted by one ID per line, split with no symbol. A summary text span below each text box will be dynamically updated after pasting, for the number of valid IDs found in the input.
NOTE that the “Name of Condition” fields are always required. It is highly recommended that users put easily understandable names there, in order to avoid the confusion when multiple sets of analyses are performed, and to keep the automatic legends on DE specific plots clean as well.
4. Parameter Settings
5. Results & Visualization
Heatmap
This tab enables a heatmap visualization for the DE analysis selected. Here SCTK will automatically group the involved cells to the two conditions where they are assigned to, and group the genes by the log2FC values, which indicate whether a gene is up-regulated or down-regulated. Additional settings can be applied by clicking on the blue settings (cog) button.
Furthermore, if more sophisticated settings on the DE heatmap are needed, users can move to the generic heatmap viewer and make use of the “import from analysis” functionality.
Gene Table
In this tab, there will be a table with all the genes that are thought to be differentially expressed in the condition of interest against the condition of control, with high significance and passed all filter parameters. The gene names (default identifier in the background object, not necessarily a gene symbol), p-values, log2FC values, and FDR values will be present. The table will be saved in background by the name of the analysis, entered before running. Users can also download the table in comma separated value (CSV) format, by clicking “Download Result Table” button.
Violin Plot
In the violin plot tab, the UI will present a standard violin plot showing how expression differs in the top N selected genes between cells in both conditions. Users can set the number of top DEG to plot and the feature identifier type to display, by clicking on the blue settings (cog) button.
Linear Modeling Plot
In the linear modeling plot tab, an analysis of covariance (ANCOVA) is done, showing how expression differs in the top N selected genes between cells in both conditions. Users can set the number of top DEG to plot and the feature identifier type to display, by clicking on the blue settings (cog) button.
The differential expression analysis can be easily performed on any preprocessed SCE dataset. Here we introduce the workflow of using the generic wrapper runDEAnalysis()
.
Basic Parameters:
The most basic parameters include:
inSCE
.method
.
"wilcox"
, "MAST"
, "DESeq2"
, "Limma"
, "ANOVA"
useAssay
useAssay
to NULL
and use useReducedDim
for specifying a matrix in reducedDims
, such as the Pathway Analysis scores. When using the pathway analysis scores, "DESeq2"
method is not available since it expects non-negative integer value in the input matrix.Additional Required Parameters:
1. Conditions
The R script functions we have allow users to perform differential expression analysis with relatively flexible condition setting. In terms of the condition of interests and the condition for control, comparison groups can be set by giving either one or more categories under a column in colData
, or a prepared index vector as long as the indices are able to subset the input SCE object.
Additionally, only specifying the condition of interests is allowed and the control will then be set to all the other cells, which turns to a biomarker finding analysis.
Since the conditions can be set by using indices or annotations, we have two groups of parameters listed below. Note that only one way of setting can be used at one time.
index1
and index2
for index style settingclass
to specify the annotation vector by either directly giving a vector of proper length or giving a column name of colData
. Then use classGroup1
, classGroup2
to specify which categories in class
are of interests.2. Namings
Something that might be special in our workflow is that, users will be required to specify name strings for:
groupName1
.groupName2
.analysisName
.The reason is we assume that users are likely to perform multiple batches of analysis for a single dataset (e.g. group1 Vs. group2 and then group1 Vs. group3), and we hope to have everything stored in one SCE without leaving users any confusion when they look back on it. Also, the namings are also used by DE analysis related plotting functions, so that legends can be well annotated.
Example
To demonstrate a simple and clear example, here we use the “PBMC-3k” dataset from “10X” which can be easily imported with SCTK functions.
The preprocessing for obtaining cluster assignment was done using the exactly same workflow presented in the A La Carte Workflow Tutorial.
library(singleCellTK)
sce <- importExampleData("pbmc3k")
sce <- runPerCellQC(sce, mitoRef = "human", mitoIDType = "symbol", mitoGeneLocation = "rownames")
sce <- subsetSCECols(sce, colData = c('total > 600', 'detected > 300', 'mito_percent < 5'))
sce <- scaterlogNormCounts(sce, useAssay = "counts", assayName = "logcounts")
sce <- runModelGeneVar(sce, useAssay = "logcounts")
sce <- setTopHVG(sce, method = "modelGeneVar")
sce <- scaterPCA(inSCE = sce, useAssay = "logcounts", useFeatureSubset = "HVG_modelGeneVar2000",
reducedDimName = "PCA", seed = 12345)
sce <- runScranSNN(sce, useReducedDim = "PCA", nComp = 10, k = 14,
weightType = "jaccard", clusterName = "cluster", seed = 12345)
sce <- runUMAP(sce, useReducedDim = "PCA", initialDims = 10, reducedDimName = "UMAP", seed = 12345)
And we can visualize the cell populations on the UMAP with cluster labeled.
plotSCEDimReduceColData(sce, colorBy = "cluster", reducedDimName = "UMAP")
# Condition chosen basing on the UMAP
sce <- runDEAnalysis(inSCE = sce, method = "wilcox", useAssay = "logcounts",
class = "cluster", classGroup1 = c(1, 2), classGroup2 = c(5),
groupName1 = "c1_2", groupName2 = "c5", analysisName = "c1_2_VS_c5")
Results
The results are saved in the metadata
slot of the returned SCE object
matadata(sce)
|- $info1
|- $info2
|- ...
|- $diffExp
|-$AnalysisName1
|-$AnalysisName2
|-...
|-$c1_2_VS_c5
|-$useAssay = "logcounts"
|-$groupNames = c("c1_2", "c5")
|-$select
| |-$ix1 = c(FALSE, TRUE, FALSE, ...)
| |-$ix2 = c(FALSE, FALSE, FALSE, ...)
|
| (Two logical vectors both having `ncol(inSCE)` values,
| specifies which cells are selected for "c1_2" or "c5")
|
|-$annotation = "cluster"
|-$result = (the `data.frame` of top DEG table)
|-$method = "wilcox"
Users may use getDEGTopTable()
function to fetch the result as a table of the top differential expressed genes and the statistics. Additional filters on the results can be applied and the feature ID type to display in the table can also be changed.
DEG <- getDEGTopTable(sce, "c1_2_VS_c5")
head(DEG)
## Gene Log2_FC Pvalue FDR group1MeanExp group2MeanExp
## 10608 LTB 1.1264965 6.754884e-43 1.163902e-39 2.474387 1.347890
## 29915 JUNB 0.8900972 9.865923e-31 7.340695e-28 3.116122 2.226025
## 20398 LDHB 0.7723192 7.738507e-44 1.490254e-40 1.939164 1.166845
## 5522 RPL32 0.6061814 2.730154e-64 8.937979e-61 4.613614 4.007433
## 23081 FOS 0.5978641 2.253412e-16 6.706565e-14 2.641626 2.043762
## 18502 RPS13 0.5829358 1.813796e-36 2.120716e-33 3.223682 2.640746
## group1ExprPerc group2ExprPerc
## 10608 0.9417122 0.6206897
## 29915 0.9526412 0.8275862
## 20398 0.9417122 0.6867816
## 5522 1.0000000 0.9971264
## 23081 0.9562842 0.8390805
## 18502 0.9899818 0.9683908
To visualize the result in a heatmap:
plotDEGHeatmap(sce, useResult = "c1_2_VS_c5", log2fcThreshold = .5)
To visualize the result in a volcano plot:
plotDEGVolcano(sce, useResult = "c1_2_VS_c5")
For other approaches of visualizing top differential expressed genes, refer to plotDEGRegression()
and plotDEGViolin()
, which are borrowed from MAST’s tutorial.