This guide helps users with performing batch correction when necessary. SingleCellTK provides 11 methods that are already published including BBKNN
, ComBatSeq
, FastMNN
, MNN
, Harmony
, LIGER
, Limma
, Scanorama
, scMerge
, Seurat
integration and ZINBWaVE
. All of these methods are available to use through our shiny ui application as well as through the R console environment through our wrapper functions.
These methods accept various type of input expression matrices (e.g. raw counts or log-normalized counts), and generate either a new corrected expression matrix or a low-dimensional representation. Users should choose a function according to their needs. More detailed instructions on how to use these methods either through the shiny ui application (select “Interactive Analysis”) or through the R console (select “Console Analysis”) are described below:
Entry of the Panel
From anywhere of the UI, the panel for normalization and batch correction can be accessed from the top navigation panel at the boxed tab. After this, click on the second boxed tab to enter the sub-page for batch correction.
Perform the Correction
The UI is constructed in a sidebar style, where the left-sided sidebar works for selecting method, setting the parameters and running the batch correction, and the right part main panel is for visualization checking.
For running any types of batch correction, there are always three essential inputs that users should be sure with:
"Sample"
or "sample"
, as the preprocessing steps generate this annotation by default.After the batch correction method is confirmed, the lower part will dynamically switch to the method specific settings. Different types of method have various parameter requirements, and various forms of output. For the detailed help information of each method, users can go to reference page and click on the function that stands for the method chosen, or click on the “(help for {method})” link in the UI.
Visualization
SCTK adopts two approaches to demonstrate the effect of batch correction:
There will be three columns in the plots for the variance explained. The third one is for the variance explained by batch division, the second is for that of an additional condition, which is optional, and the first one is the variance explained by combining both annotations. The additional condition is usually an annotation that indicates the differences between cells that should not be eliminated, such as cell types and treatments. In the first plot, the combined variance should ideally be close to the sum of the two types of variance. (i.e. the height of the first column should be closed to the sum of the heights of the other two) As a result that makes sense, the changes between the two plots in the second column should not be too much, while the third column should be largely eliminated after correction.
In the PCA plots, cells (i.e. dots) will be labeled with different colors to mark the batch belonging, and different shapes to mark for the additional condition, which was mentioned in the previous paragraph. Usually in the result, in the plot for the output, users can observe that different colors of dots are mixed up together, while different shapes are still grouped into separate clusters.
Before Correction
To run the pipeline, the most basic requirements are:
As we adopt SingleCellExperiment
(sce
) object through out the whole SCTK for a collection of all matrices and metadatas of the dataset of interests, the assay to be corrected, called "assayToCorr"
, has to be saved at assay(sce, "assayToCorr")
. Meanwhile, the batch annotation information has to be saved in a column of colData(sce)
.
Note that the batch annotation should better be saved as a
factor
in thecolData
, especially when the batches are represented by integer numbers, because some downstream analysis are likely to parse the non-character and non-logical information as continuous values instead of categorical values.
Choosing a Method
As mentioned in the introduction, our 11 methods vary in type of input and output. Input type for the expression matrix could be raw counts, log-transformed counts or scaled counts. Users need to make sure the assay being used is of the correct type. The output matrix can be an expression matrix (assay
), a low-dimensional representation (reducedDim
), or a subset features of expression matrix (altExp
). Users need to choose a method that produce a matrix that fits in their analysis.
1. Prepare an SCE object with multiple batches
Here we present an example dataset that is combined from “pbmc3k” and “pbmc4k”, originally available from package TENxPBMCData, which you can import by a function called importExampleData()
.
library(singleCellTK)
pbmc6k <- importExampleData('pbmc6k')
pbmc8k <- importExampleData('pbmc8k')
print(paste(dim(pbmc6k), c('genes', 'cells')))
print(paste(dim(pbmc8k), c('genes', 'cells')))
## [1] "32738 genes" "5419 cells"
## [1] "33694 genes" "8381 cells"
SCTK has a function called combineSCE()
, which accepts a list of SCE objects as input and returns a combined SCE object. This function requires that the number of genes in each SCE object has to be the same, and the gene metadata (i.e. rowData
) has to match with each other if the same fields exist. Therefore, we need some pre-process for the combination. Users do not necessarily have to follow the same way, depending on how the raw datasets are provided.
## Combine the two SCE objects
sce.combine <- combineSCE(sceList = list(pbmc6k = pbmc6k, pbmc8k = pbmc8k), by.r = names(rowData(pbmc6k)), by.c = names(colData(pbmc6k)), combined = TRUE)
table(sce.combine$sample)
##
## pbmc6k pbmc8k
## 5419 8381
In this manual, we only present a toy example instead of a best practice for real data. Therefore, QC and filtering are skipped.
Additionally, most of the batch correction methods provided require a log-normalized assay as input, users can check the required assay type of each method by looking at its default setting. (i.e. if by default useAssay = "logcounts"
, then it requires log-normalized assay; else if by default useAssay = "counts"
, then the raw count assay.)
## Simply filter out the genes that are expressed in less than 1% of all cells.
rowData(sce.combine)$expPerc <- rowSums(assay(sce.combine) > 0) / ncol(sce.combine)
sce.filter <- subsetSCERows(sce.combine, rowData = "expPerc >= 0.01", returnAsAltExp = FALSE)
print(sce.filter)
sce.small <- sce.filter[sample(nrow(sce.filter), 800), sample(ncol(sce.filter), 800)]
sce.small <- scaterlogNormCounts(inSCE = sce.small, useAssay = 'counts', assayName = 'logcounts')
print(sce.small)
## class: SingleCellExperiment
## dim: 10440 13800
## metadata(0):
## assays(1): counts
## rownames(10440): AP006222.2 RP11-206L10.9 ... SLC5A3 AC007325.4
## rowData names(5): rownames ENSEMBL_ID Symbol_TENx Symbol expPerc
## colnames(13800): pbmc6k_AAACATACAACCAC-1 pbmc6k_AAACATACACCAGT-1 ...
## pbmc8k_TTTGTCATCTCGAGTA-1 pbmc8k_TTTGTCATCTGCTTGC-1
## colData names(13): rownames Sample ... Date_published sample
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## class: SingleCellExperiment
## dim: 800 800
## metadata(1): assayType
## assays(2): counts logcounts
## rownames(800): NBPF15 TLR1 ... IQCG DMTF1
## rowData names(5): rownames ENSEMBL_ID Symbol_TENx Symbol expPerc
## colnames(800): pbmc8k_TGACGGCTCTCCCTGA-1 pbmc6k_TGATTAGAACCAAC-1 ...
## pbmc6k_GTAGTGTGTTCGTT-1 pbmc8k_CGCTATCAGCAGATCG-1
## colData names(14): rownames Sample ... sample sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
2. Run a batch correction method on the prepared SCE object
The basic way to run a batch correction method from SingleCellTK is to select a function for the corresponding method, input the SCE object, specify the assay to correct, and the batch annotation.
For example, here we will try the batch correction method provided by Limma, which fits a linear model to the data.
sce.small <- runLimmaBC(inSCE = sce.small, useAssay = 'logcounts', batch = 'sample', assayName = 'LIMMA')
print(sce.small)
## class: SingleCellExperiment
## dim: 800 800
## metadata(2): assayType batchCorr
## assays(3): counts logcounts LIMMA
## rownames(800): NBPF15 TLR1 ... IQCG DMTF1
## rowData names(5): rownames ENSEMBL_ID Symbol_TENx Symbol expPerc
## colnames(800): pbmc8k_TGACGGCTCTCCCTGA-1 pbmc6k_TGATTAGAACCAAC-1 ...
## pbmc6k_GTAGTGTGTTCGTT-1 pbmc8k_CGCTATCAGCAGATCG-1
## colData names(14): rownames Sample ... sample sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
3. Visualization
In this documentation, we provide three ways to examine the removal of batch effect, in terms of visualization.
This functionality is implemented in plotBatchVariance()
. It plots a violin plot of the variation explained by the given batch annotation, an additional condition, and the variation explained by combining these two conditions.
This plot would be useful for examining whether the existing batch effect is different than another condition (e.g. subtype) or is confounded by that. However, the additional condition labels (e.g. cell types) do not necessarily exist when batch effect removal is wanted, so only plotting the variation explained by batches is also supported.
plotBatchVariance(inSCE = sce.small, useAssay = 'logcounts', batch = 'sample', title = 'Variation Before Correction')
plotBatchVariance(inSCE = sce.small, useAssay = 'LIMMA', batch = 'sample', title = 'Variation After Correction')
This functionality is implemented in plotSCEBatchFeatureMean()
. The methodology is straight forward, which plots violin plots for all the batches, and within each batch, the plot illustrates the distribution of the mean expression level of each gene. Thus the batch effect can be observed from the mean and standard deviation of each batch.
There is no function special for batch correction, but this can be achieved simply by using the dimension reduction/2D embedding calculation functions (e.g. scaterPCA()
, runUMAP()
and run
TSNE()) and
plotSCEDimReduceColData()`.
sce.small <- runUMAP(inSCE = sce.small, useAssay = "logcounts", useReducedDim = NULL, logNorm = FALSE)
plotSCEDimReduceColData(inSCE = sce.small, colorBy = 'sample', reducedDimName = 'UMAP', title = 'UMAP Before Correction')
sce.small <- runUMAP(inSCE = sce.small, useAssay = 'LIMMA', useReducedDim = NULL, logNorm = FALSE, reducedDimName = 'UMAP_LIMMA')
plotSCEDimReduceColData(inSCE = sce.small, colorBy = 'sample', reducedDimName = 'UMAP_LIMMA', title = 'UMAP After Correction')
If the cell typing is already given, it is strongly recommended to specify shape = {colData colname for cell typing}
in plotSCEDimReduceColData()
to visualize the grouping simultaneously.
Method | Function | Input Type | Output Type |
---|---|---|---|
BBKNN [1] | runBBKNN() |
Scaled | reducedDim |
ComBatSeq [2] | runComBatSeq() |
raw counts | assay |
FastMNN [3] | runFastMNN() |
log-normalized | reducedDim |
MNN [3] | runMNNCorrect() |
log-normalized | assay |
Harmony [4] | runHarmony() |
log-normalized | reducedDim |
LIGER [5] | runLIGER() |
raw counts | reducedDim |
Limma [6] | runLimmaBC() |
log-normalized | assay |
Scanorama [7] | runSCANORAMA() |
log-normalized | assay |
scMerge [8] | runSCMerge() |
log-normalized | assay |
Seurat Integration [9] | runSeuratIntegration() |
raw counts | altExp |
ZINBWaVE [10] | runZINBWaVE() |
raw counts | reducedDim |
[1] K. Polański, M. D. Young, Z. Miao, K. B. Meyer, S. A. Teichmann, and J. E. Park, “BBKNN: fast batch alignment of single cell transcriptomes,” Bioinformatics, vol. 36, pp. 964–965, Aug. 2020, doi: 10.1093/bioinformatics/btz625.
[2] Y. Zhang, G. Parmigiani, and W. E. Johnson, “ComBat-seq: batch effect adjustment for RNA-seq count data,” NAR Genomics and Bioinformatics, vol. 2, p. lqaa078, Sep. 2020, doi: 10.1093/nargab/lqaa078.
[3] L. Haghverdi, A. T. L. Lun, M. D. Morgan, and J. C. Marioni, “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors,” Nature Biotechnology, vol. 36, pp. 421–427, Apr. 2018, doi: 10.1038/nbt.4091.
[4] I. Korsunsky et al., “ Fast, sensitive and accurate integration of single-cell data with Harmony,” Nature Methods, vol. 16, pp. 1289–1296, Nov. 2019, doi: 10.1038/s41592-019-0619-0.
[5] J. Liu, C. Gao, J. Sodicoff, V. Kozareva, E. Z. Macosko, and J. D. Welch, “Jointly defining cell types from multiple single-cell datasets using LIGER,” Nature Protocols, vol. 15, pp. 3632–3662, Oct. 2020, doi: 10.1038/s41596-020-0391-8.
[6] M. E. Ritchie et al., “limma powers differential expression analyses for RNA-sequencing and microarray studies,” Nucleic Acids Research, vol. 43, no. 7, p. e47, Apr. 2015, doi: 10.1093/nar/gkv007.
[7] B. Hie, B. Bryson, and B. Berger, “Efficient integration of heterogeneous single-cell transcriptomes using Scanorama,” Nature Biotechnology, vol. 37, pp. 685–691, May 2019, doi: 10.1038/s41587-019-0113-3.
[8] Y. Lin et al., “scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets,” PNAS, vol. 116, pp. 9775–9784, May 2019, doi: 10.1073/pnas.1820006116.
[9] A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija, “Integrating single-cell transcriptomic data across different conditions, technologies, and species,” Nature Biotechnology, vol. 36, no. 5, pp. 411–420, Jun. 2018, doi: 10.1038/nbt.4096.
[10] D. Risso, F. Perraudeau, S. Gribkova, S. Dudoit, and J.-P. Vert, “A general and flexible method for signal extraction from single-cell RNA-seq data,” Nature Communications, vol. 9, no. 284, Jan. 2018, doi: 10.1038/s41467-017-02554-5.