1 Set up package environment
2 Load in scripts and R packages
3 Read in data files
- 3.1 Read in tumor annotation
- 3.2 Read in and process gene mutations
4 Plotting distribution of mutation rates
5 Testing associations between mutation burden and WGD status
6 Testing associations between mutation burden and WGD status while controlling for ploidy
7 Associations between gene mutational frequencies and WGD status
8 Session Information

1 Set up package environment

We first want to install all of the packages used in this analysis. We use the package renv as a way to manage package versions. The following code was initially run to set up the package library.

install.packages("renv")
library(renv)

renv::init(bare = TRUE)
install.packages("BiocManager")
library(BiocManager)
install.packages(c("ggplot2", "magrittr", "knitr", "kableExtra", "ggrepel",
                   "rmarkdown", "formatR", "bitops", "caTools","cowplot",
                   "Rcpp","digest","evaluate","htmltools","jsonlite",
                   "base64enc","rprojroot","highr","xfun","stringi"),
                 repo = BiocManager::repositories())
renv::snapshot()

If you are attempting to reproduce this analysis, you can 1) install/load renv, 2) copy/paste the renv.lock file into your current working directory and 3) run the renv::restore() command to automatically install all of the packages with the same version.

# Make sure the 'renv.lock' file is in your current working directory.
install.packages("renv")
library(renv)
renv::restore()

2 Load in scripts and R packages

Next, R packages will be loaded and scripts will be sourced.

source("logistic_regression_util.R")
source("maf_utils.R")
library("ggplot2")
library("cowplot")
library("magrittr")
library("knitr")
library("kableExtra")
library("ggrepel")
knitr::opts_chunk$set(fig.width=12, fig.height=8)

3 Read in data files

The file containing Mutation burden (i.e. mutation per Mb) and MSI or POLE status was obtained from the DNA damage TCGA working group from the PanCanAtlas consortium. This work was published in Cell and can be found at:

https://www.cell.com/cell-reports/pdf/S2211-1247(18)30437-6.pdf.

Data from the paper can be found at:

https://gdc.cancer.gov/about-data/publications/PanCan-DDR-2018.

The ABSOLUTE call file can be found at:

http://api.gdc.cancer.gov/data/4f277128-f793-4354-a13d-30cc7fe9f6b5.

3.1 Read in tumor annotation

# Read in mutation rate data
mut.rates = read.table("../../Data/Mutation_Rate_and-MSI.txt", header=TRUE, stringsAsFactors = FALSE, sep="\t", row.names = 1)

# Read in ABSOLUTE calls and combine with mut.rates
absolute = read.table("../../Data/TCGA_mastercalls.abs_tables_JSedit.fixed.txt", header = TRUE, stringsAsFactors = FALSE, sep = "\t", row.names = 1)
mut.rates$Ploidy = absolute[rownames(mut.rates), "ploidy"]
mut.rates$PloidyCorrected_MutationPerMb = log10(mut.rates$Non.silentMutationsperMb) / mut.rates$Ploidy

mut.rates$WGD = factor(ifelse(mut.rates$Genome_doublings > 0, "WGD", "Non-WGD"), levels = c("Non-WGD", "WGD"))
mut.rates$MSI_POLE_Status = factor(ifelse(mut.rates$MSIPOLE == "MSIPOLE", "Yes", "No"), levels = c("No", "Yes"))
mut.rates$Log10_MutsPerMB = log10(mut.rates$Non.silentMutationsperMb)

# Make tumor type a factor ordered by median mutation rate
agg = aggregate(mut.rates$Non.silentMutationsperMb,
                by=list(mut.rates$Type),
                "median")
tumor.type = agg[order(agg[,2]),1]
mut.rates$Type = factor(mut.rates$Type, levels = tumor.type)

# Create subsets
mut.rates.hyper = subset(mut.rates, MSI_POLE_Status == "Yes")
mut.rates.nonhyper = subset(mut.rates, MSI_POLE_Status == "No")

3.2 Read in and process gene mutations

All tumor ids in mut.rates match an entry in the PanCanAtlas MC3 maf, however some of the tumors are no longer present after filtering out poor quality variants. Therefore, the Tumor_Sample_Barcode column within the maf is converted to a factor and the levels are set to the tumors found in mut.rates so each tumor will get a column in the muts table generated by xtabs. The maf file was obtained from http://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc. PanCan MutSig2CV results were obtained from https://api.gdc.cancer.gov/data/2e8d4650-5498-4288-a461-23667ff553e2.

# Read in file and subset to filtered variants
maf = read.table(gzfile("../../Data/mc3.v0.2.8.PUBLIC.maf.gz"), sep="\t", header=TRUE, stringsAsFactors = FALSE, quote="")
maf = subset(maf, FILTER %in% c("PASS", "wga", "native_wga_mix"))

# Substring ID and convert to factor with appropriate levels matching all tumors in mut.rates
maf$Tumor_Sample_Barcode = factor(substring(maf$Tumor_Sample_Barcode, 1, 15), levels = rownames(mut.rates))

# Subset to nonsynonymous variants
maf.c = maf.coding(maf)
muts = xtabs(~ Hugo_Symbol + Tumor_Sample_Barcode, data = maf.c)

# Subset further to significantly mutated genes from MutSig2CV
fs = list.files(path="../../Data/MutSig2CV/", full.names = TRUE)
fs = setdiff(fs, "../../Data/MutSig2CV//README.txt")
genes = c()
for(i in fs) {
  temp = read.table(i, header=TRUE, stringsAsFactors=FALSE, sep="\t", quote="")
  temp = subset(temp, qvalue < 0.1)
  genes = c(genes, temp[,1])
}
genes = sort(intersect(unique(genes), rownames(muts)))

muts = muts[genes,]

# Clear up some memory
rm(maf)
rm(maf.c)

4 Plotting distribution of mutation rates

We need to examine the distribution of mutation rates to determine appropriate statistical tests. In parts A and B, the density of non-silent mutations is plotted across 9414 tumors from TCGA. Densities are plotted separately for tumors with MSI or POLE tumors, which are expected to have higher mutation rates in general. By applying a log10 transformation in part B, we can more clearly see the distubtions.

In parts B and C, the density of non-silent mutations is plotted across 9240 tumors from TCGA that do NOT fall into MSI or POLE categories. In part C, we can see that although hypermutated samples are excluded, there is still a long tailed distribution. The application of log10 in part D makes the distrubution appear more “bell-shapped”. Therefore, we should either use non-parametric tests (e.g. Wilcoxon) or use logged transformed values with parametric tests (e.g. t-test).

p1 = ggplot(mut.rates, aes(x=Non.silentMutationsperMb, fill=MSI_POLE_Status)) + geom_density(alpha = 0.5) + theme_bw() + ggtitle("All tumors, no log transformation")
p2 = ggplot(mut.rates, aes(x=Non.silentMutationsperMb, fill=MSI_POLE_Status)) + geom_density(alpha = 0.5) + theme_bw() + ggtitle("All tumors, with Log10 transformation") + scale_x_log10()

p3 = ggplot(mut.rates.nonhyper, aes(x=Non.silentMutationsperMb, fill=MSI_POLE_Status)) + geom_density(alpha = 0.5) + theme_bw() + ggtitle("Non-hypermutated tumors, no log transformation")
p4 = ggplot(mut.rates.nonhyper, aes(x=Non.silentMutationsperMb, fill=MSI_POLE_Status)) + geom_density(alpha = 0.5) + theme_bw() + ggtitle("Non-hypermutated tumors, with Log10 transformation") + scale_x_log10()

plot_grid(p1, p2, p3, p4, labels = "AUTO")

5 Testing associations between mutation burden and WGD status

Next, we will use linear modeling to examine associations between mutation burden per Mb and WGD status while controlling for tumor type and MSI/POLE status. Note that in the “Estimate” column in the “Coefficients” section within the summary tables, the (Intercept) is the baseline group and refers to the mean mutation rate for , non-WGD, non-MSI/POLE tumors. WGDWGD refers to the difference between the baseline group and WGD tumors. MSI_POLE_StatusYes refers to the difference between the baseline group and tumors with MSI/POLE status. Rows for each tumor type (e.g. TypeBLCA) refer to the mean difference between each tumor type and the baseline group.

There are no significant associations between WGD status and mutation rate when using the non-transformed values. However, when using log-transformed values, we see strong associations with WGD status. This is true when including MSI/POLE tumors and controlling for this in the model AND when subsetting to non-MSI/POLE tumors and estimating a new model.

Overall, these results show that there is an overall strong association between WGD status and mutation rate when averaging across all tumor types . This analysis does not show whether this trend is true for all tumor types or if a subset of tumor types are driving the association.

# Controlling for tumor type and Hypermutation
model.full = lm(Non.silentMutationsperMb ~ WGD + MSI_POLE_Status + Type, data=mut.rates)
summary(model.full)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + MSI_POLE_Status + 
##     Type, data = mut.rates)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.93  -1.90  -0.61   0.07 650.40 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.27857    1.36847   0.204 0.838698    
## WGDWGD             -0.07853    0.40587  -0.193 0.846574    
## MSI_POLE_StatusYes 76.16014    1.43716  52.994  < 2e-16 ***
## TypeTHCA            0.01462    1.58542   0.009 0.992641    
## TypeTHYM            0.16719    2.16576   0.077 0.938469    
## TypeTGCT            0.12371    2.00686   0.062 0.950848    
## TypeUVM             0.20736    2.35327   0.088 0.929786    
## TypeLAML            1.07605    2.13611   0.504 0.614455    
## TypeKICH           -0.68550    2.52654  -0.271 0.786151    
## TypeACC             0.17775    2.27859   0.078 0.937822    
## TypeMESO            0.48090    2.36288   0.204 0.838730    
## TypeCHOL            0.84852    3.16544   0.268 0.788660    
## TypePRAD            0.80560    1.58180   0.509 0.610558    
## TypeLGG             1.05494    1.56799   0.673 0.501094    
## TypeSARC            0.67159    1.78782   0.376 0.707185    
## TypePAAD            2.73553    1.94006   1.410 0.158565    
## TypeBRCA            1.52186    1.47839   1.029 0.303317    
## TypeUCS             2.05381    2.68098   0.766 0.443655    
## TypeGBM             2.53184    1.63092   1.552 0.120600    
## TypeKIRC            1.20545    1.64979   0.731 0.465000    
## TypeKIRP            1.18523    1.71607   0.691 0.489792    
## TypeUCEC            6.39443    1.60428   3.986 6.77e-05 ***
## TypeCESC            3.36050    1.98163   1.696 0.089952 .  
## TypeLIHC            2.22482    1.68165   1.323 0.185869    
## TypeOV              2.80459    1.63025   1.720 0.085404 .  
## TypeESCA            2.28297    1.93751   1.178 0.238706    
## TypeREAD            6.41732    2.07244   3.097 0.001964 ** 
## TypeDLBC            2.81262    3.13054   0.898 0.368972    
## TypeHNSC            3.62029    1.57362   2.301 0.021436 *  
## TypeCOAD            4.13320    1.64851   2.507 0.012185 *  
## TypeSTAD            8.42660    1.60426   5.253 1.53e-07 ***
## TypeBLCA            6.15916    1.62559   3.789 0.000152 ***
## TypeLUAD            7.61471    1.57895   4.823 1.44e-06 ***
## TypeLUSC            7.11767    1.58949   4.478 7.63e-06 ***
## TypeSKCM           14.45046    1.58825   9.098  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.13 on 9379 degrees of freedom
## Multiple R-squared:  0.2972, Adjusted R-squared:  0.2946 
## F-statistic: 116.7 on 34 and 9379 DF,  p-value: < 2.2e-16

# Controlling for tumor type and Hypermutation with log transformation
model.full.log10 = lm(log10(Non.silentMutationsperMb) ~ WGD + MSI_POLE_Status + Type, data=mut.rates)
summary(model.full.log10)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + MSI_POLE_Status + 
##     Type, data = mut.rates)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34038 -0.19150 -0.01097  0.16847  2.16187 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.659856   0.029526 -22.348  < 2e-16 ***
## WGDWGD              0.087711   0.008757  10.016  < 2e-16 ***
## MSI_POLE_StatusYes  1.181884   0.031008  38.116  < 2e-16 ***
## TypeTHCA            0.024806   0.034207   0.725  0.46836    
## TypeTHYM            0.032610   0.046728   0.698  0.48528    
## TypeTGCT            0.014279   0.043300   0.330  0.74158    
## TypeUVM             0.175998   0.050774   3.466  0.00053 ***
## TypeLAML            0.150160   0.046088   3.258  0.00113 ** 
## TypeKICH            0.309860   0.054512   5.684 1.35e-08 ***
## TypeACC             0.495454   0.049163  10.078  < 2e-16 ***
## TypeMESO            0.445372   0.050981   8.736  < 2e-16 ***
## TypeCHOL            0.493489   0.068297   7.226 5.38e-13 ***
## TypePRAD            0.485802   0.034129  14.234  < 2e-16 ***
## TypeLGG             0.481022   0.033831  14.218  < 2e-16 ***
## TypeSARC            0.498500   0.038574  12.923  < 2e-16 ***
## TypePAAD            0.556086   0.041858  13.285  < 2e-16 ***
## TypeBRCA            0.661379   0.031898  20.734  < 2e-16 ***
## TypeUCS             0.751159   0.057844  12.986  < 2e-16 ***
## TypeGBM             0.830007   0.035188  23.588  < 2e-16 ***
## TypeKIRC            0.755898   0.035596  21.236  < 2e-16 ***
## TypeKIRP            0.761471   0.037026  20.566  < 2e-16 ***
## TypeUCEC            0.968828   0.034614  27.990  < 2e-16 ***
## TypeCESC            0.971555   0.042755  22.724  < 2e-16 ***
## TypeLIHC            0.914194   0.036283  25.196  < 2e-16 ***
## TypeOV              0.921154   0.035174  26.188  < 2e-16 ***
## TypeESCA            0.961971   0.041803  23.012  < 2e-16 ***
## TypeREAD            1.029012   0.044715  23.013  < 2e-16 ***
## TypeDLBC            1.057293   0.067544  15.653  < 2e-16 ***
## TypeHNSC            1.065720   0.033952  31.389  < 2e-16 ***
## TypeCOAD            1.162253   0.035568  32.677  < 2e-16 ***
## TypeSTAD            1.216415   0.034613  35.143  < 2e-16 ***
## TypeBLCA            1.274766   0.035073  36.346  < 2e-16 ***
## TypeLUAD            1.284862   0.034067  37.716  < 2e-16 ***
## TypeLUSC            1.389252   0.034295  40.509  < 2e-16 ***
## TypeSKCM            1.499240   0.034268  43.751  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3696 on 9379 degrees of freedom
## Multiple R-squared:  0.6017, Adjusted R-squared:  0.6002 
## F-statistic: 416.7 on 34 and 9379 DF,  p-value: < 2.2e-16

# Testing hypermutated samples only
model.hyper = lm(Non.silentMutationsperMb ~ WGD + Type, data=mut.rates.hyper)
summary(model.hyper)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + Type, data = mut.rates.hyper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -161.55  -69.97  -37.57   33.57  651.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -2.967    116.849  -0.025   0.9798  
## WGDWGD        19.460     50.694   0.384   0.7016  
## TypeACC      -11.331    128.941  -0.088   0.9301  
## TypePRAD     191.176    157.281   1.216   0.2260  
## TypeLGG      324.692    157.281   2.064   0.0406 *
## TypeSARC       4.085    157.281   0.026   0.9793  
## TypePAAD     389.267    157.281   2.475   0.0144 *
## TypeBRCA      22.773    117.097   0.194   0.8461  
## TypeUCS      106.988    157.281   0.680   0.4974  
## TypeGBM      170.342    122.735   1.388   0.1672  
## TypeUCEC      85.189    117.058   0.728   0.4679  
## TypeCESC      47.300    157.281   0.301   0.7640  
## TypeESCA     -11.979    148.888  -0.080   0.9360  
## TypeREAD     194.948    128.159   1.521   0.1302  
## TypeCOAD      47.550    118.022   0.403   0.6876  
## TypeSTAD     114.071    125.978   0.905   0.3666  
## TypeBLCA      47.809    131.408   0.364   0.7165  
## TypeSKCM      36.950    157.281   0.235   0.8146  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.3 on 156 degrees of freedom
## Multiple R-squared:  0.1566, Adjusted R-squared:  0.06472 
## F-statistic: 1.704 on 17 and 156 DF,  p-value: 0.04708

# Testing hypermutated samples only with log transformation
model.hyper.log10 = lm(log10(Non.silentMutationsperMb) ~ WGD + Type, data=mut.rates.hyper)
summary(model.hyper.log10)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + Type, data = mut.rates.hyper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5723 -0.4571 -0.0354  0.4282  1.7118 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.23593    0.67551   1.830   0.0692 .
## WGDWGD      -0.01863    0.29306  -0.064   0.9494  
## TypeACC     -0.69754    0.74541  -0.936   0.3508  
## TypePRAD     1.03871    0.90925   1.142   0.2550  
## TypeLGG      1.27156    0.90925   1.398   0.1640  
## TypeSARC    -1.18738    0.90925  -1.306   0.1935  
## TypePAAD     1.35100    0.90925   1.486   0.1393  
## TypeBRCA    -0.74880    0.67695  -1.106   0.2704  
## TypeUCS      0.78120    0.90925   0.859   0.3916  
## TypeGBM      0.77676    0.70954   1.095   0.2753  
## TypeUCEC     0.27193    0.67672   0.402   0.6884  
## TypeCESC     0.41080    0.90925   0.452   0.6520  
## TypeESCA    -0.56276    0.86073  -0.654   0.5142  
## TypeREAD     0.90232    0.74089   1.218   0.2251  
## TypeCOAD     0.25087    0.68229   0.368   0.7136  
## TypeSTAD     0.71108    0.72829   0.976   0.3304  
## TypeBLCA     0.21315    0.75968   0.281   0.7794  
## TypeSKCM     0.29534    0.90925   0.325   0.7458  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6086 on 156 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.1379 
## F-statistic: 2.628 on 17 and 156 DF,  p-value: 0.0008978

# Testing non-hypermutated samples only
model.nonhyper = lm(Non.silentMutationsperMb ~ WGD + Type, data=mut.rates.nonhyper)
summary(model.nonhyper)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + Type, data = mut.rates.nonhyper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.819  -1.690  -0.458   0.102 271.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27718    0.69589   0.398 0.690412    
## WGDWGD      -0.06941    0.20703  -0.335 0.737425    
## TypeTHCA     0.01578    0.80620   0.020 0.984386    
## TypeTHYM     0.16762    1.10131   0.152 0.879033    
## TypeTGCT     0.11636    1.02059   0.114 0.909227    
## TypeUVM      0.20819    1.19666   0.174 0.861892    
## TypeLAML     1.07711    1.08623   0.992 0.321420    
## TypeKICH     0.23932    1.29180   0.185 0.853026    
## TypeACC      1.81514    1.16679   1.556 0.119820    
## TypeMESO     0.48056    1.20155   0.400 0.689199    
## TypeCHOL     0.84789    1.60966   0.527 0.598378    
## TypePRAD     0.56660    0.80458   0.704 0.481313    
## TypeLGG      0.56582    0.79753   0.709 0.478054    
## TypeSARC     1.00938    0.90998   1.109 0.267359    
## TypePAAD     0.74023    0.98814   0.749 0.453807    
## TypeBRCA     1.76034    0.75202   2.341 0.019263 *  
## TypeUCS      1.58325    1.37253   1.154 0.248725    
## TypeGBM      1.70511    0.83033   2.054 0.040048 *  
## TypeKIRC     1.20541    0.83894   1.437 0.150799    
## TypeKIRP     1.18602    0.87264   1.359 0.174140    
## TypeUCEC     6.53743    0.82614   7.913 2.80e-15 ***
## TypeCESC     3.60851    1.00953   3.574 0.000353 ***
## TypeLIHC     2.22301    0.85514   2.600 0.009349 ** 
## TypeOV       2.80091    0.82903   3.379 0.000732 ***
## TypeESCA     2.75142    0.98673   2.788 0.005307 ** 
## TypeREAD     2.71502    1.06367   2.552 0.010711 *  
## TypeDLBC     2.81179    1.59191   1.766 0.077379 .  
## TypeHNSC     3.61806    0.80021   4.521 6.22e-06 ***
## TypeCOAD     6.45664    0.84563   7.635 2.48e-14 ***
## TypeSTAD     8.11038    0.81709   9.926  < 2e-16 ***
## TypeBLCA     6.29669    0.82722   7.612 2.97e-14 ***
## TypeLUAD     7.61072    0.80294   9.479  < 2e-16 ***
## TypeLUSC     7.11400    0.80830   8.801  < 2e-16 ***
## TypeSKCM    14.57326    0.80788  18.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.71 on 9206 degrees of freedom
## Multiple R-squared:  0.1477, Adjusted R-squared:  0.1447 
## F-statistic: 48.36 on 33 and 9206 DF,  p-value: < 2.2e-16

# Testing non-hypermutated samples only with log transformation
model.nonhyper.log10 = lm(log10(Non.silentMutationsperMb) ~ WGD + Type, data=mut.rates.nonhyper)
summary(model.nonhyper.log10)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + Type, data = mut.rates.nonhyper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34097 -0.18836 -0.00993  0.16700  2.16255 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.660044   0.028925 -22.819  < 2e-16 ***
## WGDWGD       0.088940   0.008605  10.336  < 2e-16 ***
## TypeTHCA     0.024961   0.033510   0.745 0.456353    
## TypeTHYM     0.032668   0.045776   0.714 0.475467    
## TypeTGCT     0.013290   0.042421   0.313 0.754076    
## TypeUVM      0.176109   0.049739   3.541 0.000401 ***
## TypeLAML     0.150303   0.045149   3.329 0.000875 ***
## TypeKICH     0.305224   0.053694   5.685 1.35e-08 ***
## TypeACC      0.508450   0.048498  10.484  < 2e-16 ***
## TypeMESO     0.445326   0.049942   8.917  < 2e-16 ***
## TypeCHOL     0.493403   0.066905   7.375 1.79e-13 ***
## TypePRAD     0.483156   0.033443  14.447  < 2e-16 ***
## TypeLGG      0.477800   0.033150  14.413  < 2e-16 ***
## TypeSARC     0.502457   0.037823  13.284  < 2e-16 ***
## TypePAAD     0.546158   0.041072  13.298  < 2e-16 ***
## TypeBRCA     0.664974   0.031258  21.274  < 2e-16 ***
## TypeUCS      0.736770   0.057049  12.915  < 2e-16 ***
## TypeGBM      0.825220   0.034513  23.911  < 2e-16 ***
## TypeKIRC     0.755893   0.034870  21.677  < 2e-16 ***
## TypeKIRP     0.761577   0.036271  20.997  < 2e-16 ***
## TypeUCEC     0.963528   0.034339  28.060  < 2e-16 ***
## TypeCESC     0.970240   0.041961  23.122  < 2e-16 ***
## TypeLIHC     0.913951   0.035544  25.713  < 2e-16 ***
## TypeOV       0.920658   0.034459  26.718  < 2e-16 ***
## TypeESCA     0.967312   0.041013  23.585  < 2e-16 ***
## TypeREAD     1.008680   0.044212  22.815  < 2e-16 ***
## TypeDLBC     1.057182   0.066168  15.977  < 2e-16 ***
## TypeHNSC     1.065420   0.033261  32.032  < 2e-16 ***
## TypeCOAD     1.175318   0.035149  33.438  < 2e-16 ***
## TypeSTAD     1.213602   0.033962  35.734  < 2e-16 ***
## TypeBLCA     1.276233   0.034384  37.118  < 2e-16 ***
## TypeLUAD     1.284325   0.033374  38.482  < 2e-16 ***
## TypeLUSC     1.388758   0.033597  41.336  < 2e-16 ***
## TypeSKCM     1.500010   0.033580  44.670  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.362 on 9206 degrees of freedom
## Multiple R-squared:  0.5766, Adjusted R-squared:  0.575 
## F-statistic: 379.9 on 33 and 9206 DF,  p-value: < 2.2e-16

Next, we calculate the differences between WGD and non-WGD samples within each tumor type using the non-parametric Wilcoxon rank-sum test.

res = c()
tumor.type = unique(unique(mut.rates.nonhyper$Type))
for(i in tumor.type) {
  mut.rates.type = subset(mut.rates.nonhyper, Type == i)
  model = wilcox.test(Non.silentMutationsperMb ~ WGD, data = mut.rates.type)
  agg.median = aggregate(mut.rates.type$Non.silentMutationsperMb,
                  by=list(mut.rates.type$WGD),
                  median)
  agg.mean = aggregate(mut.rates.type$Non.silentMutationsperMb,
                  by=list(mut.rates.type$WGD),
                  mean)
  res = rbind(res, c(agg.median[,2], agg.mean[,2], model$p.value))
}
colnames(res) = c("Non-WGD_Median", "WGD_Median",
                  "Non-WGD_Mean", "WGD_Mean",
                  "Pvalue")
res = data.frame(Type = tumor.type,
                 res,
                 FDR=p.adjust(res[,"Pvalue"], "fdr"),
                 Bonferroni = p.adjust(res[,"Pvalue"], "bon"),
                 check.names = FALSE)

res = res[order(res$Pvalue),]

# Show results in table
kable(res, style = 'html', row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "800px")

Type	Non-WGD_Median	WGD_Median	Non-WGD_Mean	WGD_Mean	Pvalue	FDR	Bonferroni
BRCA	0.7887883	1.2911623	1.6425024	2.4745046	0.0000000	0.0000000	0.0000000
LUAD	3.3057242	6.9627762	5.3857255	9.5580078	0.0000000	0.0000000	0.0000000
SARC	0.5722591	0.9102237	0.8857385	1.6516875	0.0000000	0.0000000	0.0000000
BLCA	3.1230637	5.7202472	4.9002245	7.5739151	0.0000000	0.0000000	0.0000000
PRAD	0.6903266	0.9494249	0.8251362	0.9890484	0.0000003	0.0000023	0.0000114
LUSC	5.6169853	6.7154968	6.2240072	8.2582152	0.0000309	0.0001700	0.0010199
LIHC	1.8065517	2.3281210	2.3667956	2.6778088	0.0003893	0.0018351	0.0128459
THYM	0.2360255	0.4123456	0.4194798	0.5894340	0.0014128	0.0058279	0.0466235
COAD	2.9237316	2.6972247	9.3617234	3.1911594	0.0053887	0.0176937	0.1778274
ACC	0.5187245	0.7906970	1.7644122	2.3151797	0.0055410	0.0176937	0.1828544
DLBC	2.3135876	4.0776043	2.6005717	4.5390239	0.0058979	0.0176937	0.1946312
LGG	0.7095434	0.7982604	0.7667416	0.9274871	0.0167829	0.0440212	0.5538356
OV	1.9435882	2.1925929	3.6942358	2.5163338	0.0173417	0.0440212	0.5722760
KIRC	1.3492120	1.5275596	1.4253739	1.7204570	0.0380972	0.0898006	1.0000000
ESCA	1.9362814	2.3858724	3.2732779	2.7673311	0.0428929	0.0943644	1.0000000
CHOL	0.6594951	0.9153742	1.1797529	0.8642641	0.0486301	0.1002995	1.0000000
PAAD	0.8976690	1.0303383	0.9533860	1.1304550	0.0556280	0.1079837	1.0000000
SKCM	8.2243260	9.0626483	13.1679982	17.2816952	0.1007054	0.1846265	1.0000000
HNSC	2.6010606	2.8967389	3.9823461	3.6935965	0.1178907	0.2047576	1.0000000
THCA	0.2364548	0.3012428	0.2901206	0.3287041	0.1261393	0.2081298	1.0000000
PCPG	0.2258893	0.2489308	0.2569747	0.3197262	0.1480959	0.2294626	1.0000000
KICH	0.4203093	0.5863136	0.4770349	0.6882803	0.1529751	0.2294626	1.0000000
KIRP	1.4209476	1.7404218	1.4333290	1.8170041	0.1726414	0.2477028	1.0000000
UCEC	1.3706469	1.3505228	8.5867437	2.0701521	0.2183405	0.3002182	1.0000000
UVM	0.3215546	0.3881548	0.4841662	0.4339046	0.2489854	0.3286607	1.0000000
MESO	0.6233613	0.7900537	0.6622938	1.0955767	0.2686183	0.3409386	1.0000000
GBM	1.3522352	1.4722469	2.0098155	1.7378375	0.3199001	0.3909891	1.0000000
STAD	2.7489453	3.4362930	10.9640890	4.6330983	0.4673039	0.5480047	1.0000000
CESC	1.7448954	2.3039068	4.3870496	2.8647043	0.4815799	0.5480047	1.0000000
READ	2.3601104	2.5257189	3.2471526	2.6678267	0.5904206	0.6494627	1.0000000
TGCT	0.2769972	0.2783775	0.3482366	0.3261160	0.6908412	0.7354116	1.0000000
UCS	1.3342429	1.3009094	3.5672642	1.5004883	0.9160979	0.9447260	1.0000000
LAML	0.3387778	0.3521584	1.3907336	0.3280533	0.9550171	0.9550171	1.0000000

We can plot the mutation counts per Mb for each tumor type divided into WGD and non-WGD.

ggplot(mut.rates.nonhyper, aes(x=Type, y=Non.silentMutationsperMb, fill=WGD)) +
  geom_violin() + 
  theme_bw() +
  scale_y_log10() +
  ylab("Non-silent mutations per Mb (log10)") + 
  xlab("Tumor Type") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

We can also plot each tumor type individually to better see the differences between WGD and non-WGD tumors.

ggplot(mut.rates.nonhyper, aes(x=WGD, y=Non.silentMutationsperMb, fill=WGD)) +
  geom_violin() + 
  geom_jitter(size = 0.11, color="grey") +
  facet_wrap(~ Type) + 
  theme_bw() +
  scale_y_log10() +
  ylab("Non-silent mutations per Mb (log10)") + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

6 Testing associations between mutation burden and WGD status while controlling for ploidy

We perform the same analysis as in the previous section, but add Ploidy as a covariate to correct for overall DNA content regardless of whether the increase in ploidy was due to WGD or not.

# Controlling for tumor type and Hypermutation
model.full.ploidy = lm(Non.silentMutationsperMb ~ WGD + Ploidy + MSI_POLE_Status + Type, data=mut.rates)
summary(model.full.ploidy)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + Ploidy + MSI_POLE_Status + 
##     Type, data = mut.rates)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81.80  -1.89  -0.61   0.09 650.47 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.469920   1.631387  -0.288 0.773315    
## WGDWGD             -0.642181   0.782274  -0.821 0.411715    
## Ploidy              0.383958   0.455550   0.843 0.399338    
## MSI_POLE_StatusYes 76.151159   1.437217  52.985  < 2e-16 ***
## TypeTHCA           -0.002643   1.585575  -0.002 0.998670    
## TypeTHYM            0.152891   2.165859   0.071 0.943724    
## TypeTGCT            0.265510   2.013934   0.132 0.895116    
## TypeUVM             0.183187   2.353480   0.078 0.937960    
## TypeLAML            1.044090   2.136476   0.489 0.625068    
## TypeKICH           -0.564078   2.530682  -0.223 0.823622    
## TypeACC             0.295910   2.282937   0.130 0.896871    
## TypeMESO            0.535937   2.363814   0.227 0.820642    
## TypeCHOL            0.866579   3.165562   0.274 0.784281    
## TypePRAD            0.788142   1.581964   0.498 0.618352    
## TypeLGG             0.982236   1.570388   0.625 0.531676    
## TypeSARC            0.713449   1.788533   0.399 0.689975    
## TypePAAD            2.759837   1.940301   1.422 0.154950    
## TypeBRCA            1.503869   1.478571   1.017 0.309127    
## TypeUCS             2.018409   2.681349   0.753 0.451614    
## TypeGBM             2.508383   1.631180   1.538 0.124138    
## TypeKIRC            1.193586   1.649878   0.723 0.469428    
## TypeKIRP            1.134958   1.717130   0.661 0.508653    
## TypeUCEC            6.374479   1.604483   3.973 7.15e-05 ***
## TypeCESC            3.359699   1.981659   1.695 0.090033 .  
## TypeLIHC            2.144415   1.684381   1.273 0.203008    
## TypeOV              2.826552   1.630488   1.734 0.083029 .  
## TypeESCA            2.344851   1.938932   1.209 0.226558    
## TypeREAD            6.405565   2.072519   3.091 0.002003 ** 
## TypeDLBC            2.711858   3.132871   0.866 0.386724    
## TypeHNSC            3.614407   1.573660   2.297 0.021651 *  
## TypeCOAD            4.120510   1.648605   2.499 0.012458 *  
## TypeSTAD            8.427906   1.604287   5.253 1.53e-07 ***
## TypeBLCA            6.122761   1.626188   3.765 0.000167 ***
## TypeLUAD            7.620255   1.578989   4.826 1.41e-06 ***
## TypeLUSC            7.155740   1.590157   4.500 6.88e-06 ***
## TypeSKCM           14.441269   1.588313   9.092  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.13 on 9378 degrees of freedom
## Multiple R-squared:  0.2972, Adjusted R-squared:  0.2946 
## F-statistic: 113.3 on 35 and 9378 DF,  p-value: < 2.2e-16

# Controlling for tumor type and Hypermutation with log transformation
model.full.log10.ploidy = lm(log10(Non.silentMutationsperMb) ~ WGD + Ploidy + MSI_POLE_Status + Type, data=mut.rates)
summary(model.full.log10.ploidy)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + Ploidy + 
##     MSI_POLE_Status + Type, data = mut.rates)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34072 -0.19100 -0.01054  0.16822  2.16241 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.684707   0.035197 -19.454  < 2e-16 ***
## WGDWGD              0.068997   0.016877   4.088 4.38e-05 ***
## Ploidy              0.012748   0.009828   1.297 0.194645    
## MSI_POLE_StatusYes  1.181586   0.031008  38.106  < 2e-16 ***
## TypeTHCA            0.024233   0.034208   0.708 0.478726    
## TypeTHYM            0.032135   0.046728   0.688 0.491654    
## TypeTGCT            0.018987   0.043450   0.437 0.662134    
## TypeUVM             0.175195   0.050776   3.450 0.000562 ***
## TypeLAML            0.149099   0.046094   3.235 0.001222 ** 
## TypeKICH            0.313892   0.054599   5.749 9.25e-09 ***
## TypeACC             0.499377   0.049254  10.139  < 2e-16 ***
## TypeMESO            0.447199   0.050999   8.769  < 2e-16 ***
## TypeCHOL            0.494088   0.068296   7.235 5.04e-13 ***
## TypePRAD            0.485222   0.034130  14.217  < 2e-16 ***
## TypeLGG             0.478608   0.033881  14.126  < 2e-16 ***
## TypeSARC            0.499890   0.038587  12.955  < 2e-16 ***
## TypePAAD            0.556893   0.041861  13.303  < 2e-16 ***
## TypeBRCA            0.660781   0.031900  20.714  < 2e-16 ***
## TypeUCS             0.749983   0.057849  12.964  < 2e-16 ***
## TypeGBM             0.829229   0.035192  23.563  < 2e-16 ***
## TypeKIRC            0.755505   0.035596  21.225  < 2e-16 ***
## TypeKIRP            0.759801   0.037047  20.509  < 2e-16 ***
## TypeUCEC            0.968166   0.034616  27.969  < 2e-16 ***
## TypeCESC            0.971528   0.042754  22.724  < 2e-16 ***
## TypeLIHC            0.911525   0.036340  25.083  < 2e-16 ***
## TypeOV              0.921883   0.035177  26.207  < 2e-16 ***
## TypeESCA            0.964025   0.041832  23.045  < 2e-16 ***
## TypeREAD            1.028622   0.044714  23.004  < 2e-16 ***
## TypeDLBC            1.053947   0.067591  15.593  < 2e-16 ***
## TypeHNSC            1.065525   0.033951  31.384  < 2e-16 ***
## TypeCOAD            1.161831   0.035568  32.665  < 2e-16 ***
## TypeSTAD            1.216458   0.034612  35.146  < 2e-16 ***
## TypeBLCA            1.273557   0.035085  36.300  < 2e-16 ***
## TypeLUAD            1.285047   0.034066  37.722  < 2e-16 ***
## TypeLUSC            1.390516   0.034307  40.531  < 2e-16 ***
## TypeSKCM            1.498934   0.034267  43.742  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3696 on 9378 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.6003 
## F-statistic: 404.9 on 35 and 9378 DF,  p-value: < 2.2e-16

# Testing hypermutated samples only
model.hyper.ploidy = lm(Non.silentMutationsperMb ~ WGD + Ploidy + Type, data=mut.rates.hyper)
summary(model.hyper.ploidy)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + Ploidy + Type, 
##     data = mut.rates.hyper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.45  -70.10  -37.90   33.96  652.84 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -16.228    138.642  -0.117   0.9070  
## WGDWGD         2.556    107.211   0.024   0.9810  
## Ploidy        10.585     59.098   0.179   0.8581  
## TypeACC      -11.490    129.346  -0.089   0.9293  
## TypePRAD     183.586    163.364   1.124   0.2628  
## TypeLGG      316.043    164.996   1.915   0.0573 .
## TypeSARC      -2.446    161.932  -0.015   0.9880  
## TypePAAD     380.407    165.346   2.301   0.0227 *
## TypeBRCA      14.317    126.594   0.113   0.9101  
## TypeUCS       99.081    163.833   0.605   0.5462  
## TypeGBM      163.931    128.216   1.279   0.2030  
## TypeUCEC      77.043    125.923   0.612   0.5415  
## TypeCESC      39.393    163.833   0.240   0.8103  
## TypeESCA     -11.556    149.371  -0.077   0.9384  
## TypeREAD     186.775    136.415   1.369   0.1729  
## TypeCOAD      39.198    127.242   0.308   0.7584  
## TypeSTAD     105.380    135.366   0.778   0.4375  
## TypeBLCA      42.056    135.675   0.310   0.7570  
## TypeSKCM      28.831    164.155   0.176   0.8608  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.6 on 155 degrees of freedom
## Multiple R-squared:  0.1568, Adjusted R-squared:  0.05888 
## F-statistic: 1.601 on 18 and 155 DF,  p-value: 0.0657

# Testing hypermutated samples only with log transformation
model.hyper.log10.ploidy = lm(log10(Non.silentMutationsperMb) ~ WGD + Ploidy + Type, data=mut.rates.hyper)
summary(model.hyper.log10.ploidy)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + Ploidy + 
##     Type, data = mut.rates.hyper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43820 -0.45347 -0.03138  0.42339  1.66186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.87026    0.79969   1.088    0.278
## WGDWGD      -0.48474    0.61840  -0.784    0.434
## Ploidy       0.29185    0.34088   0.856    0.393
## TypeACC     -0.70191    0.74607  -0.941    0.348
## TypePRAD     0.82943    0.94229   0.880    0.380
## TypeLGG      1.03309    0.95170   1.086    0.279
## TypeSARC    -1.36748    0.93403  -1.464    0.145
## TypePAAD     1.10669    0.95372   1.160    0.248
## TypeBRCA    -0.98195    0.73020  -1.345    0.181
## TypeUCS      0.56316    0.94499   0.596    0.552
## TypeGBM      0.59998    0.73955   0.811    0.418
## TypeUCEC     0.04733    0.72633   0.065    0.948
## TypeCESC     0.19277    0.94499   0.204    0.839
## TypeESCA    -0.55109    0.86158  -0.640    0.523
## TypeREAD     0.67699    0.78685   0.860    0.391
## TypeCOAD     0.02059    0.73394   0.028    0.978
## TypeSTAD     0.47144    0.78080   0.604    0.547
## TypeBLCA     0.05452    0.78258   0.070    0.945
## TypeSKCM     0.07147    0.94685   0.075    0.940
## 
## Residual standard error: 0.6091 on 155 degrees of freedom
## Multiple R-squared:  0.2263, Adjusted R-squared:  0.1364 
## F-statistic: 2.519 on 18 and 155 DF,  p-value: 0.001197

# Testing non-hypermutated samples only
model.nonhyper.ploidy = lm(Non.silentMutationsperMb ~ WGD + Ploidy + Type, data=mut.rates.nonhyper)
summary(model.nonhyper.ploidy)

## 
## Call:
## lm(formula = Non.silentMutationsperMb ~ WGD + Ploidy + Type, 
##     data = mut.rates.nonhyper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.827  -1.704  -0.453   0.110 271.101 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.325885   0.829876  -0.393 0.694557    
## WGDWGD      -0.523679   0.398596  -1.314 0.188943    
## Ploidy       0.309365   0.231969   1.334 0.182352    
## TypeTHCA     0.001851   0.806237   0.002 0.998168    
## TypeTHYM     0.156093   1.101298   0.142 0.887292    
## TypeTGCT     0.230715   1.024146   0.225 0.821770    
## TypeUVM      0.188696   1.196700   0.158 0.874712    
## TypeLAML     1.051344   1.086359   0.968 0.333186    
## TypeKICH     0.335828   1.293773   0.260 0.795199    
## TypeACC      1.908486   1.168837   1.633 0.102544    
## TypeMESO     0.524912   1.201955   0.437 0.662328    
## TypeCHOL     0.862447   1.609627   0.536 0.592106    
## TypePRAD     0.552493   0.804619   0.687 0.492319    
## TypeLGG      0.507210   0.798711   0.635 0.525421    
## TypeSARC     1.043155   0.910297   1.146 0.251845    
## TypePAAD     0.760189   0.988210   0.769 0.441759    
## TypeBRCA     1.746050   0.752064   2.322 0.020272 *  
## TypeUCS      1.554441   1.372642   1.132 0.257477    
## TypeGBM      1.685991   0.830415   2.030 0.042355 *  
## TypeKIRC     1.195854   0.838931   1.425 0.154061    
## TypeKIRP     1.145507   0.873130   1.312 0.189568    
## TypeUCEC     6.521257   0.826196   7.893 3.28e-15 ***
## TypeCESC     3.607939   1.009485   3.574 0.000353 ***
## TypeLIHC     2.158254   0.856484   2.520 0.011756 *  
## TypeOV       2.818657   0.829099   3.400 0.000678 ***
## TypeESCA     2.800397   0.987372   2.836 0.004575 ** 
## TypeREAD     2.705811   1.063650   2.544 0.010979 *  
## TypeDLBC     2.730619   1.593008   1.714 0.086539 .  
## TypeHNSC     3.613351   0.800186   4.516 6.39e-06 ***
## TypeCOAD     6.447307   0.845626   7.624 2.70e-14 ***
## TypeSTAD     8.111851   0.817055   9.928  < 2e-16 ***
## TypeBLCA     6.267093   0.827485   7.574 3.98e-14 ***
## TypeLUAD     7.615243   0.802917   9.484  < 2e-16 ***
## TypeLUSC     7.144725   0.808593   8.836  < 2e-16 ***
## TypeSKCM    14.565894   0.807866  18.030  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.71 on 9205 degrees of freedom
## Multiple R-squared:  0.1479, Adjusted R-squared:  0.1448 
## F-statistic: 46.99 on 34 and 9205 DF,  p-value: < 2.2e-16

# Testing non-hypermutated samples only with log transformation
model.nonhyper.log10.ploidy = lm(log10(Non.silentMutationsperMb) ~ WGD + Ploidy + Type, data=mut.rates.nonhyper)
summary(model.nonhyper.log10.ploidy)

## 
## Call:
## lm(formula = log10(Non.silentMutationsperMb) ~ WGD + Ploidy + 
##     Type, data = mut.rates.nonhyper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34129 -0.18715 -0.00981  0.16656  2.16306 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.683436   0.034494 -19.813  < 2e-16 ***
## WGDWGD       0.071319   0.016568   4.305 1.69e-05 ***
## Ploidy       0.012000   0.009642   1.245 0.213319    
## TypeTHCA     0.024421   0.033512   0.729 0.466182    
## TypeTHYM     0.032221   0.045776   0.704 0.481530    
## TypeTGCT     0.017725   0.042569   0.416 0.677137    
## TypeUVM      0.175353   0.049741   3.525 0.000425 ***
## TypeLAML     0.149303   0.045155   3.306 0.000948 ***
## TypeKICH     0.308967   0.053776   5.745 9.46e-09 ***
## TypeACC      0.512071   0.048583  10.540  < 2e-16 ***
## TypeMESO     0.447047   0.049960   8.948  < 2e-16 ***
## TypeCHOL     0.493968   0.066905   7.383 1.68e-13 ***
## TypePRAD     0.482608   0.033444  14.430  < 2e-16 ***
## TypeLGG      0.475526   0.033199  14.324  < 2e-16 ***
## TypeSARC     0.503767   0.037837  13.314  < 2e-16 ***
## TypePAAD     0.546932   0.041075  13.315  < 2e-16 ***
## TypeBRCA     0.664420   0.031260  21.255  < 2e-16 ***
## TypeUCS      0.735653   0.057055  12.894  < 2e-16 ***
## TypeGBM      0.824479   0.034517  23.886  < 2e-16 ***
## TypeKIRC     0.755523   0.034871  21.666  < 2e-16 ***
## TypeKIRP     0.760006   0.036292  20.941  < 2e-16 ***
## TypeUCEC     0.962901   0.034341  28.039  < 2e-16 ***
## TypeCESC     0.970218   0.041960  23.123  < 2e-16 ***
## TypeLIHC     0.911439   0.035600  25.602  < 2e-16 ***
## TypeOV       0.921347   0.034462  26.735  < 2e-16 ***
## TypeESCA     0.969212   0.041041  23.616  < 2e-16 ***
## TypeREAD     1.008323   0.044211  22.807  < 2e-16 ***
## TypeDLBC     1.054033   0.066214  15.919  < 2e-16 ***
## TypeHNSC     1.065237   0.033260  32.027  < 2e-16 ***
## TypeCOAD     1.174956   0.035149  33.428  < 2e-16 ***
## TypeSTAD     1.213659   0.033961  35.737  < 2e-16 ***
## TypeBLCA     1.275086   0.034395  37.072  < 2e-16 ***
## TypeLUAD     1.284501   0.033374  38.488  < 2e-16 ***
## TypeLUSC     1.389950   0.033610  41.356  < 2e-16 ***
## TypeSKCM     1.499725   0.033579  44.662  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.362 on 9205 degrees of freedom
## Multiple R-squared:  0.5766, Adjusted R-squared:  0.5751 
## F-statistic: 368.8 on 34 and 9205 DF,  p-value: < 2.2e-16

In the last model examining non-hypermutated samples, we can see that mutation burden is still associated with WGD even after correcting for overall ploidy. Note that although the p-value is signficant the overall effect size is still relatively low, 0.07 in log10 scale is ~1.17, so WGD seems to increase mutation burden by about 1 per MB independent of DNA-content.

Next, we calculate the differences between WGD and non-WGD samples within each tumor type using the non-parametric Wilcoxon rank-sum test. In contrast to previous section, we divided the log10 mutation burden by the tumor’s ploidy to correction for overall DNA content.

res.ploidy = c()
tumor.type = unique(unique(mut.rates.nonhyper$Type))
for(i in tumor.type) {
  mut.rates.type = subset(mut.rates.nonhyper, Type == i)
  model = wilcox.test(PloidyCorrected_MutationPerMb ~ WGD, data = mut.rates.type)
  agg.median = aggregate(mut.rates.type$PloidyCorrected_MutationPerMb,
                  by=list(mut.rates.type$WGD),
                  median)
  agg.mean = aggregate(mut.rates.type$PloidyCorrected_MutationPerMb,
                  by=list(mut.rates.type$WGD),
                  mean)
  res.ploidy = rbind(res.ploidy, c(agg.median[,2], agg.mean[,2], model$p.value))
}
colnames(res.ploidy) = c("Non-WGD_Median", "WGD_Median",
                  "Non-WGD_Mean", "WGD_Mean",
                  "Pvalue")
res.ploidy = data.frame(Type = tumor.type,
                 res.ploidy,
                 FDR=p.adjust(res.ploidy[,"Pvalue"], "fdr"),
                 Bonferroni = p.adjust(res.ploidy[,"Pvalue"], "bon"),
                 check.names = FALSE)

res.ploidy = res.ploidy[order(res.ploidy$Pvalue),]

# Show results in table
kable(res.ploidy, style = 'html', row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "800px")

Type	Non-WGD_Median	WGD_Median	Non-WGD_Mean	WGD_Mean	Pvalue	FDR	Bonferroni
LUSC	0.4033557	0.2512009	0.3823527	0.2590251	0.0000000	0.0000000	0.0000000
COAD	0.2278502	0.1253376	0.3119621	0.1344684	0.0000000	0.0000000	0.0000000
BRCA	-0.0491793	0.0328883	-0.0151825	0.0442232	0.0000000	0.0000000	0.0000000
SARC	-0.1253635	-0.0125760	-0.1143936	-0.0008105	0.0000000	0.0000000	0.0000000
SKCM	0.4664244	0.2824708	0.4241903	0.2808873	0.0000000	0.0000000	0.0000000
PRAD	-0.0814821	-0.0063491	-0.0923605	-0.0063781	0.0000000	0.0000000	0.0000000
PCPG	-0.3242914	-0.1716240	-0.3429651	-0.1619262	0.0000000	0.0000000	0.0000001
LGG	-0.0759703	-0.0267069	-0.0901267	-0.0266994	0.0000000	0.0000000	0.0000001
HNSC	0.2088784	0.1415845	0.2154824	0.1406598	0.0000000	0.0000001	0.0000012
READ	0.1748569	0.1182970	0.1880058	0.1220811	0.0000004	0.0000013	0.0000128
OV	0.1516856	0.1001366	0.1488547	0.1027001	0.0000037	0.0000112	0.0001227
STAD	0.2216100	0.1548190	0.3027677	0.1716448	0.0000095	0.0000261	0.0003134
THYM	-0.3102500	-0.0839461	-0.3282617	-0.1006516	0.0000205	0.0000520	0.0006760
THCA	-0.3121091	-0.1804234	-0.3174176	-0.1702747	0.0000802	0.0001889	0.0026452
ACC	-0.1519286	-0.0295340	-0.1383139	-0.0155641	0.0001409	0.0003099	0.0046488
KICH	-0.2214557	-0.0743173	-0.2181215	-0.0730882	0.0008347	0.0017217	0.0275467
UVM	-0.2320786	-0.1223200	-0.2385757	-0.1223627	0.0042076	0.0081676	0.1388492
CHOL	-0.0930938	-0.0123117	-0.0897872	-0.0220278	0.0055013	0.0100856	0.1815416
LIHC	0.1268829	0.0932007	0.1188285	0.0970984	0.0062439	0.0108447	0.2060484
UCEC	0.0695230	0.0371875	0.1857847	0.0582634	0.0088016	0.0145226	0.2904519
BLCA	0.2480573	0.2168641	0.2676753	0.2218998	0.0244806	0.0384695	0.8078601
PAAD	-0.0245814	0.0034962	-0.0606734	-0.0039827	0.0276393	0.0404548	0.9120957
GBM	0.0654117	0.0423195	0.0893722	0.0481101	0.0281958	0.0404548	0.9304610
ESCA	0.1561272	0.1264551	0.1614070	0.1288075	0.0358150	0.0492456	1.0000000
MESO	-0.1055617	-0.0286493	-0.1153821	-0.0398067	0.0461794	0.0609568	1.0000000
LAML	-0.2365280	-0.1153739	-0.2523608	-0.1261346	0.1446919	0.1836474	1.0000000
LUAD	0.2689551	0.2465316	0.2703211	0.2438727	0.1840052	0.2248952	1.0000000
TGCT	-0.2812626	-0.1882902	-0.3355509	-0.1894330	0.3067936	0.3615781	1.0000000
CESC	0.1258709	0.1200428	0.1665097	0.1069306	0.3278749	0.3730990	1.0000000
KIRC	0.0663805	0.0557844	0.0491416	0.0508983	0.3652635	0.4017898	1.0000000
UCS	0.0557397	0.0317711	0.1196058	0.0390536	0.4450818	0.4737968	1.0000000
DLBC	0.1791250	0.1690399	0.1731282	0.1549902	0.5191938	0.5354186	1.0000000
KIRP	0.0761933	0.0790934	0.0441408	0.0534016	0.8012356	0.8012356	1.0000000

We can plot the mutation counts per Mb divided by ploidy for each tumor type divided into WGD and non-WGD.

ggplot(mut.rates.nonhyper, aes(x=Type, y=PloidyCorrected_MutationPerMb, fill=WGD)) +
  geom_violin() + 
  theme_bw() +
  ylab("Non-silent mutations per Mb (log10) / Ploidy") + 
  xlab("Tumor Type") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

We can also plot each tumor type individually to better see the differences between Ploidy-corrected mutations burden and WGD/non-WGD tumors.

ggplot(mut.rates.nonhyper, aes(x=WGD, y=PloidyCorrected_MutationPerMb, fill=WGD)) +
  geom_violin() + 
  geom_jitter(size = 0.11, color="grey") +
  facet_wrap(~ Type) + 
  theme_bw() +
  ylab("Non-silent mutations per Mb (log10) / Ploidy") + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

7 Associations between gene mutational frequencies and WGD status

7.1 All tumors

muts.nonhyper = muts[,rownames(mut.rates.nonhyper)]
muts.nonhyper.min10 = muts.nonhyper[rowSums(muts.nonhyper) > 10,]

res.gene.lr = logistic.regression(muts.nonhyper.min10, ~ WGD + Log10_MutsPerMB + Type, covar = mut.rates.nonhyper, verbose = FALSE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

res.gene.lr.summary = data.frame(Gene = rownames(res.gene.lr$Estimate),                                  WGD_Estimate=res.gene.lr$Estimate[,"WGDWGD"],
                          WGD_Zvalue = res.gene.lr$Zvalue[,"WGDWGD"],
                          WGD_Pvalue = res.gene.lr$Pvalue[,"WGDWGD"],
                          WGD_FDR = res.gene.lr$FDR[,"WGDWGD"])
res.gene.lr.summary = res.gene.lr.summary[order(res.gene.lr.summary$WGD_Pvalue),]
write.table(res.gene.lr.summary, "LR_WGD_Results_Nonhyper.txt", sep="\t", quote=FALSE, row.names=FALSE)

label = rownames(res.gene.lr.summary)
label[res.gene.lr.summary$WGD_FDR > 0.05] = ""

# Plot with all genes
g1 = ggplot(res.gene.lr.summary, aes(x=WGD_Estimate,
                                y=-log10(WGD_Pvalue),
                                color = WGD_FDR < 0.05)) +
  geom_point() +
  theme_bw() + xlab("Log odds ratio") + ylab("P-value (-log10)") +
  theme(panel.grid = element_blank()) + 
  geom_vline(xintercept = 0, lty=2, color="grey") +
  geom_text_repel(label = label) + 
  ggtitle("Plotting all genes")

# Plot excluding TP53
g2 = ggplot(res.gene.lr.summary[-1,], aes(x=WGD_Estimate,
                                y=-log10(WGD_Pvalue),
                                color = WGD_FDR < 0.05)) +
  geom_point() +
  theme_bw() + xlab("Log odds ratio") + ylab("P-value (-log10)") +
  theme(panel.grid = element_blank()) + 
  geom_vline(xintercept = 0, lty=2, color="grey") +
  geom_text_repel(label = label[-1]) + 
  ggtitle("Excluding TP53 from plot")

plot_grid(g1, g2, labels = "AUTO")

# Show table of results
res.gene.lr.summary.sig = subset(res.gene.lr.summary, WGD_FDR < 0.25)
kable(res.gene.lr.summary.sig, style = 'html', row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "800px")

Gene	WGD_Estimate	WGD_Zvalue	WGD_Pvalue	WGD_FDR
TP53	1.0794584	18.785698	0.0000000	0.0000000
PTEN	-0.7672387	-7.011003	0.0000000	0.0000000
PIK3CA	-0.4786690	-6.384232	0.0000000	0.0000000
ARID1A	-0.5768852	-5.660534	0.0000000	0.0000024
FGFR3	-1.0398106	-4.826179	0.0000014	0.0001754
CTNNB1	-0.6158286	-4.438519	0.0000091	0.0009511
PPP2R1A	0.8456584	4.361786	0.0000129	0.0011610
KMT2B	-0.6214181	-4.241948	0.0000222	0.0017450
CHD1	-0.8968124	-4.063260	0.0000484	0.0033874
HRAS	-0.9992193	-3.883845	0.0001028	0.0060679
AKT1	-1.1463076	-3.859282	0.0001137	0.0060679
SMAD4	-0.6244750	-3.855318	0.0001156	0.0060679
KMT2C	-0.3408770	-3.763359	0.0001676	0.0081244
CTCF	-0.7029485	-3.649443	0.0002628	0.0118264
RNF43	-0.6779476	-3.335752	0.0008507	0.0357290
JAK1	-0.7668422	-3.297457	0.0009756	0.0384161
CDH1	-0.4449398	-3.058587	0.0022238	0.0824127
FAS	-1.3759040	-2.933289	0.0033539	0.1079772
BCOR	-0.4626777	-2.929062	0.0033999	0.1079772
HLA-A	-0.8536875	-2.900500	0.0037257	0.1079772
EIF1AX	-1.7996105	-2.881124	0.0039626	0.1079772
BRAF	-0.3709148	-2.877323	0.0040107	0.1079772
C1orf168	-0.6412355	-2.874089	0.0040519	0.1079772
TGFBR2	-0.6098609	-2.869330	0.0041134	0.1079772
MAP2K4	-0.5989836	-2.811855	0.0049257	0.1241270
CHD8	-0.4413577	-2.784141	0.0053670	0.1300458
CREBBP	-0.3499356	-2.754055	0.0058862	0.1373444
ZEB2	0.4240420	2.734583	0.0062459	0.1384352
RBM10	-0.5650726	-2.727977	0.0063724	0.1384352
KANSL1	-0.4999477	-2.676508	0.0074394	0.1496449
BTBD11	-0.5402268	-2.673664	0.0075028	0.1496449
MPP7	-0.6284041	-2.662284	0.0077612	0.1496449
MAP3K1	-0.4268961	-2.658946	0.0078385	0.1496449
RHOA	-0.6398340	-2.646768	0.0081265	0.1505796
FOXA2	0.6912303	2.569411	0.0101872	0.1833690
BCORL1	-0.4154184	-2.559519	0.0104817	0.1834301
MAP7D3	-0.6591130	-2.523561	0.0116173	0.1959742
PIK3R1	-0.3827401	-2.517453	0.0118207	0.1959742
CCAR1	-0.5882731	-2.508280	0.0121321	0.1959794
NF1	-0.2652156	-2.487090	0.0128793	0.2011616
ADNP2	-0.5774797	-2.469646	0.0135247	0.2011616
ZMYM2	-0.5655137	-2.464682	0.0137135	0.2011616
RPL22	-0.8136788	-2.464249	0.0137301	0.2011616
SPTA1	0.2170192	2.440063	0.0146847	0.2084574
LZTR1	-0.5395786	-2.432198	0.0150075	0.2084574
SYNE1	0.1922635	2.423399	0.0153760	0.2084574
KMT2D	-0.2205041	-2.419273	0.0155516	0.2084574
CIC	-0.3549505	-2.406418	0.0161098	0.2114413
SPTAN1	-0.3737495	-2.381084	0.0172618	0.2219374

7.2 Tumors without TP53 mutations

tumor.non.tp53 = colnames(muts.nonhyper)[muts.nonhyper["TP53",] == 0]
i = intersect(intersect(rownames(mut.rates.nonhyper), colnames(muts.nonhyper)), tumor.non.tp53)
mut.rates.nonTP53 = mut.rates.nonhyper[i,]
muts.nonTP53 = muts.nonhyper[,i]

muts.nonTP53.min10 = muts.nonTP53[rowSums(muts.nonTP53) > 10,]

res.gene.lr.nonTP53 = logistic.regression(muts.nonTP53.min10, ~ WGD + Log10_MutsPerMB + Type, covar = mut.rates.nonTP53, verbose = FALSE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

res.gene.lr.nonTP53.summary = data.frame(Gene = rownames(res.gene.lr.nonTP53$Estimate),                                  WGD_Estimate=res.gene.lr.nonTP53$Estimate[,"WGDWGD"],
                          WGD_Zvalue = res.gene.lr.nonTP53$Zvalue[,"WGDWGD"],
                          WGD_Pvalue = res.gene.lr.nonTP53$Pvalue[,"WGDWGD"],
                          WGD_FDR = res.gene.lr.nonTP53$FDR[,"WGDWGD"])
res.gene.lr.nonTP53.summary = res.gene.lr.nonTP53.summary[order(res.gene.lr.nonTP53.summary$WGD_Pvalue),]
write.table(res.gene.lr.nonTP53.summary, "LR_WGD_Results_Nonhyper_nonTP53mutated.txt", sep="\t", quote=FALSE, row.names=FALSE)

label = rep("", nrow(res.gene.lr.nonTP53.summary))
top = head(order(res.gene.lr.summary$WGD_Pvalue), 10)
label[top] = rownames(res.gene.lr.nonTP53.summary)[top]

# Plot with all significant genes
g3 = ggplot(res.gene.lr.nonTP53.summary, aes(x=WGD_Estimate,
                                y=-log10(WGD_Pvalue),
                                color = WGD_FDR < 0.05)) +
  geom_point() +
  theme_bw() + xlab("Log odds ratio") + ylab("P-value (-log10)") +
  theme(panel.grid = element_blank()) + 
  geom_vline(xintercept = 0, lty=2, color="grey") +
  geom_text_repel(label = label) +
  xlim(-3, 3) +
  ggtitle("Including only non-TP53 mutated samples")
g3

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_text_repel).

# Show table of results
res.gene.lr.nonTP53.summary.sig = subset(res.gene.lr.nonTP53.summary, WGD_Pvalue < 0.05)
kable(res.gene.lr.nonTP53.summary.sig, style = 'html', row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "800px")

Gene	WGD_Estimate	WGD_Zvalue	WGD_Pvalue	WGD_FDR
CREBBP	-0.7573872	-3.330352	0.0008674	0.2014334
RNF43	-1.3991071	-3.320763	0.0008977	0.2014334
SMAD4	-1.0149265	-3.233038	0.0012248	0.2014334
PTEN	-0.5280733	-3.167605	0.0015370	0.2014334
PIK3CA	-0.3511188	-3.150062	0.0016324	0.2014334
ADNP2	-1.2013345	-3.021193	0.0025178	0.2589149
HRAS	-1.0698854	-2.837505	0.0045468	0.4007647
RHOA	-1.1729272	-2.747925	0.0059974	0.4381836
PPP2R2D	1.9900174	2.645439	0.0081585	0.4381836
DNAH7	-0.4551239	-2.615006	0.0089226	0.4381836
BRSK1	-1.4725203	-2.612695	0.0089831	0.4381836
FGFR3	-0.7298253	-2.561275	0.0104289	0.4381836
JAK1	-0.8860764	-2.518212	0.0117952	0.4381836
MTOR	-0.5911766	-2.496233	0.0125520	0.4381836
MYO9A	-0.6088333	-2.493545	0.0126475	0.4381836
BTBD11	-0.7323884	-2.449820	0.0142928	0.4381836
C1orf168	-0.7711517	-2.431809	0.0150236	0.4381836
EPAS1	0.8110039	2.389321	0.0168795	0.4381836
SMARCA4	-0.5568790	-2.385988	0.0170333	0.4381836
RBM10	-0.7449828	-2.363704	0.0180932	0.4381836
IGSF9	0.6854331	2.349618	0.0187927	0.4381836
PIK3R1	-0.6434924	-2.343274	0.0191154	0.4381836
BAP1	0.4920152	2.322928	0.0201830	0.4381836
GPS2	-1.2829478	-2.312851	0.0207308	0.4381836
CUBN	0.3787204	2.305774	0.0211233	0.4381836
ZMYM2	-0.8824764	-2.293850	0.0217991	0.4381836
USP9X	-0.6262422	-2.290533	0.0219904	0.4381836
KANSL1	-0.6683929	-2.283124	0.0224231	0.4381836
XYLT2	-1.4678371	-2.281967	0.0224913	0.4381836
ME3	-1.0261674	-2.281916	0.0224943	0.4381836
GTF2I	-0.9810036	-2.280922	0.0225531	0.4381836
KMT2C	-0.3141781	-2.276460	0.0228185	0.4381836
STK11	0.6451324	2.266250	0.0234361	0.4381836
KMT2B	-0.5109378	-2.245148	0.0247586	0.4400067
HIST1H3B	-1.0435388	-2.242023	0.0249599	0.4400067
PPP2R1A	0.7656459	2.204708	0.0274746	0.4708838
RASA1	-0.7878640	-2.183765	0.0289795	0.4832525
EEF1A1	-0.7900353	-2.136364	0.0326497	0.5291934
CHD1	-0.7010803	-2.120317	0.0339793	0.5291934
COL7A1	-0.4356894	-2.116439	0.0343075	0.5291934
LMAN1	-1.3288244	-2.097471	0.0359519	0.5410323
CDH11	0.5438248	2.075624	0.0379288	0.5475859
ARID1A	-0.3035796	-2.071136	0.0383461	0.5475859
MAX	-1.5897197	-2.063661	0.0390499	0.5475859
ALYREF	1.2996694	2.050780	0.0402883	0.5523979
CTCF	-0.6075972	-2.039023	0.0414478	0.5559408
ELMSAN1	-0.8550667	-2.006401	0.0448135	0.5882962
HLA-A	-1.1269787	-1.988120	0.0467985	0.5968917
RPL22	-1.0262717	-1.982681	0.0474031	0.5968917

7.3 Tumors with TP53 mutations

tumor.tp53 = colnames(muts.nonhyper)[muts.nonhyper["TP53",] > 0]
i = intersect(intersect(rownames(mut.rates.nonhyper), colnames(muts.nonhyper)), tumor.tp53)

# Determine the number of tumor types that will have at least 5 tumors
# after subsetting to TP53 mutants
type.table = table(mut.rates.nonhyper[i,"Type"])
type.n = names(type.table)[type.table > 4]
tumor.type.n = rownames(subset(mut.rates.nonhyper, Type %in% type.n))

i2 = intersect(intersect(intersect(rownames(mut.rates.nonhyper), colnames(muts.nonhyper)), tumor.tp53), tumor.type.n)

mut.rates.TP53 = mut.rates.nonhyper[i2,]
mut.rates.TP53$Type = droplevels(mut.rates.TP53$Type)
muts.TP53 = muts.nonhyper[,i2]
muts.TP53.min10 = muts.TP53[rowSums(muts.TP53) > 10,]


res.gene.lr.TP53 = logistic.regression(muts.TP53.min10, ~ WGD + Log10_MutsPerMB + Type, covar = mut.rates.TP53, verbose = FALSE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

res.gene.lr.TP53.summary = data.frame(Gene = rownames(res.gene.lr.TP53$Estimate),                                  WGD_Estimate=res.gene.lr.TP53$Estimate[,"WGDWGD"],
                          WGD_Zvalue = res.gene.lr.TP53$Zvalue[,"WGDWGD"],
                          WGD_Pvalue = res.gene.lr.TP53$Pvalue[,"WGDWGD"],
                          WGD_FDR = res.gene.lr.TP53$FDR[,"WGDWGD"])
res.gene.lr.TP53.summary = res.gene.lr.TP53.summary[order(res.gene.lr.TP53.summary$WGD_Pvalue),]
write.table(res.gene.lr.TP53.summary, "LR_WGD_Results_Nonhyper_TP53mutated.txt", sep="\t", quote=FALSE, row.names=FALSE)

label = rep("", nrow(res.gene.lr.TP53.summary))
top = head(order(res.gene.lr.summary$WGD_Pvalue), 10)
label[top] = rownames(res.gene.lr.TP53.summary)[top]

# Plot with all significant genes
g4 = ggplot(res.gene.lr.TP53.summary, aes(x=WGD_Estimate,
                                y=-log10(WGD_Pvalue),
                                color = WGD_FDR < 0.05)) +
  geom_point() +
  theme_bw() + xlab("Log odds ratio") + ylab("P-value (-log10)") +
  theme(panel.grid = element_blank()) + 
  geom_vline(xintercept = 0, lty=2, color="grey") +
  geom_text_repel(label = label) +
  xlim(-5, 5) +
  ggtitle("Including only TP53 mutated samples")
g4

# Show table of results
res.gene.lr.nonTP53.summary.sig = subset(res.gene.lr.nonTP53.summary, WGD_Pvalue < 0.05)
kable(res.gene.lr.nonTP53.summary.sig, style = 'html', row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "800px")

Gene	WGD_Estimate	WGD_Zvalue	WGD_Pvalue	WGD_FDR
CREBBP	-0.7573872	-3.330352	0.0008674	0.2014334
RNF43	-1.3991071	-3.320763	0.0008977	0.2014334
SMAD4	-1.0149265	-3.233038	0.0012248	0.2014334
PTEN	-0.5280733	-3.167605	0.0015370	0.2014334
PIK3CA	-0.3511188	-3.150062	0.0016324	0.2014334
ADNP2	-1.2013345	-3.021193	0.0025178	0.2589149
HRAS	-1.0698854	-2.837505	0.0045468	0.4007647
RHOA	-1.1729272	-2.747925	0.0059974	0.4381836
PPP2R2D	1.9900174	2.645439	0.0081585	0.4381836
DNAH7	-0.4551239	-2.615006	0.0089226	0.4381836
BRSK1	-1.4725203	-2.612695	0.0089831	0.4381836
FGFR3	-0.7298253	-2.561275	0.0104289	0.4381836
JAK1	-0.8860764	-2.518212	0.0117952	0.4381836
MTOR	-0.5911766	-2.496233	0.0125520	0.4381836
MYO9A	-0.6088333	-2.493545	0.0126475	0.4381836
BTBD11	-0.7323884	-2.449820	0.0142928	0.4381836
C1orf168	-0.7711517	-2.431809	0.0150236	0.4381836
EPAS1	0.8110039	2.389321	0.0168795	0.4381836
SMARCA4	-0.5568790	-2.385988	0.0170333	0.4381836
RBM10	-0.7449828	-2.363704	0.0180932	0.4381836
IGSF9	0.6854331	2.349618	0.0187927	0.4381836
PIK3R1	-0.6434924	-2.343274	0.0191154	0.4381836
BAP1	0.4920152	2.322928	0.0201830	0.4381836
GPS2	-1.2829478	-2.312851	0.0207308	0.4381836
CUBN	0.3787204	2.305774	0.0211233	0.4381836
ZMYM2	-0.8824764	-2.293850	0.0217991	0.4381836
USP9X	-0.6262422	-2.290533	0.0219904	0.4381836
KANSL1	-0.6683929	-2.283124	0.0224231	0.4381836
XYLT2	-1.4678371	-2.281967	0.0224913	0.4381836
ME3	-1.0261674	-2.281916	0.0224943	0.4381836
GTF2I	-0.9810036	-2.280922	0.0225531	0.4381836
KMT2C	-0.3141781	-2.276460	0.0228185	0.4381836
STK11	0.6451324	2.266250	0.0234361	0.4381836
KMT2B	-0.5109378	-2.245148	0.0247586	0.4400067
HIST1H3B	-1.0435388	-2.242023	0.0249599	0.4400067
PPP2R1A	0.7656459	2.204708	0.0274746	0.4708838
RASA1	-0.7878640	-2.183765	0.0289795	0.4832525
EEF1A1	-0.7900353	-2.136364	0.0326497	0.5291934
CHD1	-0.7010803	-2.120317	0.0339793	0.5291934
COL7A1	-0.4356894	-2.116439	0.0343075	0.5291934
LMAN1	-1.3288244	-2.097471	0.0359519	0.5410323
CDH11	0.5438248	2.075624	0.0379288	0.5475859
ARID1A	-0.3035796	-2.071136	0.0383461	0.5475859
MAX	-1.5897197	-2.063661	0.0390499	0.5475859
ALYREF	1.2996694	2.050780	0.0402883	0.5523979
CTCF	-0.6075972	-2.039023	0.0414478	0.5559408
ELMSAN1	-0.8550667	-2.006401	0.0448135	0.5882962
HLA-A	-1.1269787	-1.988120	0.0467985	0.5968917
RPL22	-1.0262717	-1.982681	0.0474031	0.5968917

8 Session Information

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] ggrepel_0.8.2    kableExtra_1.1.0 knitr_1.28       magrittr_1.5    
## [5] cowplot_1.0.0    ggplot2_3.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6      highr_0.8         pillar_1.4.4      compiler_3.6.1   
##  [5] tools_3.6.1       digest_0.6.25     viridisLite_0.3.0 evaluate_0.14    
##  [9] lifecycle_0.2.0   tibble_3.0.1      gtable_0.3.0      pkgconfig_2.0.3  
## [13] rlang_0.4.6       rstudioapi_0.11   yaml_2.2.1        xfun_0.13        
## [17] withr_2.2.0       stringr_1.4.0     httr_1.4.1        xml2_1.3.2       
## [21] vctrs_0.2.4       hms_0.5.3         webshot_0.5.2     grid_3.6.1       
## [25] glue_1.4.0        R6_2.4.1          rmarkdown_2.1     farver_2.0.3     
## [29] readr_1.3.1       scales_1.1.0      ellipsis_0.3.0    htmltools_0.4.0  
## [33] rvest_0.3.5       colorspace_1.4-1  renv_0.10.0       labeling_0.3     
## [37] stringi_1.4.6     munsell_0.5.0     crayon_1.3.4

Association of mutation rates with WGD status

Joshua D. Campbell

May 09, 2020