Practical session about Normalization and differential analysis of RNAseq data - Effect of normalization and differential tests

This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The data are provided by courtesy of the transcriptomic platform of IPS2 and the name of the genes were shuffled. Hence the results are not biologically interpretable. They are composed of three files, all included in the directory data:

D1-counts.txt contains the raw counts of the experiment (13 columns: the first one contains the gene names, the others correspond to 12 different samples);
D1-genesLength.txt contains information about gene lengths;
D1-targets.txt contains information about the sample and the experimental design.

In the rest of this file, we:

1/ first import and describe the data;

2/ then perform and compare different normalizations;

3/ finally perform a differential analysis using different methods and models.

We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages devtools, ggplot2, gridExtra, reshape2, mixOmics, RColorBrewer, edgeR and VennDiagram are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file (HTML format), in Section “Session information”.

The packages are loaded with:

library(ggplot2)
library(gridExtra)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(FactoMineR)
library(factoextra)
library(edgeR)
library(VennDiagram)
library(plotly)

Settings

This command line is used to create a subdirectory to store results of the analyses:

dir.create("results")

## Warning in dir.create("results"): 'results' already exists

Data description and importation

The data used in this practical session correspond to 12 samples of RNAseq obtained from microdissections on plants. Plants have four different genotypes: the wild type (“wt”) plant and three types of mutants and are observed three times in the same conditions. for information Mutants 1 and 2 have a similar phenotype (more complicated leaves than the wild type) whereas mutant 3 has the opposite phenotype (simpler leaves than the wild type).

The datasets are included in the repository data located at the root of the material for this practical session. They can be loaded using:

raw_counts <- read.table("data/D1-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("data/D1-targets.txt", header = TRUE)
gene_lengths <- scan("data/D1-genesLength.txt")

Raw counts

The dimensions of the raw count matrix are equal to:

dim(raw_counts)

## [1] 50897    12

which shows that the data contains 12 columns (samples) and 50897 rows (genes). And a quick look is obtained by:

head(raw_counts)

##                  wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1    0    0    0      1      0      1      0      0      0
## Medtr0001s0070.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0100.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0120.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0160.1    0    0    0      0      0      0      0      0      0
## Medtr0001s0190.1    0    0    0      0      0      0      0      0      0
##                  mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1      0      1      0
## Medtr0001s0070.1      0      0      0
## Medtr0001s0100.1      0      0      0
## Medtr0001s0120.1      0      0      0
## Medtr0001s0160.1      0      0      0
## Medtr0001s0190.1      0      0      0

which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.

Experimental design

The experimental design is described in the object design:

design

##    labels group replicat
## 1    wt_1    wt  repbio1
## 2    wt_2    wt  repbio2
## 3    wt_3    wt  repbio3
## 4  mut1_1  mut1  repbio1
## 5  mut1_2  mut1  repbio2
## 6  mut1_3  mut1  repbio3
## 7  mut2_1  mut2  repbio1
## 8  mut2_2  mut2  repbio2
## 9  mut2_3  mut2  repbio3
## 10 mut3_1  mut3  repbio1
## 11 mut3_2  mut3  repbio2
## 12 mut3_3  mut3  repbio3

that contains 3 columns: the first one labels is the sample name, the second one group is the group of the sample (wild type or one of three types of mutant) and the last one replicat is the biological replicate number (that we will suppose to be corresponding from one group to another).

Basic exploratory analysis of raw counts

We start the analysis by filtering out the genes for which no count has been found:

raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)

## [1] 27916    12

The number of genes which have been filtered out is thus equal to 22981.

It is often useful, to visualize the count distribution, to compute “pseudo counts”, which are log-transformed counts:

pseudo_counts <- log2(raw_counts_wn + 1)
head(pseudo_counts)

##                      wt_1     wt_2     wt_3   mut1_1   mut1_2   mut1_3   mut2_1
## Medtr0001s0010.1 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
## Medtr0001s0200.1 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.357552 6.768184 6.643856 6.169925 6.507795 6.189825 6.209453
## Medtr0001s0360.1 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0490.1 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0002s0040.1 7.707359 6.918863 7.826548 2.584963 2.000000 4.643856 6.643856
##                    mut2_2   mut2_3   mut3_1   mut3_2   mut3_3
## Medtr0001s0010.1 0.000000 0.000000 0.000000 1.000000 0.000000
## Medtr0001s0200.1 0.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.392317 6.672425 6.820179 6.554589 6.539159
## Medtr0001s0360.1 1.584963 0.000000 1.000000 0.000000 0.000000
## Medtr0001s0490.1 0.000000 1.584963 1.000000 0.000000 1.000000
## Medtr0002s0040.1 7.651052 8.044394 3.169925 3.459432 1.584963

Count distribution

First, we provide an overview of the distribution of the first sample (without the genes that have been filtered out) by plotting the histograms of raw counts and pseudo counts:

df <- data.frame(rcounts = raw_counts_wn[, 1], prcounts = pseudo_counts[, 1])

p <- ggplot(data=df, aes(x = rcounts, y = ..density..)) +
  geom_histogram(fill = "lightblue") + theme_bw() +
  ggtitle(paste0("count distribution '", design$labels[1], "'")) +
  xlab("counts")

p2 <- ggplot(data=df, aes(x = prcounts, y = ..density..)) +
  geom_histogram(fill = "lightblue") + theme_bw() +
  ggtitle(paste0("count distribution - '", design$labels[1], "'")) +
  xlab(expression(log[2](counts + 1)))

grid.arrange(p, p2, ncol = 2)

This figure shows that the count distribution is highly skewed with a large proportion of genes with a count equal to 0 and a few number of genes with an extreme count (more that \(2^{15}\)).

An alternative representation can be obtained with boxplots (all samples, colored by group)

df <- cbind(design, t(pseudo_counts))
df <- melt(df, id.vars = c("labels", "group", "replicat"))

p <- ggplot(data = df, aes(x = labels, y = value, fill = group)) + 
  geom_boxplot() + theme_bw() + ggtitle("count distributions") + 
  xlab("sample") + ylab("counts") + 
  theme(axis.text.x = element_text(angle = 90))
p

Reproducibility between samples (MA plots)

df <- data.frame("A" = (pseudo_counts[, 3] + pseudo_counts[, 4])/2,
                 "M" = pseudo_counts[, 3] - pseudo_counts[, 4])
p <- ggplot(data = df, aes(x = A, y = M)) + geom_point(alpha = 0.2) + 
  theme_bw() + geom_smooth(colour = "red")
p

df <- vector(mode = "list", length = 12*11/2)
ind <- 1
for (ind1 in 1:11) {
  for (ind2 in (ind1+1):12) {
    A <- (pseudo_counts[, ind1] + pseudo_counts[, ind2])/2
    M <- pseudo_counts[, ind1] - pseudo_counts[, ind2]
    df[[ind]] <- data.frame("A" = A, "M" = M)
    ind <- ind+1
  }
}
lengths <- sapply(df, nrow)
df <- Reduce(rbind, df)
df$number <- rep(1:(12*11/2), each = lengths)
ggplot(data = df, aes(x = A, y = M)) + geom_point(alpha = 0.2) + theme_bw() +
  facet_wrap(~ number)

Heatmap

df <- as.matrix(dist(t(pseudo_counts)))
df

##            wt_1     wt_2     wt_3   mut1_1   mut1_2   mut1_3   mut2_1    mut2_2
## wt_1     0.0000 126.2933 130.8818 135.1886 125.2456 146.5416 120.6482 129.58103
## wt_2   126.2933   0.0000 120.4416 141.6461 121.6069 146.8200 114.4024 132.22552
## wt_3   130.8818 120.4416   0.0000 124.3985 124.1820 127.5267 137.5582 116.34194
## mut1_1 135.1886 141.6461 124.3985   0.0000 107.3116 118.6190 135.4935 112.08930
## mut1_2 125.2456 121.6069 124.1820 107.3116   0.0000 123.9054 116.4351 120.51413
## mut1_3 146.5416 146.8200 127.5267 118.6190 123.9054   0.0000 153.2298 120.09793
## mut2_1 120.6482 114.4024 137.5582 135.4935 116.4351 153.2298   0.0000 131.61151
## mut2_2 129.5810 132.2255 116.3419 112.0893 120.5141 120.0979 131.6115   0.00000
## mut2_3 132.7267 128.8963 123.6715 117.8781 119.8149 124.7136 127.9787  99.19577
## mut3_1 127.5728 129.2516 133.7004 138.1283 122.7834 146.2628 131.1581 138.35110
## mut3_2 124.5903 126.7231 120.9488 127.7823 123.4257 138.5271 127.2142 115.29926
## mut3_3 130.4482 127.8773 144.7460 146.4226 127.1282 167.7606 121.2774 145.41084
##           mut2_3   mut3_1   mut3_2   mut3_3
## wt_1   132.72674 127.5728 124.5903 130.4482
## wt_2   128.89625 129.2516 126.7231 127.8773
## wt_3   123.67150 133.7004 120.9488 144.7460
## mut1_1 117.87812 138.1283 127.7823 146.4226
## mut1_2 119.81491 122.7834 123.4257 127.1282
## mut1_3 124.71359 146.2628 138.5271 167.7606
## mut2_1 127.97870 131.1581 127.2142 121.2774
## mut2_2  99.19577 138.3511 115.2993 145.4108
## mut2_3   0.00000 138.6242 120.2381 144.2214
## mut3_1 138.62420   0.0000 120.1136 118.8938
## mut3_2 120.23807 120.1136   0.0000 118.7948
## mut3_3 144.22141 118.8938 118.7948   0.0000

cim_color <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16)
cim(df, color = cim_color, symkey = FALSE)

PCA

Finally, a PCA is performed to identify the main sources of variability in the dataset. Pseudo-counts are used to perform this PCA.

resPCA <- PCA(t(pseudo_counts), ncp = 12, graph = FALSE)
fviz_eig(resPCA)

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

p <- fviz_pca_ind(resPCA, col.ind = design$group)
p

plotly::ggplotly(p)

This figure shows that mutants 1 and 3 are well separated from the other wild type but that wild type is not well separated from mutant 2.

Normalization

This section performs different normalization approaches for RNAseq data. These normalization are performed using edgeR (TC, RPKM, UQ, TMM). The final count distribution is compared to the raw count distribution in the last part of this section.

edgeR

In this section, the package edgeR is perform to compare the other normalization approaches. To do so, a DGEList object has to be created from the count data:

dge2 <- DGEList(raw_counts_wn)
dge2

## An object of class "DGEList"
## $counts
##                  wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1    0    0    0      1      0      1      0      0      0
## Medtr0001s0200.1    0    1    0      0      0      0      0      0      0
## Medtr0001s0260.1   81  108   99     71     90     72     73     83    101
## Medtr0001s0360.1    0    0    1      0      0      0      0      2      0
## Medtr0001s0490.1    1    3    0      0      0      0      0      0      2
##                  mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1      0      1      0
## Medtr0001s0200.1      0      0      0
## Medtr0001s0260.1    112     93     92
## Medtr0001s0360.1      1      0      0
## Medtr0001s0490.1      1      0      1
## 27911 more rows ...
## 
## $samples
##        group lib.size norm.factors
## wt_1       1  5932414            1
## wt_2       1  7129204            1
## wt_3       1  6223759            1
## mut1_1     1  5577595            1
## mut1_2     1  6272344            1
## 7 more rows ...

The library sizes and the normalization factors for all samples are obtained by:

dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414            1
## wt_2       1  7129204            1
## wt_3       1  6223759            1
## mut1_1     1  5577595            1
## mut1_2     1  6272344            1
## mut1_3     1  6634542            1
## mut2_1     1  5897356            1
## mut2_2     1  5785184            1
## mut2_3     1  5767730            1
## mut3_1     1  7106881            1
## mut3_2     1  5872052            1
## mut3_3     1  6035692            1

Total count

Normalization by Total Count (TC) is obtained directly by the function cpm. Pseudo-counts (\(\log_2\) transformed counts) are computed and stored in a data frame for later use.

norm_lib <- cpm(dge2)
pseudo_TC <- log2(cpm(dge2) + 1)

df_TC <- melt(pseudo_TC, id = rownames(raw_counts_wn))
names(df_TC)[1:2] <- c ("id", "sample")
df_TC$method <- rep("TC", nrow(df_TC))

RPKM

RPKM needs information about gene lengths which have to be passed to the function rpkm. Again, pseudo-counts are computed and stored in a data frame for further use.

gene_lengths_wn <- gene_lengths[rowSums(raw_counts) > 0]
pseudo_RPKM <- log2(rpkm(dge2, gene.length = gene_lengths_wn) + 1)

df_RPKM <- melt(pseudo_RPKM, id = rownames(raw_counts_wn))
names(df_RPKM)[1:2] <- c ("id", "sample")
df_RPKM$method <- rep("RPKM", nrow(df_RPKM))

Upper quartile

Upper quartile normalization is obtained with the function calcNormFactors. This function changes the variable norm.factors in the DGEList object and thus also the way the functions cpm and rpkm are computing counts: now, a normalized library size, equal to the original library size multiplied by the normalization factor, is used to compute normalized counts: \(\tilde{K}_{gj} = \frac{K_{gj}}{\mbox{modified LS}}*10^6\). The normalization factors are computed using:

dge2 <- calcNormFactors(dge2, method = "upperquartile")
dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414    1.0047308
## wt_2       1  7129204    0.9810476
## wt_3       1  6223759    1.0130558
## mut1_1     1  5577595    1.0068753
## mut1_2     1  6272344    0.9942229
## mut1_3     1  6634542    1.0593861
## mut2_1     1  5897356    0.9873347
## mut2_2     1  5785184    1.0124342
## mut2_3     1  5767730    1.0453656
## mut3_1     1  7106881    0.9792811
## mut3_2     1  5872052    0.9857220
## mut3_3     1  6035692    0.9361638

Finally, pseudo counts are obtained and stored in a data frame:

pseudo_UQ <- log2(cpm(dge2) + 1)

df_UQ <- melt(pseudo_UQ, id = rownames(raw_counts_wn))
names(df_UQ)[1:2] <- c ("id", "sample")
df_UQ$method <- rep("UQ", nrow(df_UQ))

RLE

dge2 <- calcNormFactors(dge2, method = "RLE")
dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414    1.0068551
## wt_2       1  7129204    0.9870920
## wt_3       1  6223759    1.0023952
## mut1_1     1  5577595    1.0065636
## mut1_2     1  6272344    1.0081697
## mut1_3     1  6634542    1.0556197
## mut2_1     1  5897356    1.0035566
## mut2_2     1  5785184    1.0130438
## mut2_3     1  5767730    1.0433540
## mut3_1     1  7106881    0.9784525
## mut3_2     1  5872052    0.9669348
## mut3_3     1  6035692    0.9337173

pseudo_RLE <- log2(cpm(dge2) + 1)

df_RLE <- melt(pseudo_RLE, id = rownames(raw_counts_wn))
names(df_RLE)[1:2] <- c ("id", "sample")
df_RLE$method <- rep("RLE", nrow(df_RLE))

TMM

TMM normalization works similarly as UQ and updates the value of the variable norm.factors:

dge2 <- calcNormFactors(dge2, method = "TMM")
dge2$samples

##        group lib.size norm.factors
## wt_1       1  5932414    1.0082044
## wt_2       1  7129204    0.9850930
## wt_3       1  6223759    1.0047944
## mut1_1     1  5577595    1.0064979
## mut1_2     1  6272344    1.0094894
## mut1_3     1  6634542    1.0593265
## mut2_1     1  5897356    0.9977013
## mut2_2     1  5785184    1.0096719
## mut2_3     1  5767730    1.0387295
## mut3_1     1  7106881    0.9791254
## mut3_2     1  5872052    0.9637238
## mut3_3     1  6035692    0.9429276

Normalized pseudo counts are obtained with the function cpm and stored in a data frame:

pseudo_TMM <- log2(cpm(dge2) + 1)

df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$method <- rep("TMM", nrow(df_TMM))

Comparison

This last section shows the comparison between all normalization methods (and original raw data) for this data set. First, the distributions of all samples and all normalization methods are compared with boxplots and density plots:

df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2]<- c("id", "sample")
df_raw$method <- rep("Raw counts", nrow(df_raw))
head(df_raw)

##                 id sample    value     method
## 1 Medtr0001s0010.1   wt_1 0.000000 Raw counts
## 2 Medtr0001s0200.1   wt_1 0.000000 Raw counts
## 3 Medtr0001s0260.1   wt_1 6.357552 Raw counts
## 4 Medtr0001s0360.1   wt_1 0.000000 Raw counts
## 5 Medtr0001s0490.1   wt_1 1.000000 Raw counts
## 6 Medtr0002s0040.1   wt_1 7.707359 Raw counts

df_allnorm <- rbind(df_raw, df_TC, df_RPKM, df_UQ, df_TMM, df_RLE)
df_allnorm$method <- factor(df_allnorm$method,
                            levels = c("Raw counts", "TC", "RPKM", "TMM", "UQ", "RLE"))

p <- ggplot(data=df_allnorm, aes(x=sample, y=value, fill=method)) +
  geom_boxplot() + theme_bw() +
  ggtitle("Boxplots of normalized pseudo counts\n for all samples by normalization methods") + facet_grid(. ~ method) +
  ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("") +
  theme(title = element_text(size=10), axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())
print(p)

p <- ggplot(data = df_allnorm, aes(x = value, colour = sample)) + 
  geom_density() + theme_bw() +
  ggtitle("Density of normalized pseudo counts\n for all samples by normalization methods") +
  facet_grid(. ~ method) + ylab("density") + xlab("") + 
  theme(title = element_text(size=10), axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())
print(p)

Visually, DESeq, UQ and TMM seem to behave similarly and provide comparable distributions between samples. Finally, a PCA is performed on the pseudo counts obtained after TMM normalization:

resPCA <- PCA(t(pseudo_TMM), ncp  = 12, graph = FALSE)
fviz_eig(resPCA)

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

fviz_pca_ind(resPCA, col.ind = design$group)

This figure shows that a better separation between wild type samples and mutant 2 samples. On the contrary mutants 2 and 1 are less well separated than in raw count data. This is a biologically plausible result since the organization of the different mutants are concordant with their phenotypes. The normalization has thus improved the data quality.

df <- as.matrix(dist(t(pseudo_TMM)))
df

##            wt_1     wt_2     wt_3   mut1_1   mut1_2    mut1_3   mut2_1   mut2_2
## wt_1    0.00000 71.28860 78.90104 84.55244 70.64068  88.26819 69.53304 81.23892
## wt_2   71.28860  0.00000 70.97388 88.57248 68.29427  93.19227 60.76116 83.92985
## wt_3   78.90104 70.97388  0.00000 74.50747 74.80749  74.06644 86.66028 68.40858
## mut1_1 84.55244 88.57248 74.50747  0.00000 60.14370  63.17140 87.75813 64.73119
## mut1_2 70.64068 68.29427 74.80749 60.14370  0.00000  73.07235 64.22984 73.08702
## mut1_3 88.26819 93.19227 74.06644 63.17140 73.07235   0.00000 95.41909 62.47330
## mut2_1 69.53304 60.76116 86.66028 87.75813 64.22984  95.41909  0.00000 86.65133
## mut2_2 81.23892 83.92985 68.40858 64.73119 73.08702  62.47330 86.65133  0.00000
## mut2_3 83.23634 81.67530 74.63152 68.53364 72.88852  67.17804 82.96370 50.16135
## mut3_1 71.38379 76.36299 81.36898 84.32391 69.76380  93.05559 76.37973 86.49078
## mut3_2 72.65982 72.80861 67.96187 78.93174 72.95065  77.86109 79.56811 67.63906
## mut3_3 74.59898 69.04147 89.09842 96.24201 72.17156 103.20172 70.93649 96.58503
##          mut2_3   mut3_1   mut3_2    mut3_3
## wt_1   83.23634 71.38379 72.65982  74.59898
## wt_2   81.67530 76.36299 72.80861  69.04147
## wt_3   74.63152 81.36898 67.96187  89.09842
## mut1_1 68.53364 84.32391 78.93174  96.24201
## mut1_2 72.88852 69.76380 72.95065  72.17156
## mut1_3 67.17804 93.05559 77.86109 103.20172
## mut2_1 82.96370 76.37973 79.56811  70.93649
## mut2_2 50.16135 86.49078 67.63906  96.58503
## mut2_3  0.00000 87.55700 71.88621  95.17747
## mut3_1 87.55700  0.00000 65.23875  58.12088
## mut3_2 71.88621 65.23875  0.00000  73.23508
## mut3_3 95.17747 58.12088 73.23508   0.00000

cim_color <- colorRampPalette(rev(brewer.pal(9, "Blues")))(16)
cim(df, color = cim_color, symkey = FALSE)

Differential analysis

This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type plants and each of the mutants:

a standard NB exact test between two conditions;
a GLM with only the genotype effect;
a GLM with the plant (individual) and genotype effects.

All analyses are performed with the R package edgeR. A last section compares the results obtained using the different methods.

First approach: pairwise exact test

The following command lines loop over the different mutant types (stored in the vector all_conditions) and perform the following operations:

first, a DGEList object is created with the wild type samples and the current mutant type samples. The “wt” condition is chosen as the reference level. Normalization is performed using TMM;
then, the functions estimateCommonDisp and estimateTagwiseDisp respectively estimate a common dispersion parameter and uses it as a prior to estimate gene specific dispersions \(\phi_g\) for all \(g\). The “Biological Coefficient of Variation” plot is also displayed which displays \(\phi_g\) versus the average (normalized) count for all genes;
finally, an exact test is performed with the function exactTest to find genes which are differentially expressed between the two conditions (wild type and mutant). The output of the function is printed and p-values and adjusted p-values (Benjamini and Hochberg approach) are stored in pvals_pairwiseExact. The function topTags is used to print the genes with the smallest p-values and the function decideTests is used to extract genes whose adjusted p-value is smaller than 5%. The names of those genes as well as the fact that they are up/down regulated in the mutant type are saved in DGE_pairwiseExact.

cur_counts <- raw_counts_wn[, 1:6]
cur_design <- design[1:6, ]
cur_design$group <- relevel(factor(cur_design$group), ref = "wt")
cur_dge <- DGEList(cur_counts, group = cur_design$group, remove.zeros = TRUE)

## Removing 994 rows with all zero counts

cur_dge <- calcNormFactors(cur_dge, method = "TMM")
pseudo_counts <- log2(cpm(cur_dge) + 1)

# estimate dispersion
cur_dge <- estimateDisp(cur_dge)

## Using classic mode.

plotBCV(cur_dge, main = paste0("BCV plot"))

# perform test
res_et <- exactTest(cur_dge)
p_adjust <- p.adjust(res_et$table$PValue, method = "BH")
df <- data.frame("pvalue" = c(res_et$table$PValue, p_adjust),
                 "type" = rep(c("raw", "adjusted"), each = nrow(res_et$table)))
df$type <- factor(df$type, levels = c("raw", "adjusted"))
p <- ggplot(df, aes(x = pvalue)) + geom_histogram() + theme_bw() +
  facet_wrap(. ~ type) + xlab("p-value")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# extracting results
to_export <- topTags(res_et, n = nrow(res_et$table))
write.table(to_export, file = "results/mesjolisresultats.csv", sep = ";",
            row.names = TRUE, col.names = TRUE, dec = ",")
topTags(res_et, adjust.method = "bonferroni")

## Comparison of groups:  mut1-wt 
##                      logFC   logCPM        PValue          FWER
## Medtr7g079200.1  10.429051 5.646475 2.065283e-139 5.560154e-135
## Medtr6g034805.1  -5.621765 3.157061  3.979955e-40  1.071483e-35
## Medtr2g028530.1  -9.381839 2.800090  3.026933e-31  8.149108e-27
## Medtr3g092475.1  -2.761050 4.927626  3.387396e-30  9.119547e-26
## Medtr6g051975.1   8.734639 2.175164  1.099386e-29  2.959767e-25
## Medtr7g116150.1   6.314048 2.897514  3.166440e-29  8.524691e-25
## Medtr6g445300.1  -8.202011 1.703605  2.581322e-21  6.949436e-17
## Medtr2g436330.1  -2.304851 3.858789  3.616280e-21  9.735749e-17
## Medtr2g054640.1   1.999558 6.178289  1.949183e-18  5.247590e-14
## Medtr0024s0070.1  6.839281 2.166794  6.581894e-18  1.771978e-13

cur_res <- decideTests(res_et, adjust.method = "BH", p.value = 0.05)
cur_res

## TestResults matrix
##                  mut1-wt
## Medtr0001s0010.1       0
## Medtr0001s0200.1       0
## Medtr0001s0260.1       0
## Medtr0001s0360.1       0
## Medtr0001s0490.1       0
## 26917 more rows ...

sel_deg <- which(cur_res[, 1] != 0)
length(sel_deg)

## [1] 269

DEG_pairwiseExact <- rownames(cur_res)[sel_deg]
head(DEG_pairwiseExact)

## [1] "Medtr0002s0040.1" "Medtr0002s0530.1" "Medtr0003s0010.1" "Medtr0007s0160.1"
## [5] "Medtr0024s0070.1" "Medtr0026s0200.1"

# explore results
df <- data.frame("logCPM" = res_et$table$logCPM,
                 "logFC" = res_et$table$logFC,
                 "selected" = unname(ifelse(cur_res[, 1] != 0, "DEG", "not DE")))
p <- ggplot(df, aes(x = logCPM, y = logFC, colour = selected)) + 
  geom_point(alpha = 0.5) + theme_bw() + 
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

df$adjpval <- p.adjust(res_et$table$PValue, "BH")
p <- ggplot(df, aes(x = logFC, y = -log10(adjpval), colour = selected)) +
  geom_point(alpha = 0.5) + theme_bw() + ylim(0,50) +
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

df <- pseudo_counts[sel_deg, ]
df <- scale(t(as.matrix(df)))
cim_color <- colorRampPalette(rev(brewer.pal(11, "RdYlGn")))(256)
res_cim <- cim(df, color = cim_color, cluster = "col", col.names = FALSE, 
               dist.method = c("euclidean", "correlation"))

Extracting dendrogram associated with these genes are groups is easy with:

res_hac <- as.hclust(res_cim$ddc)
plot(res_hac, labels = FALSE, xlab = "", sub = "", main = "Groups of genes")
rect.hclust(res_hac, k = 4, border = "red")

genes_group <- cutree(res_hac, k = 4)
head(genes_group)

## Medtr0002s0040.1 Medtr0002s0530.1 Medtr0003s0010.1 Medtr0007s0160.1 
##                1                2                3                3 
## Medtr0024s0070.1 Medtr0026s0200.1 
##                2                2

table(genes_group)

## genes_group
##   1   2   3   4 
##  48  87 123  11

Second approach: overall GLM tests with group effect

In this section, we fit a generalized linear model to account for the group effect. The model writes \(K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)\) with \(\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})\). Here, \(j\) denotes one of the four genotypes (wild type or mutant 1, 2 and 3) and \(\lambda_{gj}\) is explained by: \(\log(\lambda_{g,\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g2} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_3}\). In this model, the wild type plant is the reference genotype and thus, testing if the second (resp., the third, the fourth) coefficient in the model is equal to zero is equivalent to find DEGs between the wild type and mutant 1 (resp. 2, 3).

First, a DGEList object is created with all samples and is normalized by TMM:

# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = design$group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")

Then, a design matrix is created: the group condition is stored in a variable group in which the reference level is set to “wt”. Then, this variable is used in the function model.matrix to obtain the design matrix:

# design matrix
group <- relevel(as.factor(design$group), ref = "wt")
design_matrix <- model.matrix(~ group)
design_matrix

##    (Intercept) groupmut1 groupmut2 groupmut3
## 1            1         0         0         0
## 2            1         0         0         0
## 3            1         0         0         0
## 4            1         1         0         0
## 5            1         1         0         0
## 6            1         1         0         0
## 7            1         0         1         0
## 8            1         0         1         0
## 9            1         0         1         0
## 10           1         0         0         1
## 11           1         0         0         1
## 12           1         0         0         1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

# estimate dispersion
cur_dge <- estimateDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))

# fit model and perform test
fit <- glmFit(cur_dge, design_matrix)
res_GLM1 <- glmLRT(fit, coef = 2)
table_res <- res_GLM1$table
p_adjust <- p.adjust(table_res$PValue, method = "BH")
table_res$padj <- p_adjust

df <- data.frame("pvalue" = c(res_GLM1$table$PValue, p_adjust),
                 "type" = rep(c("raw", "adjusted"), each = nrow(res_GLM1$table)))
df$type <- factor(df$type, levels = c("raw", "adjusted"))
p <- ggplot(df, aes(x = pvalue)) + geom_histogram() + theme_bw() +
  facet_wrap(. ~ type) + xlab("p-value")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# extracting results
cur_res <- decideTests(res_GLM1, adjust.method = "BH", p.value = 0.05)
cur_res

## TestResults matrix
##                  groupmut1
## Medtr0001s0010.1         0
## Medtr0001s0200.1         0
## Medtr0001s0260.1         0
## Medtr0001s0360.1         0
## Medtr0001s0490.1         0
## 27911 more rows ...

sel_deg <- which(cur_res[, 1] != 0)
length(sel_deg)

## [1] 308

DEG_GLM1 <- rownames(cur_res)[sel_deg]
head(DEG_GLM1)

## [1] "Medtr0002s0040.1" "Medtr0002s0530.1" "Medtr0003s0010.1" "Medtr0007s0160.1"
## [5] "Medtr0024s0070.1" "Medtr0026s0200.1"

# explore results
df <- data.frame("logCPM" = res_GLM1$table$logCPM,
                 "logFC" = res_GLM1$table$logFC,
                 "selected" = unname(ifelse(cur_res[, 1] != 0, "DEG", "not DE")))
p <- ggplot(df, aes(x = logCPM, y = logFC, colour = selected)) + 
  geom_point(alpha = 0.5) + theme_bw() + 
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

df$adjpval <- p.adjust(res_GLM1$table$PValue, "BH")
p <- ggplot(df, aes(x = logFC, y = -log10(adjpval), colour = selected)) +
  geom_point(alpha = 0.5) + theme_bw() +
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

Compared to what was found with the Exact test, the number of obtained DEGs is of the same order (only a bit larger). A more precise comparison is made in the last part of this section.

The difference between mutant 1 and mutant 3 can be assessed using contrasts on the same fitted model:

# fit model and perform test
contrasts <- c(0, 1, 0, -1)
res <- glmLRT(fit, contrast = contrasts)

# extracting results
cur_res <- decideTests(res, adjust.method = "BH", p.value = 0.05)
sel_deg <- which(cur_res[, 1] != 0)
length(sel_deg)

## [1] 1381

DEG_GLM1c <- rownames(cur_res)[sel_deg]
head(DEG_GLM1c)

## [1] "Medtr0002s0330.1" "Medtr0003s0010.1" "Medtr0003s0090.1" "Medtr0003s0280.1"
## [5] "Medtr0003s0350.1" "Medtr0007s0160.1"

Third approach: overall GLM tests with group and replicate effects

The previous model does not take into account all information about the experimental design. Only the group effect has been tested whereas an additional effect might influence the expression level: the replicat effect. In this section we fit a model with an additive contribution of both effects. This problem is equivalent to introduce a (fixed) individual effect in the model. In this situation, the average level of \(\log\) normalized counts is expressed as: \(\log(\lambda_{g,\textrm{rep},\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{rep}_1} + \beta_{g2} \mathbb{I}_{\textrm{rep}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g4} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g5} \mathbb{I}_{\textrm{mutant}_3}\) and DEGs for each of the three mutant types are obtained by testing coefficients \(\beta_{g3}\), \(\beta_{g4}\) and \(\beta_{g5}\).

The method is performed similarly as before: a DGEList is created and normalization factors are obtained with the TMM approach. Then a design matrix that incorporates the two factors in an additive model is created:

# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")

# create design matrix
design_matrix <- model.matrix(~ design$replicat + group)
design_matrix

##    (Intercept) design$replicatrepbio2 design$replicatrepbio3 groupmut1
## 1            1                      0                      0         0
## 2            1                      1                      0         0
## 3            1                      0                      1         0
## 4            1                      0                      0         1
## 5            1                      1                      0         1
## 6            1                      0                      1         1
## 7            1                      0                      0         0
## 8            1                      1                      0         0
## 9            1                      0                      1         0
## 10           1                      0                      0         0
## 11           1                      1                      0         0
## 12           1                      0                      1         0
##    groupmut2 groupmut3
## 1          0         0
## 2          0         0
## 3          0         0
## 4          0         0
## 5          0         0
## 6          0         0
## 7          1         0
## 8          1         0
## 9          1         0
## 10         0         1
## 11         0         1
## 12         0         1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`design$replicat`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

The dispersion is then estimated using the same approach as in the previous section:

cur_dge <- estimateDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))

And finally, the model is fitted with glmFit and tests are performed on the coefficients of interest with glmLRT. Results extracted with topTags and decideTests are saved in pvals_GLM2 and listDEG_GLM2:

# GLM fit
fit <- glmFit(cur_dge, design_matrix)
res_GLM2 <- glmLRT(fit, coef = 4)
df <- data.frame("pvalue" = c(res_GLM2$table$PValue, 
                               p.adjust(res_GLM2$table$PValue, method = "BH")),
                 "type" = rep(c("raw", "adjusted"), each = nrow(res_GLM2$table)))
df$type <- factor(df$type, levels = c("raw", "adjusted"))
p <- ggplot(df, aes(x = pvalue)) + geom_histogram() + theme_bw() +
  facet_wrap(. ~ type) + xlab("p-value")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# extracting results
cur_res <- decideTests(res_GLM2, adjust.method = "BH", p.value = 0.05)
cur_res

## TestResults matrix
##                  groupmut1
## Medtr0001s0010.1         0
## Medtr0001s0200.1         0
## Medtr0001s0260.1         0
## Medtr0001s0360.1         0
## Medtr0001s0490.1         0
## 27911 more rows ...

sel_deg <- which(cur_res[, 1] != 0)
length(sel_deg)

## [1] 306

DEG_GLM2 <- rownames(cur_res)[sel_deg]
head(DEG_GLM2)

## [1] "Medtr0002s0040.1" "Medtr0003s0010.1" "Medtr0007s0160.1" "Medtr0024s0070.1"
## [5] "Medtr0027s0090.1" "Medtr0030s0040.1"

# explore results
df <- data.frame("logCPM" = res_GLM2$table$logCPM,
                 "logFC" = res_GLM2$table$logFC,
                 "selected" = unname(ifelse(cur_res[, 1] != 0, "DEG", "not DE")))
p <- ggplot(df, aes(x = logCPM, y = logFC, colour = selected)) + 
  geom_point(alpha = 0.5) + theme_bw() + 
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

df$adjpval <- p.adjust(res_GLM2$table$PValue, "BH")
p <- ggplot(df, aes(x = logFC, y = -log10(adjpval), colour = selected)) +
  geom_point(alpha = 0.5) + theme_bw() +
  scale_colour_manual(name = "test result", values = c("red", "black"))
p

Comparison

In this last part, a Venn diagram is displayed for the results (unique genes for mutant type 1 versus reference) of the three approaches:

vd <- venn.diagram(x=list("Exact test" = DEG_pairwiseExact,
                          "GLM\n group" = DEG_GLM1, 
                          "GLM\n group + replicate" = DEG_GLM2), 
                   fill = brewer.pal(3, "Set3"), 
                   cat.col = c("darkgreen", "black", "darkblue"),
                   cat.cex = 1.5, fontface = "bold", filename = NULL)
grid.draw(vd)

This chart yields to several conclusions:

first, a large number of genes (\(194\) out of approximately 400) are in common between all three approaches. As expected, the two more consistent methods are the two GLM approaches (with an additional number of \(61\) DEGs in common and \(44+31+7+22 = 104\) DEGs that are specific of a given model). The Exact Test has 46 DEGs that are not found by one of the other two approaches and only \(22+7=31\) DEGs in common with one of the two GLM models only;
the 31 DEGs that were found exclusively by GLM with two additive factors might be interesting because they are DEGs which can not be found if the genotype effect is not included in the model.

Session information

To obtain results similar to mines, make sure that your session has identical features than the one used to compile this document:

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Paris
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] plotly_4.11.0       VennDiagram_1.7.3   futile.logger_1.4.3
##  [4] edgeR_4.8.0         limma_3.66.0        factoextra_1.0.7   
##  [7] FactoMineR_2.12     RColorBrewer_1.1-3  mixOmics_6.34.0    
## [10] lattice_0.22-6      MASS_7.3-65         reshape2_1.4.5     
## [13] gridExtra_2.3       ggplot2_3.5.2      
## 
## loaded via a namespace (and not attached):
##  [1] formatR_1.14         sandwich_3.1-1       rlang_1.1.6         
##  [4] magrittr_2.0.4       multcomp_1.4-29      matrixStats_1.5.0   
##  [7] compiler_4.5.2       mgcv_1.9-1           vctrs_0.6.5         
## [10] stringr_1.6.0        pkgconfig_2.0.3      fastmap_1.2.0       
## [13] backports_1.5.0      labeling_0.4.3       rmarkdown_2.30      
## [16] purrr_1.2.0          xfun_0.50            cachem_1.1.0        
## [19] jsonlite_1.8.9       flashClust_1.01-2    BiocParallel_1.44.0 
## [22] broom_1.0.10         parallel_4.5.2       cluster_2.1.8.1     
## [25] R6_2.5.1             bslib_0.8.0          stringi_1.8.4       
## [28] car_3.1-3            jquerylib_0.1.4      estimability_1.5.1  
## [31] Rcpp_1.1.0           knitr_1.49           zoo_1.8-14          
## [34] Matrix_1.7-2         splines_4.5.2        igraph_2.2.1        
## [37] tidyselect_1.2.1     rstudioapi_0.17.1    abind_1.4-8         
## [40] yaml_2.3.10          codetools_0.2-20     tibble_3.2.1        
## [43] plyr_1.8.9           S7_0.2.1             withr_3.0.2         
## [46] rARPACK_0.11-0       coda_0.19-4.1        evaluate_1.0.3      
## [49] lambda.r_1.2.4       survival_3.8-3       pillar_1.10.1       
## [52] ggpubr_0.6.2         carData_3.0-5        DT_0.34.0           
## [55] ellipse_0.5.0        generics_0.1.3       munsell_0.5.1       
## [58] scales_1.3.0         xtable_1.8-4         leaps_3.2           
## [61] glue_1.8.0           emmeans_2.0.0        scatterplot3d_0.3-44
## [64] lazyeval_0.2.2       tools_4.5.2          data.table_1.17.8   
## [67] RSpectra_0.16-2      locfit_1.5-9.12      ggsignif_0.6.4      
## [70] mvtnorm_1.3-3        tidyr_1.3.1          crosstalk_1.2.2     
## [73] colorspace_2.1-2     nlme_3.1-167         Formula_1.2-5       
## [76] cli_3.6.5            futile.options_1.0.1 viridisLite_0.4.2   
## [79] dplyr_1.1.4          corpcor_1.6.10       gtable_0.3.6        
## [82] rstatix_0.7.3        sass_0.4.9           digest_0.6.39       
## [85] ggrepel_0.9.6        TH.data_1.1-5        htmlwidgets_1.6.4   
## [88] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
## [91] httr_1.4.7           multcompView_0.1-10  statmod_1.5.1

Practical session about Normalization and differential analysis of RNAseq data - Effect of normalization and differential tests

Nathalie Vialaneix and Nicolas Enjalbert-Courrech

Biostatistics PF training - November 26-27

Settings

Data description and importation

Raw counts

Experimental design

Basic exploratory analysis of raw counts

Count distribution

Reproducibility between samples (MA plots)

Heatmap

PCA

Normalization

edgeR

Total count

RPKM

Upper quartile

RLE

TMM

Comparison

Differential analysis

First approach: pairwise exact test

Second approach: overall GLM tests with group effect

Third approach: overall GLM tests with group and replicate effects

Comparison

Session information