This file analyses the data of the count table included in countData.txt. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. One observation in a given condition has a clone in the second condition: this will be refered as the “genotype effect” in the rest of the document. The data were provided as a courtesy of Mohamed Zouine (ENSAT, Toulouse).

Where to start? (installation and alike)

Download the following files: * http://www.nathalievilla.org/doc/R/solution_edgeR-tomato.R (R script) * http://www.nathalievilla.org/doc/txt/countData.txt (dataset, text format) * … (design, text format) and save them all in the same directory.

Then start RStudio and using the menu “File/Open File…”, open the file solution_edgeR-tomato.R (that contains the R script of this document). Also set your working directory to the directory where you have downloaded all files with the menu “Session / Set working directory / To source file location”.

The command lines in the R script can be executed using the button “Run” or the shortcut Ctrl+Enter. The analysis has been performed using R and the R packages edgeR, limma, RColorBrewer, mixOmics and HTSFilter. Precise versions of the R software used in this document is given in the last section of this document.

The required packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievilla.org/teaching/rnaseq.html to install it by yourself (HTSFilter is probably not installed; you can not use the standard command install.packages to install it because it is a Bioconductor package: check the webpage!).

## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)
library(limma)
library(RColorBrewer)
library(mixOmics)
library(VennDiagram)
library(HTSFilter)

Preparing datasets

Description of the files (count table and design)

If you open the file countData.txt with a simple text reader (wordpad for instance), you see that:

  • the first column contains the gene names;

  • the next three columns contain the WT samples;

  • the last three columns contain the Mutant samples;

  • columns are separated by tabs.

If you open the file design.csv with a simple text reader (wordpad for instance), you see that:

  • the first column contains the sample names;

  • the second column contains the condition (M for Mutant or WT for Wild Type);

  • columns are separated by commas;

  • the first row contains column names.

Importing the files

The two files are imported in, respectively, rawCountTable and sampleInfo with the function read.table:

rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)

Their content can be checked with head that print the first 6 rows:

head(rawCountTable)
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1             0             0             0             0
## Solyc00g005020.1.1             0             0             0             0
## Solyc00g005040.2.1             0             0             0             0
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005060.1.1             0             0             0             0
## Solyc00g005070.1.1             0             0             0             0
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1             0             0
## Solyc00g005020.1.1             0             0
## Solyc00g005040.2.1             0             0
## Solyc00g005050.2.1           366           294
## Solyc00g005060.1.1             0             0
## Solyc00g005070.1.1             0             0
nrow(rawCountTable)
## [1] 34675

34675 genes are included in this file.

Similarly, informations about samples are contained into sampleInfo. The entry genotype, that corresponds to the clone number, is converted into a factor for subsequent analyses.

head(sampleInfo)
##               condition genotype
## Cond.WT.Rep.1        WT        1
## Cond.WT.Rep.2        WT        2
## Cond.WT.Rep.3        WT        3
## Cond.Mt.Rep.1         M        1
## Cond.Mt.Rep.2         M        2
## Cond.Mt.Rep.3         M        3
sampleInfo$genotype <- as.factor(sampleInfo$genotype)

Creating a DGEList object

A DGEList object is needed to process RNAseq datasets. This object is created using the count table and the design file:

dgeFull <- DGEList(rawCountTable, remove.zeros = TRUE)
## Removing 7492 rows with all zero counts
dgeFull
## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1     1 16602772            1
## Cond.WT.Rep.2     1 14374675            1
## Cond.WT.Rep.3     1 17427146            1
## Cond.Mt.Rep.1     1 18060148            1
## Cond.Mt.Rep.2     1 16010158            1
## Cond.Mt.Rep.3     1 13750228            1

The information about sample genotypes and conditions can be added to the $samples entry with:

dgeFull$samples$condition <- relevel(sampleInfo$condition, ref = "WT")
dgeFull$samples$genotype <- sampleInfo$genotype
dgeFull
## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772            1        WT        1
## Cond.WT.Rep.2     1 14374675            1        WT        2
## Cond.WT.Rep.3     1 17427146            1        WT        3
## Cond.Mt.Rep.1     1 18060148            1         M        1
## Cond.Mt.Rep.2     1 16010158            1         M        2
## Cond.Mt.Rep.3     1 13750228            1         M        3

Data exploration and quality assessment

Exploratory analysis is generally perormed on \(\log_2\)-transformed counts to avoid problems due to skewness of the count distribution:

pseudoCounts <- log2(dgeFull$counts + 1)
head(pseudoCounts)
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1      8.262095      8.974415      8.873444      8.531381
## Solyc00g005080.1.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005150.1.1      0.000000      0.000000      0.000000      1.584963
## Solyc00g005840.2.1      8.724514      8.625709      7.714246      8.784635
## Solyc00g005860.1.1      0.000000      1.000000      0.000000      2.000000
## Solyc00g006470.1.1      4.247928      5.523562      6.087463      4.000000
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1      8.519636      8.204571
## Solyc00g005080.1.1      1.584963      0.000000
## Solyc00g005150.1.1      0.000000      1.000000
## Solyc00g005840.2.1      8.826548      8.539159
## Solyc00g005860.1.1      0.000000      0.000000
## Solyc00g006470.1.1      4.754888      4.087463

Standard exploratory analyses include:

Histogram for pseudo-counts (sample Cond.WT.Rep.1)

hist(pseudoCounts[ ,"Cond.WT.Rep.1"], main = "", xlab = "counts")

Boxplot for pseudo-counts

par(mar = c(8,4,1,2))
boxplot(pseudoCounts, col = "gray", las = 3, cex.names = 1)

MA-plots between first two WT samples (using limma package)

limma::plotMA(pseudoCounts[ ,1:2], xlab = "M", ylab = "A", main = "")
abline(h = 0, col = "red")

MDS for pseudo-counts (using limma package)

MDS is similar to PCA when Euclidean distances are used to assess the distances between samples. In limma, only the 500 top genes (the most variable genes accross samples).

colConditions <- brewer.pal(3, "Set2")
colConditions <- colConditions[match(sampleInfo$condition,
                                     levels(sampleInfo$condition))]
pchGenotypes <- c(8, 15, 16)[match(sampleInfo$genotype,
                                   levels(sampleInfo$genotype))]
plotMDS(pseudoCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], 
       legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16), 
       legend = levels(sampleInfo$genotype))

Heatmap for pseudo-counts (using mixOmics package)

sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists
##               Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1        0.0000      139.0220      198.5141     227.78237
## Cond.WT.Rep.2      139.0220        0.0000      145.2869     262.96305
## Cond.WT.Rep.3      198.5141      145.2869        0.0000     310.60653
## Cond.Mt.Rep.1      227.7824      262.9630      310.6065       0.00000
## Cond.Mt.Rep.2      235.9857      267.5978      315.1272      94.13883
## Cond.Mt.Rep.3      211.7681      238.1150      283.7170     111.54555
##               Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1     235.98574      211.7681
## Cond.WT.Rep.2     267.59783      238.1150
## Cond.WT.Rep.3     315.12718      283.7170
## Cond.Mt.Rep.1      94.13883      111.5456
## Cond.Mt.Rep.2       0.00000      107.4814
## Cond.Mt.Rep.3     107.48136        0.0000
cimColor <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16)
cim(sampleDists, color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7)

Normalization

Compute normalization factors

dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull
## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        3
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3

Important: using calcNormFactors does not change the counts: it just updates the column norm.factors in $samples. It is therefore recommanded that you use the same name (dgeFull) to save the result of this function:

head(dgeFull$counts)
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
## Solyc00g006470.1.1            18            45            67            15
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## Solyc00g006470.1.1            26            16
dgeFull$samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        3
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3

Normalized counts exploratory analysis

Normalized counts and pseudo-counts can be extracted from dgeFull using the function cpm:

normCounts <- cpm(dgeFull)
pseudoNormCounts <- cpm(dgeFull, log = TRUE, prior.count = 1)
par(mar = c(8,4,1,2))
boxplot(pseudoNormCounts, col = "gray", las = 3, cex.names = 1)

plotMDS(pseudoNormCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], 
       legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16), 
       legend = levels(sampleInfo$genotype))

A further analysis and comparison of the different normalizations provided in the R packages edgeR and DESeq2 is provided in this document (the design of the dataset used in this practical application is similar to the design of the tomato dataset).

Differential analysis

This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type tomatoes and the mutants:

First approach: exact test between the two groups

In this first approach, the differences between the two groups (WT and M) is tested using an exact NB test between the two groups. This method is performed by:

  • creating a DGEList object using the argument group and using the same normalization factors than in dgeFull ;

  • estimating the dispersion for this object with the functions estimateCommonDisp and estimateTagwiseDisp ;

  • performing the test with the function exactTest.

Using the argument group in DGEList

A new DGEList object is created with the argument group. Normalization factors are updated from that in dgeFull.

dgeFull.group <- DGEList(rawCountTable, remove.zeros = TRUE, 
                         group = dgeFull$samples$condition)
## Removing 7492 rows with all zero counts
dgeFull.group$samples$norm.factors <- dgeFull$samples$norm.factors
dgeFull.group
## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371

Estimate dispersion

Common and then tagwise dispersions can be estimated with:

dgeFull.group <- estimateCommonDisp(dgeFull.group)
dgeFull.group <- estimateTagwiseDisp(dgeFull.group)
dgeFull.group
## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371
## 
## $common.dispersion
## [1] 0.0771487
## 
## $pseudo.counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.118385e+02  6.274668e+02  6.181625e+02    277.174174
## Solyc00g005080.1.1  1.387779e-17  1.387779e-17  1.387779e-17      0.000000
## Solyc00g005150.1.1  1.387779e-17  1.387779e-17  1.387779e-17      1.576323
## Solyc00g005840.2.1  4.299823e+02  4.923710e+02  2.768288e+02    330.472737
## Solyc00g005860.1.1  1.788530e-03  1.182016e+00  2.506739e-02      2.455437
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1    290.700664   292.5323378
## Solyc00g005080.1.1      1.698498     0.0000000
## Solyc00g005150.1.1      0.000000     0.9949206
## Solyc00g005840.2.1    359.808607   369.1486991
## Solyc00g005860.1.1      0.000000     0.0000000
## 27178 more rows ...
## 
## $pseudo.lib.size
## [1] 15961436
## 
## $AveLogCPM
## [1]  4.662164 -2.811589 -2.713178  4.566861 -2.620642
## 27178 more elements ...
## 
## $prior.df
## [1] 10
## 
## $prior.n
## [1] 2.5
## 
## $tagwise.dispersion
## [1] 0.06868585 0.40945758 0.23730247 0.06042544 0.37974655
## 27178 more elements ...
## 
## $span
## [1] 0.3

edgeR also contains a function to estimate dispersions in a more robust way, that can be used if the previous approach seems to fail. This function is estimateGLMRobustDisp. The quality of the variability estimatino can be assessed with the BCV versus average log CPM plot (that plots \(\phi_g\) versus the average normalized count for all genes):

plotBCV(dgeFull.group)