This file analyses the data of the count table included in countData.txt
. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. One observation in a given condition has a clone in the second condition: this will be refered as the “genotype effect” in the rest of the document. The data were provided as a courtesy of Mohamed Zouine (ENSAT, Toulouse).
Download the following files: * http://www.nathalievilla.org/doc/R/solution_edgeR-tomato.R (R script) * http://www.nathalievilla.org/doc/txt/countData.txt (dataset, text format) * … (design, text format) and save them all in the same directory.
Then start RStudio and using the menu “File/Open File…”, open the file solution_edgeR-tomato.R (that contains the R script of this document). Also set your working directory to the directory where you have downloaded all files with the menu “Session / Set working directory / To source file location”.
The command lines in the R script can be executed using the button “Run” or the shortcut Ctrl+Enter
. The analysis has been performed using R and the R packages edgeR, limma, RColorBrewer, mixOmics and HTSFilter. Precise versions of the R software used in this document is given in the last section of this document.
The required packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievilla.org/teaching/rnaseq.html to install it by yourself (HTSFilter
is probably not installed; you can not use the standard command install.packages
to install it because it is a Bioconductor package: check the webpage!).
## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)
library(limma)
library(RColorBrewer)
library(mixOmics)
library(VennDiagram)
library(HTSFilter)
If you open the file countData.txt
with a simple text reader (wordpad for instance), you see that:
the first column contains the gene names;
the next three columns contain the WT samples;
the last three columns contain the Mutant samples;
columns are separated by tabs.
If you open the file design.csv
with a simple text reader (wordpad for instance), you see that:
the first column contains the sample names;
the second column contains the condition (M for Mutant or WT for Wild Type);
columns are separated by commas;
the first row contains column names.
The two files are imported in, respectively, rawCountTable
and sampleInfo
with the function read.table
:
rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)
Their content can be checked with head
that print the first 6 rows:
head(rawCountTable)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1 0 0 0 0
## Solyc00g005020.1.1 0 0 0 0
## Solyc00g005040.2.1 0 0 0 0
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005060.1.1 0 0 0 0
## Solyc00g005070.1.1 0 0 0 0
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1 0 0
## Solyc00g005020.1.1 0 0
## Solyc00g005040.2.1 0 0
## Solyc00g005050.2.1 366 294
## Solyc00g005060.1.1 0 0
## Solyc00g005070.1.1 0 0
nrow(rawCountTable)
## [1] 34675
34675 genes are included in this file.
Similarly, informations about samples are contained into sampleInfo
. The entry genotype
, that corresponds to the clone number, is converted into a factor for subsequent analyses.
head(sampleInfo)
## condition genotype
## Cond.WT.Rep.1 WT 1
## Cond.WT.Rep.2 WT 2
## Cond.WT.Rep.3 WT 3
## Cond.Mt.Rep.1 M 1
## Cond.Mt.Rep.2 M 2
## Cond.Mt.Rep.3 M 3
sampleInfo$genotype <- as.factor(sampleInfo$genotype)
DGEList
objectA DGEList object is needed to process RNAseq datasets. This object is created using the count table and the design file:
dgeFull <- DGEList(rawCountTable, remove.zeros = TRUE)
## Removing 7492 rows with all zero counts
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 1 16602772 1
## Cond.WT.Rep.2 1 14374675 1
## Cond.WT.Rep.3 1 17427146 1
## Cond.Mt.Rep.1 1 18060148 1
## Cond.Mt.Rep.2 1 16010158 1
## Cond.Mt.Rep.3 1 13750228 1
The information about sample genotypes and conditions can be added to the $samples
entry with:
dgeFull$samples$condition <- relevel(sampleInfo$condition, ref = "WT")
dgeFull$samples$genotype <- sampleInfo$genotype
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 1 WT 1
## Cond.WT.Rep.2 1 14374675 1 WT 2
## Cond.WT.Rep.3 1 17427146 1 WT 3
## Cond.Mt.Rep.1 1 18060148 1 M 1
## Cond.Mt.Rep.2 1 16010158 1 M 2
## Cond.Mt.Rep.3 1 13750228 1 M 3
Exploratory analysis is generally perormed on \(\log_2\)-transformed counts to avoid problems due to skewness of the count distribution:
pseudoCounts <- log2(dgeFull$counts + 1)
head(pseudoCounts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 8.262095 8.974415 8.873444 8.531381
## Solyc00g005080.1.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005150.1.1 0.000000 0.000000 0.000000 1.584963
## Solyc00g005840.2.1 8.724514 8.625709 7.714246 8.784635
## Solyc00g005860.1.1 0.000000 1.000000 0.000000 2.000000
## Solyc00g006470.1.1 4.247928 5.523562 6.087463 4.000000
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 8.519636 8.204571
## Solyc00g005080.1.1 1.584963 0.000000
## Solyc00g005150.1.1 0.000000 1.000000
## Solyc00g005840.2.1 8.826548 8.539159
## Solyc00g005860.1.1 0.000000 0.000000
## Solyc00g006470.1.1 4.754888 4.087463
Standard exploratory analyses include:
Cond.WT.Rep.1
)hist(pseudoCounts[ ,"Cond.WT.Rep.1"], main = "", xlab = "counts")
par(mar = c(8,4,1,2))
boxplot(pseudoCounts, col = "gray", las = 3, cex.names = 1)
limma
package)limma::plotMA(pseudoCounts[ ,1:2], xlab = "M", ylab = "A", main = "")
abline(h = 0, col = "red")
limma
package)MDS is similar to PCA when Euclidean distances are used to assess the distances between samples. In limma
, only the 500 top genes (the most variable genes accross samples).
colConditions <- brewer.pal(3, "Set2")
colConditions <- colConditions[match(sampleInfo$condition,
levels(sampleInfo$condition))]
pchGenotypes <- c(8, 15, 16)[match(sampleInfo$genotype,
levels(sampleInfo$genotype))]
plotMDS(pseudoCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2],
legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16),
legend = levels(sampleInfo$genotype))
mixOmics
package)sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1 0.0000 139.0220 198.5141 227.78237
## Cond.WT.Rep.2 139.0220 0.0000 145.2869 262.96305
## Cond.WT.Rep.3 198.5141 145.2869 0.0000 310.60653
## Cond.Mt.Rep.1 227.7824 262.9630 310.6065 0.00000
## Cond.Mt.Rep.2 235.9857 267.5978 315.1272 94.13883
## Cond.Mt.Rep.3 211.7681 238.1150 283.7170 111.54555
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1 235.98574 211.7681
## Cond.WT.Rep.2 267.59783 238.1150
## Cond.WT.Rep.3 315.12718 283.7170
## Cond.Mt.Rep.1 94.13883 111.5456
## Cond.Mt.Rep.2 0.00000 107.4814
## Cond.Mt.Rep.3 107.48136 0.0000
cimColor <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16)
cim(sampleDists, color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7)
dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 3
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
Important: using calcNormFactors
does not change the counts: it just updates the column norm.factors
in $samples
. It is therefore recommanded that you use the same name (dgeFull
) to save the result of this function:
head(dgeFull$counts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Solyc00g006470.1.1 18 45 67 15
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## Solyc00g006470.1.1 26 16
dgeFull$samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 3
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
Normalized counts and pseudo-counts can be extracted from dgeFull
using the function cpm
:
normCounts <- cpm(dgeFull)
pseudoNormCounts <- cpm(dgeFull, log = TRUE, prior.count = 1)
par(mar = c(8,4,1,2))
boxplot(pseudoNormCounts, col = "gray", las = 3, cex.names = 1)
plotMDS(pseudoNormCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2],
legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16),
legend = levels(sampleInfo$genotype))
A further analysis and comparison of the different normalizations provided in the R packages edgeR
and DESeq2
is provided in this document (the design of the dataset used in this practical application is similar to the design of the tomato dataset).
This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type tomatoes and the mutants:
a standard NB exact test between two conditions;
a GLM with the plant and genotype effects.
In this first approach, the differences between the two groups (WT and M) is tested using an exact NB test between the two groups. This method is performed by:
creating a DGEList
object using the argument group
and using the same normalization factors than in dgeFull
;
estimating the dispersion for this object with the functions estimateCommonDisp
and estimateTagwiseDisp
;
performing the test with the function exactTest
.
group
in DGEList
A new DGEList
object is created with the argument group
. Normalization factors are updated from that in dgeFull
.
dgeFull.group <- DGEList(rawCountTable, remove.zeros = TRUE,
group = dgeFull$samples$condition)
## Removing 7492 rows with all zero counts
dgeFull.group$samples$norm.factors <- dgeFull$samples$norm.factors
dgeFull.group
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
Common and then tagwise dispersions can be estimated with:
dgeFull.group <- estimateCommonDisp(dgeFull.group)
dgeFull.group <- estimateTagwiseDisp(dgeFull.group)
dgeFull.group
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
##
## $common.dispersion
## [1] 0.0771487
##
## $pseudo.counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.118385e+02 6.274668e+02 6.181625e+02 277.174174
## Solyc00g005080.1.1 1.387779e-17 1.387779e-17 1.387779e-17 0.000000
## Solyc00g005150.1.1 1.387779e-17 1.387779e-17 1.387779e-17 1.576323
## Solyc00g005840.2.1 4.299823e+02 4.923710e+02 2.768288e+02 330.472737
## Solyc00g005860.1.1 1.788530e-03 1.182016e+00 2.506739e-02 2.455437
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 290.700664 292.5323378
## Solyc00g005080.1.1 1.698498 0.0000000
## Solyc00g005150.1.1 0.000000 0.9949206
## Solyc00g005840.2.1 359.808607 369.1486991
## Solyc00g005860.1.1 0.000000 0.0000000
## 27178 more rows ...
##
## $pseudo.lib.size
## [1] 15961436
##
## $AveLogCPM
## [1] 4.662164 -2.811589 -2.713178 4.566861 -2.620642
## 27178 more elements ...
##
## $prior.df
## [1] 10
##
## $prior.n
## [1] 2.5
##
## $tagwise.dispersion
## [1] 0.06868585 0.40945758 0.23730247 0.06042544 0.37974655
## 27178 more elements ...
##
## $span
## [1] 0.3
edgeR
also contains a function to estimate dispersions in a more robust way, that can be used if the previous approach seems to fail. This function is estimateGLMRobustDisp
. The quality of the variability estimatino can be assessed with the BCV versus average log CPM plot (that plots \(\phi_g\) versus the average normalized count for all genes):
plotBCV(dgeFull.group)