Where to start? (installation and alike)
Preparing datasets
Data exploration and quality assessment
Normalization
- Compute normalization factors
- Normalized counts exploratory analysis
Differential analysis
Filtering
- Differential analysis after independent filtering
- Comparison
Exploratory analysis of DEGs
Session information

This file analyses the data of the count table included in countData.txt. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. One observation in a given condition has a clone in the second condition: this will be refered as the “genotype effect” in the rest of the document. The data were provided as a courtesy of Mohamed Zouine (ENSAT, Toulouse).

Where to start? (installation and alike)

Download the following files: * http://www.nathalievialaneix.eu/doc/R/solution_edgeR-tomato.R (R script) * http://www.nathalievialaneix.eu/doc/txt/countData.txt (dataset, text format) * … (design, text format) and save them all in the same directory.

Then start RStudio and using the menu “File/Open File…”, open the file solution_edgeR-tomato.R (that contains the R script of this document). Also set your working directory to the directory where you have downloaded all files with the menu “Session / Set working directory / To source file location”.

The command lines in the R script can be executed using the button “Run” or the shortcut Ctrl+Enter. The analysis has been performed using R and the R packages edgeR, limma, RColorBrewer, mixOmics and HTSFilter. Precise versions of the R software used in this document is given in the last section of this document.

The required packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievialaneix.eu/teaching/rnaseq.html to install it by yourself (HTSFilter is probably not installed; you can not use the standard command install.packages to install it because it is a Bioconductor package: check the webpage!).

## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)
library(limma)
library(RColorBrewer)
library(mixOmics)
library(VennDiagram)
library(HTSFilter)

Preparing datasets

Description of the files (count table and design)

If you open the file countData.txt with a simple text reader (wordpad for instance), you see that:

the first column contains the gene names;
the next three columns contain the WT samples;
the last three columns contain the Mutant samples;
columns are separated by tabs.

If you open the file design.csv with a simple text reader (wordpad for instance), you see that:

the first column contains the sample names;
the second column contains the condition (M for Mutant or WT for Wild Type);
columns are separated by commas;
the first row contains column names.

Importing the files

The two files are imported in, respectively, rawCountTable and sampleInfo with the function read.table:

rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)

Their content can be checked with head that print the first 6 rows:

head(rawCountTable)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1             0             0             0             0
## Solyc00g005020.1.1             0             0             0             0
## Solyc00g005040.2.1             0             0             0             0
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005060.1.1             0             0             0             0
## Solyc00g005070.1.1             0             0             0             0
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1             0             0
## Solyc00g005020.1.1             0             0
## Solyc00g005040.2.1             0             0
## Solyc00g005050.2.1           366           294
## Solyc00g005060.1.1             0             0
## Solyc00g005070.1.1             0             0

nrow(rawCountTable)

## [1] 34675

34675 genes are included in this file.

Similarly, informations about samples are contained into sampleInfo. The entry genotype, that corresponds to the clone number, is converted into a factor for subsequent analyses.

head(sampleInfo)

##               condition genotype
## Cond.WT.Rep.1        WT        1
## Cond.WT.Rep.2        WT        2
## Cond.WT.Rep.3        WT        3
## Cond.Mt.Rep.1         M        1
## Cond.Mt.Rep.2         M        2
## Cond.Mt.Rep.3         M        3

sampleInfo$genotype <- as.factor(sampleInfo$genotype)

Creating a `DGEList` object

A DGEList object is needed to process RNAseq datasets. This object is created using the count table and the design file:

dgeFull <- DGEList(rawCountTable, remove.zeros = TRUE)

## Removing 7492 rows with all zero counts

dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1     1 16602772            1
## Cond.WT.Rep.2     1 14374675            1
## Cond.WT.Rep.3     1 17427146            1
## Cond.Mt.Rep.1     1 18060148            1
## Cond.Mt.Rep.2     1 16010158            1
## Cond.Mt.Rep.3     1 13750228            1

The information about sample genotypes and conditions can be added to the $samples entry with:

dgeFull$samples$condition <- relevel(sampleInfo$condition, ref = "WT")
dgeFull$samples$genotype <- sampleInfo$genotype
dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772            1        WT        1
## Cond.WT.Rep.2     1 14374675            1        WT        2
## Cond.WT.Rep.3     1 17427146            1        WT        3
## Cond.Mt.Rep.1     1 18060148            1         M        1
## Cond.Mt.Rep.2     1 16010158            1         M        2
## Cond.Mt.Rep.3     1 13750228            1         M        3

Data exploration and quality assessment

Exploratory analysis is generally perormed on $\log_2$-transformed counts to avoid problems due to skewness of the count distribution:

pseudoCounts <- log2(dgeFull$counts + 1)
head(pseudoCounts)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1      8.262095      8.974415      8.873444      8.531381
## Solyc00g005080.1.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005150.1.1      0.000000      0.000000      0.000000      1.584963
## Solyc00g005840.2.1      8.724514      8.625709      7.714246      8.784635
## Solyc00g005860.1.1      0.000000      1.000000      0.000000      2.000000
## Solyc00g006470.1.1      4.247928      5.523562      6.087463      4.000000
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1      8.519636      8.204571
## Solyc00g005080.1.1      1.584963      0.000000
## Solyc00g005150.1.1      0.000000      1.000000
## Solyc00g005840.2.1      8.826548      8.539159
## Solyc00g005860.1.1      0.000000      0.000000
## Solyc00g006470.1.1      4.754888      4.087463

Standard exploratory analyses include:

Histogram for pseudo-counts (sample `Cond.WT.Rep.1`)

hist(pseudoCounts[ ,"Cond.WT.Rep.1"], main = "", xlab = "counts")

Boxplot for pseudo-counts

par(mar = c(8,4,1,2))
boxplot(pseudoCounts, col = "gray", las = 3, cex.names = 1)

MA-plots between first two WT samples (using `limma` package)

limma::plotMA(pseudoCounts[ ,1:2], xlab = "M", ylab = "A", main = "")
abline(h = 0, col = "red")

MDS for pseudo-counts (using `limma` package)

MDS is similar to PCA when Euclidean distances are used to assess the distances between samples. In limma, only the 500 top genes (the most variable genes accross samples).

colConditions <- brewer.pal(3, "Set2")
colConditions <- colConditions[match(sampleInfo$condition,
                                     levels(sampleInfo$condition))]
pchGenotypes <- c(8, 15, 16)[match(sampleInfo$genotype,
                                   levels(sampleInfo$genotype))]
plotMDS(pseudoCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], 
       legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16), 
       legend = levels(sampleInfo$genotype))

Heatmap for pseudo-counts (using `mixOmics` package)

sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists

##               Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1        0.0000      139.0220      198.5141     227.78237
## Cond.WT.Rep.2      139.0220        0.0000      145.2869     262.96305
## Cond.WT.Rep.3      198.5141      145.2869        0.0000     310.60653
## Cond.Mt.Rep.1      227.7824      262.9630      310.6065       0.00000
## Cond.Mt.Rep.2      235.9857      267.5978      315.1272      94.13883
## Cond.Mt.Rep.3      211.7681      238.1150      283.7170     111.54555
##               Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1     235.98574      211.7681
## Cond.WT.Rep.2     267.59783      238.1150
## Cond.WT.Rep.3     315.12718      283.7170
## Cond.Mt.Rep.1      94.13883      111.5456
## Cond.Mt.Rep.2       0.00000      107.4814
## Cond.Mt.Rep.3     107.48136        0.0000

cimColor <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16)
cim(sampleDists, color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7)

Normalization

Compute normalization factors

dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        3
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3

Important: using calcNormFactors does not change the counts: it just updates the column norm.factors in $samples. It is therefore recommanded that you use the same name (dgeFull) to save the result of this function:

head(dgeFull$counts)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
## Solyc00g006470.1.1            18            45            67            15
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## Solyc00g006470.1.1            26            16

dgeFull$samples

##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        3
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3

Normalized counts exploratory analysis

Normalized counts and pseudo-counts can be extracted from dgeFull using the function cpm:

normCounts <- cpm(dgeFull)
pseudoNormCounts <- cpm(dgeFull, log = TRUE, prior.count = 1)
par(mar = c(8,4,1,2))
boxplot(pseudoNormCounts, col = "gray", las = 3, cex.names = 1)

plotMDS(pseudoNormCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2], 
       legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16), 
       legend = levels(sampleInfo$genotype))

A further analysis and comparison of the different normalizations provided in the R packages edgeR and DESeq2 is provided in this document (the design of the dataset used in this practical application is similar to the design of the tomato dataset).

Differential analysis

This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type tomatoes and the mutants:

a standard NB exact test between two conditions;
a GLM with the plant and genotype effects.

First approach: exact test between the two groups

In this first approach, the differences between the two groups (WT and M) is tested using an exact NB test between the two groups. This method is performed by:

creating a DGEList object using the argument group and using the same normalization factors than in dgeFull ;
estimating the dispersion for this object with the functions estimateCommonDisp and estimateTagwiseDisp ;
performing the test with the function exactTest.

Using the argument `group` in `DGEList`

A new DGEList object is created with the argument group. Normalization factors are updated from that in dgeFull.

dgeFull.group <- DGEList(rawCountTable, remove.zeros = TRUE, 
                         group = dgeFull$samples$condition)

## Removing 7492 rows with all zero counts

dgeFull.group$samples$norm.factors <- dgeFull$samples$norm.factors
dgeFull.group

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371

Estimate dispersion

Common and then tagwise dispersions can be estimated with:

dgeFull.group <- estimateCommonDisp(dgeFull.group)
dgeFull.group <- estimateTagwiseDisp(dgeFull.group)
dgeFull.group

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371
## 
## $common.dispersion
## [1] 0.0771487
## 
## $pseudo.counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.118385e+02  6.274668e+02  6.181625e+02    277.174174
## Solyc00g005080.1.1  1.387779e-17  1.387779e-17  1.387779e-17      0.000000
## Solyc00g005150.1.1  1.387779e-17  1.387779e-17  1.387779e-17      1.576323
## Solyc00g005840.2.1  4.299823e+02  4.923710e+02  2.768288e+02    330.472737
## Solyc00g005860.1.1  1.788530e-03  1.182016e+00  2.506739e-02      2.455437
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1    290.700664   292.5323378
## Solyc00g005080.1.1      1.698498     0.0000000
## Solyc00g005150.1.1      0.000000     0.9949206
## Solyc00g005840.2.1    359.808607   369.1486991
## Solyc00g005860.1.1      0.000000     0.0000000
## 27178 more rows ...
## 
## $pseudo.lib.size
## [1] 15961436
## 
## $AveLogCPM
## [1]  4.662164 -2.811589 -2.713178  4.566861 -2.620642
## 27178 more elements ...
## 
## $prior.df
## [1] 10
## 
## $prior.n
## [1] 2.5
## 
## $tagwise.dispersion
## [1] 0.06868585 0.40945758 0.23730247 0.06042544 0.37974655
## 27178 more elements ...
## 
## $span
## [1] 0.3

edgeR also contains a function to estimate dispersions in a more robust way, that can be used if the previous approach seems to fail. This function is estimateGLMRobustDisp. The quality of the variability estimatino can be assessed with the BCV versus average log CPM plot (that plots $\phi_g$ versus the average normalized count for all genes):

plotBCV(dgeFull.group)

Perform the test

An exact perform an exact test for the difference in expression between the two conditions “WT” and “M”:

dgeExactTest <- exactTest(dgeFull.group)
dgeExactTest

## An object of class "DGEExact"
## $table
##                         logFC    logCPM     PValue
## Solyc00g005050.2.1 -0.8550485  4.662164 0.00717063
## Solyc00g005080.1.1  2.4605333 -2.811589 0.53194167
## Solyc00g005150.1.1  2.9663412 -2.713178 0.27853427
## Solyc00g005840.2.1 -0.1792636  4.566861 0.54962937
## Solyc00g005860.1.1  0.8423155 -2.620642 1.00000000
## 27178 more rows ...
## 
## $comparison
## [1] "WT" "M" 
## 
## $genes
## NULL

$p$-values are corrected with the function topTags :

resExactTest <- topTags(dgeExactTest, n = nrow(dgeExactTest$table))
head(resExactTest$table)

##                       logFC   logCPM        PValue           FDR
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100  2.907936e-96
## Solyc01g009590.2.1 10.16384 8.694514  3.141090e-99  2.846141e-95
## Solyc07g015960.1.1 10.05634 7.328319  1.695548e-91  1.152252e-87
## Solyc06g064480.2.1 11.41320 6.803637  2.305164e-91  1.253226e-87
## Solyc01g086830.2.1 10.22500 7.374713  5.409676e-88  2.450854e-84

$p$-value and (BH) adjusted $p$-value distribution can be assessed with:

par(mfrow = c(1,2))
hist(resExactTest$table$PValue, xlab = "p-value", main = "raw p-values")
hist(resExactTest$table$FDR, xlab = "p-value", main = "adjusted p-values")

And finally, genes with a FDR smaller than 5% and a log Fold Change larger than 1 or smaller than -1 are extracted:

selectedET <- resExactTest$table$FDR < 0.05 & abs(resExactTest$table$logFC) > 1
selectedET <- resExactTest$table[selectedET, ]
nrow(selectedET)

## [1] 7010

head(selectedET)

##                       logFC   logCPM        PValue           FDR
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100  2.907936e-96
## Solyc01g009590.2.1 10.16384 8.694514  3.141090e-99  2.846141e-95
## Solyc07g015960.1.1 10.05634 7.328319  1.695548e-91  1.152252e-87
## Solyc06g064480.2.1 11.41320 6.803637  2.305164e-91  1.253226e-87
## Solyc01g086830.2.1 10.22500 7.374713  5.409676e-88  2.450854e-84

which shows that 7010 genes are found differential with this method. The column logFC can be used to found up/down regulated genes in the M:

selectedET$updown <- factor(ifelse(selectedET$logFC > 0, "up", "down"))
head(selectedET)

##                       logFC   logCPM        PValue           FDR updown
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105     up
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100  2.907936e-96     up
## Solyc01g009590.2.1 10.16384 8.694514  3.141090e-99  2.846141e-95     up
## Solyc07g015960.1.1 10.05634 7.328319  1.695548e-91  1.152252e-87     up
## Solyc06g064480.2.1 11.41320 6.803637  2.305164e-91  1.253226e-87     up
## Solyc01g086830.2.1 10.22500 7.374713  5.409676e-88  2.450854e-84     up

This list can be exported with:

write.table(selectedET, file = "tomatoDEG.csv", sep = ",")

Second approach: GLM with condition and genotype effects

Here, we fit a GLM to account for the genotype effect. The model writes $K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)$ with $\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})$ in which $j$ is the sample number, $s_j$ is the normalization factor and $\log(\lambda_{gj})$ is explained by $\log(\lambda_{gj}) = \lambda_{g0} + \beta_{g,j \textrm{ is M}} + \gamma_{g,j\textrm{ is clone 2}} + \gamma_{g,j\textrm{ is clone +}}$ (WT condition and clone number 1 are reference levels).

Estimate dispersion

The model is first encoded in a design matrix:

design.matrix <- model.matrix(~ dgeFull$samples$condition + 
                                dgeFull$samples$genotype)
design.matrix

##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         0
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         1
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"

Common, trended and then tagwise dispersions can be estimated with:

dgeFull <- estimateDisp(dgeFull, design.matrix)
dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        2
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3
## 
## $design
##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         1
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         0
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
## 
## 
## $common.dispersion
## [1] 0.09915901
## 
## $trended.dispersion
## [1] 0.08445656 0.24312199 0.24312199 0.08432161 0.24312199
## 27178 more elements ...
## 
## $tagwise.dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
## 
## $AveLogCPM
## [1]  4.662877 -2.812956 -2.714771  4.566921 -2.622279
## 27178 more elements ...
## 
## $trend.method
## [1] "locfit"
## 
## $prior.df
## [1] 6.40573
## 
## $prior.n
## [1] 3.202865
## 
## $span
## [1] 0.2833739

The quality of the variability estimation can be assessed with the BCV versus average log CPM plot (that plots $\phi_g$ versus the average normalized count for all genes):

plotBCV(dgeFull)

Fit GLM and perform the test

The GLM is fitted with the function glmFit:

fit <- glmFit(dgeFull, design.matrix)
fit

## An object of class "DGEGLM"
## $coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60593                 -0.5474377
## Solyc00g005080.1.1   -19.84956                  1.5895858
## Solyc00g005150.1.1   -18.22668                  2.0788213
## Solyc00g005840.2.1   -10.60049                 -0.0988238
## Solyc00g005860.1.1   -16.60440                  0.4872586
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3578367                0.35273273
## Solyc00g005080.1.1                 2.1070540                0.00265474
## Solyc00g005150.1.1                -2.1074340               -0.34003529
## Solyc00g005840.2.1                 0.1125328               -0.14970142
## Solyc00g005860.1.1                -0.7308550               -2.38661673
## 27178 more rows ...
## 
## $fitted.values
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.878720e+02  4.520405e+02  4.255504e+02  3.042097e+02
## Solyc00g005080.1.1  3.407412e-15  1.135343e-07  1.931898e-15  5.170012e-08
## Solyc00g005150.1.1  1.124398e-07  7.280992e-15  5.739921e-08  2.000000e+00
## Solyc00g005840.2.1  3.899910e+02  3.556151e+02  2.588498e+02  4.791059e+02
## Solyc00g005860.1.1  8.061300e-01  2.800752e-01  9.852187e-08  1.998478e+00
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1  4.116214e+02  3.269036e+02
## Solyc00g005080.1.1  2.000000e+00  2.871013e-08
## Solyc00g005150.1.1  1.503615e-07  1.000000e+00
## Solyc00g005840.2.1  5.072162e+02  3.114648e+02
## Solyc00g005860.1.1  8.061300e-01  2.392275e-07
## 27178 more rows ...
## 
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
## 
## $iter
## [1]  5 17 16  5 15
## 27178 more elements ...
## 
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
## 
## $method
## [1] "levenberg"
## 
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $unshrunk.coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60625                -0.54760473
## Solyc00g005080.1.1   -49.87975                16.23037094
## Solyc00g005150.1.1   -32.56778                16.38934864
## Solyc00g005840.2.1   -10.60081                -0.09884859
## Solyc00g005860.1.1   -16.78244                 0.60324982
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3579379                 0.3528334
## Solyc00g005080.1.1                17.5265040                -0.3073248
## Solyc00g005150.1.1               -16.3478196                -0.4122619
## Solyc00g005840.2.1                 0.1125669                -0.1497501
## Solyc00g005860.1.1                -0.8523453               -15.6573515
## 27178 more rows ...
## 
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
## 
## $design
##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         0
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         1
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
## 
## 
## $offset
##       [,1]     [,2]    [,3]     [,4]     [,5]     [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
## 
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
## 
## $prior.count
## [1] 0.125
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        2
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3
## 
## $prior.df
## [1] 6.40573
## 
## $AveLogCPM
## [1]  4.662877 -2.812956 -2.714771  4.566921 -2.622279
## 27178 more elements ...

Then tests can be performed with a log-ratio test (function glmRT). For instance, to testing differential genes between WT and M, is equivalent to testing the nullity of the second coefficient (see the design matrix) :

dgeLRTtest <- glmLRT(fit, coef = 2)
dgeLRTtest

## An object of class "DGELRT"
## $coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60593                 -0.5474377
## Solyc00g005080.1.1   -19.84956                  1.5895858
## Solyc00g005150.1.1   -18.22668                  2.0788213
## Solyc00g005840.2.1   -10.60049                 -0.0988238
## Solyc00g005860.1.1   -16.60440                  0.4872586
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3578367                0.35273273
## Solyc00g005080.1.1                 2.1070540                0.00265474
## Solyc00g005150.1.1                -2.1074340               -0.34003529
## Solyc00g005840.2.1                 0.1125328               -0.14970142
## Solyc00g005860.1.1                -0.7308550               -2.38661673
## 27178 more rows ...
## 
## $fitted.values
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.878720e+02  4.520405e+02  4.255504e+02  3.042097e+02
## Solyc00g005080.1.1  3.407412e-15  1.135343e-07  1.931898e-15  5.170012e-08
## Solyc00g005150.1.1  1.124398e-07  7.280992e-15  5.739921e-08  2.000000e+00
## Solyc00g005840.2.1  3.899910e+02  3.556151e+02  2.588498e+02  4.791059e+02
## Solyc00g005860.1.1  8.061300e-01  2.800752e-01  9.852187e-08  1.998478e+00
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1  4.116214e+02  3.269036e+02
## Solyc00g005080.1.1  2.000000e+00  2.871013e-08
## Solyc00g005150.1.1  1.503615e-07  1.000000e+00
## Solyc00g005840.2.1  5.072162e+02  3.114648e+02
## Solyc00g005860.1.1  8.061300e-01  2.392275e-07
## 27178 more rows ...
## 
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
## 
## $iter
## [1]  5 17 16  5 15
## 27178 more elements ...
## 
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
## 
## $method
## [1] "levenberg"
## 
## $unshrunk.coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60625                -0.54760473
## Solyc00g005080.1.1   -49.87975                16.23037094
## Solyc00g005150.1.1   -32.56778                16.38934864
## Solyc00g005840.2.1   -10.60081                -0.09884859
## Solyc00g005860.1.1   -16.78244                 0.60324982
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3579379                 0.3528334
## Solyc00g005080.1.1                17.5265040                -0.3073248
## Solyc00g005150.1.1               -16.3478196                -0.4122619
## Solyc00g005840.2.1                 0.1125669                -0.1497501
## Solyc00g005860.1.1                -0.8523453               -15.6573515
## 27178 more rows ...
## 
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
## 
## $design
##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         0
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         1
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
## 
## 
## $offset
##       [,1]     [,2]    [,3]     [,4]     [,5]     [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
## 
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
## 
## $prior.count
## [1] 0.125
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        2
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3
## 
## $prior.df
## [1] 6.40573
## 
## $AveLogCPM
## [1]  4.662877 -2.812956 -2.714771  4.566921 -2.622279
## 27178 more elements ...
## 
## $table
##                         logFC    logCPM        LR     PValue
## Solyc00g005050.2.1 -0.7897857  4.662877 5.4634125 0.01941869
## Solyc00g005080.1.1  2.2932875 -2.812956 1.7265297 0.18885469
## Solyc00g005150.1.1  2.9991052 -2.714771 2.9663231 0.08501488
## Solyc00g005840.2.1 -0.1425726  4.566921 0.1670542 0.68274322
## Solyc00g005860.1.1  0.7029656 -2.622279 0.2358925 0.62718864
## 27178 more rows ...
## 
## $comparison
## [1] "dgeFull$samples$conditionM"
## 
## $df.test
## [1] 1 1 1 1 1
## 27178 more elements ...

Testing differential genes between clone number 2 and 3 is equivalent to testing the equality of coefficients 3 and 4:

contrasts <- rep(0, ncol(design.matrix))
contrasts[3] <- 1
contrasts[4] <- -1
dgeLRTtest2 <- glmLRT(fit, contrast = contrasts)
dgeLRTtest2

## An object of class "DGELRT"
## $coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60593                 -0.5474377
## Solyc00g005080.1.1   -19.84956                  1.5895858
## Solyc00g005150.1.1   -18.22668                  2.0788213
## Solyc00g005840.2.1   -10.60049                 -0.0988238
## Solyc00g005860.1.1   -16.60440                  0.4872586
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3578367                0.35273273
## Solyc00g005080.1.1                 2.1070540                0.00265474
## Solyc00g005150.1.1                -2.1074340               -0.34003529
## Solyc00g005840.2.1                 0.1125328               -0.14970142
## Solyc00g005860.1.1                -0.7308550               -2.38661673
## 27178 more rows ...
## 
## $fitted.values
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.878720e+02  4.520405e+02  4.255504e+02  3.042097e+02
## Solyc00g005080.1.1  3.407412e-15  1.135343e-07  1.931898e-15  5.170012e-08
## Solyc00g005150.1.1  1.124398e-07  7.280992e-15  5.739921e-08  2.000000e+00
## Solyc00g005840.2.1  3.899910e+02  3.556151e+02  2.588498e+02  4.791059e+02
## Solyc00g005860.1.1  8.061300e-01  2.800752e-01  9.852187e-08  1.998478e+00
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1  4.116214e+02  3.269036e+02
## Solyc00g005080.1.1  2.000000e+00  2.871013e-08
## Solyc00g005150.1.1  1.503615e-07  1.000000e+00
## Solyc00g005840.2.1  5.072162e+02  3.114648e+02
## Solyc00g005860.1.1  8.061300e-01  2.392275e-07
## 27178 more rows ...
## 
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
## 
## $iter
## [1]  5 17 16  5 15
## 27178 more elements ...
## 
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
## 
## $method
## [1] "levenberg"
## 
## $unshrunk.coefficients
##                    (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1   -10.60625                -0.54760473
## Solyc00g005080.1.1   -49.87975                16.23037094
## Solyc00g005150.1.1   -32.56778                16.38934864
## Solyc00g005840.2.1   -10.60081                -0.09884859
## Solyc00g005860.1.1   -16.78244                 0.60324982
##                    dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1                 0.3579379                 0.3528334
## Solyc00g005080.1.1                17.5265040                -0.3073248
## Solyc00g005150.1.1               -16.3478196                -0.4122619
## Solyc00g005840.2.1                 0.1125669                -0.1497501
## Solyc00g005860.1.1                -0.8523453               -15.6573515
## 27178 more rows ...
## 
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
## 
## $design
##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         0
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         1
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
## 
## 
## $offset
##       [,1]     [,2]    [,3]     [,4]     [,5]     [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
## 
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
## 
## $prior.count
## [1] 0.125
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        2
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3
## 
## $prior.df
## [1] 6.40573
## 
## $AveLogCPM
## [1]  4.662877 -2.812956 -2.714771  4.566921 -2.622279
## 27178 more elements ...
## 
## $table
##                           logFC    logCPM           LR    PValue
## Solyc00g005050.2.1  0.007363416  4.662877 0.0003273673 0.9855644
## Solyc00g005080.1.1  3.036006369 -2.812956 2.0358241034 0.1536309
## Solyc00g005150.1.1 -2.549817295 -2.714771 1.4898460619 0.2222403
## Solyc00g005840.2.1  0.378324000  4.566921 0.7836039489 0.3760412
## Solyc00g005860.1.1  2.388759289 -2.622279 1.2301155435 0.2673846
## 27178 more rows ...
## 
## $comparison
## [1] "1*dgeFull$samples$genotype2 -1*dgeFull$samples$genotype3"
## 
## $df.test
## [1] 1 1 1 1 1
## 27178 more elements ...

Finally, DEGs can be extracted as previously, using the function topTags:

resLRT <- topTags(dgeLRTtest, n = nrow(dgeFull$counts))
head(resLRT$table)

##                        logFC   logCPM       LR       PValue          FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.153527e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 9.240949e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.645632e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.248492e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 3.264109e-64
## Solyc11g018570.1.1  9.869822 6.583870 296.6953 1.728899e-66 7.832776e-63

selectedLRT <- resLRT$table$FDR < 0.05 & abs(resLRT$table$logFC) > 1
selectedLRT <- resLRT$table[selectedLRT, ]
nrow(selectedLRT)

## [1] 6890

head(selectedLRT)

##                        logFC   logCPM       LR       PValue          FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.153527e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 9.240949e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.645632e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.248492e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 3.264109e-64
## Solyc11g018570.1.1  9.869822 6.583870 296.6953 1.728899e-66 7.832776e-63

Comparison

A Venn diagram comparing the two approaches is provided below:

vd <- venn.diagram(x = list("Exact test" = rownames(selectedET),
                            "GLM" = rownames(selectedLRT)),
                   fill = brewer.pal(3, "Set2")[1:2], filename = NULL)
grid.draw(vd)

More complex models and a more detailed comparison between the different approches for differential analysis is provided in this document (the same that was previously pointed for normalization comparison) and this document (with the analysis of an interaction effect).

Filtering

Differential analysis after independent filtering

Independant filtering can be performed with the package HTSFilter after the dispersion has been estimated:

dgeFilt <- HTSFilter(dgeFull)$filteredData

dgeFilt

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g006470.1.1            18            45            67            15
## Solyc00g006490.2.1           533           345           435           618
## Solyc00g006690.2.1            13            35           101            13
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005840.2.1           453           371
## Solyc00g006470.1.1            26            16
## Solyc00g006490.2.1           616           527
## Solyc00g006690.2.1            24            24
## 23699 more rows ...
## 
## $samples
##               group lib.size norm.factors condition genotype
## Cond.WT.Rep.1     1 16602772    0.9435083        WT        1
## Cond.WT.Rep.2     1 14374675    0.8879054        WT        2
## Cond.WT.Rep.3     1 17427146    0.6929932        WT        2
## Cond.Mt.Rep.1     1 18060148    1.1762817         M        1
## Cond.Mt.Rep.2     1 16010158    1.2551958         M        2
## Cond.Mt.Rep.3     1 13750228    1.1666371         M        3
## 
## $design
##   (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1           1                          0                         0
## 2           1                          0                         1
## 3           1                          0                         1
## 4           1                          1                         0
## 5           1                          1                         1
## 6           1                          1                         0
##   dgeFull$samples$genotype3
## 1                         0
## 2                         0
## 3                         0
## 4                         0
## 5                         0
## 6                         1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
## 
## 
## $common.dispersion
## [1] 0.09915901
## 
## $trended.dispersion
## [1] 0.08445656 0.08432161 0.13000266 0.08502943 0.12649632
## 23699 more elements ...
## 
## $tagwise.dispersion
## [1] 0.07708625 0.08344663 0.11902698 0.06969235 0.20709806
## 23699 more elements ...
## 
## $AveLogCPM
## [1] 4.662877 4.566921 1.174171 4.987809 1.365946
## 23699 more elements ...
## 
## $trend.method
## [1] "locfit"
## 
## $prior.df
## [1] 6.40573
## 
## $prior.n
## [1] 3.202865
## 
## $span
## [1] 0.2833739

Then, the differential analysis (GLM approach) is performed:

fit <- glmFit(dgeFilt, design.matrix)
dgeLRTfilt <- glmLRT(fit, coef = 2)
resLRTfilt <- topTags(dgeLRTfilt, n = nrow(dgeFilt$counts))
selectedFilt <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1
selectedFilt <- resLRTfilt$table[selectedFilt, ]
nrow(selectedFilt)

## [1] 6738

head(selectedFilt)

##                        logFC   logCPM       LR       PValue          FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.005894e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 8.058252e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.435017e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.088705e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 2.846354e-64
## Solyc11g018570.1.1  9.869822 6.583870 296.6953 1.728899e-66 6.830303e-63

Comparison

A Venn diagram comparing the two approaches is provided below:

vd <- venn.diagram(x = list("No filtering" = rownames(selectedLRT),
                            "Filtering" = rownames(selectedFilt)),
                   fill = brewer.pal(3, "Set2")[1:2], filename = NULL)
grid.draw(vd)

Exploratory analysis of DEGs

Create a MA plot with differentially expressed genes

To create a MA plot between M and WT, the entry $samples$group of the DGEList object must be filled with the indication of what the two groups are.

dgeFilt$samples$group <- dgeFilt$samples$condition
plotSmear(dgeFilt, de.tags = rownames(selectedFilt))

Volcano plot

volcanoData <- cbind(resLRTfilt$table$logFC, -log10(resLRTfilt$table$FDR))
colnames(volcanoData) <- c("logFC", "negLogPval")
DEGs <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1
point.col <- ifelse(DEGs, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5)

Heatmap

selY <- cpm(dgeFilt, log = TRUE, prior.count = 1)
selY <- selY[match(rownames(selectedFilt), rownames(dgeFilt$counts)), ]
finalHM <- cim(t(selY), color = cimColor, symkey = FALSE, row.cex = 0.7,
               col.cex = 0.7)

If you are interested in the result of the gene clustering, the result of HAC is saved into $ddc:

plot(finalHM$ddc, leaflab="none")
abline(h=10, lwd=2, col="pink")

Using this dendrogram, we might want to cut the tree at level $h=10$ (for instance). This can be performed using the function cutree, which will provide a cluster membership for each gene.

geneClust <- cutree(as.hclust(finalHM$ddc), h=10)
head(geneClust)

## Solyc07g053460.2.1 Solyc03g098680.2.1 Solyc06g064480.2.1 
##                  1                  2                  2 
## Solyc01g009590.2.1 Solyc11g028040.1.1 Solyc11g018570.1.1 
##                  1                  2                  2

For instance, the number of clusters is equal to

length(unique(geneClust))

## [1] 16

and the genes in cluster 1 are:

names(which(geneClust == 1))

## [1] "Solyc07g053460.2.1" "Solyc01g009590.2.1" "Solyc02g032910.1.1"
## [4] "Solyc02g078370.1.1" "Solyc05g050700.1.1"

Session information

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] HTSFilter_1.16.0    VennDiagram_1.6.17  futile.logger_1.4.3
##  [4] mixOmics_6.1.2      ggplot2_2.2.1       lattice_0.20-35    
##  [7] MASS_7.3-47         RColorBrewer_1.1-2  edgeR_3.18.0       
## [10] limma_3.32.2       
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.36.2             tidyr_0.6.1               
##  [3] jsonlite_1.4               splines_3.4.0             
##  [5] ellipse_0.3-8              Formula_1.2-1             
##  [7] shiny_1.0.3                assertthat_0.2.0          
##  [9] stats4_3.4.0               latticeExtra_0.6-28       
## [11] GenomeInfoDbData_0.99.0    yaml_2.1.14               
## [13] RSQLite_1.1-2              backports_1.0.5           
## [15] digest_0.6.12              checkmate_1.8.2           
## [17] GenomicRanges_1.28.2       XVector_0.16.0            
## [19] colorspace_1.3-2           htmltools_0.3.6           
## [21] httpuv_1.3.3               Matrix_1.2-10             
## [23] plyr_1.8.4                 DESeq2_1.16.1             
## [25] XML_3.98-1.7               genefilter_1.58.1         
## [27] zlibbioc_1.22.0            xtable_1.8-2              
## [29] corpcor_1.6.9              scales_0.4.1              
## [31] BiocParallel_1.10.1        htmlTable_1.9             
## [33] tibble_1.3.0               annotate_1.54.0           
## [35] IRanges_2.10.1             SummarizedExperiment_1.6.1
## [37] nnet_7.3-12                BiocGenerics_0.22.0       
## [39] lazyeval_0.2.0             survival_2.41-3           
## [41] magrittr_1.5               mime_0.5                  
## [43] memoise_1.1.0              evaluate_0.10             
## [45] foreign_0.8-67             data.table_1.10.4         
## [47] tools_3.4.0                matrixStats_0.52.2        
## [49] stringr_1.2.0              S4Vectors_0.14.1          
## [51] munsell_0.4.3              locfit_1.5-9.1            
## [53] cluster_2.0.6              DelayedArray_0.2.2        
## [55] AnnotationDbi_1.38.0       lambda.r_1.1.9            
## [57] compiler_3.4.0             DESeq_1.28.0              
## [59] GenomeInfoDb_1.12.0        RCurl_1.95-4.8            
## [61] htmlwidgets_0.8            igraph_1.0.1              
## [63] base64enc_0.1-3            bitops_1.0-6              
## [65] rmarkdown_1.5              gtable_0.2.0              
## [67] DBI_0.6-1                  reshape2_1.4.2            
## [69] R6_2.2.0                   gridExtra_2.2.1           
## [71] knitr_1.15.1               dplyr_0.5.0               
## [73] Hmisc_4.0-3                rprojroot_1.2             
## [75] futile.options_1.0.0       stringi_1.1.5             
## [77] parallel_3.4.0             Rcpp_0.12.10              
## [79] rpart_4.1-11               acepack_1.4.1             
## [81] geneplotter_1.54.0         rgl_0.98.1

Practical statistical analysis of RNA-Seq data - edgeR - tomato data

Annick Moisan, Ignacio Gonzales, Nathalie Villa-Vialaneix

March, 10th 2015; updated September 6th 2015 and May 4th 2017

Where to start? (installation and alike)

Preparing datasets

Description of the files (count table and design)

Importing the files

Creating a `DGEList` object

Data exploration and quality assessment

Histogram for pseudo-counts (sample `Cond.WT.Rep.1`)

Boxplot for pseudo-counts

MA-plots between first two WT samples (using `limma` package)

MDS for pseudo-counts (using `limma` package)

Heatmap for pseudo-counts (using `mixOmics` package)

Normalization

Compute normalization factors

Normalized counts exploratory analysis

Differential analysis

First approach: exact test between the two groups

Using the argument `group` in `DGEList`

Estimate dispersion

Perform the test

Second approach: GLM with condition and genotype effects

Estimate dispersion

Fit GLM and perform the test

Comparison

Filtering

Differential analysis after independent filtering

Comparison

Exploratory analysis of DEGs

Create a MA plot with differentially expressed genes

Volcano plot

Heatmap

Session information

Practical statistical analysis of RNA-Seq data - edgeR - tomato data

Annick Moisan, Ignacio Gonzales, Nathalie Villa-Vialaneix

March, 10th 2015; updated September 6th 2015 and May 4th 2017

Where to start? (installation and alike)

Preparing datasets

Description of the files (count table and design)

Importing the files

Creating a DGEList object

Data exploration and quality assessment

Histogram for pseudo-counts (sample Cond.WT.Rep.1)

Boxplot for pseudo-counts

MA-plots between first two WT samples (using limma package)

MDS for pseudo-counts (using limma package)

Heatmap for pseudo-counts (using mixOmics package)

Normalization

Compute normalization factors

Normalized counts exploratory analysis

Differential analysis

First approach: exact test between the two groups

Using the argument group in DGEList

Estimate dispersion

Perform the test

Second approach: GLM with condition and genotype effects

Estimate dispersion

Fit GLM and perform the test

Comparison

Filtering

Differential analysis after independent filtering

Comparison

Exploratory analysis of DEGs

Create a MA plot with differentially expressed genes

Volcano plot

Heatmap

Session information

Creating a `DGEList` object

Histogram for pseudo-counts (sample `Cond.WT.Rep.1`)

MA-plots between first two WT samples (using `limma` package)

MDS for pseudo-counts (using `limma` package)

Heatmap for pseudo-counts (using `mixOmics` package)

Using the argument `group` in `DGEList`