```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r loadLib, message=FALSE, warning=FALSE}
library("lubridate")
library("RColorBrewer")

library("car")
library("scales")

library("FactoMineR")
library("factoextra")
```


## Introduction

The current dataset comes from kaggle 
https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides 
information about the Tour de France race, including 8 different variables 
corresponding to 2,236 stages of the race. The data have been extracted from
Wikipedia and prepared by RasmusFiskerBang and is made available through a
CSV file that you can download on my website (the dataset is licensed under the
CC0 license - Public Domain).


## Data importation

Even though CSV files can be opened with Excel, we strongly discourage this
practice and we will use **R** directly for this task:

```{r importation}
tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE,
                       stringsAsFactors = FALSE, encoding = "UTF-8",
                       quote = "\"")
head(tdf_data)
```
where:

* `sep` is used to provide the character separating columns;

* `header = TRUE` indicates that column names are included in the file (in the
first row);

* `stringsAsFactor = FALSE` indicates that strings must not be converted to
type `factor` (this is the default behavior since **R** 4.0.0);

* `quote = "\""` indicates which character(s) has(have) to be considered as 
quotation character and not part of the data.

Other information and options are available in the help page `?read.table` or
in `?read.csv`, `?read.csv2`, `?read.delim`, `?read.delim2`.

We can take a first look at the data with:
```{r head}
head(tdf_data)
summary(tdf_data)
```
When missing data are present, this last command shows that many rows contain 
missing values (identified with `NA`) in each column. The dataset dimension is 
obtained with:
```{r dim}
dim(tdf_data)
```

Information on column types can be obtained with:
```{r classes}
sapply(tdf_data, class)
```
which indicates that all columns are character except for the third one, which
is numeric. Sometimes, numeric variables (and more specifically integers) are 
used to code for categorical variables but this is not the case in this dataset. 

To be allowed to perform properly the subsequent analysis, it is best to 
precisely define the different types of the columns:

* Stage: is a character variable but that correspond to the number (or the 
symbol) of the stage each year. It is better recoded as a `factor`, which is the
proper type in R for categorical variables with a finite number of possible
values

```{r convertStageToFactor}
tdf_data$Stage <- factor(tdf_data$Stage)
```

* Date: is a date and there is a special type in R for dates, which is better
used here (for instance, to easily extract the year or the month of the date)
as a numeric variable

```{r convertDateYear}
tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d"))
tdf_data$Year <- year(tdf_data$Date)
```

* Distance: is indeed a numeric variable;

* Origin and Destination: are categorical variables but the number of possible 
values (town of origin and destination of the stage) is so large that there is 
only a minor benefit in converting it to a factor

* Type: is a categorical variable that is better converted to a factor

```{r convertTypeToFactor}
tdf_data$Type <- factor(tdf_data$Type)
```

* Winner: is the winner name so the number of possible values is so large that
there is only a minor benefit in converting it to a factor

* Winner_Country: is the country code of the winner so is better converted to
a factor:

```{r convertCountryToFactor}
tdf_data$Winner_Country <- factor(tdf_data$Winner_Country)
```

After these stages, the summary of the dataset is already more informative:
```{r summaryPostCleaning}
summary(tdf_data)
```


## Univariate statistics

### Numerical characteristics

The variable `Distance` is the distance of the stage:
```{r numChar}
mean(tdf_data$Distance) # mean
median(tdf_data$Distance) # median
min(tdf_data$Distance) # minimum
max(tdf_data$Distance) # maximum
# quartiles and min/max
quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1))
```
The option `na.rm = TRUE` must be used when you have missing values that you 
don't want to be taken into account for the computation (otherwise, most of 
these functions would return the value `NA`).

*Exercise*: How to interpret these values? More precisely, what do they say 
about the variable distribution?

Some of these values are also available with
```{r numSum}
summary(tdf_data$Distance)
```

Dispersion characteristics are obtained with:
```{r numVaribility}
var(tdf_data$Distance) # variance
sd(tdf_data$Distance) # standard deviation
range(tdf_data$Distance) # range
diff(range(tdf_data$Distance))
```

*Exercise*: How would you compute the inter-quartile range (in just one line of
code)?

*Exercise*: What is the coefficient of variation (CV) for this variable? 

Standard modifications of data include:

* *binarization*: 
```{r cuts}
tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5)
table(tdf_data$cut1)
tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE)
table(tdf_data$cut2)
tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100))
table(tdf_data$cut3)
```

*Exercise*: What is the mode of `cut3`?

* *centering and scaling*
```{r scale}
tdf_data$DScaled <- as.vector(scale(tdf_data$Distance))
summary(tdf_data$DScaled)
var(tdf_data$DScaled)
```


### Charts

Before we start, a short note on color palettes:
```{r palettes}
display.brewer.all()
```

* For a **factor** (`cut2`):
```{r factorCharts}
pie(table(tdf_data$Type), col = c(brewer.pal(9, "Set1"), brewer.pal(9, "Set2"))) # 18 niveaux > 18 couleurs
```

```{r factorCharts2}
barplot(table(tdf_data$Winner_Country), col = "darkgreen", las = 3,
        cex.names = 0.6, cex.axis=0.6)
```


* For *numeric variables*
```{r numCharts}
hist(tdf_data$Distance)

hist(tdf_data$Distance, main = "Distribution of stage distances", 
     xlab = "distance (km)", breaks = 20)

plot(density(tdf_data$Distance))

hist(tdf_data$Distance, main = "Distribution of stage distances", 
     xlab = "distance (km)", breaks = 20, freq = FALSE)
lines(density(tdf_data$Distance), col = "red", lwd = 2)

boxplot(tdf_data$Distance)

boxplot(tdf_data$Distance, horizontal = TRUE, notch = TRUE)
```


## Bivariate statistics

### Factor versus factor

Contingency tables are obtained with the function `table` (here on a subset of
the winner countries and of the stage type).

```{r contingencyTable}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10)) 
selected_stages <- tdf_data$Winner_Country %in% selected_countries
cont_table <- table(tdf_data$Type[selected_stages], 
                    tdf_data$Winner_Country[selected_stages, drop = TRUE])
cont_table

t(cont_table) / rowSums(cont_table)
```

*Exercise*: How to interpret the numbers in the last table (we call them "row 
profiles")? For instance, what does 0.4545454545, in the column of France,
mean? Compute the column profiles.

```{r cramerV}
chisq <- chisq.test(cont_table)$statistic
ddl <- min(nrow(cont_table) - 1, ncol(cont_table) - 1)
cramerV <- sqrt(chisq / (sum(cont_table) * ddl))
cramerV
```

Barplots are obtained using the contingency table as well:
```{r barplot2Vars}
selected_stages <- selected_stages & 
  (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
cont_table <- table(tdf_data$Type[selected_stages, drop = TRUE], 
                    tdf_data$Winner_Country[selected_stages, drop = TRUE])
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
        cex.names = 0.6)
barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency",
        beside = TRUE, col = brewer.pal(3, "Set3"), cex.names = 0.6)
```

### Numeric versus numeric

Covariances can be obtained using the function `cov`:
```{r covariances}
cov(tdf_data$Distance, tdf_data$Year)
cov(tdf_data$Distance, tdf_data$Year, method = "spearman")
cov(tdf_data[ ,c("Distance", "Year", "DScaled")])

cor(tdf_data$Distance, tdf_data$Year)
cor(tdf_data$Distance, tdf_data$Year, method = "spearman")
cor(tdf_data[ ,c("Distance", "Year", "DScaled")])
```

and correlations are obtained with the function `cor` that takes the same 
arguments than `cov`.

Partial correlation between `Distance` and `Year` given `DScaled` is obtained 
with:
```{r partialCor}
cor(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals,
    lm(tdf_data$Year ~ tdf_data$DScaled)$residuals)

```

*Exercise*: Is it an expected result? Check at the residuals of the first model:
what can you say?

```{r summaryResiduals}
summary(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals)
```

Dot plots (scatterplots) between two variables are obtained with:
```{r dotplots}
plot(tdf_data$Year, tdf_data$Distance, pch = 19)
```
or with scatterplot matrices:
```{r scatterplotMatrix}
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")])
scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")],
                  col = "black", regLine = FALSE, smooth = FALSE, pch = "+")
```

*Exercise*: Can you comment on the specific plot between `Distance` and 
`DScaled`?


### Numeric versus factor

Within and between variance of `Distance` with respect to `Winner_Country` are 
obtained from the function `anova` (in the column `Mean Sq`, within group 
variance corresponds to the row `residuals`):
```{r BetWithVars}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
anova(lm(tdf_data$Distance[selected_stages] ~ 
           factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
```
and the correlation ratio is the square root of the column `F value`:
```{r corrRatio}
cor_ratio <- anova(lm(tdf_data$Distance[selected_stages] ~ 
           factor(tdf_data$Winner_Country[selected_stages, drop = TRUE])))
sqrt(cor_ratio[1, "Mean Sq"]/(cor_ratio[1, "Mean Sq"] + cor_ratio[2, "Mean Sq"]))
```

Parallel boxplots are obtained using the same syntax with the `~`:
```{r parallelPlots}
boxplot(tdf_data$Distance[selected_stages] ~ 
          tdf_data$Winner_Country[selected_stages, drop = TRUE], las = 3,
        xlab = "winner country", ylab = "distance")
```

To obtain multiple histograms on the same plot, you can use:
```{r multipleHistogramsB, echo=FALSE, fig.width=15, fig.height=15}
par(mfrow = c(5, 4))
for (wc in selected_countries) {
  tmp <- tdf_data[tdf_data$Winner_Country == wc, c("Distance")]
  hist(tmp, main = paste("Histogram of Distance for ", wc), xlab = "distance")
}
```


## Tests

### With one variable

To test if `Distance` average is equal to 200.

* with a **parametric test**, we need to first test if the variable is 
distributed as a Gaussian variable. Several tests exist to do that but we will
use the Shapiro-Wilk's test:
```{r shapiroTest}
shapiro.test(tdf_data$Distance)
```

Despite significant deviation to normality, we will perform a Student test (for
the sake of the example):
```{r oneSTTest}
res <- t.test(tdf_data$Distance, mu = 200, conf.level = 0.99)
res
res$statistic
res$p.value
res$conf.int
```

* with a *non-parametric* Wilcoxon test (that tests the median instead of the
mean):
```{r oneWilcoxTest}
res <- wilcox.test(tdf_data$Distance, mu = 200, conf.int = TRUE)
res
res$statistic
res$p.value
res$conf.int
```


### With two factors

For small contingency tables, the independence between rows and columns can be
tested with a Fisher's exact test (to be preferred) or a $\chi^2$ test (only 
when Fisher's exact test is computationally too heavy to run or when the sample
size is sufficiently large). 

```{r contTable}
selected_countries <- names(which(table(tdf_data$Winner_Country) > 10))
selected_stages <- tdf_data$Winner_Country %in% selected_countries
selected_stages <- selected_stages & 
  (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
                        tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Type", "Distance")
cont_table <- table(tdf_small$Winner_Country, tdf_small$Type)
cont_table
```

```{fisherTest, eval=FALSE}
res <- fisher.test(cont_table)

# Error in fisher.test(cont_table) : 
#   FEXACT error 7(location). LDSTP=18330 is too small for this problem,
#   (pastp=11.752, ipn_0:=ipoin[itp=563]=3932, stp[ipn_0]=10.2479).
# Increase workspace or consider using 'simulate.p.value=TRUE'
```

In addition to being more suited to large contingency tables, $\chi^2$ test also
provide interesting statistics for interpretation of the results:

```{r chiSQ}
res <- chisq.test(cont_table)
res
res$observed
res$expected
res$residuals^2
```

```{r barplot0}
barplot(t(cont_table), legend.text = TRUE, xlab = "country", ylab = "frequency",
        cex.names = 0.6, col = brewer.pal(3, "Set3"))
```


*Exercise*: `Titanic[ , Sex = "Male", Age = "Adult", ]` is the contingency
table of Male Adults in Titanics for the variables `Class` and `Survival`. Are
these two variables independant? Which classes deviate the most from the 
independence? Same questions for Women.


### With two numeric variables

Correlation tests are performed with:

```{r corrTests}
cor.test(tdf_data$Distance, tdf_data$Year)
cor.test(tdf_data$Distance, tdf_data$Year, method = "spearman")
plot(tdf_data$Distance, tdf_data$Year)
```

*Exercise*: The object `Soils` contain characteristics on soil samples. 
Variables 6-13 are numerical characteristics. Make a scatterplot matrix of these
variables and pick two variables that you think are linearly correlated. Test 
your hypothesis.


### With a numeric variable explained by a factor ($K=2$, independent)

Comparing the means of the given numeric variable `Distance` for the different
countries of origin of the winner `Winner_Country`, for the countries DEN and 
FRA:
```{r mean2Sample}
selected_countries <- c("DEN", "FRA")
selected_stages <- tdf_data$Winner_Country %in% selected_countries
tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Winner_Country", "Distance")

t.test(tdf_small$Distance ~ tdf_small$Winner_Country)
t.test(tdf_small$Distance ~ tdf_small$Winner_Country, var.equal = TRUE)
```

How to know if two variances are equal? Bartlett test or Fisher's test...
```{r bartlett}
bartlett.test(tdf_small$Distance ~ tdf_small$Winner_Country)
var.test(tdf_small$Distance ~ tdf_small$Winner_Country)
```

Comparing the medians of the given numeric variable `Distance` between the two
winner countries, FRA and DEN:
```{r median2Sample}
wilcox.test(tdf_small$Distance ~ tdf_small$Winner_Country)
```

*Exercise*: In the dataset `Soils`, does the pH significantly differ between 
samples on depressions and on slopes (variable `Contour`)? Visually confirm with
the appropriate plot.


### With a numeric variable explained by a factor ($K > 2$, independent)

ANOVA is based on the previous fitting of a linear model. For instance, if we
want to check if the means `Distance` are different between `Type` (with a 
restricted dataset), we have to run (under normality and equal variance
assumptions):
```{r ANOVA}
selected_stages <- (tdf_data$Type %in% 
                      c("Individual time trial", "Plain stage",
                        "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
anova(lm(tdf_small$Distance ~ tdf_small$Type))
```

Kruskal-Wallis nonparametric test is performed with:
```{r KW}
kruskal.test(tdf_small$Distance ~ tdf_small$Type)
```

*Exercise*: In the dataset `Soils`, does the pH significantly differ between 
the different contours? Visually confirm with the appropriate plot.


### With a numeric variables in two paired samples (two groups with the same individuals)

The `sleep` data show the effect (`extra`) of two soporific drugs (`group`) on 
10 patients (`ID`):
```{r sleep}
head(sleep)
tail(sleep)
dim(sleep)
sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group")
dim(sleep2)
head(sleep2)
```
Paired comparison tests are performed with (under normality assumption):
```{r pairedTTest}
with(sleep2, t.test(extra.1, extra.2, paired = TRUE))
```

and with a nonparametric test:
```{r pairedWTest}
with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE))
```

If there is a large number of ties, this $p$-value should not be used but it can
be trusted if the number of ties is small.
```{r sleepBoxplot}
boxplot(extra ~ group, data = sleep)
```

We can check that these tests are equivalent to tests on the difference:
```{r sleepWilcox}
sleep2$diff <- sleep2$extra.2 - sleep2$extra.1
head(sleep2)
t.test(x = sleep2$diff)
wilcox.test(x = sleep2$diff)
```


## Linear regression and models

### A numeric variable explained by one or several numeric variables

```{r linearReg}
res <- lm(tdf_data$Distance ~ tdf_data$Year)
res
summary(res)
coefficients(res)
confint(res)
plot(res)
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# For details on validations plots
```

```{r linearModels}
tdf_small <- tdf_data[, c("Distance", "Year")]
head(tdf_small)
res <- lm(Distance ~ ., data = tdf_small)
summary(res)
plot(res)
```


### A numeric variable explained by one (or several) factor(s)

```{r ANOVALM}
selected_stages <- (tdf_data$Type %in% 
                      c("Individual time trial", "Plain stage",
                        "Stage with mountain(s)"))
tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE],
                        tdf_data$Distance[selected_stages])
names(tdf_small) <- c("Type", "Distance")
head(tdf_small)
res <- lm(Distance ~ Type, data = tdf_small)
summary(res)
anova(res)
```

*Exercise*: Explain the pH of `Soils` by an additive effect between contour and
`Ca` and with a additive effect with interaction between contour and Depth.


### A binary variable is explained by the linear relation between predictor(s)

```{r logit}
res <- glm(Type ~ Distance, data = tdf_small, family = binomial(link = "logit"))
summary(res)
```

## Multiple testing correction

If we test the differences between the levels of Contour of the mean for all 
numeric variables in `Soils`, we obtain 9 p-values on which multiple testing 
can be performed:
```{r multTest}
all_pvals <- apply(Soils[ ,6:14], 2, function(avar) 
  anova(lm(avar ~ Soils$Group))[1, "Pr(>F)"])
all_pvals
p.adjust(all_pvals, method = "BH")
p.adjust(all_pvals, method = "bonferroni")
```


## PCA

```{r loadUSA}
data("USArrests")
summary(USArrests)
```

```{r PCA}
pca_usa <- PCA(USArrests, graph = FALSE)
fviz_eig(pca_usa)
```

```{r PCAplotVar}
fviz_pca_var(pca_usa, choice = "var")
```


```{r PCAPlotInd}
fviz_pca_ind(pca_usa, choice = "ind")
```

```{r PCAPlotInd2}
fviz_pca_ind(pca_usa, choice = "ind", col.ind = USArrests$UrbanPop)
```

*Exercise*: Perform PCA of the dataset `athle_records.csv` after 
log-transformation of the data.


## Clustering

```{r hc}
scaled_usa <- scale(USArrests)
dist_usa <- dist(scaled_usa)^2
hc_usa <- hclust(dist_usa, method = "ward.D")
plot(hc_usa)
```

```{r withinss}
within_ss <- cumsum(hc_usa$height)
plot(49:1, within_ss, type = "b")
1 - within_ss[46] / within_ss[49] # reproduced inertia
```

```{r cutdendro}
plot(hc_usa)
rect.hclust(hc_usa, k = 4)
```

```{r clusteringhc}
clust_hc <- cutree(hc_usa, k = 4)
clust_hc
fviz_pca_ind(pca_usa, col.ind = factor(clust_hc))
```

```{r kmeans}
init_center <- by(scaled_usa, clust_hc, colMeans)
init_center <- Reduce("rbind", init_center)
clust_kmeans <- kmeans(scaled_usa, centers = init_center)
clust_kmeans
```

```{r clusteringkmeans}
fviz_pca_ind(pca_usa, col.ind = factor(clust_kmeans$cluster))
```

*Exercise*: Perform clustering of the dataset `athle_records.csv` after 
log-transformation and scaling of the data.


## Session information

This file has been compiled with the current system:
```{r sessionInfo}
sessionInfo()
```