```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r loadLib, message=FALSE, warning=FALSE} library("lubridate") library("RColorBrewer") library("car") library("scales") library("FactoMineR") library("factoextra") ``` ## Introduction The current dataset comes from kaggle https://www.kaggle.com/ralle360/historic-tour-de-france-dataset and provides information about the Tour de France race, including 8 different variables corresponding to 2,236 stages of the race. The data have been extracted from Wikipedia and prepared by RasmusFiskerBang and is made available through a CSV file that you can download on my website (the dataset is licensed under the CC0 license - Public Domain). ## Data importation Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use **R** directly for this task: ```{r importation} tdf_data <- read.table("stages_TDF.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, encoding = "UTF-8", quote = "\"") head(tdf_data) ``` where: * `sep` is used to provide the character separating columns; * `header = TRUE` indicates that column names are included in the file (in the first row); * `stringsAsFactor = FALSE` indicates that strings must not be converted to type `factor` (this is the default behavior since **R** 4.0.0); * `quote = "\""` indicates which character(s) has(have) to be considered as quotation character and not part of the data. Other information and options are available in the help page `?read.table` or in `?read.csv`, `?read.csv2`, `?read.delim`, `?read.delim2`. We can take a first look at the data with: ```{r head} head(tdf_data) summary(tdf_data) ``` When missing data are present, this last command shows that many rows contain missing values (identified with `NA`) in each column. The dataset dimension is obtained with: ```{r dim} dim(tdf_data) ``` Information on column types can be obtained with: ```{r classes} sapply(tdf_data, class) ``` which indicates that all columns are character except for the third one, which is numeric. Sometimes, numeric variables (and more specifically integers) are used to code for categorical variables but this is not the case in this dataset. To be allowed to perform properly the subsequent analysis, it is best to precisely define the different types of the columns: * Stage: is a character variable but that correspond to the number (or the symbol) of the stage each year. It is better recoded as a `factor`, which is the proper type in R for categorical variables with a finite number of possible values ```{r convertStageToFactor} tdf_data$Stage <- factor(tdf_data$Stage) ``` * Date: is a date and there is a special type in R for dates, which is better used here (for instance, to easily extract the year or the month of the date) as a numeric variable ```{r convertDateYear} tdf_data$Date <- as.Date(tdf_data$Date, format = c("%Y-%m-%d")) tdf_data$Year <- year(tdf_data$Date) ``` * Distance: is indeed a numeric variable; * Origin and Destination: are categorical variables but the number of possible values (town of origin and destination of the stage) is so large that there is only a minor benefit in converting it to a factor * Type: is a categorical variable that is better converted to a factor ```{r convertTypeToFactor} tdf_data$Type <- factor(tdf_data$Type) ``` * Winner: is the winner name so the number of possible values is so large that there is only a minor benefit in converting it to a factor * Winner_Country: is the country code of the winner so is better converted to a factor: ```{r convertCountryToFactor} tdf_data$Winner_Country <- factor(tdf_data$Winner_Country) ``` After these stages, the summary of the dataset is already more informative: ```{r summaryPostCleaning} summary(tdf_data) ``` ## Univariate statistics ### Numerical characteristics The variable `Distance` is the distance of the stage: ```{r numChar} mean(tdf_data$Distance) # mean median(tdf_data$Distance) # median min(tdf_data$Distance) # minimum max(tdf_data$Distance) # maximum # quartiles and min/max quantile(tdf_data$Distance, probs = c(0, 0.25, 0.5, 0.75, 1)) ``` The option `na.rm = TRUE` must be used when you have missing values that you don't want to be taken into account for the computation (otherwise, most of these functions would return the value `NA`). *Exercise*: How to interpret these values? More precisely, what do they say about the variable distribution? Some of these values are also available with ```{r numSum} summary(tdf_data$Distance) ``` Dispersion characteristics are obtained with: ```{r numVaribility} var(tdf_data$Distance) # variance sd(tdf_data$Distance) # standard deviation range(tdf_data$Distance) # range diff(range(tdf_data$Distance)) ``` *Exercise*: How would you compute the inter-quartile range (in just one line of code)? *Exercise*: What is the coefficient of variation (CV) for this variable? Standard modifications of data include: * *binarization*: ```{r cuts} tdf_data$cut1 <- cut(tdf_data$Distance, breaks = 5) table(tdf_data$cut1) tdf_data$cut2 <- cut(tdf_data$Distance, breaks = 5, labels = FALSE) table(tdf_data$cut2) tdf_data$cut3 <- cut(tdf_data$Distance, breaks = seq(0, 500, by = 100)) table(tdf_data$cut3) ``` *Exercise*: What is the mode of `cut3`? * *centering and scaling* ```{r scale} tdf_data$DScaled <- as.vector(scale(tdf_data$Distance)) summary(tdf_data$DScaled) var(tdf_data$DScaled) ``` ### Charts Before we start, a short note on color palettes: ```{r palettes} display.brewer.all() ``` * For a **factor** (`cut2`): ```{r factorCharts} pie(table(tdf_data$Type), col = c(brewer.pal(9, "Set1"), brewer.pal(9, "Set2"))) # 18 niveaux > 18 couleurs ``` ```{r factorCharts2} barplot(table(tdf_data$Winner_Country), col = "darkgreen", las = 3, cex.names = 0.6, cex.axis=0.6) ``` * For *numeric variables* ```{r numCharts} hist(tdf_data$Distance) hist(tdf_data$Distance, main = "Distribution of stage distances", xlab = "distance (km)", breaks = 20) plot(density(tdf_data$Distance)) hist(tdf_data$Distance, main = "Distribution of stage distances", xlab = "distance (km)", breaks = 20, freq = FALSE) lines(density(tdf_data$Distance), col = "red", lwd = 2) boxplot(tdf_data$Distance) boxplot(tdf_data$Distance, horizontal = TRUE, notch = TRUE) ``` ## Bivariate statistics ### Factor versus factor Contingency tables are obtained with the function `table` (here on a subset of the winner countries and of the stage type). ```{r contingencyTable} selected_countries <- names(which(table(tdf_data$Winner_Country) > 10)) selected_stages <- tdf_data$Winner_Country %in% selected_countries cont_table <- table(tdf_data$Type[selected_stages], tdf_data$Winner_Country[selected_stages, drop = TRUE]) cont_table t(cont_table) / rowSums(cont_table) ``` *Exercise*: How to interpret the numbers in the last table (we call them "row profiles")? For instance, what does 0.4545454545, in the column of France, mean? Compute the column profiles. ```{r cramerV} chisq <- chisq.test(cont_table)$statistic ddl <- min(nrow(cont_table) - 1, ncol(cont_table) - 1) cramerV <- sqrt(chisq / (sum(cont_table) * ddl)) cramerV ``` Barplots are obtained using the contingency table as well: ```{r barplot2Vars} selected_stages <- selected_stages & (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)")) cont_table <- table(tdf_data$Type[selected_stages, drop = TRUE], tdf_data$Winner_Country[selected_stages, drop = TRUE]) barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency", cex.names = 0.6) barplot(cont_table, legend.text = TRUE, xlab = "country", ylab = "frequency", beside = TRUE, col = brewer.pal(3, "Set3"), cex.names = 0.6) ``` ### Numeric versus numeric Covariances can be obtained using the function `cov`: ```{r covariances} cov(tdf_data$Distance, tdf_data$Year) cov(tdf_data$Distance, tdf_data$Year, method = "spearman") cov(tdf_data[ ,c("Distance", "Year", "DScaled")]) cor(tdf_data$Distance, tdf_data$Year) cor(tdf_data$Distance, tdf_data$Year, method = "spearman") cor(tdf_data[ ,c("Distance", "Year", "DScaled")]) ``` and correlations are obtained with the function `cor` that takes the same arguments than `cov`. Partial correlation between `Distance` and `Year` given `DScaled` is obtained with: ```{r partialCor} cor(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals, lm(tdf_data$Year ~ tdf_data$DScaled)$residuals) ``` *Exercise*: Is it an expected result? Check at the residuals of the first model: what can you say? ```{r summaryResiduals} summary(lm(tdf_data$Distance ~ tdf_data$DScaled)$residuals) ``` Dot plots (scatterplots) between two variables are obtained with: ```{r dotplots} plot(tdf_data$Year, tdf_data$Distance, pch = 19) ``` or with scatterplot matrices: ```{r scatterplotMatrix} scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")]) scatterplotMatrix(tdf_data[, c("Distance", "Year", "DScaled")], col = "black", regLine = FALSE, smooth = FALSE, pch = "+") ``` *Exercise*: Can you comment on the specific plot between `Distance` and `DScaled`? ### Numeric versus factor Within and between variance of `Distance` with respect to `Winner_Country` are obtained from the function `anova` (in the column `Mean Sq`, within group variance corresponds to the row `residuals`): ```{r BetWithVars} selected_countries <- names(which(table(tdf_data$Winner_Country) > 10)) selected_stages <- tdf_data$Winner_Country %in% selected_countries anova(lm(tdf_data$Distance[selected_stages] ~ factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]))) ``` and the correlation ratio is the square root of the column `F value`: ```{r corrRatio} cor_ratio <- anova(lm(tdf_data$Distance[selected_stages] ~ factor(tdf_data$Winner_Country[selected_stages, drop = TRUE]))) sqrt(cor_ratio[1, "Mean Sq"]/(cor_ratio[1, "Mean Sq"] + cor_ratio[2, "Mean Sq"])) ``` Parallel boxplots are obtained using the same syntax with the `~`: ```{r parallelPlots} boxplot(tdf_data$Distance[selected_stages] ~ tdf_data$Winner_Country[selected_stages, drop = TRUE], las = 3, xlab = "winner country", ylab = "distance") ``` To obtain multiple histograms on the same plot, you can use: ```{r multipleHistogramsB, echo=FALSE, fig.width=15, fig.height=15} par(mfrow = c(5, 4)) for (wc in selected_countries) { tmp <- tdf_data[tdf_data$Winner_Country == wc, c("Distance")] hist(tmp, main = paste("Histogram of Distance for ", wc), xlab = "distance") } ``` ## Tests ### With one variable To test if `Distance` average is equal to 200. * with a **parametric test**, we need to first test if the variable is distributed as a Gaussian variable. Several tests exist to do that but we will use the Shapiro-Wilk's test: ```{r shapiroTest} shapiro.test(tdf_data$Distance) ``` Despite significant deviation to normality, we will perform a Student test (for the sake of the example): ```{r oneSTTest} res <- t.test(tdf_data$Distance, mu = 200, conf.level = 0.99) res res$statistic res$p.value res$conf.int ``` * with a *non-parametric* Wilcoxon test (that tests the median instead of the mean): ```{r oneWilcoxTest} res <- wilcox.test(tdf_data$Distance, mu = 200, conf.int = TRUE) res res$statistic res$p.value res$conf.int ``` ### With two factors For small contingency tables, the independence between rows and columns can be tested with a Fisher's exact test (to be preferred) or a $\chi^2$ test (only when Fisher's exact test is computationally too heavy to run or when the sample size is sufficiently large). ```{r contTable} selected_countries <- names(which(table(tdf_data$Winner_Country) > 10)) selected_stages <- tdf_data$Winner_Country %in% selected_countries selected_stages <- selected_stages & (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)")) tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE], tdf_data$Type[selected_stages, drop = TRUE], tdf_data$Distance[selected_stages]) names(tdf_small) <- c("Winner_Country", "Type", "Distance") cont_table <- table(tdf_small$Winner_Country, tdf_small$Type) cont_table ``` ```{fisherTest, eval=FALSE} res <- fisher.test(cont_table) # Error in fisher.test(cont_table) : # FEXACT error 7(location). LDSTP=18330 is too small for this problem, # (pastp=11.752, ipn_0:=ipoin[itp=563]=3932, stp[ipn_0]=10.2479). # Increase workspace or consider using 'simulate.p.value=TRUE' ``` In addition to being more suited to large contingency tables, $\chi^2$ test also provide interesting statistics for interpretation of the results: ```{r chiSQ} res <- chisq.test(cont_table) res res$observed res$expected res$residuals^2 ``` ```{r barplot0} barplot(t(cont_table), legend.text = TRUE, xlab = "country", ylab = "frequency", cex.names = 0.6, col = brewer.pal(3, "Set3")) ``` *Exercise*: `Titanic[ , Sex = "Male", Age = "Adult", ]` is the contingency table of Male Adults in Titanics for the variables `Class` and `Survival`. Are these two variables independant? Which classes deviate the most from the independence? Same questions for Women. ### With two numeric variables Correlation tests are performed with: ```{r corrTests} cor.test(tdf_data$Distance, tdf_data$Year) cor.test(tdf_data$Distance, tdf_data$Year, method = "spearman") plot(tdf_data$Distance, tdf_data$Year) ``` *Exercise*: The object `Soils` contain characteristics on soil samples. Variables 6-13 are numerical characteristics. Make a scatterplot matrix of these variables and pick two variables that you think are linearly correlated. Test your hypothesis. ### With a numeric variable explained by a factor ($K=2$, independent) Comparing the means of the given numeric variable `Distance` for the different countries of origin of the winner `Winner_Country`, for the countries DEN and FRA: ```{r mean2Sample} selected_countries <- c("DEN", "FRA") selected_stages <- tdf_data$Winner_Country %in% selected_countries tdf_small <- data.frame(tdf_data$Winner_Country[selected_stages, drop = TRUE], tdf_data$Distance[selected_stages]) names(tdf_small) <- c("Winner_Country", "Distance") t.test(tdf_small$Distance ~ tdf_small$Winner_Country) t.test(tdf_small$Distance ~ tdf_small$Winner_Country, var.equal = TRUE) ``` How to know if two variances are equal? Bartlett test or Fisher's test... ```{r bartlett} bartlett.test(tdf_small$Distance ~ tdf_small$Winner_Country) var.test(tdf_small$Distance ~ tdf_small$Winner_Country) ``` Comparing the medians of the given numeric variable `Distance` between the two winner countries, FRA and DEN: ```{r median2Sample} wilcox.test(tdf_small$Distance ~ tdf_small$Winner_Country) ``` *Exercise*: In the dataset `Soils`, does the pH significantly differ between samples on depressions and on slopes (variable `Contour`)? Visually confirm with the appropriate plot. ### With a numeric variable explained by a factor ($K > 2$, independent) ANOVA is based on the previous fitting of a linear model. For instance, if we want to check if the means `Distance` are different between `Type` (with a restricted dataset), we have to run (under normality and equal variance assumptions): ```{r ANOVA} selected_stages <- (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)")) tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE], tdf_data$Distance[selected_stages]) names(tdf_small) <- c("Type", "Distance") anova(lm(tdf_small$Distance ~ tdf_small$Type)) ``` Kruskal-Wallis nonparametric test is performed with: ```{r KW} kruskal.test(tdf_small$Distance ~ tdf_small$Type) ``` *Exercise*: In the dataset `Soils`, does the pH significantly differ between the different contours? Visually confirm with the appropriate plot. ### With a numeric variables in two paired samples (two groups with the same individuals) The `sleep` data show the effect (`extra`) of two soporific drugs (`group`) on 10 patients (`ID`): ```{r sleep} head(sleep) tail(sleep) dim(sleep) sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group") dim(sleep2) head(sleep2) ``` Paired comparison tests are performed with (under normality assumption): ```{r pairedTTest} with(sleep2, t.test(extra.1, extra.2, paired = TRUE)) ``` and with a nonparametric test: ```{r pairedWTest} with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE)) ``` If there is a large number of ties, this $p$-value should not be used but it can be trusted if the number of ties is small. ```{r sleepBoxplot} boxplot(extra ~ group, data = sleep) ``` We can check that these tests are equivalent to tests on the difference: ```{r sleepWilcox} sleep2$diff <- sleep2$extra.2 - sleep2$extra.1 head(sleep2) t.test(x = sleep2$diff) wilcox.test(x = sleep2$diff) ``` ## Linear regression and models ### A numeric variable explained by one or several numeric variables ```{r linearReg} res <- lm(tdf_data$Distance ~ tdf_data$Year) res summary(res) coefficients(res) confint(res) plot(res) # http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/ # For details on validations plots ``` ```{r linearModels} tdf_small <- tdf_data[, c("Distance", "Year")] head(tdf_small) res <- lm(Distance ~ ., data = tdf_small) summary(res) plot(res) ``` ### A numeric variable explained by one (or several) factor(s) ```{r ANOVALM} selected_stages <- (tdf_data$Type %in% c("Individual time trial", "Plain stage", "Stage with mountain(s)")) tdf_small <- data.frame(tdf_data$Type[selected_stages, drop = TRUE], tdf_data$Distance[selected_stages]) names(tdf_small) <- c("Type", "Distance") head(tdf_small) res <- lm(Distance ~ Type, data = tdf_small) summary(res) anova(res) ``` *Exercise*: Explain the pH of `Soils` by an additive effect between contour and `Ca` and with a additive effect with interaction between contour and Depth. ### A binary variable is explained by the linear relation between predictor(s) ```{r logit} res <- glm(Type ~ Distance, data = tdf_small, family = binomial(link = "logit")) summary(res) ``` ## Multiple testing correction If we test the differences between the levels of Contour of the mean for all numeric variables in `Soils`, we obtain 9 p-values on which multiple testing can be performed: ```{r multTest} all_pvals <- apply(Soils[ ,6:14], 2, function(avar) anova(lm(avar ~ Soils$Group))[1, "Pr(>F)"]) all_pvals p.adjust(all_pvals, method = "BH") p.adjust(all_pvals, method = "bonferroni") ``` ## PCA ```{r loadUSA} data("USArrests") summary(USArrests) ``` ```{r PCA} pca_usa <- PCA(USArrests, graph = FALSE) fviz_eig(pca_usa) ``` ```{r PCAplotVar} fviz_pca_var(pca_usa, choice = "var") ``` ```{r PCAPlotInd} fviz_pca_ind(pca_usa, choice = "ind") ``` ```{r PCAPlotInd2} fviz_pca_ind(pca_usa, choice = "ind", col.ind = USArrests$UrbanPop) ``` *Exercise*: Perform PCA of the dataset `athle_records.csv` after log-transformation of the data. ## Clustering ```{r hc} scaled_usa <- scale(USArrests) dist_usa <- dist(scaled_usa)^2 hc_usa <- hclust(dist_usa, method = "ward.D") plot(hc_usa) ``` ```{r withinss} within_ss <- cumsum(hc_usa$height) plot(49:1, within_ss, type = "b") 1 - within_ss[46] / within_ss[49] # reproduced inertia ``` ```{r cutdendro} plot(hc_usa) rect.hclust(hc_usa, k = 4) ``` ```{r clusteringhc} clust_hc <- cutree(hc_usa, k = 4) clust_hc fviz_pca_ind(pca_usa, col.ind = factor(clust_hc)) ``` ```{r kmeans} init_center <- by(scaled_usa, clust_hc, colMeans) init_center <- Reduce("rbind", init_center) clust_kmeans <- kmeans(scaled_usa, centers = init_center) clust_kmeans ``` ```{r clusteringkmeans} fviz_pca_ind(pca_usa, col.ind = factor(clust_kmeans$cluster)) ``` *Exercise*: Perform clustering of the dataset `athle_records.csv` after log-transformation and scaling of the data. ## Session information This file has been compiled with the current system: ```{r sessionInfo} sessionInfo() ```