--- title: "Small-plot analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Small-plot analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{asreml} --- ``` r library(CBADASReml) #> Loading required package: asremlPlus #> ASReml-R needs to be loaded if the mixed-model functions are to be used. #> #> ASReml-R is available from VSNi. Please visit http://www.vsni.co.uk/ for more information. #> Loading required package: ggplot2 #> Loading required package: gstat #> Loading required package: sf #> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE library(asreml) #> Loading required package: Matrix #> Online License checked out Fri Aug 22 14:18:11 2025 #> Loading ASReml-R version 4.2 ``` # CBADASReml ## Introduction A typical type of experimental analysis that this package will assist with is the small plot trial. Here we will show the general workflow for performing a small-plot analysis and how each function from the package can be used. Analysis of small plot trials is done using linear mixed-effects models (LMMs), using {ASReml} or [{glmmTMB}](https://glmmtmb.github.io/glmmTMB/), and this package contains helper functions for this type of modelling. Note that the trial design is not handled at all by this package, and we recommend the use of [{speed}](https://biometryhub.github.io/speed/) or [{FielDHub}](https://didiermurillof.github.io/FielDHub/index.html) for creating small plot trial designs. ## Mathematical Background Let: \begin{equation*} y = X b + Z u + e \end{equation*} Where - $y \in \mathbb{R}^{n}$ is the vector of observations. - $b \in \mathbb{R}^{p}$ is the vector of fixed effects. - $X \in \mathbb{R}^{n \times p}$ is the design matrix for the fixed effects. - $u \in \mathbb{R}^{q}$ is the vector of random effects. - $Z \in \mathbb{R}^{n \times q}$ is the design matrix for the random effects. - $e \in \mathbb{R}^{n}$ is the vector of residuals. It is assumed that the vectors $u$ and $e$ are independent, and their joint distribution is a multivariate Gaussian described by: \begin{equation*} \begin{bmatrix} u \\ e \end{bmatrix} \sim N \bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} G & 0 \\ 0 & R \end{bmatrix} \bigg) \end{equation*} Where $G$ and $R$ are covariance matrices for $u$ and $e$ and are functions of the *variance parameters* $\sigma_g$ and $\sigma_r$, and finally the dependent variable $y$ is described by: \begin{equation*} y \sim N \big( X b , Z G Z^\intercal + R \big) \end{equation*} These $G$ and $R$ covariance matrices, or *covariance structures*, are very important. These represent the way the random effects and the residuals behave, and $\sigma_g$ and $\sigma_r$, the *covariance structure parameters* are the things we try and estimate when we use ASReml. In the case of independent and identically distributed (IID) residual structures, the residual covariance parameter $\sigma_r$ will be some *variance component* $\sigma_e^2$. ### Exploratory Data Analysis After gathering data from an experiment, it is useful to perform an exploratory data analysis on the raw data, and for this you will likely use a combination of `tidyverse` and `gplot2` for this, to visualise outliers, the data distribution, etc. in order to get a preliminary grasp of what you are working with, and to observe the data for any issues that will hinder the actual analysis. As an assistance with this process, the function `exploratory_table` is provided. ``` r CBADASReml::exploratory_table( data = oats, response = "yield", groupby = c("Variety", "Nitrogen") ) #> Variety Nitrogen mean median sd cv reps min max range #> 1 Golden_rain 0_cwt 80 75 21.0047613649858 26.2559517062322 6 60 117 60 - 117 #> 2 Golden_rain 0.2_cwt 98.5 102.5 13.4721935853075 13.6773538937132 6 82 114 82 - 114 #> 3 Golden_rain 0.4_cwt 114.666666666667 110 29.9443929086343 26.1142961412508 6 86 161 86 - 161 #> 4 Golden_rain 0.6_cwt 124.833333333333 129.5 20.8750249500849 16.7223163819105 6 96 149 96 - 149 #> 5 Marvellous 0_cwt 86.6666666666667 92.5 16.5730705262081 19.1227736840862 6 63 105 63 - 105 #> 6 Marvellous 0.2_cwt 108.5 111.5 26.8533051969399 24.7495900432626 6 70 140 70 - 140 #> 7 Marvellous 0.4_cwt 117.166666666667 118.5 9.78604448521805 8.35224280388454 6 104 132 104 - 132 #> 8 Marvellous 0.6_cwt 126.833333333333 122.5 20.2920345620311 15.9989760016014 6 99 156 99 - 156 #> 9 Victory 0_cwt 71.5 65 20.59854363784 28.809151941035 6 53 111 53 - 111 #> 10 Victory 0.2_cwt 89.6666666666667 89.5 22.5092573548455 25.103260990534 6 64 130 64 - 130 #> 11 Victory 0.4_cwt 110.833333333333 106 26.0108951531213 23.4684768298838 6 81 157 81 - 157 #> 12 Victory 0.6_cwt 118.5 114.5 30.0915270466622 25.3936937102634 6 86 174 86 - 174 ``` ### Model Fitting and Validation We may use the R package `asreml` as follows to implement an LMM. For the purpose of this vignette we will fit a few different models, so we can demonstrate different techniques from the package. ``` r # The needs to be ordered so that the model can run oats <- oats[order(oats$Column), ] mod_oats <- asreml( fixed = yield ~ Variety + Nitrogen + Variety:Nitrogen, random = ~ idv(Blocks) + idv(Blocks):id(Wplots), residual = ~ ar1(Column):ar1(Row), data = oats, trace = FALSE ) #> Warning in asreml(fixed = yield ~ Variety + Nitrogen + Variety:Nitrogen, : Some components changed by more than 1% on #> the last iteration mod_oats <- update(mod_oats, aom = TRUE) mod_rats <- asreml( fixed = weight ~ littersize + Sex + Dose, random = ~idv(Dam), residual = ~units, data = rats, trace = FALSE ) mod_rats <- update(mod_rats, aom = TRUE) ``` In the first model, `mod_oats`, we are using the `oats` data: ##### Data - The `oats` data has 72 rows, 2 treatments (Nitrogen and Variety), and plots (3 levels) within blocks (6 levels). Therefore: + $n = 72$ (number of observations) + $p = 3$ (number of fixed effects, including an interaction effect) + $q = 2$ (number of random effects, blocks, and plots within blocks) ##### $y$ - `yield` is our $y \in \mathbb{R}^{72}$ ($72 \times 1$) vector. ##### $X$ - `Variety`, `Nitrogen` and their interaction `Variety:Nitrogen` are in the $X \in \mathbb{R}^{72 \times 3}$ ($72 \times 3$) matrix. ##### $b$ - There are 3 fixed effects which are continuous, therefore our fixed effects vector is given by $b \in \mathbb{R}^{3 \times 1}$ ##### $Z$ - `Blocks` and `Blocks:Wplots` make up our $Z \in \mathbb{R}^{n \times q}$ matrix. Since these are factors, each level is treated as it's own dummy variable. We end up with two $Z_i$ matrices which make up $Z = [ \; Z_1 \; Z_2 \; ]$ which are given as follows (if ordered by their respective variables): \begin{equation*} Z_{1} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ \vdots \times \text{10} & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & \vdots \times \text{11} & 0 & 0 & 0 & 0 \\ & & \vdots & & & \end{bmatrix} \in \mathbb{R}^{72 \times 6} \end{equation*} \begin{equation*} Z_{2} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \vdots \times \text{2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & \vdots \times \text{2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ & & & & & & & & \vdots & & & & & & & & & \end{bmatrix} \in \mathbb{R}^{72 \times 18} \end{equation*} Where each column in $Z_1$ represents the $6$ `Blocks`, and the columns in $Z_2$ represent the $6 \times 3 = 18$ `Blocks` $\times$ `Wplots`. ##### $u$ - Since we have $2$ model terms (`idv(Blocks)` and `idv(Blocks):id(Wplots)`), we have two $Z_i$ matrices, and also two $u_i$ vectors. Our random effects vector is given by: + $u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}$ + $u_1 \sim N(0, G_1) \in \mathbb{R}^{6}$ + $u_2 \sim N(0, G_2) \in \mathbb{R}^{18}$. ##### $G$ - `idv(Blocks) + idv(Blocks):id(Wplots)` is the random effects variance structure. Since we have two terms, we again have two $G_i$'s. We specify the `idv` variance model function, which is the homogeneous scaled identity variance structure, i.e. $\sigma^2 I$. The second term has a /direct product/ implied by the `:` operator, which in the case of matrices is a /Kronecker product/, of `idv(Blocks)` and `id(Wplots)`. `id` variance model means it is simply a correlation structure, i.e. $I$, and there is no estimation of a variance parameter. Noting we have 2 random effects terms: + $G_1 = \sigma_1^2 I_6$ ($I_6$ is the $6 \times 6$ identity matrix) + $G_2 = (\sigma_2^2 I_6) \otimes I_3$ + $G = \bigoplus_{i = 1}^2 G_i = \begin{bmatrix} \sigma_1^2 I_6 & 0 \\ 0 & (\sigma_2^2 I_6 \otimes I_3) \end{bmatrix}$ ##### $e$ - The error vector is specified by the single model term `ar1(Column):ar1(Row)`, therefore the error is given by: + $e \sim N(0, R)$ where $R$ is the $R$ structure ##### $R$ - `ar1(Column):ar1(Row)` specifies the residual variance structure. In this case we refer to an autoregressive model of order 1 (AR(1)) on both `Row` and `Column`, where the `:` operator implies a separable Kronecker product structure. The variance model `ar1` is of the form: $\Sigma_{ii} = 1, \Sigma_{ij} = \phi \Sigma_{i-1, j}, i > j+1, \lvert \phi \rvert < 1$. + $R = \Sigma_c(\phi_c) \otimes \Sigma_r(\phi_r)$ + Where $\phi$ refers to the correlation within the columns ($\phi_c$) or the rows ($\phi_r$). In the second model, `mod_rats`, we are using the `rats` data: - `weight` is our $y$ vector. - `littersize`, `Sex` and `Dose` are in the $X$ matrix. - `Dam` is in the $Z$ matrix. We specify the `idv` variance model function, which is the homogeneous scaled identity variance structure, i.e. $\sigma^2 I$. In other words, the term is IID: + $u \in \mathbb{R}^d \sim N(0, G)$ + $Z$ is the design matrix for `Dam`. + $G = \sigma_d^2 I_d$. - `idv(units)` is the residual variance structure, in this case the residual term has the homogeneous covariance matrix: $\sigma^2 I$. This is basically saying all observations have the same error distribution (IID): + $e \in \mathbb{R}^n \sim N(0, R)$ + $R = \sigma_e^2 I_n$ Upon executing the above in R, the `ASReml-R` code will attempt to *estimate* the variance components $\sigma_d^2$ and $\sigma_e^2$, using the REML algorithm. ## Function examples {.tabset} ### Traditional Diagnostics (`ASReml-R`) #### `plot.asreml` ``` r plot(mod_oats) ``` ![plot of chunk plot_mod_oats](plot_mod_oats-1.png) #### `plot(varioGram(mod))` ``` r plot(varioGram(mod_oats)) ``` ![plot of chunk variogram_mod_oats](variogram_mod_oats-1.png) ### CBADASReml Diagnostics #### `outlier_summary` The purpose of the `outlier_summary` function is to observe potential outliers that can then be discussed with the data provider. Below is an example for the rats dataset that is included within `ASReml-R`. ``` r outlier_summary(mod_rats) #> #> ── Outliers detected: 2 ─────────────────────────────────────────────────────────────────────────────────────────── #> Dose Sex littersize Dam weight units combined_trt residuals #> 56 C F 13 5 5.02 56 13_F_C -4.090506 #> 66 C F 9 6 3.68 66 9_F_C -7.941340 #> #> ── Data table ── #> #> Dose Sex littersize Dam weight units combined_trt residuals #> 56 C F 13 5 5.02 56 13_F_C -4.0905065 #> 57 C F 13 5 6.04 57 13_F_C -1.4769901 #> 53 C F 13 5 7.16 53 13_F_C 1.3927533 #> 55 C F 13 5 7.14 55 13_F_C 1.3415079 #> 54 C F 13 5 7.09 54 13_F_C 1.2133944 #> 129 C F 13 10 5.92 129 13_F_C -1.1145276 #> 126 C F 13 10 6.67 126 13_F_C 0.8050510 #> 128 C F 13 10 6.53 128 13_F_C 0.4467297 #> 130 C F 13 10 6.52 130 13_F_C 0.4211353 #> 125 C F 13 10 6.44 125 13_F_C 0.2163802 #> 131 C F 13 10 6.44 131 13_F_C 0.2163802 #> 127 C F 13 10 6.43 127 13_F_C 0.1907859 #> 66 C F 9 6 3.68 66 9_F_C -7.9413397 #> 64 C F 9 6 7.26 64 9_F_C 1.3727706 #> 65 C F 9 6 6.58 65 9_F_C -0.3963900 ``` #### `directional_variograms` The purpose of this function is to view the autocorrelation structure in both directions, rather than the traditional approach with `ASReml-R`. ``` r plot(directional_variograms(mod_oats)) ``` ![plot of chunk directional_variograms_mod_oats](directional_variograms_mod_oats-1.png) ## Model Outputs {.tabset} ### Traditional Outputs (`ASReml-R`) #### `wald` For getting the significance of each variable in the model ``` r wald(mod_rats) #> Wald tests for fixed effects. #> Response: weight #> Terms added sequentially; adjusted for those above. #> #> Df Sum of Sq Wald statistic Pr(Chisq) #> (Intercept) 1 1477.67 8981.5 < 2.2e-16 *** #> littersize 1 4.58 27.8 1.312e-07 *** #> Sex 1 9.80 59.5 1.199e-14 *** #> Dose 2 3.76 22.8 1.100e-05 *** #> residual (MS) 0.16 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` #### `predict` ``` r predict.asreml(mod_oats, classify = "Nitrogen") #> $pvals #> #> Notes: #> - The predictions are obtained by averaging across the hypertable #> calculated from model terms constructed solely from factors in #> the averaging and classify sets. #> - Use 'average' to move ignored factors into the averaging set. #> - The simple averaging set: Variety #> - The ignored set: Blocks,Wplots #> #> Nitrogen predicted.value std.error status #> 1 0_cwt 77.76388 6.929776 Estimable #> 2 0.2_cwt 100.15097 6.896569 Estimable #> 3 0.4_cwt 114.41044 6.887489 Estimable #> 4 0.6_cwt 123.22901 6.912904 Estimable #> #> $avsed #> overall #> 3.748427 ``` ### CBADASReml Outputs ### `anova_table` Show an ANOVA table for multiple models, with p-values for each effect against each response. ``` r rats[["newresp"]] <- rats[["weight"]] * runif(nrow(rats)) mod_rats2 <- asreml( fixed = newresp ~ littersize + Sex + Dose, random = ~idv(Dam), residual = ~units, data = rats, trace = FALSE ) mod_rats2 <- update(mod_rats2, aom = TRUE) anova_table(mod_rats, mod_rats2) #> Effect weight newresp #> 1 littersize 0 0.626 #> 2 Sex 0 0.011 #> 3 Dose 0 0.058 ``` ### `pred_table` Get the mean for each level of a fixed effect, as well as standard error, and confidence intervals. ``` r pred_table(mod_rats, classify = "Dose") #> Error: #> ! Classify must be type character ``` ## Least significant difference #### `lsd_table` Show a least significant difference table: ``` r lsd_table(mod_rats, classify = "Dose") #> treatment group lsd means #> 1 C a 0.354997 6.410370 #> 2 Low b 0.354997 5.989315 #> 3 High c 0.354997 5.554064 ``` #### `lsd_graph` Show an LSD graph of the same data: ``` r lsd_graph(mod_rats, classify = "Dose") ``` ![plot of chunk lsd_graph_mod_rats](lsd_graph_mod_rats-1.png)