--- title: "Using nert to augment agricultural analytics with SMIPS data" author: Russell Edson and Max Moldovan output: rmarkdown::html_document vignette: > %\VignetteIndexEntry{Using nert to augment agricultural analytics with SMIPS data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{terra} %\VignetteDepends{nlme} %\VignetteDepends{ggplot2} --- The **nert** package provides straightforward and seamless access to the TERN Soil Moisture Integration and Prediction System (SMIPS) datasets, which include soil moisture estimates that can be used in your data analytics pipelines. This document showcases this process with the example of a synthesised multi-site grain production experiment. The **nert** package is loaded and used to download soil moisture time series data at various locations across South Australia. Recognising soil moisture as a classical confounder---affecting crop yield (via root-zone water availability) and nitrogen treatment (via increased volatilisation in moist soils)---we will augment our modelling of the grain production to use soil moisture estimates over the treatment period for estimating the nitrogen treatment effect. ## A synthetic dataset for a grain production experiment A simulated dataset containing yield data for a fictitious grain production experiment has been included with the **nert** package. The dataset is called `grain`, and you can load the package and this dataset with: ``` r library(nert) data(grain) str(grain) #> 'data.frame': 2880 obs. of 10 variables: #> $ Site : Factor w/ 10 levels "Adelaide","Barossa Valley",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ Latitude : num -34.9 -34.9 -34.9 -34.9 -34.9 ... #> $ Longitude : num 139 139 139 139 139 ... #> $ SowDate : Date, format: "2022-05-20" "2022-05-20" ... #> $ NitrogenDate : Date, format: "2022-06-04" "2022-06-04" ... #> $ Rep : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... #> $ Variety : Factor w/ 8 levels "Variety_A","Variety_B",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ Nitrogen_kgNha : Factor w/ 4 levels "0","30","60",..: 1 1 1 2 2 2 3 3 3 4 ... #> $ SeedRate_plantsm2: Factor w/ 3 levels "75","150","300": 1 2 3 1 2 3 1 2 3 1 ... #> $ Yield_Tha : num 3 3.69 3.89 2.76 3.27 ... ``` The `grain` dataset supposes that a hypothetical grain production experiment has been run at a selection of ten sites across South Australia, investigating the effects of four different treatment levels of nitrogen application and three different seeding rates on the grain yield for some crop with eight varieties. Note that we have locational coordinates (in `Latitude` and `Longitude`) for each of the sites, as well as dates for the crop sowing and nitrogen applications (in `SowDate` and `NitrogenDate` respectively.) This spatio-temporal information is useful, as it allows us to match up the experimental sites with the data points that we need to download from SMIPS. ## Download SMIPS data for the sites across times Here we use the **nert** package to download SMIPS soil moisture datasets from the TERN Data Portal, which we can use to augment our analysis of the `grain` experiment data. The two key SMIPS datasets of interest are: 1. `totalbucket`: an estimate of the _volumetric soil moisture_ (in units of mm) from the SMIPS bucket moisture store, 1. `SMindex`: the SMIPS _soil moisture index_ (i.e., a percentage between 0 and 100 that indicates how full the SMIPS bucket moisture store is relative to its 90cm capacity). For simplicity, suppose we are interested in the SMIPS soil moisture index (`SMindex`) data, for all days falling between the earliest nitrogen application date up to 30 days after the last application date, and we want the soil moisture index at each site (by latitude/longitude coordinates). We can generate this date range in a straightforward way using the `NitrogenDate` column of the `grain` dataset as follows: ``` r start_date <- min(grain$NitrogenDate) end_date <- max(grain$NitrogenDate) + 30 date_range <- seq(start_date, end_date, by = "1 day") c(date_range[1], date_range[length(date_range)]) #> [1] "2022-05-07" "2022-07-15" ``` We also need the latitude and longitude for the sites, which we can readily retrieve from the `grain` dataset columns `Latitude` and `Longitude`: ``` r sites <- unique(grain[, c("Site", "Latitude", "Longitude")]) sites #> Site Latitude Longitude #> 1 Adelaide -34.94773 138.6244 #> 289 Barossa Valley -34.54843 138.9580 #> 577 Clare Valley -33.83754 138.6012 #> 865 Eyre Peninsula -34.37378 135.7967 #> 1153 Fleurieu Peninsula -35.52307 138.4608 #> 1441 Kangaroo Island -35.71041 137.0736 #> 1729 Limestone Coast -37.88957 140.6388 #> 2017 Sturt Vale -33.34666 140.0723 #> 2305 Murraylands -35.12425 139.3014 #> 2593 Yorke Peninsula -35.05124 137.2090 ``` Now TERN supplies the SMIPS daily data rasters as cloud-optimised GeoTIFFs (or COGs), which contain the soil moisture point predictions across the entirety of Australia. However, because they are cloud-optimised, we can be clever in our downloading to make sure that we only download data for the locations of interest, rather than the entire raster. The **nert** package works in tandem with the **terra** package to achieve this efficiency: - First, we download the _information_ for a daily SMIPS raster using `nert::read_smips`, - Then we _extract only the point values we need_, using `terra::extract`. This leads to a quicker and tighter data download. The below code shows this process. (Note that `terra::extract` is particular about the order of the latitude/longitude coordinates: longitude should be specified first, followed by latitude.) ``` r smips_data <- data.frame() for (i in seq_along(date_range)) { r <- read_smips(date = date_range[i], collection = "SMindex") smips_points <- terra::extract( x = r, y = data.frame(lon = sites$Longitude, lat = sites$Latitude), xy = TRUE ) names(smips_points)[2] <- "smips_smi_perc" smips_data <- rbind( smips_data, data.frame( Date = date_range[i], Latitude = smips_points$y, Longitude = smips_points$x, smips_smi_perc = smips_points$smips_smi_perc ) ) } head(smips_data) #> Date Latitude Longitude smips_smi_perc #> 1 2022-05-07 -34.95253 138.6237 0.25906488 #> 2 2022-05-07 -34.55264 138.9537 0.17677850 #> 3 2022-05-07 -33.83285 138.6037 0.07674548 #> 4 2022-05-07 -34.37270 135.7944 0.53408158 #> 5 2022-05-07 -35.52237 138.4638 0.37496096 #> 6 2022-05-07 -35.71231 137.0741 0.30223876 ``` We can add the `Site` column to the `smips_data` to make it easier to use it in conjunction with the `grain` dataset during the analysis: ``` r smips_data$Site <- sites$Site head(smips_data) #> Date Latitude Longitude smips_smi_perc Site #> 1 2022-05-07 -34.95253 138.6237 0.25906488 Adelaide #> 2 2022-05-07 -34.55264 138.9537 0.17677850 Barossa Valley #> 3 2022-05-07 -33.83285 138.6037 0.07674548 Clare Valley #> 4 2022-05-07 -34.37270 135.7944 0.53408158 Eyre Peninsula #> 5 2022-05-07 -35.52237 138.4638 0.37496096 Fleurieu Peninsula #> 6 2022-05-07 -35.71231 137.0741 0.30223876 Kangaroo Island ``` We are now ready to proceed with the analytics. ## A simple model for grain yield First, we model the grain yield with a simple model without taking into account any soil moisture confounding---that is, without reference to the SMIPS data that we have just downloaded. We will later incorporate the SMIPS data to see how including the soil moisture as a covariate improves the nitrogen treatment effect size estimates. Here we will use the **nlme** package to fit a linear mixed-effect model. The grain yield `Yield_Tha` will be modelled taking the `Variety`, nitrogen application rate `Nitrogen_kgNha` and seeding rate `SeedRate_plantsm2` as fixed terms, and incorporating the `Site` and replicate (`Rep`) structure of the experiment in a random effect term. ``` r library(nlme) simple_model <- lme( fixed = Yield_Tha ~ Variety * Nitrogen_kgNha * SeedRate_plantsm2, random = ~ 1 | Site / Rep, data = grain ) ``` We can check the confidence intervals for the effect sizes of the nitrogen treatments for this simple model: ``` r simple_model.ints <- intervals(simple_model, which = "fixed")$fixed simple_model.Neffects <- simple_model.ints[paste0("Nitrogen_kgNha", c(30, 60, 90)), ] simple_model.Neffects #> lower est. upper #> Nitrogen_kgNha30 -0.17355028 0.0392950 0.2521403 #> Nitrogen_kgNha60 0.08045472 0.2933000 0.5061453 #> Nitrogen_kgNha90 0.09779306 0.3106383 0.5234836 ``` ## Augmenting the yield model with soil moisture data Next, we augment our modelling by including the SMIPS soil moisture data as a covariate (perhaps anticipating that the soil moisture improves the yield, but might reduce the effect of nitrogen applied to the soil due to volatilisation). To keep things simple, for each site (and its associated nitrogen application date) we take an average of the SMIPS-reported soil moisture index from the date of the nitrogen application until 30 days after the application, which we will store in a new column called `SoilMoisture_avg`. ``` r for (site in sites$Site) { start_date <- unique(grain[which(grain$Site == site), "NitrogenDate"])[1] dates <- seq(start_date, start_date + 30, by = "1 day") smips <- smips_data[which(smips_data$Site == site & smips_data$Date %in% dates), ] smips_avg <- mean(smips[["smips_smi_perc"]]) grain[which(grain$Site == site), "SoilMoisture_avg"] <- smips_avg } ``` We can then add this average soil moisture as a fixed effect to the linear mixed-effect model: ``` r augmented_model <- lme( fixed = Yield_Tha ~ Variety * Nitrogen_kgNha * SeedRate_plantsm2 + SoilMoisture_avg + SoilMoisture_avg:Nitrogen_kgNha, random = ~ 1 | Site / Rep, data = grain ) ``` The confidence intervals for the effect sizes of the nitrogen application treatments for this augmented model are then as follows: ``` r augmented_model.ints <- intervals(augmented_model, which = "fixed")$fixed augmented_model.Neffects <- augmented_model.ints[paste0("Nitrogen_kgNha", c(30, 60, 90)), ] augmented_model.Neffects #> lower est. upper #> Nitrogen_kgNha30 -0.03984861 0.2000454 0.4399395 #> Nitrogen_kgNha60 0.31409573 0.5539898 0.7938838 #> Nitrogen_kgNha90 0.33961765 0.5795117 0.8194058 ``` We can use **ggplot2** plotting to graph the confidence intervals for the simple model versus the augmented model, and illuminate the difference attained when we include the soil moisture as a confounding term in our modelling: ``` r library(ggplot2) ci <- rbind( data.frame(simple_model.Neffects, model = "Simple Model"), data.frame(augmented_model.Neffects, model = "SMIPS-Augmented Model") ) ci$term <- paste0("Nitrogen_kgNha", c(30, 60, 90)) pd <- position_dodge(-0.4) ggplot(ci, aes(x = est., y = term, colour = model)) + scale_y_discrete(limits = rev) + geom_point(position = pd) + geom_errorbarh(aes(xmin = lower, xmax = upper), position = pd, height = 0.2) + labs(y = "predictor", x = "estimate") ```
Nitrogen effect estimates compared between the simple model and
  the SMIPS-augmented model.

Nitrogen effect estimates compared between the simple model and the SMIPS-augmented model.

From the comparison of the effect sizes for the nitrogen treatment, we can see that ignoring the effect of soil moisture in the simple model underestimates the direct nitrogen treatment effect. The augmented model, in contrast, accounts for the soil moisture and the potential nitrogen volatilisation that may occur in higher soil moisture conditions. ## Simplified collection with `collect_tern_data()` The hand-written `for` loop above is shown for didactic purposes — it makes the per-date COG read and per-site point extraction explicit. For routine use, `collect_tern_data()` performs the same operation vectorised across locations, with one COG open per (dataset, date, variant) regardless of the number of points. The same `SMindex` time series can be produced with a single call: ``` r smips_dt <- collect_tern_data( xy = data.frame(lon = sites$Longitude, lat = sites$Latitude), date_range = c(start_date, end_date), datasets = "SMIPS", smips_collection = "SMindex", verbose = FALSE ) head(smips_dt) ``` `smips_dt` is a `data.table` with one row per (date, location); a single extraction call replaces the inner `terra::extract()` of the loop. The example above is shown with `eval=FALSE` to avoid duplicating the network traffic already performed by the loop; the output schema is identical to `smips_data` apart from the column name (`SMIPS_SMindex` rather than `smips_smi_perc`).