Code
# loading packages
library(tidyverse)
library(ggthemes)An Exploration Across States and Months
Slides: slides.html ( Go to slides.qmd to edit)
Statistical methods for comparing differences between groups have long served as a cornerstone of empirical research across the natural and social sciences. Foundational tools such as t-tests, analysis of variance (ANOVA), and linear regression provided early mechanisms for evaluating mean differences and variance relationships under specific conditions. However, as research designs grew more complex, incorporating repeated measurements, nested data structures, and unbalanced datasets, the limitations of traditional techniques became increasingly apparent. These methodological pressures contributed to the development of linear mixed-effects models (LMMs), which extend classical linear modeling by incorporating both fixed and random sources of variation within hierarchical data structures. The purpose of this review is to examine how LMMs came to prominent usage, the associated benefits and limitations, and how this method may be applied to this capstone project.
Formal statistical tests for group comparisons experienced widespread growth in the early twentieth century. Sir Ronald Fisher’s Statistical Methods for Research Workers (1925) was one of the earliest compilations of comparison analyses such as t-tests, the analysis of variance (ANOVA), the maximum likelihood estimation, and experimental design for correlation studies (Huitema (2025)). Building on this groundwork, Henderson (1953) advanced methods for estimating variance and covariance components in unbalanced agricultural datasets. His work demonstrated how linear models could utilize variance elements from both fixed and random factors, providing unbiased estimates even among unbalanced group structures. These contributions laid critical groundwork for the development of modern mixed-effects modeling.
Traditional approaches such as ANOVA and simple regression perform optimally under assumptions of independence, normality, and balanced groupings (Henderson (1953)). Violations to the general schema — including repeated measures, nested clustering, and missing data — can compromise inference (Brown (2021)). Linear mixed-effects models address these limitations by modeling fixed population-level effects alongside random group-level deviations within a single framework. This hierarchical variance structure allows LMMs to account for correlated observations while improving estimate stability through partial pooling, where group estimates are shrunk toward the overall mean when data are limited or missing(Meteyard and Davies (2020)).
This structure endows LMMs with considerable flexibility. Schielzeth et al. (2020) demonstrated that fixed-effect variance estimates remain largely unbiased under assumption violations such as skewness, heteroskedasticity, and bimodality, though precision may decline. Despite these advantages, LMMs introduce analytical complexities. Model specification — particularly the selection of random-effects representation terms— can be challenging, and over- or under-parameterization may affect model stability and interpretability (Ryoo (2011)). Additionally, while LMMs accommodate unbalanced data, reduced precision under distributional violations remains an important consideration (Schielzeth et al. (2020)).
Owing to their robust performance in the face of imperfect data, LMMs are commonly utilized in practical fields such as agriculture, psychology, ecology, and clinical research. Meteyard and Davies (2020) documented their extensive adoption in psychological research, while agricultural applications include modeling spatial yield variation in plant breeding trials (Adhikari, Wu, and Caffe-Treml (2016)) and assessing environmental drivers of winter wheat performance (Zhou et al. (2022)).
Within the present capstone project, LMMs provide an appropriate framework for modeling U.S. egg production across states and years while investigating the effects of average feed price and temperature. The repeated-measures and nested structure of the dataset necessitate an approach capable of partitioning population-level effects from state-level variability. By capturing both fixed interactions and hierarchical random variation, LMMs offer more comprehensive insights into production dynamics than traditional regression alone.
The general linear mixed model may be applied to the egg production dataset to examine factors associated with laying productivity. Linear mixed models are commonly used to analyze hierarchical or longitudinal datasets in which repeated observations are nested within higher-level groups(Meteyard and Davies (2020)). The response variable, \(\boldsymbol{y}\), represents eggs produced per 100 laying hens, a standard measure of laying productivity reported in USDA egg production statistics. Feed price, yearly variation, monthly variation, and average monthly temperature by state are represented in the fixed-effects portion of the model, \(\boldsymbol{\beta}\). Fixed effects are used to estimate the systematic contribution of these variables to egg productivity across the population of commercial egg-producing states. In contrast, inter-state variation is captured through the random-effects component \(\boldsymbol{\gamma}\), while the residual error term \(\mathbf{e}\) represents fluctuations in productivity not explained by either the fixed or random components of the model(Mustafa (2023)).
In this capstone project, the fixed-effects portion of the model represents systematic relationships between egg productivity and explanatory variables that are assumed to influence all states in a similar manner(Mustafa (2023)). Average monthly temperature and feed price are included as fixed effects because they represent measurable environmental and economic conditions expected to affect egg production across the industry as a whole. The associated fixed-effect coefficients therefore estimate the average change in eggs produced per 100 layers associated with variation in these predictors, while also accounting for seasonal and long-term patterns through month and year effects.
The random-effects component accounts for unobserved heterogeneity between states. Commercial egg production systems vary across states due to differences in housing systems, farm size, management practices, breed composition, and regional climate conditions(American Egg Board (2013)). These factors may lead to systematic differences in baseline productivity that are not fully captured by the measured predictors. To accommodate this variability, the model includes state-level random effects, allowing each state to deviate from the overall intercept. This structure reflects the hierarchical nature of the dataset, in which repeated monthly observations are nested within states.
In this study, state is modeled as a random effect rather than a fixed effect because the goal is not to estimate a separate parameter for each individual state, but rather to account for variability across the broader population of egg-producing states. Treating state as a random effect allows each state to have its own baseline productivity level while still estimating overall relationships between temperature, feed price, month, year, and egg production. This approach is appropriate when observations are grouped and repeatedly measured within the same units over time(Sarafian (2020)). Additionally, the modeling of state as a random effect reduces the number of parameters required compared with estimating separate fixed coefficients for each state and improves the model’s ability to generalize the estimated relationships beyond the specific states included in the dataset(Mustafa (2023)).
The general linear mixed model (LMM) is defined below:
\[ \mathbf{y} = X\boldsymbol{\beta} + Z\boldsymbol{\gamma} + \mathbf{e} \]
\(\mathbf{y}\): outcome vector
\(X\): fixed-effects matrix
\(\boldsymbol{\beta}\): fixed-effects coefficient vector
\(Z\): random-effects matrix
\(\boldsymbol{\gamma}\): random-effects vector
\(\mathbf{e}\): residual error vector
The residual error term satisfies:
\[ \mathbf{e} \sim \mathcal{N}(\mathbf{0}, R) \]
\(\boldsymbol{0}\): mean
\({R}\): variance
The random effects vector satisfies:
\[ \boldsymbol{\gamma} \sim \mathcal{N}(\mathbf{0}, G) \]
\(\boldsymbol{0}\): mean
\({G}\): variance-covariance matrix
Additionally, the residual errors and random effects are assumed uncorrelated(Ao (2007)):
\[ \mathrm{Cov}(\mathbf{e}, \boldsymbol{\gamma}) = 0 \]
The dataset was compiled from multiple government agency resources. Egg production rates and grain prices were sourced from the U.S. Department of Agriculture’s National Agricultural Statistics Service monthly report archive (United States Department of Agriculture, National Agricultural Statistics Service (2026b), United States Department of Agriculture, National Agricultural Statistics Service (2026a)). Temperature data was sourced from the National Oceanic and Atmospheric Administration’s Climate at a Glance National Time Series archive (NOAA National Centers for Environmental Information (2026)).
| Variable | Type | Description |
|---|---|---|
date |
Time Index | Monthly observation period from January 2000 through December 2024. |
state |
Categorical | 25 U.S. states with consistently high commercial egg production. |
eggsPer100Layers |
Continuous | Monthly productivity measure defined as the number of eggs produced per 100 laying hens. |
cornPrice |
Continuous | Monthly average U.S. price per bushel of corn (USD). |
soybeanPrice |
Continuous | Monthly average U.S. price per bushel of soybeans (USD). |
feedPrice |
Continuous | Weighted feed cost calculated as 2/3 × cornPrice + 1/3 × soybeanPrice (USD per bushel). |
avgTemp |
Continuous | State-level monthly average temperature (°F). |
# loading packages
library(tidyverse)
library(ggthemes)# load data
eggData <- read.csv("USDA_Egg_Data.csv")
# clean data for visuals
eggData_clean <- eggData %>%
select(date, state, eggsPer100Layers, avgTemp, feedPrice) %>%
mutate(
state = str_trim(state),
eggsPer100Layers = as.numeric(eggsPer100Layers),
avgTemp = as.numeric(avgTemp),
Date = as.Date(paste0(date, "-01")),
Year = year(Date),
Month = month(Date)
)
#time series plot
timeSeries = eggData_clean %>% ggplot(mapping = aes(x=Date, y=eggsPer100Layers))
timeSeries + geom_line(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, span = 0.25) +
facet_wrap(~ state, ncol = 5, scales = "free_y") +
labs(title = "Eggs per 100 Layers Over Time (2000–2024)",
x = "Date",
y = "Eggs per 100 Layers") +
theme_bw()Seasonality appears present for every state. This is evidenced by the regularly repeating peaks and troughs. We can presume that laying productivity follows an annual pattern. While nearly all states examined exhibit some degree of increase in egg productivity, Alabama yielded a unique graph. This trend line is generally consistent over the 25 years with minor variability present.
# month box plot
boxPlot = eggData_clean %>% ggplot(mapping = aes(x=factor(Month), y=eggsPer100Layers))
boxPlot + geom_boxplot(outlier.alpha = 0.3) +
labs(title = "Seasonality: Eggs per 100 Layers by Month",
x = "Month",
y = "Eggs per 100 Layers") +
theme_bw()This plot indicates that there are disparities in egg production across months. Productivity appears to be higher in January, March, May, July, August, October, and December. The least productive month is evidently February.
# temperature scatterplot
scatter = eggData_clean %>% ggplot(mapping = aes(x=avgTemp, y=eggsPer100Layers))
scatter + geom_point(alpha = 0.25) +
geom_smooth(method = "loess", se = TRUE, span = 0.75) +
labs(title = "Eggs per 100 Layers vs Average Temperature",
x = "Average Temperature (°F)",
y = "Eggs per 100 Layers") +
theme_bw()A notable feature of this scatterplot is how the productivity trend declines sharply as temperature exceeds 75 degrees F. Kim et al. (2024) explains that laying hens are susceptible to heat stress under high temperature conditions. A temperature–humidity index of 76-81 degrees is considered dangerous and over 81 is categorized as emergency.
# feed scatterplot
scatter2 = eggData_clean %>% ggplot(mapping = aes(x=feedPrice, y=eggsPer100Layers))
scatter2 + geom_point(alpha = 0.25) +
geom_smooth(method = "loess", se = TRUE, span = 0.75) +
labs(title = "Eggs per 100 Layers vs Average Temperature",
x = "Feed Price (USD/bushel)",
y = "Eggs per 100 Layers") +
theme_bw()This scatterplot seems to indicate that a slightly positive relationship exists between feed price and eggs/100 layers. This is counter intuitive to preconceived theories that a negative relationship may exist.
Data for this analysis were compiled from multiple sources, including USDA egg production reports, agricultural price data, and NOAA climate records. 25 of the most prolific egg-producing states were selected for study and monthly observations from 2000 through 2024 were merged in order to create a unified set. Feed price was constructed as a weighted combination of corn and soybean prices, defined as \(\frac{2}{3}\) corn price and \(\frac{1}{3}\) soybean price, to reflect approximate poultry feed composition. Missing values were substituted with “NA”.
Model specification and implementation were derived from recommendations for linear mixed-effects modeling as outlined by An Introduction to Linear Mixed-Effects Modeling in R (Brown (2021)). These guidelines highlight the importance of matching model structure to the hierarchical nature of the data, selecting an appropriate random-effects structure, and prioritizing readability over unnecessary model complexity. Accordingly, a random-intercept model was utilized to capture between-state variability while maintaining model stability and facilitating clear interpretation of fixed effects.
\[ y = \beta_0 + \beta_1 \text{avgTemp} + \beta_2 \text{feedPrice} + \beta_3 \text{Year} + \sum_{m=2}^{12} \beta_4 \text{Month} + b_{0} \text{betweenStateEffects} + \varepsilon \text{withinStateEffects} \]
library(lme4)
library(tidyverse)
egg_mod1 <- lmer(
eggsPer100Layers ~ avgTemp + feedPrice + Year + factor(Month) + (1 | state),
data = eggData_clean
)
summary(egg_mod1)Linear mixed model fit by REML ['lmerMod']
Formula: eggsPer100Layers ~ avgTemp + feedPrice + Year + factor(Month) +
(1 | state)
Data: eggData_clean
REML criterion at convergence: 86062.3
Scaled residuals:
Min 1Q Median 3Q Max
-7.0778 -0.5744 0.0198 0.6146 4.2742
Random effects:
Groups Name Variance Std.Dev.
state (Intercept) 25050 158.27
Residual 5630 75.03
Number of obs: 7491, groups: state, 25
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.228e+04 3.234e+02 -68.896
avgTemp 3.997e-01 2.034e-01 1.965
feedPrice 1.189e+00 5.570e-01 2.135
Year 1.221e+01 1.613e-01 75.682
factor(Month)2 -2.079e+02 4.290e+00 -48.466
factor(Month)3 7.569e+00 4.874e+00 1.553
factor(Month)4 -7.006e+01 6.005e+00 -11.667
factor(Month)5 -9.531e+00 7.490e+00 -1.272
factor(Month)6 -7.853e+01 8.965e+00 -8.760
factor(Month)7 3.730e-01 9.715e+00 0.038
factor(Month)8 1.410e+00 9.465e+00 0.149
factor(Month)9 -7.008e+01 8.301e+00 -8.443
factor(Month)10 1.701e+01 6.398e+00 2.658
factor(Month)11 -3.842e+01 4.915e+00 -7.817
factor(Month)12 3.223e+01 4.305e+00 7.486
Model output revealed multiple patterns in egg productivity. A positive time trend was observed, with egg production increasing by approximately 12 eggs per 100 layers per year. This suggests significant long-term gains in production efficiency. Feed price exhibited a positive association with productivity, suggesting that higher feed costs may be linked to increased efficiency or improved management practices. Average monthly temperature demonstrated a slightly positive relationship with egg production, though the magnitude of this effect was comparatively small.
The random-effects results highlighted substantial variability between states, with between-state variance greatly exceeding within-state residual variance. This indicates that a large proportion of the overall variation in egg production is attributable to differences across states, underscoring the importance of accounting for state-level heterogeneity in the modeling framework.
library(dplyr)
mean_data <- eggData_clean %>%
group_by(Date) %>%
summarize(mean_eggs = mean(eggsPer100Layers, na.rm = TRUE))
ggplot(mean_data, aes(x = Date, y = mean_eggs)) +
geom_line(size = 1.2) +
labs(
title = "Average Egg Production Over Time",
x = "Date",
y = "Eggs per 100 Layers"
)The above graph illustrates the average monthly egg production across all states over time. An upward trend is apparent. Productivity increased steadily throughout the study period, consistent with the strong positive effect of year indicated by the mixed-effects model. Also, a pronounced seasonal pattern is evident, with recurring dips in production occurring at regular intervals each year. These cyclical fluctuations support the inclusion of month as a fixed effect in the model. Overall, the figure highlights both long-term growth in egg production and persistent seasonal variation.
model_data <- model.frame(egg_mod1)
model_data$fitted <- fitted(egg_mod1)
ggplot(model_data, aes(x = fitted, y = eggsPer100Layers)) +
geom_point(alpha = 0.3) +
geom_abline(slope = 1, intercept = 0) +
labs(
x = "Fitted values",
y = "Observed eggs per 100 layers",
title = "Observed vs Fitted Values") +
theme_bw()This plot presents the relationship between observed and fitted values from the linear mixed-effects model. The points cluster closely to the 45-degree reference line, indicating strong agreement between predicted and observed egg production values. This indicates that the mixed-effects model provides a good fit to the data.
re_state <- ranef(egg_mod1)$state
re_state$state <- rownames(re_state)
ggplot(re_state, aes(x = reorder(state, `(Intercept)`), y = `(Intercept)`)) +
geom_point() +
coord_flip() +
labs(
title = "State-Level Random Intercepts",
x = "State",
y = "Deviation from Overall Mean")This graphic represents each state’s deviation from the overall mean egg production. States with positive values exhibit higher baseline productivity, while those with negative values fall below the overall average. The wide spread of values indicates substantial variability in egg production across states and highlights the importance of including state-level random effects in the model.
This analysis examined the relationship between environmental, economic, and chronological factors and egg production using a linear mixed-effects model. The results demonstrated a strong upward trend in egg productivity over time, indicating substantial long-term improvements in production efficiency across the U.S. egg industry. Feed price was found to have a positive association with productivity, suggesting that economic conditions may influence management decisions and efficiency. By comparison, average temperature exhibited a more mild effect, indicating that environmental conditions play a smaller but still measurable role in production outcomes.
In addition to these fixed effects, the model identified significant variability between states, with state-level differences accounting for a large proportion of the total variation in egg production. This finding highlights the importance of regional factors such as management practices, production systems, and local conditions, which are not fully captured by the observed predictors.
Overall, the results demonstrate that egg production is driven by a combination of long-term industry improvements, seasonal patterns, and substantial differences across states. By incorporating both fixed and random effects, the mixed-effects modeling approach provides a comprehensive framework for understanding these dynamics. These findings suggest that while broad economic and environmental factors influence egg productivity, state-level characteristics play a critical role in determining baseline production levels. This has important implications for both producers and policymakers, as efforts to improve productivity may benefit from strategies tailored to regional conditions rather than relying solely on nationwide trends.
ChatGPT-5.2 used to fine tune writing, extract data from original sources, and develop effective code.
Meteyard and Davies (2020) Brown (2021) Zhou et al. (2022) Adhikari, Wu, and Caffe-Treml (2016) Ryoo (2011) Schielzeth et al. (2020) Huitema (2025) Henderson (1953) NOAA National Centers for Environmental Information (2026) United States Department of Agriculture, National Agricultural Statistics Service (2026b) United States Department of Agriculture, National Agricultural Statistics Service (2026a) Kim et al. (2024) Ao (2007) Mustafa (2023) Sarafian (2020) American Egg Board (2013)