Dataset quality heavily impacts the predictive performance of data-driven modelling. This issue can be exacerbated in the prediction of agricultural production due to the complex interactions between the climate, the environment and the way the plant is affected by these conditions during the season. This study aims to create an empirical model to predict Head Rice Yield (HRY), the primary quality metric for rice growers and millers globally. Model development focused on an industry-level dataset made available by SunRice, Australia's most prominent rice trading company. Using the SunRice data, two dataset construction methods were implemented to evaluate the effect of dataset construction and data pre-processing on model accuracy. The first dataset construction method was based on aggregating meteorological conditions using estimates of phenology, while the second method used aggregations based on defined lengths of time. Deviations of each construction method were generated to explore the impact of varying levels in aggregation stages and stage lengths. Each constructed dataset underwent feature selection prior to model training using the XGBoost algorithm with Leave-One-Year-Out Cross-Validation. The time-based dataset construction method proved to be the most accurate dataset construction method, producing the highest mean model accuracy scores across all pre-processing and model training configurations. The single most accurate model came from the two-week aggregation dataset, which yielded a 125% increase in Lin's Concordance Correlation Coefficient compared to the worst-performing model produced in this study. Developing a highly accurate model that allows for crop stage knowledge discovery is critical for uncovering actionable insights to improve the management of future rice crops for HRY. The knowledge discovered in this study provides actionable insights to improve the management of future rice crops for HRY. The developed model demonstrates the potential for SunRice to predict HRY at the receival point to optimise post-harvest handling and milling. When matched to region-specific data, the dataset construction methods explored can be replicated in other rice-growing regions globally.

Original languageEnglish
Article number108716
JournalComputers and Electronics in Agriculture
Publication statusPublished - Apr 2024


Dive into the research topics of 'The effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia'. Together they form a unique fingerprint.

Cite this