EN FR

The importance of data quantity in machine learning: how small is too small?

Beverly Yang, Andrew Tsai, Amichai Mitelman, Rita Tsai, Davide Elmo

In the proceedings of: GeoSaskatoon 2023: 76th Canadian Geotechnical Conference

Session: Innovative Geotechnical

ABSTRACT: The past decade has seen rock engineering become more data-driven, resulting in increased use of machine learning (ML). ML is a type of artificial intelligence that involves the development of mathematical models, resulting in a computer system capable of making predictions with minimal human involvement. Such a powerful tool can help rock engineers efficiently uncover complex relationships between data and has been used to predict rock mass properties, mining and tunnelling hazards, and slope stability. However, the success and reliability of ML models are directly linked to the quality and quantity of data available. ML models for rock engineering applications are generally trained using either poor-quality or limited data. This inherently leads to poor and unreliable results with potential real-life adverse impacts. Both data quality and quantity pose significant challenges in rock engineering due to the subjective nature of many commonly used rock engineering parameters (resulting in poor-quality data) and the limited data available in the early stages of the design process. While there has been an increased awareness of the importance of data quality for ML among rock engineers, there has yet to be a comparably increased awareness of the importance of data quantity. Many rock engineering research articles focused on ML are training their ML models on only a few hundred data points, with some as few as 80 data points. However, these results can be misleading due to the stochastic nature of ML models and how the data is shuffled before data splitting, resulting in unreliable models. Using synthetic data and surrogate models, this paper aims to demonstrate the importance of data quantity in ML, and it recommends using caution when using small datasets for ML.


Submit a Data Update Form for this paper
Please include this code when submitting a data update via other methods: GEO2023_149

Access this article:
Canadian Geotechnical Society members can access to this article, along with all other Canadian Geotechnical Conference proceedings, in the Member Area. Conference proceedings are also available in many libraries.

Cite this article:
Yang, Beverly, Tsai, Andrew, Mitelman, Amichai, Tsai, Rita, Elmo, Davide (2023) The importance of data quantity in machine learning: how small is too small? in GEO2023. Ottawa, Ontario: Canadian Geotechnical Society.

@inproceedings{Yang_GEO2023_149, author = {{Yang, Beverly}, {Tsai, Andrew}, {Mitelman, Amichai}, {Tsai, Rita}, {Elmo, Davide}}
title = {The importance of data quantity in machine learning: how small is too small? }
booktitle = {Proceedings of the 76th Canadian Geotechnical Conference}
year = {2023}
organization = {The Canadian Geotechnical Society},
address = {Ottawa, Canada} }
Abstracts are Copyright © the Authors and used with permission. Online database Copyright © 2026 The Canadian Geotechnical Society. All rights reserved.