Skip to main navigation Skip to search Skip to main content

Exploring the potential of machine learning to understand the occurrence and health risks of haloacetic acids in a drinking water distribution system

  • Ying Yu
  • , Md Mahjib Hossain
  • , Rabbi Sikder
  • , Zhenguo Qi
  • , Lixin Huo
  • , Ruya Chen
  • , Wenyue Dou
  • , Baoyou Shi
  • , Tao Ye
  • Xiamen University of Technology
  • CAS - Research Center for Eco-Environmental Sciences
  • South Dakota School of Mines & Technology
  • Zhejiang Gongshang University
  • Xuzhou Institute of Technology

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

Determining the occurrence of disinfection byproducts (DBPs) in drinking water distribution system (DWDS) remains challenging. Predicting DBPs using readily available water quality parameters can help to understand DBPs associated risks and capture the complex interrelationships between water quality and DBP occurrence. In this study, we collected drinking water samples from a distribution network throughout a year and measured the related water quality parameters (WQPs) and haloacetic acids (HAAs). 12 machine learning (ML) algorithms were evaluated. Random Forest (RF) achieved the best performance (i.e., R2 of 0.78 and RMSE of 7.74) for predicting HAAs concentration. Instead of using cytotoxicity or genotoxicity separately as the surrogate for evaluating toxicity associated with HAAs, we created a health risk index (HRI) that was calculated as the sum of cytotoxicity and genotoxicity of HAAs following the widely used Tic-Tox approach. Similarly, ML models were developed to predict the HRI, and RF model was found to perform the best, obtaining R2 of 0.69 and RMSE of 0.38. To further explore advanced ML approaches, we developed 3 models using uncertainty-based active learning. Our findings revealed that Categorical Boosting Regression (CAT) model developed through active learning substantially outperformed other models, achieving R2 of 0.87 and 0.82 for predicting concentration and the HRI, respectively. Feature importance analysis with the CAT model revealed that temperature, ions (e.g., chloride and nitrate), and DOC concentration in the distribution network had a significant impact on the occurrence of HAAs. Meanwhile, chloride ion, pH, ORP, and free chlorine were found as the most important features for HRI prediction. This study demonstrates that ML has the potential in the prediction of HAA occurrence and toxicity. By identifying key WQPs impacting HAA occurrence and toxicity, this research offers valuable insights for targeted DBP mitigation strategies.

Original languageEnglish
Article number175573
JournalScience of the Total Environment
Volume951
DOIs
StatePublished - 15 Nov 2024

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • Disinfection byproducts
  • Drinking water distribution system
  • Haloacetic acids
  • Machine learning
  • Toxicity
  • Uncertainty based active learning

Fingerprint

Dive into the research topics of 'Exploring the potential of machine learning to understand the occurrence and health risks of haloacetic acids in a drinking water distribution system'. Together they form a unique fingerprint.

Cite this