Measuring Robustness for NLP

Yu Yu, Abdul Rafae Khan, Jia Xu

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (ϵ, γ)Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems.

Original languageEnglish
Pages (from-to)3908-3916
Number of pages9
JournalProceedings - International Conference on Computational Linguistics, COLING
Volume29
Issue number1
StatePublished - 2022
Event29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of
Duration: 12 Oct 202217 Oct 2022

Fingerprint

Dive into the research topics of 'Measuring Robustness for NLP'. Together they form a unique fingerprint.

Cite this