TY - GEN
T1 - HateModerate
T2 - 2024 Findings of the Association for Computational Linguistics: NAACL 2024
AU - Zheng, Jiangrui
AU - Liu, Xueqing
AU - Yang, Guanqun
AU - Haque, Mirazul
AU - Qian, Xing
AU - Rathnasuriya, Ravishka
AU - Yang, Wei
AU - Budhrani, Girish
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - To protect users from massive hateful content, existing works studied automated hate speech detection. Despite the existing efforts, one question remains: Do automated hate speech detectors conform to social media content policies? A platform's content policies are a checklist of content moderated by the social media platform. Because content moderation rules are often uniquely defined, existing hate speech datasets cannot directly answer this question. This work seeks to answer this question by creating HateModerate, a dataset for testing the behaviors of automated content moderators against content policies. First, we engage 28 annotators and GPT in a six-step annotation process, resulting in a list of hateful and non-hateful test suites matching each of Facebook's 41 hate speech policies. Second, we test the performance of state-of-the-art hate speech detectors against HateModerate, revealing substantial failures these models have in their conformity to the policies. Third, using HateModerate, we augment the training data of a top-downloaded hate detector on HuggingFace. We observe significant improvement in the models' conformity to content policies while having comparable scores on the original test data. Our dataset and code can be found on https://github.com/stevens-textmining/HateModerate.
AB - To protect users from massive hateful content, existing works studied automated hate speech detection. Despite the existing efforts, one question remains: Do automated hate speech detectors conform to social media content policies? A platform's content policies are a checklist of content moderated by the social media platform. Because content moderation rules are often uniquely defined, existing hate speech datasets cannot directly answer this question. This work seeks to answer this question by creating HateModerate, a dataset for testing the behaviors of automated content moderators against content policies. First, we engage 28 annotators and GPT in a six-step annotation process, resulting in a list of hateful and non-hateful test suites matching each of Facebook's 41 hate speech policies. Second, we test the performance of state-of-the-art hate speech detectors against HateModerate, revealing substantial failures these models have in their conformity to the policies. Third, using HateModerate, we augment the training data of a top-downloaded hate detector on HuggingFace. We observe significant improvement in the models' conformity to content policies while having comparable scores on the original test data. Our dataset and code can be found on https://github.com/stevens-textmining/HateModerate.
UR - http://www.scopus.com/inward/record.url?scp=85197900343&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197900343&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-naacl.172
DO - 10.18653/v1/2024.findings-naacl.172
M3 - Conference contribution
AN - SCOPUS:85197900343
T3 - Findings of the Association for Computational Linguistics: NAACL 2024 - Findings
SP - 2691
EP - 2710
BT - Findings of the Association for Computational Linguistics
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
Y2 - 16 June 2024 through 21 June 2024
ER -