TY - GEN
T1 - Grouping Words with Semantic Diversity
AU - Chubarian, Karine
AU - Khan, Abdul Rafae
AU - Sidiropoulos, Anastasios
AU - Xu, Jia
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.
AB - Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.
UR - http://www.scopus.com/inward/record.url?scp=85137677206&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137677206&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85137677206
T3 - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
SP - 3217
EP - 3228
BT - NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
T2 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
Y2 - 6 June 2021 through 11 June 2021
ER -