TY - JOUR
T1 - MultiPL-E
T2 - A Scalable and Polyglot Approach to Benchmarking Neural Code Generation
AU - Cassano, Federico
AU - Gouwar, John
AU - Nguyen, Daniel
AU - Nguyen, Sydney
AU - Phipps-Costin, Luna
AU - Pinckney, Donald
AU - Yee, Ming Ho
AU - Zi, Yangtian
AU - Anderson, Carolyn Jane
AU - Feldman, Molly Q.
AU - Guha, Arjun
AU - Greenberg, Michael
AU - Jangda, Abhinav
N1 - Publisher Copyright:
© 1976-2012 IEEE.
PY - 2023/7/1
Y1 - 2023/7/1
N2 - Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.
AB - Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.
KW - B.2.3 reliability, testing, and fault-tolerance
KW - I.5.1.D neural nets
UR - http://www.scopus.com/inward/record.url?scp=85153525587&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85153525587&partnerID=8YFLogxK
U2 - 10.1109/TSE.2023.3267446
DO - 10.1109/TSE.2023.3267446
M3 - Article
AN - SCOPUS:85153525587
SN - 0098-5589
VL - 49
SP - 3675
EP - 3691
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 7
ER -