TY - JOUR
T1 - On the structure and semantics of identifier names containing closed syntactic category words
AU - Newman, Christian D.
AU - Peruma, Anthony
AU - AlOmar, Eman Abdullah
AU - Crabbe, Mahie
AU - Banabilah, Syreen
AU - Alsuhaibani, Reem S.
AU - Decker, Michael J.
AU - Akhbardeh, Farhad
AU - Zampieri, Marcos
AU - Mkaouer, Mohamed Wiem
AU - Maletic, Jonathan I.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/9
Y1 - 2025/9
N2 - Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.
AB - Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.
KW - Closed category terms
KW - Identifier naming
KW - Naming conventions
KW - Part of speech tagging
KW - Program comprehension
KW - Software linguistics
KW - Software maintenance and evolution
UR - https://www.scopus.com/pages/publications/105011415383
UR - https://www.scopus.com/pages/publications/105011415383#tab=citedBy
U2 - 10.1007/s10664-025-10699-x
DO - 10.1007/s10664-025-10699-x
M3 - Article
AN - SCOPUS:105011415383
SN - 1382-3256
VL - 30
JO - Empirical Software Engineering
JF - Empirical Software Engineering
IS - 5
M1 - 148
ER -