TY - JOUR
T1 - Author gender identification from text
AU - Cheng, Na
AU - Chandramouli, R.
AU - Subbalakshmi, K. P.
PY - 2011/7
Y1 - 2011/7
N2 - Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.
AB - Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.
KW - Decision tree
KW - Gender identification
KW - Logistic regression
KW - Psycho-linguistic analysis
KW - Support vector machine
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=80051667271&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051667271&partnerID=8YFLogxK
U2 - 10.1016/j.diin.2011.04.002
DO - 10.1016/j.diin.2011.04.002
M3 - Article
AN - SCOPUS:80051667271
SN - 1742-2876
VL - 8
SP - 78
EP - 88
JO - Digital Investigation
JF - Digital Investigation
IS - 1
ER -