Author gender identification from text

Na Cheng, R. Chandramouli, K. P. Subbalakshmi

Research output: Contribution to journalArticlepeer-review

194 Scopus citations

Abstract

Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.

Original languageEnglish
Pages (from-to)78-88
Number of pages11
JournalDigital Investigation
Volume8
Issue number1
DOIs
StatePublished - Jul 2011

Keywords

  • Decision tree
  • Gender identification
  • Logistic regression
  • Psycho-linguistic analysis
  • Support vector machine
  • Text mining

Fingerprint

Dive into the research topics of 'Author gender identification from text'. Together they form a unique fingerprint.

Cite this