Dissertation Defense

Demographic-Aware Natural Language Processing

Aparna Garimella
3316 EECS BuildingMap

The underlying traits of our demographic group affect and shape our thoughts, therefore surfacing in the way we express ourselves and employ language in our  day-to-day life. Understanding and analyzing language use across different groups of people helps uncover the demographic particularities of the speakers. Conversely, leveraging these differences could lead to the development of better language representations, thus enabling further demographic-focused refinements in natural language processing (NLP) tasks. In this thesis, I employ methods rooted in computational linguistics to better understand various demographic groups through their language use. The thesis makes two main contributions.

First, it provides empirical evidence that words are indeed used differently by different demographic groups in naturally occurring text. Through experiments conducted on very large data sets displaying usage scenarios for hundreds of frequent words, I show that automatic classification methods can be effective in distinguishing between how words are used by different demographic groups. I compare the encoding ability of utilized features by conducting feature analyses and shedding light on how the various attributes perform in underscoring the differences.

Second, the thesis explores whether demographic differences in word usage by different groups can inform the development of more refined approaches to NLP tasks. Specifically, I start by investigating the task of word association prediction. The thesis shows that going beyond the traditional “one-size-fits-all” approach, demographic-aware models achieve better performances in predicting word associations for different demographic groups than generic ones. Next, I investigate the impact of demographic information on part-of-speech tagging and syntactic parsing, where experiments reveal numerous part-of-speech tags and syntactic relations whose predictions benefit from the prevalence of a specific group in the training data. Finally, I explore demographic-specific humor generation, and develop a humor generation framework to fill-in the blanks to create funny stories, taking into account people’s demographic backgrounds.

Sponsored by

Rada Mihalcea


Sonya Siddique