Beyond Topic: Multi-dimensional Topic Models of Text
Add to Google Calendar
Over the past decade research and applications of topic models have exploded, paving the way for new types of analyses of text data. Topic models discover latent concepts and themes in text corpora relying on unsupervised learning, allowing applications to a diverse set of corpora. However, topic is only one aspect of many that can influence text and data analysts are often interested in others: sentiment, perspective, aspect, etc.
In this work, we present multi-dimensional topic models of text which jointly capture topic and other aspects of text. We begin by presenting Factorial Latent Dirichlet Allocation, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. We demonstrate the flexibility of f-LDA by surveying several applications each with a different notion of factors, including sentiment analysis of reviews and extractive summarization of key concepts from discussion forums.
Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He is also affiliated with the Center for Language and Speech Processing and the Center for Population Health Information Technology. His research in natural language processing and machine learning has focused on graphical models, semi-supervised learning, information extraction, large-scale learning, and speech processing. His recent work includes health information applications, including information extraction from social media, biomedical and clinical texts. He obtained his PhD from the University of Pennsylvania in 2009.