2024 at Michigan AI: Highlights
by Trenton Chang (PhD Candidate)

2024 was an amazing year for the Michigan AI lab, with research about topics from large language models to financial markets by our students, faculty members, and visitors appearing in top conferences around the world. Check out some of the highlights from our research, from papers that received awards or special recognition!
The Effect of Liquidity on the Spoofability of Financial Markets
Anri Gu, Yongzhao Wang, Chris Mascioli, Rahul Savani, Theodore L. Turocy, Mithun Chakraborty, and Michael P. Wellman
Best Paper, ICAIF ‘24
Imagine you’re trading stocks, and you see a huge wave of orders for a specific stock, raising the stock price. You figure a lot of people must believe in this stock, so you hop on the bandwagon and put your money in. Suddenly, the orders vanish — the price crashes, and you’re left holding the bag. This is spoofing: making fake orders in order to make something look more or less valuable than it actually is. Markets depend on what others believe about the value of items, and spoofing makes it hard for everyone to get a fair deal. In the age of AI-powered trading, it’s important to understand when this is most harmful. This paper analyzes the impact of market liquidity, or how easy it is to make a transaction, on spoofing.
They found that, in highly liquid markets, spoofing isn’t as effective: spoofers need to make more, less-profitable trades, since they can’t influence market prices as much. In contrast, in less liquid markets, the spoofer can make less trades, but profit more each trade, ultimately ending with more profit. The findings suggest that maintaining some level of liquidity in financial markets can mitigate the worst effects of spoofing.
MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning
Inderjeet Nair, Lu Wang
Area Chair Award, ACL ‘24
Despite impressive advances, large language models (LLMs) like ChatGPT still struggle to understand common-sense reasoning. Given a simple task like “go to the store and get groceries,” LLMs aren’t always able to figure out what to do step-by-step. If we want to know if the LLM actually “understands” a task, the underlying reasoning is important to analyze. There’s a few methods that extract these “chains of reasoning” from LLMs, but they run into some limitations, since an LLM usually takes one attempt to generate each word one and a time, much like what you might see when you talk to ChatGPT online. MIDGARD asks: what if we let the LLM try multiple times, and take the parts that are most consistent? Using a metric called the “minimum description length,” MIDGARD combines the LLM’s responses by searching for the reasoning chain that’s “closest” on average to the LLM’s tries. This retains pieces of reasoning that appear in a lot of the LLM’s answers while filtering out parts that only appear in a few.
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
Siqi Shen, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Soujanya Poria, Rada Mihalcea
Social Impact Award, NAACL ‘24
What color is a wedding dress? When you visit someone, should you bring a gift?
The answer depends on who you ask: different cultures have different customs and norms. But we don’t know whether large language models (LLMs) reflect these differences. This paper finds that LLMs struggle with cultural common sense: it’s not as good as answering culture-specific questions as general questions. In other words, compared to questions like “water freezes when it is cold,” which is true everywhere, LLMs tend to struggle with questions like “which side of the car does the driver sit on?” The language used to ask the LLM these questions also matters: English usually works best, and even switching languages might not help answer questions about that language’s culture. These findings are important to consider as we design LLMs that can accommodate a variety of users around the world.
Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias
Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin
Best Paper, NeurIPS ‘24, Workshop on Causality and Large Models
Large language models generate a lot of text, but some of it might be biased against some demographic groups. While some works have tried to measure these biases, this paper argues that the past measurements have some limitations: the prompts might leak information about the stereotypes, or human annotations could themselves be biased, among other issues.
They propose a causally-motivated benchmark called OccuGender that aims to isolate the baseline level of gender bias in an LLM’s responses, and find that models of varying sizes exhibit occupational biases, or will consistently predict that an individual with a certain occupation is a stereotypical gender.
Language Model Alignment in Multilingual Trolley Problems
Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf
Best Paper, NeurIPS’ 24, Workshop on Pluralistic Alignment
Ever wondered how large language models respond to moral dilemmas? This paper analyzed a dataset featuring a large number of human responses to moral dilemmas, and compared them to how LLMs respond, depending on whether the dilemma concerned humans or animals, how many lives were at stake, and many other criteria. They found that a large language model’s priorities don’t always match human moral priorities. However, they also found that LLMs don’t just have one point of view: the priorities of the LLM differ depending on the language used.
As we think about how to design morally-aware LLMs, this paper reminds us to think about which set of morals we’re designing for.
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang
Oral Presentation, ECCV ‘24
When we generate images with AI, how do we make sure that the images stay true to what we asked for? Images often have many components: “a boat in the middle of a lake at sunset” tells us that the image needs to contain certain objects (a boat, a lake) in certain positions (in the middle of a lake) and have a certain type of lightning (sunset). However, it’s challenging to train models to stay true to all of these aspects. Parrot is a method for solving this problem: given a set of possible images, Parrot tries to find a good balance between all of these criteria by generating a bunch of images for testing, and letting the model learn from images that are the best in at least one of these categories. Ultimately, this method can help us design more flexible and powerful AI image generators.
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
Oral Presentation, ICML ‘24
Large language models often go through further training to make them helpful and capable, and avoid toxic behaviors, such as generating offensive or insulting responses to the user. But how well does this actually work? Are large language models truly “forgetting” how to be toxic, or simply side-stepping toxic outputs? As a case study, this paper uses a standard method to “align” a model towards non-toxic behaviors. At first glance, the newly-aligned model seems to behave in toxic ways much less often, as desired. But after closely inspecting the model’s internal representations, the paper finds that toxic behavior is lurking just beneath the surface: a small, simple change to the representations is enough to make the model exhibit toxic behaviors. In short: aligning the model towards non-toxic behaviors didn’t remove the ability to be toxic: the model’s internal representations simply drifted a short distance away. This paper is a reminder that there’s a lot of work to be done in making large language models safer.
Editors: Trenton Chang, Vaibhav Balloli, Aurelia Bunescu