Learning in Multi-Agent Systems

by Max Smith

Challenges and Considerations

Artificially intelligent (AI) systems are increasingly enmeshed in our society. Hidden engines powering many web-services and social-media sites, they provide intelligent content recommendations to millions. They challenge the most brilliant human competitors at Go, Poker, and Chess — games foolishly thought so complex that only humans might perform well. They even participate in recreational activities such as painting and playing video games. Despite this growing resume of incredible accomplishments, AI systems still struggle with many tasks that are trivial for people. As we encounter AI systems more-and-more in everyday life, it is increasingly important to understand where they shine and where they struggle. Developing this common-sense understanding offers us a better working relationship between people and AI systems.

In this article, we focus on AI systems designed to take actions in a world and receive feedback on the quality of their performance. We refer to this class of action-taking AI as agents. Agents operate in specific worlds, often referred to as their environment. For an agent, the environment can be as simple as balancing a teetering pole on a cart, or as formerly difficult as identifying the angles between amino acids in a correctly folded protein (recent work has shown this to be viable!). Generally, we deploy agents into environments to solve problems that either humans do not know how to solve, or cannot solve well. In platform settings where there are clear markers of success for the problem under deliberation (e.g., winning a game of chess), the platform can provide feedback and rewards for meeting certain markers. 

Many of the greatest breakthroughs in AI have been systems in such a setup, where AI agents have shown outstanding performance against the benchmarks defined for their tasks. However, such benchmarks do not consider the impact their actions may have on other agents or humans, nor the impact of others upon itself. Including other agents/humans in the mix dramatically increases the difficulty of training agents. In this article, we take a brief tour through just a few of the exciting and difficult problems involved with developing multiagent systems. These problems are meant to help one understand the state of the art in AI and give insights into how agents make decisions.

Credit Assignment

Recall back in primary school when you were put into a group project with randomly assigned members. Did you dread having to take-on a majority of the work, or were you excited at the idea of coat-tailing off your peers? Despite your teacher’s best effort, a student probably received a grade that they did not deserve as a result of the group component of the assignment. 

This flashback illustrates the credit assignment problem in multiagent learning. At its core, the problem is simply what actions taken, and by whom, resulted in receiving a reward? Was it the studying that you did the previous summer to prepare for this coursework that resulted in your high marks? Or perhaps, you enjoyed several full nights of watching television and also received high marks — negatively reinforcing your bad habits. In these last two cases, as humans, it’s hopefully clear how the choices taken affected the outcome. However, AI agents do not come with this intuition pre-trained and must instead learn to solve the problem alongside learning the causality of their actions. 

This difficulty is further exacerbated by the effect of everyone else’s choices in your group on the outcome of the project. Did your contribution to the project matter more or less than, or carry the same value as your peers? Which actions from which group members were good and should be repeated for the next project? When we were in school, rubrics helped serve as a stopgap to help understand what was done well; however, not all problems have a rubric. 

In summary, when an AI system is successful, we have to both figure out what actions caused its success, while also understanding how different agents acting in the environment also influenced and potentially unequally contributed or hindered the success.

Moving Target

Unfortunately, even if we could solve the group project credit assignment problem, it would not be a panacea for learning in multiagent systems. Our agent may see how others behave and adapt its own behavior. However, the other agents may also choose to change their behavior in response as well! We can think about this as our group mates realizing we are a slacker and deciding to change teams or no longer carry the whole group on their shoulders. 

This phenomenon is referred to as the moving target problem. At its core, it deals with deciding the best thing to do when other agents can change over time, considering both how other agents learn or when other agents are replaced. The term target refers to the best behavior, and it is moving (read: changing) as the other agents change. 

Let us first consider the case where we need to only interact well with one other agent, but they are also learning alongside us. One promising approach is to consider what the other agent may learn from the interaction, and account for that in your change. This reasoning can be extended infinitely, where agents consider N levels of how the other agent may respond to their change that they got from considering the other agent’s change and so on… This can quickly become overwhelming to think about, but solutions of this nature are being actively refined. You can think of it as nested hypotheticals, “if I do this, what will they do? And then what will I do? And then what will they do? Etc, etc, etc.”


Another approach is to not try and work as a team with any one particular agent, but instead try and figure out the best behavior in the worst case. If the other agents were to be swapped out with malicious, evil agents (unbeknownst to us), then our agent needs to steel itself for operating among these evil agents. This direction is inspired and implemented by learning objectives defined in Game Theory. 


The previous two problems, credit assignment and moving targets, focused on learning how to behave to meet a goal. There’s an even bigger problem on the table though, and it’s one that even humans have not mastered: communication. Communication is a broad field, and all of its complexities one might never imagine distilling into a simple blog post. Instead we will scratch the surface by discussing briefly two important points: verbal and non-verbal communication. 

Verbal communication is rife with ambiguities. Borrowing an example from an earlier Michigan AI blog post by Laura Burdick: if you read “the chocolate bar,” it’s not clear if this refers to a piece of candy or a bar in a restaurant that serves chocolate. All of these linguistic challenges faced by humans, must also be overcome by AI systems. Only preliminary work has been done on verbal communication between AI and humans, simply due to the extraordinary challenge it presents. Instead, research has focused on communication between several AI using artificially constructed tiny languages (with simple grammars). 


Non-verbal communication deals with how all of the actions you take reveal some information about your thoughts. The game Hanabi excellently highlights what non-verbal communication is and the complexities that underlie it. Hanabi is a 2-5 player card game where everyone works cooperatively to build a firework show. Each player is dealt a hand of cards, and what makes this game unique is that you turn your hand around so that all the other players can see your cards but you cannot. Each player has different parts of the “firework show” in their hands, and need to play each colour in order of ascending value (I play a Yellow 1, then you play a Yellow 2, then our third friend plays a Yellow 3) if they want to win. To play, each player chooses to either blindly add a firework (a card) to the show being built in the middle of the table, hint to another player about what’s in their hand, or discard a card. Giving hints is obviously valuable, so only limited hints are allowed. However, discarding a card can earn the team more hint tokens. The show is successfully completed when all five colours of fireworks have had their all five tiers of their firework played. There are a few catches to this situation. First, cards can only successfully be added to the show if a stack of the same colour fireworks containing all of the smaller valued fireworks already exists in the show. This means a red firework of rank three requires that the red firework of ranks one and two have already been played. Incorrectly playing a card causes the firework to explode and the team can only survive three explosions. The second constraint is that when hinting to another player about the cards they’re holding, there are only two choices of hints: pointing out all cards of the same color, or pointing out all cards of the same rank. 

A super rich non-verbal language has evolved in the Hanabi community surrounding what information may be implicitly communicated through the hint action. One such simple convention is that a player will keep their oldest card in the right most position and this will be the card that the player will discard — referred to as the “chop principle”. Another convention is the “high value principle” with the assumption that if a clue was worth giving, it should be interpreted as the highest value (best) available move for that player. 

The Hanabi challenge is to construct agents that can learn to pick up on these non-trivial implicit conventions. In particular, there’s a strong interest in having agents that can be placed with new teammates and quickly learn to adopt the customs of this group. This requires the agents to reason about the true intent of their teammates actions. Overall, this gives us a way to study the subtlety that exists in communication, possibly applicable to human communication. 

Social Dilemmas

Successful communication can breed a whole new slew of challenges because it allows more meaningful interactions between agents. One of the simplest examples of this is the prisoner’s dilemma. This classic situation involves two criminals who have been caught and a detective who is trying to get a confession. The criminals are brought into separate rooms and can choose to cooperate with the detective or remain quiet. If the two criminals both cooperate they will serve a medium sentence; if they both remain silent they will serve a short sentence. However, if only one confesses they get off for free and their partner is jailed for a long sentence. While this particular example doesn’t allow for communication between the two agents, the underlying struggle faced by the agents can be at the core of many interactions. 

Should our agent cooperate or be selfish? The answer isn’t always clear and will depend on a lot of externalities; for example: will the two agents interact again in the future? How risk averse is your agent? Is there any pre-existing agreement between the agents that might offer assurances that they both will remain silent? This problem and its many variations have been under investigation for decades, as despite its simplicity we are still learning more and more about this style of interaction each day. A successful strategy for these two players, if they were to repeatedly end up in a prisoner’s dilemma, is the “tit for tat” strategy. In this strategy each player takes the action that the other player took on the previous interaction. In fact, this strategy has even been found by AI agents who were trained to play this game. 


The field of AI has recently turned to the game Diplomacy as a benchmark for more rich social dilemmas. In the game of Diplomacy, seven players take on the roles of European powers during World War I and compete to take control over Europe. Notably, this game involves repeated rounds of private and public discussions between the powers and then movement of their forces. This benchmark offers a challenging problem for AI in both communicating with the other players and learning social skills. Can our agents construct alliances? Will they know when they’re going to get back-stabbed? Will they back-stab an ally? The emergence of these rich social interactions is an active area of study and interest. 


Through just looking at a few problems our AI systems face when interacting with other humans or agents, we can gleam a lot about the state of AI in society. We have seen many recent incredible advances; however, there are still many large hurdles that need to be overcome before Siri becomes as smart as the AI in the movie Her. Two-player games are not the world, and the world is complicated.