Speech 1: DLRLSS 2019-James Wright: Multi-Agent Systems

Slides

In this talk:

Multi-agent systems are important and hard
Game theory: Multi-agent mathematics
Thinking realistically about multi-agent systems
Predicting human behaviour in normal-form games (Not included due to time)

More agents, more problems

Nobody ever says “maybe I should solve my problem with multi-agent systems!”

Multi-agent systems aren’t solutions. They are problems.

But they are problems that are important to solve.

Single Agent Systems

1. An agent interacts with an environment by taking an action

2. Agent’s action and environment’s previous state determine:

Agent’s reward
Environment’s next state

Multi-agent Systems

To change the single agent system to a multi-agent system, we need

Multiple autonomous agents
With distinct goals / priorities / rewards

Multi-agent Systems are Widespread

Multi-agent systems are everywhere:

Games (Poker, Go, Shogi, StarCraft)
Auctions (ad auctions, spectrum auctions)
Online platforms (sharing economy, crowdsourcing)
Cooperative interaction (language, advice, decision support)
Cooperative interaction (language, advice, decision support)
Kidney exchange
Urban planning

Multi-agent Systems are Nonstationary

Multi-agent systems are nonstationary in a special way because every agent is part of every other agent’s environment. Each agent may try to game the others and the environment changes in anticipation of agent’s actions. So the agent can’t do some straightforward learning on its environment.

Multi-agent Objectives are Often Unclear

In a single-agent system:

Maximize total reward over all the agents:

Multi-agent Objectives

Agent Design: Maximize individual agent’s reward

Mechanism Design: Maximize total reward by optimizing the environment

Both of these depend on the (other) agents’ policies!

(Noncooperative) Game Theory

It turns out that we can really precisely put our finger on a lot of these difficulties using game theory.

Game theory is amazing because it’s talking about multiple agents in math. Game theory is the mathematical study of interaction between multiple rational（理性的）, self-interested（自利的） agents (these are two important assumptions):

Rational: Each agent maximizes their expected utility
Self-interested: Agents pursue only their own preferences
- Not the same as “agents are psychopaths”! Their preferences may include the well-being of other agents. - I make my own decision to make others happy.
- Rather, the agents are autonomous: they decide on their own priorities independently.

Fun Game: Prisoner’s Dilemma（囚徒困境）

Two suspects are being questioned separately by the police.

If they both remain silent (cooperate – i.e., with each other), then they will both be sentenced to 1 year on a lesser charge
If they both implicate each other (defect), then they will both receive a reduced sentence of 3 years
If one defects and the other cooperates, the defector is given immunity (0 years) and the cooperator serves a full sentence of 5 years.

Normal-Form Games

The Prisoner’s Dilemma is an example of a normal-form game, in which

Agents make a single decision simultaneously, and then receive a payoff depending on the profile of actions.
There are n players: N={1,2,…,n}, indexed by i.
Each player i has a set Ai of available actions.
A tuple a∈A=A1✖…✖An giving an action ai for each player i in an action profile
Each player i has a utility function ui: A → ℝ (their preferences)

Maximize Individual Agent’s Rewards: Best Response

In game theory, a rational agent is one that maximizes their expected utility given their beliefs π_i about the other agents’ policies:

A shorter way to say this: ai* is a best response:

Nash Equilibrium（纳什均衡）

Where do the beliefs π_i come from?

Nash Equilibrium is the standard game theoretic prediction about what’s gonna happen in a multi-agent system.

Here’s how Nash Equilibrium works:

Every agent is perfectly rational
Therefore every agent is best responding to the actual policies of the other agents
Therefore πi(ai)>0 ⟹ ai∈BRi(π_i) for all i

Any profile π* of policies that satisfies (3) is a Nash equilibrium.

Nash Equilibrium Predictions

Every agent is perfectly rational, therefore every agent best responds to every other agent simultaneously (otherwise they’re behaving irrationally).

In this game, it turns out that there are more than one nash equilibrium. If the row player goes to the ballet with 2/3 probability and the column player goes to the ballet with 1/3 probability, that’s a equilibrium because it makes the expected utility for either action equal.

However, these predictions are not quite helpful and this is the first problem that game theory is going to point out to us. Nash equilibrium seems in some ways reasonable and it seems like everybody’s kind of doing the right thing, but if we actually want to know what to do, it’s telling us that we should either play one of our two actions or the other one, or maybe we should randomize between them.

Maximize Total Reward?

Let’s come back to another issue.

Maybe we want everybody’s sum of rewards to be as big as it could possibly be and we are gonna do that by optimizing the environment (maximize over p).

Mechanism Design: Maximize total reward by optimizing the environment

It turns out that utility values between agents can’t actually be add ed up. It’s not necessarily meaningful to add up the utility between agents because utilities are just representations of preferences and they just tell us how the agent is going to behave.

Agent i with utility ui and agent j with utility uj(a)=10ui(a) (every possible outcome) will behave in exactly the same way:

We can even add a arbitrary constant to everybody’s utility as well and it’ll also not change their behavior. So it’s not always straightforward to know how we can optimize the total utility.

If we can’s just add up utilities, then how should we reason about mechanism design? There are two answers and one answer which is often used in the real world is punting (like just counting everything in money). It’s kind of a reasonable approximation in some situations. Game theory has a theory about that, which is called Pareto Optimality

Pareto Optimality

This is a way of comparing even though we can’t compare individuals’ utilities. It basically consists of two definitions:

Pareto domination: A profile of utilities (x1,…,xn) Pareto dominates（支配） another profile (y1,…yn) if
- xi is larger or equal to yi for every i∈N, and
- There exists an i∈N such that xi is larger than yi (for some agents, that’s a strict inequality)
- Every agent is at least as happy with x as with y, and at least one agent strictly prefers x to y.
Pareto optimality: An outcome (x1,…,xn) is Pareto optimal if it is not Pareto dominated by any other outcome.
- Any outcome that is better for one agent must be worse for another

Why we talk about all of this? Because individually rational behaviour is not always globally optimal, we still need to consider the outcome of Nash equilibrium.

Individually Rational Behaviour is not Always Globally Optimal

We saw this with the prison dilemma.

Which outcome is Pareto optimal?
What is the Nash equilibrium (unique)?

The only equilibrium of Prisoner’s Dilemma is also the only outcome that is Pareto-dominated! - Both defecting.

`Individual optimization can lead to the worst-possible global outcome.`

That should worry us in the setting where we only get to control one agent and we only care about optimizing our own utility.

Fun Game: Traveller’s Dilemma（旅行者困境）

Two players pick a number (2-100) simultaneously
If they pick the same number x, then they both get $x payoff
If they pick different numbers:
- Player who picked lower number gets lower number, plus bonus of $2
- Player who picked higher number gets lower number, minus penalty of $2

We are talking about not only expected utility maximization, but also the equilibrium. Equilibrium says we are not just maximizing our expected utility, we are maximizing it with respect to believing that the player is also maximizing his expected utility. If we believe the other player is picking 100, we don’t believe he’s maximizing his expected utility.

It’s not surprising that the unique Nash equilibrium of Traveller’s Dilemma is two-two. Individual optimization can lead to the nearly worst-possible individual outcome.

This example convinced us that game theory was somehow missing something. If we are perfectly rational, we’ll reliably lose money. In what sense is that perfect rationality?

Nash Equilibrium vs. Humans

Nash equilibrium often makes counterintuitive（反直觉） predictions.
- In Traveller’s Dilemma: The vast majority of human players choose 97–100. The Nash equilibrium is 2.
Modifications to a game that don’t change Nash equilibrium predictions at all can cause large changes in how human subjects play the game
- In Traveller’s Dilemma: When the penalty is large, people play much closer to Nash equilibrium.
- But the size of the penalty does not change the equilibrium!
Clearly Nash equilibrium does not capture the whole story (for humans)

Multiple Nash Equilibria

Multiple equilibria are a problem even in perfectly cooperative games.

The game theory doesn’t even tell us what side of road we should drive on.

Best Response is Expensive

All of the previous problems arise in very simple games, but in many settings (Poker, Go, Chess) we can’t even reliably compute BRi.

What’s Going Wrong Here?

Why is game theory failing us so badly? Two kinds of problems.

We can’t do what game theory prescribes, and neither can the other players (best responce - how?)
- Finding a Nash equilibrium is computationally hard
Nash equilibrium is about what satisfies certain conditions (very post hoc), not how you get there
- Hence multiple, unordered (no comparison) equilibria
Unrestricted individual optimization can cause even small misalignments in interests to snowball.

Thinking About the Problem

Three ways of viewing a multi-agent problem:

Normative（规范性）
Descriptive
Positive

We have to think really carefully about what we’re trying to do in a multi-agent system. We should never just pick some loss function off the shelf.

Normative Modeling

Start from axioms (assumptions) and derive consequences:
- Every agent maximizes expected utility
- Every agent has accurate beliefs
Game theory problems come from:
- Violation of these assumptions
- Focus on consequences rather than process

Descriptive Modeling

Construct a model of how agents actually behave (rational but not perfect)

Need a descriptive model of other players to evaluate algorithm performance:

Understanding human behavior can give insight into bounded rationality

Positive Modeling

Given a descriptive model of behaviour and environment, figure out how to achieve a goal

Agent Design: Design an agent that will achieve its individual goals:
- Maximum-value bundle of goods, winning a game
Mechanism Design: Design an environment for agents such that the agents’ collective actions will achieve a goal:
- Markets, online platforms, auctions, Internet protocols

Final Thoughts

Our lives are increasingly part of novel multi-agent systems
- It is imperative that we understand their properties so we can design them properly
This is a wide-open research area
- We don’t even have a coherent notion of optimal behaviour yet
Game theory helps us reason precisely about multi-agent interactions, but it isn’t enough
Need real-world descriptive models and a theory of how to use them

Slides

In this talk:

Multi-agent systems are important and hard

Game theory: Multi-agent mathematics

Thinking realistically about multi-agent systems

Predicting human behaviour in normal-form games (Not included due to time)

More agents, more problems

Nobody ever says “maybe I should solve my problem with multi-agent systems!”

Multi-agent systems aren’t solutions. They are problems.

But they are problems that are important to solve.

Single Agent Systems

1. An agent interacts with an environment by taking an action

2. Agent’s action and environment’s previous state determine:

Agent’s reward

Environment’s next state

Multi-agent Systems

To change the single agent system to a multi-agent system, we need

Multiple autonomous agents

With distinct goals / priorities / rewards

Multi-agent Systems are Widespread

Multi-agent systems are everywhere:

Games (Poker, Go, Shogi, StarCraft)

Auctions (ad auctions, spectrum auctions)

Online platforms (sharing economy, crowdsourcing)

Cooperative interaction (language, advice, decision support)

Cooperative interaction (language, advice, decision support)

Kidney exchange

Urban planning

Multi-agent Systems are Nonstationary

Multi-agent systems are nonstationary in a special way because every agent is part of every other agent’s environment. Each agent may try to game the others and the environment changes in anticipation of agent’s actions. So the agent can’t do some straightforward learning on its environment.

Multi-agent Objectives are Often Unclear

In a single-agent system:

Maximize total reward over all the agents:

Multi-agent Objectives

Agent Design: Maximize individual agent’s reward

Mechanism Design: Maximize total reward by optimizing the environment

Both of these depend on the (other) agents’ policies!

(Noncooperative) Game Theory

It turns out that we can really precisely put our finger on a lot of these difficulties using game theory.

Game theory is amazing because it’s talking about multiple agents in math. Game theory is the mathematical study of interaction between multiple rational（理性的）, self-interested（自利的） agents (these are two important assumptions):

Rational: Each agent maximizes their expected utility

Self-interested: Agents pursue only their own preferences

Not the same as “agents are psychopaths”! Their preferences may include the well-being of other agents. - I make my own decision to make others happy.

Rather, the agents are autonomous: they decide on their own priorities independently.

Fun Game: Prisoner’s Dilemma（囚徒困境）

Two suspects are being questioned separately by the police.

If they both remain silent (cooperate – i.e., with each other), then they will both be sentenced to 1 year on a lesser charge

If they both implicate each other (defect), then they will both receive a reduced sentence of 3 years

If one defects and the other cooperates, the defector is given immunity (0 years) and the cooperator serves a full sentence of 5 years.

Normal-Form Games

The Prisoner’s Dilemma is an example of a normal-form game, in which

Agents make a single decision simultaneously, and then receive a payoff depending on the profile of actions.

There are n players: N={1,2,…,n}, indexed by i.

Each player i has a set Ai of available actions.

A tuple a∈A=A1✖…✖An giving an action ai for each player i in an action profile

Each player i has a utility function ui: A → ℝ (their preferences)

Maximize Individual Agent’s Rewards: Best Response

In game theory, a rational agent is one that maximizes their expected utility given their beliefs π_i about the other agents’ policies:

A shorter way to say this: ai* is a best response:

Nash Equilibrium（纳什均衡）

Where do the beliefs π_i come from?

Nash Equilibrium is the standard game theoretic prediction about what’s gonna happen in a multi-agent system.

Here’s how Nash Equilibrium works:

Every agent is perfectly rational

Therefore every agent is best responding to the actual policies of the other agents

Therefore πi(ai)>0 ⟹ ai∈BRi(π_i) for all i

Any profile π* of policies that satisfies (3) is a Nash equilibrium.

Nash Equilibrium Predictions

Every agent is perfectly rational, therefore every agent best responds to every other agent simultaneously (otherwise they’re behaving irrationally).

In this game, it turns out that there are more than one nash equilibrium. If the row player goes to the ballet with 2/3 probability and the column player goes to the ballet with 1/3 probability, that’s a equilibrium because it makes the expected utility for either action equal.

Maximize Total Reward?

Let’s come back to another issue.

Maybe we want everybody’s sum of rewards to be as big as it could possibly be and we are gonna do that by optimizing the environment (maximize over p).

Mechanism Design: Maximize total reward by optimizing the environment

It turns out that utility values between agents can’t actually be add ed up. It’s not necessarily meaningful to add up the utility between agents because utilities are just representations of preferences and they just tell us how the agent is going to behave.

Agent i with utility ui and agent j with utility uj(a)=10ui(a) (every possible outcome) will behave in exactly the same way:

We can even add a arbitrary constant to everybody’s utility as well and it’ll also not change their behavior. So it’s not always straightforward to know how we can optimize the total utility.

Pareto Optimality

This is a way of comparing even though we can’t compare individuals’ utilities. It basically consists of two definitions:

Pareto domination: A profile of utilities (x1,…,xn) Pareto dominates（支配） another profile (y1,…yn) if

Not the same as “agents are psychopaths”! `Their preferences may include the well-being of other agents.` - I make my own decision to make others happy.

`Individual optimization can lead to the worst-possible global outcome.`

`Clearly Nash equilibrium does not capture the whole story (for humans)`

Unrestricted individual optimization can cause `even small misalignments` in interests to snowball.