Cautiously-Optimistic Knowledge Sharing for Cooperative Multi-Agent Reinforcement Learning

Venue: AAAI
Year: 2024
Authors: Yanwen Ba, Xuan Liu, Xinning Chen, Hao Wang, Yang Xu, Kenli Li, Shigeng Zhang
Topic: ML

🌟 Highlights

Overall, it's a very easily digestable and well put-together paper. I enjoyed reading it overall. I would have given it an accept.
Their technique both novel and interesting, and I might refer to it later for my own MARL-related research.

📝 Summary

Decentralized training in multi-agent reinforcement learning systems provide advantages in robustness and scalability, yet suffers from issues in coordination and partial observability. Prior work in areas such as knowledge sharing, and student-teacher models allow agents to share knowledge between each over, yet can suffer when the teacher learns a sub-optimal policy. This problem is further exacerbated when the student network blindly adopts that policy. The authors introduce CONS, a knowledge sharing framework where the agents cautiously adopt both positive and negative knowledge from other agents, and can vary the weight of such knowledge over time.

🧩 Key Contributions

Focuses on Decentralized Training and decentralized execution (DTDE).
Provides a framework which allows agents to request "advice" from other agents on policy spaces less known to them in the form of both positive and negative knowledge. Instead of blindly following advice, it integrates it probabilistically into it's own policy exploration, making this method robust to sub-optimal actions from teachers.
This method integrates nicely into an existing framework without additional training overhead. From reading the paper and referencing the open-source repository, it's easy to integrate into one's own project

✅ Strengths

I immediately thought "What's wrong with standard student-teacher models?". It was nice to see the paper firmly establish and appropiately argue where this method fails and where theirs applies. It was also nice to see the value of negative information versus just positive.
Although Communication overhead wasn't a core focus or contribution of the paper, the authors put a lot of work into making sure their method was communication efficient. The framework they referenced for communication budgets was properly utilized and credited.
The workflow and method were easy to follow; I could see exactly how the method works
Thorough evaluation on three different benchmark MARL setups along with an appropiate number and type of baselines.

⚠️ Weaknesses / Questions

Moderate: The environments are all discrete and low-dimensional. It would be interesting to see how it performs on a baseline complex environment like Starcraft or Overcooked.
Minor: The inclusion of negative knowledge is a nice touch, and we see in the ablation that it actually has a substantial impact. The authors point that positive knowledge becomes more important in later stages, which makes sense intuitively, but the ablation shows it's just as important.
Minor: Drawing on the previous point, on page 17302, they propose using a rapid shift into positive only information rather than a linear function. As the ablation shows the importance of negative, even in higher episodes, I wonder how changing this function would change performance, if at all.
Nit: Sample an action section describes fundemental exploration vs explotiation known in RL. Paragraph could be compressed further.
Nit: Dangling references to parts of the paper present in some places. One example is page 17303 under Task Settings where it references Table ??.

🔍 Related Work

Decentralized Training and Decentralized Execution (DTDE)
Knowledge Sharing
Agent Communication Methods
Advising Mechanisms

📄 Attachments

PDF: 📄 View PDF
Code: 🧑‍💻 GitHub Repository
Paper Link: 🔗 External Page