VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Venue: NeurIPS
Year: 2024
Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
Topic: ML

🌟 Highlights

Overall, it's a decent and interesting paper. Having the LLM form an abstraction of the problem and build upon it is a natural method for improving few-shot performance. This paper addresses it mostly through prompt engineering. I would have given it a weak accept.
This paper seems to borrow a lot from Reinforcement learning, but without actually implementing any of the machinery of Reinforcement Learning as it's purely prompt engineering based.

📝 Summary

LLMs and VLMs excel in few-shot in-context learning but require high-quality example demonstrations in their context window. In response, the authors propose In-Context Abstraction Learning (ICAL), a method that builds a memory and abstraction from a series of sub-optimal demos and/or human feedback. The authors show it performs better than current SOTA.

🧩 Key Contributions

The authors introduce, In-Context Abstraction Learning (ICAL), which transforms raw experiences into 4 types of useful abstractions for in-context learning. Such abstractions are then fixed and optimized. This is opposed to reinforcement learning, which would optimize a prompt through trial and error and myopically focus on improving rewards for the current scene.
ICAL is done almost exclusively through prompting, and allows learning through both human feedback (i.e. human in the loop), and noisy visual demonstrations. The authors position the natural langauge feedback method as a benefit to other methods (e.g. DAgger)
ICAL also attempts to formulate a Task and causal abstraction, that is, show how elements are connected through cause and effect, as well as encourage prediction of state changes.

✅ Strengths

The paper is easily approachable without being overly dense.
The three different benchmarks: TEACh, VisualWebArena, and Ego4D target different components of a VLM model, which are dialogue, autonomous web tasks, and video action anticipation respectively.
The authors show in the appendix that ICAL is not the best on every task, and in fact HELPER outperforms in certain aspects. I appreciate papers which include results where their model may not always perform the best.

⚠️ Weaknesses / Questions

Moderate: Although it appears to be excellent for smaller tasks, I am not convinced how this will perform on complex tasks and environments. Personally, I tried to reproduce this work for a complicated task on ChatGPT 5, and it quickly fell apart, missing key details or dropping parts of the abstraction.
Question: How often will the system hallucinate parts of the abstraction? When given an atypical task (e.g. create an atypical expense report for a new type of company classification), does it accurately encompass this without injection hallunications about standard companies?

🔍 Related Work

Task reward signals
Human corrections after failures
Using domain experts to hand-write or hand-pick examples without introspection
Utilizing language to shape policies
VLM agents
Instructable Interactive Agents

📄 Attachments

PDF: 📄 View PDF
Paper Link: 🔗 External Page