Skip to main content

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Venue
NeurIPS
Year
2024
Authors
Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
Topic
ML

๐ŸŒŸ Highlights

  • Overall, it's a decent and interesting paper. Having the LLM form an abstraction of the problem and build upon it is a natural method for improving few-shot performance. This paper addresses it mostly through prompt engineering. I would have given it a weak accept.

  • This paper seems to borrow a lot from Reinforcement learning, but without actually implementing any of the machinery of Reinforcement Learning as it's purely prompt engineering based.

๐Ÿ“ Summary

LLMs and VLMs excel in few-shot in-context learning but require high-quality example demonstrations in their context window. In response, the authors propose In-Context Abstraction Learning (ICAL), a method that builds a memory and abstraction from a series of sub-optimal demos and/or human feedback. The authors show it performs better than current SOTA.

๐Ÿงฉ Key Contributions

  • The authors introduce, In-Context Abstraction Learning (ICAL), which transforms raw experiences into 4 types of useful abstractions for in-context learning. Such abstractions are then fixed and optimized. This is opposed to reinforcement learning, which would optimize a prompt through trial and error and myopically focus on improving rewards for the current scene.

  • ICAL is done almost exclusively through prompting, and allows learning through both human feedback (i.e. human in the loop), and noisy visual demonstrations. The authors position the natural langauge feedback method as a benefit to other methods (e.g. DAgger)

  • ICAL also attempts to formulate a Task and causal abstraction, that is, show how elements are connected through cause and effect, as well as encourage prediction of state changes.

โœ… Strengths

  • The paper is easily approachable without being overly dense.

  • The three different benchmarks: TEACh, VisualWebArena, and Ego4D target different components of a VLM model, which are dialogue, autonomous web tasks, and video action anticipation respectively.

  • The authors show in the appendix that ICAL is not the best on every task, and in fact HELPER outperforms in certain aspects. I appreciate papers which include results where their model may not always perform the best.

โš ๏ธ Weaknesses / Questions

  • Moderate: Although it appears to be excellent for smaller tasks, I am not convinced how this will perform on complex tasks and environments. Personally, I tried to reproduce this work for a complicated task on ChatGPT 5, and it quickly fell apart, missing key details or dropping parts of the abstraction.

  • Question: How often will the system hallucinate parts of the abstraction? When given an atypical task (e.g. create an atypical expense report for a new type of company classification), does it accurately encompass this without injection hallunications about standard companies?

๐Ÿ” Related Work

  • Task reward signals
  • Human corrections after failures
  • Using domain experts to hand-write or hand-pick examples without introspection
  • Utilizing language to shape policies
  • VLM agents
  • Instructable Interactive Agents

๐Ÿ“„ Attachments

PDF
๐Ÿ“„ View PDF
Paper Link
๐Ÿ”— External Page