Lupe: Integrating the Top-down Approach with DNN Execution on Ultra-Low-Power Devices

Venue: SenSys
Year: 2025
Authors: Mingyuan Xiang, Pouya Mahdi Gholami, Henry Hoffmann
Topic: Embedded ML

🌟 Highlights

An open source code generation framework for running DNNs on ultra-low-powered microcontrollers.
Overall, I think it's a solid paper. I would have given it an accept.

📝 Summary

The introduction of small and powerful deep neural network models creates opportunity to execute these models ultra-low-powered microcontrollers. Although execution is possible through the use of on-board accelerators, such devices are stil extremely constrained by limited memories, extremely low CPU frequency, and intermittent power. To address this, the authors introduce Lupe, a code generation framework that converts a high level DNN algorithm into code optimized for an ultra-low-powered microcontroller using a top-down method and hardware-aware code generation. The authors claim this reduces the average intermittent runtime cost by 96% and 71% as well as 12.36x and 2.22x speedups in continuous inference compared to current state of the art.

🧩 Key Contributions

A code generation framework named Lupe that starts from a top-down DNN structure, and maps it directly onto functions accelerated by the microcontroller's on-board accelerator.
- It achieves this from a top-down view, exposing the limitations of the typical bottom-up approach.
- This top-down view allows Lupe to select the most efficient accelerator operation, where one may not be immediately obvious when using a bottom-up approach.
- This top-down view further allows intermittent-safe deep neural network computation with the addition of the loop continuation technique.
- It uses a novel atomic logging system that helps it outperform baselines like Hawaii which have a much heavier logging system.

✅ Strengths

Lupe is evaluated over a wide range of DNN architectures including ResNet3, DS-CNN, MobileNetV2, LeNet, and MLPClassifier.
The paper is well-organized. The sections and areas flow naturally to one another.
The authors elected to use an even more constrained device than the ones used for comparison baselines
Lupe is open-sourced, allowing for reproducibility.

⚠️ Weaknesses / Questions

This might be very specific to the MSP430 device. Seeing as though a lot of research uses this device, that might be less of a weakness.
Adding another dataset for each model would have further strengthed the evaluations. Unfortanately, it seems like this is common practice when browsing related works in this field.
It's mentioned that only the smallest model can fit into the SRAM with Hawaii, and therefore the authors had to resort to chunking for a fair comparison. This would increase data movement though, which according to the paper, results in less efficiency. For MLPClassifier, the evaluations between Lupe and Hawaii are pretty close. For the others, Lupe has a large speed-up. I'm interested in how much chunking is affecting Hawaii. I can't seem to find in the original Hawaii paper if the authors there had to use chunking too (both projects use a MSP430), so that's a weird discrepency. I believe their results are still significant regardless.
Loop continuation works well here, but it's difficult to see how it scales up when exposed to more arbitrary code. I guess for normal CNNs, everything is fairly well defined, predictable, and deterministic, so maybe this isn't that big of an issue since they are just targetting exclusively well-defined DNNs.
No actual real-world deployment or discussion on impact to programming effort. This makes it difficult to determine how well it will work outside a lab setting. However, it is open-source, so it's not difficult for someone to try it out on their sensor network.
Overall, these weaknesses are trivial at best.

🔍 Related Work

Hawaii (Most similar work)

📄 Attachments

PDF: 📄 View PDF
Paper Link: 🔗 External Page