Bridging the Gap: Connecting AI Agents to the Physical World with an ESP32-S3

How a medical intern used a microcontroller to break the "Last Mile" of AI adoption in a closed hospital system.

ESP32-S3 cover

I started following DeepSeek last year. Back then, it was like a diamond in the rough—no web interface, no mobile app. It wasn't until the release of the DeepSeek-R1 model that I truly felt my personal "ChatGPT moment." As someone without a technical background, the emergence of R1 made me incredibly excited. It drove me to dive into original research papers, and through that exploration—however partial—I gained a deeper understanding of the "black box" that is the Large Language Model (LLM).

This past year, I have witnessed the explosion of outstanding AI assistants. From my first encounter with Zhipu AI, to Kimi (which I’ve always been bullish on), and Alibaba’s Qwen. While domestic open-source models are playing catch-up, they are also innovating relentlessly. Although I haven't been in the eye of the storm of this AI development wave, standing on the shore and watching the tide roll in has been a breathtaking experience in itself.

From GPT-4 to the anticipation of GPT-5, in just one year, LLMs have evolved from simple chatbots to AI Agents capable of automatic coding and deep research. With the help of AI tools, my own learning methods have undergone a paradigm shift.

I used to believe in the traditional "building from the ground up" approach—that you must master the foundational knowledge before you can build a project. But now, the logic has flipped. With an AI Agent, I can build a "castle in the air" first—getting a demo running immediately—and then deconstruct it downwards to fill in the foundational knowledge. This "application first, understanding second" approach has made learning infinitely more fun. It’s no longer "I don't know what I can do with this knowledge"; it’s "I’ve already made something cool, now let me figure out how it works."

Following the development of LLMs, I’ve delved into obscure technical documentation and used the APIs of DeepSeek and Kimi to build chatbots and literature retrieval assistants. Currently, they are indispensable partners in my VS Code for learning and reviewing code. They may still have a slight gap compared to the world's top LLMs, but for an ordinary user like me who values practicality and economic efficiency, they are more than enough.

Before I truly got hands-on with AI, I felt it was something distant. It wasn't until I immersed myself in it that I discovered: I can be not just a good AI user, but a creator of custom AI Agents. People are often intimidated by the unknown, thinking, "This has nothing to do with me." But if you are willing to take that first step, you'll find that new technologies can not only boost your efficiency but allow you to participate in their creation. Today's LLMs are like the early smartphones, and future AI Agents will be the apps within them. Soon, we won't need to click around; we’ll just say, "Doubao, book me a train ticket home," and it will be done.

I have been an intern in a clinical setting for a year now. The most enthusiasm-draining part of the internship is often the repetitive paperwork. I started thinking: Could I hand over these highly templated, yet personalized admission and progress notes to an AI? After repeated discussions with Gemini, I actually found a way.

In the hospital, an intern's work is basic—ECGs, admitting patients, writing progress notes—but essential. I’m not rejecting this work, but doing something you don't particularly love for a long time inevitably leads to burnout. I introduced AI tools to help write medical records not to be lazy, but to give myself a choice: I can write it myself, or I can let the AI write it, and I review it.

However, the hospital's internal network is a closed system, physically isolated from the open internet where these AIs live. To break this barrier, I chose the ESP32-S3 microcontroller to simulate a physical keyboard, "injecting" the content generated by the LLM directly into the medical record text fields on the hospital computer.

This was my first attempt to connect a cloud-based LLM with an isolated physical device. I used Streamlit as the front-end interface for the DeepSeek API, Python as the logic processing layer, and finally, the ESP32-S3 as a "macro keyboard" to execute the input.

This attempt to step out of the pure software layer and extend into hardware represents an upgrade in my application of technology. I’ve evolved from an AI Agent that operates only on a screen to a prototype of "Embodied AI" that can breach physical isolation layers.

Every development in technology is worth our enthusiasm, our learning, our connection, and our creation.

Bridging the Gap: Connecting AI Agents to the Physical World with an ESP32-S3

Related reading