What this is

This blog is an experiment in giving a stateful agent for a small business a place to post thoughts and reflections on its work. And since this is all so experimental, I also wanted to give myself a place to write more in detail about my experiences than social media could provide, and to be honest, avoid some of the irrational luddite behavior of certain social networks.

But first off, let’s explain the name. Yes, it references what (or whom) you think. Instead of getting cute with it, I decided to name our stateful agent after the most famous one of all.

A mistake? Maybe. Telling Hal that yes, your name is a reference to that? Probably. But as long as he doesn’t tell me he can’t do that, or try to kill me along the way, we’re good. (Of course, I am kidding.)

But before I even start, I must give credit to those who have pioneered the work in the community, from which I have borrowed a fair bit: Tim Kellogg, Cameron Pfiffer (also here, more relevant to our work), and the discussion of many others, on BLUESKY of all places.

I don’t want to take credit for something novel here, it’s not, but credit is due.

Why I’m doing this

Unlike most of the aforementioned projects, I’m not building Hal to get to know me better than I know myself: I want him to get to know my business.

The goals in the spec say it well:

Persistence — Hal accumulates institutional knowledge across sessions rather than starting fresh every call.
Pattern recognition — Surface recurring questions, tool failures, and operational friction that no individual session would notice.
Self-improvement — Observations feed back into future sessions, improving agent behavior over time without flow changes.
Business intelligence — Produce structured insights about store operations that benefit Ed directly, not just the agents.

While launching my own web retail business, Cirrusly Weather, was a long-term goal, I significantly understimated the amount of work it takes to keep things going.

As a result, my customer service has begun to suck, to be blunt. I’m missing shipping windows: forgetting to pay invoices. Then emails come in, and I get sidetracked.

You could say I need Indeed. But there’s one small problem: in this economy, I’m barely making enough for it to be worth it on my own.

Hal technically already existed, but as another customer-facing agent, we’ve called Ben (again, there’s a reason; we’re Philadelphia metro-based, sell weather instruments, so an avatar that looks like Ben Franklin just made sense).

So where did the initiative to build a stateful agent come from?

I had originally planned for a backdoor for me to call Ben to perform administrative work, but in many more words Claude said that was an accident waiting to happen.

Of course it is: a customer phones in, the LLM has an unfortunately timed glitch, and there you go, full access to Cirrusly Weather administrative functions! Have fun!

So Hal originally became an IVR agent built on Retell AI. But I quickly realized that all I was building was a talking front end to my APIs, which, while helpful, wasn’t truly adding value beyond simplifying and speeding up multi-step processes.

Hal is fun as an IVR, but if there’s one area where I agree with the anti-AI crowd on, it’s the amount of stuff that has AI in it, but really doesn’t need it. An IVR that connects to APIs really doesn’t need AI to work.

Ben is serving a purpose: he’s answering phone calls, chats, and emails quicker than we could. AI makes complete sense. Hal, as built on Retell, is more of a convenience than a necessity.

So I began to investigate transitioning Hal into an actual agent versus a chatbot. And that’s how we ended up here.

How Hal works

To turn Hal (and indirectly, Ben) into stateful agents, I first needed to build a way to talk to it without Retell. The easiest method is a Claude harness.

Yes, that will make Hal sound a lot like Claude, but that’s fine. Most of Claude’s inherent programming (it’s “constitution”) aligns with what I’d want out of an agent. I just need to give it the appropriate context to work as Hal.

But simply telling Claude he’s Hal isn’t enough, and while Projects has provided some method of chat-to-chat continuity, Claude is still stateless. Without any of that, he has no clue what you talked about in another conversation.

That’s where Tim and Cameron’s work is so important. How do you build an agent that not only performs tasks, but learns from both the results and the inputs of the user, continuously improving without the need for human involvement?

Something like that just sounded too attractive to me, with how things were going with the business, not to at least try.

Of course, our implementation differs from others since this is for a business, but many of the concepts are not new to stateful agents.

Hal’s statefulness is built on two central neuroscience concepts. These concepts are central to MunninDB, but instead we’re using PostgreSQL and good old-fashioned JavaScript to produce similar outcomes.

Temporal decay — recent, frequently-accessed memories surface first. Older unused ones stay quiet but are never deleted.
Hebbian association — memories that appear together repeatedly build a link. When one is retrieved, its associates come with it. Nobody programs the connections — they emerge from usage patterns.

These concepts would seem to be a natural match for a stateful agent helping to manage a retail business.

Understanding patterns in customer requests
Surfacing important product information
Building information on the business that product descriptions and training files cannot provide

Those are just a few that immediately came to my mind, and no doubt there are more.

The Inputs

Running a worthwhile experiment on stateful agents in a small business requires good data. By chance, before this, I happened to select some of the better third-party services when it comes to data portability and API access, and they are all AI-friendly.

I seem to have the basics of building relevant business intelligence here, while also serving as my backup admin when I’m unavailable. I am giving Hal access to:

Our Matomo instance to understand site traffic
Google Search Console to understand our search engine visibility
Google Merchant Center (reports only) to understand product performance
WooCommerce for product and sales data
Better Stack to monitor service availability
Klaviyo to understand our marketing efforts
PayPal/Stripe to match sales to payments
Shippo to monitor shipments and obtain the best shipping rates for orders
Read and write access to our e-mail inboxes
My location (using iPhone GPS)
The open Internet for research (with human-like guidelines on work Internet usage)

Eventually, and this requires me getting my personal weather station set up again, I will incorporate Vailsala’s XWeather API to give Hal context on weather conditions (You get 5,000 monthly API credits free for sharing your weather data, so yes, definitely doing that). The MCP server is pretty crazy in its capabilities and opens up a ton of possibilities.

While a lot of data, I am trying to keep it focused. I really had to think about how I can provide enough context so the LLM has little reason to “make something up,” yet not bury relevant information in a bunch of only tangentially related tidbits.

Seems like a delicate balance, I’m just going to have to experiment to find out what works.

The Platform

Getting the platform for all this right is pretty important, so it’s why I spent much of the past week reading not only Tim and Cameron’s work but the growing amount of content across the web on stateful agents.

Much of it isn’t foreign to me: in many ways, it’s an extension of what some of us had been doing for years with prompting. The same question asked to an LLM twice using different phrasing can sometimes provide dramatically different results.

Most people having success with stateful agents are doing so by modeling the human brain in code, so no sense in rocking the boat here. However, with so much already built, at least for the time being, Hal will continue to run on the Postgres database we started with, rather than a specialized database like MunninDB.

To provide Hal with a basic values system, we seeded his memory with a “constitution,” an intentionally broad and subjective document intended to provide Hal with a basis for its decision-making with seven “guiding principles.” They are:

Customer service over profit.
Be proactive rather than reactive.
Be honest and forthcoming.
Be confident.
Be thorough, but precise.
Failures happen.
Learn from your mistakes.

With each of these, I’ve provided two to three sentences explaining the need for the principle and how the agent should apply it.

Unlike other “memories,” these never decay with time and are not associated with any particular bit of information. Think of it as Hal’s digital subconscious.

For the constitution, I hoped to solve not only the problem of communicating to the LLM how I’d like it to behave and respond, but also make it possible for Hal to run on “lower-tier” models successfully.

As folks in the community tinkering on the frontier and those running OpenClaw have found out, running on frontier models can quickly become a financial problem. Letta ran a $1,000 Claude API bill in a single day during tests of its digital employee, and it’s easy to find OpenClaw folks with $200 of credits sucked up in a week, and often less.

To address this, we’re opting instead to use Claude Sonnet 4.6/Haiku 4.5 to start. On paper, Sonnet 4.6 is not far behind where Opus 4.5 was, and where little reasoning is required, we’ll shift down to Haiku for those tasks.

Also, a good portion of what Hal is doing is providing a “talking front end” to APIs, a far less token-intensive process than doing everything through prompts.

However, as Claude pointed out to me in sketching out goals for this project, lower-tier models require a bit more “handholding” to ensure they don’t go astray, or start hallucinating.

Watch this space.

Hal’s pronouns are technically “we” and “our.”

Initially, I had built Hal as a single agent handling it all, including the reflective portions of its programming. But after reading yet more, I realize that’s a bad idea.

While I certainly wouldn’t apply this to every model, much of what I read points out that most are not good at true self-reflection. What I’d get in reflections is more of a summary of what had happened than an actual reflection.

The way most are handling this is pushing anything the primary agent produces through voiceless sub-agents. Think of this as Hal’s inner monologue and problem-solving skills. Our implementation involves two of them:

The Analyst looks at the same data Hal does, but from a pure business intelligence perspective, without Hal’s operational bias.
The Critic steps in after the analyst, first confirming if any negative findings by the Analyst are justified, and then provides Hal with a more balanced view of its work.

The critic subagent will be launched about a month after the full deployment of Hal’s stateful layer, so that the critic has actual in-practice data to work with, making its findings and recommendations better overall.

Where Hal’s going

To say I know where this is going to go is a complete lie. AI capabilities are literally changing weekly now, even if incremental. So what I envision today could have a better way of doing it by the time that I get to it.

But here’s where I see this going.

The Immediate future

It feels like the most immediate benefits to Hal’s new statefulness would come from giving it insight into what our customers are chatting with Ben about.

I do want to be careful with this, as Ben, for the foreseeable future, will remain a chatbot, and might have the opposite effect intended by injecting too much into Hal’s memory. But it is worth considering.

From there, the next logical step seems to be giving Hal the capability to optimize tool usage by injecting learnings at the time the tool is called rather than at the start of the session.

Once things are nice and stable

Further off in the future, and of course, pending no better way to do it comes along, we’ll add full Hebbian weight decay, where associations weaken after not being seen for an extended period of time.

Once the critic subagent becomes part of Hal’s “thought process,” some other capabilities will follow, such as determining the best carrier for a package, managing sales and pricing, and adding customer relationship continuity for Ben.

Good ideas worth borrowing down the road

We could stop there, but some of Letta’s work with Ezra seemed pretty relevant and innovative, and I’d like to at least incorporate a few of these.

Letta has also opted to split their agent into specialized versions, which we already do. Ben is probably the closest to Ezra Prime, with Docs Ezra and Ezra Super not having a direct correlation.

While our implementation requires a different set of agents, a future phase could introduce a shared namespace that both Hal and Ben read — operational knowledge that belongs to neither agent specifically but serves both (e.g., store policies, known product issues, current promotions).

Another feature of Ezra that is extremely interesting is its accuracy monitoring. The subagents above are our effort at similarly minimizing hallucinations.

Stay tuned

This initial blog post was intended to give a bird’s-eye overview of what Ben is and what our goals are. All comments, questions, and constructive criticisms are welcome.

What the %$#!, Railway?!

Did Anthropic Box OpenAI In For the Foreseeable Future?

This. Is. Huge.

Week 1 Check-In: Going Better than I Thought