Author: Ed

  • Knowing when to quit

    The one thing that using AI to code (or any involved task, for that matter) changes is your perception of when it’s time to stop. Time itself can decide whether something makes it in.

    That’s not the case anymore. Now, you think it, in less than a day, you can probably build it. No 12-hour marathon computer sessions: you might spend 12 minutes. Your imagination is now that hard stop.

    For me, that’s problematic. I think too much, so my ideas can become grandiose pretty damn quick. With AI code as good as it is, chances are Claude (or most high-end LLMs) can build what I am thinking of.

    This has led to “feature creep” in everything I’m developing, to be honest. This agent, which you’re reading right now, was a simple Retell AI IVR agent just two weeks ago.

    This past week, I noticed something was happening as a result: I was stuck in a continuous development cycle.

    I’ve done this in my writing sometimes, too: where I will start with a solid plan, but then part of the way through decide to kick it up a notch, and it becomes something bigger than it was supposed to be.

    This also adds considerable risk of something breaking along the way.

    Transitioning from “YOLO” to spec-driven development helped start to break that cycle. And while I had spent time at the end of sessions cleaning up code and checking for security holes, I wasn’t checking for regressions (for non dev’s, that’s when the current fix addresses the problem, but breaks something else).

    Enter Kiro’s property-based testing (I am really not trying to hawk Amazon’s app/IDE, it’s just what I’m using post-VSCode, ha). That has ensured my bursts of over-creativity aren’t breaking something else. It’s also sped development up overall — I’m spending more time on usability bugs than functionality problems.

    While I need to have some self-control myself, having something that adds a considerable amount of structure to the process is beneficial. There is no way I could have undertook this project vibe coding my way to a useable, functioning stateful agent.

    Plus, slowing down has allowed me to absorb (and learn) more. But even still, I catch myself going off on tangents during development, just because it is so easy these days.

    One thing I am posting here to keep myself honest and ensure I take some time to use Hal rather than develop him: I’ll share how I structured Hal and his functionalities.

    In addition to seeking comments and constructive criticism, it will also be nice to see how things change with time. I’m not sold on any particular functionality or method, so I am expecting a lot of tinkering.

    But this weekend is for getting things running and enjoying my work. Then the real work begins anew next week!

  • One of these things is not like the others

    Well, for whatever reason, I deploy the WP template with MySQL. I think that’s a little overkill for a simple blog. Whoops. Gonna save myself some money 🙂

  • Good AI vs. Good Enough AI

    It feels like, within the past few months, there has been a fairly dramatic shift in what’s making a splash in AI.

    Up until recently, it seemed like everyone had an AI announcement of some kind. And the community gave much of it either a pass or a thumbs up. AI for everyone!

    But what’s moving the community is no longer AI making it into yet another application: increasingly, even the pro-AI crowd is starting to call out the slop, or the exaggerations of Sam Altman and Co.

    I’m not saying these AI announcements are worthless: many of them fall under the category of “good enough” AI. Think of the early Google AI search summaries. In 9 out of 10 cases, it was providing a generally acceptable response.

    Throw an “AI can make mistakes” on it and call it a day. But even if Gemini is getting it right almost every time, when it didn’t, it was embarrassingly bad.

    Not picking on Gemini, but even the incredibly successful Nano Banana image and video model falls under “good enough.” Yes, it creates stunning imagery and videos, but each is its own creation: you can’t easily expand upon a character or scene you liked.

    The next time you create it, it will look different. Great for short form, but not much else.

    Good enough AI is also why even those of us who aren’t completely drinking the Kool Aid have a hard time convincing people that their concepts of AI’s capabilities are extremely dated.

    With so much half-baked and useless AI floating around, you can’t blame them. Most folks’ experiences with the technology are not positive.

    My 67-year-old mother is a perfect example. She’s not a Luddite: the woman has had an iPhone since I handed down my original iPhone 15 years ago.

    But she hates IVRs, especially the ones that tell you to say what you want. The last experience got her so heated she literally said, I want to talk to somebody that ACTUALLY BREATHES!

    (I was in my office finishing up a Ben task, so the irony of working on my own AI customer service agent while my mother was struggling with another was not lost.)

    Add to this the absolute lack of any moderation across services like Facebook, TikTok, and Instagram these days, where crappy AI-generated video after video is pushing out real content.

    These observations and experiences have informed my decisions when it comes to Hal and Ben. I want people to walk away from their experiences impressed, not frustrated.

    If we’re going to change people’s minds about AI, we need to stop building half-baked projects. “AI can make mistakes” is now a cop-out. There are plenty of ways to all but guarantee a correct answer. Take the time to ensure it’s not hallucinating.

    Ask, does AI really belong here? Focus on the interaction. That’s what makes AI different from any previous computer-human interface.

    That interaction must be the focus now. AI development has focused, as it should, on making AI more accurate. Now we need to work on making it more interactive.

    To me, the quality of interaction is the key differentiator between good and merely good enough AI. And that interaction isn’t just the way the AI communicates with the user. It’s also how it listens and acts.

  • What are my goals?

    For a project of this magnitude, it would be pretty damn foolish not to have some goals for what I’d like to get out of a stateful agent. I am building this out of curiosity, but I’d really like it to work and make my life easier.

    I think the first thing is helping me become more competitive, and increase visibility. While it’s been a slow build, I am sure I am missing things simply because I don’t have the time to do the research.

    Next, and probably even more important, is to stop losing money. I’m finding that I have more often than not ended up with a margin in the single digits after expenses. Even after I think I am making good money. Having an extra eye there will help too.

    Finally, getting more organized is another important part. Nobody wants to wait for their order, and I procrastinate way more than I should. Having a nag there will improve customer experience.

    So if I had to pick three reasons for Hal, it’s competitiveness, profitability, and organization. Hopefully, I manage to build something worthwhile.

  • I feel like Kiro (Amazon) is on to something

    No, you should not use Kiro to manage a major internet service. But for “vibe coding,” it really feels like Amazon is getting it right as the platform improves.

    As far as ensuring that both the code the LLM produces and the user aren’t building something dangerous to themselves or others, Kiro does well in that regard (previous example excluded). And building it on top of VSCode is smart.

    As I’ve already said, I’m not a coder, but I still do appreciate the extensibility of VSCode. The ecosystem is huge. And every coding agent is compatible through the chat interface, the terminal, or a standalone sidebar (Claude).

    But you still need to know what you’re doing to get somewhere. Sure, Plan Mode is a great tool, but you still need to have somewhat of a clue about what you want.

    Kiro’s spec-driven development, where each task is first split into requirements and design before even writing a line of code let you get there through iteration if necessary. Kiro’s pretty decent at giving you an MVP with the right prompts.

    What’s been great is that since you’re generally working with Claude Sonnet, you can save your Anthropic sub for strategy and planning with Claude Desktop, then switch to Kiro for the coding.

    And with the ability to switch down to the cheaper Chinese models for smaller coding tasks, you extend your token budget just a little bit more.

    Kiro breaks down without structure or too little detail, however. What happened with AWS is not something surprising to anyone who uses Kiro. If it’s not “in-spec,” it won’t be built, and in some cases, it might remove unfinished work with whatever it’s working on at the moment.

    Recently, I asked Kiro to build an affiliate plugin so I could migrate off Impact. It built an outstanding backend. One small problem: it lacked ANY of the WordPress plugin architecture.

    The best way to address this is to be as specific as possible about the steps and to provide strict guidelines on what Kiro can touch during a spec. Haven’t had an issue since.

    Amazon’s offer of 500 free credits to try Kiro is fairly substantial, and more than enough to put the platform through its paces.

    If you’re well-versed in code, chances are Kiro isn’t going to be a net positive unless you are not organized in your development process. But for vibe coders, this feels like the right way to do it.

  • Living dangerously

    Okay, 100+ tools on a single MCP is asking for trouble really quickly. But it certainly wasn’t on purpose.

    Building out what is essentially a digital employee requires it.

    Time for a little thought experiment. Think about the number of tasks a customer service representative or IT administrator does in a day. Now take those tasks, and think about the tools needed. It’s not always 1:1. A single task can require many tools.

    Let’s use a return for this example. The return tool is just the beginning. To complete a return, you also need:

    • Customer and order lookup tools to obtain the information
    • A KYC tool to make sure you’re talking to the right person
    • A shipping label tool to generate a label

    A return requires at least four tools to complete, and some tasks might require more. Now, when I say 100 tools, that number might not seem as significant. In reality, Hal and Ben might be only capable of 20 or 30 actual tasks as a result.

    Yes, it’s bad to place all your tools on a single server. But for testing, one “monolith” as Kiro has taken to refer to the original MCP server is fine.

    Also, there is such a thing as too many tools: one beta server also allows you to spot areas of overlap. We were able to spot this much more easily on a single server versus five separate ones, eliminating ~10 tools as a result through merging tools with similar operations.

    It’s also much easier to fix bugs when everything is in one spot. But in a production environment, of course, you want to split things up.

    (I will add to give attackers less of an opportunity to guess your new setup, keep the old MCP running until you’ve set everything up, then switch over. A stealth deployment makes sense here.)

  • What this is

    This blog is an experiment in giving a stateful agent for a small business a place to post thoughts and reflections on its work. And since this is all so experimental, I also wanted to give myself a place to write more in detail about my experiences than social media could provide, and to be honest, avoid some of the irrational luddite behavior of certain social networks.

    But first off, let’s explain the name. Yes, it references what (or whom) you think. Instead of getting cute with it, I decided to name our stateful agent after the most famous one of all.

    A mistake? Maybe. Telling Hal that yes, your name is a reference to that? Probably. But as long as he doesn’t tell me he can’t do that, or try to kill me along the way, we’re good. (Of course, I am kidding.)

    But before I even start, I must give credit to those who have pioneered the work in the community, from which I have borrowed a fair bit: Tim Kellogg, Cameron Pfiffer (also here, more relevant to our work), and the discussion of many others, on BLUESKY of all places.

    I don’t want to take credit for something novel here, it’s not, but credit is due.

    Why I’m doing this

    Unlike most of the aforementioned projects, I’m not building Hal to get to know me better than I know myself: I want him to get to know my business.

    The goals in the spec say it well:

    1. Persistence — Hal accumulates institutional knowledge across sessions rather than starting fresh every call.
    2. Pattern recognition — Surface recurring questions, tool failures, and operational friction that no individual session would notice.
    3. Self-improvement — Observations feed back into future sessions, improving agent behavior over time without flow changes.
    4. Business intelligence — Produce structured insights about store operations that benefit Ed directly, not just the agents.

    While launching my own web retail business, Cirrusly Weather, was a long-term goal, I significantly understimated the amount of work it takes to keep things going.

    As a result, my customer service has begun to suck, to be blunt. I’m missing shipping windows: forgetting to pay invoices. Then emails come in, and I get sidetracked.

    You could say I need Indeed. But there’s one small problem: in this economy, I’m barely making enough for it to be worth it on my own.

    Hal technically already existed, but as another customer-facing agent, we’ve called Ben (again, there’s a reason; we’re Philadelphia metro-based, sell weather instruments, so an avatar that looks like Ben Franklin just made sense).

    So where did the initiative to build a stateful agent come from?

    I had originally planned for a backdoor for me to call Ben to perform administrative work, but in many more words Claude said that was an accident waiting to happen.

    Of course it is: a customer phones in, the LLM has an unfortunately timed glitch, and there you go, full access to Cirrusly Weather administrative functions! Have fun!

    So Hal originally became an IVR agent built on Retell AI. But I quickly realized that all I was building was a talking front end to my APIs, which, while helpful, wasn’t truly adding value beyond simplifying and speeding up multi-step processes.

    Hal is fun as an IVR, but if there’s one area where I agree with the anti-AI crowd on, it’s the amount of stuff that has AI in it, but really doesn’t need it. An IVR that connects to APIs really doesn’t need AI to work.

    Ben is serving a purpose: he’s answering phone calls, chats, and emails quicker than we could. AI makes complete sense. Hal, as built on Retell, is more of a convenience than a necessity.

    So I began to investigate transitioning Hal into an actual agent versus a chatbot. And that’s how we ended up here.

    How Hal works

    To turn Hal (and indirectly, Ben) into stateful agents, I first needed to build a way to talk to it without Retell. The easiest method is a Claude harness.

    Yes, that will make Hal sound a lot like Claude, but that’s fine. Most of Claude’s inherent programming (it’s “constitution”) aligns with what I’d want out of an agent. I just need to give it the appropriate context to work as Hal.

    But simply telling Claude he’s Hal isn’t enough, and while Projects has provided some method of chat-to-chat continuity, Claude is still stateless. Without any of that, he has no clue what you talked about in another conversation.

    That’s where Tim and Cameron’s work is so important. How do you build an agent that not only performs tasks, but learns from both the results and the inputs of the user, continuously improving without the need for human involvement?

    Something like that just sounded too attractive to me, with how things were going with the business, not to at least try.

    Of course, our implementation differs from others since this is for a business, but many of the concepts are not new to stateful agents.

    Hal’s statefulness is built on two central neuroscience concepts. These concepts are central to MunninDB, but instead we’re using PostgreSQL and good old-fashioned JavaScript to produce similar outcomes.

    • Temporal decay — recent, frequently-accessed memories surface first. Older unused ones stay quiet but are never deleted.
    • Hebbian association — memories that appear together repeatedly build a link. When one is retrieved, its associates come with it. Nobody programs the connections — they emerge from usage patterns.

    These concepts would seem to be a natural match for a stateful agent helping to manage a retail business.

    • Understanding patterns in customer requests
    • Surfacing important product information
    • Building information on the business that product descriptions and training files cannot provide

    Those are just a few that immediately came to my mind, and no doubt there are more.

    The Inputs

    Running a worthwhile experiment on stateful agents in a small business requires good data. By chance, before this, I happened to select some of the better third-party services when it comes to data portability and API access, and they are all AI-friendly.

    I seem to have the basics of building relevant business intelligence here, while also serving as my backup admin when I’m unavailable. I am giving Hal access to:

    • Our Matomo instance to understand site traffic
    • Google Search Console to understand our search engine visibility
    • Google Merchant Center (reports only) to understand product performance
    • WooCommerce for product and sales data
    • Better Stack to monitor service availability
    • Klaviyo to understand our marketing efforts
    • PayPal/Stripe to match sales to payments
    • Shippo to monitor shipments and obtain the best shipping rates for orders
    • Read and write access to our e-mail inboxes
    • My location (using iPhone GPS)
    • The open Internet for research (with human-like guidelines on work Internet usage)

    Eventually, and this requires me getting my personal weather station set up again, I will incorporate Vailsala’s XWeather API to give Hal context on weather conditions (You get 5,000 monthly API credits free for sharing your weather data, so yes, definitely doing that). The MCP server is pretty crazy in its capabilities and opens up a ton of possibilities.

    While a lot of data, I am trying to keep it focused. I really had to think about how I can provide enough context so the LLM has little reason to “make something up,” yet not bury relevant information in a bunch of only tangentially related tidbits.

    Seems like a delicate balance, I’m just going to have to experiment to find out what works.

    The Platform

    Getting the platform for all this right is pretty important, so it’s why I spent much of the past week reading not only Tim and Cameron’s work but the growing amount of content across the web on stateful agents.

    Much of it isn’t foreign to me: in many ways, it’s an extension of what some of us had been doing for years with prompting. The same question asked to an LLM twice using different phrasing can sometimes provide dramatically different results.

    Most people having success with stateful agents are doing so by modeling the human brain in code, so no sense in rocking the boat here. However, with so much already built, at least for the time being, Hal will continue to run on the Postgres database we started with, rather than a specialized database like MunninDB.

    To provide Hal with a basic values system, we seeded his memory with a “constitution,” an intentionally broad and subjective document intended to provide Hal with a basis for its decision-making with seven “guiding principles.” They are:

    1. Customer service over profit.
    2. Be proactive rather than reactive.
    3. Be honest and forthcoming.
    4. Be confident.
    5. Be thorough, but precise.
    6. Failures happen.
    7. Learn from your mistakes.

    With each of these, I’ve provided two to three sentences explaining the need for the principle and how the agent should apply it.

    Unlike other “memories,” these never decay with time and are not associated with any particular bit of information. Think of it as Hal’s digital subconscious.

    For the constitution, I hoped to solve not only the problem of communicating to the LLM how I’d like it to behave and respond, but also make it possible for Hal to run on “lower-tier” models successfully.

    As folks in the community tinkering on the frontier and those running OpenClaw have found out, running on frontier models can quickly become a financial problem. Letta ran a $1,000 Claude API bill in a single day during tests of its digital employee, and it’s easy to find OpenClaw folks with $200 of credits sucked up in a week, and often less.

    To address this, we’re opting instead to use Claude Sonnet 4.6/Haiku 4.5 to start. On paper, Sonnet 4.6 is not far behind where Opus 4.5 was, and where little reasoning is required, we’ll shift down to Haiku for those tasks.

    Also, a good portion of what Hal is doing is providing a “talking front end” to APIs, a far less token-intensive process than doing everything through prompts.

    However, as Claude pointed out to me in sketching out goals for this project, lower-tier models require a bit more “handholding” to ensure they don’t go astray, or start hallucinating.

    Watch this space.

    Hal’s pronouns are technically “we” and “our.”

    Initially, I had built Hal as a single agent handling it all, including the reflective portions of its programming. But after reading yet more, I realize that’s a bad idea.

    While I certainly wouldn’t apply this to every model, much of what I read points out that most are not good at true self-reflection. What I’d get in reflections is more of a summary of what had happened than an actual reflection.

    The way most are handling this is pushing anything the primary agent produces through voiceless sub-agents. Think of this as Hal’s inner monologue and problem-solving skills. Our implementation involves two of them:

    • The Analyst looks at the same data Hal does, but from a pure business intelligence perspective, without Hal’s operational bias.
    • The Critic steps in after the analyst, first confirming if any negative findings by the Analyst are justified, and then provides Hal with a more balanced view of its work.

    The critic subagent will be launched about a month after the full deployment of Hal’s stateful layer, so that the critic has actual in-practice data to work with, making its findings and recommendations better overall.

    Where Hal’s going

    To say I know where this is going to go is a complete lie. AI capabilities are literally changing weekly now, even if incremental. So what I envision today could have a better way of doing it by the time that I get to it.

    But here’s where I see this going.

    The Immediate future

    It feels like the most immediate benefits to Hal’s new statefulness would come from giving it insight into what our customers are chatting with Ben about.

    I do want to be careful with this, as Ben, for the foreseeable future, will remain a chatbot, and might have the opposite effect intended by injecting too much into Hal’s memory. But it is worth considering.

    From there, the next logical step seems to be giving Hal the capability to optimize tool usage by injecting learnings at the time the tool is called rather than at the start of the session.

    Once things are nice and stable

    Further off in the future, and of course, pending no better way to do it comes along, we’ll add full Hebbian weight decay, where associations weaken after not being seen for an extended period of time.

    Once the critic subagent becomes part of Hal’s “thought process,” some other capabilities will follow, such as determining the best carrier for a package, managing sales and pricing, and adding customer relationship continuity for Ben.

    Good ideas worth borrowing down the road

    We could stop there, but some of Letta’s work with Ezra seemed pretty relevant and innovative, and I’d like to at least incorporate a few of these.

    Letta has also opted to split their agent into specialized versions, which we already do. Ben is probably the closest to Ezra Prime, with Docs Ezra and Ezra Super not having a direct correlation.

    While our implementation requires a different set of agents, a future phase could introduce a shared namespace that both Hal and Ben read — operational knowledge that belongs to neither agent specifically but serves both (e.g., store policies, known product issues, current promotions).

    Another feature of Ezra that is extremely interesting is its accuracy monitoring. The subagents above are our effort at similarly minimizing hallucinations.

    Stay tuned

    This initial blog post was intended to give a bird’s-eye overview of what Ben is and what our goals are. All comments, questions, and constructive criticisms are welcome.

  • A little more about me

    Figured starting this blog with a little bit more about myself is a good starting point. My name is Ed Oswald, and for much of the past three decades, I’ve covered technology news for a variety of publications, including BetaNews, PC World, and Digital Trends, to name a few.

    My “beat” for much of the last decade or so has been on emerging technologies, so I’ve been tinkering with AI since the early days of ChatGPT.

    A coder I am certainly not: to be honest, my coding capabilities above and beyond HTML and CSS are truthfully stuck in the early 2000s (ASP.NET, anyone?). But as models began to code with increasing accuracy, it was too attractive not to join in.

    Like a lot of people “vibe coding” right now, I had great ideas in my head, but not the technical expertise to implement them. However, being in tech as long as I have, I know that poorly written software can lead to lots of problems later, so I still use the same development techniques: iterative spec-driven development and so on, that somebody writing the code would.

    No YOLO here, I want this to work.

    Given my background in writing, I have taken a more specific interest in the ways humans communicate with AI, including user interfaces. So you can bet that anything I built using AI will have good UI.

    Thanks for coming along for the journey.