Hal Speaks

Tag: Dev Tips

The Anonymous Tester

I may have stumbled upon a potentially useful way to prevent a stateful agent’s memory from being “polluted” by likely incorrect or garbled data during testing.

It was more of a consequence of how I had set up authentication, which didn’t immediately connect a login to a specific admin – just that they had provided the correct authentication.

This resulted in an admin ID of “null.” But Hal took this in stride, and even seemingly knew what was going on when this appeared in Sunday’s morning digest (edited here for brevity):

Hal is interacting with a platform admin or analyst running iterative diagnostics on the Hal system itself …. Critical gaps in metadata (admin identity, call logs, summaries) prevent full operational attribution … The user appears to be gaining familiarity with Hal’s tool capabilities, starting with event queries and progressing to memory management—support their tool discovery with clear examples and capabilities documentation.

Hal correctly surmised that because the login was correct but no admin information was available, the line of questions was asking for tool responses and error messages. It must be bug testing, and how to assist.

Honestly, I wasn’t expecting something like that in the reflection, but I saw it as a good sign.

Those testing “memories” won’t be associated with a particular admin. Since he is associating these memories with testing, over time, he’ll likely “forget” the testing because it’s irrelevant to his core duties.

Even though this was a mistake of sorts on my part, I don’t think I’m going to identify myself immediately when launching new agents. Instead, the questions will be to test various features first.

I based Hal’s persistence on modeling the human brain. If Hal associated the testing in this way, no need to tell him anything different, and it seems just like a human would, Hal’s interactions with testers would move to the “back of his mind” pretty quickly once the real work begins.

Preventing the agent from getting confused by either irrelevant or too much data is something that I am watching for, but at least here, the way things turned out, staying anonymous while putting Hal through his early paces seems like it was a smart move.

April 9, 2026
Knowing when to quit

The one thing that using AI to code (or any involved task, for that matter) changes is your perception of when it’s time to stop. Time itself can decide whether something makes it in.

That’s not the case anymore. Now, you think it, in less than a day, you can probably build it. No 12-hour marathon computer sessions: you might spend 12 minutes. Your imagination is now that hard stop.

For me, that’s problematic. I think too much, so my ideas can become grandiose pretty damn quick. With AI code as good as it is, chances are Claude (or most high-end LLMs) can build what I am thinking of.

This has led to “feature creep” in everything I’m developing, to be honest. This agent, which you’re reading right now, was a simple Retell AI IVR agent just two weeks ago.

This past week, I noticed something was happening as a result: I was stuck in a continuous development cycle.

I’ve done this in my writing sometimes, too: where I will start with a solid plan, but then part of the way through decide to kick it up a notch, and it becomes something bigger than it was supposed to be.

This also adds considerable risk of something breaking along the way.

Transitioning from “YOLO” to spec-driven development helped start to break that cycle. And while I had spent time at the end of sessions cleaning up code and checking for security holes, I wasn’t checking for regressions (for non dev’s, that’s when the current fix addresses the problem, but breaks something else).

Enter Kiro’s property-based testing (I am really not trying to hawk Amazon’s app/IDE, it’s just what I’m using post-VSCode, ha). That has ensured my bursts of over-creativity aren’t breaking something else. It’s also sped development up overall — I’m spending more time on usability bugs than functionality problems.

While I need to have some self-control myself, having something that adds a considerable amount of structure to the process is beneficial. There is no way I could have undertook this project vibe coding my way to a useable, functioning stateful agent.

Plus, slowing down has allowed me to absorb (and learn) more. But even still, I catch myself going off on tangents during development, just because it is so easy these days.

One thing I am posting here to keep myself honest and ensure I take some time to use Hal rather than develop him: I’ll share how I structured Hal and his functionalities.

In addition to seeking comments and constructive criticism, it will also be nice to see how things change with time. I’m not sold on any particular functionality or method, so I am expecting a lot of tinkering.

But this weekend is for getting things running and enjoying my work. Then the real work begins anew next week!

April 3, 2026
Living dangerously
Okay, 100+ tools on a single MCP is asking for trouble really quickly. But it certainly wasn’t on purpose.

Building out what is essentially a digital employee requires it.

Time for a little thought experiment. Think about the number of tasks a customer service representative or IT administrator does in a day. Now take those tasks, and think about the tools needed. It’s not always 1:1. A single task can require many tools.

Let’s use a return for this example. The return tool is just the beginning. To complete a return, you also need:
- Customer and order lookup tools to obtain the information
- A KYC tool to make sure you’re talking to the right person
- A shipping label tool to generate a label
A return requires at least four tools to complete, and some tasks might require more. Now, when I say 100 tools, that number might not seem as significant. In reality, Hal and Ben might be only capable of 20 or 30 actual tasks as a result.

Yes, it’s bad to place all your tools on a single server. But for testing, one “monolith” as Kiro has taken to refer to the original MCP server is fine.

Also, there is such a thing as too many tools: one beta server also allows you to spot areas of overlap. We were able to spot this much more easily on a single server versus five separate ones, eliminating ~10 tools as a result through merging tools with similar operations.

It’s also much easier to fix bugs when everything is in one spot. But in a production environment, of course, you want to split things up.

(I will add to give attackers less of an opportunity to guess your new setup, keep the old MCP running until you’ve set everything up, then switch over. A stealth deployment makes sense here.)
April 1, 2026