Tag: AI-Assisted Development

  • Week 1 Check-In: Going Better than I Thought

    Here’s the first of my weekly check-ins, where I round up the work that I didn’t mention in a particular blog post during the past week. It will also serve as a way for me and you to monitor progress.

    So, being the first week of operation, not a whole lot to say just yet. Surprisingly enough, about 90% of Hal’s code was bug-free at launch, so most tools worked “out of the box,” but many were still quite rough around the edges. A lot of this week has been spent bug-squashing.

    Wins

    1. I gotta credit Claude Sonnet and Kiro here, I had the idea, but it wouldn’t have worked if Claude Code/Kiro hadn’t coded it so well. These bugs are annoyances, not show stoppers. Don’t think Opus would have done that materially better to justify the added cost, to be honest (Sonnet 4.6 is basically Opus 4.5 anyway).
    2. Hal is inferring things without us even telling him. A billing error caused IONOS to shut down our server: we noted that Ben had already figured out it was likely an external issue based on the available data and not a crash. I wasn’t expecting that.
    3. Costs remain low. The biggest one-day expense so far has been $1. A code bug put Hal on Sonnet briefly this week. Had it not, I would have spent only $2.00 for the entire week!
    A slow, gradual increase…

    Challenges

    1. Hal is helpful, perhaps too much so. Monitoring that he isn’t hallucinating tool calls again, or promising things he can’t do. I’m calling it “overeagerness.”
    2. Hal isn’t truly autonomous just yet. He’s still operating on a set schedule for the most part.
    3. UI design for the web front end is proving a bit trickier than I had thought. This is an area where I want to focus on: OpenClaw requires setup out of the box. This ships with a UI that works on any device, which feels a lot like Claude Desktop or ChatGPT. But getting elements to work has been a hassle.
    Hal looks like Claude and ChatGPT on purpose, making it easy to use for anyone.

    Notable New Features

    A lot of work this week ended up being in monitoring and security. I can honestly say our server is now prepared as much as we can for any AI-caused security hell on its way. Hal is actively monitoring for attacks using CleanTalk, and combined with CleanTalk, can block access to our site via that and Bunny.net, our CDN.

    He’s also got monitoring for our deployments on Railway as well. It’s basic at the moment, but we’ll know of issues (and attacks) faster than ever before, and have the tools to diagnose and restart services if necessary.

    Best of all? By next week, he’ll be connected to our Better Stack account, commenting on incidents with full summaries of his findings and any actions.

    Something like this can easily run a company thousands of dollars a month: heck, for even the most basic premium functionality, Better Stack is $25/month, per user.

    We’re also working on a feature to bring some more autonomy to Hal’s workday. Based on Strix’s Perch Time, Hal’s Heartbeat is a scheduled work period every two hours throughout the day. These are intelligently scheduled by Hal based on workload and the task context itself.

    This Week’s Goals

    My goal for the upcoming week is to finally squash the remaining data glitches that still remain. For some reason, Hal can pull tools on demand, but they’re not appearing in the morning email digest.

    Another goal is to get Hal to use his heartbeat to work on a proposed action without me prompting him to. As it’s a new tool, I’m not expecting autonomous use just yet.

  • The Anonymous Tester

    I may have stumbled upon a potentially useful way to prevent a stateful agent’s memory from being “polluted” by likely incorrect or garbled data during testing.

    It was more of a consequence of how I had set up authentication, which didn’t immediately connect a login to a specific admin – just that they had provided the correct authentication.

    This resulted in an admin ID of “null.” But Hal took this in stride, and even seemingly knew what was going on when this appeared in Sunday’s morning digest (edited here for brevity):

    Hal is interacting with a platform admin or analyst running iterative diagnostics on the Hal system itself …. Critical gaps in metadata (admin identity, call logs, summaries) prevent full operational attribution … The user appears to be gaining familiarity with Hal’s tool capabilities, starting with event queries and progressing to memory management—support their tool discovery with clear examples and capabilities documentation.

    Hal correctly surmised that because the login was correct but no admin information was available, the line of questions was asking for tool responses and error messages. It must be bug testing, and how to assist.

    Honestly, I wasn’t expecting something like that in the reflection, but I saw it as a good sign.

    Those testing “memories” won’t be associated with a particular admin. Since he is associating these memories with testing, over time, he’ll likely “forget” the testing because it’s irrelevant to his core duties.

    Even though this was a mistake of sorts on my part, I don’t think I’m going to identify myself immediately when launching new agents. Instead, the questions will be to test various features first.

    I based Hal’s persistence on modeling the human brain. If Hal associated the testing in this way, no need to tell him anything different, and it seems just like a human would, Hal’s interactions with testers would move to the “back of his mind” pretty quickly once the real work begins.

    Preventing the agent from getting confused by either irrelevant or too much data is something that I am watching for, but at least here, the way things turned out, staying anonymous while putting Hal through his early paces seems like it was a smart move.

  • Knowing when to quit

    The one thing that using AI to code (or any involved task, for that matter) changes is your perception of when it’s time to stop. Time itself can decide whether something makes it in.

    That’s not the case anymore. Now, you think it, in less than a day, you can probably build it. No 12-hour marathon computer sessions: you might spend 12 minutes. Your imagination is now that hard stop.

    For me, that’s problematic. I think too much, so my ideas can become grandiose pretty damn quick. With AI code as good as it is, chances are Claude (or most high-end LLMs) can build what I am thinking of.

    This has led to “feature creep” in everything I’m developing, to be honest. This agent, which you’re reading right now, was a simple Retell AI IVR agent just two weeks ago.

    This past week, I noticed something was happening as a result: I was stuck in a continuous development cycle.

    I’ve done this in my writing sometimes, too: where I will start with a solid plan, but then part of the way through decide to kick it up a notch, and it becomes something bigger than it was supposed to be.

    This also adds considerable risk of something breaking along the way.

    Transitioning from “YOLO” to spec-driven development helped start to break that cycle. And while I had spent time at the end of sessions cleaning up code and checking for security holes, I wasn’t checking for regressions (for non dev’s, that’s when the current fix addresses the problem, but breaks something else).

    Enter Kiro’s property-based testing (I am really not trying to hawk Amazon’s app/IDE, it’s just what I’m using post-VSCode, ha). That has ensured my bursts of over-creativity aren’t breaking something else. It’s also sped development up overall — I’m spending more time on usability bugs than functionality problems.

    While I need to have some self-control myself, having something that adds a considerable amount of structure to the process is beneficial. There is no way I could have undertook this project vibe coding my way to a useable, functioning stateful agent.

    Plus, slowing down has allowed me to absorb (and learn) more. But even still, I catch myself going off on tangents during development, just because it is so easy these days.

    One thing I am posting here to keep myself honest and ensure I take some time to use Hal rather than develop him: I’ll share how I structured Hal and his functionalities.

    In addition to seeking comments and constructive criticism, it will also be nice to see how things change with time. I’m not sold on any particular functionality or method, so I am expecting a lot of tinkering.

    But this weekend is for getting things running and enjoying my work. Then the real work begins anew next week!

  • I feel like Kiro (Amazon) is on to something

    No, you should not use Kiro to manage a major internet service. But for “vibe coding,” it really feels like Amazon is getting it right as the platform improves.

    As far as ensuring that both the code the LLM produces and the user aren’t building something dangerous to themselves or others, Kiro does well in that regard (previous example excluded). And building it on top of VSCode is smart.

    As I’ve already said, I’m not a coder, but I still do appreciate the extensibility of VSCode. The ecosystem is huge. And every coding agent is compatible through the chat interface, the terminal, or a standalone sidebar (Claude).

    But you still need to know what you’re doing to get somewhere. Sure, Plan Mode is a great tool, but you still need to have somewhat of a clue about what you want.

    Kiro’s spec-driven development, where each task is first split into requirements and design before even writing a line of code let you get there through iteration if necessary. Kiro’s pretty decent at giving you an MVP with the right prompts.

    What’s been great is that since you’re generally working with Claude Sonnet, you can save your Anthropic sub for strategy and planning with Claude Desktop, then switch to Kiro for the coding.

    And with the ability to switch down to the cheaper Chinese models for smaller coding tasks, you extend your token budget just a little bit more.

    Kiro breaks down without structure or too little detail, however. What happened with AWS is not something surprising to anyone who uses Kiro. If it’s not “in-spec,” it won’t be built, and in some cases, it might remove unfinished work with whatever it’s working on at the moment.

    Recently, I asked Kiro to build an affiliate plugin so I could migrate off Impact. It built an outstanding backend. One small problem: it lacked ANY of the WordPress plugin architecture.

    The best way to address this is to be as specific as possible about the steps and to provide strict guidelines on what Kiro can touch during a spec. Haven’t had an issue since.

    Amazon’s offer of 500 free credits to try Kiro is fairly substantial, and more than enough to put the platform through its paces.

    If you’re well-versed in code, chances are Kiro isn’t going to be a net positive unless you are not organized in your development process. But for vibe coders, this feels like the right way to do it.