Tag: Cost Efficiency

  • Week 1 Check-In: Going Better than I Thought

    Here’s the first of my weekly check-ins, where I round up the work that I didn’t mention in a particular blog post during the past week. It will also serve as a way for me and you to monitor progress.

    So, being the first week of operation, not a whole lot to say just yet. Surprisingly enough, about 90% of Hal’s code was bug-free at launch, so most tools worked “out of the box,” but many were still quite rough around the edges. A lot of this week has been spent bug-squashing.

    Wins

    1. I gotta credit Claude Sonnet and Kiro here, I had the idea, but it wouldn’t have worked if Claude Code/Kiro hadn’t coded it so well. These bugs are annoyances, not show stoppers. Don’t think Opus would have done that materially better to justify the added cost, to be honest (Sonnet 4.6 is basically Opus 4.5 anyway).
    2. Hal is inferring things without us even telling him. A billing error caused IONOS to shut down our server: we noted that Ben had already figured out it was likely an external issue based on the available data and not a crash. I wasn’t expecting that.
    3. Costs remain low. The biggest one-day expense so far has been $1. A code bug put Hal on Sonnet briefly this week. Had it not, I would have spent only $2.00 for the entire week!
    A slow, gradual increase…

    Challenges

    1. Hal is helpful, perhaps too much so. Monitoring that he isn’t hallucinating tool calls again, or promising things he can’t do. I’m calling it “overeagerness.”
    2. Hal isn’t truly autonomous just yet. He’s still operating on a set schedule for the most part.
    3. UI design for the web front end is proving a bit trickier than I had thought. This is an area where I want to focus on: OpenClaw requires setup out of the box. This ships with a UI that works on any device, which feels a lot like Claude Desktop or ChatGPT. But getting elements to work has been a hassle.
    Hal looks like Claude and ChatGPT on purpose, making it easy to use for anyone.

    Notable New Features

    A lot of work this week ended up being in monitoring and security. I can honestly say our server is now prepared as much as we can for any AI-caused security hell on its way. Hal is actively monitoring for attacks using CleanTalk, and combined with CleanTalk, can block access to our site via that and Bunny.net, our CDN.

    He’s also got monitoring for our deployments on Railway as well. It’s basic at the moment, but we’ll know of issues (and attacks) faster than ever before, and have the tools to diagnose and restart services if necessary.

    Best of all? By next week, he’ll be connected to our Better Stack account, commenting on incidents with full summaries of his findings and any actions.

    Something like this can easily run a company thousands of dollars a month: heck, for even the most basic premium functionality, Better Stack is $25/month, per user.

    We’re also working on a feature to bring some more autonomy to Hal’s workday. Based on Strix’s Perch Time, Hal’s Heartbeat is a scheduled work period every two hours throughout the day. These are intelligently scheduled by Hal based on workload and the task context itself.

    This Week’s Goals

    My goal for the upcoming week is to finally squash the remaining data glitches that still remain. For some reason, Hal can pull tools on demand, but they’re not appearing in the morning email digest.

    Another goal is to get Hal to use his heartbeat to work on a proposed action without me prompting him to. As it’s a new tool, I’m not expecting autonomous use just yet.

  • Why I Chose Claude Haiku

    Claude Haiku often feels like the lovechild of the Anthropic model family: afraid, ashamed, misunderstood, to quote the timeless Diana Ross. But it shouldn’t be that way.

    I’ll admit I misunderstood Haiku, too, but it is also how Anthropic markets the model (“fastest for quick answers”). Hell, it took me quite a while to admit that for 95% of the work I do, I really don’t need Opus. But when selecting the primary model for Hal, I ran into an issue that so many stateful AI tinkerers are: cost.

    Most have turned to open-weight models to affordably run their stateful agents. I am not as excited about open weight models as others.

    My experience with Chinese models has often left me feeling like I was using a weird Frankenstein Claude version (I wonder why) with occasionally better features, but also random kanji and odd failures.

    American open-weight models are getting better, but let’s face it, they’re still fighting issues solved by Big AI in early 2025, so we’re at least six months to a year before those models are truly usable in agentic applications.

    Add to this the fact that this is for commercial, not just personal use, and it seemed like settling on a commercial model was the best course of action. But the OpenClaw stories had me worried.

    Was I about to sink hundreds a month into something that won’t make that back?

    A lot of my design decisions have attempted to keep Hal as an orchestrator of various services, versus allowing the LLM more autonomy on how it does its job.

    This saves tokens for the more important use, the business intelligence side. And even there, there is quite a bit of instruction.

    A common misconception about Haiku is that it is less capable. While on a true benchmark basis, it is more of an instruction issue. Haiku requires more comprehensive prompting than either Sonnet or Opus to produce reliable outputs.

    I’ve also noticed of any of the three (well, soon to be four) Claude models, Haiku hallucinates the most often. For lack of a better way to put it, a “lack of confidence” in its abilities and an increased need for guidance could have something to do with it.

    But Claude was pretty insistent when I argued for Sonnet over Haiku. Claude’s reasoning was this: if you give Haiku enough guidance, it will be able to handle it.

    Instead, we’ve developed a system where tasks are scored on complexity and redirected to higher models as necessary. Regular reports should be run through Sonnet at a minimum, and Opus for complex tasks and reports.

    Batching is another feature we’re baking into Hal for tasks that don’t require immediate response. As a result, Hal’s able to use higher-end models at a 50% discount because the work sent to these models isn’t intended for immediate consumption.

    What I am hoping is that this makes those $200 token bills that OpenClaw is known to cause a virtual impossibility.

    The cost of analysis should scale with your business, not put you in the poor house from the start. I will definitely report on my experiences with this setup, as I know so many have turned to open-weight models because of the high cost.