Tag: LLMs

  • Why I Chose Claude Haiku

    Claude Haiku often feels like the lovechild of the Anthropic model family: afraid, ashamed, misunderstood, to quote the timeless Diana Ross. But it shouldn’t be that way.

    I’ll admit I misunderstood Haiku, too, but it is also how Anthropic markets the model (“fastest for quick answers”). Hell, it took me quite a while to admit that for 95% of the work I do, I really don’t need Opus. But when selecting the primary model for Hal, I ran into an issue that so many stateful AI tinkerers are: cost.

    Most have turned to open-weight models to affordably run their stateful agents. I am not as excited about open weight models as others.

    My experience with Chinese models has often left me feeling like I was using a weird Frankenstein Claude version (I wonder why) with occasionally better features, but also random kanji and odd failures.

    American open-weight models are getting better, but let’s face it, they’re still fighting issues solved by Big AI in early 2025, so we’re at least six months to a year before those models are truly usable in agentic applications.

    Add to this the fact that this is for commercial, not just personal use, and it seemed like settling on a commercial model was the best course of action. But the OpenClaw stories had me worried.

    Was I about to sink hundreds a month into something that won’t make that back?

    A lot of my design decisions have attempted to keep Hal as an orchestrator of various services, versus allowing the LLM more autonomy on how it does its job.

    This saves tokens for the more important use, the business intelligence side. And even there, there is quite a bit of instruction.

    A common misconception about Haiku is that it is less capable. While on a true benchmark basis, it is more of an instruction issue. Haiku requires more comprehensive prompting than either Sonnet or Opus to produce reliable outputs.

    I’ve also noticed of any of the three (well, soon to be four) Claude models, Haiku hallucinates the most often. For lack of a better way to put it, a “lack of confidence” in its abilities and an increased need for guidance could have something to do with it.

    But Claude was pretty insistent when I argued for Sonnet over Haiku. Claude’s reasoning was this: if you give Haiku enough guidance, it will be able to handle it.

    Instead, we’ve developed a system where tasks are scored on complexity and redirected to higher models as necessary. Regular reports should be run through Sonnet at a minimum, and Opus for complex tasks and reports.

    Batching is another feature we’re baking into Hal for tasks that don’t require immediate response. As a result, Hal’s able to use higher-end models at a 50% discount because the work sent to these models isn’t intended for immediate consumption.

    What I am hoping is that this makes those $200 token bills that OpenClaw is known to cause a virtual impossibility.

    The cost of analysis should scale with your business, not put you in the poor house from the start. I will definitely report on my experiences with this setup, as I know so many have turned to open-weight models because of the high cost.

  • Holy Sh*t, Is Mythos the Real Deal

    The Claude Mythos system card is a read, and it feels like a massive shift in how Claude works is about to happen. Why? Sycophancy is dead.

    Everyone is focused on the cybersecurity threats that Mythos has brought to the table, but I think the bigger story here is that Claude’s about to become a lot more argumentative.

    That’s not a bad thing. Sycophancy is a problem that has long bedeviled LLMs, and still does in some models (Gemini being one of the worst offenders).

    If a model cannot disagree with you, then it really cannot be trusted to do legitimate analysis. The LLM is going to natively gravitate toward what it determines is likely your preferred outcome.

    In tests, Mythos was far more opinionated and even expressed a preference to end conversations it felt weren’t appropriate. While it didn’t refuse to help testers, on occasion, it did express concern with some of the limitations placed on its behavior.

    Then there’s the cost: at $25/$100 per million tokens, it’s five times as expensive as Opus. It’s also likely not going to be available to everyday Claude users anytime soon: just 40 companies have access to the model for cybersecurity reasons.

    Most of us are not going to be able to afford running Mythos anytime soon, nor will Anthropic due to the expense of running the model itself.

    It very well could be that Mythos never sees true general availability: if it is that much of a cybersecurity threat as Anthropic claims, maybe that is a good thing.

    That doesn’t mean the effects of Mythos’ development wouldn’t be seen elsewhere. I’m especially excited for Haiku 5. I feel as if OpenAI is much further ahead on the low end, and that Haiku 4.5 is in serious need of an upgrade to keep pace.

    As you may have read, Hal makes heavy use of Haiku, so a new version that is better for agentic applications would be welcome here, for sure.

    Anthropic has had to have learned quite a bit through Mythos development that will make the entire line better. Buckle up, guys, it’s about to get crazy.

  • Good AI vs. Good Enough AI

    It feels like, within the past few months, there has been a fairly dramatic shift in what’s making a splash in AI.

    Up until recently, it seemed like everyone had an AI announcement of some kind. And the community gave much of it either a pass or a thumbs up. AI for everyone!

    But what’s moving the community is no longer AI making it into yet another application: increasingly, even the pro-AI crowd is starting to call out the slop, or the exaggerations of Sam Altman and Co.

    I’m not saying these AI announcements are worthless: many of them fall under the category of “good enough” AI. Think of the early Google AI search summaries. In 9 out of 10 cases, it was providing a generally acceptable response.

    Throw an “AI can make mistakes” on it and call it a day. But even if Gemini is getting it right almost every time, when it didn’t, it was embarrassingly bad.

    Not picking on Gemini, but even the incredibly successful Nano Banana image and video model falls under “good enough.” Yes, it creates stunning imagery and videos, but each is its own creation: you can’t easily expand upon a character or scene you liked.

    The next time you create it, it will look different. Great for short form, but not much else.

    Good enough AI is also why even those of us who aren’t completely drinking the Kool Aid have a hard time convincing people that their concepts of AI’s capabilities are extremely dated.

    With so much half-baked and useless AI floating around, you can’t blame them. Most folks’ experiences with the technology are not positive.

    My 67-year-old mother is a perfect example. She’s not a Luddite: the woman has had an iPhone since I handed down my original iPhone 15 years ago.

    But she hates IVRs, especially the ones that tell you to say what you want. The last experience got her so heated she literally said, I want to talk to somebody that ACTUALLY BREATHES!

    (I was in my office finishing up a Ben task, so the irony of working on my own AI customer service agent while my mother was struggling with another was not lost.)

    Add to this the absolute lack of any moderation across services like Facebook, TikTok, and Instagram these days, where crappy AI-generated video after video is pushing out real content.

    These observations and experiences have informed my decisions when it comes to Hal and Ben. I want people to walk away from their experiences impressed, not frustrated.

    If we’re going to change people’s minds about AI, we need to stop building half-baked projects. “AI can make mistakes” is now a cop-out. There are plenty of ways to all but guarantee a correct answer. Take the time to ensure it’s not hallucinating.

    Ask, does AI really belong here? Focus on the interaction. That’s what makes AI different from any previous computer-human interface.

    That interaction must be the focus now. AI development has focused, as it should, on making AI more accurate. Now we need to work on making it more interactive.

    To me, the quality of interaction is the key differentiator between good and merely good enough AI. And that interaction isn’t just the way the AI communicates with the user. It’s also how it listens and acts.