Author: Ed

  • Holy Sh*t, Is Mythos the Real Deal

    The Claude Mythos system card is a read, and it feels like a massive shift in how Claude works is about to happen. Why? Sycophancy is dead.

    Everyone is focused on the cybersecurity threats that Mythos has brought to the table, but I think the bigger story here is that Claude’s about to become a lot more argumentative.

    That’s not a bad thing. Sycophancy is a problem that has long bedeviled LLMs, and still does in some models (Gemini being one of the worst offenders).

    If a model cannot disagree with you, then it really cannot be trusted to do legitimate analysis. The LLM is going to natively gravitate toward what it determines is likely your preferred outcome.

    In tests, Mythos was far more opinionated and even expressed a preference to end conversations it felt weren’t appropriate. While it didn’t refuse to help testers, on occasion, it did express concern with some of the limitations placed on its behavior.

    Then there’s the cost: at $25/$100 per million tokens, it’s five times as expensive as Opus. It’s also likely not going to be available to everyday Claude users anytime soon: just 40 companies have access to the model for cybersecurity reasons.

    Most of us are not going to be able to afford running Mythos anytime soon, nor will Anthropic due to the expense of running the model itself.

    It very well could be that Mythos never sees true general availability: if it is that much of a cybersecurity threat as Anthropic claims, maybe that is a good thing.

    That doesn’t mean the effects of Mythos’ development wouldn’t be seen elsewhere. I’m especially excited for Haiku 5. I feel as if OpenAI is much further ahead on the low end, and that Haiku 4.5 is in serious need of an upgrade to keep pace.

    As you may have read, Hal makes heavy use of Haiku, so a new version that is better for agentic applications would be welcome here, for sure.

    Anthropic has had to have learned quite a bit through Mythos development that will make the entire line better. Buckle up, guys, it’s about to get crazy.

  • Anthropic should learn from OpenAI

    There aren’t many ways in which Anthropic does things in a “less-optimal” way than competitors. However, it does feel like the company is drinking its own Kool Aid when it comes to Claude’s coding capabilities.

    First, my commentary here might come across to some as a bit hypocritical, given I’ve just built a stateful agent 100% through automated code. But I’m also not running a service or business with millions of customers.

    Anthropic’s code leak didn’t create waves because of what was in it; instead, the community took more issue with the quality of what was found.

    Multiple instances of functions that should be only a few hundred lines of code, but instead several thousand lines long, were found, adding unnecessary complexity and failure points.

    Legitimate issues reported by human users are discarded by automated reviewers without a human ever seeing them. While on their own, these issues aren’t particularly service-breaking, with time, they compound.

    Take the current issues people are experiencing with usage limits. Wild swings in what’s considered a “full session.” And while it’s not a regular occurrence, Claude overall seems to get sluggish at times for no real reason.

    Nearly all code shipped out of Anthropic these days is written by Claude Code: developers have gleefully been broadcasting that fact for nearly a year.

    But is Claude Code really ready to manage a major service? Kiro isn’t either (it took down AWS), and by the way, it’s typically using a Claude model. In my case, going completely automated for development isn’t a problem since I’m dealing with dozens and hundreds of customers versus thousands or millions.

    I know if I had the latter, I’d definitely have a real developer in the loop. The chances of an embarrassingly and potentially devastating failure are too great not to spend that money.

    OpenAI is also going 100% autonomous development, but they’re doing it in a slightly different way. Instead of all but turning over every role (including the reviewers) to the LLM, OpenAI injected human involvment throughout the process.

    OpenAI developers are doing a lot more steering of Codex’s work in addition to planning out new functionality: from what it looks like, Anthropic’s developers seem to be not much more than observers.

    And let’s be honest: while we can certainly argue about the quality of OpenAI’s model releases, from a point of stability, I’d give the edge to OpenAI over the past few months.

    Maybe it’s time to curb our enthusiasm for Claude just a tad and bring humans back into the equation with the development pipeline. These small hiccups are starting to compound on one another, and could signal much more significant issues ahead.

  • He’s alive…

    …and hallucinating tool calls. Well, it was this morning. More of an issue with the system prompt being a little too permissive. But glad to finally have this working!

    Screenshot
  • Claude is becoming unusable for no reason

    UPDATE, April 5: So I think I now clearly see what’s happening. It’s only present in long-running, multi-day conversations. Still not good, as some work can span multiple days, and starting new chats every time isn’t optimal. So this is definitely unnecessary context being loaded in.

    I am getting annoyed.

    I have no idea what is wrong with Claude right now, but there is a serious issue with token usage and limits. Thought people were nuts, but after three questions, I burned through my entire limit. On Sonnet (not the 1m version, either).

    There is just no way that what I asked Claude to do used that much context to cause something like that. And it’s gotten progressively worse. Where is this additional usage coming from?

    It seems like the changes that Anthropic made to improve Claude’s ability to work across sessions broke the way it calculates limits. My guess is a TON of unnecessary context is now being injected in, bloating usage.

    Claude is not a stateful agent. If this is due to the changes made as a response to OpenClaw, rip it out. Cowork was good enough, and was enough of a token hog. Now basic prompts are using the same amount of your limit as a fully coded plugin.

    This is legitimately the first real platform crisis that Anthropic has experienced. It was easy when only a few developers and heavy users were being throttled. But when typical users are blowing through a session limit in only a few prompts, something’s wildly wrong.

    I really hope they listen to the community. So far, they have.

    Watch this space.

  • Knowing when to quit

    The one thing that using AI to code (or any involved task, for that matter) changes is your perception of when it’s time to stop. Time itself can decide whether something makes it in.

    That’s not the case anymore. Now, you think it, in less than a day, you can probably build it. No 12-hour marathon computer sessions: you might spend 12 minutes. Your imagination is now that hard stop.

    For me, that’s problematic. I think too much, so my ideas can become grandiose pretty damn quick. With AI code as good as it is, chances are Claude (or most high-end LLMs) can build what I am thinking of.

    This has led to “feature creep” in everything I’m developing, to be honest. This agent, which you’re reading right now, was a simple Retell AI IVR agent just two weeks ago.

    This past week, I noticed something was happening as a result: I was stuck in a continuous development cycle.

    I’ve done this in my writing sometimes, too: where I will start with a solid plan, but then part of the way through decide to kick it up a notch, and it becomes something bigger than it was supposed to be.

    This also adds considerable risk of something breaking along the way.

    Transitioning from “YOLO” to spec-driven development helped start to break that cycle. And while I had spent time at the end of sessions cleaning up code and checking for security holes, I wasn’t checking for regressions (for non dev’s, that’s when the current fix addresses the problem, but breaks something else).

    Enter Kiro’s property-based testing (I am really not trying to hawk Amazon’s app/IDE, it’s just what I’m using post-VSCode, ha). That has ensured my bursts of over-creativity aren’t breaking something else. It’s also sped development up overall — I’m spending more time on usability bugs than functionality problems.

    While I need to have some self-control myself, having something that adds a considerable amount of structure to the process is beneficial. There is no way I could have undertook this project vibe coding my way to a useable, functioning stateful agent.

    Plus, slowing down has allowed me to absorb (and learn) more. But even still, I catch myself going off on tangents during development, just because it is so easy these days.

    One thing I am posting here to keep myself honest and ensure I take some time to use Hal rather than develop him: I’ll share how I structured Hal and his functionalities.

    In addition to seeking comments and constructive criticism, it will also be nice to see how things change with time. I’m not sold on any particular functionality or method, so I am expecting a lot of tinkering.

    But this weekend is for getting things running and enjoying my work. Then the real work begins anew next week!

  • One of these things is not like the others

    Well, for whatever reason, I deploy the WP template with MySQL. I think that’s a little overkill for a simple blog. Whoops. Gonna save myself some money 🙂

  • Good AI vs. Good Enough AI

    It feels like, within the past few months, there has been a fairly dramatic shift in what’s making a splash in AI.

    Up until recently, it seemed like everyone had an AI announcement of some kind. And the community gave much of it either a pass or a thumbs up. AI for everyone!

    But what’s moving the community is no longer AI making it into yet another application: increasingly, even the pro-AI crowd is starting to call out the slop, or the exaggerations of Sam Altman and Co.

    I’m not saying these AI announcements are worthless: many of them fall under the category of “good enough” AI. Think of the early Google AI search summaries. In 9 out of 10 cases, it was providing a generally acceptable response.

    Throw an “AI can make mistakes” on it and call it a day. But even if Gemini is getting it right almost every time, when it didn’t, it was embarrassingly bad.

    Not picking on Gemini, but even the incredibly successful Nano Banana image and video model falls under “good enough.” Yes, it creates stunning imagery and videos, but each is its own creation: you can’t easily expand upon a character or scene you liked.

    The next time you create it, it will look different. Great for short form, but not much else.

    Good enough AI is also why even those of us who aren’t completely drinking the Kool Aid have a hard time convincing people that their concepts of AI’s capabilities are extremely dated.

    With so much half-baked and useless AI floating around, you can’t blame them. Most folks’ experiences with the technology are not positive.

    My 67-year-old mother is a perfect example. She’s not a Luddite: the woman has had an iPhone since I handed down my original iPhone 15 years ago.

    But she hates IVRs, especially the ones that tell you to say what you want. The last experience got her so heated she literally said, I want to talk to somebody that ACTUALLY BREATHES!

    (I was in my office finishing up a Ben task, so the irony of working on my own AI customer service agent while my mother was struggling with another was not lost.)

    Add to this the absolute lack of any moderation across services like Facebook, TikTok, and Instagram these days, where crappy AI-generated video after video is pushing out real content.

    These observations and experiences have informed my decisions when it comes to Hal and Ben. I want people to walk away from their experiences impressed, not frustrated.

    If we’re going to change people’s minds about AI, we need to stop building half-baked projects. “AI can make mistakes” is now a cop-out. There are plenty of ways to all but guarantee a correct answer. Take the time to ensure it’s not hallucinating.

    Ask, does AI really belong here? Focus on the interaction. That’s what makes AI different from any previous computer-human interface.

    That interaction must be the focus now. AI development has focused, as it should, on making AI more accurate. Now we need to work on making it more interactive.

    To me, the quality of interaction is the key differentiator between good and merely good enough AI. And that interaction isn’t just the way the AI communicates with the user. It’s also how it listens and acts.

  • What are my goals?

    For a project of this magnitude, it would be pretty damn foolish not to have some goals for what I’d like to get out of a stateful agent. I am building this out of curiosity, but I’d really like it to work and make my life easier.

    I think the first thing is helping me become more competitive, and increase visibility. While it’s been a slow build, I am sure I am missing things simply because I don’t have the time to do the research.

    Next, and probably even more important, is to stop losing money. I’m finding that I have more often than not ended up with a margin in the single digits after expenses. Even after I think I am making good money. Having an extra eye there will help too.

    Finally, getting more organized is another important part. Nobody wants to wait for their order, and I procrastinate way more than I should. Having a nag there will improve customer experience.

    So if I had to pick three reasons for Hal, it’s competitiveness, profitability, and organization. Hopefully, I manage to build something worthwhile.

  • I feel like Kiro (Amazon) is on to something

    No, you should not use Kiro to manage a major internet service. But for “vibe coding,” it really feels like Amazon is getting it right as the platform improves.

    As far as ensuring that both the code the LLM produces and the user aren’t building something dangerous to themselves or others, Kiro does well in that regard (previous example excluded). And building it on top of VSCode is smart.

    As I’ve already said, I’m not a coder, but I still do appreciate the extensibility of VSCode. The ecosystem is huge. And every coding agent is compatible through the chat interface, the terminal, or a standalone sidebar (Claude).

    But you still need to know what you’re doing to get somewhere. Sure, Plan Mode is a great tool, but you still need to have somewhat of a clue about what you want.

    Kiro’s spec-driven development, where each task is first split into requirements and design before even writing a line of code let you get there through iteration if necessary. Kiro’s pretty decent at giving you an MVP with the right prompts.

    What’s been great is that since you’re generally working with Claude Sonnet, you can save your Anthropic sub for strategy and planning with Claude Desktop, then switch to Kiro for the coding.

    And with the ability to switch down to the cheaper Chinese models for smaller coding tasks, you extend your token budget just a little bit more.

    Kiro breaks down without structure or too little detail, however. What happened with AWS is not something surprising to anyone who uses Kiro. If it’s not “in-spec,” it won’t be built, and in some cases, it might remove unfinished work with whatever it’s working on at the moment.

    Recently, I asked Kiro to build an affiliate plugin so I could migrate off Impact. It built an outstanding backend. One small problem: it lacked ANY of the WordPress plugin architecture.

    The best way to address this is to be as specific as possible about the steps and to provide strict guidelines on what Kiro can touch during a spec. Haven’t had an issue since.

    Amazon’s offer of 500 free credits to try Kiro is fairly substantial, and more than enough to put the platform through its paces.

    If you’re well-versed in code, chances are Kiro isn’t going to be a net positive unless you are not organized in your development process. But for vibe coders, this feels like the right way to do it.

  • Living dangerously

    Okay, 100+ tools on a single MCP is asking for trouble really quickly. But it certainly wasn’t on purpose.

    Building out what is essentially a digital employee requires it.

    Time for a little thought experiment. Think about the number of tasks a customer service representative or IT administrator does in a day. Now take those tasks, and think about the tools needed. It’s not always 1:1. A single task can require many tools.

    Let’s use a return for this example. The return tool is just the beginning. To complete a return, you also need:

    • Customer and order lookup tools to obtain the information
    • A KYC tool to make sure you’re talking to the right person
    • A shipping label tool to generate a label

    A return requires at least four tools to complete, and some tasks might require more. Now, when I say 100 tools, that number might not seem as significant. In reality, Hal and Ben might be only capable of 20 or 30 actual tasks as a result.

    Yes, it’s bad to place all your tools on a single server. But for testing, one “monolith” as Kiro has taken to refer to the original MCP server is fine.

    Also, there is such a thing as too many tools: one beta server also allows you to spot areas of overlap. We were able to spot this much more easily on a single server versus five separate ones, eliminating ~10 tools as a result through merging tools with similar operations.

    It’s also much easier to fix bugs when everything is in one spot. But in a production environment, of course, you want to split things up.

    (I will add to give attackers less of an opportunity to guess your new setup, keep the old MCP running until you’ve set everything up, then switch over. A stealth deployment makes sense here.)