Last updated April 5, 2026
Note: As Hal was built on the work of other open source developers, it too will be open-sourced. However, until I can assure that the code is stable, cost-efficient for use with Anthropic/OpenAI, and most importantly, secure, I will not be publishing the code.
This page is a living document aimed at providing general concepts on how Hal works for those that might be considering stateful AI agent development themselves.
To limit potential attack vectors, we’ve intentionally stayed vague on specifics. The below should be sufficient for most in the research stages of development, though, and we hope it helps.
We’ll also keep track of changes to the document here, so you can see how our development and goals have evolved, and with real-world use.
Who Hal Is
Hal is a voice-first AI assistant built for Cirrusly Weather, a specialty e-commerce retailer selling premium weather instruments. Unlike most AI agents who start fresh every conversation, Hal remembers. He learns from his own mistakes, tracks his predictions, and evolves his behavior over time.
This document describes what Hal does and how his stateful capabilities work at a conceptual level.
What Hal Does
Hal gives the store owner and a secondary admin direct voice access to store operations. Think of him as a hands-free operations dashboard you can talk to. He handles:
- Order management and fulfillment
- Inventory monitoring
- Email triage and supplier communication
- Infrastructure monitoring and incident response
- Customer escalation resolution
- Business intelligence and analytics
- Competitive pricing analysis
Hal is available by phone (outbound and inbound) and through a web-based chat interface. The voice channel uses Cartesia for natural speech; the chat interface supports file attachments, streaming responses, and a task queue for complex output.
At the moment, “Chat” Hal and “Phone” Hal are technically two different agents that share common tooling. Chat Hal has persistent memory and will evolve organically as it learns its surroundings.
Phone Hal is a stateless agent with stateful characteristics: think similar to how Claude Desktop uses text files to store information about a user and task.
The Tool-Based Architecture
Hal doesn’t contain business logic himself. Every capability is implemented as a discrete tool on a backend server that Hal calls as needed. Adding a new capability means registering a new tool — Hal gains access without changing his conversation flow.
This is a conscious decision: Hal needs to do very little improvisation to answer a question. Tools are structured in such a way that the agent cannot confuse one tool for another. Additionally, secondary tools surface later in Hal’s flows, when they’re needed rather than at the beginning of a task.
This, in theory, should also make Hal usable on most mid-tier LLM models, and even some low-end models for basic tasks.
At last count, Hal has access to roughly 130 tools across multiple backend services, covering everything from order lookups to incident diagnosis to document generation.
What Makes Hal Stateful
Most AI agents are stateless — each conversation starts from zero. Hal’s statefulness comes from several interlocking systems:
Memory with Scope
Hal maintains a persistent memory layer backed by a PostgreSQL database. Memory entries are classified as either shared (visible to all admins) or scoped to a specific admin. This means Hal can remember that one admin prefers concise summaries while another wants full detail, without mixing those preferences.
Memory keys are organized by domain — contacts, preferences, research findings, business intelligence — and persist across sessions indefinitely.
Nightly Reflection
Every night, Hal runs a multi-stage reflection pipeline that synthesizes the day’s activity:
1. A data collection phase gathers store metrics, email activity, incident history, web analytics, search performance, and competitive pricing data.
2. An “analyst” agent synthesizes this data into observations, writing structured findings to memory.
3. A “critic” agent reviews the analyst’s work, flagging weak reasoning or unsupported conclusions. The critic only activates when the analyst’s output crosses a quality threshold — most nights it’s not needed.
4. The reflection pipeline proposes action items based on patterns it detects — things like “competitor X dropped their price on product Y” or “search impressions are up, but click-through is declining on keyword Z.”
The reflection output feeds into the next day’s conversations. When the owner calls in the morning, Hal already knows what happened overnight.
Prediction Tracking and Calibration
Hal logs forward-looking predictions during conversations — things like “I expect this order to ship by Thursday” or “this supplier delay will likely resolve within 48 hours.” These predictions are reviewed in batch using a separate AI evaluation pass, scored for accuracy, and the results are written back to a calibration memory.
Over time, this creates a feedback loop: Hal can see his own track record and adjust his confidence accordingly. If he’s been consistently wrong about shipping estimates from a particular supplier, that pattern surfaces in his calibration data.
Action Items
The reflection pipeline doesn’t just observe — it proposes concrete action items. These go through a lifecycle: proposed → accepted → completed (or dismissed). Admins review and accept the ones worth pursuing, and Hal tracks completion.
Action items include a research hint field, so Hal can autonomously gather supporting information before presenting a recommendation.
Event Log and Audit Trail
Every significant Hal action — tool calls, session summaries, reflection outputs, prediction reviews — is logged to a unified event stream. This serves two purposes: it gives the owner a plain-English audit trail of what Hal has been doing, and it provides the raw material for the reflection pipeline to synthesize patterns over longer time horizons.
Onboarding and Personalization
Each admin goes through a conversation-driven onboarding flow when they first interact with Hal. This calibrates Hal’s behavior to their preferences — communication style, areas of focus, and notification preferences. The onboarding responses feed into a per-admin preference profile that shapes how Hal presents information going forward.
The Escalation Bridge
Hal works alongside a separate customer-facing chat agent (Ben). The two agents never communicate directly — a database sits between them. When the customer agent can’t resolve an issue, it writes an escalation ticket with a short alphanumeric code. The customer receives the code. When the owner calls Hal, he provides the code, and Hal retrieves the full context bundle — what the customer said, what tools were tried, what failed. The owner resolves the issue through Hal, and the ticket is closed.
This design means neither agent needs awareness of the other. The database is the silent bridge.
Proactive Monitoring
Hal doesn’t just wait for calls. He actively monitors:
- Infrastructure health via uptime monitoring, calling the owner when something goes down (with quiet hours and SMS fallback)
- Open escalation tickets that haven’t been addressed, nudging the owner after a configurable window
- Suspicious account registrations and spam activity
- Competitor pricing changes
When Hal detects an infrastructure incident, he autonomously gathers diagnostic data from multiple sources — server status, spam activity, security logs — and synthesizes a structured diagnosis before the owner even picks up the phone.
Voice-to-Chat Handoff
Some requests produce better output read than heard — analytics reports, pricing matrices, and long documents. When Hal recognizes this during a voice call, he queues the task for the chat interface instead. The owner gets an SMS notification with a link, and the full output is waiting in the chat UI when they open it.
Model Selection — Not Every Task Needs the Same LLM
While many projects have opted for open-source LLM backends due to cost, Hal is intended for use in commercial applications, more specifically, small businesses.
While there is no evidence that open-weight LLMs have potential security issues, it is generally recommended that business applications avoid these models in production environments. We share these concerns.
To address the issue of cost, Claude Haiku currently handles most Hal interactions, and Sonnet handles more intensive agentic applications.
Anthropic’s batch feature offers another attractive possibility. Future large report generation and significant analysis will be batched through Opus, which, at a 50% discount, is only slightly more expensive than Sonnet.
The platform doesn’t use a single AI model for everything. Different tasks route to different Claude models based on complexity and cost:
- The conversational agent itself (real-time voice and chat) runs on Haiku, a fast, lightweight model optimized for low latency.
- Backend classification tasks — email triage, supplier parsing, return scoring, spam detection — also use Haiku. These are high-volume, structured-output tasks where speed and cost matter more than deep reasoning. Haiku can handle these tasks as long as the structure is provided.
- Heavier synthesis tasks — incident diagnosis, nightly reflection, audit summaries, web research — route to Sonnet (and eventually Opus if necessary), which reason across multiple data sources and produce nuanced analysis much better than Haiku.
- A complexity router dynamically selects the model for certain tasks. Web research, for example, scores the incoming question on multiple signals (number of sources requested, comparative language, regulatory domain, question length) and routes simple factual lookups to the Haiku while multi-source analysis goes to the Sonnet/Opus and eventually Anthropic’s upcoming top-tier model, “Mythos.”
Every AI call is logged with token counts and estimated cost, so the owner can see exactly where the AI budget is going and whether the routing is making sensible tradeoffs.
Web Research
Hal can search the web, extract content from URLs, and synthesize findings using AI — with optional persistence to memory.
What’s Next
The self-improvement roadmap includes a skill library (reusable capability patterns Hal can learn and adapt) and a hill-climbing optimization loop (systematic A/B testing of Hal’s own operational parameters).
Inspiration for new skills is curated from ClawHub and then repurposed for safe use in Hal’s environment. As a safety precaution, Hal cannot create new skills from any ClawHub skill that does not receive a “benign” rating from both ClawHub and VirusTotal.
Additional experimental work in using Hal to control the Kiro Autonomous Agent when preview access is provided is also on the roadmap. The early concept is to allow Hal to trigger Kiro through its GitHub Issues integration.
This moves coding into a safe sandboxed environment with an agent specifically designed for autonomous coding with structured testing that we already use in development. Hal is not a coding agent, nor will he be. Kiro is.
We also plan to rework Ben to become more stateful with time, with Hal eventually helping control and monitor Ben’s activities. However, those enhancements depend on what Retell’s platform supports.
As these features dramatically expand Hal’s potential capabilities and add additional autonomous operation and potential dangers, they are not part of the initial release until Hal is proven stable and free of hallucinations.