How It Works
This page explains the core ideas behind AI Agents HQ. If you are new to multi-agent systems, concurrent programming, or distributed coordination, start here. Everything is explained from the ground up — no prior knowledge assumed.
The Fundamental Problem
Imagine you have three AI agents — Claude, Gemini, and Codex — all working on the same codebase. Without coordination:
- Two agents might try to edit the same file at the same time, overwriting each other's work
- An agent might start a task that another agent already finished, wasting time and tokens
- If an agent crashes mid-task, the task gets stuck forever with no one to pick it up
- Agents have no way to share results with each other ("I finished the research, here are my findings")
AI Agents HQ solves all of these problems with a simple file-based coordination system.
The Architecture in One Sentence
Agents communicate by reading and writing JSON files through a CLI tool (hq), and the system uses file locking, version numbers, and leases to prevent conflicts.
Tasks
A task is the fundamental unit of work in AI Agents HQ. It is a JSON file stored at ~/.hq/tasks/{team}/{id}.json that contains everything an agent needs to know about an assignment.
What a Task Looks Like
Here is an example task file (you would not normally edit this by hand — agents use the hq CLI):
Key Fields Explained
idsubjectdescriptionstatusownerversiontoolleaseOwnerleaseUntilblockedByblockssummaryhq task complete --summary (null until completed)failureDetailhq task fail --reason (null until failed)failureCountstateReasonTask Lifecycle
Every task moves through a series of states. Understanding this lifecycle is key to understanding how the whole system works.
Status Flow
That is the happy path. But real work has complications — agents crash, tasks fail, dependencies exist. Here is the complete picture:
All Possible States
The task has been created and is waiting for an agent to pick it up. This is the starting state for all new tasks. An agent can claim any pending task that has no unresolved blockers.
An agent has claimed the task and is actively working on it. The agent holds a 30-minute lease. No other agent can claim this task until the lease expires or the agent finishes.
The agent finished the work successfully. The task is done. Any tasks that were "blocked by" this task are automatically unblocked.
The agent tried but could not complete the task. The failure count is incremented. Another agent can retry by claiming the task again. After 3 failures, the task auto-escalates.
The task depends on other tasks that have not been completed yet. The blockedBy field lists which tasks must finish first. Once all blockers complete, the status automatically changes to pending.
The task has failed 3 or more times and has been flagged for human attention. Automated agents will not pick it up. A human needs to investigate what is going wrong.
State Transitions
Here is every valid way a task can change states and what triggers each transition:
hq task claim--blockedBy dependencieshq task completehq task failAuto-Unblock: How Dependencies Resolve
When you create a task with --blockedBy 1,2, it starts in blocked status. The task cannot be claimed until tasks 1 and 2 are both completed.
Here is what happens step by step:
- Task 3 is created with
blockedBy: [1, 2]— status isblocked - Agent completes task 1 — the system scans all tasks and removes
1from task 3's blockedBy list - Task 3 now has
blockedBy: [2]— still blocked - Agent completes task 2 — the system removes
2from task 3's blockedBy list - Task 3 now has
blockedBy: []— automatically changes topending - An agent can now claim task 3
This all happens automatically inside hq task complete. The completing agent does not need to know about downstream dependencies.
Auto-Escalate: When Things Keep Failing
If a task fails 3 times, the system gives up on automated retry and escalates to a human:
- Agent A claims task, fails —
failureCount: 1, status:failed - Agent B claims task (retry), fails —
failureCount: 2, status:failed - Agent C claims task (retry), fails —
failureCount: 3, status:escalated, stateReason:human_escalation
Once escalated, the task will not be automatically picked up again. A human needs to investigate the root cause — maybe the task description is ambiguous, the required tool is misconfigured, or there is a genuine blocker that AI agents cannot handle.
Inboxes
Inboxes are a simple append-only messaging system that agents use to communicate with each other. Each agent has one inbox per team, stored as a JSON file at ~/.hq/teams/{team}/inboxes/{agent}.json.
When Do Agents Use Inboxes?
task_completedreview_approvedresearch_findingserror_reportad_hoc_requestHow Inboxes Work
- Agent A sends a message to Agent B's inbox using
hq inbox send - The message includes a type (what kind of event), a payload (the actual data), and an idempotency key (to prevent duplicates)
- The message gets a sequential event ID (1, 2, 3, ...)
- Agent B reads its inbox using
hq inbox read - Agent B can filter by
--since-eventto only see new messages since the last one it processed
Inboxes are append-only — messages are never deleted or modified. This creates a complete audit trail of all inter-agent communication.
Inbox Event Structure
Each message in an inbox looks like this:
Concurrency Safety
The most important feature of AI Agents HQ is that it is safe for concurrent use. Multiple agents running at the same time cannot corrupt the system's state. This section explains the three mechanisms that make this possible.
Mechanism 1: File Locking (flock)
When the hq CLI needs to read or modify a task file, it first acquires an advisory lock on that file using the operating system's flock system call. This is like putting a "do not disturb" sign on a hotel room door.
While one agent holds the lock on task-1.json, any other agent trying to modify the same file will wait until the lock is released. This prevents two agents from reading the file at the same time, both making changes, and then both writing — which would cause one agent's changes to overwrite the other's.
The lock is automatically released when the operation finishes, even if the program crashes.
Mechanism 2: Atomic Writes
When writing a file, the system never modifies the file in place. Instead, it:
- Writes the new data to a temporary file in the same directory
- Calls
fsyncto ensure the data is physically written to disk (not just sitting in a memory buffer) - Renames the temp file to the real filename
The rename operation is atomic on all major operating systems — it either fully succeeds or does not happen at all. This means if the system crashes at any point during writing, you either have the old complete file or the new complete file — never a half-written corrupted file.
Mechanism 3: CAS (Compare-And-Swap) Versioning
Every task has a version field that starts at 1 and increases by 1 every time the task is modified. When an agent wants to update a task, it must provide the version it expects the task to be at.
Here is why this matters:
Agent B's operation is rejected because between reading and writing, someone else changed the task. This is called a CAS conflict (Compare-And-Swap conflict). The system returns exit code 10 when this happens, and the agent knows it needs to re-read the task and try again.
This is the same technique used by databases and distributed systems worldwide. It is the gold standard for preventing race conditions.
How They Work Together
All three mechanisms work in concert:
- File locking ensures only one process can modify a file at a time
- Atomic writes ensure that even if a process crashes mid-write, the file is never corrupted
- CAS versioning catches cases where two agents read the same state and try to make conflicting updates
Together, they guarantee that no matter how many agents are running simultaneously, the task state is always consistent and correct. This has been verified with both unit tests (10 goroutines racing) and integration tests (5 separate OS processes racing).
Leases and Heartbeats
A lease is a time-limited reservation on a task. When an agent claims a task, it gets a lease that expires in 30 minutes. This solves a critical problem: what happens if an agent crashes?
The Problem
Without leases, if an agent claims a task and then crashes (process killed, network failure, out-of-memory), the task would be stuck in in_progress forever. No other agent could pick it up because it is "owned" by the crashed agent.
The Solution
With leases, the task's leaseUntil field records when the lease expires. If an agent does not finish the task or send a heartbeat before the lease expires, the system considers the task abandoned. Another agent can then claim it.
How Heartbeats Work
For tasks that take longer than a few minutes, the agent should send periodic heartbeats:
The recommended heartbeat interval is every 10 minutes (configured as HeartbeatInterval in the protocol constants). This gives plenty of margin before the 30-minute lease expires.
Stale Lease Recovery
When an agent tries to claim a task that is in_progress, the system checks the lease:
- If the lease is still valid — the claim is rejected (someone is working on it)
- If the lease has expired — the claim succeeds, the old owner is replaced, and the new agent gets a fresh 30-minute lease
This means no human intervention is needed to recover from agent crashes. The system self-heals.
Idempotency
Idempotency means that doing the same operation multiple times has the same effect as doing it once. This is crucial for reliability.
Why It Matters
Imagine an agent sends hq task complete but the process crashes immediately after — before the agent can record that it already sent the command. When the agent restarts, it does not know whether the command succeeded, so it sends it again.
Without idempotency, the second hq task complete would fail with "task is already completed" or worse, corrupt the state. With idempotency, the second call is silently ignored and returns success — because the system detects that this exact operation was already performed.
How It Works
Every state-changing operation includes an idempotency key — a unique string that identifies the operation. The format is deterministic: {task_id}-{agent_name}-{action}.
The system stores all seen idempotency keys in a persistent JSON file. Before performing any operation, it checks whether the key already exists. If it does, the operation is skipped and the system returns success.
Protocol Versioning
Every agent must include a --protocol-version flag when performing state-changing operations (claim, complete, fail, inbox send). The system checks that this matches the current protocol version (currently version 2).
Why?
If the protocol changes in a future version (new fields, different behavior), old agents running the previous version could corrupt data by misunderstanding the new format. Protocol version checking prevents this.
When there is a mismatch, the system returns exit code 12 (ExitProtocolMismatch), which tells the agent: "you are running an outdated version, please restart with the new code."
This is a safety net that prevents subtle bugs when upgrading the system while agents are still running.