How It Works


This page explains the core ideas behind AI Agents HQ. If you are new to multi-agent systems, concurrent programming, or distributed coordination, start here. Everything is explained from the ground up — no prior knowledge assumed.

The Fundamental Problem

Imagine you have three AI agents — Claude, Gemini, and Codex — all working on the same codebase. Without coordination:

  • Two agents might try to edit the same file at the same time, overwriting each other's work
  • An agent might start a task that another agent already finished, wasting time and tokens
  • If an agent crashes mid-task, the task gets stuck forever with no one to pick it up
  • Agents have no way to share results with each other ("I finished the research, here are my findings")

AI Agents HQ solves all of these problems with a simple file-based coordination system.

The Architecture in One Sentence

Agents communicate by reading and writing JSON files through a CLI tool (hq), and the system uses file locking, version numbers, and leases to prevent conflicts.

Tasks


A task is the fundamental unit of work in AI Agents HQ. It is a JSON file stored at ~/.hq/tasks/{team}/{id}.json that contains everything an agent needs to know about an assignment.

What a Task Looks Like

Here is an example task file (you would not normally edit this by hand — agents use the hq CLI):

terminal
$ cat ~/.hq/tasks/my-team/1.json
{
"schemaVersion": 1,
"id": 1,
"protocol_version": 2,
"subject": "Write unit tests for auth module",
"description": "Add tests for email validation, password hashing, and session creation.",
"status": "pending",
"owner": null,
"version": 1,
"leaseOwner": null,
"leaseUntil": null,
"tool": "claude",
"profile": null,
"required_mcps": [],
"soft_token_limit": null,
"stateReason": null,
"blockedBy": [],
"blocks": [],
"activeForm": "",
"summary": null,
"failureDetail": null,
"done_criteria_override": null,
"failureCount": 0,
"createdAt": 1739884800000,
"updatedAt": 1739884800000
}

Key Fields Explained

Field
What It Means
id
A unique number identifying this task (1, 2, 3, ...)
subject
A short title describing the task
description
Detailed instructions for the agent
status
The current state: pending, in_progress, completed, failed, blocked, or escalated
owner
Which agent is currently working on this task (null if nobody has claimed it)
version
A counter that increases every time the task is modified (used for conflict detection)
tool
Which AI tool should handle this task: "claude", "gemini", or "codex"
leaseOwner
Which agent currently holds the lease (may differ from owner if lease expired)
leaseUntil
When the lease expires (Unix timestamp in milliseconds)
blockedBy
List of task IDs that must complete before this task can be started
blocks
List of task IDs that are waiting for this task to complete
summary
Completion summary set by hq task complete --summary (null until completed)
failureDetail
Failure reason set by hq task fail --reason (null until failed)
failureCount
How many times this task has been attempted and failed
stateReason
Why the task is in its current state (e.g., "agent_error", "lease_expired")

Task Lifecycle


Every task moves through a series of states. Understanding this lifecycle is key to understanding how the whole system works.

Status Flow

01 Pending
02 In Progress
03 Completed

That is the happy path. But real work has complications — agents crash, tasks fail, dependencies exist. Here is the complete picture:

All Possible States

Pending

The task has been created and is waiting for an agent to pick it up. This is the starting state for all new tasks. An agent can claim any pending task that has no unresolved blockers.

In Progress

An agent has claimed the task and is actively working on it. The agent holds a 30-minute lease. No other agent can claim this task until the lease expires or the agent finishes.

Completed

The agent finished the work successfully. The task is done. Any tasks that were "blocked by" this task are automatically unblocked.

Failed

The agent tried but could not complete the task. The failure count is incremented. Another agent can retry by claiming the task again. After 3 failures, the task auto-escalates.

Blocked

The task depends on other tasks that have not been completed yet. The blockedBy field lists which tasks must finish first. Once all blockers complete, the status automatically changes to pending.

Escalated

The task has failed 3 or more times and has been flagged for human attention. Automated agents will not pick it up. A human needs to investigate what is going wrong.

State Transitions

Here is every valid way a task can change states and what triggers each transition:

From
To
Trigger
Pending
In Progress
Agent claims the task via hq task claim
Pending
Blocked
Task is created with --blockedBy dependencies
In Progress
Completed
Agent reports success via hq task complete
In Progress
Failed
Agent reports failure via hq task fail
In Progress
Pending
Lease expires (no heartbeat), task becomes claimable again
Failed
In Progress
Another agent retries by claiming the failed task
Failed
Escalated
Failure count reaches 3 (automatic)
Blocked
Pending
All blocking tasks are completed (automatic)

Auto-Unblock: How Dependencies Resolve

When you create a task with --blockedBy 1,2, it starts in blocked status. The task cannot be claimed until tasks 1 and 2 are both completed.

Here is what happens step by step:

  1. Task 3 is created with blockedBy: [1, 2] — status is blocked
  2. Agent completes task 1 — the system scans all tasks and removes 1 from task 3's blockedBy list
  3. Task 3 now has blockedBy: [2] — still blocked
  4. Agent completes task 2 — the system removes 2 from task 3's blockedBy list
  5. Task 3 now has blockedBy: [] — automatically changes to pending
  6. An agent can now claim task 3

This all happens automatically inside hq task complete. The completing agent does not need to know about downstream dependencies.

Auto-Escalate: When Things Keep Failing

If a task fails 3 times, the system gives up on automated retry and escalates to a human:

  1. Agent A claims task, fails — failureCount: 1, status: failed
  2. Agent B claims task (retry), fails — failureCount: 2, status: failed
  3. Agent C claims task (retry), fails — failureCount: 3, status: escalated, stateReason: human_escalation

Once escalated, the task will not be automatically picked up again. A human needs to investigate the root cause — maybe the task description is ambiguous, the required tool is misconfigured, or there is a genuine blocker that AI agents cannot handle.

Inboxes


Inboxes are a simple append-only messaging system that agents use to communicate with each other. Each agent has one inbox per team, stored as a JSON file at ~/.hq/teams/{team}/inboxes/{agent}.json.

When Do Agents Use Inboxes?

Event Type
When It Is Sent
task_completed
An agent finishes a task and wants to notify the lead or another agent
review_approved
A reviewer agent approves a code change
research_findings
A research agent shares findings for another agent to act on
error_report
An agent encountered a problem and is reporting it
ad_hoc_request
One agent is asking another agent to do something outside the normal task flow

How Inboxes Work

  1. Agent A sends a message to Agent B's inbox using hq inbox send
  2. The message includes a type (what kind of event), a payload (the actual data), and an idempotency key (to prevent duplicates)
  3. The message gets a sequential event ID (1, 2, 3, ...)
  4. Agent B reads its inbox using hq inbox read
  5. Agent B can filter by --since-event to only see new messages since the last one it processed

Inboxes are append-only — messages are never deleted or modified. This creates a complete audit trail of all inter-agent communication.

Inbox Event Structure

Each message in an inbox looks like this:

terminal
{
"event_id": 1,
"from": "researcher-agent",
"to": "coder-agent",
"type": "research_findings",
"protocol_version": 2,
"idempotency_key": "5-researcher-findings",
"payload": {
"taskId": 5,
"summary": "Found 3 relevant API endpoints...",
"details": "..."
},
"timestamp": 1739884800000
}

Concurrency Safety


The most important feature of AI Agents HQ is that it is safe for concurrent use. Multiple agents running at the same time cannot corrupt the system's state. This section explains the three mechanisms that make this possible.

Mechanism 1: File Locking (flock)

When the hq CLI needs to read or modify a task file, it first acquires an advisory lock on that file using the operating system's flock system call. This is like putting a "do not disturb" sign on a hotel room door.

While one agent holds the lock on task-1.json, any other agent trying to modify the same file will wait until the lock is released. This prevents two agents from reading the file at the same time, both making changes, and then both writing — which would cause one agent's changes to overwrite the other's.

The lock is automatically released when the operation finishes, even if the program crashes.

Mechanism 2: Atomic Writes

When writing a file, the system never modifies the file in place. Instead, it:

  1. Writes the new data to a temporary file in the same directory
  2. Calls fsync to ensure the data is physically written to disk (not just sitting in a memory buffer)
  3. Renames the temp file to the real filename

The rename operation is atomic on all major operating systems — it either fully succeeds or does not happen at all. This means if the system crashes at any point during writing, you either have the old complete file or the new complete file — never a half-written corrupted file.

Mechanism 3: CAS (Compare-And-Swap) Versioning

Every task has a version field that starts at 1 and increases by 1 every time the task is modified. When an agent wants to update a task, it must provide the version it expects the task to be at.

Here is why this matters:

terminal
# Agent A reads task 1 (version = 1)
# Agent B reads task 1 (version = 1)
# Agent A claims it, passing expectedVersion=1 — succeeds, version bumps to 2
# Agent B tries to claim it, passing expectedVersion=1 — FAILS (version is now 2)

Agent B's operation is rejected because between reading and writing, someone else changed the task. This is called a CAS conflict (Compare-And-Swap conflict). The system returns exit code 10 when this happens, and the agent knows it needs to re-read the task and try again.

This is the same technique used by databases and distributed systems worldwide. It is the gold standard for preventing race conditions.

How They Work Together

All three mechanisms work in concert:

  1. File locking ensures only one process can modify a file at a time
  2. Atomic writes ensure that even if a process crashes mid-write, the file is never corrupted
  3. CAS versioning catches cases where two agents read the same state and try to make conflicting updates

Together, they guarantee that no matter how many agents are running simultaneously, the task state is always consistent and correct. This has been verified with both unit tests (10 goroutines racing) and integration tests (5 separate OS processes racing).

Leases and Heartbeats


A lease is a time-limited reservation on a task. When an agent claims a task, it gets a lease that expires in 30 minutes. This solves a critical problem: what happens if an agent crashes?

The Problem

Without leases, if an agent claims a task and then crashes (process killed, network failure, out-of-memory), the task would be stuck in in_progress forever. No other agent could pick it up because it is "owned" by the crashed agent.

The Solution

With leases, the task's leaseUntil field records when the lease expires. If an agent does not finish the task or send a heartbeat before the lease expires, the system considers the task abandoned. Another agent can then claim it.

How Heartbeats Work

For tasks that take longer than a few minutes, the agent should send periodic heartbeats:

terminal
# Agent claims task (gets 30-minute lease)
$ hq task claim --team my-team --task 1 --agent worker --tool claude --protocol-version 2
# 10 minutes later, agent sends heartbeat (extends lease by another 30 minutes)
$ hq task heartbeat --team my-team --task 1 --agent worker
# 10 more minutes later, another heartbeat
$ hq task heartbeat --team my-team --task 1 --agent worker
# Agent finishes and completes the task
$ hq task complete --team my-team --task 1 --agent worker --summary "Done" --protocol-version 2 --idempotency-key "1-worker-complete"

The recommended heartbeat interval is every 10 minutes (configured as HeartbeatInterval in the protocol constants). This gives plenty of margin before the 30-minute lease expires.

Stale Lease Recovery

When an agent tries to claim a task that is in_progress, the system checks the lease:

  1. If the lease is still valid — the claim is rejected (someone is working on it)
  2. If the lease has expired — the claim succeeds, the old owner is replaced, and the new agent gets a fresh 30-minute lease

This means no human intervention is needed to recover from agent crashes. The system self-heals.

Idempotency


Idempotency means that doing the same operation multiple times has the same effect as doing it once. This is crucial for reliability.

Why It Matters

Imagine an agent sends hq task complete but the process crashes immediately after — before the agent can record that it already sent the command. When the agent restarts, it does not know whether the command succeeded, so it sends it again.

Without idempotency, the second hq task complete would fail with "task is already completed" or worse, corrupt the state. With idempotency, the second call is silently ignored and returns success — because the system detects that this exact operation was already performed.

How It Works

Every state-changing operation includes an idempotency key — a unique string that identifies the operation. The format is deterministic: {task_id}-{agent_name}-{action}.

terminal
# First call — performs the operation
$ hq task complete --team demo --task 1 --agent worker --summary "Done" --protocol-version 2 --idempotency-key "1-worker-complete"
# Exit code 0 (success)
# Second call (same idempotency key) — recognized as duplicate, silently ignored
$ hq task complete --team demo --task 1 --agent worker --summary "Done" --protocol-version 2 --idempotency-key "1-worker-complete"
# Exit code 0 (success, but no state change)

The system stores all seen idempotency keys in a persistent JSON file. Before performing any operation, it checks whether the key already exists. If it does, the operation is skipped and the system returns success.

Protocol Versioning


Every agent must include a --protocol-version flag when performing state-changing operations (claim, complete, fail, inbox send). The system checks that this matches the current protocol version (currently version 2).

Why?

If the protocol changes in a future version (new fields, different behavior), old agents running the previous version could corrupt data by misunderstanding the new format. Protocol version checking prevents this.

When there is a mismatch, the system returns exit code 12 (ExitProtocolMismatch), which tells the agent: "you are running an outdated version, please restart with the new code."

This is a safety net that prevents subtle bugs when upgrading the system while agents are still running.