HUD Documentation — Evaluations and RL Environments.

Before deploying, test locally. This page covers local testing patterns, A/B testing with variants and groups, mock mode, and debugging.

Local Testing

Environment	`local_test.py`
No Docker	`from env import env`
Docker	`env.connect_url("http://localhost:8765/mcp")`

Both use the same API after setup:

async with env:
    tools = env.as_tools()                              # List available tools
    result = await env.call_tool("my_tool", arg="val")  # Call a tool

Variants

LLM outputs vary from run to run—ask the same question twice and you might get different quality answers. Variants let you test different configurations side-by-side:

import hud

async with hud.eval(task, variants={"model": ["gpt-4o", "claude-sonnet-4-5"]}) as ctx:
    response = await client.chat.completions.create(
        model=ctx.variants["model"],  # Current variant
        messages=[{"role": "user", "content": ctx.prompt}]
    )
    ctx.reward = 1.0 if "correct" in response.choices[0].message.content else 0.0

for result in ctx.results:
    print(f"{result.variants}: reward={result.reward}")

Lists expand to all combinations:

variants = {
    "model": ["gpt-4o", "claude"],
    "temperature": [0.0, 0.7],
}
# Creates 4 combinations: gpt-4o+0.0, gpt-4o+0.7, claude+0.0, claude+0.7

Groups

Run each variant multiple times to see the distribution, not just one lucky or unlucky result:

async with hud.eval(
    task,
    variants={"model": ["gpt-4o", "claude-sonnet-4-5"]},
    group=5  # 10 runs total: 2 models × 5 each
) as ctx:
    ...

The hud.eval manager parallelizes automatically. Total runs = len(tasks) × len(variant_combinations) × group.

Mock Mode

env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:

env.mock()  # All tools return schema-based fake responses
env.mock_tool("send_email", {"status": "sent", "id": "mock-123"})
env.mock_tool("charge_card", {"success": True, "transaction_id": "tx-mock"})

# Check mock state
assert env.is_mock == True

For stateful mocking (tracking what happened for assertions):

class MockPaymentService:
    def __init__(self):
        self.charges = []
    
    async def charge(self, amount: int, card_token: str) -> dict:
        self.charges.append({"amount": amount, "token": card_token})
        return {"success": True, "id": f"ch-{len(self.charges)}"}

payments = MockPaymentService()

@env.scenario("checkout")
async def checkout(cart_total: int):
    _ = yield f"Complete checkout for ${cart_total}"
    yield 1.0 if any(c["amount"] == cart_total for c in payments.charges) else 0.0

Your agent code stays the same—toggle env.mock() for testing.

Testing Scenarios Directly

Scenarios are async generators. hud.eval() drives them automatically, but you can test the logic directly:

async def checkout(user_id: str, amount: int = 100):
    # Setup + prompt (first yield)
    answer = yield f"Complete checkout for {user_id}, ${amount}"
    
    # Evaluation (second yield)
    yield 1.0 if "success" in answer.lower() else 0.0

async def test():
    gen = checkout("alice", 50)
    prompt = await anext(gen)            # What hud.eval() does at start
    reward = await gen.asend("Success!") # What hud.eval() does after submit
    assert reward == 1.0

If your scenario tests pass, hud.eval() will behave identically.

Hot-Reload

For Docker environments, hud dev -w path reloads Python on save:

hud dev -w scenarios -w tools --port 8765

System services (postgres, VNC, browsers) persist across reloads.

Debugging Build Failures

hud build runs the exact same pipeline as New → Environment on hud.ai—so if it passes locally, it’ll work in production. If the build fails or the container crashes on startup, use hud debug:

hud debug my-env:latest

Output shows exactly which phase failed:

✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize
✗ Phase 3: Tool discovery failed
  → Error: Connection refused on port 8005
  → Hint: Backend service may not be starting

You can also debug a directory (builds first) or stop at a specific phase:

hud debug .                    # Build and debug current directory
hud debug . --max-phase 3      # Stop after phase 3
hud debug --config mcp.json    # Debug from config file

Scenario MCP Protocol Mapping

Understanding how scenarios map to MCP is crucial for debugging. Each scenario registers two MCP endpoints:

Phase	MCP Type	Endpoint	What it does
Setup	Prompt	`get_prompt("{env}:{scenario}", args)`	Runs code before first `yield`, returns the prompt
Evaluate	Resource	`read_resource("{env}:{scenario}")`	Runs code after first `yield`, returns `{"reward": float}`

Debug with raw MCP calls

If a scenario isn’t working, test each phase directly:

async with env:
    # Phase 1: Setup (runs code before first yield)
    prompt_result = await env.get_prompt(
        "myenv:checkout", 
        {"product": "laptop", "user_id": "alice"}
    )
    print(f"Prompt: {prompt_result.messages[0].content}")
    
    # ... agent runs here ...
    
    # Phase 2: Submit answer (stores it for evaluation)
    await env.submit("checkout", answer="Order completed successfully")
    
    # Phase 3: Evaluate (runs code after first yield)
    resource_result = await env.read_resource("myenv:checkout")
    print(f"Reward: {resource_result}")  # {"reward": 1.0}

Common debugging scenarios

Problem: evaluate_tool: NULL but using v5 scenarios

Cause: v5 scenarios don’t use evaluate_tool—they return rewards via read_resource
Fix: Ensure your orchestrator calls read_resource() after agent completion

Problem: TypeError when evaluating with complex args like list[dict]

Cause: MCP passes all arguments as strings; SDK deserializes them
Debug: Add logging to check type(arg) at scenario entry

Problem: Scenario setup works but evaluate returns no reward

Cause: submit() wasn’t called before read_resource()
Fix: Call await env.submit(scenario_name, answer) first

Useful Environment Properties

# Check parallelization (for running multiple evals)
env.is_parallelizable  # True if all connections are remote

# List what's connected
env.connections        # Dict of connection names → connectors
env.is_connected       # True if in async context

# Resources and prompts (beyond tools)
await env.list_resources()  # MCP resources
await env.list_prompts()    # MCP prompts

Sandboxing

Make databases and services safe for testing

hud debug CLI

Full debug command reference

Testing Environments

Local Testing

Variants

Groups

Mock Mode

Testing Scenarios Directly

Hot-Reload

Debugging Build Failures

Scenario MCP Protocol Mapping

Debug with raw MCP calls

Common debugging scenarios

Useful Environment Properties

See Also

Sandboxing

hud debug CLI

Documentation Index

​Local Testing

​Variants

​Groups

​Mock Mode

​Testing Scenarios Directly

​Hot-Reload

​Debugging Build Failures

​Scenario MCP Protocol Mapping

​Debug with raw MCP calls

​Common debugging scenarios

​Useful Environment Properties

​See Also

Sandboxing

hud debug CLI

Local Testing

Variants

Groups

Mock Mode

Testing Scenarios Directly

Hot-Reload

Debugging Build Failures

Scenario MCP Protocol Mapping

Debug with raw MCP calls

Common debugging scenarios

Useful Environment Properties

See Also