Documentation Index Fetch the complete documentation index at: https://hud-f5fd7c15-feat-agent-server-and-scenario-chat.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Before deploying, test locally. This page covers local testing patterns, A/B testing with variants and groups, mock mode, and debugging.
Local Testing
Environment local_test.pyNo Docker from env import envDocker env.connect_url("http://localhost:8765/mcp")
Both use the same API after setup:
async with env:
tools = env.as_tools() # List available tools
result = await env.call_tool( "my_tool" , arg = "val" ) # Call a tool
Variants
LLM outputs vary from run to run—ask the same question twice and you might get different quality answers. Variants let you test different configurations side-by-side:
import hud
async with hud.eval(task, variants = { "model" : [ "gpt-4o" , "claude-sonnet-4-5" ]}) as ctx:
response = await client.chat.completions.create(
model = ctx.variants[ "model" ], # Current variant
messages = [{ "role" : "user" , "content" : ctx.prompt}]
)
ctx.reward = 1.0 if "correct" in response.choices[ 0 ].message.content else 0.0
for result in ctx.results:
print ( f " { result.variants } : reward= { result.reward } " )
Lists expand to all combinations:
variants = {
"model" : [ "gpt-4o" , "claude" ],
"temperature" : [ 0.0 , 0.7 ],
}
# Creates 4 combinations: gpt-4o+0.0, gpt-4o+0.7, claude+0.0, claude+0.7
Groups
Run each variant multiple times to see the distribution, not just one lucky or unlucky result:
async with hud.eval(
task,
variants = { "model" : [ "gpt-4o" , "claude-sonnet-4-5" ]},
group = 5 # 10 runs total: 2 models × 5 each
) as ctx:
...
The hud.eval manager parallelizes automatically. Total runs = len(tasks) × len(variant_combinations) × group.
Mock Mode
env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:
env.mock() # All tools return schema-based fake responses
env.mock_tool( "send_email" , { "status" : "sent" , "id" : "mock-123" })
env.mock_tool( "charge_card" , { "success" : True , "transaction_id" : "tx-mock" })
# Check mock state
assert env.is_mock == True
For stateful mocking (tracking what happened for assertions):
class MockPaymentService :
def __init__ ( self ):
self .charges = []
async def charge ( self , amount : int , card_token : str ) -> dict :
self .charges.append({ "amount" : amount, "token" : card_token})
return { "success" : True , "id" : f "ch- { len ( self .charges) } " }
payments = MockPaymentService()
@env.scenario ( "checkout" )
async def checkout ( cart_total : int ):
_ = yield f "Complete checkout for $ { cart_total } "
yield 1.0 if any (c[ "amount" ] == cart_total for c in payments.charges) else 0.0
Your agent code stays the same—toggle env.mock() for testing.
Testing Scenarios Directly
Scenarios are async generators. hud.eval() drives them automatically, but you can test the logic directly:
async def checkout ( user_id : str , amount : int = 100 ):
# Setup + prompt (first yield)
answer = yield f "Complete checkout for { user_id } , $ { amount } "
# Evaluation (second yield)
yield 1.0 if "success" in answer.lower() else 0.0
async def test ():
gen = checkout( "alice" , 50 )
prompt = await anext (gen) # What hud.eval() does at start
reward = await gen.asend( "Success!" ) # What hud.eval() does after submit
assert reward == 1.0
If your scenario tests pass, hud.eval() will behave identically.
Hot-Reload
For Docker environments, hud dev -w path reloads Python on save:
hud dev -w scenarios -w tools --port 8765
System services (postgres, VNC, browsers) persist across reloads.
Debugging Build Failures
hud build runs the exact same pipeline as New → Environment on hud.ai —so if it passes locally, it’ll work in production. If the build fails or the container crashes on startup, use hud debug:
Output shows exactly which phase failed:
✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize
✗ Phase 3: Tool discovery failed
→ Error: Connection refused on port 8005
→ Hint: Backend service may not be starting
You can also debug a directory (builds first) or stop at a specific phase:
hud debug . # Build and debug current directory
hud debug . --max-phase 3 # Stop after phase 3
hud debug --config mcp.json # Debug from config file
Scenario MCP Protocol Mapping
Understanding how scenarios map to MCP is crucial for debugging. Each scenario registers two MCP endpoints :
Phase MCP Type Endpoint What it does Setup Prompt get_prompt("{env}:{scenario}", args)Runs code before first yield, returns the prompt Evaluate Resource read_resource("{env}:{scenario}")Runs code after first yield, returns {"reward": float}
Debug with raw MCP calls
If a scenario isn’t working, test each phase directly:
async with env:
# Phase 1: Setup (runs code before first yield)
prompt_result = await env.get_prompt(
"myenv:checkout" ,
{ "product" : "laptop" , "user_id" : "alice" }
)
print ( f "Prompt: { prompt_result.messages[ 0 ].content } " )
# ... agent runs here ...
# Phase 2: Submit answer (stores it for evaluation)
await env.submit( "checkout" , answer = "Order completed successfully" )
# Phase 3: Evaluate (runs code after first yield)
resource_result = await env.read_resource( "myenv:checkout" )
print ( f "Reward: { resource_result } " ) # {"reward": 1.0}
Common debugging scenarios
Problem: evaluate_tool: NULL but using v5 scenarios
Cause: v5 scenarios don’t use evaluate_tool—they return rewards via read_resource
Fix: Ensure your orchestrator calls read_resource() after agent completion
Problem: TypeError when evaluating with complex args like list[dict]
Cause: MCP passes all arguments as strings; SDK deserializes them
Debug: Add logging to check type(arg) at scenario entry
Problem: Scenario setup works but evaluate returns no reward
Cause: submit() wasn’t called before read_resource()
Fix: Call await env.submit(scenario_name, answer) first
Useful Environment Properties
# Check parallelization (for running multiple evals)
env.is_parallelizable # True if all connections are remote
# List what's connected
env.connections # Dict of connection names → connectors
env.is_connected # True if in async context
# Resources and prompts (beyond tools)
await env.list_resources() # MCP resources
await env.list_prompts() # MCP prompts
See Also
Sandboxing Make databases and services safe for testing
hud debug CLI Full debug command reference