requestId Correlation Runbook

Use this runbook to trace a failing workflow across controller, relay, and extension node by correlating on requestId. A single request ID spans all three components and is the most reliable way to isolate failure layer.

Before you start

Otto relay running and accessible.
Local dev log streaming enabled in extension options (for extension-originated events).

Procedure

Step 1: Start live log capture

otto logs follow --source all

Leave this running in a separate terminal before reproducing the failure.

Step 2: Reproduce the failure

Run the failing command. Use --json for machine-readable output:

otto test reddit.com getChatMessages --json 2>&1 | tee /tmp/otto-test-output.txt

Capture the requestId from the output (look for "requestId": "req_..." in the result or error envelope).

Step 3: Query scoped log timeline

# Full timeline for this request across all sources
otto logs list --request-id <requestId> --source all

# Extension-side events only
otto logs list --source node --latest 300

Step 4: Classify the failure layer

Layer	Signals to look for
Controller	Request never left controller; token error before WebSocket auth
Relay	`auth`, routing, ACL, or lock event in relay logs before node receives command
Node	Command received by node; execution, site mismatch, or auth preflight failure
Stream	Subscribe succeeded but no `listener_update` events; or teardown missing

Stream path checks

For stream failures specifically:

Capture the subscribe requestId (separate from the stream command requestId).
Verify listener_update events correlate to the subscribe requestId.
Verify teardown was explicit: listener.unsubscribe or command_cancel.

Quick reference commands

# Live follow all sources
otto logs follow --source all

# Live follow extension-side only
otto logs follow --source node

# Bounded node evidence
otto logs list --source node --latest 300

# Stream test with follow window
otto test reddit.com getChatMessages --stream-follow-ms 45000 --json

Incident summary template

Fill this in before escalating:

requestId:         req_...
action:            command.run / command.test
site + command:    reddit.com / getChatMessages
targetNodeId:      node_local_1
tabSessionId:      ts_...
failing layer:     relay / node / stream
error code:        tab_busy / acl_missing_node_grant / ...
retryable:         yes / no
immediate fix:     retry with backoff / re-pair / refresh tokens
prevention:        add bounded wait policy / fix token refresh lifecycle

Next steps

Advanced Troubleshooting — error-to-action table.
Controller Troubleshooting Decision Tree — layer-by-layer decision path.
Error Codes — full error code reference.

Before you start​

Procedure​

Step 1: Start live log capture​

Step 2: Reproduce the failure​

Step 3: Query scoped log timeline​

Step 4: Classify the failure layer​

Stream path checks​

Quick reference commands​

Incident summary template​

Next steps​