feat(host-agent): Phase 1a process supervision — instance start/stop/restart/status + push state events

Per-instance ProcessSupervisor: tokio child spawn with proper arg list
(fixes Go's naive space-splitting), graceful SIGTERM with 30s budget
then force kill, monitor task classifying ordered-stop vs crash (exit
code captured), watch-channel state observable everywhere. Instance cmd
channel live on corrosion.{license}.{instance}.cmd (start/stop/restart/
status) with state events pushed on {instance}.status (keep-latest
semantics, documented). Heartbeats now carry live process state +
uptime per instance. Crate restructured lib+bin for integration tests.

Verified: 5 integration tests with real OS processes (lifecycle, crash
exit-code, restart recovery, unmanaged rejection, clean spawn failure)
+ live-NATS contract test (request-reply roundtrips, double-start
rejection, push events, heartbeat state) — all green.

Known limitation (documented): no PID adoption yet — agent restart
orphans a running game process to 'stopped' until panel restart.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Vantz Stockwell
2026-06-11 10:44:24 -04:00
parent f706c3c47e
commit 068a476f39
13 changed files with 669 additions and 44 deletions

View File

@@ -1,8 +1,9 @@
# Corrosion Wire Protocol v2
Status: **Phase 0 implemented** (host heartbeat, host commands, going-offline
beacon). Per-instance command/status subjects are reserved and specified here
for Phase 1.
Status: **Phase 0 + Phase 1 process control implemented** (host heartbeat,
host commands, going-offline beacon, per-instance start/stop/restart/status
with push state events). RCON, SteamCMD, file ops, and game adapters are
specified but not yet implemented.
## Design
@@ -70,9 +71,10 @@ All telemetry is measured, never fabricated. Fields the agent cannot measure
are omitted (`probe` before the first probe completes, `hostname` if
unavailable).
Phase 0 instance `state` values: `configured` (root path exists),
`missing_root`. Phase 1 adds live process states: `running`, `stopped`,
`crashed`, `starting`, `updating`.
Instance `state` values — process-managed (an `executable` is configured):
`running`, `stopped`, `starting`, `stopping`, `crashed`; unmanaged
(telemetry-only): `configured` (root exists), `missing_root`. Each instance
also reports `uptime_seconds` (0 unless running).
### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply)
@@ -92,19 +94,35 @@ Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
the host to offline immediately instead of waiting out heartbeat staleness.
Payload: `{}`.
## Instance-level subjects (Phase 1 — reserved, not yet implemented)
## Instance-level subjects
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply)
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply) — LIVE
Lifecycle and control for one game instance. Planned funcs: `start`, `stop`,
`restart`, `status`, `rcon` (process-class games), `steam_update`,
`oxide_install` (rust), plus game-adapter-specific commands (Dune: docker
lifecycle, RabbitMQ bus commands, Coriolis reset).
Lifecycle and control for one game instance.
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish)
Implemented funcs: `start`, `stop` (graceful with 30s budget, then force
kill), `restart`, `status` (returns `state` + `uptime_seconds`). Errors reply
`{ "status": "error", "message": ... }` — including start on an unmanaged
instance, double start, and unknown funcs.
State-change events (started/stopped/crashed) so the panel does not wait for
the next heartbeat.
Planned funcs: `rcon` (process-class games), `steam_update`, `oxide_install`
(rust), plus game-adapter-specific commands (Dune: docker lifecycle, RabbitMQ
bus commands, Coriolis reset).
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish) — LIVE
State-change events so the panel does not wait for the next heartbeat.
Payload: `{ "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }`.
Semantics: **keep-latest state sync**, not a lossless transition ledger —
near-instant transient states (e.g. `starting` when spawn succeeds
immediately) may coalesce into the following state. Consumers should treat
each event as "current state is now X".
Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if
the agent itself restarts while a game server is running, the game process
survives but reports `stopped` until restarted through the panel. PID
adoption is queued with the service-install work.
### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish)