feat(host-agent): Phase 1a process supervision — instance start/stop/restart/status + push state events
Per-instance ProcessSupervisor: tokio child spawn with proper arg list
(fixes Go's naive space-splitting), graceful SIGTERM with 30s budget
then force kill, monitor task classifying ordered-stop vs crash (exit
code captured), watch-channel state observable everywhere. Instance cmd
channel live on corrosion.{license}.{instance}.cmd (start/stop/restart/
status) with state events pushed on {instance}.status (keep-latest
semantics, documented). Heartbeats now carry live process state +
uptime per instance. Crate restructured lib+bin for integration tests.
Verified: 5 integration tests with real OS processes (lifecycle, crash
exit-code, restart recovery, unmanaged rejection, clean spawn failure)
+ live-NATS contract test (request-reply roundtrips, double-start
rejection, push events, heartbeat state) — all green.
Known limitation (documented): no PID adoption yet — agent restart
orphans a running game process to 'stopped' until panel restart.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,8 +1,9 @@
|
||||
# Corrosion Wire Protocol v2
|
||||
|
||||
Status: **Phase 0 implemented** (host heartbeat, host commands, going-offline
|
||||
beacon). Per-instance command/status subjects are reserved and specified here
|
||||
for Phase 1.
|
||||
Status: **Phase 0 + Phase 1 process control implemented** (host heartbeat,
|
||||
host commands, going-offline beacon, per-instance start/stop/restart/status
|
||||
with push state events). RCON, SteamCMD, file ops, and game adapters are
|
||||
specified but not yet implemented.
|
||||
|
||||
## Design
|
||||
|
||||
@@ -70,9 +71,10 @@ All telemetry is measured, never fabricated. Fields the agent cannot measure
|
||||
are omitted (`probe` before the first probe completes, `hostname` if
|
||||
unavailable).
|
||||
|
||||
Phase 0 instance `state` values: `configured` (root path exists),
|
||||
`missing_root`. Phase 1 adds live process states: `running`, `stopped`,
|
||||
`crashed`, `starting`, `updating`.
|
||||
Instance `state` values — process-managed (an `executable` is configured):
|
||||
`running`, `stopped`, `starting`, `stopping`, `crashed`; unmanaged
|
||||
(telemetry-only): `configured` (root exists), `missing_root`. Each instance
|
||||
also reports `uptime_seconds` (0 unless running).
|
||||
|
||||
### `corrosion.{license_id}.host.cmd` (backend → agent, request-reply)
|
||||
|
||||
@@ -92,19 +94,35 @@ Best-effort beacon (500ms budget) on graceful shutdown so the panel can flip
|
||||
the host to offline immediately instead of waiting out heartbeat staleness.
|
||||
Payload: `{}`.
|
||||
|
||||
## Instance-level subjects (Phase 1 — reserved, not yet implemented)
|
||||
## Instance-level subjects
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply)
|
||||
### `corrosion.{license_id}.{instance_id}.cmd` (backend → agent, request-reply) — LIVE
|
||||
|
||||
Lifecycle and control for one game instance. Planned funcs: `start`, `stop`,
|
||||
`restart`, `status`, `rcon` (process-class games), `steam_update`,
|
||||
`oxide_install` (rust), plus game-adapter-specific commands (Dune: docker
|
||||
lifecycle, RabbitMQ bus commands, Coriolis reset).
|
||||
Lifecycle and control for one game instance.
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish)
|
||||
Implemented funcs: `start`, `stop` (graceful with 30s budget, then force
|
||||
kill), `restart`, `status` (returns `state` + `uptime_seconds`). Errors reply
|
||||
`{ "status": "error", "message": ... }` — including start on an unmanaged
|
||||
instance, double start, and unknown funcs.
|
||||
|
||||
State-change events (started/stopped/crashed) so the panel does not wait for
|
||||
the next heartbeat.
|
||||
Planned funcs: `rcon` (process-class games), `steam_update`, `oxide_install`
|
||||
(rust), plus game-adapter-specific commands (Dune: docker lifecycle, RabbitMQ
|
||||
bus commands, Coriolis reset).
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.status` (agent → backend, publish) — LIVE
|
||||
|
||||
State-change events so the panel does not wait for the next heartbeat.
|
||||
Payload: `{ "timestamp", "instance_id", "event": { "state": ..., "exit_code"? } }`.
|
||||
|
||||
Semantics: **keep-latest state sync**, not a lossless transition ledger —
|
||||
near-instant transient states (e.g. `starting` when spawn succeeds
|
||||
immediately) may coalesce into the following state. Consumers should treat
|
||||
each event as "current state is now X".
|
||||
|
||||
Known Phase 1 limitation: the supervisor does not yet persist/adopt PIDs — if
|
||||
the agent itself restarts while a game server is running, the game process
|
||||
survives but reports `stopped` until restarted through the panel. PID
|
||||
adoption is queued with the service-install work.
|
||||
|
||||
### `corrosion.{license_id}.{instance_id}.console` (agent → backend, publish)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user