feat(host-agent): Phase 2 — Dune docker-compose adapter via Supervisor trait
Some checks failed
CI / backend-types (push) Successful in 9s
CI / frontend-build (push) Successful in 15s
CI / agent-tests (push) Failing after 35s
CI / integration (push) Has been skipped
Build Host Agent (Rust) / build (push) Successful in 1m45s

Introduce a Supervisor trait (async-trait) so the agent manages games with
different models behind one wire contract. ProcessSupervisor (spawned process:
rust/conan/soulmask) and the new DockerComposeSupervisor (dune) both impl it;
Agent.supervisors is now HashMap<String, Arc<dyn Supervisor>> and instancecmd
dispatch is game-agnostic — start/stop/restart/status identical across games,
selected by a per-game factory in main. InstanceState moved to the shared
supervisor module.

DockerComposeSupervisor drives docker-compose up-d / stop / restart against
the instance's compose project, with -f/-p/single-service support and a
configurable compose binary. New [instance.docker_compose] config block.
First cut = lifecycle + cached state; container crash-detection + restart
adoption deferred to Phase 3b (reconcilable with a compose ps probe).

Trait choice (dyn over enum) per Commander: scales to future planes (kubectl,
AMP/podman, SSH) as new struct+impl, no central match.

56 tests green (6 new docker-compose mock-binary tests + 5 refactored process
tests), zero warnings. Live verification pending a real Dune stack.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Vantz Stockwell
2026-06-11 21:32:25 -04:00
parent 651a35d4be
commit d13f2cb8b1
17 changed files with 679 additions and 166 deletions

View File

@@ -7,16 +7,17 @@ use tokio::sync::RwLock;
use tokio_util::sync::CancellationToken;
use crate::config::Settings;
use crate::process::ProcessSupervisor;
use crate::prober::ProbeReport;
use crate::supervisor::Supervisor;
pub struct Agent {
pub cfg: Settings,
pub nats: async_nats::Client,
pub started: Instant,
pub last_probe: RwLock<Option<ProbeReport>>,
/// One supervisor per instance (unmanaged instances included — they
/// report `unmanaged` state and reject process commands).
pub supervisors: HashMap<String, Arc<ProcessSupervisor>>,
/// One supervisor per instance, keyed by instance id. The concrete impl
/// (process vs docker-compose) is chosen per game by the factory in main;
/// every subsystem talks to the `Supervisor` trait only.
pub supervisors: HashMap<String, Arc<dyn Supervisor>>,
pub shutdown: CancellationToken,
}

View File

@@ -10,6 +10,7 @@ use serde::Deserialize;
use std::collections::HashSet;
use std::path::{Path, PathBuf};
use crate::docker_compose::DockerComposeConfig;
use crate::rcon::RconConfig;
use crate::steamcmd::SteamcmdConfig;
@@ -76,6 +77,10 @@ pub struct InstanceConfig {
/// validate = false).
#[serde(default)]
pub steamcmd: Option<SteamcmdConfig>,
/// Docker-compose settings for container-managed games (Dune). Absent =
/// defaults apply (compose file in the instance root, project = instance id).
#[serde(default)]
pub docker_compose: Option<DockerComposeConfig>,
}
impl InstanceConfig {

View File

@@ -0,0 +1,216 @@
//! Docker-compose instance supervision — the Dune: Awakening adapter.
//!
//! Dune does not ship as a SteamCMD-updated process like Rust/Conan/Soulmask;
//! it runs as Docker container(s) (game server + RabbitMQ broker + Postgres),
//! orchestrated as a compose stack (a "battlegroup"). So Dune lifecycle is
//! `docker compose up -d / stop / restart` against the instance's compose
//! project, not a spawned OS process. This supervisor implements the same
//! [`Supervisor`] trait `ProcessSupervisor` does, so the instance command
//! dispatch is identical — only the management model differs.
//!
//! Scope (first cut): lifecycle + cached state. Two parity items are deferred
//! to Phase 3b alongside process PID adoption: (1) crash detection (containers
//! give us no child handle — a `docker compose ps` poll loop would supply it);
//! (2) state adoption on agent restart (a running stack reports `stopped` until
//! the next lifecycle command). Both are reconcilable with a `ps` probe.
//!
//! Reference: docs/reference-repos/icehunter SETUP_DOCKER.md (the docker
//! control plane this mirrors).
use std::path::PathBuf;
use std::process::Stdio;
use std::sync::Arc;
use std::time::Instant;
use anyhow::{bail, Context, Result};
use serde::Deserialize;
use tokio::process::Command;
use tokio::sync::{watch, Mutex};
use crate::config::InstanceConfig;
use crate::supervisor::{InstanceState, Supervisor};
/// Per-instance docker-compose settings (`[instance.docker_compose]`). All
/// fields optional — defaults cover the common "one compose file in the
/// instance root" case.
#[derive(Debug, Clone, Default, Deserialize)]
#[serde(deny_unknown_fields)]
pub struct DockerComposeConfig {
/// Compose file (`-f`). Relative paths resolve against the run dir. Default:
/// compose's own discovery (docker-compose.yml in the run dir).
#[serde(default)]
pub file: Option<PathBuf>,
/// Compose project name (`-p`). Default: the instance id.
#[serde(default)]
pub project: Option<String>,
/// Limit lifecycle ops to one service. Default: every service in the file.
#[serde(default)]
pub service: Option<String>,
/// Override the compose binary invocation. Default: `["docker","compose"]`.
/// Use `["docker-compose"]` for the legacy standalone binary.
#[serde(default)]
pub command: Option<Vec<String>>,
}
struct Inner {
started_at: Option<Instant>,
}
pub struct DockerComposeSupervisor {
instance_id: String,
/// Directory the compose commands run in (relative `-f`/file paths resolve
/// against it).
run_dir: PathBuf,
compose_file: Option<PathBuf>,
project: String,
service: Option<String>,
/// Compose binary + leading args, e.g. `["docker","compose"]`.
command: Vec<String>,
inner: Mutex<Inner>,
state_tx: watch::Sender<InstanceState>,
}
impl DockerComposeSupervisor {
pub fn new(cfg: &InstanceConfig) -> Arc<Self> {
let dc = cfg.docker_compose.clone().unwrap_or_default();
let run_dir = cfg
.working_dir
.clone()
.unwrap_or_else(|| cfg.root.clone());
let command = dc
.command
.filter(|c| !c.is_empty())
.unwrap_or_else(|| vec!["docker".to_string(), "compose".to_string()]);
let (state_tx, _) = watch::channel(InstanceState::Stopped);
Arc::new(Self {
instance_id: cfg.id.clone(),
run_dir,
compose_file: dc.file,
project: dc.project.unwrap_or_else(|| cfg.id.clone()),
service: dc.service,
command,
inner: Mutex::new(Inner { started_at: None }),
state_tx,
})
}
fn set_state(&self, state: InstanceState) {
let _ = self.state_tx.send_replace(state);
}
/// Run one compose subcommand (`up`/`stop`/`restart`/...), bailing with the
/// captured stderr on non-zero exit. Global flags (`-f`, `-p`) precede the
/// subcommand; the optional single service is appended last.
async fn run(&self, action: &str, action_args: &[&str]) -> Result<()> {
let mut cmd = Command::new(&self.command[0]);
cmd.args(&self.command[1..]);
if let Some(file) = &self.compose_file {
cmd.arg("-f").arg(file);
}
cmd.arg("-p").arg(&self.project);
cmd.arg(action);
cmd.args(action_args);
if let Some(service) = &self.service {
cmd.arg(service);
}
cmd.current_dir(&self.run_dir)
.stdin(Stdio::null())
.stdout(Stdio::piped())
.stderr(Stdio::piped());
let output = cmd
.output()
.await
.with_context(|| format!("running `{} {action}` (is docker installed and on PATH?)", self.command.join(" ")))?;
if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
let stdout = String::from_utf8_lossy(&output.stdout);
let detail = if !stderr.trim().is_empty() {
stderr.trim()
} else {
stdout.trim()
};
bail!("compose {action} failed ({}): {detail}", output.status);
}
Ok(())
}
}
#[async_trait::async_trait]
impl Supervisor for DockerComposeSupervisor {
fn instance_id(&self) -> &str {
&self.instance_id
}
fn state(&self) -> InstanceState {
self.state_tx.borrow().clone()
}
fn watch_state(&self) -> watch::Receiver<InstanceState> {
self.state_tx.subscribe()
}
async fn uptime_seconds(&self) -> u64 {
let inner = self.inner.lock().await;
match (&*self.state_tx.borrow(), inner.started_at) {
(InstanceState::Running, Some(t)) => t.elapsed().as_secs(),
_ => 0,
}
}
async fn start(self: Arc<Self>) -> Result<()> {
if matches!(
*self.state_tx.borrow(),
InstanceState::Running | InstanceState::Starting
) {
bail!("instance '{}' is already running", self.instance_id);
}
self.set_state(InstanceState::Starting);
match self.run("up", &["-d"]).await {
Ok(()) => {
self.inner.lock().await.started_at = Some(Instant::now());
self.set_state(InstanceState::Running);
tracing::info!("instance '{}' compose up -d", self.instance_id);
Ok(())
}
Err(e) => {
self.set_state(InstanceState::Stopped);
Err(e)
}
}
}
async fn stop(self: Arc<Self>) -> Result<()> {
self.set_state(InstanceState::Stopping);
match self.run("stop", &[]).await {
Ok(()) => {
self.inner.lock().await.started_at = None;
self.set_state(InstanceState::Stopped);
tracing::info!("instance '{}' compose stop", self.instance_id);
Ok(())
}
Err(e) => {
// Stop failed — the stack is most likely still up.
self.set_state(InstanceState::Running);
Err(e)
}
}
}
async fn restart(self: Arc<Self>) -> Result<()> {
self.set_state(InstanceState::Starting);
match self.run("restart", &[]).await {
Ok(()) => {
self.inner.lock().await.started_at = Some(Instant::now());
self.set_state(InstanceState::Running);
tracing::info!("instance '{}' compose restart", self.instance_id);
Ok(())
}
Err(e) => {
self.set_state(InstanceState::Stopped);
Err(e)
}
}
}
}

View File

@@ -13,9 +13,9 @@ use serde_json::json;
use std::sync::Arc;
use crate::agent::Agent;
use crate::process::ProcessSupervisor;
use crate::subjects;
use crate::steamcmd;
use crate::supervisor::Supervisor;
#[derive(Debug, Deserialize)]
struct InstanceCommand {
@@ -26,8 +26,8 @@ struct InstanceCommand {
}
/// Forward every supervisor state change as a status event.
pub async fn publish_state_changes(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>) {
let subject = subjects::instance_status(&agent.cfg.license_id, &sup.instance_id);
pub async fn publish_state_changes(agent: Arc<Agent>, sup: Arc<dyn Supervisor>) {
let subject = subjects::instance_status(&agent.cfg.license_id, sup.instance_id());
let mut rx = sup.watch_state();
let cancel = agent.shutdown.clone();
@@ -40,13 +40,13 @@ pub async fn publish_state_changes(agent: Arc<Agent>, sup: Arc<ProcessSupervisor
let state = rx.borrow().clone();
let event = json!({
"timestamp": Utc::now().to_rfc3339_opts(SecondsFormat::Secs, true),
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"event": state,
});
match serde_json::to_vec(&event) {
Ok(bytes) => {
if let Err(e) = agent.nats.publish(subject.clone(), bytes.into()).await {
tracing::warn!("status publish failed for '{}': {e}", sup.instance_id);
tracing::warn!("status publish failed for '{}': {e}", sup.instance_id());
}
}
Err(e) => tracing::error!("status serialize failed: {e}"),
@@ -58,8 +58,8 @@ pub async fn publish_state_changes(agent: Arc<Agent>, sup: Arc<ProcessSupervisor
}
/// Request-reply command handler for one instance.
pub async fn run(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>) -> anyhow::Result<()> {
let subject = subjects::instance_cmd(&agent.cfg.license_id, &sup.instance_id);
pub async fn run(agent: Arc<Agent>, sup: Arc<dyn Supervisor>) -> anyhow::Result<()> {
let subject = subjects::instance_cmd(&agent.cfg.license_id, sup.instance_id());
let mut sub = agent.nats.subscribe(subject.clone()).await?;
tracing::info!("instance command handler listening on {subject}");
@@ -74,13 +74,13 @@ pub async fn run(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>) -> anyhow::Resu
tokio::spawn(async move { handle(agent, sup, msg).await });
}
None => {
tracing::warn!("instance command subscription ended for '{}'", sup.instance_id);
tracing::warn!("instance command subscription ended for '{}'", sup.instance_id());
break;
}
}
}
_ = cancel.cancelled() => {
tracing::info!("instance command handler stopping for '{}'", sup.instance_id);
tracing::info!("instance command handler stopping for '{}'", sup.instance_id());
break;
}
}
@@ -88,7 +88,7 @@ pub async fn run(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>) -> anyhow::Resu
Ok(())
}
async fn handle(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>, msg: async_nats::Message) {
async fn handle(agent: Arc<Agent>, sup: Arc<dyn Supervisor>, msg: async_nats::Message) {
let Some(reply) = msg.reply.clone() else {
tracing::warn!("instance command without reply subject ignored");
return;
@@ -113,20 +113,22 @@ async fn handle(agent: Arc<Agent>, sup: Arc<ProcessSupervisor>, msg: async_nats:
async fn dispatch(
agent: &Arc<Agent>,
sup: &Arc<ProcessSupervisor>,
sup: &Arc<dyn Supervisor>,
cmd: &InstanceCommand,
) -> serde_json::Value {
let func = cmd.func.as_str();
// start/stop/restart take `self: Arc<Self>` (they may hand a clone to a
// monitor task), so clone the Arc before the consuming call.
let outcome = match func {
"start" => sup.start().await.map(|_| "starting"),
"stop" => sup.stop().await.map(|_| "stopped"),
"restart" => sup.restart().await.map(|_| "restarted"),
"start" => sup.clone().start().await.map(|_| "starting"),
"stop" => sup.clone().stop().await.map(|_| "stopped"),
"restart" => sup.clone().restart().await.map(|_| "restarted"),
"status" => {
return json!({
"status": "success",
"func": "status",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"state": sup.state(),
"uptime_seconds": sup.uptime_seconds().await,
});
@@ -139,15 +141,15 @@ async fn dispatch(
.cfg
.instances
.iter()
.find(|i| i.id == sup.instance_id);
.find(|i| i.id == sup.instance_id());
let rcon_cfg = inst_cfg.and_then(|i| i.rcon.as_ref());
let Some(rcon_cfg) = rcon_cfg else {
return json!({
"status": "error",
"func": "rcon",
"instance_id": sup.instance_id,
"message": format!("instance '{}' has no rcon configured", sup.instance_id),
"instance_id": sup.instance_id(),
"message": format!("instance '{}' has no rcon configured", sup.instance_id()),
});
};
@@ -155,7 +157,7 @@ async fn dispatch(
return json!({
"status": "error",
"func": "rcon",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"message": "rcon func requires a 'command' field",
});
};
@@ -165,13 +167,13 @@ async fn dispatch(
Ok(output) => json!({
"status": "success",
"func": "rcon",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"output": output,
}),
Err(e) => json!({
"status": "error",
"func": "rcon",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"message": format!("{e:#}"),
}),
};
@@ -181,14 +183,14 @@ async fn dispatch(
// settings. The supervisor only carries process-control state, not
// the full config, so we reach into agent.cfg.instances here as the
// rcon dispatch does.
let inst_cfg = agent.cfg.instances.iter().find(|i| i.id == sup.instance_id);
let inst_cfg = agent.cfg.instances.iter().find(|i| i.id == sup.instance_id());
let Some(inst_cfg) = inst_cfg else {
return json!({
"status": "error",
"func": "steam_update",
"instance_id": sup.instance_id,
"message": format!("no config found for instance '{}'", sup.instance_id),
"instance_id": sup.instance_id(),
"message": format!("no config found for instance '{}'", sup.instance_id()),
});
};
@@ -209,7 +211,7 @@ async fn dispatch(
};
let license = agent.cfg.license_id.clone();
let instance_id = sup.instance_id.clone();
let instance_id = sup.instance_id().to_string();
let nats = agent.nats.clone();
// Publish each progress line to the steam_status subject.
@@ -240,12 +242,12 @@ async fn dispatch(
Ok(()) => json!({
"status": "success",
"func": "steam_update",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
}),
Err(e) => json!({
"status": "error",
"func": "steam_update",
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"message": format!("{e:#}"),
}),
};
@@ -262,14 +264,14 @@ async fn dispatch(
Ok(result) => json!({
"status": "success",
"func": func,
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"result": result,
"state": sup.state(),
}),
Err(e) => json!({
"status": "error",
"func": func,
"instance_id": sup.instance_id,
"instance_id": sup.instance_id(),
"message": format!("{e:#}"),
}),
}

View File

@@ -4,6 +4,7 @@
pub mod agent;
pub mod bus;
pub mod config;
pub mod docker_compose;
pub mod filemanager;
pub mod hostcmd;
pub mod instancecmd;
@@ -12,6 +13,7 @@ pub mod process;
pub mod rcon;
pub mod steamcmd;
pub mod subjects;
pub mod supervisor;
pub mod telemetry;
pub mod update;
pub mod version;

View File

@@ -5,8 +5,8 @@
//! game adapters arrive in Phase 1+ (see PROTOCOL.md).
use corrosion_host_agent::{
agent, bus, config, filemanager, hostcmd, instancecmd, prober, process, subjects, telemetry,
version,
agent, bus, config, docker_compose, filemanager, hostcmd, instancecmd, prober, process,
subjects, supervisor, telemetry, version,
};
use anyhow::{Context, Result};
@@ -92,10 +92,20 @@ async fn run(settings: config::Settings) -> Result<()> {
let nats = bus::connect(&settings).await?;
let supervisors = settings
// Per-game supervisor factory: container-managed games (Dune) get a
// docker-compose supervisor; everything else is a spawned-process
// supervisor. Both satisfy the `Supervisor` trait, so the rest of the agent
// is game-agnostic.
let supervisors: std::collections::HashMap<String, Arc<dyn supervisor::Supervisor>> = settings
.instances
.iter()
.map(|inst| (inst.id.clone(), process::ProcessSupervisor::new(inst)))
.map(|inst| {
let sup: Arc<dyn supervisor::Supervisor> = match inst.game.as_str() {
"dune" => docker_compose::DockerComposeSupervisor::new(inst),
_ => process::ProcessSupervisor::new(inst),
};
(inst.id.clone(), sup)
})
.collect();
let agent = Arc::new(Agent {

View File

@@ -1,14 +1,16 @@
//! Per-instance game-server process supervision.
//!
//! One `ProcessSupervisor` per process-managed instance. Lifecycle mirrors the
//! proven Go agent behavior — graceful SIGTERM with a 30s budget before force
//! kill, a monitor task that reaps the child and records crash-vs-stop — with
//! two fixes the Go version needed: args are a proper list (no naive space
//! splitting), and every state change is observable through a watch channel
//! so the panel gets push events instead of waiting for the next heartbeat.
//! One `ProcessSupervisor` per process-managed instance (Rust/Conan/Soulmask).
//! Lifecycle mirrors the proven Go agent behavior — graceful SIGTERM with a 30s
//! budget before force kill, a monitor task that reaps the child and records
//! crash-vs-stop — with two fixes the Go version needed: args are a proper list
//! (no naive space splitting), and every state change is observable through a
//! watch channel so the panel gets push events instead of waiting for the next
//! heartbeat. Lifecycle control is exposed through the [`Supervisor`] trait so
//! the command dispatch is identical across process- and container-managed
//! games.
use anyhow::{bail, Context, Result};
use serde::Serialize;
use std::path::PathBuf;
use std::process::Stdio;
use std::sync::Arc;
@@ -17,39 +19,11 @@ use tokio::process::{Child, Command};
use tokio::sync::{watch, Mutex};
use crate::config::InstanceConfig;
use crate::supervisor::{InstanceState, Supervisor};
const GRACEFUL_STOP_BUDGET: Duration = Duration::from_secs(30);
const RESTART_PAUSE: Duration = Duration::from_secs(2);
#[derive(Debug, Clone, PartialEq, Serialize)]
#[serde(rename_all = "snake_case", tag = "state")]
pub enum InstanceState {
/// Not process-managed (no executable configured).
Unmanaged,
Stopped,
Starting,
Running,
Stopping,
/// Process exited without a stop request.
Crashed {
#[serde(skip_serializing_if = "Option::is_none")]
exit_code: Option<i32>,
},
}
impl InstanceState {
pub fn as_label(&self) -> &'static str {
match self {
InstanceState::Unmanaged => "unmanaged",
InstanceState::Stopped => "stopped",
InstanceState::Starting => "starting",
InstanceState::Running => "running",
InstanceState::Stopping => "stopping",
InstanceState::Crashed { .. } => "crashed",
}
}
}
struct Inner {
child: Option<Child>,
started_at: Option<Instant>,
@@ -59,7 +33,7 @@ struct Inner {
}
pub struct ProcessSupervisor {
pub instance_id: String,
instance_id: String,
executable: Option<PathBuf>,
args: Vec<String>,
working_dir: Option<PathBuf>,
@@ -90,72 +64,6 @@ impl ProcessSupervisor {
})
}
pub fn state(&self) -> InstanceState {
self.state_tx.borrow().clone()
}
pub fn watch_state(&self) -> watch::Receiver<InstanceState> {
self.state_tx.subscribe()
}
pub async fn uptime_seconds(&self) -> u64 {
let inner = self.inner.lock().await;
match (&*self.state_tx.borrow(), inner.started_at) {
(InstanceState::Running, Some(t)) => t.elapsed().as_secs(),
_ => 0,
}
}
pub async fn start(self: &Arc<Self>) -> Result<()> {
let Some(exe) = self.executable.clone() else {
bail!("instance '{}' has no executable configured", self.instance_id);
};
if !exe.exists() {
bail!("executable not found: {}", exe.display());
}
let mut inner = self.inner.lock().await;
if matches!(*self.state_tx.borrow(), InstanceState::Running | InstanceState::Starting) {
bail!("instance '{}' is already running", self.instance_id);
}
self.set_state(InstanceState::Starting);
let workdir = self
.working_dir
.clone()
.or_else(|| exe.parent().map(|p| p.to_path_buf()))
.unwrap_or_else(|| PathBuf::from("."));
let child = Command::new(&exe)
.args(&self.args)
.current_dir(&workdir)
.stdin(Stdio::null())
.stdout(Stdio::inherit())
.stderr(Stdio::inherit())
.spawn()
.with_context(|| format!("spawning {}", exe.display()))?;
let pid = child.id();
inner.child = Some(child);
inner.started_at = Some(Instant::now());
inner.stop_requested = false;
drop(inner);
self.set_state(InstanceState::Running);
tracing::info!(
"instance '{}' started: {} (pid {:?})",
self.instance_id,
exe.display(),
pid
);
// Monitor: reap the child and classify the exit.
let sup = Arc::clone(self);
tokio::spawn(async move { sup.monitor().await });
Ok(())
}
async fn monitor(self: Arc<Self>) {
// Take a waiter without holding the lock across the whole child
// lifetime: Child::wait needs &mut, so the child stays in inner and
@@ -201,7 +109,85 @@ impl ProcessSupervisor {
}
}
pub async fn stop(self: &Arc<Self>) -> Result<()> {
fn set_state(&self, state: InstanceState) {
// send_replace never fails even with zero receivers.
let _ = self.state_tx.send_replace(state);
}
}
#[async_trait::async_trait]
impl Supervisor for ProcessSupervisor {
fn instance_id(&self) -> &str {
&self.instance_id
}
fn state(&self) -> InstanceState {
self.state_tx.borrow().clone()
}
fn watch_state(&self) -> watch::Receiver<InstanceState> {
self.state_tx.subscribe()
}
async fn uptime_seconds(&self) -> u64 {
let inner = self.inner.lock().await;
match (&*self.state_tx.borrow(), inner.started_at) {
(InstanceState::Running, Some(t)) => t.elapsed().as_secs(),
_ => 0,
}
}
async fn start(self: Arc<Self>) -> Result<()> {
let Some(exe) = self.executable.clone() else {
bail!("instance '{}' has no executable configured", self.instance_id);
};
if !exe.exists() {
bail!("executable not found: {}", exe.display());
}
let mut inner = self.inner.lock().await;
if matches!(*self.state_tx.borrow(), InstanceState::Running | InstanceState::Starting) {
bail!("instance '{}' is already running", self.instance_id);
}
self.set_state(InstanceState::Starting);
let workdir = self
.working_dir
.clone()
.or_else(|| exe.parent().map(|p| p.to_path_buf()))
.unwrap_or_else(|| PathBuf::from("."));
let child = Command::new(&exe)
.args(&self.args)
.current_dir(&workdir)
.stdin(Stdio::null())
.stdout(Stdio::inherit())
.stderr(Stdio::inherit())
.spawn()
.with_context(|| format!("spawning {}", exe.display()))?;
let pid = child.id();
inner.child = Some(child);
inner.started_at = Some(Instant::now());
inner.stop_requested = false;
drop(inner);
self.set_state(InstanceState::Running);
tracing::info!(
"instance '{}' started: {} (pid {:?})",
self.instance_id,
exe.display(),
pid
);
// Monitor: reap the child and classify the exit.
let sup = Arc::clone(&self);
tokio::spawn(async move { sup.monitor().await });
Ok(())
}
async fn stop(self: Arc<Self>) -> Result<()> {
let mut inner = self.inner.lock().await;
if inner.child.is_none() {
bail!("instance '{}' is not running", self.instance_id);
@@ -263,16 +249,14 @@ impl ProcessSupervisor {
Ok(())
}
pub async fn restart(self: &Arc<Self>) -> Result<()> {
if !matches!(*self.state_tx.borrow(), InstanceState::Stopped | InstanceState::Crashed { .. } | InstanceState::Unmanaged) {
self.stop().await?;
async fn restart(self: Arc<Self>) -> Result<()> {
if !matches!(
*self.state_tx.borrow(),
InstanceState::Stopped | InstanceState::Crashed { .. } | InstanceState::Unmanaged
) {
self.clone().stop().await?;
}
tokio::time::sleep(RESTART_PAUSE).await;
self.start().await
}
fn set_state(&self, state: InstanceState) {
// send_replace never fails even with zero receivers.
let _ = self.state_tx.send_replace(state);
}
}

View File

@@ -0,0 +1,80 @@
//! The supervision abstraction.
//!
//! A `Supervisor` owns the lifecycle of one game instance. Different games are
//! managed in fundamentally different ways — Rust/Conan/Soulmask are spawned OS
//! processes ([`crate::process::ProcessSupervisor`]); Dune is a docker-compose
//! stack ([`crate::docker_compose::DockerComposeSupervisor`]); future planes
//! (kubectl, AMP/podman, SSH) will be their own impls. The instance command
//! dispatch (`instancecmd::dispatch`) talks only to this trait, so it never
//! learns which management model is behind a given instance.
//!
//! Trait objects (`Arc<dyn Supervisor>`) need object-safe, dynamically
//! dispatchable async methods; native `async fn` in traits is not yet
//! dyn-compatible, so we use `#[async_trait]` (the battle-tested ecosystem
//! standard) to box the returned futures. The cost — one heap alloc per
//! lifecycle call — is irrelevant for start/stop/restart, which happen seconds
//! to minutes apart.
use std::sync::Arc;
use anyhow::Result;
use serde::Serialize;
use tokio::sync::watch;
/// Observable lifecycle state of one instance. Shared vocabulary across every
/// supervisor impl; serialized verbatim into heartbeats and status events
/// (`{"state":"running", ...}`).
#[derive(Debug, Clone, PartialEq, Serialize)]
#[serde(rename_all = "snake_case", tag = "state")]
pub enum InstanceState {
/// Not lifecycle-managed (a process instance with no executable, etc.).
Unmanaged,
Stopped,
Starting,
Running,
Stopping,
/// Exited/died without a stop request.
Crashed {
#[serde(skip_serializing_if = "Option::is_none")]
exit_code: Option<i32>,
},
}
impl InstanceState {
pub fn as_label(&self) -> &'static str {
match self {
InstanceState::Unmanaged => "unmanaged",
InstanceState::Stopped => "stopped",
InstanceState::Starting => "starting",
InstanceState::Running => "running",
InstanceState::Stopping => "stopping",
InstanceState::Crashed { .. } => "crashed",
}
}
}
/// Lifecycle control + state observation for one instance.
///
/// `start`/`stop`/`restart` take `self: Arc<Self>` so an impl can hand a clone
/// to a spawned monitor task; callers hold an `Arc<dyn Supervisor>` and
/// `clone()` before each call. `watch_state` exposes the same channel the
/// status-event publisher drains, so panel push events stay decoupled from the
/// heartbeat cadence.
#[async_trait::async_trait]
pub trait Supervisor: Send + Sync {
/// The instance slug (a NATS subject segment).
fn instance_id(&self) -> &str;
/// Current cached state (cheap; no I/O).
fn state(&self) -> InstanceState;
/// Subscribe to state transitions.
fn watch_state(&self) -> watch::Receiver<InstanceState>;
/// Seconds since the instance entered `Running` (0 otherwise).
async fn uptime_seconds(&self) -> u64;
async fn start(self: Arc<Self>) -> Result<()>;
async fn stop(self: Arc<Self>) -> Result<()>;
async fn restart(self: Arc<Self>) -> Result<()>;
}

View File

@@ -129,7 +129,7 @@ pub async fn collect(agent: &Agent, sys: &mut System) -> HeartbeatPayload {
let mut instances = Vec::with_capacity(agent.cfg.instances.len());
for inst in &agent.cfg.instances {
let (state, uptime_seconds) = match agent.supervisors.get(&inst.id) {
Some(sup) if !matches!(sup.state(), crate::process::InstanceState::Unmanaged) => {
Some(sup) if !matches!(sup.state(), crate::supervisor::InstanceState::Unmanaged) => {
(sup.state().as_label().to_string(), sup.uptime_seconds().await)
}
_ => {