Agent Node
Standalone agent process for extending Nimbus clusters across multiple machines with WebSocket connectivity, template sync, and crash recovery.
The Nimbus Agent (nimbus-agent) is a standalone process that runs on remote machines to extend a Nimbus cluster across multiple nodes. It connects to the controller via WebSocket, receives service lifecycle commands, and manages local JVM processes.
Overview
The agent handles:
- WebSocket connection to the controller with automatic reconnection
- Template downloading and caching (ZIP-based, hash-compared)
- Local JVM process management (start, stop, command forwarding, stdout capture)
- Java version resolution with automatic Adoptium downloads
- Service state persistence and crash recovery
- System metrics reporting (CPU, memory) via heartbeats
Architecture
nimbus-agent/src/main/kotlin/dev/nimbuspowered/nimbus/agent/
├── Agent.kt → Entry point (main function), CLI arg parsing
├── AgentConfig.kt → TOML config model + loader
├── AgentRuntime.kt → WebSocket connection loop, message dispatch
├── AgentStateStore.kt → JSON persistence for crash recovery
├── JavaResolver.kt → Java installation detection + Adoptium download
├── LocalProcessHandle.kt → Single JVM process wrapper (stdin/stdout/lifecycle)
├── LocalProcessManager.kt→ Service lifecycle, template copy, config patching
├── SetupWizard.kt → First-run interactive setup
├── StateSyncClient.kt → State-sync pull/push for floating services
├── TemplateDownloader.kt → HTTP template download + ZIP extraction
└── TlsHelper.kt → Fingerprint-pinning TLS trust managerBootstrap and connection
The main() function in Agent.kt parses CLI arguments (--controller, --token, --name, --max-memory, --max-services), loads agent.toml (or runs the setup wizard), then starts AgentRuntime.
AgentRuntime maintains a persistent WebSocket connection to the controller with automatic 5-second reconnection. On connect, it sends an AuthRequest with the node's token, name, resources, and OS info. After authentication, any recovered services from a previous run are reported, and the agent enters the message dispatch loop.
All message handling uses kotlinx-coroutines with a SupervisorJob scope -- individual service failures do not crash the agent. The agent handles StartService, StopService, SendCommand, HeartbeatRequest, and ShutdownAgent messages (see Protocol for details).
Template sync
TemplateDownloader fetches templates from the controller's REST API (GET /api/templates/{name}/download). Templates are identified by name and content hash.
Agent requests template "Lobby" with hash "abc123"
├── Local cache has matching hash? → Skip download
└── Hash mismatch or missing?
├── Download ZIP from controller (includes global plugins like SDK)
├── Extract to templates/{name}/
└── Store hash for future comparisonsThe software parameter is included in the download request so the controller can bundle the correct global plugins (e.g., SDK for Paper but not for Velocity).
Local service management
LocalProcessManager handles service lifecycle: copying templates to work directories (services/temp/{Name}_{uuid} for dynamic, services/static/{Name} for static), patching server.properties and Velocity forwarding configs, building the JVM command line, and launching the process.
LocalProcessHandle wraps a single JVM process with stdout capture (SharedFlow<String>, buffer 4096), ready detection (watches for Done ( regex), graceful shutdown (sends stop via stdin), and process adoption for crash recovery (adopt(pid, serviceName)).
Crash recovery
AgentStateStore persists running service metadata to state/services.json using atomic file writes. On restart:
- The agent loads persisted services from the state file
- For each service, it attempts to adopt the OS process by PID via
LocalProcessHandle.adopt() - Adoption verifies the process is alive and its command line contains
nimbus.service.name={serviceName} - Successfully recovered services are reported to the controller as
READY - Failed recoveries are removed from the state file
Java version resolution
JavaResolver finds the correct Java version by checking configured paths ([java] section), cached downloads in jdks/, environment variables (JAVA_16_HOME through JAVA_21_HOME), JAVA_HOME, and common directories (/usr/lib/jvm, ~/.sdkman/candidates/java, etc.). If no compatible version is found, it auto-downloads from the Adoptium API. Java 16 is the minimum supported runtime.
Building and running
# Build the agent fat JAR
./gradlew :nimbus-agent:shadowJar
# Run (first run triggers setup wizard)
java -jar nimbus-agent/build/libs/nimbus-agent-<version>-all.jar
# Run with CLI args (skip wizard)
java -jar nimbus-agent-<version>-all.jar \
--controller wss://10.0.0.1:8443/cluster \
--token your-cluster-token \
--name worker-1 \
--max-memory 16G \
--max-services 20
# Force the setup wizard even when agent.toml exists
java -jar nimbus-agent-<version>-all.jar --setupConfiguration
The agent is configured via agent.toml in the working directory. On first run, the setup wizard calls the controller's /api/cluster/bootstrap endpoint to fetch the TLS fingerprint and fills this file in for you.
[agent]
controller = "wss://127.0.0.1:8443/cluster"
token = "your-cluster-token"
node_name = "worker-1"
max_memory = "8G"
max_services = 10
public_host = "" # Leave blank to auto-detect from network interfaces
# TLS: pinned controller cert fingerprint (set by the wizard, preferred).
trusted_fingerprint = "AA:BB:CC:DD:EE:FF:..."
# tls_verify = false only for local dev (MITM-vulnerable).
tls_verify = true
# Advanced: supply a JKS/PKCS12 truststore for CA-issued certs.
truststore_path = ""
truststore_password = ""
# Optional: specify paths to Java installations.
# Leave empty for auto-detection / auto-download from Adoptium.
[java]
java_16 = ""
java_17 = ""
java_21 = ""| Key | Default | Description |
|---|---|---|
controller | wss://127.0.0.1:8443/cluster | Controller WebSocket URL |
token | (required) | Cluster auth token (matches controller config) |
node_name | hostname | Display name for this node |
max_memory | auto-detected | Maximum memory this node can allocate to services |
max_services | auto-detected | Maximum concurrent services on this node |
public_host | "" | Publicly reachable IP or hostname for player routing. Leave blank to auto-detect from network interfaces. |
trusted_fingerprint | "" | SHA-256 fingerprint of the controller's TLS cert. Takes precedence over truststore and system CAs. Set by the setup wizard. |
tls_verify | true | If false, accept any cert (DEV ONLY). Ignored when trusted_fingerprint is set. |
truststore_path | "" | Path to a JKS/PKCS12 truststore (advanced, for CA-issued certs). |
truststore_password | "" | Password for the truststore. |
TLS trust precedence: trusted_fingerprint → truststore_path → system CAs → tls_verify = false (trust all).
Memory and max services are auto-detected from the system if not specified (total RAM minus 2 GB for OS, one service per GB).
See Cluster TLS & Security for the full trust model, cert rotation flow, and troubleshooting.
Next steps
- Architecture -- How the agent fits into the cluster
- Protocol -- All cluster message types
- SDK -- Backend server plugin API