Joblet is a micro-container runtime for running Linux jobs with: Process and filesystem isolation (PID namespace, chroot) Fine-grained CPU, memory, and IO throttling (cgroups v2) Secure job execution with mTLS and RBAC Built-in scheduler, SSE log streaming, and multi-core pinning Ideal for: Agentic AI Workloads (Untrusted code)
Comprehensive guide to monitoring remote joblet server resources and performance using RNX’s client-side monitoring capabilities.
RNX provides comprehensive remote monitoring capabilities from your client machine/workstation that track joblet server resources:
✅ Remote Monitoring: Monitor joblet server resources from your local workstation
✅ Client-Server Architecture: Secure gRPC communication with mTLS authentication
✅ Volume Tracking: Automatic detection and monitoring of server-side joblet volumes
✅ Cloud Detection: Support for AWS, GCP, Azure, KVM, and bare metal server detection
✅ JSON Output: UI-compatible format for dashboards and monitoring tools
✅ Resource Filtering: Monitor specific server resources (CPU, memory, disk, network)
✅ Process Analysis: Top consumers by CPU and memory usage on the server
┌─────────────────────┐ gRPC/mTLS ┌─────────────────────┐
│ Client Machine │ ◄──────────────► │ Joblet Server │
│ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ rnx monitor │ │ Monitor Req │ │ Monitoring │ │
│ │ (from laptop/ │ │ ──────────────► │ │ Service │ │
│ │ workstation) │ │ │ │ │ │
│ │ │ │ Metrics Data │ │ Collects: │ │
│ │ Displays: │ │ ◄────────────── │ │ - CPU/Memory │ │
│ │ - Server CPU │ │ │ │ - Disk Usage │ │
│ │ - Server Mem │ │ │ │ - Volumes │ │
│ │ - Server Disk │ │ │ │ - Processes │ │
│ │ - Volumes │ │ │ │ - Network │ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────────────────┘ └─────────────────────┘
Client Side (Your Workstation):
Server Side (Joblet Host):
Monitor different joblet servers from a single client:
# Monitor production server
rnx --node=production monitor status
# Monitor staging server
rnx --node=staging monitor status
# Monitor development server
rnx --node=dev monitor watch --interval=5
# Get comprehensive overview of joblet server resources
rnx monitor status
# JSON output for dashboards/APIs (server metrics)
rnx monitor status --json
# Monitor specific joblet server node
rnx --node=production monitor status
# Watch all server metrics from your workstation (5s refresh)
rnx monitor watch
# Faster refresh rate for real-time server monitoring
rnx monitor watch --interval=2
# Monitor specific server resources remotely
rnx monitor watch --filter=cpu,memory,disk
# Show current server metrics with top processes
rnx monitor top
# Filter by server resource type
rnx monitor top --filter=disk,network
# JSON output for monitoring tools (server data)
rnx monitor top --json
rnx monitor statusDisplays comprehensive remote server status including all server resources and joblet volumes.
Features:
Enhanced Display (v4.7.2+):
The status command now includes:
Usage:
rnx monitor status # Server status with version info
rnx monitor status --json # JSON format (server data)
rnx --node=production monitor status # Specific server node
Example Output:
System Status - 2025-10-08T18:45:14Z
Available: true
Host Information:
Hostname: joblet-server
OS: Ubuntu 22.04.2 LTS
Kernel: 5.15.0-153-generic
Architecture: amd64
Uptime: 33d 4h 58m
Node ID: 8eb41e22-2940-4f83-9066-7d739d057ad2
Server IPs: 192.168.1.161, 172.20.0.1
MAC Addresses: 5e:9f:b0:c0:61:22, 1e:45:87:fe:bc:53
Joblet Server:
Version: v4.7.2
Git Tag: v4.7.2
Git Commit: 00df3a5ee3d6ae7e25078d610e05b977cd8a1812
Build Date: 2025-10-08T18:44:40Z
Go Version: go1.24.0
Platform: linux/amd64
Network Interfaces:
ens18:
IP: 192.168.1.161
MAC: 5e:9f:b0:c0:61:22
RX: 16.1 GB (10264780 packets, 0 errors)
TX: 986.9 MB (3558641 packets, 0 errors)
Rate: RX 167 B/s TX 699 B/s
joblet0:
IP: 172.20.0.1
MAC: 1e:45:87:fe:bc:53
RX: 5.7 MB (73297 packets, 0 errors)
TX: 6.7 MB (73490 packets, 0 errors)
Rate: RX 0 B/s TX 0 B/s
rnx monitor topShows current remote server metrics in a condensed format with top resource consumers.
Features:
Usage:
rnx monitor top # All server metrics
rnx monitor top --filter=cpu,memory # Specific server metrics only
rnx monitor top --json # JSON output (server data)
rnx monitor watchReal-time remote server monitoring with configurable refresh intervals.
Features:
Usage:
rnx monitor watch # Default 5s server monitoring
rnx monitor watch --interval=1 # 1s server refresh
rnx monitor watch --filter=disk,network # Specific server resources
rnx monitor watch --compact # Compact server format
rnx monitor watch --json --interval=10 # JSON server streaming
NetworkCollector Implementation: The monitoring system uses a dedicated NetworkCollector that:
/proc/net/dev for bandwidth metricsnet packageLocation: /internal/joblet/monitoring/collectors/network.go
The --json flag produces structured output optimized for dashboard integration:
{
"hostInfo": {
"hostname": "joblet-server",
"platform": "Ubuntu 22.04.2 LTS",
"arch": "amd64",
"uptime": 152070,
"cloudProvider": "AWS",
"instanceType": "t3.medium",
"region": "us-east-1"
},
"cpuInfo": {
"cores": 8,
"usage": 0.15,
"loadAverage": [0.5, 0.3, 0.2],
"perCoreUsage": [0.1, 0.2, 0.05, 0.3, 0.18, 0.07, 0.12, 0.09]
},
"memoryInfo": {
"total": 4100255744,
"used": 378679296,
"available": 3556278272,
"percent": 9.23,
"cached": 1835712512,
"swap": {
"total": 2147479552,
"used": 0,
"percent": 0
}
},
"disksInfo": {
"disks": [
{
"name": "/dev/sda1",
"mountpoint": "/",
"filesystem": "ext4",
"size": 19896352768,
"used": 11143790592,
"available": 8752562176,
"percent": 56.01
},
{
"name": "analytics-data",
"mountpoint": "/opt/joblet/volumes/analytics-data",
"filesystem": "joblet-volume",
"size": 1073741824,
"used": 52428800,
"available": 1021313024,
"percent": 4.88
}
],
"totalSpace": 21936726016,
"usedSpace": 11196219392
},
"networkInfo": {
"interfaces": [
{
"name": "eth0",
"type": "ethernet",
"status": "up",
"rxBytes": 1234567890,
"txBytes": 987654321,
"rxPackets": 123456,
"txPackets": 98765
}
],
"totalRxBytes": 1234567890,
"totalTxBytes": 987654321
},
"processesInfo": {
"processes": [
{
"pid": 1234,
"name": "joblet",
"command": "/opt/joblet/joblet",
"cpu": 2.5,
"memory": 1.2,
"memoryBytes": 49152000,
"status": "sleeping"
}
],
"totalProcesses": 149
}
}
For real-time monitoring integrations:
# Stream JSON objects every 10 seconds
rnx monitor watch --json --interval=10
# Process with monitoring tools
rnx monitor watch --json | jq '.cpuInfo.usage'
# Forward to monitoring systems
rnx monitor watch --json --interval=30 | logger -t joblet-metrics
Create a data source using the JSON output:
#!/bin/bash
# grafana-collector.sh
while true; do
rnx monitor status --json > /var/lib/grafana/joblet-metrics.json
sleep 60
done
Export metrics in Prometheus format:
#!/bin/bash
# prometheus-exporter.sh
METRICS=$(rnx monitor status --json)
CPU_USAGE=$(echo "$METRICS" | jq -r '.cpuInfo.usage')
MEMORY_PERCENT=$(echo "$METRICS" | jq -r '.memoryInfo.percent')
echo "joblet_cpu_usage $CPU_USAGE"
echo "joblet_memory_percent $MEMORY_PERCENT"
Use the JSON API to build custom monitoring dashboards:
// JavaScript example
async function getJobletMetrics() {
const { exec } = require('child_process');
return new Promise((resolve, reject) => {
exec('rnx monitor status --json', (error, stdout) => {
if (error) reject(error);
else resolve(JSON.parse(stdout));
});
});
}
// Usage
const metrics = await getJobletMetrics();
console.log(`CPU Usage: ${metrics.cpuInfo.usage * 100}%`);
console.log(`Memory Usage: ${metrics.memoryInfo.percent}%`);
No Volume Statistics Showing
# Check if volumes exist
rnx volume list
# Create test volume
rnx volume create test-monitoring --size=100MB
# Verify monitoring detects it
rnx monitor status --json | grep "joblet-volume"
High Resource Usage
# Identify resource-heavy processes
rnx monitor top --filter=process
# Monitor specific resources
rnx monitor watch --filter=cpu,memory --interval=1
# Check for resource-intensive jobs
rnx job list --json | jq '.[] | select(.status=="running")'
Network Monitoring Issues
# Check active interfaces
rnx monitor status | grep -A 10 "Network Interfaces"
# Monitor network activity
rnx monitor watch --filter=network --interval=2
Reduce Monitoring Overhead
# Use longer intervals for production
rnx monitor watch --interval=30
# Filter to essential metrics only
rnx monitor watch --filter=cpu,memory
# Use compact format for less output
rnx monitor watch --compact
Efficient JSON Processing
# Extract specific metrics only
rnx monitor status --json | jq '.cpuInfo'
# Monitor specific volumes
rnx monitor status --json | jq '.disksInfo.disks[] | select(.filesystem=="joblet-volume")'
# Set up automated monitoring
*/5 * * * * rnx monitor status --json > /var/log/joblet/metrics-$(date +%Y%m%d-%H%M).json
# Create alerting scripts
#!/bin/bash
CPU_USAGE=$(rnx monitor status --json | jq -r '.cpuInfo.usage')
if (( $(echo "$CPU_USAGE > 0.8" | bc -l) )); then
echo "ALERT: High CPU usage: $(echo "$CPU_USAGE * 100" | bc)%"
fi
# Monitor volume usage regularly
rnx monitor status --json | jq '.disksInfo.disks[] | select(.filesystem=="joblet-volume") | {name, percent}'
# Clean up unused volumes
rnx volume list | grep -v "in-use"
# Monitor job performance impact
rnx monitor watch --filter=cpu,memory &
rnx job run --max-cpu=50 heavy-computation.py
# Log metrics for trend analysis
rnx monitor status --json | jq '{timestamp: now, cpu: .cpuInfo.usage, memory: .memoryInfo.percent}' >> metrics.jsonl
# Test monitoring integration
rnx monitor status --json | jq . > /dev/null && echo "JSON valid" || echo "JSON invalid"
# Verify all metrics present
REQUIRED_FIELDS="hostInfo cpuInfo memoryInfo disksInfo networkInfo processesInfo"
for field in $REQUIRED_FIELDS; do
rnx monitor status --json | jq ".$field" > /dev/null || echo "Missing: $field"
done
The persist service handles historical log and metric storage. Monitor its health and performance:
# Check persist service status (on server)
ssh server "systemctl status persist"
# View persist service logs
ssh server "journalctl -u persist -n 100 -f"
# Check IPC socket connectivity
ssh server "ls -la /opt/joblet/run/persist.sock"
# Monitor storage usage for logs and metrics
ssh server "du -sh /opt/joblet/logs /opt/joblet/metrics"
# Check for persist service errors
ssh server "journalctl -u persist --since '1 hour ago' | grep -i error"
Persist Service Metrics:
The persist service exposes its own metrics for monitoring:
# Persist service health (if gRPC endpoint is enabled)
curl http://server:9093/health
# Prometheus metrics (if enabled)
curl http://server:9092/metrics | grep persist_
Key Metrics to Monitor:
/opt/joblet/logs and /opt/joblet/metricsStorage Management:
# Check current storage usage
ssh server "df -h /opt/joblet"
# Find largest log directories
ssh server "du -sh /opt/joblet/logs/* | sort -hr | head -10"
# Check metric storage
ssh server "du -sh /opt/joblet/metrics/* | sort -hr | head -10"
# Monitor storage growth rate
ssh server "watch -n 60 'du -sh /opt/joblet/logs /opt/joblet/metrics'"
Automated Monitoring:
# Log persist metrics to JSONL for analysis
while true; do
ssh server "du -sk /opt/joblet/logs /opt/joblet/metrics" | \
awk '{print "{\"timestamp\":" systime() ",\"path\":\"" $2 "\",\"size_kb\":" $1 "}"}' >> persist-metrics.jsonl
sleep 300 # Every 5 minutes
done
# Alert on high storage usage
THRESHOLD=80 # Alert at 80% usage
USAGE=$(ssh server "df /opt/joblet | tail -1 | awk '{print \$5}' | sed 's/%//'")
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "ALERT: Persist storage at ${USAGE}% (threshold: ${THRESHOLD}%)"
fi
Performance Tuning:
Monitor persist service performance and adjust configuration:
# In /opt/joblet/config/joblet-config.yml
persist:
writer:
flush_interval: "1s" # Increase to reduce I/O, decrease for lower latency
batch_size: 100 # Higher = better throughput, more memory
query:
cache:
ttl: "5m" # Cache query results to reduce disk I/O
stream:
buffer_size: 1024 # Buffer size for streaming queries
Troubleshooting Persist Service:
# Service not running
ssh server "sudo systemctl restart persist"
# Check if socket exists
ssh server "sudo ls -la /opt/joblet/run/persist.sock"
# Verify socket permissions (should be 600)
ssh server "sudo stat /opt/joblet/run/persist.sock"
# Test IPC connectivity from joblet service
ssh server "sudo lsof | grep persist.sock"
# Check for disk space issues
ssh server "df -h /opt/joblet && df -i /opt/joblet"
Best Practices:
/opt/joblet/logs and /opt/joblet/metrics in backupsWhen eBPF telematics is enabled, joblet captures process execution, network connection, and memory events from jobs. These events are shipped to CloudWatch alongside regular logs and metrics.
Viewing eBPF Telematics Events via CLI:
# View eBPF telematics events for a job
rnx job telematics <job-uuid>
# Using short UUID (first 8 characters)
rnx job telematics f47ac10b
# Filter specific event types
rnx job telematics f47ac10b --types exec,connect
# Filter with grep
rnx job telematics f47ac10b | grep EXEC # Process executions
rnx job telematics f47ac10b | grep CONNECT # Outgoing connections
rnx job telematics f47ac10b | grep ACCEPT # Incoming connections
rnx job telematics f47ac10b | grep MMAP # Memory mappings with exec
# View resource metrics separately
rnx job metrics f47ac10b
Available eBPF Event Types:
| Event | Display | Description |
|---|---|---|
| exec | EXEC | Process executions (fork/exec syscalls) |
| connect | CONNECT | Outgoing network connections (connect syscall) |
| accept | ACCEPT | Incoming network connections (accept syscall) |
| socket_data | SEND/RECV | Socket data transfers (sendto/recvfrom syscalls) |
| mmap | MMAP | Memory mappings with executable permissions |
| mprotect | MPROTECT | Memory protection changes adding exec permission |
Data Flow:
eBPF Monitor → Telemetry Collector → IPC Writer → Persist Service → CloudWatch Logs
CloudWatch Log Streams (per job):
Log Group: /joblet/{node_id}
- {job_uuid}-logs # stdout/stderr logs
- {job_uuid}-metrics # Resource metrics
- {job_uuid}-exec-events # Process execution events (eBPF)
- {job_uuid}-connect-events # Network connection events (eBPF)
Querying eBPF Events in CloudWatch Insights:
-- Find all processes executed by a job
fields @timestamp, pid, filename, args
| filter @logStream like "abc123-exec-events"
| sort @timestamp desc
| limit 100
-- Find network connections made by a job
fields @timestamp, pid, dst_addr, dst_port, protocol
| filter @logStream like "abc123-connect-events"
| sort @timestamp desc
-- Find jobs connecting to a specific database
fields @timestamp, job_uuid, pid, comm, dst_addr, dst_port
| filter dst_addr = "10.0.1.50" and dst_port = 5432
| sort @timestamp desc
-- Correlate process executions with network activity
fields @timestamp, @logStream
| filter @logStream like "-exec-events" or @logStream like "-connect-events"
| sort @timestamp desc
| limit 200
Local Storage (when CloudWatch is disabled):
eBPF events are stored locally in compressed JSONL format:
/opt/joblet/events/{job-uuid}/
├── exec_events.jsonl.gz # Process execution events
└── connect_events.jsonl.gz # Network connection events
Monitoring eBPF Event Volume:
# Check eBPF event storage usage
ssh server "du -sh /opt/joblet/events/*"
# Monitor event write rate (from persist logs)
ssh server "journalctl -u persist | grep 'Wrote.*events' | tail -20"
Configuration:
eBPF telematics is configured in joblet:
# /opt/joblet/config/config.yml
telemetry:
activity:
enabled: true # Enable eBPF tracking
events:
exec: true # Track process executions
connect: true # Track network connections
file: false # File access (high volume, disabled by default)
CloudWatch storage is configured in persist:
# /opt/joblet/config/persist.yml
storage:
type: cloudwatch # or "local" for standalone VMs
cloudwatch:
region: us-west-2
log_group_prefix: /joblet
For additional help, run rnx monitor --help or see the troubleshooting guide.