Native Linux Microcontainers

Joblet is a micro-container runtime for running Linux jobs with: Process and filesystem isolation (PID namespace, chroot) Fine-grained CPU, memory, and IO throttling (cgroups v2) Secure job execution with mTLS and RBAC Built-in scheduler, SSE log streaming, and multi-core pinning Ideal for: Agentic AI Workloads (Untrusted code)


Project maintained by ehsaniara Hosted on GitHub Pages — Theme by mattgraham

Remote System Monitoring Guide

Comprehensive guide to monitoring remote joblet server resources and performance using RNX’s client-side monitoring capabilities.

Table of Contents

Overview

RNX provides comprehensive remote monitoring capabilities from your client machine/workstation that track joblet server resources:

Key Features

Remote Monitoring: Monitor joblet server resources from your local workstation
Client-Server Architecture: Secure gRPC communication with mTLS authentication
Volume Tracking: Automatic detection and monitoring of server-side joblet volumes
Cloud Detection: Support for AWS, GCP, Azure, KVM, and bare metal server detection
JSON Output: UI-compatible format for dashboards and monitoring tools
Resource Filtering: Monitor specific server resources (CPU, memory, disk, network)
Process Analysis: Top consumers by CPU and memory usage on the server

Client-Server Architecture

How Remote Monitoring Works

┌─────────────────────┐    gRPC/mTLS     ┌─────────────────────┐
│   Client Machine    │ ◄──────────────► │   Joblet Server     │
│                     │                  │                     │
│  ┌───────────────┐  │                  │  ┌───────────────┐  │
│  │ rnx monitor   │  │   Monitor Req    │  │ Monitoring    │  │
│  │ (from laptop/ │  │ ──────────────►  │  │ Service       │  │
│  │ workstation)  │  │                  │  │               │  │
│  │               │  │   Metrics Data   │  │ Collects:     │  │
│  │ Displays:     │  │ ◄──────────────  │  │ - CPU/Memory  │  │
│  │ - Server CPU  │  │                  │  │ - Disk Usage  │  │
│  │ - Server Mem  │  │                  │  │ - Volumes     │  │
│  │ - Server Disk │  │                  │  │ - Processes   │  │
│  │ - Volumes     │  │                  │  │ - Network     │  │
│  └───────────────┘  │                  │  └───────────────┘  │
└─────────────────────┘                  └─────────────────────┘

Configuration Requirements

Client Side (Your Workstation):

Server Side (Joblet Host):

Multi-Node Monitoring

Monitor different joblet servers from a single client:

# Monitor production server
rnx --node=production monitor status

# Monitor staging server  
rnx --node=staging monitor status

# Monitor development server
rnx --node=dev monitor watch --interval=5

Quick Start

Basic Remote Server Status

# Get comprehensive overview of joblet server resources
rnx monitor status

# JSON output for dashboards/APIs (server metrics)
rnx monitor status --json

# Monitor specific joblet server node
rnx --node=production monitor status

Real-time Server Monitoring

# Watch all server metrics from your workstation (5s refresh)
rnx monitor watch

# Faster refresh rate for real-time server monitoring
rnx monitor watch --interval=2

# Monitor specific server resources remotely
rnx monitor watch --filter=cpu,memory,disk

Current Server Metrics

# Show current server metrics with top processes
rnx monitor top

# Filter by server resource type
rnx monitor top --filter=disk,network

# JSON output for monitoring tools (server data)
rnx monitor top --json

Monitoring Commands

rnx monitor status

Displays comprehensive remote server status including all server resources and joblet volumes.

Features:

Enhanced Display (v4.7.2+):

The status command now includes:

  1. Joblet Server Version Information - Shows version, git tag, commit, build date, Go version
  2. Network Interface Details - Each interface displays:
    • IP address (intelligently mapped to interfaces)
    • MAC address (hardware address)
    • RX/TX statistics with real-time rates
    • Packet counts and error tracking

Usage:

rnx monitor status                    # Server status with version info
rnx monitor status --json            # JSON format (server data)
rnx --node=production monitor status # Specific server node

Example Output:

System Status - 2025-10-08T18:45:14Z
Available: true

Host Information:
  Hostname:     joblet-server
  OS:           Ubuntu 22.04.2 LTS
  Kernel:       5.15.0-153-generic
  Architecture: amd64
  Uptime:       33d 4h 58m
  Node ID:      8eb41e22-2940-4f83-9066-7d739d057ad2
  Server IPs:   192.168.1.161, 172.20.0.1
  MAC Addresses: 5e:9f:b0:c0:61:22, 1e:45:87:fe:bc:53

Joblet Server:
  Version:      v4.7.2
  Git Tag:      v4.7.2
  Git Commit:   00df3a5ee3d6ae7e25078d610e05b977cd8a1812
  Build Date:   2025-10-08T18:44:40Z
  Go Version:   go1.24.0
  Platform:     linux/amd64

Network Interfaces:
  ens18:
    IP:   192.168.1.161
    MAC:  5e:9f:b0:c0:61:22
    RX:   16.1 GB (10264780 packets, 0 errors)
    TX:   986.9 MB (3558641 packets, 0 errors)
    Rate: RX 167 B/s TX 699 B/s
  joblet0:
    IP:   172.20.0.1
    MAC:  1e:45:87:fe:bc:53
    RX:   5.7 MB (73297 packets, 0 errors)
    TX:   6.7 MB (73490 packets, 0 errors)
    Rate: RX 0 B/s TX 0 B/s

rnx monitor top

Shows current remote server metrics in a condensed format with top resource consumers.

Features:

Usage:

rnx monitor top                          # All server metrics
rnx monitor top --filter=cpu,memory      # Specific server metrics only
rnx monitor top --json                   # JSON output (server data)

rnx monitor watch

Real-time remote server monitoring with configurable refresh intervals.

Features:

Usage:

rnx monitor watch                            # Default 5s server monitoring
rnx monitor watch --interval=1               # 1s server refresh
rnx monitor watch --filter=disk,network      # Specific server resources
rnx monitor watch --compact                  # Compact server format
rnx monitor watch --json --interval=10       # JSON server streaming

Metrics Types

CPU Metrics

Memory Metrics

Disk Metrics

Network Metrics

NetworkCollector Implementation: The monitoring system uses a dedicated NetworkCollector that:

Location: /internal/joblet/monitoring/collectors/network.go

Process Metrics

Volume Metrics

JSON Integration

Output Structure

The --json flag produces structured output optimized for dashboard integration:

{
  "hostInfo": {
    "hostname": "joblet-server",
    "platform": "Ubuntu 22.04.2 LTS", 
    "arch": "amd64",
    "uptime": 152070,
    "cloudProvider": "AWS",
    "instanceType": "t3.medium",
    "region": "us-east-1"
  },
  "cpuInfo": {
    "cores": 8,
    "usage": 0.15,
    "loadAverage": [0.5, 0.3, 0.2],
    "perCoreUsage": [0.1, 0.2, 0.05, 0.3, 0.18, 0.07, 0.12, 0.09]
  },
  "memoryInfo": {
    "total": 4100255744,
    "used": 378679296,
    "available": 3556278272,
    "percent": 9.23,
    "cached": 1835712512,
    "swap": {
      "total": 2147479552,
      "used": 0,
      "percent": 0
    }
  },
  "disksInfo": {
    "disks": [
      {
        "name": "/dev/sda1",
        "mountpoint": "/",
        "filesystem": "ext4", 
        "size": 19896352768,
        "used": 11143790592,
        "available": 8752562176,
        "percent": 56.01
      },
      {
        "name": "analytics-data",
        "mountpoint": "/opt/joblet/volumes/analytics-data",
        "filesystem": "joblet-volume",
        "size": 1073741824,
        "used": 52428800,
        "available": 1021313024,
        "percent": 4.88
      }
    ],
    "totalSpace": 21936726016,
    "usedSpace": 11196219392
  },
  "networkInfo": {
    "interfaces": [
      {
        "name": "eth0",
        "type": "ethernet",
        "status": "up",
        "rxBytes": 1234567890,
        "txBytes": 987654321,
        "rxPackets": 123456,
        "txPackets": 98765
      }
    ],
    "totalRxBytes": 1234567890,
    "totalTxBytes": 987654321
  },
  "processesInfo": {
    "processes": [
      {
        "pid": 1234,
        "name": "joblet",
        "command": "/opt/joblet/joblet",
        "cpu": 2.5,
        "memory": 1.2,
        "memoryBytes": 49152000,
        "status": "sleeping"
      }
    ],
    "totalProcesses": 149
  }
}

Streaming JSON

For real-time monitoring integrations:

# Stream JSON objects every 10 seconds
rnx monitor watch --json --interval=10

# Process with monitoring tools
rnx monitor watch --json | jq '.cpuInfo.usage'

# Forward to monitoring systems
rnx monitor watch --json --interval=30 | logger -t joblet-metrics

Dashboard Integration

Grafana Integration

Create a data source using the JSON output:

#!/bin/bash
# grafana-collector.sh
while true; do
  rnx monitor status --json > /var/lib/grafana/joblet-metrics.json
  sleep 60
done

Prometheus Integration

Export metrics in Prometheus format:

#!/bin/bash
# prometheus-exporter.sh
METRICS=$(rnx monitor status --json)
CPU_USAGE=$(echo "$METRICS" | jq -r '.cpuInfo.usage')
MEMORY_PERCENT=$(echo "$METRICS" | jq -r '.memoryInfo.percent')

echo "joblet_cpu_usage $CPU_USAGE"  
echo "joblet_memory_percent $MEMORY_PERCENT"

Custom Dashboards

Use the JSON API to build custom monitoring dashboards:

// JavaScript example
async function getJobletMetrics() {
  const { exec } = require('child_process');
  
  return new Promise((resolve, reject) => {
    exec('rnx monitor status --json', (error, stdout) => {
      if (error) reject(error);
      else resolve(JSON.parse(stdout));
    });
  });
}

// Usage
const metrics = await getJobletMetrics();
console.log(`CPU Usage: ${metrics.cpuInfo.usage * 100}%`);
console.log(`Memory Usage: ${metrics.memoryInfo.percent}%`);

Troubleshooting

Common Issues

No Volume Statistics Showing

# Check if volumes exist
rnx volume list

# Create test volume
rnx volume create test-monitoring --size=100MB

# Verify monitoring detects it
rnx monitor status --json | grep "joblet-volume"

High Resource Usage

# Identify resource-heavy processes
rnx monitor top --filter=process

# Monitor specific resources
rnx monitor watch --filter=cpu,memory --interval=1

# Check for resource-intensive jobs
rnx job list --json | jq '.[] | select(.status=="running")'

Network Monitoring Issues

# Check active interfaces
rnx monitor status | grep -A 10 "Network Interfaces"

# Monitor network activity
rnx monitor watch --filter=network --interval=2

Performance Optimization

Reduce Monitoring Overhead

# Use longer intervals for production
rnx monitor watch --interval=30

# Filter to essential metrics only
rnx monitor watch --filter=cpu,memory

# Use compact format for less output
rnx monitor watch --compact

Efficient JSON Processing

# Extract specific metrics only
rnx monitor status --json | jq '.cpuInfo'

# Monitor specific volumes
rnx monitor status --json | jq '.disksInfo.disks[] | select(.filesystem=="joblet-volume")'

Best Practices

1. Regular Monitoring

# Set up automated monitoring
*/5 * * * * rnx monitor status --json > /var/log/joblet/metrics-$(date +%Y%m%d-%H%M).json

2. Resource Thresholds

# Create alerting scripts
#!/bin/bash
CPU_USAGE=$(rnx monitor status --json | jq -r '.cpuInfo.usage')
if (( $(echo "$CPU_USAGE > 0.8" | bc -l) )); then
  echo "ALERT: High CPU usage: $(echo "$CPU_USAGE * 100" | bc)%"
fi

3. Volume Management

# Monitor volume usage regularly
rnx monitor status --json | jq '.disksInfo.disks[] | select(.filesystem=="joblet-volume") | {name, percent}'

# Clean up unused volumes
rnx volume list | grep -v "in-use"

4. Performance Monitoring

# Monitor job performance impact
rnx monitor watch --filter=cpu,memory &
rnx job run --max-cpu=50 heavy-computation.py

5. Historical Tracking

# Log metrics for trend analysis  
rnx monitor status --json | jq '{timestamp: now, cpu: .cpuInfo.usage, memory: .memoryInfo.percent}' >> metrics.jsonl

6. Integration Testing

# Test monitoring integration
rnx monitor status --json | jq . > /dev/null && echo "JSON valid" || echo "JSON invalid"

# Verify all metrics present
REQUIRED_FIELDS="hostInfo cpuInfo memoryInfo disksInfo networkInfo processesInfo"
for field in $REQUIRED_FIELDS; do
  rnx monitor status --json | jq ".$field" > /dev/null || echo "Missing: $field"
done

7. Persist Service Monitoring

The persist service handles historical log and metric storage. Monitor its health and performance:

# Check persist service status (on server)
ssh server "systemctl status persist"

# View persist service logs
ssh server "journalctl -u persist -n 100 -f"

# Check IPC socket connectivity
ssh server "ls -la /opt/joblet/run/persist.sock"

# Monitor storage usage for logs and metrics
ssh server "du -sh /opt/joblet/logs /opt/joblet/metrics"

# Check for persist service errors
ssh server "journalctl -u persist --since '1 hour ago' | grep -i error"

Persist Service Metrics:

The persist service exposes its own metrics for monitoring:

# Persist service health (if gRPC endpoint is enabled)
curl http://server:9093/health

# Prometheus metrics (if enabled)
curl http://server:9092/metrics | grep persist_

Key Metrics to Monitor:

Storage Management:

# Check current storage usage
ssh server "df -h /opt/joblet"

# Find largest log directories
ssh server "du -sh /opt/joblet/logs/* | sort -hr | head -10"

# Check metric storage
ssh server "du -sh /opt/joblet/metrics/* | sort -hr | head -10"

# Monitor storage growth rate
ssh server "watch -n 60 'du -sh /opt/joblet/logs /opt/joblet/metrics'"

Automated Monitoring:

# Log persist metrics to JSONL for analysis
while true; do
  ssh server "du -sk /opt/joblet/logs /opt/joblet/metrics" | \
    awk '{print "{\"timestamp\":" systime() ",\"path\":\"" $2 "\",\"size_kb\":" $1 "}"}' >> persist-metrics.jsonl
  sleep 300  # Every 5 minutes
done

# Alert on high storage usage
THRESHOLD=80  # Alert at 80% usage
USAGE=$(ssh server "df /opt/joblet | tail -1 | awk '{print \$5}' | sed 's/%//'")
if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "ALERT: Persist storage at ${USAGE}% (threshold: ${THRESHOLD}%)"
fi

Performance Tuning:

Monitor persist service performance and adjust configuration:

# In /opt/joblet/config/joblet-config.yml
persist:
  writer:
    flush_interval: "1s"      # Increase to reduce I/O, decrease for lower latency
    batch_size: 100           # Higher = better throughput, more memory

  query:
    cache:
      ttl: "5m"               # Cache query results to reduce disk I/O
    stream:
      buffer_size: 1024       # Buffer size for streaming queries

Troubleshooting Persist Service:

# Service not running
ssh server "sudo systemctl restart persist"

# Check if socket exists
ssh server "sudo ls -la /opt/joblet/run/persist.sock"

# Verify socket permissions (should be 600)
ssh server "sudo stat /opt/joblet/run/persist.sock"

# Test IPC connectivity from joblet service
ssh server "sudo lsof | grep persist.sock"

# Check for disk space issues
ssh server "df -h /opt/joblet && df -i /opt/joblet"

Best Practices:

  1. Monitor Storage Growth: Set up alerts for storage thresholds
  2. Regular Cleanup: Configure retention policies to auto-delete old data
  3. Performance Baseline: Establish normal IPC latency and query times
  4. Backup Strategy: Include /opt/joblet/logs and /opt/joblet/metrics in backups
  5. Log Rotation: Ensure persist service logs don’t fill up disk

For additional help, run rnx monitor --help or see the troubleshooting guide.