Joblet is a micro-container runtime for running Linux jobs with: Process and filesystem isolation (PID namespace, chroot) Fine-grained CPU, memory, and IO throttling (cgroups v2) Secure job execution with mTLS and RBAC Built-in scheduler, SSE log streaming, and multi-core pinning Ideal for: Agentic AI Workloads (Untrusted code)
π Note: This document covers log and metrics persistence (
persistservice). For job state persistence (DynamoDB/memory), see STATE_PERSISTENCE.md.
The Joblet Persistence Service (persist) is a dedicated microservice that handles historical storage and
querying of job logs and metrics. It runs as a subprocess of the main joblet daemon and provides durable storage with
support for multiple storage backends including local filesystem and AWS CloudWatch.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Joblet Main Service β
β β
β βββββββββββββββ βββββββββββββββββββββββ β
β β Job ExecutorββββββββββΆβ IPC Client β β
β β β logs β (Unix Socket) β β
β βββββββββββββββ metricsβββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββ
β
Unix Socket: /opt/joblet/run/persist-ipc.sock
β
ββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββ
β Joblet Persistence Service β
β β
β ββββββββββββββββ ββββββββββββββββββββββββ β
β β IPC Server βββββββββΆβ Storage Backend β β
β β (writes) β β Manager β β
β ββββββββββββββββ ββββββββββββββββββββββββ β
β β β
β ββββββββββββββββ β β
β β gRPC Server βββββββββββββββββββ β
β β (queries) β β
β ββββββββββββββββ β
β β β
βββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββ
β
β Unix Socket: /opt/joblet/run/persist-grpc.sock
β
βββββββββββΌββββββββββββββ
β RNX CLI / API Client β
β (Historical Queries) β
βββββββββββββββββββββββββ
/opt/joblet/run/persist-ipc.sock/opt/joblet/run/persist-grpc.sockFile-based storage using gzipped JSON lines format.
Features:
Storage Format:
/opt/joblet/logs/
βββ {job-id-1}/
β βββ stdout.log.gz # Gzipped JSONL
β βββ stderr.log.gz # Gzipped JSONL
βββ {job-id-2}/
β βββ stdout.log.gz
β βββ stderr.log.gz
βββ ...
/opt/joblet/metrics/
βββ {job-id-1}/
β βββ metrics.jsonl.gz
βββ {job-id-2}/
β βββ metrics.jsonl.gz
βββ ...
Configuration:
persist:
storage:
type: "local"
base_dir: "/opt/joblet"
local:
logs:
directory: "/opt/joblet/logs"
format: "jsonl.gz"
metrics:
directory: "/opt/joblet/metrics"
format: "jsonl.gz"
Cloud-native storage using AWS CloudWatch Logs for both logs and metrics.
Features:
Log Organization:
CloudWatch Log Groups (one per node):
βββ {log_group_prefix}/{nodeId}/jobs
βββ Log Stream: {jobId}-stdout
βββ Log Stream: {jobId}-stderr
βββ Log Stream: {anotherJobId}-stdout
βββ Log Stream: {anotherJobId}-stderr
CloudWatch Metrics (namespace per deployment):
βββ {metric_namespace} (e.g., Joblet/Jobs)
βββ Metric: CPUUsage
βββ Metric: MemoryUsage (MB)
βββ Metric: GPUUsage (%)
βββ Metric: DiskReadBytes
βββ Metric: DiskWriteBytes
βββ Metric: DiskReadOps
βββ Metric: DiskWriteOps
βββ Metric: NetworkRxBytes (KB)
βββ Metric: NetworkTxBytes (KB)
Dimensions: JobID, NodeID, [custom dimensions...]
Multi-Node Architecture:
CloudWatch backend supports distributed deployments with multiple joblet nodes. Each node is identified by a unique
nodeId, ensuring logs from different nodes are properly isolated:
Log Groups (one per node):
βββ /joblet/node-1/jobs # All jobs from node-1
β βββ job-abc-stdout
β βββ job-abc-stderr
βββ /joblet/node-2/jobs # All jobs from node-2
β βββ job-abc-stdout # (same jobID, different node)
β βββ job-abc-stderr
βββ /joblet/node-3/jobs # All jobs from node-3
βββ job-xyz-stdout
βββ job-xyz-stderr
CloudWatch Metrics (shared namespace across all nodes):
βββ Joblet/Jobs
βββ CPUUsage [JobID=job-abc, NodeID=node-1]
βββ MemoryUsage [JobID=job-abc, NodeID=node-1]
βββ CPUUsage [JobID=job-abc, NodeID=node-2]
βββ MemoryUsage [JobID=job-abc, NodeID=node-2]
βββ CPUUsage [JobID=job-xyz, NodeID=node-3]
βββ MemoryUsage [JobID=job-xyz, NodeID=node-3]
This allows:
Authentication:
CloudWatch backend uses AWS default credential chain (secure, no credentials in config files):
# EC2 instance automatically uses attached IAM role
# No configuration needed
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=yyy
export AWS_REGION=us-east-1
~/.aws/credentials
Required IAM Permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:PutRetentionPolicy",
"logs:DescribeLogStreams",
"logs:GetLogEvents",
"logs:FilterLogEvents",
"logs:DeleteLogGroup",
"logs:DeleteLogStream",
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics",
"ec2:DescribeRegions"
],
"Resource": "*"
}
]
}
Configuration:
server:
nodeId: "node-1" # REQUIRED: Unique identifier for this node
persist:
storage:
type: "cloudwatch"
cloudwatch:
# Region auto-detection (recommended for EC2)
region: "" # Empty = auto-detect from EC2 metadata, falls back to us-east-1
# Or specify explicitly
# region: "us-east-1"
# Log group organization (one per node)
log_group_prefix: "/joblet" # Creates: /joblet/{nodeId}/jobs
# Note: log_stream_prefix is deprecated - streams are named: {jobId}-{streamType}
# Metrics configuration
metric_namespace: "Joblet/Production" # CloudWatch Metrics namespace
metric_dimensions: # Additional custom dimensions
Environment: "production"
Cluster: "main-cluster"
# Batch settings (CloudWatch API limits)
log_batch_size: 100 # Max: 10,000 events per batch
metric_batch_size: 20 # Max: 1,000 data points per batch
# Retention settings
log_retention_days: 7 # Log retention in days (default: 7)
# Valid values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653
# 0 or not set = default to 7 days
# -1 = never expire (infinite retention, can be expensive!)
Log Retention:
CloudWatch Logs retention controls how long your logs are stored:
-1 (not recommended due to cost)Example retention strategies:
# Development - 1 day retention
log_retention_days: 1
# Production - 30 days retention
log_retention_days: 30
# Compliance - 1 year retention
log_retention_days: 365
# Infinite (expensive!)
log_retention_days: -1
Note: CloudWatch Metrics retention is fixed at 15 months and cannot be configured.
Auto-Detection Features:
http://169.254.169.254/latest/meta-data/placement/regionus-east-1 if not on EC2Monitoring:
View logs in AWS Console:
CloudWatch β Log Groups β /joblet/{nodeId}/jobs
View metrics in AWS Console:
CloudWatch β Metrics β Custom Namespaces β Joblet/Jobs
Query logs using AWS CLI:
# Get logs for a specific job on node-1
aws logs get-log-events \
--log-group-name "/joblet/node-1/jobs" \
--log-stream-name "my-job-id-stdout"
# Filter logs across all nodes
aws logs filter-log-events \
--log-group-name-prefix "/joblet/" \
--filter-pattern "ERROR"
Query metrics using AWS CLI:
# Get CPU usage for a specific job
aws cloudwatch get-metric-statistics \
--namespace "Joblet/Jobs" \
--metric-name "CPUUsage" \
--dimensions Name=JobID,Value=my-job-id Name=NodeID,Value=node-1 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T23:59:59Z \
--period 60 \
--statistics Average
# List all metrics for a job
aws cloudwatch list-metrics \
--namespace "Joblet/Jobs" \
--dimensions Name=JobID,Value=my-job-id
# Get memory usage with custom dimensions
aws cloudwatch get-metric-statistics \
--namespace "Joblet/Production" \
--metric-name "MemoryUsage" \
--dimensions Name=JobID,Value=my-job-id Name=NodeID,Value=node-1 Name=Environment,Value=production \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum,Minimum
Object storage for long-term archival (v2.1+).
The persistence service shares the same configuration file as the main joblet daemon (/opt/joblet/joblet-config.yml).
Configuration is nested under the persist: section:
# /opt/joblet/joblet-config.yml
version: "3.0"
# Node identification (shared with main service)
server:
nodeId: "production-node-1" # Used by CloudWatch backend
address: "0.0.0.0"
port: 50051
# Main joblet configuration
joblet:
# ... main service config ...
# Persistence service configuration (nested)
persist:
server:
grpc_socket: "/opt/joblet/run/persist-grpc.sock" # Unix socket for queries
max_connections: 500
ipc:
socket: "/opt/joblet/run/persist-ipc.sock" # Unix socket for writes
max_connections: 10
max_message_size: 134217728 # 128MB
storage:
type: "cloudwatch" # or "local", "s3"
base_dir: "/opt/joblet"
# Backend-specific configuration
local:
# ... local config ...
cloudwatch:
# ... cloudwatch config ...
# Logging (inherited by persist service)
logging:
level: "INFO"
format: "text"
output: "stdout"
# Security (inherited by persist service)
security:
serverCert: "..."
serverKey: "..."
caCert: "..."
The persistence service inherits several settings from the parent configuration:
logging.levellogging.formatlogging.outputsecurity.serverCert (for TLS)security.serverKeysecurity.caCertserver.nodeId (for CloudWatch multi-node support)persist.storage.base_dir defaults to parentβs base directoryβ οΈ IMPORTANT: Persistence configuration affects both storage AND buffering behavior.
The persistence service is controlled by the ipc.enabled setting in the main joblet configuration (not under
persist: section):
# Main joblet configuration
ipc:
enabled: true # Enable persistence + in-memory buffering
socket: "/opt/joblet/run/persist-ipc.sock"
buffer_size: 10000
reconnect_delay: "5s"
max_reconnects: 0
ipc.enabled: true)Behavior:
Use Cases:
Memory Impact:
β οΈ CRITICAL REQUIREMENT: Mandatory Persist Service Health Check
When persistence is enabled (ipc.enabled: true), the persist service MUST be running and healthy before joblet can start. This is a fail-fast design to prevent joblet from running in a degraded state.
Startup Behavior:
"persist service is ready and healthy""FATAL: persist service is not available but ipc.enabled=true"Why This Matters:
Troubleshooting Startup Failures:
# Check if persist subprocess started
ps aux | grep persist
# Check Unix socket exists
ls -la /opt/joblet/run/persist-grpc.sock
# View startup logs
journalctl -u joblet -n 50 --no-pager
# Common issues:
# 1. Persist service crashed on startup (check logs)
# 2. Unix socket permissions issue (check /opt/joblet/run ownership)
# 3. Configuration error in persist section (check syntax)
If You Donβt Need Persistence:
Set ipc.enabled: false to disable the requirement entirely. Joblet will skip persist service connection and use live-streaming-only mode (no buffering, no historical data).
ipc.enabled: false)Behavior:
Use Cases:
Advantages:
Limitations:
version: "3.0"
server:
nodeId: "dev-node-1"
address: "0.0.0.0"
port: 50051
# Disable persistence entirely
ipc:
enabled: false # NO buffering, NO persistence, live streaming only
# The persist: section can be omitted entirely when disabled
# If present, it will be ignored since ipc.enabled: false
# 1. Update configuration
sudo vi /opt/joblet/joblet-config.yml
# Change:
# ipc.enabled: false β ipc.enabled: true
# 2. Add persist service configuration
# (see "Unified Configuration File" section above)
# 3. Restart joblet service
sudo systemctl restart joblet
# 4. Verify persist subprocess started
sudo systemctl status joblet
# Look for: "persist service started successfully"
# 5. Check Unix sockets created
ls -la /opt/joblet/run/
# Expected: persist-ipc.sock, persist-grpc.sock
Note: Existing running jobs are not affected. The new buffering/persistence behavior applies only to jobs started after the configuration change.
Best for:
persist:
storage:
type: "local"
base_dir: "/opt/joblet"
Best for:
server:
nodeId: "prod-node-1"
persist:
storage:
type: "cloudwatch"
cloudwatch:
region: "" # Auto-detect
log_group_prefix: "/joblet"
Best for:
Node 1 Configuration:
server:
nodeId: "cluster-node-1"
address: "10.0.1.10"
persist:
storage:
type: "cloudwatch"
cloudwatch:
region: "us-east-1"
log_group_prefix: "/joblet-cluster"
metric_dimensions:
Cluster: "production"
Node: "node-1"
Node 2 Configuration:
server:
nodeId: "cluster-node-2"
address: "10.0.1.11"
persist:
storage:
type: "cloudwatch"
cloudwatch:
region: "us-east-1"
log_group_prefix: "/joblet-cluster"
metric_dimensions:
Cluster: "production"
Node: "node-2"
Result in CloudWatch:
Log Groups:
/joblet-cluster/cluster-node-1/jobs
βββ Streams: job-123-stdout, job-123-stderr
/joblet-cluster/cluster-node-2/jobs
βββ Streams: job-456-stdout, job-456-stderr
Metrics (namespace: Joblet/Production):
βββ CPUUsage [JobID=job-123, NodeID=cluster-node-1, Environment=production, Cluster=main-cluster, Node=node-1]
βββ MemoryUsage [JobID=job-123, NodeID=cluster-node-1, Environment=production, Cluster=main-cluster, Node=node-1]
βββ CPUUsage [JobID=job-456, NodeID=cluster-node-2, Environment=production, Cluster=main-cluster, Node=node-2]
βββ MemoryUsage [JobID=job-456, NodeID=cluster-node-2, Environment=production, Cluster=main-cluster, Node=node-2]
Used by joblet daemon to write logs/metrics to persistence service.
// Log write message
message LogLine {
string job_id = 1;
StreamType stream = 2; // STDOUT or STDERR
bytes content = 3;
int64 timestamp = 4;
int64 sequence = 5;
}
// Metric write message
message Metric {
string job_id = 1;
int64 timestamp = 2;
double cpu_percent = 3;
int64 memory_bytes = 4;
int64 io_read_bytes = 5;
int64 io_write_bytes = 6;
// ... additional fields
}
Used by RNX CLI and external clients to query historical data.
service PersistService {
// Query logs for a job
rpc QueryLogs(LogQueryRequest) returns (stream LogLine);
// Query metrics for a job
rpc QueryMetrics(MetricQueryRequest) returns (stream Metric);
// Delete job data
rpc DeleteJobData(DeleteJobDataRequest) returns (DeleteJobDataResponse);
}
message LogQueryRequest {
string job_id = 1;
StreamType stream = 2; // Optional filter
int64 start_time = 3; // Unix timestamp
int64 end_time = 4; // Unix timestamp
int32 limit = 5;
int32 offset = 6;
string filter = 7; // Text search filter
}
message MetricQueryRequest {
string job_id = 1;
int64 start_time = 2;
int64 end_time = 3;
string aggregation = 4; // "avg", "min", "max", "sum"
int32 limit = 5;
int32 offset = 6;
}
# Get all logs for a job
rnx job log <job-id>
# Get only stderr
rnx job log <job-id> --stream=stderr
# Filter logs
rnx job log <job-id> --filter="ERROR"
# Time range query
rnx job log <job-id> --since="2024-01-01T00:00:00Z"
# Get metrics for a job
rnx job metrics <job-id>
# Aggregated metrics
rnx job metrics <job-id> --aggregate=avg
# Time range
rnx job metrics <job-id> --since="1h" --until="now"
Write Performance:
Read Performance:
Disk Usage:
Typical job with 10,000 log lines:
- Raw JSON: ~5 MB
- Gzipped: ~1 MB
- Storage: ~1 MB per job
Write Performance:
Read Performance:
Cost Considerations:
CloudWatch Logs Pricing (prices vary by region):
- Ingestion: Per GB ingested
- Storage: Per GB/month stored
- Query (Insights): Per GB scanned
CloudWatch Metrics Pricing:
- Standard Metrics: First 10 metrics free, then charged per metric/month
- Custom Metrics: Charged per metric/month
- API Requests: Charged per 1,000 requests
Example: 1000 jobs/day, 10 MB logs/job
Logs with 7-day retention (default):
- Ingestion: 10 GB/day ingested
- Storage: 70 GB (7 days) stored
- Note: Ingestion typically dominates cost
Logs with 30-day retention:
- Ingestion: 10 GB/day (same)
- Storage: 300 GB (30 days) stored
- Note: Higher storage than 7-day retention
Logs with 1-day retention (dev):
- Ingestion: 10 GB/day (same)
- Storage: 10 GB (1 day) stored - minimal
- Note: Lowest storage cost
Metrics (9 metrics per job):
- 9 unique metric names charged
- Dimensions don't multiply the cost (part of metric identity)
Cost comparison:
- 7-day retention: Balanced (logs + metrics)
- 30-day retention: Higher storage costs
- 1-day retention: Minimal storage costs (dev/test)
π‘ Cost Optimization: Shorter retention = lower storage costs!
Rate Limiting:
persist:
storage:
cloudwatch:
log_batch_size: 100 # Tune based on log volume
metric_batch_size: 20 # Tune based on metric frequency
# Check if persist service is running
ps aux | grep persist
# Check IPC socket
ls -la /opt/joblet/run/persist-ipc.sock
# Check gRPC socket
ls -la /opt/joblet/run/persist-grpc.sock
# View persist logs
journalctl -u joblet -f | grep persist
Problem: Logs not appearing in CloudWatch
# Check AWS credentials
aws sts get-caller-identity
# Check CloudWatch permissions
aws logs describe-log-groups --log-group-name-prefix="/joblet/"
# Check region configuration
grep -A 5 "cloudwatch:" /opt/joblet/joblet-config.yml
# Enable debug logging
# In joblet-config.yml:
logging:
level: "DEBUG"
Problem: βAccess Deniedβ errors
Verify IAM permissions:
aws iam get-role-policy --role-name joblet-ec2-role --policy-name joblet-logs-policy
Problem: Region auto-detection failed
Explicitly set region:
persist:
storage:
cloudwatch:
region: "us-east-1" # Explicit instead of ""
Problem: Disk full
# Check disk usage
du -sh /opt/joblet/logs
du -sh /opt/joblet/metrics
# Clean up old jobs
find /opt/joblet/logs -type d -mtime +7 -exec rm -rf {} \;
Problem: Permission errors
# Fix ownership
sudo chown -R joblet:joblet /opt/joblet/logs
sudo chown -R joblet:joblet /opt/joblet/metrics
# Fix permissions
sudo chmod -R 755 /opt/joblet/logs
sudo chmod -R 755 /opt/joblet/metrics
# 1. Stop joblet
sudo systemctl stop joblet
# 2. Update configuration
sudo vi /opt/joblet/joblet-config.yml
# Change: type: "local" β type: "cloudwatch"
# 3. Start joblet
sudo systemctl start joblet
# 4. (Optional) Migrate old logs
# Use custom script to read local logs and push to CloudWatch
# 1. Stop joblet
sudo systemctl stop joblet
# 2. Update configuration
sudo vi /opt/joblet/joblet-config.yml
# Change: type: "cloudwatch" β type: "local"
# 3. Create directories
sudo mkdir -p /opt/joblet/logs /opt/joblet/metrics
sudo chown joblet:joblet /opt/joblet/logs /opt/joblet/metrics
# 4. Start joblet
sudo systemctl start joblet
chmod 700 /opt/joblet/logs
chmod 700 /opt/joblet/metrics
/opt/joblet partitionlogrotate or custom cleanup scriptsaws logs associate-kms-key \
--log-group-name "/joblet" \
--kms-key-id "arn:aws:kms:region:account:key/xxx"
/opt/joblet/logs, /opt/joblet/metricsiostat -x 1lsof | grep /opt/joblet | wc -lBuilt-in CloudWatch metrics:
IncomingBytes: Log ingestion volumeIncomingLogEvents: Number of log eventsThrottledRequests: Rate limitingCustom metrics (via CloudWatch Metrics API):
CloudWatch Metrics Features: