Native Linux Microcontainers

Joblet is a micro-container runtime for running Linux jobs with: Process and filesystem isolation (PID namespace, chroot) Fine-grained CPU, memory, and IO throttling (cgroups v2) Secure job execution with mTLS and RBAC Built-in scheduler, SSE log streaming, and multi-core pinning Ideal for: Agentic AI Workloads (Untrusted code)


Project maintained by ehsaniara Hosted on GitHub Pages — Theme by mattgraham

Job Scheduling

This document provides comprehensive information about Joblet’s job scheduling system, including how to schedule jobs for future execution, architecture details, multi-node behavior, and persistence across restarts.

Document Structure

Overview

Joblet supports scheduling jobs for future execution using RFC3339 timestamps. Scheduled jobs are:

Key Characteristics

Feature Behavior
Time Format RFC3339 (e.g., 2025-01-15T10:30:00Z)
Timezone UTC recommended, local time supported
Persistence DynamoDB (survives restarts)
Node Affinity Jobs execute on creating node only
Overdue Jobs Execute immediately on recovery

Scheduling Jobs

Basic Scheduling

Use the --schedule flag to schedule a job for future execution:

# Schedule for a specific time (UTC)
rnx job run --schedule="2025-01-15T10:30:00Z" python3 process_data.py

# Schedule for a specific time with timezone offset
rnx job run --schedule="2025-01-15T10:30:00-05:00" ./backup.sh

# Schedule with resource limits
rnx job run --schedule="2025-01-15T14:00:00Z" \
    --max-cpu=50 \
    --max-memory=512 \
    python3 heavy_computation.py

Schedule Format

The schedule parameter must be in RFC3339 format:

YYYY-MM-DDTHH:MM:SSZ          # UTC time
YYYY-MM-DDTHH:MM:SS+HH:MM     # With positive offset
YYYY-MM-DDTHH:MM:SS-HH:MM     # With negative offset

Examples:

# UTC time
--schedule="2025-06-15T09:00:00Z"

# US Eastern (EST, -5 hours)
--schedule="2025-06-15T09:00:00-05:00"

# Central European (CET, +1 hour)
--schedule="2025-06-15T09:00:00+01:00"

Scheduling with File Uploads

Scheduled jobs support file uploads. Files are staged immediately and stored until execution:

# Schedule job with file upload
rnx job run --schedule="2025-01-15T10:00:00Z" \
    --upload=./data.csv:/workspace/data.csv \
    python3 /workspace/process.py

# Multiple files
rnx job run --schedule="2025-01-15T10:00:00Z" \
    --upload=./config.yaml:/app/config.yaml \
    --upload=./script.py:/app/script.py \
    python3 /app/script.py

Response

When a job is scheduled, you receive confirmation with the job ID:

{
  "uuid": "job-abc123-def456",
  "status": "SCHEDULED",
  "scheduledTime": "2025-01-15T10:30:00Z",
  "nodeId": "node-prod-1"
}

Architecture

Components

┌─────────────────────────────────────────────────────────────────┐
│                         Joblet Node                              │
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │   gRPC API   │───▶│   Scheduler  │───▶│ Execution Engine │  │
│  │   (StartJob) │    │              │    │                  │  │
│  └──────────────┘    └──────┬───────┘    └──────────────────┘  │
│                             │                                    │
│                      ┌──────▼───────┐                           │
│                      │ Priority     │                           │
│                      │ Queue        │                           │
│                      │ (Min-Heap)   │                           │
│                      └──────────────┘                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │    DynamoDB      │
                    │  (Persistence)   │
                    └──────────────────┘

Scheduler Design

The scheduler uses a sleep-until-next strategy for efficient CPU usage:

  1. Priority Queue: Jobs ordered by scheduled time (min-heap)
  2. Sleep Strategy: Sleeps until next job is due (no polling)
  3. Wake-on-Insert: Wakes immediately when earlier job is added
  4. Graceful Shutdown: Respects stop signals during sleep

Job States

SCHEDULED ──▶ INITIALIZING ──▶ RUNNING ──▶ COMPLETED
    │                              │            │
    │                              ▼            ▼
    ▼                           STOPPED      FAILED
 CANCELED
Status Description
SCHEDULED Job queued for future execution
INITIALIZING Job starting (setting up resources)
RUNNING Job actively executing
COMPLETED Job finished successfully
FAILED Job exited with error
STOPPED Job manually stopped
CANCELED Scheduled job removed before execution

Multi-Node Deployments

Node Affinity

In multi-node deployments sharing a DynamoDB table, scheduled jobs are node-specific:

┌─────────────────┐         ┌─────────────────┐
│    Node A       │         │    Node B       │
│  nodeId: "a"    │         │  nodeId: "b"    │
│                 │         │                 │
│  Scheduler:     │         │  Scheduler:     │
│  - job-1 (a) ✓  │         │  - job-3 (b) ✓  │
│  - job-2 (a) ✓  │         │  - job-4 (b) ✓  │
└────────┬────────┘         └────────┬────────┘
         │                           │
         └───────────┬───────────────┘
                     ▼
            ┌─────────────────┐
            │    DynamoDB     │
            │                 │
            │  job-1 (node:a) │
            │  job-2 (node:a) │
            │  job-3 (node:b) │
            │  job-4 (node:b) │
            └─────────────────┘

Behavior:

Configuration

Each node must have a unique nodeId in its configuration:

# /opt/joblet/config/joblet-config.yml
server:
  address: "0.0.0.0"
  port: 50051
  nodeId: "node-prod-1"  # Must be unique per node

Important Considerations

If a node goes down permanently:

This is a deliberate design choice. Future orchestration layers may add:

Persistence and Recovery

How Persistence Works

  1. Job Creation: Job stored in DynamoDB with status=SCHEDULED
  2. In-Memory Queue: Job added to scheduler’s priority queue
  3. Execution: Job removed from queue, status → RUNNING
  4. Completion: Status updated to COMPLETED/FAILED

Recovery on Restart

When a Joblet node restarts:

1. Start Joblet
2. SyncFromPersistentState() → Load ALL jobs from DynamoDB
3. RecoverScheduledJobs() → Filter for:
   - status == SCHEDULED
   - nodeId == current node's ID
4. Add matching jobs to scheduler queue
5. Execute overdue jobs immediately

Log output during recovery:

[INFO] recovering scheduled jobs from persistent storage totalJobs=15 nodeId=node-prod-1
[INFO] scheduled job is overdue, will execute immediately job_uuid=job-abc123 overdueBy=5m30s
[DEBUG] recovered scheduled job job_uuid=job-def456 scheduledTime=2025-01-15T10:30:00Z
[INFO] scheduled job recovery completed recovered=3 skipped=0 nodeId=node-prod-1

Overdue Job Handling

Jobs whose scheduled time passed during downtime execute immediately:

Scheduled Time Node Down Node Up Behavior
10:00 09:55 10:15 Executes at 10:15 (15 min overdue)
10:00 09:55 09:58 Executes at 10:00 (on time)

Monitoring Scheduled Jobs

List Scheduled Jobs

# List all jobs (includes scheduled)
rnx job list

# Filter by status
rnx job list --status=SCHEDULED

# JSON output for scripting
rnx job list --status=SCHEDULED --json

View Scheduled Job Details

# Get job details
rnx job status <job-id>

# Example output
Job ID:     job-abc123-def456
Status:     SCHEDULED
Command:    python3 process_data.py
Scheduled:  2025-01-15T10:30:00Z (in 2h 15m)
Node:       node-prod-1

gRPC API

// List scheduled jobs
rpc ListJobs(ListJobsRequest) returns (ListJobsResponse);

message ListJobsRequest {
  string status_filter = 1;  // "SCHEDULED"
}

Canceling Scheduled Jobs

Cancel Before Execution

# Cancel a scheduled job
rnx job stop <job-id>

# Force cancel (if stuck)
rnx job stop --force <job-id>

Result:

Delete Scheduled Job

# Delete job and its data
rnx job delete <job-id>

Limitations

Current Limitations

Limitation Description Workaround
No recurring schedules Single execution only External cron + API calls
Node-specific Jobs tied to creating node Design for node affinity
No distributed locking No cross-node coordination Future orchestration layer
In-memory queue Queue rebuilt on restart Persistent storage ensures durability

Not Supported

Future Enhancements

Planned features for future releases:

  1. Cron-style recurring schedules
  2. Job orchestration layer with leader election
  3. Cross-node job reassignment on failure
  4. Job dependencies and workflows

Best Practices

Time Specification

# GOOD: Use UTC for consistency
rnx job run --schedule="2025-01-15T10:00:00Z" ./script.sh

# GOOD: Explicit timezone if needed
rnx job run --schedule="2025-01-15T10:00:00-05:00" ./script.sh

# AVOID: Ambiguous local time (depends on server timezone)

Resource Planning

# Set appropriate resource limits for scheduled jobs
rnx job run --schedule="2025-01-15T02:00:00Z" \
    --max-cpu=80 \
    --max-memory=2048 \
    --max-io=50 \
    ./nightly_backup.sh

Monitoring

Multi-Node Deployments

Troubleshooting

Job Not Executing at Scheduled Time

Symptoms: Job remains in SCHEDULED status past its scheduled time.

Checks:

  1. Verify node is running: systemctl status joblet
  2. Check scheduler logs: journalctl -u joblet | grep scheduler
  3. Verify job’s nodeId matches current node

Jobs Not Recovered After Restart

Symptoms: Scheduled jobs missing after node restart.

Checks:

  1. Verify DynamoDB connectivity
  2. Check SyncFromPersistentState logs
  3. Confirm nodeId hasn’t changed
# Check recovery logs
journalctl -u joblet | grep -E "(recovering|recovered)"

Duplicate Job Execution

Symptoms: Same job running on multiple nodes.

Cause: Likely nodeId configuration issue.

Fix:

  1. Ensure each node has unique nodeId
  2. Restart affected nodes
  3. Check job’s nodeId field

Scheduled Time in the Past

Behavior: Jobs scheduled for past times execute immediately.

# This will execute immediately
rnx job run --schedule="2020-01-01T00:00:00Z" echo "runs now"

Common Error Messages

Error Cause Solution
invalid schedule format Non-RFC3339 timestamp Use YYYY-MM-DDTHH:MM:SSZ format
job not found Job deleted or wrong ID Verify job ID with rnx job list
state client not available DynamoDB connection issue Check state service logs