Joblet is a micro-container runtime for running Linux jobs with: Process and filesystem isolation (PID namespace, chroot) Fine-grained CPU, memory, and IO throttling (cgroups v2) Secure job execution with mTLS and RBAC Built-in scheduler, SSE log streaming, and multi-core pinning Ideal for: Agentic AI Workloads (Untrusted code)
This comprehensive API reference provides detailed technical documentation for the Joblet gRPC interface, including complete service definitions, message schemas, authentication protocols, authorization frameworks, and practical implementation examples for client development.
Joblet provides two gRPC API services:
Both services utilize gRPC as their communication protocol with Protocol Buffers for efficient message serialization. The API implements enterprise-grade security through mutual TLS authentication and provides comprehensive role-based access control for organizational deployment scenarios.
Main Joblet Service:
Persist Service:
/opt/joblet/run/persist-ipc.sockMain Service:
Server Address: <host>:50051
TLS: Required (mutual authentication)
Client Certificates: Required for all operations
Platform: Linux server required for job execution
Persist Service:
Unix Socket: /opt/joblet/run/persist-grpc.sock (optional, gRPC queries)
IPC Socket: /opt/joblet/run/persist-ipc.sock (internal communication)
TLS: Optional (disabled by default for localhost)
Platform: Linux server required
The Joblet API enforces mutual TLS authentication for all client connections, requiring valid X.509 client certificates issued by the same Certificate Authority (CA) that signed the server certificate.
Client Certificate Subject Format:
CN=<client-name>, OU=<role>, O=<organization>
Supported Roles:
- OU=admin → Full access (all operations)
- OU=viewer → Read-only access (get, list, stream)
certs/
├── ca-cert.pem # Certificate Authority
├── client-cert.pem # Client certificate (admin or viewer)
└── client-key.pem # Client private key
| Role | RunJob | GetJobStatus | StopJob | CancelJob | DeleteJob | DeleteAllJobs | ListJobs | GetJobLogs | GetJobMetrics |
|---|---|---|---|---|---|---|---|---|---|
| admin | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| viewer | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
syntax = "proto3";
package joblet;
service JobService {
// Job operations
rpc RunJob(RunJobRequest) returns (RunJobResponse);
rpc GetJobStatus(GetJobStatusReq) returns (GetJobStatusRes);
rpc StopJob(StopJobReq) returns (StopJobRes);
rpc CancelJob(CancelJobReq) returns (CancelJobRes);
rpc DeleteJob(DeleteJobReq) returns (DeleteJobRes);
rpc DeleteAllJobs(DeleteAllJobsReq) returns (DeleteAllJobsRes);
rpc GetJobLogs(GetJobLogsReq) returns (stream DataChunk);
rpc ListJobs(EmptyRequest) returns (Jobs);
// Job metrics operations
rpc GetJobMetrics(JobMetricsRequest) returns (stream JobMetricsSample);
// Workflow operations
rpc RunWorkflow(RunWorkflowRequest) returns (RunWorkflowResponse);
rpc GetWorkflowStatus(GetWorkflowStatusRequest) returns (GetWorkflowStatusResponse);
rpc ListWorkflows(ListWorkflowsRequest) returns (ListWorkflowsResponse);
rpc GetWorkflowJobs(GetWorkflowJobsRequest) returns (GetWorkflowJobsResponse);
}
Creates and starts a new job with specified command and resource limits. Jobs execute on the Linux server with complete process isolation.
Authorization: Admin only
rpc RunJob(RunJobReq) returns (RunJobRes);
Request Parameters:
command (string): Command to execute (required)args (repeated string): Command arguments (optional)maxCPU (int32): CPU limit percentage (optional, default: 100)maxMemory (int32): Memory limit in MB (optional, default: 512)maxIOBPS (int32): I/O bandwidth limit in bytes/sec (optional, default: 0=unlimited)Job Execution Environment:
Response:
Example:
# CLI
rnx job run --max-cpu=50 --max-memory=512 python3 script.py
# Expected Response
Job started:
ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Command: python3 script.py
Status: INITIALIZING
StartTime: 2024-01-15T10:30:00Z
MaxCPU: 50
MaxMemory: 512
Network: host (shared with system)
Retrieves detailed information about a specific job, including current status, resource usage, and execution metadata.
Authorization: Admin, Viewer
rpc GetJobStatus(GetJobStatusReq) returns (GetJobStatusRes);
Request Parameters:
id (string): Job UUID (required)Response:
Example:
# CLI
rnx job status f47ac10b-58cc-4372-a567-0e02b2c3d479
# Expected Response
Id: f47ac10b-58cc-4372-a567-0e02b2c3d479
Command: python3 script.py
Status: RUNNING
Started At: 2024-01-15T10:30:00Z
Ended At:
MaxCPU: 50
MaxMemory: 512
MaxIOBPS: 0
ExitCode: 0
Terminates a running job using graceful shutdown (SIGTERM) followed by force termination (SIGKILL) if necessary.
Authorization: Admin only
rpc StopJob(StopJobReq) returns (StopJobRes);
Request Parameters:
id (string): Job UUID (required)Termination Process:
SIGTERM to process groupSIGKILL if process still aliveSTOPPEDResponse:
Example:
# CLI
rnx job stop f47ac10b-58cc-4372-a567-0e02b2c3d479
# Expected Response
Job stopped successfully:
ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Status: STOPPED
ExitCode: -1
EndTime: 2024-01-15T10:45:00Z
Cancels a scheduled job before it starts executing. This is specifically for jobs in SCHEDULED status.
Authorization: Admin only
rpc CancelJob(CancelJobReq) returns (CancelJobRes);
Request Parameters:
uuid (string): Job UUID (required)Behavior:
CANCELED (not STOPPED)Status Semantics:
StopJob → for RUNNING jobs (status becomes STOPPED)CancelJob → for SCHEDULED jobs (status becomes CANCELED)Response:
Example:
# CLI
rnx job cancel f47ac10b-58cc-4372-a567-0e02b2c3d479
# Expected Response
Job canceled successfully:
ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Status: CANCELED
CanceledAt: 2024-01-15T10:45:00Z
Error Conditions:
NOT_FOUNDFAILED_PRECONDITION (“Job is not scheduled”)PERMISSION_DENIEDPermanently deletes a job and all its associated data including logs, metrics, and metadata.
Authorization: Admin only
rpc DeleteJob(DeleteJobReq) returns (DeleteJobRes);
Request Parameters:
uuid (string): Job UUID (required)Deletion Scope:
Restrictions:
Response:
Example:
# CLI
rnx job delete f47ac10b-58cc-4372-a567-0e02b2c3d479
# Expected Response
Job deleted successfully:
ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Error Conditions:
NOT_FOUNDFAILED_PRECONDITION (“Cannot delete running job”)PERMISSION_DENIEDDeletes all non-running jobs from the system in a single operation. Running and scheduled jobs are preserved.
Authorization: Admin only
rpc DeleteAllJobs(DeleteAllJobsReq) returns (DeleteAllJobsRes);
Request Parameters: None (or empty request)
Behavior:
Response:
deleted_count (int): Number of jobs deletedskipped_count (int): Number of jobs skipped (running/scheduled)message (string): Summary messageExample:
# CLI
rnx job delete-all
# Expected Response
Successfully deleted 15 jobs, skipped 3 running/scheduled jobs
Use Cases:
Lists all jobs with their current status and metadata. Useful for monitoring overall system activity.
Authorization: Admin, Viewer
rpc ListJobs(EmptyRequest) returns (Jobs);
Request Parameters: None
Response:
Example:
# CLI
rnx job list
# Expected Response
f47ac10b-58cc-4372-a567-0e02b2c3d479 COMPLETED StartTime: 2024-01-15T10:30:00Z Command: echo hello
6ba7b810-9dad-11d1-80b4-00c04fd430c8 RUNNING StartTime: 2024-01-15T10:35:00Z Command: python3 script.py
6ba7b811-9dad-11d1-80b4-00c04fd430c8 FAILED StartTime: 2024-01-15T10:40:00Z Command: invalid-command
Streams job output in real-time, including historical logs and live updates. Supports multiple concurrent clients streaming the same job.
Authorization: Admin, Viewer
rpc GetJobLogs(GetJobLogsReq) returns (stream DataChunk);
Request Parameters:
id (string): Job UUID (required)Streaming Behavior:
Response:
DataChunk messages containing raw stdout/stderr outputExample:
# CLI
rnx job log f47ac10b-58cc-4372-a567-0e02b2c3d479
# Expected Response (streaming)
Logs for job f47ac10b-58cc-4372-a567-0e02b2c3d479 (Press Ctrl+C to exit if streaming):
Starting script...
Processing item 1
Processing item 2
...
Script completed successfully
Streams resource usage metrics for a job as time-series data. Shows CPU, memory, I/O, network, process, and GPU metrics collected during job execution.
Authorization: Admin, Viewer
rpc GetJobMetrics(JobMetricsRequest) returns (stream JobMetricsSample);
Request Parameters:
job_uuid (string): Job UUID (required)Streaming Behavior:
Similar to GetJobLogs, this method streams all metrics from job start:
Metrics Collected:
| Category | Metrics |
|---|---|
| CPU | Usage %, user/system time, throttling |
| Memory | Current/peak usage, anonymous/file cache, page faults |
| I/O | Read/write bandwidth, IOPS, total bytes |
| Network | RX/TX bytes/packets, bandwidth |
| Process | Count, threads, open file descriptors |
| GPU | Utilization, memory, temperature, power (if GPUs allocated) |
Response:
JobMetricsSample messages, one per collection interval (typically 1-5 seconds)Storage:
Metrics are persisted server-side as gzipped JSONL files at:
/opt/joblet/metrics/<job-uuid>/<timestamp>.jsonl.gzExample:
# CLI - Stream metrics for a job
rnx job metrics f47ac10b-58cc-4372-a567-0e02b2c3d479
# CLI - JSON output for analysis
rnx --json job metrics f47ac10b | jq -c '{timestamp, cpu: .cpu.usagePercent, memory: .memory.current}'
# Expected Response (streaming)
Timestamp: 2024-01-15T10:30:01Z
CPU: 45.2%, Memory: 256MB, I/O Read: 10MB/s, I/O Write: 5MB/s
Network RX: 1.2MB/s, Network TX: 0.8MB/s
Timestamp: 2024-01-15T10:30:06Z
CPU: 48.1%, Memory: 312MB, I/O Read: 12MB/s, I/O Write: 6MB/s
Network RX: 1.5MB/s, Network TX: 1.0MB/s
...
Use Cases:
Core job representation used across all API responses.
message Job {
string id = 1; // Unique job UUID identifier
string name = 2; // Readable job name (from workflows, empty for individual jobs)
string command = 3; // Command being executed
repeated string args = 4; // Command arguments
int32 maxCPU = 5; // CPU limit in percent
string cpuCores = 6; // CPU core binding specification
int32 maxMemory = 7; // Memory limit in MB
int32 maxIOBPS = 8; // IO limit in bytes per second
string status = 9; // Current job status
string startTime = 10; // Start time (RFC3339 format)
string endTime = 11; // End time (RFC3339 format, empty if running)
int32 exitCode = 12; // Process exit code
string scheduledTime = 13; // Scheduled execution time (RFC3339 format)
string runtime = 14; // Runtime specification used
map<string, string> environment = 15; // Regular environment variables (visible)
map<string, string> secret_environment = 16; // Secret environment variables (masked)
// Additional fields
string nodeId = 20; // Unique identifier of the Joblet node that executed this job
}
INITIALIZING - Job created, setting up isolation and resources
RUNNING - Process executing in isolated namespace
COMPLETED - Process finished successfully (exit code 0)
FAILED - Process finished with error (exit code != 0)
STOPPED - Process terminated by user request or timeout
Default values when not specified in configuration (joblet-config.yml):
DefaultCPULimitPercent = 100 // 100% of one core
DefaultMemoryLimitMB = 512 // 512 MB
DefaultIOBPS = 0 // Unlimited I/O
message RunJobReq {
string command = 1; // Required: command to execute
repeated string args = 2; // Optional: command arguments
int32 maxCPU = 3; // Optional: CPU limit percentage
int32 maxMemory = 4; // Optional: memory limit in MB
int32 maxIOBPS = 5; // Optional: I/O bandwidth limit
}
Response message for job status requests, including node identification.
message GetJobStatusRes {
string uuid = 1; // Job UUID
string name = 2; // Job name (from workflows, empty for individual jobs)
string command = 3; // Command being executed
repeated string args = 4; // Command arguments
int32 maxCPU = 5; // CPU limit in percent
string cpuCores = 6; // CPU core binding specification
int32 maxMemory = 7; // Memory limit in MB
int64 maxIOBPS = 8; // IO limit in bytes per second
string status = 9; // Current job status
string startTime = 10; // Start time (RFC3339 format)
string endTime = 11; // End time (RFC3339 format, empty if running)
int32 exitCode = 12; // Process exit code
string scheduledTime = 13; // Scheduled execution time (RFC3339 format)
string runtime = 14; // Runtime specification used
map<string, string> environment = 15; // Regular environment variables (visible)
map<string, string> secret_environment = 16; // Secret environment variables (masked)
string network = 17; // Network configuration
repeated string volumes = 18; // Volume names
string workDir = 19; // Working directory
repeated FileUpload uploads = 20; // File uploads
repeated string dependencies = 21; // Job dependencies
string workflowUuid = 22; // Workflow UUID if part of workflow
int32 gpuCount = 23; // Number of GPUs allocated
repeated int32 gpuIndices = 24; // GPU indices allocated
int64 gpuMemoryMB = 25; // GPU memory in MB
string nodeId = 26; // Unique identifier of the Joblet node that executed this job
}
Used for streaming job output with efficient binary transport.
message DataChunk {
bytes payload = 1; // Raw output data (stdout/stderr merged)
}
| Code | Description | Common Causes |
|---|---|---|
UNAUTHENTICATED |
Invalid or missing client certificate | Certificate expired, wrong CA |
PERMISSION_DENIED |
Insufficient role permissions | Viewer trying admin operation |
NOT_FOUND |
Job not found | Invalid job UUID |
INTERNAL |
Server-side error | Job creation failed, system error |
CANCELED |
Operation canceled | Client disconnected during stream |
INVALID_ARGUMENT |
Invalid request parameters | Empty command, invalid limits |
{
"code": "NOT_FOUND",
"message": "job not found: f47ac10b-58cc-4372-a567-0e02b2c3d479",
"details": []
}
# Missing certificate
Error: failed to extract client role: no TLS information found
# Wrong role (viewer trying to run job)
Error: role viewer is not allowed to perform operation run_job
# Invalid certificate
Error: certificate verify failed: certificate has expired
# Job not found
Error: job not found: f47ac10b-58cc-4372-a567-0e02b2c3d479
# Job not running (for stop operation)
Error: job is not running: 6ba7b810-9dad-11d1-80b4-00c04fd430c8 (current status: COMPLETED)
# Command validation failed
Error: invalid command: command contains dangerous characters
# Resource limits exceeded
Error: job creation failed: maxMemory exceeds system limits
# Linux platform required
Error: job execution requires Linux server (current: darwin)
# Cgroup setup failed
Error: cgroup setup failed: permission denied
# Namespace creation failed
Error: failed to create isolated environment: operation not permitted
--server string Server address (default "localhost:50051")
--cert string Client certificate path (default "certs/client-cert.pem")
--key string Client private key path (default "certs/client-key.pem")
--ca string CA certificate path (default "certs/ca-cert.pem")
Create and start a new job with optional resource limits.
rnx job run [flags] <command> [args...]
Flags:
--max-cpu int Max CPU percentage (default: from config)
--max-memory int Max memory in MB (default: from config)
--max-iobps int Max I/O bytes per second (default: 0=unlimited)
Examples:
rnx job run echo "hello world"
rnx job run --max-cpu=50 python3 script.py
rnx job run --max-memory=1024 java -jar app.jar
rnx job run bash -c "sleep 10 && echo done"
Get detailed information about a job by UUID.
rnx job status <job-uuid>
Example:
rnx job status f47ac10b-58cc-4372-a567-0e02b2c3d479
List all jobs with their current status.
rnx job list
Example:
rnx job list
Stop a running job gracefully (SIGTERM) or forcefully (SIGKILL).
rnx job stop <job-uuid>
Example:
rnx job stop f47ac10b-58cc-4372-a567-0e02b2c3d479
Stream job output in real-time or view historical logs.
rnx job log <job-uuid>
Streams logs from running or completed jobs. Use Ctrl+C to stop following.
Examples:
rnx job log f47ac10b-58cc-4372-a567-0e02b2c3d479 # Stream logs
rnx job log f47ac10b | grep ERROR # Filter output
# Connect to remote Linux server from any platform
rnx --server=prod.example.com:50051 \
--cert=certs/admin-client-cert.pem \
--key=certs/admin-client-key.pem \
run echo "remote execution on Linux"
export JOBLET_SERVER="prod.example.com:50051"
export JOBLET_CERT_PATH="./certs/admin-client-cert.pem"
export JOBLET_KEY_PATH="./certs/admin-client-key.pem"
export JOBLET_CA_PATH="./certs/ca-cert.pem"
rnx job run python3 script.py
Resource limits and timeouts are configured in /opt/joblet/joblet-config.yml:
joblet:
defaultCpuLimit: 100 # Default CPU percentage
defaultMemoryLimit: 512 # Default memory in MB
defaultIoLimit: 0 # Default I/O limit (0=unlimited)
maxConcurrentJobs: 100 # Maximum concurrent jobs
jobTimeout: "1h" # Maximum job runtime
cleanupTimeout: "5s" # Resource cleanup timeout
grpc:
maxRecvMsgSize: 524288 # 512KB max receive message
maxSendMsgSize: 4194304 # 4MB max send message
keepAliveTime: "30s" # Connection keep-alive
The server provides detailed logging for:
# Structured logging with fields
DEBUG - Detailed execution flow and debugging info
INFO - Job lifecycle events and normal operations
WARN - Resource limit violations, slow clients, recoverable errors
ERROR - Job failures, system errors, authentication failures
# Example log entry
[2024-01-15T10:30:00Z] [INFO] job started successfully | jobId=f47ac10b-58cc-4372-a567-0e02b2c3d479 pid=12345 command="python3 script.py" duration=50ms
# Check server health
rnx job list
# Verify certificate and connection
rnx --server=your-server:50051 list
# Monitor service status (systemd)
sudo systemctl status joblet
sudo journalctl -u joblet -f
rnx job list/sys/fs/cgroup/joblet.slice/Joblet provides comprehensive workflow orchestration through YAML-defined job dependencies. Workflows enable complex multi-job execution with dependency management, resource isolation, and comprehensive monitoring.
requires clausesThe API provides multiple services with distinct responsibilities:
Handles regular user jobs with production isolation:
service JobService {
// Job execution with production isolation
rpc RunJob(RunJobReq) returns (RunJobRes);
rpc GetJobStatus(GetJobStatusReq) returns (GetJobStatusRes);
rpc StopJob(StopJobReq) returns (StopJobRes);
rpc ListJobs(EmptyRequest) returns (Jobs);
rpc GetJobLogs(GetJobLogsReq) returns (stream DataChunk);
// Workflow execution
rpc RunWorkflow(RunWorkflowRequest) returns (RunWorkflowResponse);
rpc GetWorkflowStatus(GetWorkflowStatusRequest) returns (GetWorkflowStatusResponse);
rpc ListWorkflows(ListWorkflowsRequest) returns (ListWorkflowsResponse);
rpc GetWorkflowJobs(GetWorkflowJobsRequest) returns (GetWorkflowJobsResponse);
}
Handles runtime building with builder chroot access:
service RuntimeService {
// Runtime installation and management
rpc InstallRuntime(InstallRuntimeRequest) returns (InstallRuntimeResponse);
rpc ListRuntimes(ListRuntimesRequest) returns (ListRuntimesResponse);
rpc GetRuntimeInfo(GetRuntimeInfoRequest) returns (GetRuntimeInfoResponse);
rpc TestRuntime(TestRuntimeRequest) returns (TestRuntimeResponse);
}
Key Differences:
JobType: "standard" → minimal chroot with production isolationJobType: "runtime-build" → builder chroot with host OS accessRepresents a job within a workflow with dependency information.
message WorkflowJob {
string jobId = 1; // Actual job UUID for started jobs, "0" for non-started jobs
string jobName = 2; // Job name from workflow YAML
string status = 3; // Current job status
repeated string dependencies = 4; // List of job names this job depends on
Timestamp startTime = 5; // Job start time
Timestamp endTime = 6; // Job completion time
int32 exitCode = 7; // Process exit code
}
Job ID Behavior:
jobId contains actual job UUID assigned by joblet (e.g., “f47ac10b-58cc-4372-a567-0e02b2c3d479”, “
6ba7b810-9dad-11d1-80b4-00c04fd430c8”)jobId shows “0” to indicate the job hasn’t been started yetProvides comprehensive workflow status with job details.
message GetWorkflowStatusResponse {
WorkflowInfo workflow = 1; // Overall workflow information
repeated WorkflowJob jobs = 2; // Detailed job information with dependencies
}
Workflow jobs have Job names derived from YAML job keys:
# workflow.yaml
jobs:
setup-data: # Job name: "setup-data"
command: "python3"
args: ["setup.py"]
process-data: # Job name: "process-data"
command: "python3"
args: ["process.py"]
requires:
- setup-data: "COMPLETED"
Job ID vs Job Name:
Status Display:
JOB ID JOB NAME STATUS EXIT CODE DEPENDENCIES
---------------------------------------------------------------------------------------------------------------------
f47ac10b-58cc-4372-a567-0e02b2c3d479 setup-data COMPLETED 0 -
6ba7b810-9dad-11d1-80b4-00c04fd430c8 process-data RUNNING - setup-data
Workflow status commands automatically display job names for better visibility:
# Get workflow status with job names and dependencies
rnx workflow status a1b2c3d4-e5f6-7890-1234-567890abcdef
# List workflows
rnx workflow list
# Execute workflow
rnx workflow run pipeline.yaml
The Persist Service provides historical data storage and querying capabilities for logs and metrics. It operates alongside the main Joblet service to provide durable storage and efficient historical queries.
Purpose: Store and query historical job logs and metrics
Architecture:
Deployment:
# Service runs alongside main joblet service
systemctl status persist
# Configured via unified config file
/opt/joblet/config/joblet-config.yml
The persist service uses two communication channels:
Purpose: High-throughput log and metric writes from main service
Joblet Service → Unix Socket → Persist Service
(/opt/joblet/run/persist-ipc.sock)
Protocol: Custom IPC protocol (defined in internal/proto/ipc.proto)
Message Types:
Configuration:
persist:
ipc:
socket: "/opt/joblet/run/persist-ipc.sock"
max_message_size: 10485760 # 10MB
Purpose: Historical queries from RNX clients
RNX Client → gRPC (Unix socket) → Persist Service
Protocol: gRPC (defined in internal/proto/persist.proto)
Security: Optional TLS (disabled by default for localhost)
Configuration:
persist:
server:
grpc_socket: "/opt/joblet/run/persist-grpc.sock"
tls:
enabled: false # Optional for external access
Retrieves historical logs for a completed or running job.
Request:
message QueryLogsRequest {
string job_id = 1; // Job UUID (required)
string stream = 2; // "stdout" or "stderr" (optional, default: both)
int64 start_time = 3; // Unix timestamp (optional)
int64 end_time = 4; // Unix timestamp (optional)
int32 limit = 5; // Max lines to return (optional)
}
Response:
message QueryLogsResponse {
repeated LogEntry entries = 1;
}
message LogEntry {
int64 timestamp = 1; // Unix timestamp in nanoseconds
string stream = 2; // "stdout" or "stderr"
string content = 3; // Log line content
}
Storage Location:
/opt/joblet/logs/{job_id}/
├── stdout.log.gz # Compressed stdout
└── stderr.log.gz # Compressed stderr
Example Usage:
# RNX automatically queries persist service for completed jobs
rnx job log <job-id>
# Internally calls persist service QueryLogs API
Retrieves historical metrics for a job.
Request:
message QueryMetricsRequest {
string job_id = 1; // Job UUID (required)
string metric_type = 2; // "cpu", "memory", "io", "gpu" (optional)
int64 start_time = 3; // Unix timestamp (optional)
int64 end_time = 4; // Unix timestamp (optional)
int32 limit = 5; // Max samples to return (optional)
}
Response:
message QueryMetricsResponse {
repeated MetricSample samples = 1;
}
message MetricSample {
int64 timestamp = 1; // Unix timestamp in nanoseconds
string metric_type = 2; // Metric category
map<string, double> values = 3; // Metric key-value pairs
}
Storage Location:
/opt/joblet/metrics/{job_id}/
└── metrics.jsonl.gz # Compressed JSON Lines metrics
Metric Types:
cpu_user, cpu_system, cpu_total, cpu_percentmemory_rss, memory_vms, memory_percentio_read_bytes, io_write_bytes, io_read_ops, io_write_opsgpu_utilization, gpu_memory_used, gpu_temperature (if available)Example Metric Sample:
{
"timestamp": 1704451200000000000,
"metric_type": "cpu",
"values": {
"cpu_user": 45.2,
"cpu_system": 12.8,
"cpu_total": 58.0,
"cpu_percent": 58.0
}
}
Storage Management:
persist:
storage:
retention:
logs_days: 30 # Keep logs for 30 days
metrics_days: 90 # Keep metrics for 90 days
cleanup_interval: "24h" # Run cleanup daily
Manual Cleanup:
# Clean up old job data
find /opt/joblet/logs -type d -mtime +30 -exec rm -rf {} \;
find /opt/joblet/metrics -type d -mtime +90 -exec rm -rf {} \;
Write Performance:
Query Performance:
Planned Features:
Configuration (Future):
persist:
storage:
type: "cloud" # "local" or "cloud"
cloud:
provider: "aws"
s3:
bucket: "joblet-logs"
region: "us-west-2"
cloudwatch:
log_group: "/joblet/jobs"