Scaling - xyOps

Running xyOps in production with lots of servers and high job volumes? This guide provides best practices for scaling your deployment to handle enterprise workloads.

Start with Self-Hosting first if you’re new to xyOps deployment. This guide complements those foundational concepts.

Hardware Sizing

Proper hardware provisioning is critical for production xyOps deployments at scale.

CPU Cores

xyOps is multi-process and highly concurrent. More cores improve performance across:

Job scheduler
Web server request handling
Storage I/O operations
Real-time log compression

Recommendation: Minimum 4 cores for small deployments, 8-16 cores for production fleets with hundreds of servers.

Memory (RAM)

Adequate RAM ensures smooth operation and reduces disk I/O:

Node.js heap space
In-process caches (storage, lists)
Storage engine caches (SQLite, Filesystem)
OS page cache for log files

Recommendation: 16-32 GB RAM for production installs. Higher RAM directly improves cache hit rates.

Storage

Use fast SSD/NVMe storage for production. HDDs cannot handle the IOPS required for parallel job logs and database operations.

Type: Prefer SSD or NVMe for local Filesystem/SQLite backends
IOPS: Ensure adequate IOPS for parallel job logs, snapshots, and uploads
Capacity: Plan for log archives, job history, and monitor time-series data

Network

Ensure good NIC throughput and low latency between conductors and workers
For external storage (S3, Redis, MinIO), place conductors in the same region/AZ
Use load balancers with proper health checks for multi-conductor setups

OS Limits

# Increase file descriptor limits
ulimit -n 65536

# For systemd services, add to service file:
LimitNOFILE=65536
LimitNPROC=32768

Configure swap conservatively to avoid heap thrashing under memory pressure.

Memory Configuration

Node.js Heap Size

xyOps honors the NODE_MAX_MEMORY environment variable to set Node’s old-space heap size.

Configure Node.js Memory

Set environment variable

export NODE_MAX_MEMORY=8192

Or for Docker:

docker run -e NODE_MAX_MEMORY=8192 ...

Calculate appropriate value

On a 16 GB instance, allocate 8-12 GB to Node.js heap, leaving room for:

OS and system processes
Filesystem cache
External daemons (nginx, database)

Monitor and adjust

Monitor RSS vs heap usage over time. Adjust conservatively to avoid swapping.

Default: 4096 MB (4 GB)

Storage Engine Caching

xyOps uses pixl-server-storage with in-memory caches for JSON records.

Storage.SQLite.cache.maxBytes

number

default:"104857600"

Maximum cache size in bytes (default ~100 MB)

Storage.SQLite.cache.maxItems

number

default:"100000"

Maximum cached items

Storage.Filesystem.cache.maxBytes

number

default:"104857600"

Filesystem cache size in bytes

Recommendation: For large production installs, increase cache sizes 5-10× if RAM permits:

"Storage": {
  "SQLite": {
    "cache": {
      "enabled": true,
      "maxBytes": 524288000,
      "maxItems": 500000
    }
  },
  "Filesystem": {
    "cache": {
      "enabled": true,
      "maxBytes": 524288000,
      "maxItems": 500000
    }
  }
}

Tune based on hit ratio and latency. Monitor cache effectiveness in storage logs.

Multi-Conductor Architecture

Multi-conductor deployments require external shared storage so all conductors see the same state.

See Multi-Conductor with Nginx for detailed setup instructions.

Storage Backend Options

S3 / MinIO

AWS S3 works but has higher latency. MinIO (self-hosted S3) performs better on-prem.

"Storage": {
  "engine": "S3",
  "S3": {
    "params": {
      "Bucket": "xyops-production"
    },
    "cache": {
      "enabled": true,
      "maxBytes": 524288000
    }
  }
}

Redis + S3 Hybrid

Common pattern: fast key/value store for JSON documents, object store for binaries.

"Storage": {
  "engine": "Hybrid",
  "Hybrid": {
    "docEngine": "Redis",
    "binaryEngine": "S3"
  }
}

Ensure Redis persistence (RDB/AOF) is enabled for durability.

NFS Shared Filesystem

If using NFS for Filesystem backend:

Ensure low latency
Adequate throughput
Robust file locking semantics

Network file systems can introduce latency. Test thoroughly before production use.

SQLite works great for single-conductor but cannot be shared across multiple conductors. Switch to a networked backend for multi-conductor.

Best Practice: Keep conductors in the same region/AZ as storage to minimize cross-zone latency.

Performance Tuning

Disable QuickMon at Scale

QuickMon sends per-second metrics from all satellites. At large scale, this adds ingestion load.

"satellite": {
  "config": {
    "quickmon_enabled": false,
    "monitoring_enabled": true
  }
}

Minute-level monitoring remains enabled via monitoring_enabled.

Disable Job Network Monitoring

For servers with tens of thousands of network connections, disable real-time network monitoring during jobs:

// In satellite config.json or global satellite.config
"disable_job_network_io": true

This reduces load on busy servers while jobs are running.

Job Throughput

Increase the global job rate limit prudently:

max_jobs_per_min

number

default:"100"

Global e-brake to prevent runaway workflows from overwhelming the system

Align with per-category limits and workflow constraints. Monitor worker CPU/RAM when increasing.

Data Retention

Cap database history sizes to prevent unbounded growth:

"db_maint": {
  "jobs": { "max_rows": 1000000 },
  "alerts": { "max_rows": 100000 },
  "snapshots": { "max_rows": 100000 },
  "activity": { "max_rows": 100000 },
  "servers": { "max_rows": 10000 }
}

Adjust to fit your storage budget and compliance requirements.

Search Performance

search_file_threads

number

default:"1"

Worker threads for file search operations

Increase carefully for frequent file searches (I/O bound - test first).

Automated Backups

Configure nightly API export

Use the nightly API export for critical data. Schedule via cron and store off-host. See Daily Backups.

Enable SQLite backups

"Storage": {
  "SQLite": {
    "backups": {
      "enabled": true,
      "dir": "data/backups",
      "compress": true,
      "keep": 7
    }
  }
}

Note: Backups briefly lock the database during copy.

Store backups off-host

Copy backups to S3, network storage, or backup service for disaster recovery.

Monitoring and Alerting

Critical Error Notifications

Configure system hooks to send alerts for crashes and failed upgrades:

"hooks": {
  "critical": {
    "email": "ops-oncall@yourcompany.com"
  }
}

Or create tickets:

"hooks": {
  "critical": {
    "ticket": {
      "type": "issue",
      "assignees": ["admin"]
    }
  }
}

Universal Alert Actions

Configure global alert actions that fire for all alerts:

"alert_universal_actions": [
  {
    "enabled": true,
    "hidden": true,
    "condition": "alert_new",
    "type": "snapshot"
  },
  {
    "enabled": true,
    "condition": "alert_new",
    "type": "email",
    "email": "oncall-pager@mycompany.com"
  }
]

Security Hardening

Network Access Control

"WebServer": {
  "whitelist": ["10.0.0.0/8", "172.16.0.0/12"],
  "allow_hosts": ["xyops.yourcompany.com"]
}

Restrict inbound IPs using CIDR notation. Limit valid Host headers.

HTTPS/TLS

"WebServer": {
  "https": true,
  "https_port": 5523,
  "https_cert_file": "conf/tls.crt",
  "https_key_file": "conf/tls.key",
  "https_force": true
}

Enable HTTPS and force HTTP→HTTPS redirects. Use https_header_detect if terminating TLS upstream.

Upload and Connection Limits

"WebServer": {
  "max_upload_size": 536870912,
  "max_connections": 2048,
  "max_concurrent_requests": 256
}

Reduce upload limits and tune connection caps to match instance capacity.

Security Headers

"WebServer": {
  "uri_response_headers": {
    "(\\/|\\.html)$": {
      "Content-Security-Policy": "default-src 'self'...",
      "X-Frame-Options": "DENY",
      "Strict-Transport-Security": "max-age=31536000"
    }
  }
}

Enforce CSP, HSTS, and other security headers for HTML routes.

Rotate your secret_key every few months. See Secret Key Rotation for details.

Rate Limiting with Nginx

If using the multi-conductor Nginx setup, add rate limiting:

Create limits.conf

limit_req_zone $binary_remote_addr zone=req_per_ip:20m rate=100r/s;
limit_req_status 429;

Add volume bind to Docker

docker run -v ./limits.conf:/etc/nginx/conf.d/limits.conf:ro ...

This limits traffic to 100 requests/sec per IP, using ~20MB cache (~300K IPs). See ngx_http_limit_req_module for more options.

Additional Tuning

Logging Verbosity

Disable verbose logs in production unless actively debugging:

"WebServer": {
  "log_requests": false
},
"Storage": {
  "log_event_types": {}
}

Timeouts

Configure request timeouts to mitigate slow-loris attacks:

"WebServer": {
  "timeout": 30,
  "request_timeout": 30,
  "keep_alive_timeout": 30,
  "socket_prelim_timeout": 5
}

Documentation Index

​Hardware Sizing

​CPU Cores

​Memory (RAM)

​Storage

​Network

​OS Limits

​Memory Configuration

​Node.js Heap Size

​Storage Engine Caching

​Multi-Conductor Architecture

​Storage Backend Options

​Performance Tuning

​Disable QuickMon at Scale

​Disable Job Network Monitoring

​Job Throughput

​Data Retention

​Search Performance

​Automated Backups

​Monitoring and Alerting

​Critical Error Notifications

​Universal Alert Actions

​Security Hardening

​Rate Limiting with Nginx

​Additional Tuning

​Logging Verbosity

​Timeouts

​References

Hardware Sizing

CPU Cores

Memory (RAM)

Storage

Network

OS Limits

Memory Configuration

Node.js Heap Size

Storage Engine Caching

Multi-Conductor Architecture

Storage Backend Options

Performance Tuning

Disable QuickMon at Scale

Disable Job Network Monitoring

Job Throughput

Data Retention

Search Performance

Automated Backups

Monitoring and Alerting

Critical Error Notifications

Universal Alert Actions

Security Hardening

Rate Limiting with Nginx

Additional Tuning

Logging Verbosity

Timeouts

References