1
Audit Your Config
Drop a YAML or JSON service definition into the Audit Tool. The same 8-rule engine that powers the dashboard will scan it and return scored findings with fix hints.
2
Map Blast Radius
The dependency graph traces every connection in your service chain. See exactly how many systems a single failure takes down. Click any hotspot to generate a targeted incident response plan before you need one.
3
Fix What Matters Most
Every finding is scored by severity and blast radius, then ranked by risk-reduction ROI. Critical issues include fix hints and runbook generation. Start with the fix that eliminates the most exposure per engineering hour.
1
See Your Risk Posture
The dashboard produces a single Operational Reliability Risk score from 0 to 100, severity-weighted so critical issues stand out. Use it to give leadership one clear reliability number instead of scattered technical anecdotes.
2
Quantify Outage Exposure
Incident Intelligence correlates past incidents to show where cost is concentrated — for example, where repeated reliability gaps are driving most of the outage impact across trading or critical-path systems.
3
Prioritize Investment
Blast radius analysis and reliability debt tracking show which remediation work delivers the most risk reduction per engineering dollar spent.
ReliOps is a reliability risk intelligence platform with an open-source core. It scans your service architecture and surfaces risk — things like missing failover, SLO breaches, dangerous dependency chains, and incident patterns before they cause outages. Think of it as a reliability posture dashboard for your entire service fleet, with an enterprise path for regulated environments.
The Operational Reliability Risk score ranges from 0 (no issues) to 100 (critical). It's severity-weighted: CRITICAL findings contribute far more than LOW ones. A score under 30 is healthy, 30–59 is moderate and worth reviewing, and 60+ means you have findings that need prompt attention.
The graph maps every service-to-service dependency in your architecture. Node color reflects risk level — green is clean, yellow has medium findings, red has critical issues. You can drag nodes, zoom, and hover for details. The layout reveals which services are tightly coupled and where a single failure could cascade.
Blast radius measures how many downstream services would be affected if a given service fails. ReliOps traces the dependency chain and counts every system that depends — directly or transitively — on that service. A blast radius of 8 means an outage would ripple across 8 other services.
ReliOps takes a config-first approach — it doesn't require runtime agents, APM instrumentation, or access to production. Dependencies are extracted from artifacts your teams already maintain:

Today (open-source core): YAML/JSON service configs, Kubernetes manifests, Helm charts, docker-compose files, and Terraform state. These declare dependencies explicitly: database connections, upstream services, queue bindings, circuit breaker settings, SLO targets. Most teams underestimate how much their config files already reveal about blast radius.

Enterprise roadmap: For cloud-native environments, ReliOps ingests Kubernetes API, cloud provider service discovery (AWS CloudMap, etc.), and CI/CD pipeline definitions. Passive reads, not agents. For legacy on-prem (common in financial services), the path is config + architecture document ingestion: CMDB exports, network diagrams, and AI-assisted extraction from runbooks and architecture docs.

Why this matters for financial services: Regulated environments often can't deploy third-party agents into production. The config-first model is a design choice, not a limitation. You don't need 100% coverage to deliver value. Even a partial blast radius map with 3–5 critical findings changes how a team prioritizes reliability work.
The open-source core gives you the full config audit engine, 8-rule scoring, blast radius mapping, and dependency visualization. Everything you need to validate the workflow. Enterprise delivery adds what regulated, business-critical environments require:

Deployment: Private deployment in your VPC or on-prem infrastructure. No data leaves your network.
Identity & access: SSO integration, role-based access control, audit logging for compliance.
AI insights: Local LLM option with dedicated model tuning. Contextual remediation guidance calibrated to your architecture.
Dependency discovery: Cloud API ingestion, IaC parsing, CMDB integration, and AI-assisted extraction from legacy architecture docs, expanding coverage beyond config files.
Continuous monitoring: Scheduled scans, drift detection, and historical trend analysis.
Calibration: Rules, thresholds, and severity weights tuned to your SLO targets and regulatory requirements (SEC 17a-4, DORA, SOC 2).
Support: Dedicated onboarding, pilot program, and ongoing reliability engineering advisory.

Enterprise contracts start at $150K/year for financial services institutions. Start a pilot →
AIOps platforms like BigPanda and Selector AI are reactive correlation engines. They sit downstream of your monitoring stack and get smarter about grouping, deduplicating, and routing alerts after something breaks. ReliOps answers a different question: what is going to break, how far will it spread, and what is the highest-value fix right now?

Pre-incident vs post-incident: BigPanda ingests alerts from Datadog, PagerDuty, Splunk and correlates them. Selector AI does the same with network telemetry. Both require something to go wrong first. ReliOps maps dependency risk and blast radius from config files before the first alert fires.

No agent, no integration tax: BigPanda needs connectors into every alerting and ITSM tool in your stack. Selector AI needs network telemetry pipelines. Both take months to deploy in regulated environments. ReliOps starts from config files you already have: YAML, Helm charts, Terraform, docker-compose. Config to first findings in a week, not a quarter.

Blast radius as the unit of analysis: AIOps tools think in alerts and incidents. ReliOps thinks in dependency chains and concentration risk. For a trading firm, knowing that three critical order-routing services all depend on the same Redis cluster with no failover is worth more than faster alert grouping after that cluster goes down.

Open-source transparency: Neither BigPanda nor Selector AI lets you inspect the logic. ReliOps's 8-rule engine is open, auditable, and runnable locally. That matters for regulated finserv teams who need to explain scoring to a risk committee.
ReliOps runs 8 rules against your services: SLO breach trending, missing cross-AZ failover, missing circuit breakers, retry storm risk, single-owner risk, incident recurrence, missing saturation metrics, and dependency fan-out limits. Each rule produces findings at CRITICAL, HIGH, MEDIUM, or LOW severity.
Go to Run Audit in the navigation. Drop in a YAML or JSON file describing your services. Include fields like name, tier, owner, SLO targets, and dependencies. ReliOps will run the full rules engine against it and return a scored report with actionable findings.
Tiers reflect business criticality. Tier 1 is revenue-critical or customer-facing infrastructure, services where downtime directly impacts users or revenue. Higher tier numbers indicate supporting or internal services. ReliOps weighs findings on Tier-1 services more heavily since their failures have the broadest impact.
The audit tool accepts YAML or JSON files with a list of service definitions. Each service should include: name, tier (e.g. tier-1), owner, slo_target, slo_current, dependencies (a list of service names), and optionally deployment info like az_count. You can find a sample in the project's mock data directory.
ReliOps has an open-source core that you can self-host, inspect, and extend. The core is portable so teams can validate the workflow without lock-in. Companies that need supported private deployment, rule calibration, rollout support, or regulated-environment delivery can also engage UpliftPal on an enterprise basis. See Enterprise for the operating model.
ReliOps can optionally use an LLM to generate richer, contextual reliability recommendations from your risk findings. All service names and internal identifiers are anonymized before anything is sent. AI insights are off by default. When disabled, ReliOps uses its built-in pattern-based analysis. If the AI provider is ever unavailable, it falls back to pattern-based insights automatically.
Add these to your .env file (see .env.example for the full template):

AI_INSIGHTS_ENABLED=true
AI_PROVIDER=openai
AI_API_KEY=your-api-key

Three providers are supported:

anthropic — Anthropic Claude API
openai — OpenAI GPT API
local — Your own self-hosted LLM. Nothing leaves your network. Works with any OpenAI-compatible API: Ollama, vLLM, llama.cpp, LocalAI, LM Studio, and others. Just set AI_BASE_URL to your model endpoint. No API key required unless your endpoint uses one.
Only if you choose an external provider (anthropic or openai). Even then, ReliOps strips all service names, owner info, and internal identifiers before sending. The LLM only sees anonymized patterns like "Service-A has blast radius 7 without AZ redundancy" — never your real service names or architecture details.

For full data isolation, use the local provider. Set AI_PROVIDER=local and point AI_BASE_URL at a model running on your own infrastructure. Zero data leaves your network. No external API calls, no third-party access. Ideal for environments with strict compliance or data residency requirements.
When you click "Generate Response Plan" on a critical issue or blast-radius hotspot, ReliOps builds a structured incident runbook for that specific service. The runbook includes: immediate response steps, blast-radius containment actions, a copy-paste communication template, and a post-incident checklist.

If AI insights are enabled, the runbook is generated by the LLM using your service's violations, dependency graph, and incident history (anonymized). If AI is off, ReliOps generates a deterministic template based on the same data. Either way, you get actionable steps, not a generic playbook.
The Incident Intelligence section analyzes your full incident history to surface systemic patterns that span multiple incidents. Instead of looking at each incident individually, it detects:

Root cause categories — e.g. "3 of 10 incidents involve timeout cascades, costing $4.2M"
Cost concentration — e.g. "80% of incident costs trace to services missing circuit breakers"
Repeat offenders — services that appear in multiple incidents
Temporal clusters — bursts of incidents in short timeframes

Each pattern includes a recommendation so you can prioritize the highest-impact fixes.
Each critical and high-severity issue now includes a one-line "Fix" hint: a concrete remediation suggestion mapped to the specific rule that triggered the violation. For example, a missing circuit breaker violation shows "Fix: Add circuit breaker for fault isolation." These hints give SREs an immediate starting point without needing to open a separate runbook.