Open to roles · New grad Dec 2025 · Open to relocation

Kriti Behl

Software engineer building systems that catch failures before production.

Backend · Reliability · Developer Tooling · AI Infrastructure. Production backend at Thales Group, 4 merged fixes to the Temporal Go SDK.

What I've proven
Prevented distributed system corruption — 0.0% duplicate commits under fault injection
Blocked unsafe deployments — caught +608% p95 regression on AMD MI300X
Turned CI failures into release decisions — 0.91 confidence classification
Fixed concurrency bugs in production Go SDK — 4 merged Temporal PRs
0
Fault-Injected Scenarios
0 duplicate commits · 0 violations
+0%
p95 Regression Caught
AMD MI300X · vLLM · BLOCK
0 PRs
Merged — Temporal SDK
+ 2 Azure SDK in review
0k
Records / Run
Production · Thales Group
Backend & Platform
Distributed execution · APIs · PostgreSQL · Kubernetes
FaultlineAutoOpsThales
QA Automation
Regression detection · test automation · release validation
AutoOpsKubePulseFaultline
Developer Tooling
CI failure intelligence · dashboards · agentic workflows
AutoOpsDetTraceAgentGridTemporal OSS
Reliability & SRE
SLO validation · rollout gates · observability
KubePulseAutoOpsFaultline
AI Infrastructure
Model evaluation · serving latency gates · AMD proof
FairEvalKubePulse AMDAccelSim
Systems & Performance
Deterministic replay · bottleneck analysis · correctness
DetTraceAccelSimFaultline

Live Cloud Run system that detects unsafe AI outputs, blocks them with eval gates, and converts failures into AutoOps incident intelligence.

Why this mattersWithout systems like this, incorrect or incomplete AI outputs can reach users, causing silent failures.
Live proof
25validation runs
9ship decisions
10hold decisions
6escalate decisions
258msp95 eval latency
0.88tool-call success
0unsafe shipments
Cloud Run live
Example output
DecisionHOLD
Reasonmissing_context
AutoOps Output
→ PM summary: Missing deployment context
→ Engineering bug: Missing dependency metadata
→ Support action: Request logs and retry deployment
System flow
Queryuser / support input
RAG over docs/logs/runbookscontext retrieval
LangGraph workflowstateful multi-step
MCP-style tool executionstructured tool calls
Eval Gatesafety + quality check
ship / hold / escalatedecision output
AutoOpsincident ingestion
Incident + Actionstructured output
★ AutoOps-Insight · Developer Tooling · QA Automation · SRE
Turns noisy CI failures into release-blocking decisions with root cause and confidence scoring.

CI Failure Intelligence Dashboard

Detected recurring CI failures and blocked unreliable releases — grouped failures into incident families and generated hold/ship decisions with 0.91 confidence.

  • Full-stack platform (FastAPI + React/Vite): ingests CI logs → classifies failure families → fingerprints recurrence → generates hold_release / investigate with confidence scores
  • Signature-based recurrence tracking across 3 repos — 60% release-blocking decisions; Parquet exports with 17-field schema for downstream analysis
  • Fleet-level metrics: noisy services ranking, recurrence heatmaps, root-cause distribution via warehouse-style SQL models
Why this mattersOn-call engineers get structured triage instead of raw logs — faster decisions, less tribal knowledge required.
60% release-blocking decisions0.91 confidenceFastAPI · React · PostgreSQL · Kafka
5
incidents analyzed
3
hold-release decisions
0.91
dns_failure confidence
0.91
latency_spike confidence
Raw log
Classify
Fingerprint
HOLD / SHIP
★ Faultline · Backend · Platform · Distributed Systems · SRE
Prevented duplicate writes under distributed failures — 0.0% duplicate commits vs 1.0–2.5% naive baseline.

Crash-Safe Distributed Job Execution

Stale workers commit after losing ownership. Lease expiry stops new claims — it doesn't stop an old worker from writing late. Fencing tokens fix the write boundary at the database, not the application.

Worker ClaimsSKIP LOCKEDFencing Tokenunique constraintFault Injectedcrash / partitionStale Rejectedat DB boundary0 Duplicatesacross all runs
  • 1,500+ injected scenarios: crashes, lease takeovers, retry storms, partial writes — 0 invariant violations
  • Coordination overhead measured: 46.5% of runtime in worst case, broken down by claim / poll / reconcile / retry
Why this mattersDouble-commits show up as billing errors, inventory miscounts, or audit failures. Fencing tokens make them physically impossible at the DB boundary.
0.0% duplicates1,500+ scenarios0 invariant violationsPython · PostgreSQL · Prometheus
Fault RateNaive QueueFaultline
5%1.0% dupes0.0% ✓
10%2.5% dupes0.0% ✓
20%2.5% dupes0.0% ✓
1,500+
failure scenarios
0
invariant violations
KubePulse · Reliability · AI Infra
Blocked unsafe deployments even when Kubernetes health checks stayed green.

Release Safety Validation

AMD MI300X — Serving Regression
Baseline p95200 ms
Burst p951,422 ms
Delta+608%
DecisionBLOCK
  • +333% p95 latency drift while probes stayed green — error budget 0.0%, safe_to_operate=false
  • Validation data pipeline: structured JSON artifacts per scenario run, CI/CD integrable
Why this mattersPrevents the class of incidents where the system looks healthy but is serving degraded traffic.
+608% AMD blocked+333% p95 caughtPython · Kubernetes · Terraform
DetTrace · Systems Debugging
Isolated the first incorrect event before any visible failure downstream.

First-Failure Isolation

EXPECTEDgpio_edge → irq_assert → isr_enter → gpio_ack → irq_clear
ACTUALgpio_edge → irq_assert → isr_enter → gpio_edge → register_read
⚑ First divergence: index 3 · event ordering mismatch
  • C++17 deterministic replay + Swift actor-isolated analysis — finds root cause before symptoms appear
  • Cross-incident learning at 1.0 confidence; control-loop: 3/4 scenarios diverged
Why this mattersTurns "we couldn't reproduce it" into a named, replayable divergence at an exact event index.
Index 3 isolated1.0 confidenceC++17 · Swift · CMake
FairEval-Suite · AI Infra · ML Platform
Decides whether to ship a model update — catches regressions average score hides.

Regression Gating for GenAI

Baselineavg 0.794 · 100% passSHIP
Candidateavg 0.000 · 0% passBLOCK
Gemini Flashavg 0.367 · 40% passBLOCK
AMD servingquality ✓ · p95 +47.1%BLOCK
  • Welch t-test + chi-squared at p=0.0 — structural regression, not noise
  • Hardware-aware gate: blocks on serving latency even when output quality holds
p=0.0 significanceAMD hardware gatePython · FastAPI · PyTorch
AccelSim-Lite · Systems · Performance
Named which pipeline stage is the binding constraint — not just that a workload is slow.

Accelerator Bottleneck Simulator

WorkloadThroughputBottleneck
compute_heavy0.33 ops/cyWaitingDependency
memory_heavy0.14 ops/cyNoMemoryPort ← 2.4×
queue_pressure0.32 ops/cyDep + ComputeUnit
  • Memory pressure: ~2.4× throughput degradation — named stall classification per cycle identifies the correct remediation
2.4× degradation quantifiedC++17 · CMake
AgentGrid · Agentic Systems · Developer Tooling
Converts unstructured operational documents into structured risk signals, owner routing, and action summaries.

LangGraph Document Triage Agent

  • LangGraph multi-step stateful workflow: classification → issue extraction → severity scoring → owner routing → action generation; deterministic graph nodes with typed shared state contract
  • Machine-readable JSON output; CLI-driven, pytest-backed, 3 scenarios 100% pass
Why this mattersEncodes triage logic that currently lives in engineers' heads — consistent, testable, pipeline-integrable.
3 scenarios · 100% passLangGraph · Python · CLI
RFI / Change Order / Safety Notice
Classification
Issue Extraction
Severity Scoring
Owner Routing
Action Summary JSON

4 merged PRs in the Temporal Go SDK + 2 Azure SDK PRs in review.

Fixed async future chaining where ready futures could still block callers
Resolved a bug where already-resolved futures in the workflow test environment could still cause callers to block, breaking async execution guarantees.
Fixed OnWorkflow mock to observe propagated context headers
Applied workflow context propagation to mock execution so OnWorkflow matchers see the same headers as real workflow execution.
Fixed goroutine leak in child-workflow test environment
Child workflows could block on an unclosed doneChannel. Added idempotent closure with sync.Once and a regression test that fails without the fix.
Restored workflow poller type assignment in scalable task pollers
Wired poller type assignment into scalable task pollers, restoring sticky vs non-sticky distinction used by poller balancing.
Azure #26051In Review
Surfaced silently dropped transport errors in azcore retry policy
Composed realClose() transport failures with request errors using errors.Join so callers can inspect retry-path failures instead of losing them silently.
Azure #26106In Review
Implemented W3C Trace Context propagation in azcore HTTP tracing
Added traceparent and tracestate propagation via OpenTelemetry propagators and validated header injection with tests.
Languages
PythonGoC++17SwiftTypeScriptJavaSQL
Systems & Correctness
IdempotencyFencing tokensDeterministic replayState machinesRetries / backoff
Reliability Engineering
Chaos testingRegression gatingRelease safetyFailure mode analysisSLO tracking
Backend & APIs
FastAPIRESTPydanticNode.jsReactNext.js
Observability
PrometheusGrafanaOpenTelemetryStructured logging
Infrastructure
PostgreSQLSQLiteDockerKubernetesGitHub ActionsTerraform
ML & Evaluation
PyTorchHuggingFace TransformersDistilBERTEval pipelinesStatistical gating
Education
MS CS · UF · GPA 3.8Distributed SystemsNetworksAlgorithmsSecurityNLP
Software EngineerCurrent
Cheenti Digital LLC · Remote
Feb 2026 – Present
  • Built 4-phase internal platform unifying analytics, search, campaign, and website-performance workflows into one reporting and monitoring system
  • Developed automated diagnostics covering 5+ technical issue classes: crawlability, broken links, redirect chains, metadata gaps, sitemap issues
  • Built monitoring workflows across 4 performance dimensions to surface regressions earlier than periodic reporting
DevSecOps Intern
Thales Group · Plantation, FL
Jun 2025 – Aug 2025
  • Built Python backend processing ~100k state-transition records per run; computed per-resource utilization, queue depth, and efficiency across HSM resource pools (payShield 10K, Luna HSM)
  • Replaced frontend JavaScript state computation with deterministic backend state engine; REST endpoints exposing real-time HSM state, queue depth, idle/recovery counts, and time-in-state from PostgreSQL event logs
  • Implemented configurable time-window efficiency analysis (24h–N days) via delta-based evaluation; exposed via REST APIs
  • Built internal dashboard for DevOps/engineering teams showing per-resource-type efficiency charts across HSM states: active, idle, queued, recovery, validation, error
Graduate Assistant
University of Florida · Gainesville, FL
Dec 2024 – Dec 2025
  • Operated and improved production scheduling system used by ~600–800 weekly users; diagnosed live failures and restored correctness during active usage
Let's work together.

Looking for backend, platform, QA automation, or reliability engineering roles. I build systems that prevent failures before production.

New grad · Dec 2025 · Open to relocation