New grad · Dec 2025 · Open to relocation

Kriti Behl

New-grad software engineer building backend and distributed systems that stay correct under failure. Built production backend systems at Thales Group, contributed merged fixes to the Temporal Go SDK, and built proof-heavy systems including 0 duplicate commits across 1,500 fault-injected race reproductions and resilience checks that catch unsafe behavior even when probes still report healthy.

1,500
Fault-Injected Race Runs
0 duplicate commits · 1,500 stale writes blocked
5
Open Source PRs
2 merged Temporal · 2 Azure in review · 1 Temporal open
86/100
Resilience Score
8s recovery · KubePulse validation
100k
Production Records / Run
Backend analytics at Thales Group
01

Projects

These projects are built to show not just that systems work, but what happens when they fail, drift, recover, or become unsafe in ways surface health checks can miss.

Measures whether services truly recover — not just whether probes say so.
Why it matters: shows that I can evaluate real recovery behavior instead of trusting green probes.
  • YAML disruption scenarios (CPU stress, pod kills, network partition) with baseline-vs-observed comparison and composite resilience scorecards
  • Readiness false-positive detection: surfaces cases where probes report healthy while service metrics still show degradation
  • CPU-stress validated: 8s recovery · ~210ms p95 · ~2% error rate · resilience score 86/100
8s recovery~210ms p9586/100 scorePython · Kubernetes · Prometheus
Raw CI logs → structured incidents with config-driven rules, audit log, and operator replay.
Why it matters: shows operator-facing incident triage, structured failure analysis, and debugging workflows.
  • 11 failure families · config/rules.yaml drives detection patterns, ownership hints, and remediation — no backend code changes required
  • Admin audit log (rule ID, actor, timestamp, before/after state) · python cli.py replay <incident_id> for repeatable triage
  • 11 FastAPI endpoints · 5 Prometheus counters · 16 passing tests · runbook in docs/runbook.md
11 failure families16 testsConfig-driven YAMLIncident replayPython · FastAPI · SQLite
C++17 + Swift toolchain that turns flaky concurrent failures into reproducible root-cause artifacts.
Why it matters: shows deterministic debugging and first-divergence isolation for hard-to-reproduce failures.
  • Deterministically isolated first divergence at event index 5 across a 20-event trace · preserves 4 artifacts per run
  • Swift companion (DetTraceAnalyzer): async/await, actor-isolated AnalysisStore, JSON + Markdown reports · 3 passing tests
Event index 5 isolated4 artifacts/run3 Swift testsC++17 · Swift
Blocks degraded model releases before they ship — dataset-driven, regression-gated, fully automated.
Why it matters: shows release gating and regression prevention for ML systems before bad changes ship.
  • Pipeline: cases.jsonl → DistilBERT inference → RAG-overlap or classification-label scorer → runs/ → reports/ → compare/ → gate/
  • Release gate blocks on avg score drop / pass-rate drop / per-case regressions · validated by test_real_model_regression_gate.py
  • FastAPI: POST /evaluate, /compare, /gate · full CLI · Dockerfile · 11 tests
11 tests3 API endpointsDistilBERT4 artifact stagesPython · PyTorch · FastAPI
02

Open Source Impact

Open-source contributions to production systems SDKs, including 2 merged PRs and 1 open PR in the Temporal Go SDK, plus 2 PRs under review in the Azure Go SDK.

Temporal #2212
Fixed OnWorkflow mock to observe propagated context headers
Applied workflow context propagation to mock execution so OnWorkflow matchers see the same headers as real workflow execution.
Merged
Temporal #2200
Fixed goroutine leak in child-workflow test environment
Child workflows could block on an unclosed doneChannel. Added idempotent closure with sync.Once and a regression test that fails without the fix and passes with it.
Merged
Temporal #2248
Restored workflow poller type assignment in scalable task pollers
Wired poller type assignment into scalable task pollers, restoring sticky vs. non-sticky distinction used by poller balancing and adding regression coverage.
In Review
Azure #26051
Surfaced silently dropped transport errors in azcore retry policy
Composed realClose() transport failures with request errors using errors.Join so callers can inspect retry-path failures instead of losing them silently.
In Review
Azure #26106
Implemented W3C Trace Context propagation in azcore HTTP tracing
Added traceparent and tracestate propagation via OpenTelemetry propagators and validated header injection with tests.
In Review
03

Skills & Stack

Languages
PythonGoC++17SwiftTypeScriptJavaSQL
Systems & Correctness
IdempotencyFencing tokensDeterministic replayState machinesRetries / backoff
Reliability Engineering
Chaos testingRegression gatingRelease safetyFailure mode analysis
Backend & APIs
FastAPIRESTPydanticNode.jsReactNext.js
Observability
PrometheusGrafanaOpenTelemetryStructured logging
Runtime & Infrastructure
PostgreSQLSQLiteDockerKubernetesGitHub Actions
ML & Evaluation
PyTorchHuggingFace TransformersDistilBERTEval pipelines
Education
MS CS · UF · GPA 3.8Distributed SystemsNetworksAlgorithmsSecurityNLP
04

Experience

DevSecOps Intern
Jun – Aug 2025
Thales Group · Plantation, FL
  • Built a PostgreSQL-backed backend analytics service processing ~100k state-transition records per run, giving operations teams real-time visibility into resource utilization across distributed pools.
  • Designed deterministic state-resolution logic and timestamp-delta aggregation over historical event logs to compute configurable utilization metrics across 24-hour to 30-day reporting windows.
  • Built REST APIs and operational dashboards for resource- and group-level efficiency reporting, enabling capacity planners to identify underutilized resources and optimization opportunities without affecting live request-processing paths.
Software Development Intern
May – Aug 2024
Elixir Web Solutions · New Delhi, India
  • Built backend REST services on AWS using Node.js and Express, strengthening API behavior with improved input validation and structured error handling.
  • Optimized database query execution plans and indexing, reducing endpoint latency by ~15–25% in performance tests.
Software Engineering Intern
Jun – Aug 2023
C1 India Pvt Ltd · Gurugram, India
  • Built Java backend modules for procurement workflows with transactional safeguards and log-recovery simulation for safer pre-production behavior.
05

Selected Writing

06

Contact

I’m targeting backend, infrastructure, reliability, and production engineering roles where correctness under failure and system behavior under degradation actually matter.

New grad, Dec 2025 · Open to relocation.