Backend · Platform · AI Infrastructure

Complex systems rarely fail for obvious reasons.

I build tools that make their behavior observable, measurable and actionable.

Run a failure scenario ↓Open KubePulse lab ↗GitHub

Open to backend, platform, reliability and production engineering roles

Temporal OSS5 merged PRs Azure SDKmerged contribution MLH / Meta PEProduction Engineering Track Thales100k+ telemetry records UF MS CS3.8 GPA

SYSTEM FAILURE OBSERVATORYLIVE / RUN #287

NOMINAL

RequestHEALTHY

AgentGridHEALTHY

FaultlineHEALTHY

KubePulseHEALTHY

FairEvalHEALTHY

DetTraceHEALTHY

ReleaseHEALTHY

RESETREADY

Evidence, not adjectives

Proof, not adjectives.

Maintainer-reviewed code. Production signals. Measured failure behavior.

5merged Temporal PRs

100k+telemetry records

1,500+injected failures

73AgentGrid tests

0.0%duplicate commits

+608%latency regression detected

One operating model

Break it. Watch it. Prove it.

Inject

Controlled failure

Observe

Telemetry + traces

Enforce

Reject or block

Explain

Reproducible proof

Flagship system · interactive

AgentGrid

Slide confidence. Routing changes.

View GitHub ↗

01Requestaccepted

02Retrieverconfidence 0.42

03Tool runtimesuccess

04Evaluationreview

05Decisionhuman

REVIEWhuman required

Workflow quality → reviewable.

Correctness debugger · replayable

Faultline

41 expires. 42 wins.

View GitHub ↗

INCIDENT #2847STATE PRESERVED

01worker_aexpired02worker_bowner03validatorreject

worker_aold owner41

worker_bcurrent owner42

FENCINGVALIDATOR

STALE WRITE REJECTED

CORRECTNESS GUARANTEECurrent owner only.STATE PRESERVED

0 dupes1.5k faults37 rejected

Stale writes rejected. State preserved.

Network production engineering · interactive diagnostic

KubePulse

Fault → layer → fix.

View GitHub ↗

Loading diagnostic topology…

01Client✓

02DNS—

03TCP—

04HTTP—

05Service—

06DB—

BROKEN LAYER FOUND

Depth proof · two live instruments

Evidence changes the decision.

GOVERNANCE BOARDFairEval ↗

RELEASE DECISIONBLOCK

Groundedness score 82%Serving regression +42%

Quality PASSSafety PASSGroundedness REVIEWServing BLOCK

Quality gate, not guesswork.

TRACE VIEWERDetTrace ↗

SCRUB TO INCIDENTDIVERGENCE@ STEP 438

expected

replayed

FIRST DIVERGENCE ISOLATED

First divergence found.

Open source

Maintainers accepted the work.

Reviewed and merged outside my repositories.

Temporal PRs ↗Azure SDK PR ↗

#2200Workflow runtime reliabilityMerged #2212Workflow mock headersMerged #2248Poller instrumentationMerged #2298Async completion correctnessMerged #2367Worker rate-limit caveatMerged Azure SDKRetry policy error propagationMerged

Engineering writing

Writing from real investigations.

Distributed systems. AI reliability. Production engineering.

ReliabilityKubernetes Said Everything Was Healthy. It Wasn't.↗Distributed SystemsHow I Built a Distributed Job Queue That Stays Correct Under Crashes.↗AI SystemsThe Most Dangerous AI Failures Don't Crash. They Quietly Look Correct.↗