Shkumbin Sherifi — AI Systems Engineer

Shkumbin
Sherifi

I design and build AI infrastructure, evaluation systems, and agent runtimes, spanning architecture and benchmark methodology through implementation, observability, and production hardening.

AI architecture · Agent infrastructure · Evaluation systems · Benchmark design · PyTorch · MLX

View Systems Contact

10 years of Google Keep notes turned into a knowledge graph — 1,032 wiki pages, 12,801 nodes, 44,534 edges. Mapped.

Nodes: 12,801
Edges: 44,534
Wiki pages: 1,032

Pipeline

Load all Google Keep Takeout notes (JSON)

Filter relevant notes with Qwen 0.6B classifier

Extract concepts per note with Qwen 27B

Deduplicate and canonicalize concepts with Qwen 27B

Synthesize wiki pages per cluster with Qwen 27B

Build graph.json and render D3 force layout

Infrastructure

Extracted entirely on-device using local LLMs (Qwen 0.6B + 27B via Apple MLX)

Multilingual: English, Albanian, Arabic with cross-language concept deduplication

Per-concept wiki synthesis across 1,032 generated pages

Interactive D3 timeline spanning 2015 to 2026

PythonMLXQwen 0.6B / 27BD3.jsLocal-First

An interactive treemap that turns a GitHub user's public starred repositories into searchable, filterable tiles sized by popularity and colored by language.

Sign-in: GitHub App
Repository data: Public only
License: MIT

Experience

Search and filter starred repositories by language or topic

Open any repository directly from its treemap tile

Explore a bundled preview before signing in

Security and delivery

GitHub OAuth tokens remain on the server and never reach the browser

Signed, short-lived HttpOnly cookie protects the OAuth PKCE flow

Astro and D3 application deployed as a public beta on Render

Terminal-Bench style C++ benchmark for testing whether coding agents can solve a hidden 2D heat-equation task under private cases and shared runtime constraints.

Reference error: 0.00113
Error gate: 0.005
Replay baselines: 3

Evaluation Harness

Docker verifier with hidden manufactured solutions

Trusted ADI/Crank-Nicolson numerical reference

Replay baselines for Codex, adversarial Codex, and Claude artifacts

Reference-margin audit strengthened the trusted solver without changing the agent-facing task

Result

Rejects coarse time stepping, explicit instability, and brute-force over-resolution

Prior pass artifacts replay to reward 0.0 while the trusted reference remains deterministic

Fresh Claude run produced a failing partial artifact before hitting provider rate limits

A local-first execution layer for agent workflows: model routing, durable queues, and verifier-first gates that keep autonomy measurable and controllable.

Default route: Local
Queue state: Durable
External effects: Gated

Architecture

Routing: MLX local-first, policy-gated fallback providers

Orchestration: SQLite-backed queues with state, retry, recovery

Supervision: containerized service lifecycle management

Resource gating: pauses dispatch under memory pressure

Observability: execution traces + queue-state monitoring

Quality gates (in progress): tests/scans/approvals before any external effects

Positioning: control plane + gates

PythonMLXSQLiteDocker

High performance simulation engine built to evaluate vector synchronization, relational aggregation, and orchestration performance across multi-service architecture.

Seasons: 10
Player vectors: 4,940
Local runtime: ~16s

Pipeline Stages

Stage 1: Draft Generation

7 rounds × 32 picks
Automated attribute scoring matrices
Prospect generation and ranking logic

Stage 2: Embedding System

4,940 player vectors
PyTorch VAE clustering into 12 archetypes
Low-latency similarity retrieval via ChromaDB

Stage 3: Season Simulation

17-week simulation engine
Play-by-play matchup execution
Seeded variance and home-field weighting

Stage 4: Progression Engine

Physical aging curves
Development trait progression
Rookie growth and veteran regression

Stage 5: Salary Cap & Front Office

Rule-of-51 enforcement
Dead-money calculations
Asset valuation and trade logic

Supporting Systems

Coaching Layer

Play-calling logic, scheme-fit metrics, and in-game adjustments.

Scouting Engine

Regional grading pipelines and combine evaluation models.

Free Agency Marketplace

Multi-agent contract bidding and team-fit valuation scoring.

Draft Intelligence

Need-weighted board ranking and trade-up/down evaluation.

Infrastructure

DuckDB for structured OLAP analytics across ~1,700 entities

ChromaDB vector storage for 45-dimensional embeddings

Full 10-season franchise lifecycle simulation computed in ~16 seconds locally

PythonPyTorchDuckDBChromaDBMLXDockerNext.js

Local first transcription pipeline for Albanian (Kosovo dialect) with human-correction feedback loops and iterative fine-tuning workflows.

Audio sources: 901
Quality metrics: WER + CER
Model sizes: 5

Pipeline

901 curated YouTube audio sources processed through Faster-Whisper for segment-level transcription.

Configurable inference models from tiny to large-v3 with language auto-detection and re-ranking.

Human correction feedback loop for continual dataset refinement.

Observability

SQLite-backed tracking for inference latency, correction rates, WER, and CER.

Real-time monitoring dashboard across deployment variants and model sizes.

Structured evaluation workflows for reproducible ASR benchmarking.

WhisperFaster-WhisperPythonASRSQLiteAudio ProcessingWeb UI

Contact

Open to AI systems, ML infrastructure, and applied AI engineering roles.