Shkumbin Sherifi — AI Systems Engineer

AI Systems Engineer

Shkumbin
Sherifi

Designing AI systems for inference, retrieval, orchestration, and evaluation across local and cloud infrastructure.

Local inference · Retrieval systems · Multi-model routing · Evaluation systems

Coding. Building. AI Buildings.

High performance simulation engine built to evaluate vector synchronization, relational aggregation, and orchestration performance across multi-service architecture.

Pipeline Stages

Stage 1: Draft Generation

  • 7 rounds × 32 picks
  • Automated attribute scoring matrices
  • Prospect generation and ranking logic

Stage 2: Embedding System

  • 4,940 player vectors
  • PyTorch VAE clustering into 12 archetypes
  • Low-latency similarity retrieval via ChromaDB

Stage 3: Season Simulation

  • 17-week simulation engine
  • Play-by-play matchup execution
  • Seeded variance and home-field weighting

Stage 4: Progression Engine

  • Physical aging curves
  • Development trait progression
  • Rookie growth and veteran regression

Stage 5: Salary Cap & Front Office

  • Rule-of-51 enforcement
  • Dead-money calculations
  • Asset valuation and trade logic
Supporting Systems

Coaching Layer

  • Play-calling logic, scheme-fit metrics, and in-game adjustments.

Scouting Engine

  • Regional grading pipelines and combine evaluation models.

Free Agency Marketplace

  • Multi-agent contract bidding and team-fit valuation scoring.

Draft Intelligence

  • Need-weighted board ranking and trade-up/down evaluation.
Infrastructure

DuckDB for structured OLAP analytics across ~1,700 entities

ChromaDB vector storage for 45-dimensional embeddings

Full 10-season franchise lifecycle simulation computed in ~16 seconds locally

PythonPyTorchDuckDBChromaDBMLXDockerNext.js
data

Albanian Speech-to-Text

Local first transcription pipeline for Albanian (Kosovo dialect) with human-correction feedback loops and iterative fine-tuning workflows.

Pipeline

901 curated YouTube audio sources processed through Faster-Whisper for segment-level transcription.

Configurable inference models from tiny to large-v3 with language auto-detection and re-ranking.

Human correction feedback loop for continual dataset refinement.

Observability

SQLite-backed tracking for inference latency, correction rates, WER, and CER.

Real-time monitoring dashboard across deployment variants and model sizes.

Structured evaluation workflows for reproducible ASR benchmarking.

WhisperFaster-WhisperPythonASRSQLiteAudio ProcessingWeb UI
constraint

Hermes Workflow Environment

Workflow environment integrating MLX inference, provider routing, automation pipelines, and retrieval systems.

Inference Layer

oMLX model server (:8001) as the primary inference backend

Automatic routing across local and private cloud providers

Fallback chain: local MLX → OpenRouter → Cerebras → cloud GPU infrastructure

Orchestration

s6-overlay supervision for containerized service lifecycle management

SQLite-backed kanban orchestration system with queue state tracking

Cron-driven monitoring with automatic retry and failure recovery

Memory-pressure gating pauses dispatch during constrained local resource states

Workflow Automation

Integrated agent workflows inspired by procedural task execution patterns

Multi-step task chaining with parameterized execution flows

Retrieval-assisted workflows using SQLite observability and vector search pipelines

Experimentation with local first agent coordination and automation tooling

PythonMLXSQLiteNext.jsAutomationAgents
graph

Keep-Graph: 10 years of personal notes, visualized

10 years of Google Keep notes turned into a knowledge graph — 1,032 wiki pages, 12,801 nodes, 44,534 edges. Mapped.

Pipeline

Load all Google Keep Takeout notes (JSON)

Filter relevant notes with Qwen 0.6B classifier

Extract concepts per note with Qwen 27B

Deduplicate and canonicalize concepts with Qwen 27B

Synthesize wiki pages per cluster with Qwen 27B

Build graph.json and render D3 force layout

Infrastructure

Extracted entirely on-device using local LLMs (Qwen 0.6B + 27B via Apple MLX)

Multilingual: English, Albanian, Arabic with cross-language concept deduplication

Per-concept wiki synthesis across 1,032 generated pages

Interactive D3 timeline spanning 2015 to 2026

PythonMLXQwen 0.6B / 27BD3.jsLocal-First

Production Systems

Në Dritën Islame

E-commerce and operational administration platform built with Next.js, Supabase, and automated fulfillment workflows. Handles order processing and backend admin automation.

Gloweb

Client-facing web systems and backend integrations across React, TypeScript, and Node.js. Focused on production deployment workflows and API integration layers.

Arbnori Engineering

Multilingual business platform with deployment automation and localization systems for multi-region operations.

Contact

Open to AI systems, ML infrastructure, and applied AI engineering roles.