How 15 Docker services, 12 AI models, and 48 GB of VRAM are orchestrated into a single-command deployment. This page covers every internal: the service dependency DAG, VRAM budgeting strategy, init container pattern, LiteLLM routing configuration, and the 544-line interactive setup wizard.
The system is organized in layers: external clients → API gateway → inference engines → data stores → init containers. Every service lives in Docker with explicit health checks and dependency ordering.
┌─────────────────────────────────────────────────────────────────────┐ │ CLIENTS (any OpenAI SDK) │ │ VS Code · SMA Pipeline · Open WebUI · curl · Telegram Bot │ └────────────────────────────────┬────────────────────────────────────┘ │ :4000 (OpenAI API) ┌────────────────────────────────▼────────────────────────────────────┐ │ LITELLM GATEWAY │ │ Auth (master_key) · Routing · Fallbacks · Retries · Logging │ │ 9 model entries · 4 fallback chains → GPT-4o (GitHub Models) │ └──┬──────────┬──────────┬──────────┬──────────┬─────────────────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌──────┐ ┌────────┐ ┌────────┐ ┌───────┐ ┌─────────────┐ │OLLAMA│ │SPEACHES│ │COMFYUI │ │MEM0 │ │GPT-4o │ │:11434│ │ :8000 │ │ :8188 │ │:8080 │ │(cloud only) │ │ROCm │ │Whisper │ │FLUX │ │Memory │ │GitHub Models │ │5 LLMs│ │+Kokoro │ │CogVideo│ │+Qdrant│ │ │ └──┬───┘ └────────┘ └────────┘ └───┬───┘ └──────────────┘ │ │ │ ┌──────────┐ ┌──────────┐ ┌──┴───────┐ │ │FISH-SPCH│ │XTTS-v2 │ │QDRANT │ │ │ :8001 │ │ :8002 │ │ :6333 │ │ │S1-Mini │ │Voice Cln │ │Vector DB │ │ └──────────┘ └──────────┘ └──────────┘ │ ┌──┴────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │MUSICGEN │ │POSTGRES │ │REDIS │ │DASHBOARD│ │:8003/:8004│ │ :5432 │ │ internal │ │ :9000 │ │AudioCraft │ │LiteLLM DB│ │Cache/Rate│ │nginx+GPU │ └───────────┘ └──────────┘ └──────────┘ └──────────┘
Docker Compose depends_on with condition: service_healthy enforces strict startup ordering.
Init containers run once and exit. Services only start after their dependencies pass health checks.
Boot sequence (left → right = dependency order): TIER 0 — No Dependencies (start immediately) ├── ollama (health: TCP 11434) ├── db (postgres) (health: pg_isready) ├── qdrant (health: TCP 6333) ├── redis (health: redis-cli ping) └── model-downloader (init: pip install + download_models.sh → exit) TIER 1 — Depends on Tier 0 ├── ollama-init → waits: ollama (healthy) → pulls 5 models + pre-warms 2 → exit ├── speaches (self-contained, no depends_on, health: /health) ├── fish-speech → waits: model-downloader (completed) ├── xtts → waits: model-downloader (completed) ├── musicgen (self-contained, health: /health) └── comfyui → waits: model-downloader (completed) TIER 2 — Depends on Tier 0 + Tier 1 ├── litellm → waits: db (healthy) + ollama (healthy) ├── mem0 → waits: qdrant (healthy) + ollama (healthy) ├── speaches-init → waits: speaches (healthy) → install whisper model → exit └── musicgen-ui → waits: musicgen (healthy) TIER 3 — Depends on Tier 2 ├── open-webui → waits: litellm (healthy) + speaches (healthy) └── dashboard (static nginx, no hard dependencies)
Key insight: Init containers use restart: "no" + condition: service_completed_successfully.
They download models, configure state, then exit. This means subsequent docker compose up starts are instant — no re-downloads.
48 GB is a hard ceiling. Every byte is accounted for. The strategy: keep two LLMs permanently resident in VRAM for instant inference, and reserve the remainder for ComfyUI media generation.
Two models are loaded into VRAM at boot and never evicted:
deepseek-r1:32b — ~20 GB — primary coding + reasoningdeepseek-v2:16b — ~10 GB — fast autocomplete + light tasksControlled by Ollama env vars:
OLLAMA_MAX_LOADED_MODELS=2OLLAMA_KEEP_ALIVE=-1 (infinite TTL)FLUX.1-schnell FP8 requires ~17 GB VRAM per image generation. CogVideoX-2b uses ~14 GB plus T5-XXL encoder.
Concurrency rule: Image/video gen is sequential — only one workflow runs at a time. The remaining ~1 GB is OS overhead.
If a third Ollama model is requested (e.g., deepseek-r1:70b), Ollama evicts the least-recently-used model to make room. The 70B model requires the full 48 GB — ComfyUI and both pre-warmed models get evicted.
Safety net: LiteLLM fallback chains route to GPT-4o if the local model fails to load within 120 seconds.
Pre-warming eliminates cold-start penalties. First-token latency for R1:32b and V2:16b is measured in milliseconds, not seconds. Compare to cloud API round-trip: ~200-800ms network latency alone.
Three ephemeral containers execute one-shot setup tasks on first deploy, then exit cleanly. This Kubernetes-inspired pattern keeps Docker images small, avoids redundant downloads, and ensures idempotent deployments.
Image: python:3.12-slim
Runs: pip install huggingface_hub + bash download-models.sh
Downloads:
Persistence: ./models/ bind-mounted volume. Skip logic: checks if directory has files before downloading.
Image: python:3.12-slim + curl
Waits for: ollama: service_healthy
Pulls:
deepseek-r1:32b (19.9 GB)deepseek-v2:16b (8.9 GB)deepseek-r1:70b (42.5 GB)llama3.3:latest (42.5 GB)nomic-embed-text:latest (274 MB)Pre-warms: Sends a dummy prompt to R1:32b and V2:16b to load them into VRAM.
Image: python:3.12-slim + curl
Waits for: speaches: service_healthy
Logic: Queries GET /v1/models to check if deepdml/faster-whisper-large-v3-turbo-ct2 is already installed. If not, sends POST /v1/models to trigger download. Idempotent — safe to re-run.
First deploy (cold): model-downloader ─── 30-60 min ───┬── fish-speech ✓ ├── xtts ✓ ├── comfyui ✓ ollama-init ─────── 10-20 min ────── ollama models pulled + pre-warmed speaches-init ───── 30 sec ──────── whisper model installed Subsequent deploys (warm): model-downloader ── SKIP (files exist) ── exits in <2s ollama-init ─────── SKIP (models pulled) ─ exits in <5s speaches-init ───── SKIP (model installed) exits in <2s
One YAML file maps 9 model entries to local services or cloud overflow. Every client sees a single OpenAI-compatible endpoint at port 4000 with master_key authentication.
model_list: # ── Local LLMs (AMD W7900, ROCm GPU) ── - model_name: deepseek-r1-32b litellm_params: model: ollama/deepseek-r1:32b api_base: "http://ollama:11434" - model_name: deepseek-v2-16b # fast autocomplete - model_name: deepseek-r1-70b # deep reasoning (swaps VRAM) - model_name: llama3-8b # general purpose - model_name: nomic-embed-text # embeddings for Mem0/@codebase # ── Speech (Speaches — OpenAI-compatible) ── - model_name: whisper-turbo litellm_params: model: openai/deepdml/faster-whisper-large-v3-turbo-ct2 api_base: "http://speaches:8000/v1" - model_name: kokoro-tts # fast TTS # ── Cloud Overflow ── - model_name: github-fallback litellm_params: model: openai/gpt-4o api_key: "os.environ/GITHUB_TOKEN" api_base: "https://models.inference.ai.azure.com" litellm_settings: request_timeout: 120 num_retries: 2 fallbacks: - deepseek-r1-32b: [github-fallback] # every local → GPT-4o - deepseek-v2-16b: [github-fallback] - deepseek-r1-70b: [github-fallback] - llama3-8b: [github-fallback]
master_key from os.environ/LITELLM_MASTER_KEY — set during setup wizard. All requests require Authorization: Bearer sk-... header. PostgreSQL 16 stores usage analytics and spend tracking.
Every local LLM has a fallback to github-fallback (GPT-4o via GitHub Models). If the local model times out (120s) or the GPU is busy with ComfyUI, the request transparently routes to cloud. Clients see identical API responses.
Every model deployed across the stack, with parameter count, VRAM usage, and inference context.
| Model | Params | VRAM | Service | Modality |
|---|---|---|---|---|
| DeepSeek-R1:32B | 32B | ~20 GB | Ollama (ROCm) | Chat / Code / Reasoning |
| DeepSeek-V2:16B | 16B | ~10 GB | Ollama (ROCm) | Fast Code / Autocomplete |
| DeepSeek-R1:70B | 70B | ~42 GB | Ollama (ROCm) | Deep Reasoning (swap) |
| Llama 3.3 | 8B | ~5 GB | Ollama (ROCm) | General Purpose |
| nomic-embed-text | 137M | ~0.3 GB | Ollama (ROCm) | 768D Embeddings |
| Whisper Large v3 Turbo | 809M | CPU | Speaches | Speech-to-Text |
| Kokoro-82M | 82M | CPU | Speaches | Text-to-Speech (fast) |
| OpenAudio S1-Mini | — | CPU | Fish Speech | Text-to-Speech (expressive) |
| XTTS-v2 | ~400M | CPU | XTTS-v2 | Voice Cloning (17 languages) |
| FLUX.1-schnell FP8 | 12B | ~17 GB | ComfyUI (ROCm) | Image Generation |
| CogVideoX-2b + T5-XXL | 2B+11B | ~14 GB | ComfyUI (ROCm) | Video Generation |
| MusicGen-small | 300M | CPU | MusicGen API | Text-to-Music |
GPU models (orange VRAM) share the W7900. CPU models run on system RAM (64 GB DDR5). Whisper, Kokoro, Fish Speech, XTTS-v2, and MusicGen use CPU to avoid VRAM contention with LLMs.
453 lines of orchestration. Key patterns used throughout the compose file.
AMD GPUs require /dev/kfd (KFD kernel driver) and /dev/dri (DRI render nodes). Two services use GPU: ollama and comfyui.
devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri
Every long-running service has a health check. Docker won't start dependent services until the health check passes. Patterns used:
echo > /dev/tcp/localhost/PORT (Ollama, Qdrant)pg_isready, redis-cli pingpython3 urllib.request.urlopen() (LiteLLM, Mem0, Speaches)curl -sf URL (Fish Speech, XTTS, Open WebUI)Named volumes for database state. Bind mounts for models (allows sharing across rebuilds).
postgres_data, redis_data, qdrant_data — named volumes./models/ — bind-mounted for all AI model weightscomfyui_data — named volume for generated outputsopen_webui_data — named volume for chat historySecrets stored in .env (chmod 600, gitignored). Injected via ${VAR:-default} syntax. Internal services use Docker network DNS (no exposed ports for Redis). LITELLM_MASTER_KEY gates all external API access.
One command (./setup.sh) takes a fresh clone to a fully running AI stack. Six phases, all idempotent.
$ ./setup.sh PHASE 0 — Prerequisites │ Check: Docker, Docker Compose, docker group, /dev/kfd + /dev/dri, HF CLI │ Aborts with actionable error if anything missing │ PHASE 1 — Interactive API Key Configuration │ Prompts for: HF_TOKEN, LITELLM_MASTER_KEY, POSTGRES_PASSWORD, GITHUB_TOKEN │ Loads existing .env if present → "Keep this value? [Y/n]" │ Auto-generates secure defaults: sk-master-$(openssl rand -hex 8) │ Writes .env (chmod 600) │ PHASE 2 — Install HuggingFace CLI │ Creates .venv if needed → pip install huggingface_hub[cli] │ Login with HF_TOKEN for gated models (Fish Speech) │ PHASE 3 — Download HuggingFace Models │ Interactive: "Which groups? [1] Speech ~6GB [2] Creative ~31GB" │ Per-model skip: checks if directory has files → "SKIP (already downloaded)" │ Downloads: whisper-turbo, kokoro, fish-speech, xtts-v2, flux-schnell, cogvideox-2b │ PHASE 4 — Build & Start Docker Stack │ docker compose pull → build (mem0-api) → start core (ollama, db, qdrant, redis) │ Wait for Ollama healthy → pull 5 Ollama models → start all remaining services │ Wait for LiteLLM healthy │ PHASE 5 — Pull Ollama Models │ Shows existing vs. missing models with ✓/○ markers │ Interactive: "Pull missing models? [Y/n]" │ Streaming progress: pulling manifest 45% │ PHASE 6 — Health Check & Summary │ Container status: ✓ Up / ✗ Down for each container │ Endpoint reachability: ✓ localhost:PORT for each service │ Final banner with all URLs and management commands Flags: --skip-downloads, --skip-keys, --help
A 674-line single-file HTML dashboard served by nginx:alpine on port 9000. Features live GPU monitoring and service health checks.
Polls GET /api/ps on Ollama every 5 seconds to show loaded models, VRAM usage, and active inference requests. Color-coded status: green (loaded), yellow (loading), red (error).
11 service cards with live health status. Each card pings its service endpoint and displays ✓ (healthy) or ✗ (unreachable). Includes direct links to each service UI.
Auto-detects current hostname via window.location.hostname and rewrites all service URLs to match. Works seamlessly when accessed over Tailscale from any device on the mesh network.
Dark theme (#0f0f13), purple/teal gradient accents, fully responsive grid layout. Zero dependencies — pure HTML/CSS/JS. GPU badge displays hardware info prominently.
A YAML configuration file that connects VS Code to all local Ollama models via Continue.dev, with 4 specialized chat agents, tab autocomplete, and semantic codebase search.
models: # 4 chat agents, each with a specialized system prompt - name: DeepSeek R1 32B (Coding Agent) # primary — Claude Code patterns - name: DeepSeek V2 16B (Fast Code) # light tasks + autocomplete - name: DeepSeek R1 70B (Deep Reasoning) # complex architecture - name: Llama 3.3 8B (General) # general purpose tabAutocomplete: provider: ollama model: deepseek-v2:16b # fast inline completions embeddings: provider: ollama model: nomic-embed-text # @codebase semantic search context: - code, docs, diff, terminal, problems, folder, codebase
@codebase searches the entire workspace using nomic-embed-text embeddings indexed locally.
When you type @codebase how does the auth middleware work?, Continue embeds the query, searches the local vector index,
and injects the top-k relevant code snippets into the LLM context. All local — nothing leaves the machine.
Four specialized system prompts in prompts/ — distilled from analyzing the behavioral patterns of
commercial AI coding agents. Designed for local models that need stronger steering than cloud models.
| Agent | File | Distilled From | Domain |
|---|---|---|---|
| Coding Agent | coding-agent.md | Claude Code + Cursor Agent 2.0 | Software engineering, code generation, file editing |
| UI Agent | ui-agent.md | v0 (Vercel) + Bolt.new | Frontend development, CSS, component design |
| Debugging Agent | debugging-agent.md | Claude Code + Devin AI | Root cause analysis, hypothesis ranking, fix verification |
| Hardware Agent | hardware-agent.md | Domain expertise | Drones, ESC firmware, PX4, sensor fusion |
Cloud models (Claude, GPT-4o) produce good results with minimal instruction. Local models (DeepSeek 32B) need explicit behavioral rules to match that quality. The system prompts encode hundreds of behavioral rules distilled from observing commercial agents: no preamble/postamble, concise answers, convention-following, security practices, debugging methodology (observe → hypothesize → test → fix → prevent), and proactiveness constraints. The result is near-commercial quality coding assistance running entirely on local hardware.
Two JSON workflow files for visual media generation, both using CogVideoX-2b for video synthesis. ComfyUI runs a custom ROCm 6.0 Docker build with GPU passthrough.
File: cogvideox-2b-text-to-video.json
Pipeline: 8 ComfyUI nodes → CogVideoX-2b generates 17 frames at 480×320 → assembled into animated WEBP at 6 FPS.
Generation time: ~200 seconds per clip on W7900.
VRAM: ~14 GB (model + T5-XXL encoder).
File: cogvideox-2b-frames.json
Pipeline: Same CogVideoX-2b generation, but outputs 17 individual PNG frames instead of assembled video. Useful for post-processing or custom frame manipulation.
Image generation uses FLUX.1-schnell FP8 (~17 GB VRAM, ~10s per image) directly through ComfyUI's built-in checkpoints system. Open WebUI can trigger image generation via the ComfyUI API integration.
├── docker-compose.yml # 453 lines — 15 services, health checks, GPU passthrough ├── litellm_config.yaml # 71 lines — 9 model entries, fallback chains ├── continue-config.yaml # 103 lines — 4 agents, autocomplete, embeddings ├── setup.sh # 544 lines — 6-phase interactive installer ├── dashboard.html # 674 lines — service portal + live GPU monitor ├── README.md # project documentation ├── comfyui/ │ └── Dockerfile # ROCm 6.0 custom ComfyUI build ├── mem0-api/ │ ├── Dockerfile # Python 3.12 + Mem0 + Qdrant client │ └── main.py # Memory API (FastAPI) ├── musicgen-api/ │ ├── Dockerfile # Python + AudioCraft │ └── main.py # Text-to-music REST API ├── musicgen-ui/ │ ├── Dockerfile # Gradio web UI │ └── app.py # MusicGen playground ├── prompts/ │ ├── coding-agent.md # Claude Code + Cursor patterns │ ├── ui-agent.md # v0/Vercel + Bolt.new patterns │ ├── debugging-agent.md # Claude Code + Devin patterns │ └── hardware-agent.md # Drone/ESC/PX4 domain ├── scripts/ │ ├── download-models.sh # HuggingFace model downloader │ ├── ollama-init.sh # Pull + pre-warm Ollama models │ └── configure-openwebui-images.py └── workflows/ ├── cogvideox-2b-text-to-video.json └── cogvideox-2b-frames.json
See the overview for capabilities and cost savings, or explore the SMA pipeline that runs on this infrastructure.