Architecture Deep-Dive

How 15 Docker services, 12 AI models, and 48 GB of VRAM are orchestrated into a single-command deployment. This page covers every internal: the service dependency DAG, VRAM budgeting strategy, init container pattern, LiteLLM routing configuration, and the 544-line interactive setup wizard.

Macro Architecture

The system is organized in layers: external clients → API gateway → inference engines → data stores → init containers. Every service lives in Docker with explicit health checks and dependency ordering.

  ┌─────────────────────────────────────────────────────────────────────┐
  │  CLIENTS (any OpenAI SDK)                                          │
  │  VS Code · SMA Pipeline · Open WebUI · curl · Telegram Bot         │
  └────────────────────────────────┬────────────────────────────────────┘
                                   │  :4000 (OpenAI API)
  ┌────────────────────────────────▼────────────────────────────────────┐
  │  LITELLM GATEWAY                                                    │
  │  Auth (master_key) · Routing · Fallbacks · Retries · Logging       │
  │  9 model entries · 4 fallback chains → GPT-4o (GitHub Models)      │
  └──┬──────────┬──────────┬──────────┬──────────┬─────────────────────┘
     │          │          │          │          │
     ▼          ▼          ▼          ▼          ▼
  ┌──────┐ ┌────────┐ ┌────────┐ ┌───────┐ ┌─────────────┐
  │OLLAMA│ │SPEACHES│ │COMFYUI │ │MEM0  │ │GPT-4o       │
  │:11434│ │ :8000  │ │ :8188  │ │:8080  │ │(cloud only)  │
  │ROCm  │ │Whisper │ │FLUX    │ │Memory │ │GitHub Models │
  │5 LLMs│ │+Kokoro │ │CogVideo│ │+Qdrant│ │              │
  └──┬───┘ └────────┘ └────────┘ └───┬───┘ └──────────────┘
     │                                │
     │  ┌──────────┐ ┌──────────┐ ┌──┴───────┐
     │  │FISH-SPCH│ │XTTS-v2  │ │QDRANT    │
     │  │ :8001    │ │ :8002    │ │ :6333    │
     │  │S1-Mini   │ │Voice Cln │ │Vector DB │
     │  └──────────┘ └──────────┘ └──────────┘
     │
  ┌──┴────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
  │MUSICGEN   │ │POSTGRES │ │REDIS    │ │DASHBOARD│
  │:8003/:8004│ │ :5432    │ │ internal │ │ :9000    │
  │AudioCraft │ │LiteLLM DB│ │Cache/Rate│ │nginx+GPU │
  └───────────┘ └──────────┘ └──────────┘ └──────────┘
Service Dependency DAG

Docker Compose depends_on with condition: service_healthy enforces strict startup ordering. Init containers run once and exit. Services only start after their dependencies pass health checks.

  Boot sequence (left → right = dependency order):

  TIER 0 — No Dependencies (start immediately)
  ├── ollama          (health: TCP 11434)
  ├── db (postgres)   (health: pg_isready)
  ├── qdrant          (health: TCP 6333)
  ├── redis           (health: redis-cli ping)
  └── model-downloader (init: pip install + download_models.sh → exit)

  TIER 1 — Depends on Tier 0
  ├── ollama-init     → waits: ollama (healthy) → pulls 5 models + pre-warms 2 → exit
  ├── speaches        (self-contained, no depends_on, health: /health)
  ├── fish-speech     → waits: model-downloader (completed)
  ├── xtts            → waits: model-downloader (completed)
  ├── musicgen        (self-contained, health: /health)
  └── comfyui         → waits: model-downloader (completed)

  TIER 2 — Depends on Tier 0 + Tier 1
  ├── litellm         → waits: db (healthy) + ollama (healthy)
  ├── mem0            → waits: qdrant (healthy) + ollama (healthy)
  ├── speaches-init   → waits: speaches (healthy) → install whisper model → exit
  └── musicgen-ui     → waits: musicgen (healthy)

  TIER 3 — Depends on Tier 2
  ├── open-webui      → waits: litellm (healthy) + speaches (healthy)
  └── dashboard       (static nginx, no hard dependencies)

Key insight: Init containers use restart: "no" + condition: service_completed_successfully. They download models, configure state, then exit. This means subsequent docker compose up starts are instant — no re-downloads.

VRAM Budget Management

48 GB is a hard ceiling. Every byte is accounted for. The strategy: keep two LLMs permanently resident in VRAM for instant inference, and reserve the remainder for ComfyUI media generation.

AMD Radeon PRO W7900 — 48 GB VRAM Budget 48 GB total
R1:32b ~20 GB
V2:16b ~10 GB
ComfyUI ~17 GB
~1
0 GB │ 20 GB │ 30 GB │ 47 GB 48 GB

🔥 Pre-Warmed Models (30 GB)

Two models are loaded into VRAM at boot and never evicted:

  • deepseek-r1:32b — ~20 GB — primary coding + reasoning
  • deepseek-v2:16b — ~10 GB — fast autocomplete + light tasks

Controlled by Ollama env vars:

  • OLLAMA_MAX_LOADED_MODELS=2
  • OLLAMA_KEEP_ALIVE=-1 (infinite TTL)

🎨 ComfyUI Reserve (~17 GB)

FLUX.1-schnell FP8 requires ~17 GB VRAM per image generation. CogVideoX-2b uses ~14 GB plus T5-XXL encoder.

Concurrency rule: Image/video gen is sequential — only one workflow runs at a time. The remaining ~1 GB is OS overhead.

📊 LRU Overflow Strategy

If a third Ollama model is requested (e.g., deepseek-r1:70b), Ollama evicts the least-recently-used model to make room. The 70B model requires the full 48 GB — ComfyUI and both pre-warmed models get evicted.

Safety net: LiteLLM fallback chains route to GPT-4o if the local model fails to load within 120 seconds.

⚡ Result: Sub-Second First Token

Pre-warming eliminates cold-start penalties. First-token latency for R1:32b and V2:16b is measured in milliseconds, not seconds. Compare to cloud API round-trip: ~200-800ms network latency alone.

Init Container Pattern

Three ephemeral containers execute one-shot setup tasks on first deploy, then exit cleanly. This Kubernetes-inspired pattern keeps Docker images small, avoids redundant downloads, and ensures idempotent deployments.

📦 model-downloader

Image: python:3.12-slim

Runs: pip install huggingface_hub + bash download-models.sh

Downloads:

  • Fish Speech (OpenAudio S1-Mini)
  • XTTS-v2 (Coqui voice cloning)
  • FLUX.1-schnell FP8 (~17 GB)
  • CogVideoX-2b (~14 GB)

Persistence: ./models/ bind-mounted volume. Skip logic: checks if directory has files before downloading.

🤖 ollama-init

Image: python:3.12-slim + curl

Waits for: ollama: service_healthy

Pulls:

  • deepseek-r1:32b (19.9 GB)
  • deepseek-v2:16b (8.9 GB)
  • deepseek-r1:70b (42.5 GB)
  • llama3.3:latest (42.5 GB)
  • nomic-embed-text:latest (274 MB)

Pre-warms: Sends a dummy prompt to R1:32b and V2:16b to load them into VRAM.

🎙️ speaches-init

Image: python:3.12-slim + curl

Waits for: speaches: service_healthy

Logic: Queries GET /v1/models to check if deepdml/faster-whisper-large-v3-turbo-ct2 is already installed. If not, sends POST /v1/models to trigger download. Idempotent — safe to re-run.

  First deploy (cold):
  model-downloader ─── 30-60 min ───┬── fish-speech ✓
                                      ├── xtts ✓
                                      ├── comfyui ✓
  ollama-init ─────── 10-20 min ────── ollama models pulled + pre-warmed
  speaches-init ───── 30 sec ──────── whisper model installed

  Subsequent deploys (warm):
  model-downloader ── SKIP (files exist) ── exits in <2s
  ollama-init ─────── SKIP (models pulled) ─ exits in <5s
  speaches-init ───── SKIP (model installed) exits in <2s
LiteLLM Gateway Configuration

One YAML file maps 9 model entries to local services or cloud overflow. Every client sees a single OpenAI-compatible endpoint at port 4000 with master_key authentication.

litellm_config.yaml YAML
model_list:
  # ── Local LLMs (AMD W7900, ROCm GPU) ──
  - model_name: deepseek-r1-32b
    litellm_params:
      model: ollama/deepseek-r1:32b
      api_base: "http://ollama:11434"

  - model_name: deepseek-v2-16b         # fast autocomplete
  - model_name: deepseek-r1-70b         # deep reasoning (swaps VRAM)
  - model_name: llama3-8b               # general purpose
  - model_name: nomic-embed-text        # embeddings for Mem0/@codebase

  # ── Speech (Speaches — OpenAI-compatible) ──
  - model_name: whisper-turbo
    litellm_params:
      model: openai/deepdml/faster-whisper-large-v3-turbo-ct2
      api_base: "http://speaches:8000/v1"

  - model_name: kokoro-tts              # fast TTS

  # ── Cloud Overflow ──
  - model_name: github-fallback
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/GITHUB_TOKEN"
      api_base: "https://models.inference.ai.azure.com"

litellm_settings:
  request_timeout: 120
  num_retries: 2
  fallbacks:
    - deepseek-r1-32b: [github-fallback]  # every local → GPT-4o
    - deepseek-v2-16b: [github-fallback]
    - deepseek-r1-70b: [github-fallback]
    - llama3-8b:       [github-fallback]

🔑 Authentication

master_key from os.environ/LITELLM_MASTER_KEY — set during setup wizard. All requests require Authorization: Bearer sk-... header. PostgreSQL 16 stores usage analytics and spend tracking.

🔄 Fallback Chains

Every local LLM has a fallback to github-fallback (GPT-4o via GitHub Models). If the local model times out (120s) or the GPU is busy with ComfyUI, the request transparently routes to cloud. Clients see identical API responses.

12-Model Inventory

Every model deployed across the stack, with parameter count, VRAM usage, and inference context.

ModelParamsVRAMServiceModality
DeepSeek-R1:32B 32B ~20 GB Ollama (ROCm) Chat / Code / Reasoning
DeepSeek-V2:16B 16B ~10 GB Ollama (ROCm) Fast Code / Autocomplete
DeepSeek-R1:70B 70B ~42 GB Ollama (ROCm) Deep Reasoning (swap)
Llama 3.3 8B ~5 GB Ollama (ROCm) General Purpose
nomic-embed-text 137M ~0.3 GB Ollama (ROCm) 768D Embeddings
Whisper Large v3 Turbo 809M CPU Speaches Speech-to-Text
Kokoro-82M 82M CPU Speaches Text-to-Speech (fast)
OpenAudio S1-Mini CPU Fish Speech Text-to-Speech (expressive)
XTTS-v2 ~400M CPU XTTS-v2 Voice Cloning (17 languages)
FLUX.1-schnell FP8 12B ~17 GB ComfyUI (ROCm) Image Generation
CogVideoX-2b + T5-XXL 2B+11B ~14 GB ComfyUI (ROCm) Video Generation
MusicGen-small 300M CPU MusicGen API Text-to-Music

GPU models (orange VRAM) share the W7900. CPU models run on system RAM (64 GB DDR5). Whisper, Kokoro, Fish Speech, XTTS-v2, and MusicGen use CPU to avoid VRAM contention with LLMs.

Docker Compose Internals

453 lines of orchestration. Key patterns used throughout the compose file.

🔌 GPU Passthrough (ROCm)

AMD GPUs require /dev/kfd (KFD kernel driver) and /dev/dri (DRI render nodes). Two services use GPU: ollama and comfyui.

docker-compose.ymlYAML
devices:
  - /dev/kfd:/dev/kfd
  - /dev/dri:/dev/dri

🩺 Health Checks

Every long-running service has a health check. Docker won't start dependent services until the health check passes. Patterns used:

  • TCP probe: echo > /dev/tcp/localhost/PORT (Ollama, Qdrant)
  • CLI probe: pg_isready, redis-cli ping
  • HTTP probe: python3 urllib.request.urlopen() (LiteLLM, Mem0, Speaches)
  • curl probe: curl -sf URL (Fish Speech, XTTS, Open WebUI)

💾 Volume Strategy

Named volumes for database state. Bind mounts for models (allows sharing across rebuilds).

  • postgres_data, redis_data, qdrant_data — named volumes
  • ./models/ — bind-mounted for all AI model weights
  • comfyui_data — named volume for generated outputs
  • open_webui_data — named volume for chat history

🔒 Security Model

Secrets stored in .env (chmod 600, gitignored). Injected via ${VAR:-default} syntax. Internal services use Docker network DNS (no exposed ports for Redis). LITELLM_MASTER_KEY gates all external API access.

Setup Wizard — 544-Line Interactive Installer

One command (./setup.sh) takes a fresh clone to a fully running AI stack. Six phases, all idempotent.

  $ ./setup.sh

  PHASE 0 — Prerequisites
  │ Check: Docker, Docker Compose, docker group, /dev/kfd + /dev/dri, HF CLI
  │ Aborts with actionable error if anything missing
  │
  PHASE 1 — Interactive API Key Configuration
  │ Prompts for: HF_TOKEN, LITELLM_MASTER_KEY, POSTGRES_PASSWORD, GITHUB_TOKEN
  │ Loads existing .env if present → "Keep this value? [Y/n]"
  │ Auto-generates secure defaults: sk-master-$(openssl rand -hex 8)
  │ Writes .env (chmod 600)
  │
  PHASE 2 — Install HuggingFace CLI
  │ Creates .venv if needed → pip install huggingface_hub[cli]
  │ Login with HF_TOKEN for gated models (Fish Speech)
  │
  PHASE 3 — Download HuggingFace Models
  │ Interactive: "Which groups? [1] Speech ~6GB [2] Creative ~31GB"
  │ Per-model skip: checks if directory has files → "SKIP (already downloaded)"
  │ Downloads: whisper-turbo, kokoro, fish-speech, xtts-v2, flux-schnell, cogvideox-2b
  │
  PHASE 4 — Build & Start Docker Stack
  │ docker compose pull → build (mem0-api) → start core (ollama, db, qdrant, redis)
  │ Wait for Ollama healthy → pull 5 Ollama models → start all remaining services
  │ Wait for LiteLLM healthy
  │
  PHASE 5 — Pull Ollama Models
  │ Shows existing vs. missing models with ✓/○ markers
  │ Interactive: "Pull missing models? [Y/n]"
  │ Streaming progress: pulling manifest 45%PHASE 6 — Health Check & Summary
  │ Container status: ✓ Up / ✗ Down for each container
  │ Endpoint reachability: ✓ localhost:PORT for each service
  │ Final banner with all URLs and management commands

  Flags: --skip-downloads, --skip-keys, --help
Dashboard & Live GPU Monitor

A 674-line single-file HTML dashboard served by nginx:alpine on port 9000. Features live GPU monitoring and service health checks.

📊 Live GPU Monitor

Polls GET /api/ps on Ollama every 5 seconds to show loaded models, VRAM usage, and active inference requests. Color-coded status: green (loaded), yellow (loading), red (error).

🩺 Service Health Grid

11 service cards with live health status. Each card pings its service endpoint and displays ✓ (healthy) or ✗ (unreachable). Includes direct links to each service UI.

🌐 Tailscale-Aware URLs

Auto-detects current hostname via window.location.hostname and rewrites all service URLs to match. Works seamlessly when accessed over Tailscale from any device on the mesh network.

🎨 Design

Dark theme (#0f0f13), purple/teal gradient accents, fully responsive grid layout. Zero dependencies — pure HTML/CSS/JS. GPU badge displays hardware info prominently.

Continue.dev IDE Integration

A YAML configuration file that connects VS Code to all local Ollama models via Continue.dev, with 4 specialized chat agents, tab autocomplete, and semantic codebase search.

continue-config.yamlYAML
models:
  # 4 chat agents, each with a specialized system prompt
  - name: DeepSeek R1 32B (Coding Agent)     # primary — Claude Code patterns
  - name: DeepSeek V2 16B (Fast Code)         # light tasks + autocomplete
  - name: DeepSeek R1 70B (Deep Reasoning)    # complex architecture
  - name: Llama 3.3 8B (General)              # general purpose

tabAutocomplete:
  provider: ollama
  model: deepseek-v2:16b                      # fast inline completions

embeddings:
  provider: ollama
  model: nomic-embed-text                     # @codebase semantic search

context:
  - code, docs, diff, terminal, problems, folder, codebase

@codebase searches the entire workspace using nomic-embed-text embeddings indexed locally. When you type @codebase how does the auth middleware work?, Continue embeds the query, searches the local vector index, and injects the top-k relevant code snippets into the LLM context. All local — nothing leaves the machine.

Agent Prompt Distillation

Four specialized system prompts in prompts/ — distilled from analyzing the behavioral patterns of commercial AI coding agents. Designed for local models that need stronger steering than cloud models.

AgentFileDistilled FromDomain
Coding Agent coding-agent.md Claude Code + Cursor Agent 2.0 Software engineering, code generation, file editing
UI Agent ui-agent.md v0 (Vercel) + Bolt.new Frontend development, CSS, component design
Debugging Agent debugging-agent.md Claude Code + Devin AI Root cause analysis, hypothesis ranking, fix verification
Hardware Agent hardware-agent.md Domain expertise Drones, ESC firmware, PX4, sensor fusion

📝 Prompt Engineering for Local Models

Cloud models (Claude, GPT-4o) produce good results with minimal instruction. Local models (DeepSeek 32B) need explicit behavioral rules to match that quality. The system prompts encode hundreds of behavioral rules distilled from observing commercial agents: no preamble/postamble, concise answers, convention-following, security practices, debugging methodology (observe → hypothesize → test → fix → prevent), and proactiveness constraints. The result is near-commercial quality coding assistance running entirely on local hardware.

ComfyUI Workflows

Two JSON workflow files for visual media generation, both using CogVideoX-2b for video synthesis. ComfyUI runs a custom ROCm 6.0 Docker build with GPU passthrough.

🎬 Text-to-Video (Animated WEBP)

File: cogvideox-2b-text-to-video.json

Pipeline: 8 ComfyUI nodes → CogVideoX-2b generates 17 frames at 480×320 → assembled into animated WEBP at 6 FPS.

Generation time: ~200 seconds per clip on W7900.

VRAM: ~14 GB (model + T5-XXL encoder).

🖼️ Text-to-Video (Frame Sequence)

File: cogvideox-2b-frames.json

Pipeline: Same CogVideoX-2b generation, but outputs 17 individual PNG frames instead of assembled video. Useful for post-processing or custom frame manipulation.

Image generation uses FLUX.1-schnell FP8 (~17 GB VRAM, ~10s per image) directly through ComfyUI's built-in checkpoints system. Open WebUI can trigger image generation via the ComfyUI API integration.

Project Structure
cognitive-silo/24 files · 4,076 LOC
├── docker-compose.yml         # 453 lines — 15 services, health checks, GPU passthrough
├── litellm_config.yaml         # 71 lines — 9 model entries, fallback chains
├── continue-config.yaml        # 103 lines — 4 agents, autocomplete, embeddings
├── setup.sh                    # 544 lines — 6-phase interactive installer
├── dashboard.html              # 674 lines — service portal + live GPU monitor
├── README.md                   # project documentation
├── comfyui/
│   └── Dockerfile              # ROCm 6.0 custom ComfyUI build
├── mem0-api/
│   ├── Dockerfile              # Python 3.12 + Mem0 + Qdrant client
│   └── main.py                 # Memory API (FastAPI)
├── musicgen-api/
│   ├── Dockerfile              # Python + AudioCraft
│   └── main.py                 # Text-to-music REST API
├── musicgen-ui/
│   ├── Dockerfile              # Gradio web UI
│   └── app.py                  # MusicGen playground
├── prompts/
│   ├── coding-agent.md         # Claude Code + Cursor patterns
│   ├── ui-agent.md             # v0/Vercel + Bolt.new patterns
│   ├── debugging-agent.md      # Claude Code + Devin patterns
│   └── hardware-agent.md       # Drone/ESC/PX4 domain
├── scripts/
│   ├── download-models.sh      # HuggingFace model downloader
│   ├── ollama-init.sh          # Pull + pre-warm Ollama models
│   └── configure-openwebui-images.py
└── workflows/
    ├── cogvideox-2b-text-to-video.json
    └── cogvideox-2b-frames.json
Explore More

See the overview for capabilities and cost savings, or explore the SMA pipeline that runs on this infrastructure.

← Overview SMA Engine View Source Portfolio Home