A self-hosted AI stack that deploys LLMs, voice cloning, image generation, video generation, music generation,
and persistent memory behind a single OpenAI-compatible gateway. One docker compose up -d launches 15 containers
on an AMD Radeon PRO W7900 (48 GB VRAM). This is the infrastructure layer that powers the
SMA pipeline and all local AI workloads.
Cloud AI APIs are expensive, rate-limited, and route your data through third-party servers. Cognitive Silo eliminates all three: a single command deploys GPT-4 class reasoning, voice cloning, image synthesis, video generation, music composition, and semantic memory — all running locally on a single GPU. The entire stack speaks the OpenAI API, so any tool that supports OpenAI (VS Code, Python SDK, curl) works without changes.
Any client that speaks OpenAI API: VS Code (Continue.dev) ─┐ SMA Pipeline (Python) ─┤ curl / Postman ─┤─── LiteLLM Gateway (:4000) ──┬── Ollama (5 LLMs, ROCm GPU) Open WebUI (Browser) ─┤ One key. One endpoint. ├── Speaches (Whisper + Kokoro) Telegram Bot ─┘ ├── GPT-4o (cloud overflow) └── + 9 more services below
Five modalities of AI generation — all accessible from a single API endpoint with unified authentication.
Deep reasoning, code synthesis, refactoring. Pre-warmed in VRAM for sub-second first-token latency.
Whisper Large v3 Turbo (CTranslate2). Real-time transcription with word-level timestamps.
Kokoro for fast TTS, Fish Speech for expressive output, XTTS-v2 for voice cloning across 17 languages.
FLUX.1-schnell FP8 via ComfyUI. High-quality images in ~10 seconds on W7900.
CogVideoX-2b generates 17-frame videos at 480×320. Two workflows: animated WEBP or individual frames.
Meta AudioCraft MusicGen 300M. Text-to-music with REST API and Gradio playground UI.
Mem0 + Qdrant vector DB. Per-user memory isolation — each project gets its own context silo.
Continue.dev config with 4 specialized agents, tab autocomplete, and semantic codebase search.
Each decision explains a deliberate architectural choice, the alternatives considered, and the measurable outcome.
All services orchestrated via a single 453-line Docker Compose file with health checks, dependency ordering, and GPU passthrough (ROCm device mapping).
| Service | Port | Purpose | Notes |
|---|---|---|---|
| ollama | 11434 | LLM inference engine (ROCm GPU) | 5 models, VRAM pre-warming |
| litellm | 4000 | OpenAI-compatible API gateway | Auth, routing, fallbacks, retries |
| open-webui | 3000 | ChatGPT-style playground | Web UI for all models |
| speaches | 8000 | Whisper STT + Kokoro TTS | OpenAI-compatible endpoints |
| fish-speech | 8001 | Expressive TTS (OpenAudio S1-Mini) | Gradio UI |
| xtts | 8002 | Voice cloning TTS (XTTS-v2) | 58 speakers, 17 languages |
| musicgen | 8003 | Text-to-music (Meta AudioCraft) | REST API + Gradio UI (:8004) |
| comfyui | 8188 | Image gen (FLUX) + Video gen (CogVideoX) | Custom ROCm 6.0 build |
| mem0 | 8080 | Persistent AI memory API | Per-user isolation via Qdrant |
| qdrant | 6333 | Vector database | Embeddings for Mem0 |
| db | 5432 | PostgreSQL 16 | LiteLLM persistence/analytics |
| redis | — | Redis 7 (cache + rate limiting) | Internal only |
| dashboard | 9000 | Service portal + live GPU monitor | nginx:alpine |
| model-downloader | — | Init: download HuggingFace models | Runs once, exits |
| ollama-init | — | Init: pull Ollama models + pre-warm VRAM | Runs once, exits |
Every model, every service, every capability — zero recurring cost. Cloud equivalents priced at standard API rates.
Equivalent cloud spend: ~$445/month saved — using open-source models on owned hardware.
| Layer | Technology | Role |
|---|---|---|
| GPU | AMD Radeon PRO W7900 (48 GB, RDNA 3) | All local AI inference |
| Runtime | ROCm 6.0 | AMD GPU compute (Ollama, ComfyUI) |
| Gateway | LiteLLM | OpenAI-compatible proxy + routing + auth |
| LLM Engine | Ollama (ROCm) | 5 models, VRAM management, pre-warming |
| STT | Speaches (Whisper) | Speech-to-text, word timestamps |
| TTS | Kokoro / Fish Speech / XTTS-v2 | 3 TTS engines: fast, expressive, voice cloning |
| Image | ComfyUI + FLUX.1-schnell FP8 | Image generation (~10s/image) |
| Video | ComfyUI + CogVideoX-2b | Video generation (~200s/clip) |
| Music | HuggingFace Transformers + MusicGen | Text-to-music (300M) |
| Memory | Mem0 + Qdrant + nomic-embed-text | Persistent semantic memory |
| Database | PostgreSQL 16 | LiteLLM analytics + persistence |
| Cache | Redis 7 | Rate limiting + response caching |
| Network | Tailscale | Mesh VPN for remote access |
| UI | Open WebUI / Gradio / nginx dashboard | Chat playground, service portal |
| IDE | Continue.dev | VS Code AI coding agent |
| Orchestration | Docker Compose (15 services) | Health checks, dependencies, GPU passthrough |
| Cloud Fallback | GitHub Models (GPT-4o) | Automatic overflow when GPU is busy |
Explore the VRAM management strategy, service dependency DAG, LiteLLM routing config, init container pattern, and Docker Compose internals.