Cognitive Silo: Private AI Infrastructure

A self-hosted AI stack that deploys LLMs, voice cloning, image generation, video generation, music generation, and persistent memory behind a single OpenAI-compatible gateway. One docker compose up -d launches 15 containers on an AMD Radeon PRO W7900 (48 GB VRAM). This is the infrastructure layer that powers the SMA pipeline and all local AI workloads.

15
Docker Services
12
AI Models
48 GB
VRAM Managed
$0/mo
Recurring Cost
4,076
Lines of Code
Your Own AI Data Center

Cloud AI APIs are expensive, rate-limited, and route your data through third-party servers. Cognitive Silo eliminates all three: a single command deploys GPT-4 class reasoning, voice cloning, image synthesis, video generation, music composition, and semantic memory — all running locally on a single GPU. The entire stack speaks the OpenAI API, so any tool that supports OpenAI (VS Code, Python SDK, curl) works without changes.

  Any client that speaks OpenAI API:

  VS Code (Continue.dev)  ─┐
  SMA Pipeline (Python)   ─┤
  curl / Postman          ─┤─── LiteLLM Gateway (:4000) ──┬── Ollama (5 LLMs, ROCm GPU)
  Open WebUI (Browser)    ─┤    One key. One endpoint.    ├── Speaches (Whisper + Kokoro)
  Telegram Bot            ─┘                              ├── GPT-4o (cloud overflow)
                                                          └── + 9 more services below
8 AI Capabilities, 12 Models

Five modalities of AI generation — all accessible from a single API endpoint with unified authentication.

💬

Chat & Code Generation

Deep reasoning, code synthesis, refactoring. Pre-warmed in VRAM for sub-second first-token latency.

deepseek-r1:32b · deepseek-v2:16b · deepseek-r1:70b · llama3.3 · gpt-4o (fallback)
🎙️

Speech-to-Text

Whisper Large v3 Turbo (CTranslate2). Real-time transcription with word-level timestamps.

faster-whisper-large-v3-turbo-ct2
🔊

Text-to-Speech (3 engines)

Kokoro for fast TTS, Fish Speech for expressive output, XTTS-v2 for voice cloning across 17 languages.

Kokoro-82M · OpenAudio S1-Mini · XTTS-v2 (58 speakers)
🎨

Image Generation

FLUX.1-schnell FP8 via ComfyUI. High-quality images in ~10 seconds on W7900.

FLUX.1-schnell FP8 (~17 GB VRAM)
🎬

Video Generation

CogVideoX-2b generates 17-frame videos at 480×320. Two workflows: animated WEBP or individual frames.

CogVideoX-2b · T5-XXL-FP8 encoder (~200s/video)
🎵

Music Generation

Meta AudioCraft MusicGen 300M. Text-to-music with REST API and Gradio playground UI.

facebook/musicgen-small (300M)
🧠

Persistent AI Memory

Mem0 + Qdrant vector DB. Per-user memory isolation — each project gets its own context silo.

nomic-embed-text (768D) → Qdrant
💻

IDE Integration

Continue.dev config with 4 specialized agents, tab autocomplete, and semantic codebase search.

4 agent prompts · @codebase · autocomplete
6 Key Engineering Decisions

Each decision explains a deliberate architectural choice, the alternatives considered, and the measurable outcome.

🔀 LiteLLM as Unified Gateway

One endpoint (port 4000), one auth key, one SDK (OpenAI Python) for ALL services — LLMs, STT, TTS, embeddings, cloud fallback. Any tool that speaks OpenAI API works without configuration changes.
Alt: Direct service calls with per-service auth, or custom API aggregator
→ Zero integration code. VS Code, SMA pipeline, curl, and Open WebUI all connect to the same URL.

📦 Init Container Pattern

3 ephemeral containers (model-downloader, ollama-init, speaches-init) run once on first deploy to download models and pre-warm VRAM, then exit cleanly. Subsequent starts are instant from cached volumes.
Alt: Bake models into Docker images, or download on every start
→ First run: 30-60 min setup. Every run after: instant. Images stay small.

🧮 VRAM Budget Management

48 GB is finite. OLLAMA_MAX_LOADED_MODELS=2 + KEEP_ALIVE=-1 pre-warms DeepSeek R1 32B (~20 GB) + V2 16B (~10 GB) permanently. Remaining 18 GB reserved for ComfyUI image/video gen. LRU eviction handles overflow.
Alt: Load models on demand (cold start), or use multiple GPUs
→ Sub-second first-token for primary models. No cold-start penalty for daily use.

☁️ Cloud Overflow Fallback

When the local GPU is saturated (running CogVideoX or third LLM), LiteLLM automatically routes to GPT-4o via GitHub Models. Transparent to clients — they see the same API.
Alt: Queue requests until GPU is free, or hard-fail
→ Zero downtime. Local-first, cloud-backup. Overflow is invisible to callers.

🌐 Tailscale Mesh Networking

Stable IPs without exposing ports to the internet. Dashboard auto-rewrites URLs to match the current hostname. Full AI stack accessible from any device on the mesh.
Alt: Port forwarding + dynamic DNS, or VPN server
→ Access from coffee shop laptop. No firewall rules. Zero configuration per-device.

🤖 Agent Prompt Distillation

Local models (DeepSeek 32B) need better system prompts than cloud models to match quality. 4 specialized prompts distilled from Claude Code, Cursor Agent 2.0, v0/Vercel, and Devin AI behavioral patterns.
Alt: Use default system prompts, or fine-tune models
→ Near-commercial quality from local models. Coding, UI, debugging, and hardware agents.
15 Services, One Compose File

All services orchestrated via a single 453-line Docker Compose file with health checks, dependency ordering, and GPU passthrough (ROCm device mapping).

ServicePortPurposeNotes
ollama 11434 LLM inference engine (ROCm GPU) 5 models, VRAM pre-warming
litellm 4000 OpenAI-compatible API gateway Auth, routing, fallbacks, retries
open-webui 3000 ChatGPT-style playground Web UI for all models
speaches 8000 Whisper STT + Kokoro TTS OpenAI-compatible endpoints
fish-speech 8001 Expressive TTS (OpenAudio S1-Mini) Gradio UI
xtts 8002 Voice cloning TTS (XTTS-v2) 58 speakers, 17 languages
musicgen 8003 Text-to-music (Meta AudioCraft) REST API + Gradio UI (:8004)
comfyui 8188 Image gen (FLUX) + Video gen (CogVideoX) Custom ROCm 6.0 build
mem0 8080 Persistent AI memory API Per-user isolation via Qdrant
qdrant 6333 Vector database Embeddings for Mem0
db 5432 PostgreSQL 16 LiteLLM persistence/analytics
redis Redis 7 (cache + rate limiting) Internal only
dashboard 9000 Service portal + live GPU monitor nginx:alpine
model-downloader Init: download HuggingFace models Runs once, exits
ollama-init Init: pull Ollama models + pre-warm VRAM Runs once, exits
$0/Month Production AI Stack

Every model, every service, every capability — zero recurring cost. Cloud equivalents priced at standard API rates.

LLM Inference
$0
vs. ~$200/mo GPT-4o API
Speech (STT + TTS)
$0
vs. ~$50/mo Whisper + ElevenLabs
Image Generation
$0
vs. ~$40/mo DALL-E 3 API
Video Generation
$0
vs. ~$100/mo Runway/Pika
Music Generation
$0
vs. ~$30/mo Suno/Udio
Vector DB + Memory
$0
vs. ~$25/mo Pinecone

Equivalent cloud spend: ~$445/month saved — using open-source models on owned hardware.

Technology Map
LayerTechnologyRole
GPUAMD Radeon PRO W7900 (48 GB, RDNA 3)All local AI inference
RuntimeROCm 6.0AMD GPU compute (Ollama, ComfyUI)
GatewayLiteLLMOpenAI-compatible proxy + routing + auth
LLM EngineOllama (ROCm)5 models, VRAM management, pre-warming
STTSpeaches (Whisper)Speech-to-text, word timestamps
TTSKokoro / Fish Speech / XTTS-v23 TTS engines: fast, expressive, voice cloning
ImageComfyUI + FLUX.1-schnell FP8Image generation (~10s/image)
VideoComfyUI + CogVideoX-2bVideo generation (~200s/clip)
MusicHuggingFace Transformers + MusicGenText-to-music (300M)
MemoryMem0 + Qdrant + nomic-embed-textPersistent semantic memory
DatabasePostgreSQL 16LiteLLM analytics + persistence
CacheRedis 7Rate limiting + response caching
NetworkTailscaleMesh VPN for remote access
UIOpen WebUI / Gradio / nginx dashboardChat playground, service portal
IDEContinue.devVS Code AI coding agent
OrchestrationDocker Compose (15 services)Health checks, dependencies, GPU passthrough
Cloud FallbackGitHub Models (GPT-4o)Automatic overflow when GPU is busy
Go Deeper

Explore the VRAM management strategy, service dependency DAG, LiteLLM routing config, init container pattern, and Docker Compose internals.

🏗️ Architecture Deep-Dive View Source on GitHub SMA Engine →