Cognitive Silo: Private AI Infrastructure

The Concept

Your Own AI Data Center

Cloud AI APIs are expensive, rate-limited, and route your data through third-party servers. Cognitive Silo eliminates all three: a single command deploys GPT-4 class reasoning, voice cloning, image synthesis, video generation, music composition, and semantic memory — all running locally on a single GPU. The entire stack speaks the OpenAI API, so any tool that supports OpenAI (VS Code, Python SDK, curl) works without changes.

  Any client that speaks OpenAI API:

  VS Code (Continue.dev)  ─┐
  SMA Pipeline (Python)   ─┤
  curl / Postman          ─┤─── LiteLLM Gateway (:4000) ──┬── Ollama (5 LLMs, ROCm GPU)
  Open WebUI (Browser)    ─┤    One key. One endpoint.    ├── Speaches (Whisper + Kokoro)
  Telegram Bot            ─┘                              ├── GPT-4o (cloud overflow)
                                                          └── + 9 more services below

What It Runs

8 AI Capabilities, 12 Models

Five modalities of AI generation — all accessible from a single API endpoint with unified authentication.

💬

Chat & Code Generation

Deep reasoning, code synthesis, refactoring. Pre-warmed in VRAM for sub-second first-token latency.

deepseek-r1:32b · deepseek-v2:16b · deepseek-r1:70b · llama3.3 · gpt-4o (fallback)

🎙️

Speech-to-Text

Whisper Large v3 Turbo (CTranslate2). Real-time transcription with word-level timestamps.

faster-whisper-large-v3-turbo-ct2

🔊

Text-to-Speech (3 engines)

Kokoro for fast TTS, Fish Speech for expressive output, XTTS-v2 for voice cloning across 17 languages.

Kokoro-82M · OpenAudio S1-Mini · XTTS-v2 (58 speakers)

🎨

Image Generation

FLUX.1-schnell FP8 via ComfyUI. High-quality images in ~10 seconds on W7900.

FLUX.1-schnell FP8 (~17 GB VRAM)

🎬

Video Generation

CogVideoX-2b generates 17-frame videos at 480×320. Two workflows: animated WEBP or individual frames.

CogVideoX-2b · T5-XXL-FP8 encoder (~200s/video)

🎵

Music Generation

Meta AudioCraft MusicGen 300M. Text-to-music with REST API and Gradio playground UI.

facebook/musicgen-small (300M)

🧠

Persistent AI Memory

Mem0 + Qdrant vector DB. Per-user memory isolation — each project gets its own context silo.

nomic-embed-text (768D) → Qdrant

💻

IDE Integration

Continue.dev config with 4 specialized agents, tab autocomplete, and semantic codebase search.

4 agent prompts · @codebase · autocomplete

Why It's Built This Way

6 Key Engineering Decisions

Each decision explains a deliberate architectural choice, the alternatives considered, and the measurable outcome.

🔀 LiteLLM as Unified Gateway

One endpoint (port 4000), one auth key, one SDK (OpenAI Python) for ALL services — LLMs, STT, TTS, embeddings, cloud fallback. Any tool that speaks OpenAI API works without configuration changes.

Alt: Direct service calls with per-service auth, or custom API aggregator

→ Zero integration code. VS Code, SMA pipeline, curl, and Open WebUI all connect to the same URL.

📦 Init Container Pattern

3 ephemeral containers (model-downloader, ollama-init, speaches-init) run once on first deploy to download models and pre-warm VRAM, then exit cleanly. Subsequent starts are instant from cached volumes.

Alt: Bake models into Docker images, or download on every start

→ First run: 30-60 min setup. Every run after: instant. Images stay small.

🧮 VRAM Budget Management

48 GB is finite. OLLAMA_MAX_LOADED_MODELS=2 + KEEP_ALIVE=-1 pre-warms DeepSeek R1 32B (~20 GB) + V2 16B (~10 GB) permanently. Remaining 18 GB reserved for ComfyUI image/video gen. LRU eviction handles overflow.

Alt: Load models on demand (cold start), or use multiple GPUs

→ Sub-second first-token for primary models. No cold-start penalty for daily use.

☁️ Cloud Overflow Fallback

When the local GPU is saturated (running CogVideoX or third LLM), LiteLLM automatically routes to GPT-4o via GitHub Models. Transparent to clients — they see the same API.

Alt: Queue requests until GPU is free, or hard-fail

→ Zero downtime. Local-first, cloud-backup. Overflow is invisible to callers.

🌐 Tailscale Mesh Networking

Stable IPs without exposing ports to the internet. Dashboard auto-rewrites URLs to match the current hostname. Full AI stack accessible from any device on the mesh.

Alt: Port forwarding + dynamic DNS, or VPN server

→ Access from coffee shop laptop. No firewall rules. Zero configuration per-device.

🤖 Agent Prompt Distillation

Local models (DeepSeek 32B) need better system prompts than cloud models to match quality. 4 specialized prompts distilled from Claude Code, Cursor Agent 2.0, v0/Vercel, and Devin AI behavioral patterns.

Alt: Use default system prompts, or fine-tune models

→ Near-commercial quality from local models. Coding, UI, debugging, and hardware agents.

Infrastructure

15 Services, One Compose File

All services orchestrated via a single 453-line Docker Compose file with health checks, dependency ordering, and GPU passthrough (ROCm device mapping).

Service	Port	Purpose	Notes
ollama	11434	LLM inference engine (ROCm GPU)	5 models, VRAM pre-warming
litellm	4000	OpenAI-compatible API gateway	Auth, routing, fallbacks, retries
open-webui	3000	ChatGPT-style playground	Web UI for all models
speaches	8000	Whisper STT + Kokoro TTS	OpenAI-compatible endpoints
fish-speech	8001	Expressive TTS (OpenAudio S1-Mini)	Gradio UI
xtts	8002	Voice cloning TTS (XTTS-v2)	58 speakers, 17 languages
musicgen	8003	Text-to-music (Meta AudioCraft)	REST API + Gradio UI (:8004)
comfyui	8188	Image gen (FLUX) + Video gen (CogVideoX)	Custom ROCm 6.0 build
mem0	8080	Persistent AI memory API	Per-user isolation via Qdrant
qdrant	6333	Vector database	Embeddings for Mem0
db	5432	PostgreSQL 16	LiteLLM persistence/analytics
redis	—	Redis 7 (cache + rate limiting)	Internal only
dashboard	9000	Service portal + live GPU monitor	nginx:alpine
model-downloader	—	Init: download HuggingFace models	Runs once, exits
ollama-init	—	Init: pull Ollama models + pre-warm VRAM	Runs once, exits

Economics

$0/Month Production AI Stack

Every model, every service, every capability — zero recurring cost. Cloud equivalents priced at standard API rates.

LLM Inference

vs. ~$200/mo GPT-4o API

Speech (STT + TTS)

vs. ~$50/mo Whisper + ElevenLabs

Image Generation

vs. ~$40/mo DALL-E 3 API

Video Generation

vs. ~$100/mo Runway/Pika

Music Generation

vs. ~$30/mo Suno/Udio

Vector DB + Memory

vs. ~$25/mo Pinecone

Equivalent cloud spend: ~$445/month saved — using open-source models on owned hardware.

Stack

Technology Map

Layer	Technology	Role
GPU	AMD Radeon PRO W7900 (48 GB, RDNA 3)	All local AI inference
Runtime	ROCm 6.0	AMD GPU compute (Ollama, ComfyUI)
Gateway	LiteLLM	OpenAI-compatible proxy + routing + auth
LLM Engine	Ollama (ROCm)	5 models, VRAM management, pre-warming
STT	Speaches (Whisper)	Speech-to-text, word timestamps
TTS	Kokoro / Fish Speech / XTTS-v2	3 TTS engines: fast, expressive, voice cloning
Image	ComfyUI + FLUX.1-schnell FP8	Image generation (~10s/image)
Video	ComfyUI + CogVideoX-2b	Video generation (~200s/clip)
Music	HuggingFace Transformers + MusicGen	Text-to-music (300M)
Memory	Mem0 + Qdrant + nomic-embed-text	Persistent semantic memory
Database	PostgreSQL 16	LiteLLM analytics + persistence
Cache	Redis 7	Rate limiting + response caching
Network	Tailscale	Mesh VPN for remote access
UI	Open WebUI / Gradio / nginx dashboard	Chat playground, service portal
IDE	Continue.dev	VS Code AI coding agent
Orchestration	Docker Compose (15 services)	Health checks, dependencies, GPU passthrough
Cloud Fallback	GitHub Models (GPT-4o)	Automatic overflow when GPU is busy