Files
Munich-news/docs/OLLAMA_SETUP.md
2025-11-11 17:58:12 +01:00

7.8 KiB

Ollama Setup Guide

This project includes an integrated Ollama service for AI-powered summarization and translation.

🚀 Want 5-10x faster performance? See GPU_SETUP.md for GPU acceleration setup.

The docker-compose.yml includes an Ollama service that automatically:

  • Runs Ollama server (internal only, not exposed to host)
  • Pulls the phi3:latest model on first startup
  • Persists model data in a Docker volume
  • Supports GPU acceleration (NVIDIA GPUs)
  • Only accessible by other Docker Compose services for security

GPU Support

Ollama can use NVIDIA GPUs for significantly faster inference (5-10x speedup).

Prerequisites:

  • NVIDIA GPU with CUDA support
  • NVIDIA drivers installed
  • NVIDIA Container Toolkit installed

Installation (Ubuntu/Debian):

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Start with GPU support:

# Automatic detection and startup
./start-with-gpu.sh

# Or manually specify GPU support
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d

Verify GPU is being used:

# Check if GPU is detected
docker exec munich-news-ollama nvidia-smi

# Monitor GPU usage during inference
watch -n 1 'docker exec munich-news-ollama nvidia-smi'

Configuration

Update your backend/.env file with one of these configurations:

For Docker Compose (services communicate via internal network):

OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120

For external Ollama server (running on host machine):

OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120

Starting the Services

# Option 1: Auto-detect GPU and start (recommended)
./start-with-gpu.sh

# Option 2: Start with GPU support (if you have NVIDIA GPU)
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d

# Option 3: Start without GPU (CPU only)
docker-compose up -d

# Check Ollama logs
docker-compose logs -f ollama

# Check model setup logs
docker-compose logs ollama-setup

# Verify Ollama is running (from inside a container)
docker-compose exec crawler curl http://ollama:11434/api/tags

First Time Setup

On first startup, the ollama-setup service will automatically pull the phi3:latest model. This may take several minutes depending on your internet connection (model is ~2.3GB).

You can monitor the progress:

docker-compose logs -f ollama-setup

Available Models

The default model is phi3:latest (2.3GB), which provides a good balance of speed and quality.

To use a different model:

  1. Update OLLAMA_MODEL in your .env file
  2. Pull the model manually:
    docker-compose exec ollama ollama pull <model-name>
    

Popular alternatives:

  • llama3.2:latest - Larger, more capable model
  • mistral:latest - Fast and efficient
  • gemma2:2b - Smallest, fastest option

Troubleshooting

Ollama service not starting:

# Check if port 11434 is already in use
lsof -i :11434

# Restart the service
docker-compose restart ollama

# Check logs
docker-compose logs ollama

Model not downloading:

# Manually pull the model
docker-compose exec ollama ollama pull phi3:latest

# Check available models
docker-compose exec ollama ollama list

GPU not being detected:

# Check if NVIDIA drivers are installed
nvidia-smi

# Check if Docker can access GPU
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

# Verify GPU is available in Ollama container
docker exec munich-news-ollama nvidia-smi

# Check Ollama logs for GPU initialization
docker-compose logs ollama | grep -i gpu

GPU out of memory:

  • Phi3 requires ~2-4GB VRAM
  • Close other GPU applications
  • Use a smaller model: gemma2:2b (requires ~1.5GB VRAM)
  • Or fall back to CPU mode

CPU out of memory errors:

  • Phi3 requires ~4GB RAM
  • Consider using a smaller model like gemma2:2b
  • Or increase Docker's memory limit in Docker Desktop settings

Slow performance even with GPU:

  • Ensure GPU drivers are up to date
  • Check GPU utilization: watch -n 1 'docker exec munich-news-ollama nvidia-smi'
  • Verify you're using the GPU compose file: docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
  • Some models may not fully utilize GPU - try different models

Local Ollama Installation

If you prefer to run Ollama directly on your host machine:

  1. Install Ollama: https://ollama.ai/download
  2. Pull the model: ollama pull phi3:latest
  3. Start Ollama: ollama serve
  4. Update .env to use http://host.docker.internal:11434

Testing the Setup

Basic API Test

# Test Ollama API from inside a container
docker-compose exec crawler curl -s http://ollama:11434/api/generate -d '{
  "model": "phi3:latest",
  "prompt": "Translate to English: Guten Morgen",
  "stream": false
}'

GPU Verification

# Check if GPU is detected
docker exec munich-news-ollama nvidia-smi

# Monitor GPU usage during a test
# Terminal 1: Monitor GPU
watch -n 1 'docker exec munich-news-ollama nvidia-smi'

# Terminal 2: Run test crawl
docker-compose exec crawler python crawler_service.py 1

# You should see GPU memory usage increase during inference

Full Integration Test

# Run a test crawl to verify translation works
docker-compose exec crawler python crawler_service.py 1

# Check the logs for translation timing
# GPU: ~0.3-0.5s per translation
# CPU: ~1-2s per translation
docker-compose logs crawler | grep "Title translated"

Performance Notes

CPU Performance

  • First request may be slow as the model loads into memory (~10-30 seconds)
  • Subsequent requests are faster (cached in memory)
  • Translation: 0.5-2 seconds per title
  • Summarization: 5-10 seconds per article
  • Recommended: 4+ CPU cores, 8GB+ RAM

GPU Performance (NVIDIA)

  • Model loads faster (~5-10 seconds)
  • Translation: 0.1-0.5 seconds per title (5-10x faster)
  • Summarization: 1-3 seconds per article (3-5x faster)
  • Recommended: 4GB+ VRAM for phi3:latest
  • Larger models (llama3.2) require 8GB+ VRAM

Performance Comparison

Operation CPU (4 cores) GPU (RTX 3060) Speedup
Model Load 20s 8s 2.5x
Translation 1.5s 0.3s 5x
Summarization 8s 2s 4x
10 Articles 90s 25s 3.6x

Tip: GPU acceleration is most beneficial when processing many articles in batch.


Integration Complete

What's Included

Ollama service integrated into Docker Compose Automatic model download (phi3:latest, 2.2GB) GPU support with automatic detection CPU fallback when GPU unavailable Internal-only access (secure) Persistent model storage

Quick Verification

# Check Ollama is running
docker ps | grep ollama

# Check model is downloaded
docker-compose exec ollama ollama list

# Test from inside network
docker-compose exec crawler python -c "
from ollama_client import OllamaClient
from config import Config
client = OllamaClient(Config.OLLAMA_BASE_URL, Config.OLLAMA_MODEL, Config.OLLAMA_ENABLED)
print(client.translate_title('Guten Morgen'))
"

Performance

CPU Mode:

  • Translation: ~1.5s per title
  • Summarization: ~8s per article
  • Suitable for <20 articles/day

GPU Mode:

  • Translation: ~0.3s per title (5x faster)
  • Summarization: ~2s per article (4x faster)
  • Suitable for high-volume processing

See GPU_SETUP.md for GPU acceleration setup.