6.8 KiB
Ollama Setup Guide
This project includes an integrated Ollama service for AI-powered summarization and translation.
🚀 Want 5-10x faster performance? See GPU_SETUP.md for GPU acceleration setup.
Docker Compose Setup (Recommended)
The docker-compose.yml includes an Ollama service that automatically:
- Runs Ollama server (internal only, not exposed to host)
- Pulls the phi3:latest model on first startup
- Persists model data in a Docker volume
- Supports GPU acceleration (NVIDIA GPUs)
- Only accessible by other Docker Compose services for security
GPU Support
Ollama can use NVIDIA GPUs for significantly faster inference (5-10x speedup).
Prerequisites:
- NVIDIA GPU with CUDA support
- NVIDIA drivers installed
- NVIDIA Container Toolkit installed
Installation (Ubuntu/Debian):
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Start with GPU support:
# Automatic detection and startup
./start-with-gpu.sh
# Or manually specify GPU support
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
Verify GPU is being used:
# Check if GPU is detected
docker exec munich-news-ollama nvidia-smi
# Monitor GPU usage during inference
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
Configuration
Update your backend/.env file with one of these configurations:
For Docker Compose (services communicate via internal network):
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120
For external Ollama server (running on host machine):
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120
Starting the Services
# Option 1: Auto-detect GPU and start (recommended)
./start-with-gpu.sh
# Option 2: Start with GPU support (if you have NVIDIA GPU)
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
# Option 3: Start without GPU (CPU only)
docker-compose up -d
# Check Ollama logs
docker-compose logs -f ollama
# Check model setup logs
docker-compose logs ollama-setup
# Verify Ollama is running (from inside a container)
docker-compose exec crawler curl http://ollama:11434/api/tags
First Time Setup
On first startup, the ollama-setup service will automatically pull the phi3:latest model. This may take several minutes depending on your internet connection (model is ~2.3GB).
You can monitor the progress:
docker-compose logs -f ollama-setup
Available Models
The default model is phi3:latest (2.3GB), which provides a good balance of speed and quality.
To use a different model:
- Update
OLLAMA_MODELin your.envfile - Pull the model manually:
docker-compose exec ollama ollama pull <model-name>
Popular alternatives:
llama3.2:latest- Larger, more capable modelmistral:latest- Fast and efficientgemma2:2b- Smallest, fastest option
Troubleshooting
Ollama service not starting:
# Check if port 11434 is already in use
lsof -i :11434
# Restart the service
docker-compose restart ollama
# Check logs
docker-compose logs ollama
Model not downloading:
# Manually pull the model
docker-compose exec ollama ollama pull phi3:latest
# Check available models
docker-compose exec ollama ollama list
GPU not being detected:
# Check if NVIDIA drivers are installed
nvidia-smi
# Check if Docker can access GPU
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
# Verify GPU is available in Ollama container
docker exec munich-news-ollama nvidia-smi
# Check Ollama logs for GPU initialization
docker-compose logs ollama | grep -i gpu
GPU out of memory:
- Phi3 requires ~2-4GB VRAM
- Close other GPU applications
- Use a smaller model:
gemma2:2b(requires ~1.5GB VRAM) - Or fall back to CPU mode
CPU out of memory errors:
- Phi3 requires ~4GB RAM
- Consider using a smaller model like
gemma2:2b - Or increase Docker's memory limit in Docker Desktop settings
Slow performance even with GPU:
- Ensure GPU drivers are up to date
- Check GPU utilization:
watch -n 1 'docker exec munich-news-ollama nvidia-smi' - Verify you're using the GPU compose file:
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d - Some models may not fully utilize GPU - try different models
Local Ollama Installation
If you prefer to run Ollama directly on your host machine:
- Install Ollama: https://ollama.ai/download
- Pull the model:
ollama pull phi3:latest - Start Ollama:
ollama serve - Update
.envto usehttp://host.docker.internal:11434
Testing the Setup
Basic API Test
# Test Ollama API from inside a container
docker-compose exec crawler curl -s http://ollama:11434/api/generate -d '{
"model": "phi3:latest",
"prompt": "Translate to English: Guten Morgen",
"stream": false
}'
GPU Verification
# Check if GPU is detected
docker exec munich-news-ollama nvidia-smi
# Monitor GPU usage during a test
# Terminal 1: Monitor GPU
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
# Terminal 2: Run test crawl
docker-compose exec crawler python crawler_service.py 1
# You should see GPU memory usage increase during inference
Full Integration Test
# Run a test crawl to verify translation works
docker-compose exec crawler python crawler_service.py 1
# Check the logs for translation timing
# GPU: ~0.3-0.5s per translation
# CPU: ~1-2s per translation
docker-compose logs crawler | grep "Title translated"
Performance Notes
CPU Performance
- First request may be slow as the model loads into memory (~10-30 seconds)
- Subsequent requests are faster (cached in memory)
- Translation: 0.5-2 seconds per title
- Summarization: 5-10 seconds per article
- Recommended: 4+ CPU cores, 8GB+ RAM
GPU Performance (NVIDIA)
- Model loads faster (~5-10 seconds)
- Translation: 0.1-0.5 seconds per title (5-10x faster)
- Summarization: 1-3 seconds per article (3-5x faster)
- Recommended: 4GB+ VRAM for phi3:latest
- Larger models (llama3.2) require 8GB+ VRAM
Performance Comparison
| Operation | CPU (4 cores) | GPU (RTX 3060) | Speedup |
|---|---|---|---|
| Model Load | 20s | 8s | 2.5x |
| Translation | 1.5s | 0.3s | 5x |
| Summarization | 8s | 2s | 4x |
| 10 Articles | 90s | 25s | 3.6x |
Tip: GPU acceleration is most beneficial when processing many articles in batch.