10 KiB
GPU Setup Guide for Ollama
This guide explains how to enable GPU acceleration for Ollama to achieve 5-10x faster AI inference.
Quick Start
# 1. Check if you have a compatible GPU
./check-gpu.sh
# 2. If GPU is available, start with GPU support
./start-with-gpu.sh
# 3. Verify GPU is being used
docker exec munich-news-ollama nvidia-smi
Benefits of GPU Acceleration
| Operation | CPU (4 cores) | GPU (RTX 3060) | Speedup |
|---|---|---|---|
| Model Load | 20s | 8s | 2.5x |
| Translation | 1.5s | 0.3s | 5x |
| Summarization | 8s | 2s | 4x |
| 10 Articles | 90s | 25s | 3.6x |
Bottom line: Processing 10 articles takes ~90 seconds on CPU vs ~25 seconds on GPU.
Requirements
Hardware
- NVIDIA GPU with CUDA support (GTX 1060 or newer recommended)
- Minimum 4GB VRAM for phi3:latest
- 8GB+ VRAM for larger models (llama3.2, etc.)
Software
- NVIDIA drivers (version 525.60.13 or newer)
- Docker 20.10+
- Docker Compose v2.3+
- NVIDIA Container Toolkit
Installation
Step 1: Install NVIDIA Drivers
Ubuntu/Debian:
# Check current driver
nvidia-smi
# If not installed, install recommended driver
sudo ubuntu-drivers autoinstall
sudo reboot
Other Linux: Visit: https://www.nvidia.com/Download/index.aspx
Step 2: Install NVIDIA Container Toolkit
Ubuntu/Debian:
# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
RHEL/CentOS:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 3: Verify Installation
# Test GPU access from Docker
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
# You should see your GPU information
Usage
Starting Services with GPU
Option 1: Automatic (Recommended)
./start-with-gpu.sh
This script automatically detects GPU availability and starts services accordingly.
Option 2: Manual
# With GPU
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
# Without GPU (CPU only)
docker-compose up -d
Verifying GPU Usage
# Check if GPU is detected in container
docker exec munich-news-ollama nvidia-smi
# Monitor GPU usage in real-time
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
# Run a test and watch GPU usage
# Terminal 1:
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
# Terminal 2:
docker-compose exec crawler python crawler_service.py 2
You should see:
- GPU memory usage increase during inference
- GPU utilization spike to 80-100%
- Faster processing times in logs
Troubleshooting
GPU Not Detected
Check NVIDIA drivers:
nvidia-smi
# Should show GPU information
Check Docker GPU access:
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
# Should show GPU information from inside container
Check Ollama container:
docker exec munich-news-ollama nvidia-smi
# Should show GPU information
Out of Memory Errors
Symptoms:
- "CUDA out of memory" errors
- Container crashes during inference
Solutions:
-
Use a smaller model:
# Edit backend/.env OLLAMA_MODEL=gemma2:2b # Requires ~1.5GB VRAM -
Close other GPU applications:
# Check what's using GPU nvidia-smi -
Increase GPU memory (if using Docker Desktop):
- Docker Desktop → Settings → Resources → Advanced
- Increase memory allocation
Slow Performance Despite GPU
Check GPU utilization:
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
If GPU utilization is low (<50%):
- Ensure you're using the GPU compose file
- Check Ollama logs for errors:
docker-compose logs ollama - Try a different model that better utilizes GPU
- Update NVIDIA drivers
Docker Compose GPU Not Working
Error: could not select device driver "" with capabilities: [[gpu]]
Solution:
# Reconfigure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify configuration
cat /etc/docker/daemon.json
# Should contain nvidia runtime configuration
Performance Tuning
Model Selection
Different models have different GPU requirements and performance:
| Model | VRAM | Speed | Quality | Best For |
|---|---|---|---|---|
| gemma2:2b | 1.5GB | Fastest | Good | High volume, speed critical |
| phi3:latest | 2-4GB | Fast | Very Good | Balanced (default) |
| llama3.2:3b | 4-6GB | Medium | Excellent | Quality critical |
| mistral:latest | 6-8GB | Medium | Excellent | Long-form content |
Batch Processing
GPU acceleration is most effective when processing multiple articles:
- 1 article: ~2x speedup
- 10 articles: ~4x speedup
- 50+ articles: ~5-10x speedup
This is because the model stays loaded in GPU memory between requests.
Concurrent Requests
Ollama can handle multiple concurrent requests on GPU:
# Edit backend/.env to enable concurrent processing
OLLAMA_CONCURRENT_REQUESTS=3
Note: Each concurrent request uses additional VRAM.
Monitoring
Real-time GPU Monitoring
# Basic monitoring
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
# Detailed monitoring
watch -n 1 'docker exec munich-news-ollama nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv'
Performance Logging
Check crawler logs for timing information:
docker-compose logs crawler | grep "Title translated"
# GPU: ✓ Title translated (0.3s)
# CPU: ✓ Title translated (1.5s)
Cost-Benefit Analysis
When to Use GPU
Use GPU if:
- Processing 10+ articles daily
- Need faster newsletter generation
- Have available GPU hardware
- Running multiple AI operations
Use CPU if:
- Processing <5 articles daily
- No GPU available
- GPU needed for other tasks
- Cost-sensitive deployment
Cloud Deployment
GPU instances cost more but process faster:
| Provider | Instance | GPU | Cost/hour | Articles/hour |
|---|---|---|---|---|
| AWS | g4dn.xlarge | T4 | $0.526 | ~1000 |
| GCP | n1-standard-4 + T4 | T4 | $0.35 | ~1000 |
| Azure | NC6 | K80 | $0.90 | ~500 |
For comparison, CPU instances process ~100-200 articles/hour at $0.05-0.10/hour.
Additional Resources
Support
If you encounter issues:
- Run
./check-gpu.shto diagnose - Check logs:
docker-compose logs ollama - See OLLAMA_SETUP.md for general Ollama troubleshooting
- Open an issue with:
- Output of
nvidia-smi - Output of
docker info | grep -i runtime - Relevant logs
- Output of
Quick Start Guide
30-Second Setup
# 1. Check GPU
./check-gpu.sh
# 2. Start services
./start-with-gpu.sh
# 3. Test
docker-compose exec crawler python crawler_service.py 2
Command Reference
Setup:
./check-gpu.sh # Check GPU availability
./configure-ollama.sh # Configure Ollama
./start-with-gpu.sh # Start with GPU auto-detection
With GPU (manual):
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
Without GPU:
docker-compose up -d
Monitoring:
docker exec munich-news-ollama nvidia-smi # Check GPU
watch -n 1 'docker exec munich-news-ollama nvidia-smi' # Monitor GPU
docker-compose logs -f ollama # Check logs
Testing:
docker-compose exec crawler python crawler_service.py 2 # Test crawl
docker-compose logs crawler | grep "Title translated" # Check timing
Performance Expectations
| Operation | CPU | GPU | Speedup |
|---|---|---|---|
| Translation | 1.5s | 0.3s | 5x |
| Summary | 8s | 2s | 4x |
| 10 Articles | 115s | 31s | 3.7x |
Integration Summary
What Was Implemented
-
Ollama Service in Docker Compose
- Runs on internal network (port 11434)
- Automatic model download (phi3:latest)
- Persistent storage in Docker volume
- GPU support with automatic detection
-
GPU Acceleration
- NVIDIA GPU support via docker-compose.gpu.yml
- Automatic GPU detection script
- 5-10x performance improvement
- Graceful CPU fallback
-
Helper Scripts
start-with-gpu.sh- Auto-detect and startcheck-gpu.sh- Diagnose GPU availabilityconfigure-ollama.sh- Interactive configurationtest-ollama-setup.sh- Comprehensive tests
-
Security
- Ollama is internal-only (not exposed to host)
- Only accessible via Docker network
- Prevents unauthorized access
Files Created
docker-compose.gpu.yml- GPU configuration overridestart-with-gpu.sh- Auto-start scriptcheck-gpu.sh- GPU detection scripttest-ollama-setup.sh- Test suitedocs/GPU_SETUP.md- This documentationdocs/OLLAMA_SETUP.md- Ollama setup guidedocs/PERFORMANCE_COMPARISON.md- Benchmarks
Quick Commands
# Start with GPU
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
# Or use helper script
./start-with-gpu.sh
# Verify GPU usage
docker exec munich-news-ollama nvidia-smi