# GPU Setup Guide for Ollama This guide explains how to enable GPU acceleration for Ollama to achieve 5-10x faster AI inference. ## Quick Start ```bash # 1. Check if you have a compatible GPU ./check-gpu.sh # 2. If GPU is available, start with GPU support ./start-with-gpu.sh # 3. Verify GPU is being used docker exec munich-news-ollama nvidia-smi ``` ## Benefits of GPU Acceleration | Operation | CPU (4 cores) | GPU (RTX 3060) | Speedup | |-----------|---------------|----------------|---------| | Model Load | 20s | 8s | 2.5x | | Translation | 1.5s | 0.3s | 5x | | Summarization | 8s | 2s | 4x | | 10 Articles | 90s | 25s | 3.6x | **Bottom line:** Processing 10 articles takes ~90 seconds on CPU vs ~25 seconds on GPU. ## Requirements ### Hardware - NVIDIA GPU with CUDA support (GTX 1060 or newer recommended) - Minimum 4GB VRAM for phi3:latest - 8GB+ VRAM for larger models (llama3.2, etc.) ### Software - NVIDIA drivers (version 525.60.13 or newer) - Docker 20.10+ - Docker Compose v2.3+ - NVIDIA Container Toolkit ## Installation ### Step 1: Install NVIDIA Drivers **Ubuntu/Debian:** ```bash # Check current driver nvidia-smi # If not installed, install recommended driver sudo ubuntu-drivers autoinstall sudo reboot ``` **Other Linux:** Visit: https://www.nvidia.com/Download/index.aspx ### Step 2: Install NVIDIA Container Toolkit **Ubuntu/Debian:** ```bash # Add repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Install sudo apt-get update sudo apt-get install -y nvidia-container-toolkit # Configure Docker sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` **RHEL/CentOS:** ```bash distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo sudo yum install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` ### Step 3: Verify Installation ```bash # Test GPU access from Docker docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi # You should see your GPU information ``` ## Usage ### Starting Services with GPU **Option 1: Automatic (Recommended)** ```bash ./start-with-gpu.sh ``` This script automatically detects GPU availability and starts services accordingly. **Option 2: Manual** ```bash # With GPU docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d # Without GPU (CPU only) docker-compose up -d ``` ### Verifying GPU Usage ```bash # Check if GPU is detected in container docker exec munich-news-ollama nvidia-smi # Monitor GPU usage in real-time watch -n 1 'docker exec munich-news-ollama nvidia-smi' # Run a test and watch GPU usage # Terminal 1: watch -n 1 'docker exec munich-news-ollama nvidia-smi' # Terminal 2: docker-compose exec crawler python crawler_service.py 2 ``` You should see: - GPU memory usage increase during inference - GPU utilization spike to 80-100% - Faster processing times in logs ## Troubleshooting ### GPU Not Detected **Check NVIDIA drivers:** ```bash nvidia-smi # Should show GPU information ``` **Check Docker GPU access:** ```bash docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi # Should show GPU information from inside container ``` **Check Ollama container:** ```bash docker exec munich-news-ollama nvidia-smi # Should show GPU information ``` ### Out of Memory Errors **Symptoms:** - "CUDA out of memory" errors - Container crashes during inference **Solutions:** 1. Use a smaller model: ```bash # Edit backend/.env OLLAMA_MODEL=gemma2:2b # Requires ~1.5GB VRAM ``` 2. Close other GPU applications: ```bash # Check what's using GPU nvidia-smi ``` 3. Increase GPU memory (if using Docker Desktop): - Docker Desktop → Settings → Resources → Advanced - Increase memory allocation ### Slow Performance Despite GPU **Check GPU utilization:** ```bash watch -n 1 'docker exec munich-news-ollama nvidia-smi' ``` If GPU utilization is low (<50%): 1. Ensure you're using the GPU compose file 2. Check Ollama logs for errors: `docker-compose logs ollama` 3. Try a different model that better utilizes GPU 4. Update NVIDIA drivers ### Docker Compose GPU Not Working **Error:** `could not select device driver "" with capabilities: [[gpu]]` **Solution:** ```bash # Reconfigure Docker runtime sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # Verify configuration cat /etc/docker/daemon.json # Should contain nvidia runtime configuration ``` ## Performance Tuning ### Model Selection Different models have different GPU requirements and performance: | Model | VRAM | Speed | Quality | Best For | |-------|------|-------|---------|----------| | gemma2:2b | 1.5GB | Fastest | Good | High volume, speed critical | | phi3:latest | 2-4GB | Fast | Very Good | Balanced (default) | | llama3.2:3b | 4-6GB | Medium | Excellent | Quality critical | | mistral:latest | 6-8GB | Medium | Excellent | Long-form content | ### Batch Processing GPU acceleration is most effective when processing multiple articles: - 1 article: ~2x speedup - 10 articles: ~4x speedup - 50+ articles: ~5-10x speedup This is because the model stays loaded in GPU memory between requests. ### Concurrent Requests Ollama can handle multiple concurrent requests on GPU: ```bash # Edit backend/.env to enable concurrent processing OLLAMA_CONCURRENT_REQUESTS=3 ``` Note: Each concurrent request uses additional VRAM. ## Monitoring ### Real-time GPU Monitoring ```bash # Basic monitoring watch -n 1 'docker exec munich-news-ollama nvidia-smi' # Detailed monitoring watch -n 1 'docker exec munich-news-ollama nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv' ``` ### Performance Logging Check crawler logs for timing information: ```bash docker-compose logs crawler | grep "Title translated" # GPU: ✓ Title translated (0.3s) # CPU: ✓ Title translated (1.5s) ``` ## Cost-Benefit Analysis ### When to Use GPU **Use GPU if:** - Processing 10+ articles daily - Need faster newsletter generation - Have available GPU hardware - Running multiple AI operations **Use CPU if:** - Processing <5 articles daily - No GPU available - GPU needed for other tasks - Cost-sensitive deployment ### Cloud Deployment GPU instances cost more but process faster: | Provider | Instance | GPU | Cost/hour | Articles/hour | |----------|----------|-----|-----------|---------------| | AWS | g4dn.xlarge | T4 | $0.526 | ~1000 | | GCP | n1-standard-4 + T4 | T4 | $0.35 | ~1000 | | Azure | NC6 | K80 | $0.90 | ~500 | For comparison, CPU instances process ~100-200 articles/hour at $0.05-0.10/hour. ## Additional Resources - [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - [Ollama GPU Support](https://github.com/ollama/ollama/blob/main/docs/gpu.md) - [Docker GPU Support](https://docs.docker.com/config/containers/resource_constraints/#gpu) - [CUDA Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/) ## Support If you encounter issues: 1. Run `./check-gpu.sh` to diagnose 2. Check logs: `docker-compose logs ollama` 3. See [OLLAMA_SETUP.md](OLLAMA_SETUP.md) for general Ollama troubleshooting 4. Open an issue with: - Output of `nvidia-smi` - Output of `docker info | grep -i runtime` - Relevant logs