# GPU Setup Guide for Ollama

This guide explains how to enable GPU acceleration for Ollama to achieve 5-10x faster AI inference.

## Quick Start

```bash
# 1. Check if you have a compatible GPU
./check-gpu.sh

# 2. If GPU is available, start with GPU support
./start-with-gpu.sh

# 3. Verify GPU is being used
docker exec munich-news-ollama nvidia-smi
```

## Benefits of GPU Acceleration

| Operation | CPU (4 cores) | GPU (RTX 3060) | Speedup |
|-----------|---------------|----------------|---------|
| Model Load | 20s | 8s | 2.5x |
| Translation | 1.5s | 0.3s | 5x |
| Summarization | 8s | 2s | 4x |
| 10 Articles | 90s | 25s | 3.6x |

**Bottom line:** Processing 10 articles takes ~90 seconds on CPU vs ~25 seconds on GPU.

## Requirements

### Hardware
- NVIDIA GPU with CUDA support (GTX 1060 or newer recommended)
- Minimum 4GB VRAM for phi3:latest
- 8GB+ VRAM for larger models (llama3.2, etc.)

### Software
- NVIDIA drivers (version 525.60.13 or newer)
- Docker 20.10+
- Docker Compose v2.3+
- NVIDIA Container Toolkit

## Installation

### Step 1: Install NVIDIA Drivers

**Ubuntu/Debian:**
```bash
# Check current driver
nvidia-smi

# If not installed, install recommended driver
sudo ubuntu-drivers autoinstall
sudo reboot
```

**Other Linux:**
Visit: https://www.nvidia.com/Download/index.aspx

### Step 2: Install NVIDIA Container Toolkit

**Ubuntu/Debian:**
```bash
# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

**RHEL/CentOS:**
```bash
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | \
    sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

sudo yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

### Step 3: Verify Installation

```bash
# Test GPU access from Docker
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

# You should see your GPU information
```

## Usage

### Starting Services with GPU

**Option 1: Automatic (Recommended)**
```bash
./start-with-gpu.sh
```
This script automatically detects GPU availability and starts services accordingly.

**Option 2: Manual**
```bash
# With GPU
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d

# Without GPU (CPU only)
docker-compose up -d
```

### Verifying GPU Usage

```bash
# Check if GPU is detected in container
docker exec munich-news-ollama nvidia-smi

# Monitor GPU usage in real-time
watch -n 1 'docker exec munich-news-ollama nvidia-smi'

# Run a test and watch GPU usage
# Terminal 1:
watch -n 1 'docker exec munich-news-ollama nvidia-smi'

# Terminal 2:
docker-compose exec crawler python crawler_service.py 2
```

You should see:
- GPU memory usage increase during inference
- GPU utilization spike to 80-100%
- Faster processing times in logs

## Troubleshooting

### GPU Not Detected

**Check NVIDIA drivers:**
```bash
nvidia-smi
# Should show GPU information
```

**Check Docker GPU access:**
```bash
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
# Should show GPU information from inside container
```

**Check Ollama container:**
```bash
docker exec munich-news-ollama nvidia-smi
# Should show GPU information
```

### Out of Memory Errors

**Symptoms:**
- "CUDA out of memory" errors
- Container crashes during inference

**Solutions:**
1. Use a smaller model:
   ```bash
   # Edit backend/.env
   OLLAMA_MODEL=gemma2:2b  # Requires ~1.5GB VRAM
   ```

2. Close other GPU applications:
   ```bash
   # Check what's using GPU
   nvidia-smi
   ```

3. Increase GPU memory (if using Docker Desktop):
   - Docker Desktop → Settings → Resources → Advanced
   - Increase memory allocation

### Slow Performance Despite GPU

**Check GPU utilization:**
```bash
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
```

If GPU utilization is low (<50%):
1. Ensure you're using the GPU compose file
2. Check Ollama logs for errors: `docker-compose logs ollama`
3. Try a different model that better utilizes GPU
4. Update NVIDIA drivers

### Docker Compose GPU Not Working

**Error:** `could not select device driver "" with capabilities: [[gpu]]`

**Solution:**
```bash
# Reconfigure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify configuration
cat /etc/docker/daemon.json
# Should contain nvidia runtime configuration
```

## Performance Tuning

### Model Selection

Different models have different GPU requirements and performance:

| Model | VRAM | Speed | Quality | Best For |
|-------|------|-------|---------|----------|
| gemma2:2b | 1.5GB | Fastest | Good | High volume, speed critical |
| phi3:latest | 2-4GB | Fast | Very Good | Balanced (default) |
| llama3.2:3b | 4-6GB | Medium | Excellent | Quality critical |
| mistral:latest | 6-8GB | Medium | Excellent | Long-form content |

### Batch Processing

GPU acceleration is most effective when processing multiple articles:
- 1 article: ~2x speedup
- 10 articles: ~4x speedup
- 50+ articles: ~5-10x speedup

This is because the model stays loaded in GPU memory between requests.

### Concurrent Requests

Ollama can handle multiple concurrent requests on GPU:
```bash
# Edit backend/.env to enable concurrent processing
OLLAMA_CONCURRENT_REQUESTS=3
```

Note: Each concurrent request uses additional VRAM.

## Monitoring

### Real-time GPU Monitoring

```bash
# Basic monitoring
watch -n 1 'docker exec munich-news-ollama nvidia-smi'

# Detailed monitoring
watch -n 1 'docker exec munich-news-ollama nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv'
```

### Performance Logging

Check crawler logs for timing information:
```bash
docker-compose logs crawler | grep "Title translated"
# GPU: ✓ Title translated (0.3s)
# CPU: ✓ Title translated (1.5s)
```

## Cost-Benefit Analysis

### When to Use GPU

**Use GPU if:**
- Processing 10+ articles daily
- Need faster newsletter generation
- Have available GPU hardware
- Running multiple AI operations

**Use CPU if:**
- Processing <5 articles daily
- No GPU available
- GPU needed for other tasks
- Cost-sensitive deployment

### Cloud Deployment

GPU instances cost more but process faster:

| Provider | Instance | GPU | Cost/hour | Articles/hour |
|----------|----------|-----|-----------|---------------|
| AWS | g4dn.xlarge | T4 | $0.526 | ~1000 |
| GCP | n1-standard-4 + T4 | T4 | $0.35 | ~1000 |
| Azure | NC6 | K80 | $0.90 | ~500 |

For comparison, CPU instances process ~100-200 articles/hour at $0.05-0.10/hour.

## Additional Resources

- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
- [Ollama GPU Support](https://github.com/ollama/ollama/blob/main/docs/gpu.md)
- [Docker GPU Support](https://docs.docker.com/config/containers/resource_constraints/#gpu)
- [CUDA Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/)

## Support

If you encounter issues:
1. Run `./check-gpu.sh` to diagnose
2. Check logs: `docker-compose logs ollama`
3. See [OLLAMA_SETUP.md](OLLAMA_SETUP.md) for general Ollama troubleshooting
4. Open an issue with:
   - Output of `nvidia-smi`
   - Output of `docker info | grep -i runtime`
   - Relevant logs