update
This commit is contained in:
249
docs/OLLAMA_SETUP.md
Normal file
249
docs/OLLAMA_SETUP.md
Normal file
@@ -0,0 +1,249 @@
|
||||
# Ollama Setup Guide
|
||||
|
||||
This project includes an integrated Ollama service for AI-powered summarization and translation.
|
||||
|
||||
**🚀 Want 5-10x faster performance?** See [GPU_SETUP.md](GPU_SETUP.md) for GPU acceleration setup.
|
||||
|
||||
## Docker Compose Setup (Recommended)
|
||||
|
||||
The docker-compose.yml includes an Ollama service that automatically:
|
||||
- Runs Ollama server on port 11434
|
||||
- Pulls the phi3:latest model on first startup
|
||||
- Persists model data in a Docker volume
|
||||
- Supports GPU acceleration (NVIDIA GPUs)
|
||||
|
||||
### GPU Support
|
||||
|
||||
Ollama can use NVIDIA GPUs for significantly faster inference (5-10x speedup).
|
||||
|
||||
**Prerequisites:**
|
||||
- NVIDIA GPU with CUDA support
|
||||
- NVIDIA drivers installed
|
||||
- NVIDIA Container Toolkit installed
|
||||
|
||||
**Installation (Ubuntu/Debian):**
|
||||
```bash
|
||||
# Install NVIDIA Container Toolkit
|
||||
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
|
||||
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y nvidia-container-toolkit
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
**Start with GPU support:**
|
||||
```bash
|
||||
# Automatic detection and startup
|
||||
./start-with-gpu.sh
|
||||
|
||||
# Or manually specify GPU support
|
||||
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
|
||||
```
|
||||
|
||||
**Verify GPU is being used:**
|
||||
```bash
|
||||
# Check if GPU is detected
|
||||
docker exec munich-news-ollama nvidia-smi
|
||||
|
||||
# Monitor GPU usage during inference
|
||||
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Update your `backend/.env` file with one of these configurations:
|
||||
|
||||
**For Docker Compose (services communicate via internal network):**
|
||||
```env
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_BASE_URL=http://ollama:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
OLLAMA_TIMEOUT=120
|
||||
```
|
||||
|
||||
**For external Ollama server (running on host machine):**
|
||||
```env
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_BASE_URL=http://host.docker.internal:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
OLLAMA_TIMEOUT=120
|
||||
```
|
||||
|
||||
### Starting the Services
|
||||
|
||||
```bash
|
||||
# Option 1: Auto-detect GPU and start (recommended)
|
||||
./start-with-gpu.sh
|
||||
|
||||
# Option 2: Start with GPU support (if you have NVIDIA GPU)
|
||||
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
|
||||
|
||||
# Option 3: Start without GPU (CPU only)
|
||||
docker-compose up -d
|
||||
|
||||
# Check Ollama logs
|
||||
docker-compose logs -f ollama
|
||||
|
||||
# Check model setup logs
|
||||
docker-compose logs ollama-setup
|
||||
|
||||
# Verify Ollama is running
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
### First Time Setup
|
||||
|
||||
On first startup, the `ollama-setup` service will automatically pull the phi3:latest model. This may take several minutes depending on your internet connection (model is ~2.3GB).
|
||||
|
||||
You can monitor the progress:
|
||||
```bash
|
||||
docker-compose logs -f ollama-setup
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
The default model is `phi3:latest` (2.3GB), which provides a good balance of speed and quality.
|
||||
|
||||
To use a different model:
|
||||
1. Update `OLLAMA_MODEL` in your `.env` file
|
||||
2. Pull the model manually:
|
||||
```bash
|
||||
docker-compose exec ollama ollama pull <model-name>
|
||||
```
|
||||
|
||||
Popular alternatives:
|
||||
- `llama3.2:latest` - Larger, more capable model
|
||||
- `mistral:latest` - Fast and efficient
|
||||
- `gemma2:2b` - Smallest, fastest option
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Ollama service not starting:**
|
||||
```bash
|
||||
# Check if port 11434 is already in use
|
||||
lsof -i :11434
|
||||
|
||||
# Restart the service
|
||||
docker-compose restart ollama
|
||||
|
||||
# Check logs
|
||||
docker-compose logs ollama
|
||||
```
|
||||
|
||||
**Model not downloading:**
|
||||
```bash
|
||||
# Manually pull the model
|
||||
docker-compose exec ollama ollama pull phi3:latest
|
||||
|
||||
# Check available models
|
||||
docker-compose exec ollama ollama list
|
||||
```
|
||||
|
||||
**GPU not being detected:**
|
||||
```bash
|
||||
# Check if NVIDIA drivers are installed
|
||||
nvidia-smi
|
||||
|
||||
# Check if Docker can access GPU
|
||||
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
|
||||
|
||||
# Verify GPU is available in Ollama container
|
||||
docker exec munich-news-ollama nvidia-smi
|
||||
|
||||
# Check Ollama logs for GPU initialization
|
||||
docker-compose logs ollama | grep -i gpu
|
||||
```
|
||||
|
||||
**GPU out of memory:**
|
||||
- Phi3 requires ~2-4GB VRAM
|
||||
- Close other GPU applications
|
||||
- Use a smaller model: `gemma2:2b` (requires ~1.5GB VRAM)
|
||||
- Or fall back to CPU mode
|
||||
|
||||
**CPU out of memory errors:**
|
||||
- Phi3 requires ~4GB RAM
|
||||
- Consider using a smaller model like `gemma2:2b`
|
||||
- Or increase Docker's memory limit in Docker Desktop settings
|
||||
|
||||
**Slow performance even with GPU:**
|
||||
- Ensure GPU drivers are up to date
|
||||
- Check GPU utilization: `watch -n 1 'docker exec munich-news-ollama nvidia-smi'`
|
||||
- Verify you're using the GPU compose file: `docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d`
|
||||
- Some models may not fully utilize GPU - try different models
|
||||
|
||||
## Local Ollama Installation
|
||||
|
||||
If you prefer to run Ollama directly on your host machine:
|
||||
|
||||
1. Install Ollama: https://ollama.ai/download
|
||||
2. Pull the model: `ollama pull phi3:latest`
|
||||
3. Start Ollama: `ollama serve`
|
||||
4. Update `.env` to use `http://host.docker.internal:11434`
|
||||
|
||||
## Testing the Setup
|
||||
|
||||
### Basic API Test
|
||||
```bash
|
||||
# Test Ollama API directly
|
||||
curl http://localhost:11434/api/generate -d '{
|
||||
"model": "phi3:latest",
|
||||
"prompt": "Translate to English: Guten Morgen",
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### GPU Verification
|
||||
```bash
|
||||
# Check if GPU is detected
|
||||
docker exec munich-news-ollama nvidia-smi
|
||||
|
||||
# Monitor GPU usage during a test
|
||||
# Terminal 1: Monitor GPU
|
||||
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
|
||||
|
||||
# Terminal 2: Run test crawl
|
||||
docker-compose exec crawler python crawler_service.py 1
|
||||
|
||||
# You should see GPU memory usage increase during inference
|
||||
```
|
||||
|
||||
### Full Integration Test
|
||||
```bash
|
||||
# Run a test crawl to verify translation works
|
||||
docker-compose exec crawler python crawler_service.py 1
|
||||
|
||||
# Check the logs for translation timing
|
||||
# GPU: ~0.3-0.5s per translation
|
||||
# CPU: ~1-2s per translation
|
||||
docker-compose logs crawler | grep "Title translated"
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
|
||||
### CPU Performance
|
||||
- First request may be slow as the model loads into memory (~10-30 seconds)
|
||||
- Subsequent requests are faster (cached in memory)
|
||||
- Translation: 0.5-2 seconds per title
|
||||
- Summarization: 5-10 seconds per article
|
||||
- Recommended: 4+ CPU cores, 8GB+ RAM
|
||||
|
||||
### GPU Performance (NVIDIA)
|
||||
- Model loads faster (~5-10 seconds)
|
||||
- Translation: 0.1-0.5 seconds per title (5-10x faster)
|
||||
- Summarization: 1-3 seconds per article (3-5x faster)
|
||||
- Recommended: 4GB+ VRAM for phi3:latest
|
||||
- Larger models (llama3.2) require 8GB+ VRAM
|
||||
|
||||
### Performance Comparison
|
||||
|
||||
| Operation | CPU (4 cores) | GPU (RTX 3060) | Speedup |
|
||||
|-----------|---------------|----------------|---------|
|
||||
| Model Load | 20s | 8s | 2.5x |
|
||||
| Translation | 1.5s | 0.3s | 5x |
|
||||
| Summarization | 8s | 2s | 4x |
|
||||
| 10 Articles | 90s | 25s | 3.6x |
|
||||
|
||||
**Tip:** GPU acceleration is most beneficial when processing many articles in batch.
|
||||
Reference in New Issue
Block a user