14 KiB
Design Document - AI Article Summarization
Overview
This design integrates Ollama AI into the news crawler workflow to automatically generate concise summaries of articles. The system will extract full article content, send it to Ollama for summarization, and store both the original content and the AI-generated summary in MongoDB.
Architecture
High-Level Flow
RSS Feed → Extract Content → Summarize with Ollama → Store in MongoDB
↓ ↓ ↓
Full Article Text AI Summary (≤150 words) Both Stored
Component Diagram
┌─────────────────────────────────────────────────────────────┐
│ News Crawler Service │
│ │
│ ┌────────────────┐ ┌──────────────────┐ │
│ │ RSS Parser │──────→│ Content Extractor│ │
│ └────────────────┘ └──────────────────┘ │
│ │ │
│ ↓ │
│ ┌──────────────────┐ │
│ │ Ollama Client │ │
│ │ (New Component) │ │
│ └──────────────────┘ │
│ │ │
│ ↓ │
│ ┌──────────────────┐ │
│ │ Database Writer │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
↓
┌──────────────────┐
│ Ollama Server │
│ (External) │
└──────────────────┘
│
↓
┌──────────────────┐
│ MongoDB │
└──────────────────┘
Components and Interfaces
1. Ollama Client Module
File: news_crawler/ollama_client.py
Purpose: Handle communication with Ollama server for summarization
Interface:
class OllamaClient:
def __init__(self, base_url, model, api_key=None, enabled=True):
"""Initialize Ollama client with configuration"""
def summarize_article(self, content: str, max_words: int = 150) -> dict:
"""
Summarize article content using Ollama
Args:
content: Full article text
max_words: Maximum words in summary (default 150)
Returns:
{
'summary': str, # AI-generated summary
'word_count': int, # Summary word count
'success': bool, # Whether summarization succeeded
'error': str or None, # Error message if failed
'duration': float # Time taken in seconds
}
"""
def is_available(self) -> bool:
"""Check if Ollama server is reachable"""
def test_connection(self) -> dict:
"""Test connection and return server info"""
Key Methods:
-
summarize_article()
- Constructs prompt for Ollama
- Sends HTTP POST request
- Handles timeouts and errors
- Validates response
- Returns structured result
-
is_available()
- Quick health check
- Returns True/False
- Used before attempting summarization
-
test_connection()
- Detailed connection test
- Returns server info and model list
- Used for diagnostics
2. Enhanced Crawler Service
File: news_crawler/crawler_service.py
Changes:
# Add Ollama client initialization
from ollama_client import OllamaClient
# Initialize at module level
ollama_client = OllamaClient(
base_url=os.getenv('OLLAMA_BASE_URL'),
model=os.getenv('OLLAMA_MODEL'),
api_key=os.getenv('OLLAMA_API_KEY'),
enabled=os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
)
# Modify crawl_rss_feed() to include summarization
def crawl_rss_feed(feed_url, feed_name, max_articles=10):
# ... existing code ...
# After extracting content
article_data = extract_article_content(article_url)
# NEW: Summarize with Ollama
summary_result = None
if ollama_client.enabled and article_data.get('content'):
print(f" 🤖 Summarizing with AI...")
summary_result = ollama_client.summarize_article(
article_data['content'],
max_words=150
)
if summary_result['success']:
print(f" ✓ Summary generated ({summary_result['word_count']} words)")
else:
print(f" ⚠ Summarization failed: {summary_result['error']}")
# Build article document with summary
article_doc = {
'title': article_data.get('title'),
'author': article_data.get('author'),
'link': article_url,
'content': article_data.get('content'),
'summary': summary_result['summary'] if summary_result and summary_result['success'] else None,
'word_count': article_data.get('word_count'),
'summary_word_count': summary_result['word_count'] if summary_result and summary_result['success'] else None,
'source': feed_name,
'published_at': extract_published_date(entry),
'crawled_at': article_data.get('crawled_at'),
'summarized_at': datetime.utcnow() if summary_result and summary_result['success'] else None,
'created_at': datetime.utcnow()
}
3. Configuration Module
File: news_crawler/config.py (new file)
Purpose: Centralize configuration management
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='../.env')
class Config:
# MongoDB
MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
DB_NAME = 'munich_news'
# Ollama
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'phi3:latest')
OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '')
OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
OLLAMA_TIMEOUT = int(os.getenv('OLLAMA_TIMEOUT', '30'))
# Crawler
RATE_LIMIT_DELAY = 1 # seconds between requests
MAX_CONTENT_LENGTH = 50000 # characters
Data Models
Updated Article Schema
{
_id: ObjectId,
title: String,
author: String,
link: String, // Unique index
content: String, // Full article content
summary: String, // AI-generated summary (≤150 words)
word_count: Number, // Original content word count
summary_word_count: Number, // Summary word count
source: String,
published_at: String,
crawled_at: DateTime,
summarized_at: DateTime, // When AI summary was generated
created_at: DateTime
}
Ollama Request Format
{
"model": "phi3:latest",
"prompt": "Summarize the following article in 150 words or less. Focus on the key points and main message:\n\n[ARTICLE CONTENT]",
"stream": false,
"options": {
"temperature": 0.7,
"max_tokens": 200
}
}
Ollama Response Format
{
"model": "phi3:latest",
"created_at": "2024-11-10T16:30:00Z",
"response": "The AI-generated summary text here...",
"done": true,
"total_duration": 5000000000
}
Error Handling
Error Scenarios and Responses
| Scenario | Handling | User Impact |
|---|---|---|
| Ollama server down | Log warning, store original content | Article saved without summary |
| Ollama timeout (>30s) | Cancel request, store original | Article saved without summary |
| Empty summary returned | Log error, store original | Article saved without summary |
| Invalid response format | Log error, store original | Article saved without summary |
| Network error | Retry once, then store original | Article saved without summary |
| Model not found | Log error, disable Ollama | All articles saved without summaries |
Error Logging Format
{
'timestamp': datetime.utcnow(),
'article_url': article_url,
'error_type': 'timeout|connection|invalid_response|empty_summary',
'error_message': str(error),
'ollama_config': {
'base_url': OLLAMA_BASE_URL,
'model': OLLAMA_MODEL,
'enabled': OLLAMA_ENABLED
}
}
Testing Strategy
Unit Tests
-
test_ollama_client.py
- Test summarization with mock responses
- Test timeout handling
- Test error scenarios
- Test connection checking
-
test_crawler_with_ollama.py
- Test crawler with Ollama enabled
- Test crawler with Ollama disabled
- Test fallback when Ollama fails
- Test rate limiting
Integration Tests
- test_end_to_end.py
- Crawl real RSS feed
- Summarize with real Ollama
- Verify database storage
- Check all fields populated
Manual Testing
- Test with Ollama enabled and working
- Test with Ollama disabled
- Test with Ollama unreachable
- Test with slow Ollama responses
- Test with various article lengths
Performance Considerations
Timing Estimates
- Article extraction: 2-5 seconds
- Ollama summarization: 5-15 seconds (depends on article length and model)
- Database write: <1 second
- Total per article: 8-21 seconds
Optimization Strategies
-
Sequential Processing
- Process one article at a time
- Prevents overwhelming Ollama
- Easier to debug
-
Timeout Management
- 30-second timeout per request
- Prevents hanging on slow responses
-
Rate Limiting
- 1-second delay between articles
- Respects server resources
-
Future: Batch Processing
- Queue articles for summarization
- Process in batches
- Use Celery for async processing
Resource Usage
- Memory: ~100MB per crawler instance
- Network: ~1-5KB per article (to Ollama)
- Storage: +150 words per article (~1KB)
- CPU: Minimal (Ollama does the heavy lifting)
Security Considerations
-
API Key Storage
- Store in environment variables
- Never commit to git
- Use secrets management in production
-
Content Sanitization
- Don't log full article content
- Sanitize URLs in logs
- Limit error message detail
-
Network Security
- Support HTTPS for Ollama
- Validate SSL certificates
- Use secure connections
-
Rate Limiting
- Prevent abuse of Ollama server
- Implement backoff on errors
- Monitor usage patterns
Deployment Considerations
Environment Variables
# Required
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true
# Optional
OLLAMA_API_KEY=your-api-key
OLLAMA_TIMEOUT=30
Docker Deployment
# docker-compose.yml
services:
crawler:
build: ./news_crawler
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- OLLAMA_ENABLED=true
depends_on:
- ollama
- mongodb
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
Monitoring
-
Metrics to Track
- Summarization success rate
- Average summarization time
- Ollama server uptime
- Error frequency by type
-
Logging
- Log all summarization attempts
- Log errors with context
- Log performance metrics
-
Alerts
- Alert if Ollama is down >5 minutes
- Alert if success rate <80%
- Alert if average time >20 seconds
Migration Plan
Phase 1: Add Ollama Client (Week 1)
- Create ollama_client.py
- Add configuration
- Write unit tests
- Test with sample articles
Phase 2: Integrate with Crawler (Week 1)
- Modify crawler_service.py
- Add summarization step
- Update database schema
- Test end-to-end
Phase 3: Update Backend API (Week 2)
- Update news routes
- Add summary fields to responses
- Update frontend to display summaries
- Deploy to production
Phase 4: Monitor and Optimize (Ongoing)
- Monitor performance
- Tune prompts for better summaries
- Optimize rate limiting
- Add batch processing if needed
Rollback Plan
If issues arise:
- Immediate: Set
OLLAMA_ENABLED=false - Short-term: Revert crawler code changes
- Long-term: Remove Ollama integration
System will continue to work with original content if Ollama is disabled.
Success Metrics
- ✅ 95%+ of articles successfully summarized
- ✅ Average summarization time <15 seconds
- ✅ Zero data loss (all articles stored even if summarization fails)
- ✅ Ollama uptime >99%
- ✅ Summary quality: readable and accurate (manual review)
Future Enhancements
-
Multi-language Support
- Detect article language
- Use appropriate model
- Translate summaries
-
Custom Summary Lengths
- Allow configuration per feed
- Support different lengths for different use cases
-
Sentiment Analysis
- Add sentiment score
- Categorize as positive/negative/neutral
-
Keyword Extraction
- Extract key topics
- Enable better search
-
Batch Processing
- Queue articles
- Process in parallel
- Use Celery for async
-
Caching
- Cache summaries
- Avoid re-processing
- Use Redis for cache