# Design Document - AI Article Summarization ## Overview This design integrates Ollama AI into the news crawler workflow to automatically generate concise summaries of articles. The system will extract full article content, send it to Ollama for summarization, and store both the original content and the AI-generated summary in MongoDB. ## Architecture ### High-Level Flow ``` RSS Feed → Extract Content → Summarize with Ollama → Store in MongoDB ↓ ↓ ↓ Full Article Text AI Summary (≤150 words) Both Stored ``` ### Component Diagram ``` ┌─────────────────────────────────────────────────────────────┐ │ News Crawler Service │ │ │ │ ┌────────────────┐ ┌──────────────────┐ │ │ │ RSS Parser │──────→│ Content Extractor│ │ │ └────────────────┘ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────┐ │ │ │ Ollama Client │ │ │ │ (New Component) │ │ │ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────┐ │ │ │ Database Writer │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ↓ ┌──────────────────┐ │ Ollama Server │ │ (External) │ └──────────────────┘ │ ↓ ┌──────────────────┐ │ MongoDB │ └──────────────────┘ ``` ## Components and Interfaces ### 1. Ollama Client Module **File:** `news_crawler/ollama_client.py` **Purpose:** Handle communication with Ollama server for summarization **Interface:** ```python class OllamaClient: def __init__(self, base_url, model, api_key=None, enabled=True): """Initialize Ollama client with configuration""" def summarize_article(self, content: str, max_words: int = 150) -> dict: """ Summarize article content using Ollama Args: content: Full article text max_words: Maximum words in summary (default 150) Returns: { 'summary': str, # AI-generated summary 'word_count': int, # Summary word count 'success': bool, # Whether summarization succeeded 'error': str or None, # Error message if failed 'duration': float # Time taken in seconds } """ def is_available(self) -> bool: """Check if Ollama server is reachable""" def test_connection(self) -> dict: """Test connection and return server info""" ``` **Key Methods:** 1. **summarize_article()** - Constructs prompt for Ollama - Sends HTTP POST request - Handles timeouts and errors - Validates response - Returns structured result 2. **is_available()** - Quick health check - Returns True/False - Used before attempting summarization 3. **test_connection()** - Detailed connection test - Returns server info and model list - Used for diagnostics ### 2. Enhanced Crawler Service **File:** `news_crawler/crawler_service.py` **Changes:** ```python # Add Ollama client initialization from ollama_client import OllamaClient # Initialize at module level ollama_client = OllamaClient( base_url=os.getenv('OLLAMA_BASE_URL'), model=os.getenv('OLLAMA_MODEL'), api_key=os.getenv('OLLAMA_API_KEY'), enabled=os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true' ) # Modify crawl_rss_feed() to include summarization def crawl_rss_feed(feed_url, feed_name, max_articles=10): # ... existing code ... # After extracting content article_data = extract_article_content(article_url) # NEW: Summarize with Ollama summary_result = None if ollama_client.enabled and article_data.get('content'): print(f" 🤖 Summarizing with AI...") summary_result = ollama_client.summarize_article( article_data['content'], max_words=150 ) if summary_result['success']: print(f" ✓ Summary generated ({summary_result['word_count']} words)") else: print(f" ⚠ Summarization failed: {summary_result['error']}") # Build article document with summary article_doc = { 'title': article_data.get('title'), 'author': article_data.get('author'), 'link': article_url, 'content': article_data.get('content'), 'summary': summary_result['summary'] if summary_result and summary_result['success'] else None, 'word_count': article_data.get('word_count'), 'summary_word_count': summary_result['word_count'] if summary_result and summary_result['success'] else None, 'source': feed_name, 'published_at': extract_published_date(entry), 'crawled_at': article_data.get('crawled_at'), 'summarized_at': datetime.utcnow() if summary_result and summary_result['success'] else None, 'created_at': datetime.utcnow() } ``` ### 3. Configuration Module **File:** `news_crawler/config.py` (new file) **Purpose:** Centralize configuration management ```python import os from dotenv import load_dotenv load_dotenv(dotenv_path='../.env') class Config: # MongoDB MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/') DB_NAME = 'munich_news' # Ollama OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434') OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'phi3:latest') OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '') OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true' OLLAMA_TIMEOUT = int(os.getenv('OLLAMA_TIMEOUT', '30')) # Crawler RATE_LIMIT_DELAY = 1 # seconds between requests MAX_CONTENT_LENGTH = 50000 # characters ``` ## Data Models ### Updated Article Schema ```javascript { _id: ObjectId, title: String, author: String, link: String, // Unique index content: String, // Full article content summary: String, // AI-generated summary (≤150 words) word_count: Number, // Original content word count summary_word_count: Number, // Summary word count source: String, published_at: String, crawled_at: DateTime, summarized_at: DateTime, // When AI summary was generated created_at: DateTime } ``` ### Ollama Request Format ```json { "model": "phi3:latest", "prompt": "Summarize the following article in 150 words or less. Focus on the key points and main message:\n\n[ARTICLE CONTENT]", "stream": false, "options": { "temperature": 0.7, "max_tokens": 200 } } ``` ### Ollama Response Format ```json { "model": "phi3:latest", "created_at": "2024-11-10T16:30:00Z", "response": "The AI-generated summary text here...", "done": true, "total_duration": 5000000000 } ``` ## Error Handling ### Error Scenarios and Responses | Scenario | Handling | User Impact | |----------|----------|-------------| | Ollama server down | Log warning, store original content | Article saved without summary | | Ollama timeout (>30s) | Cancel request, store original | Article saved without summary | | Empty summary returned | Log error, store original | Article saved without summary | | Invalid response format | Log error, store original | Article saved without summary | | Network error | Retry once, then store original | Article saved without summary | | Model not found | Log error, disable Ollama | All articles saved without summaries | ### Error Logging Format ```python { 'timestamp': datetime.utcnow(), 'article_url': article_url, 'error_type': 'timeout|connection|invalid_response|empty_summary', 'error_message': str(error), 'ollama_config': { 'base_url': OLLAMA_BASE_URL, 'model': OLLAMA_MODEL, 'enabled': OLLAMA_ENABLED } } ``` ## Testing Strategy ### Unit Tests 1. **test_ollama_client.py** - Test summarization with mock responses - Test timeout handling - Test error scenarios - Test connection checking 2. **test_crawler_with_ollama.py** - Test crawler with Ollama enabled - Test crawler with Ollama disabled - Test fallback when Ollama fails - Test rate limiting ### Integration Tests 1. **test_end_to_end.py** - Crawl real RSS feed - Summarize with real Ollama - Verify database storage - Check all fields populated ### Manual Testing 1. Test with Ollama enabled and working 2. Test with Ollama disabled 3. Test with Ollama unreachable 4. Test with slow Ollama responses 5. Test with various article lengths ## Performance Considerations ### Timing Estimates - Article extraction: 2-5 seconds - Ollama summarization: 5-15 seconds (depends on article length and model) - Database write: <1 second - **Total per article: 8-21 seconds** ### Optimization Strategies 1. **Sequential Processing** - Process one article at a time - Prevents overwhelming Ollama - Easier to debug 2. **Timeout Management** - 30-second timeout per request - Prevents hanging on slow responses 3. **Rate Limiting** - 1-second delay between articles - Respects server resources 4. **Future: Batch Processing** - Queue articles for summarization - Process in batches - Use Celery for async processing ### Resource Usage - **Memory**: ~100MB per crawler instance - **Network**: ~1-5KB per article (to Ollama) - **Storage**: +150 words per article (~1KB) - **CPU**: Minimal (Ollama does the heavy lifting) ## Security Considerations 1. **API Key Storage** - Store in environment variables - Never commit to git - Use secrets management in production 2. **Content Sanitization** - Don't log full article content - Sanitize URLs in logs - Limit error message detail 3. **Network Security** - Support HTTPS for Ollama - Validate SSL certificates - Use secure connections 4. **Rate Limiting** - Prevent abuse of Ollama server - Implement backoff on errors - Monitor usage patterns ## Deployment Considerations ### Environment Variables ```bash # Required OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_MODEL=phi3:latest OLLAMA_ENABLED=true # Optional OLLAMA_API_KEY=your-api-key OLLAMA_TIMEOUT=30 ``` ### Docker Deployment ```yaml # docker-compose.yml services: crawler: build: ./news_crawler environment: - OLLAMA_BASE_URL=http://ollama:11434 - OLLAMA_ENABLED=true depends_on: - ollama - mongodb ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama ``` ### Monitoring 1. **Metrics to Track** - Summarization success rate - Average summarization time - Ollama server uptime - Error frequency by type 2. **Logging** - Log all summarization attempts - Log errors with context - Log performance metrics 3. **Alerts** - Alert if Ollama is down >5 minutes - Alert if success rate <80% - Alert if average time >20 seconds ## Migration Plan ### Phase 1: Add Ollama Client (Week 1) - Create ollama_client.py - Add configuration - Write unit tests - Test with sample articles ### Phase 2: Integrate with Crawler (Week 1) - Modify crawler_service.py - Add summarization step - Update database schema - Test end-to-end ### Phase 3: Update Backend API (Week 2) - Update news routes - Add summary fields to responses - Update frontend to display summaries - Deploy to production ### Phase 4: Monitor and Optimize (Ongoing) - Monitor performance - Tune prompts for better summaries - Optimize rate limiting - Add batch processing if needed ## Rollback Plan If issues arise: 1. **Immediate**: Set `OLLAMA_ENABLED=false` 2. **Short-term**: Revert crawler code changes 3. **Long-term**: Remove Ollama integration System will continue to work with original content if Ollama is disabled. ## Success Metrics - ✅ 95%+ of articles successfully summarized - ✅ Average summarization time <15 seconds - ✅ Zero data loss (all articles stored even if summarization fails) - ✅ Ollama uptime >99% - ✅ Summary quality: readable and accurate (manual review) ## Future Enhancements 1. **Multi-language Support** - Detect article language - Use appropriate model - Translate summaries 2. **Custom Summary Lengths** - Allow configuration per feed - Support different lengths for different use cases 3. **Sentiment Analysis** - Add sentiment score - Categorize as positive/negative/neutral 4. **Keyword Extraction** - Extract key topics - Enable better search 5. **Batch Processing** - Queue articles - Process in parallel - Use Celery for async 6. **Caching** - Cache summaries - Avoid re-processing - Use Redis for cache