Files

2025-11-10 19:13:33 +01:00

14 KiB

Raw Blame History

Design Document - AI Article Summarization

Overview

This design integrates Ollama AI into the news crawler workflow to automatically generate concise summaries of articles. The system will extract full article content, send it to Ollama for summarization, and store both the original content and the AI-generated summary in MongoDB.

Architecture

High-Level Flow

RSS Feed → Extract Content → Summarize with Ollama → Store in MongoDB
                ↓                      ↓                    ↓
         Full Article Text    AI Summary (≤150 words)   Both Stored

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                    News Crawler Service                      │
│                                                              │
│  ┌────────────────┐      ┌──────────────────┐             │
│  │ RSS Parser     │──────→│ Content Extractor│             │
│  └────────────────┘      └──────────────────┘             │
│                                   │                         │
│                                   ↓                         │
│                          ┌──────────────────┐              │
│                          │ Ollama Client    │              │
│                          │ (New Component)  │              │
│                          └──────────────────┘              │
│                                   │                         │
│                                   ↓                         │
│                          ┌──────────────────┐              │
│                          │ Database Writer  │              │
│                          └──────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                                   │
                                   ↓
                          ┌──────────────────┐
                          │  Ollama Server   │
                          │  (External)      │
                          └──────────────────┘
                                   │
                                   ↓
                          ┌──────────────────┐
                          │    MongoDB       │
                          └──────────────────┘

Components and Interfaces

1. Ollama Client Module

File: news_crawler/ollama_client.py

Purpose: Handle communication with Ollama server for summarization

Interface:

class OllamaClient:
    def __init__(self, base_url, model, api_key=None, enabled=True):
        """Initialize Ollama client with configuration"""
        
    def summarize_article(self, content: str, max_words: int = 150) -> dict:
        """
        Summarize article content using Ollama
        
        Args:
            content: Full article text
            max_words: Maximum words in summary (default 150)
            
        Returns:
            {
                'summary': str,           # AI-generated summary
                'word_count': int,        # Summary word count
                'success': bool,          # Whether summarization succeeded
                'error': str or None,     # Error message if failed
                'duration': float         # Time taken in seconds
            }
        """
        
    def is_available(self) -> bool:
        """Check if Ollama server is reachable"""
        
    def test_connection(self) -> dict:
        """Test connection and return server info"""

Key Methods:

summarize_article()
- Constructs prompt for Ollama
- Sends HTTP POST request
- Handles timeouts and errors
- Validates response
- Returns structured result
is_available()
- Quick health check
- Returns True/False
- Used before attempting summarization
test_connection()
- Detailed connection test
- Returns server info and model list
- Used for diagnostics

2. Enhanced Crawler Service

File: news_crawler/crawler_service.py

Changes:

# Add Ollama client initialization
from ollama_client import OllamaClient

# Initialize at module level
ollama_client = OllamaClient(
    base_url=os.getenv('OLLAMA_BASE_URL'),
    model=os.getenv('OLLAMA_MODEL'),
    api_key=os.getenv('OLLAMA_API_KEY'),
    enabled=os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
)

# Modify crawl_rss_feed() to include summarization
def crawl_rss_feed(feed_url, feed_name, max_articles=10):
    # ... existing code ...
    
    # After extracting content
    article_data = extract_article_content(article_url)
    
    # NEW: Summarize with Ollama
    summary_result = None
    if ollama_client.enabled and article_data.get('content'):
        print(f"   🤖 Summarizing with AI...")
        summary_result = ollama_client.summarize_article(
            article_data['content'],
            max_words=150
        )
        
        if summary_result['success']:
            print(f"   ✓ Summary generated ({summary_result['word_count']} words)")
        else:
            print(f"   ⚠ Summarization failed: {summary_result['error']}")
    
    # Build article document with summary
    article_doc = {
        'title': article_data.get('title'),
        'author': article_data.get('author'),
        'link': article_url,
        'content': article_data.get('content'),
        'summary': summary_result['summary'] if summary_result and summary_result['success'] else None,
        'word_count': article_data.get('word_count'),
        'summary_word_count': summary_result['word_count'] if summary_result and summary_result['success'] else None,
        'source': feed_name,
        'published_at': extract_published_date(entry),
        'crawled_at': article_data.get('crawled_at'),
        'summarized_at': datetime.utcnow() if summary_result and summary_result['success'] else None,
        'created_at': datetime.utcnow()
    }

3. Configuration Module

File: news_crawler/config.py (new file)

Purpose: Centralize configuration management

import os
from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')

class Config:
    # MongoDB
    MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
    DB_NAME = 'munich_news'
    
    # Ollama
    OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
    OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'phi3:latest')
    OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '')
    OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
    OLLAMA_TIMEOUT = int(os.getenv('OLLAMA_TIMEOUT', '30'))
    
    # Crawler
    RATE_LIMIT_DELAY = 1  # seconds between requests
    MAX_CONTENT_LENGTH = 50000  # characters

Data Models

Updated Article Schema

{
  _id: ObjectId,
  title: String,
  author: String,
  link: String,                    // Unique index
  content: String,                 // Full article content
  summary: String,                 // AI-generated summary (≤150 words)
  word_count: Number,              // Original content word count
  summary_word_count: Number,      // Summary word count
  source: String,
  published_at: String,
  crawled_at: DateTime,
  summarized_at: DateTime,         // When AI summary was generated
  created_at: DateTime
}

Ollama Request Format

{
  "model": "phi3:latest",
  "prompt": "Summarize the following article in 150 words or less. Focus on the key points and main message:\n\n[ARTICLE CONTENT]",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "max_tokens": 200
  }
}

Ollama Response Format

{
  "model": "phi3:latest",
  "created_at": "2024-11-10T16:30:00Z",
  "response": "The AI-generated summary text here...",
  "done": true,
  "total_duration": 5000000000
}

Error Handling

Error Scenarios and Responses

Scenario	Handling	User Impact
Ollama server down	Log warning, store original content	Article saved without summary
Ollama timeout (>30s)	Cancel request, store original	Article saved without summary
Empty summary returned	Log error, store original	Article saved without summary
Invalid response format	Log error, store original	Article saved without summary
Network error	Retry once, then store original	Article saved without summary
Model not found	Log error, disable Ollama	All articles saved without summaries

Error Logging Format

{
    'timestamp': datetime.utcnow(),
    'article_url': article_url,
    'error_type': 'timeout|connection|invalid_response|empty_summary',
    'error_message': str(error),
    'ollama_config': {
        'base_url': OLLAMA_BASE_URL,
        'model': OLLAMA_MODEL,
        'enabled': OLLAMA_ENABLED
    }
}

Testing Strategy

Unit Tests

test_ollama_client.py
- Test summarization with mock responses
- Test timeout handling
- Test error scenarios
- Test connection checking
test_crawler_with_ollama.py
- Test crawler with Ollama enabled
- Test crawler with Ollama disabled
- Test fallback when Ollama fails
- Test rate limiting

Integration Tests

test_end_to_end.py
- Crawl real RSS feed
- Summarize with real Ollama
- Verify database storage
- Check all fields populated

Manual Testing

Test with Ollama enabled and working
Test with Ollama disabled
Test with Ollama unreachable
Test with slow Ollama responses
Test with various article lengths

Performance Considerations

Timing Estimates

Article extraction: 2-5 seconds
Ollama summarization: 5-15 seconds (depends on article length and model)
Database write: <1 second
Total per article: 8-21 seconds

Optimization Strategies

Sequential Processing
- Process one article at a time
- Prevents overwhelming Ollama
- Easier to debug
Timeout Management
- 30-second timeout per request
- Prevents hanging on slow responses
Rate Limiting
- 1-second delay between articles
- Respects server resources
Future: Batch Processing
- Queue articles for summarization
- Process in batches
- Use Celery for async processing

Resource Usage

Memory: ~100MB per crawler instance
Network: ~1-5KB per article (to Ollama)
Storage: +150 words per article (~1KB)
CPU: Minimal (Ollama does the heavy lifting)

Security Considerations

API Key Storage
- Store in environment variables
- Never commit to git
- Use secrets management in production
Content Sanitization
- Don't log full article content
- Sanitize URLs in logs
- Limit error message detail
Network Security
- Support HTTPS for Ollama
- Validate SSL certificates
- Use secure connections
Rate Limiting
- Prevent abuse of Ollama server
- Implement backoff on errors
- Monitor usage patterns

Deployment Considerations

Environment Variables

# Required
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true

# Optional
OLLAMA_API_KEY=your-api-key
OLLAMA_TIMEOUT=30

Docker Deployment

# docker-compose.yml
services:
  crawler:
    build: ./news_crawler
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - OLLAMA_ENABLED=true
    depends_on:
      - ollama
      - mongodb
  
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

Monitoring

Metrics to Track
- Summarization success rate
- Average summarization time
- Ollama server uptime
- Error frequency by type
Logging
- Log all summarization attempts
- Log errors with context
- Log performance metrics
Alerts
- Alert if Ollama is down >5 minutes
- Alert if success rate <80%
- Alert if average time >20 seconds

Migration Plan

Phase 1: Add Ollama Client (Week 1)

Create ollama_client.py
Add configuration
Write unit tests
Test with sample articles

Phase 2: Integrate with Crawler (Week 1)

Modify crawler_service.py
Add summarization step
Update database schema
Test end-to-end

Phase 3: Update Backend API (Week 2)

Update news routes
Add summary fields to responses
Update frontend to display summaries
Deploy to production

Phase 4: Monitor and Optimize (Ongoing)

Monitor performance
Tune prompts for better summaries
Optimize rate limiting
Add batch processing if needed

Rollback Plan

If issues arise:

Immediate: Set OLLAMA_ENABLED=false
Short-term: Revert crawler code changes
Long-term: Remove Ollama integration

System will continue to work with original content if Ollama is disabled.

Success Metrics

✅ 95%+ of articles successfully summarized
✅ Average summarization time <15 seconds
✅ Zero data loss (all articles stored even if summarization fails)
✅ Ollama uptime >99%
✅ Summary quality: readable and accurate (manual review)

Future Enhancements

Multi-language Support
- Detect article language
- Use appropriate model
- Translate summaries
Custom Summary Lengths
- Allow configuration per feed
- Support different lengths for different use cases
Sentiment Analysis
- Add sentiment score
- Categorize as positive/negative/neutral
Keyword Extraction
- Extract key topics
- Enable better search
Batch Processing
- Queue articles
- Process in parallel
- Use Celery for async
Caching
- Cache summaries
- Avoid re-processing
- Use Redis for cache

14 KiB Raw Blame History