Files
2025-11-10 19:13:33 +01:00

14 KiB

Design Document - AI Article Summarization

Overview

This design integrates Ollama AI into the news crawler workflow to automatically generate concise summaries of articles. The system will extract full article content, send it to Ollama for summarization, and store both the original content and the AI-generated summary in MongoDB.

Architecture

High-Level Flow

RSS Feed → Extract Content → Summarize with Ollama → Store in MongoDB
                ↓                      ↓                    ↓
         Full Article Text    AI Summary (≤150 words)   Both Stored

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                    News Crawler Service                      │
│                                                              │
│  ┌────────────────┐      ┌──────────────────┐             │
│  │ RSS Parser     │──────→│ Content Extractor│             │
│  └────────────────┘      └──────────────────┘             │
│                                   │                         │
│                                   ↓                         │
│                          ┌──────────────────┐              │
│                          │ Ollama Client    │              │
│                          │ (New Component)  │              │
│                          └──────────────────┘              │
│                                   │                         │
│                                   ↓                         │
│                          ┌──────────────────┐              │
│                          │ Database Writer  │              │
│                          └──────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                                   │
                                   ↓
                          ┌──────────────────┐
                          │  Ollama Server   │
                          │  (External)      │
                          └──────────────────┘
                                   │
                                   ↓
                          ┌──────────────────┐
                          │    MongoDB       │
                          └──────────────────┘

Components and Interfaces

1. Ollama Client Module

File: news_crawler/ollama_client.py

Purpose: Handle communication with Ollama server for summarization

Interface:

class OllamaClient:
    def __init__(self, base_url, model, api_key=None, enabled=True):
        """Initialize Ollama client with configuration"""
        
    def summarize_article(self, content: str, max_words: int = 150) -> dict:
        """
        Summarize article content using Ollama
        
        Args:
            content: Full article text
            max_words: Maximum words in summary (default 150)
            
        Returns:
            {
                'summary': str,           # AI-generated summary
                'word_count': int,        # Summary word count
                'success': bool,          # Whether summarization succeeded
                'error': str or None,     # Error message if failed
                'duration': float         # Time taken in seconds
            }
        """
        
    def is_available(self) -> bool:
        """Check if Ollama server is reachable"""
        
    def test_connection(self) -> dict:
        """Test connection and return server info"""

Key Methods:

  1. summarize_article()

    • Constructs prompt for Ollama
    • Sends HTTP POST request
    • Handles timeouts and errors
    • Validates response
    • Returns structured result
  2. is_available()

    • Quick health check
    • Returns True/False
    • Used before attempting summarization
  3. test_connection()

    • Detailed connection test
    • Returns server info and model list
    • Used for diagnostics

2. Enhanced Crawler Service

File: news_crawler/crawler_service.py

Changes:

# Add Ollama client initialization
from ollama_client import OllamaClient

# Initialize at module level
ollama_client = OllamaClient(
    base_url=os.getenv('OLLAMA_BASE_URL'),
    model=os.getenv('OLLAMA_MODEL'),
    api_key=os.getenv('OLLAMA_API_KEY'),
    enabled=os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
)

# Modify crawl_rss_feed() to include summarization
def crawl_rss_feed(feed_url, feed_name, max_articles=10):
    # ... existing code ...
    
    # After extracting content
    article_data = extract_article_content(article_url)
    
    # NEW: Summarize with Ollama
    summary_result = None
    if ollama_client.enabled and article_data.get('content'):
        print(f"   🤖 Summarizing with AI...")
        summary_result = ollama_client.summarize_article(
            article_data['content'],
            max_words=150
        )
        
        if summary_result['success']:
            print(f"   ✓ Summary generated ({summary_result['word_count']} words)")
        else:
            print(f"   ⚠ Summarization failed: {summary_result['error']}")
    
    # Build article document with summary
    article_doc = {
        'title': article_data.get('title'),
        'author': article_data.get('author'),
        'link': article_url,
        'content': article_data.get('content'),
        'summary': summary_result['summary'] if summary_result and summary_result['success'] else None,
        'word_count': article_data.get('word_count'),
        'summary_word_count': summary_result['word_count'] if summary_result and summary_result['success'] else None,
        'source': feed_name,
        'published_at': extract_published_date(entry),
        'crawled_at': article_data.get('crawled_at'),
        'summarized_at': datetime.utcnow() if summary_result and summary_result['success'] else None,
        'created_at': datetime.utcnow()
    }

3. Configuration Module

File: news_crawler/config.py (new file)

Purpose: Centralize configuration management

import os
from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')

class Config:
    # MongoDB
    MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
    DB_NAME = 'munich_news'
    
    # Ollama
    OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
    OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'phi3:latest')
    OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '')
    OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
    OLLAMA_TIMEOUT = int(os.getenv('OLLAMA_TIMEOUT', '30'))
    
    # Crawler
    RATE_LIMIT_DELAY = 1  # seconds between requests
    MAX_CONTENT_LENGTH = 50000  # characters

Data Models

Updated Article Schema

{
  _id: ObjectId,
  title: String,
  author: String,
  link: String,                    // Unique index
  content: String,                 // Full article content
  summary: String,                 // AI-generated summary (≤150 words)
  word_count: Number,              // Original content word count
  summary_word_count: Number,      // Summary word count
  source: String,
  published_at: String,
  crawled_at: DateTime,
  summarized_at: DateTime,         // When AI summary was generated
  created_at: DateTime
}

Ollama Request Format

{
  "model": "phi3:latest",
  "prompt": "Summarize the following article in 150 words or less. Focus on the key points and main message:\n\n[ARTICLE CONTENT]",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "max_tokens": 200
  }
}

Ollama Response Format

{
  "model": "phi3:latest",
  "created_at": "2024-11-10T16:30:00Z",
  "response": "The AI-generated summary text here...",
  "done": true,
  "total_duration": 5000000000
}

Error Handling

Error Scenarios and Responses

Scenario Handling User Impact
Ollama server down Log warning, store original content Article saved without summary
Ollama timeout (>30s) Cancel request, store original Article saved without summary
Empty summary returned Log error, store original Article saved without summary
Invalid response format Log error, store original Article saved without summary
Network error Retry once, then store original Article saved without summary
Model not found Log error, disable Ollama All articles saved without summaries

Error Logging Format

{
    'timestamp': datetime.utcnow(),
    'article_url': article_url,
    'error_type': 'timeout|connection|invalid_response|empty_summary',
    'error_message': str(error),
    'ollama_config': {
        'base_url': OLLAMA_BASE_URL,
        'model': OLLAMA_MODEL,
        'enabled': OLLAMA_ENABLED
    }
}

Testing Strategy

Unit Tests

  1. test_ollama_client.py

    • Test summarization with mock responses
    • Test timeout handling
    • Test error scenarios
    • Test connection checking
  2. test_crawler_with_ollama.py

    • Test crawler with Ollama enabled
    • Test crawler with Ollama disabled
    • Test fallback when Ollama fails
    • Test rate limiting

Integration Tests

  1. test_end_to_end.py
    • Crawl real RSS feed
    • Summarize with real Ollama
    • Verify database storage
    • Check all fields populated

Manual Testing

  1. Test with Ollama enabled and working
  2. Test with Ollama disabled
  3. Test with Ollama unreachable
  4. Test with slow Ollama responses
  5. Test with various article lengths

Performance Considerations

Timing Estimates

  • Article extraction: 2-5 seconds
  • Ollama summarization: 5-15 seconds (depends on article length and model)
  • Database write: <1 second
  • Total per article: 8-21 seconds

Optimization Strategies

  1. Sequential Processing

    • Process one article at a time
    • Prevents overwhelming Ollama
    • Easier to debug
  2. Timeout Management

    • 30-second timeout per request
    • Prevents hanging on slow responses
  3. Rate Limiting

    • 1-second delay between articles
    • Respects server resources
  4. Future: Batch Processing

    • Queue articles for summarization
    • Process in batches
    • Use Celery for async processing

Resource Usage

  • Memory: ~100MB per crawler instance
  • Network: ~1-5KB per article (to Ollama)
  • Storage: +150 words per article (~1KB)
  • CPU: Minimal (Ollama does the heavy lifting)

Security Considerations

  1. API Key Storage

    • Store in environment variables
    • Never commit to git
    • Use secrets management in production
  2. Content Sanitization

    • Don't log full article content
    • Sanitize URLs in logs
    • Limit error message detail
  3. Network Security

    • Support HTTPS for Ollama
    • Validate SSL certificates
    • Use secure connections
  4. Rate Limiting

    • Prevent abuse of Ollama server
    • Implement backoff on errors
    • Monitor usage patterns

Deployment Considerations

Environment Variables

# Required
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true

# Optional
OLLAMA_API_KEY=your-api-key
OLLAMA_TIMEOUT=30

Docker Deployment

# docker-compose.yml
services:
  crawler:
    build: ./news_crawler
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - OLLAMA_ENABLED=true
    depends_on:
      - ollama
      - mongodb
  
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

Monitoring

  1. Metrics to Track

    • Summarization success rate
    • Average summarization time
    • Ollama server uptime
    • Error frequency by type
  2. Logging

    • Log all summarization attempts
    • Log errors with context
    • Log performance metrics
  3. Alerts

    • Alert if Ollama is down >5 minutes
    • Alert if success rate <80%
    • Alert if average time >20 seconds

Migration Plan

Phase 1: Add Ollama Client (Week 1)

  • Create ollama_client.py
  • Add configuration
  • Write unit tests
  • Test with sample articles

Phase 2: Integrate with Crawler (Week 1)

  • Modify crawler_service.py
  • Add summarization step
  • Update database schema
  • Test end-to-end

Phase 3: Update Backend API (Week 2)

  • Update news routes
  • Add summary fields to responses
  • Update frontend to display summaries
  • Deploy to production

Phase 4: Monitor and Optimize (Ongoing)

  • Monitor performance
  • Tune prompts for better summaries
  • Optimize rate limiting
  • Add batch processing if needed

Rollback Plan

If issues arise:

  1. Immediate: Set OLLAMA_ENABLED=false
  2. Short-term: Revert crawler code changes
  3. Long-term: Remove Ollama integration

System will continue to work with original content if Ollama is disabled.

Success Metrics

  • 95%+ of articles successfully summarized
  • Average summarization time <15 seconds
  • Zero data loss (all articles stored even if summarization fails)
  • Ollama uptime >99%
  • Summary quality: readable and accurate (manual review)

Future Enhancements

  1. Multi-language Support

    • Detect article language
    • Use appropriate model
    • Translate summaries
  2. Custom Summary Lengths

    • Allow configuration per feed
    • Support different lengths for different use cases
  3. Sentiment Analysis

    • Add sentiment score
    • Categorize as positive/negative/neutral
  4. Keyword Extraction

    • Extract key topics
    • Enable better search
  5. Batch Processing

    • Queue articles
    • Process in parallel
    • Use Celery for async
  6. Caching

    • Cache summaries
    • Avoid re-processing
    • Use Redis for cache