Files
2025-11-11 16:58:03 +01:00

11 KiB

Design Document: Article Title Translation

Overview

This feature extends the existing Ollama AI integration to translate German article titles to English during the crawling process. The translation will be performed immediately after article content extraction and before AI summarization. Both the original German title and English translation will be stored in the MongoDB article document, and the newsletter template will be updated to display the English title prominently with the original as a subtitle.

The design leverages the existing Ollama infrastructure (same server, configuration, and error handling patterns) to minimize complexity and maintain consistency with the current summarization feature.

Architecture

Component Interaction Flow

RSS Feed Entry
    ↓
Crawler Service (extract_article_content)
    ↓
Article Data (with German title)
    ↓
Ollama Client (translate_title) ← New Method
    ↓
Translation Result
    ↓
Crawler Service (prepare article_doc)
    ↓
MongoDB (articles collection with title + title_en)
    ↓
Newsletter Service (fetch articles)
    ↓
Newsletter Template (display English title + German subtitle)
    ↓
Email to Subscribers

Integration Points

  1. Ollama Client - Add new translate_title() method alongside existing summarize_article() method
  2. Crawler Service - Call translation after content extraction, before summarization
  3. Article Document Schema - Add title_en and translated_at fields
  4. Newsletter Template - Update title display logic to show English/German titles

Components and Interfaces

1. Ollama Client Extension

New Method: translate_title(title, target_language='English')

def translate_title(self, title, target_language='English'):
    """
    Translate article title to target language
    
    Args:
        title (str): Original German title
        target_language (str): Target language (default: 'English')
    
    Returns:
        dict: {
            'success': bool,
            'translated_title': str or None,
            'error': str or None,
            'duration': float
        }
    """

Implementation Details:

  • Prompt Engineering: Clear, concise prompt instructing the model to translate only the headline without explanations
  • Temperature: 0.3 (lower than summarization's 0.7) for more consistent, deterministic translations
  • Token Limit: 100 tokens (sufficient for title-length outputs)
  • Response Cleaning:
    • Remove surrounding quotes (single and double)
    • Extract first line only (ignore any extra text)
    • Trim whitespace
  • Error Handling: Same pattern as summarize_article() - catch timeouts, connection errors, HTTP errors
  • Validation: Check for empty title input before making API call

2. Crawler Service Integration

Location: In crawl_rss_feed() function, after content extraction

Execution Order:

  1. Extract article content (existing)
  2. Translate title (new)
  3. Summarize article (existing)
  4. Save to database (modified)

Implementation Pattern:

# After article_data extraction
translation_result = None
original_title = article_data.get('title') or entry.get('title', '')

if Config.OLLAMA_ENABLED:
    # Translate title
    print(f"   🌐 Translating title...")
    translation_result = ollama_client.translate_title(original_title)
    
    if translation_result and translation_result['success']:
        print(f"   ✓ Title translated ({translation_result['duration']:.1f}s)")
    else:
        print(f"   ⚠ Translation failed: {translation_result['error']}")
    
    # Then summarize (existing code)
    ...

Console Output Format:

  • Success: ✓ Title translated (0.8s)
  • Failure: ⚠ Translation failed: Request timed out

3. Data Models

MongoDB Article Document Schema Extension:

{
  // Existing fields
  title: String,              // Original German title
  author: String,
  link: String,
  content: String,
  summary: String,
  word_count: Number,
  summary_word_count: Number,
  source: String,
  category: String,
  published_at: Date,
  crawled_at: Date,
  summarized_at: Date,
  created_at: Date,
  
  // New fields
  title_en: String,           // English translation of title (nullable)
  translated_at: Date         // Timestamp when translation completed (nullable)
}

Field Behavior:

  • title_en: NULL if translation fails or Ollama is disabled
  • translated_at: NULL if translation fails, set to datetime.utcnow() on success

4. Newsletter Template Updates

Current Title Display:

<h2 style="...">
    {{ article.title }}
</h2>

New Title Display Logic:

<!-- Primary title: English if available, otherwise German -->
<h2 style="margin: 12px 0 8px 0; font-size: 19px; font-weight: 700; line-height: 1.3; color: #1a1a1a;">
    {{ article.title_en if article.title_en else article.title }}
</h2>

<!-- Subtitle: Original German title (only if English translation exists and differs) -->
{% if article.title_en and article.title_en != article.title %}
<p style="margin: 0 0 12px 0; font-size: 13px; color: #999999; font-style: italic;">
    Original: {{ article.title }}
</p>
{% endif %}

Display Rules:

  1. If title_en exists and differs from title: Show English as primary, German as subtitle
  2. If title_en is NULL or same as title: Show only the original title
  3. Subtitle styling: Smaller font (13px), gray color (#999999), italic

Error Handling

Translation Failure Scenarios

Scenario Behavior User Impact
Ollama server unavailable Skip translation, continue with summarization Newsletter shows German title only
Translation timeout Log error, store NULL in title_en Newsletter shows German title only
Empty title input Return error immediately, skip API call Newsletter shows German title only
Ollama disabled in config Skip translation entirely Newsletter shows German title only
Network error Catch exception, log error, continue Newsletter shows German title only

Error Handling Principles

  1. Non-blocking: Translation failures never prevent article processing
  2. Graceful degradation: Fall back to original German title
  3. Consistent logging: All errors logged with descriptive messages
  4. No retry logic: Single attempt per article (same as summarization)
  5. Silent failures: Newsletter displays seamlessly regardless of translation status

Console Output Examples

Success Case:

🔍 Crawling: Neuer U-Bahn-Ausbau in München geplant...
   🌐 Translating title...
   ✓ Title translated (0.8s)
   🤖 Summarizing with AI...
   ✓ Summary: 45 words (from 320 words, 2.3s)
   ✓ Saved (320 words)

Translation Failure Case:

🔍 Crawling: Neuer U-Bahn-Ausbau in München geplant...
   🌐 Translating title...
   ⚠ Translation failed: Request timed out after 30 seconds
   🤖 Summarizing with AI...
   ✓ Summary: 45 words (from 320 words, 2.3s)
   ✓ Saved (320 words)

Testing Strategy

Unit Testing

Ollama Client Tests (test_ollama_client.py):

  1. Test successful translation with valid German title
  2. Test empty title input handling
  3. Test timeout handling
  4. Test connection error handling
  5. Test response cleaning (quotes, newlines, whitespace)
  6. Test translation with special characters
  7. Test translation with very long titles

Test Data Examples:

  • Simple: "München plant neue U-Bahn-Linie"
  • With quotes: ""Historischer Tag" für München"
  • With special chars: "Oktoberfest 2024: 7,5 Millionen Besucher"
  • Long: "Stadtrat beschließt umfassende Maßnahmen zur Verbesserung der Verkehrsinfrastruktur..."

Integration Testing

Crawler Service Tests:

  1. Test article processing with translation enabled
  2. Test article processing with translation disabled
  3. Test article processing when translation fails
  4. Test database document structure includes new fields
  5. Test console output formatting

Manual Testing

End-to-End Workflow:

  1. Enable Ollama in configuration
  2. Trigger crawl with max_articles=2
  3. Verify console shows translation status
  4. Check MongoDB for title_en and translated_at fields
  5. Send test newsletter
  6. Verify email displays English title with German subtitle

Test Scenarios:

  • Fresh crawl with Ollama enabled
  • Re-crawl existing articles (should skip translation)
  • Crawl with Ollama disabled
  • Crawl with Ollama server stopped (simulate failure)

Performance Testing

Metrics to Monitor:

  • Translation duration per article (target: < 2 seconds)
  • Impact on total crawl time (translation + summarization)
  • Ollama server resource usage

Expected Performance:

  • Translation: ~0.5-1.5 seconds per title
  • Total per article: ~3-5 seconds (translation + summarization)
  • Acceptable for batch processing during scheduled crawls

Configuration

No New Configuration Required

The translation feature uses existing Ollama configuration:

# From config.py (existing)
OLLAMA_ENABLED = True/False
OLLAMA_BASE_URL = "http://ollama:11434"
OLLAMA_MODEL = "phi3:latest"
OLLAMA_TIMEOUT = 30

Rationale: Simplifies deployment and maintains consistency. Translation is automatically enabled/disabled with the existing OLLAMA_ENABLED flag.

Deployment Considerations

Docker Container Updates

Affected Services:

  • crawler service: Needs rebuild to include new translation code
  • sender service: Needs rebuild to include updated newsletter template

Deployment Steps:

  1. Update code in news_crawler/ollama_client.py
  2. Update code in news_crawler/crawler_service.py
  3. Update template in news_sender/newsletter_template.html
  4. Rebuild containers: docker-compose up -d --build crawler sender
  5. No database migration needed (new fields are nullable)

Backward Compatibility

Existing Articles: Articles without title_en will display German title only (graceful fallback)

No Breaking Changes: Newsletter template handles NULL title_en values

Rollback Plan

If issues arise:

  1. Revert code changes
  2. Rebuild containers
  3. Existing articles with title_en will continue to work
  4. New articles will only have German titles

Future Enhancements

Potential Improvements (Out of Scope)

  1. Batch Translation: Translate multiple titles in single API call for efficiency
  2. Translation Caching: Cache common phrases/words to reduce API calls
  3. Multi-language Support: Add configuration for target language selection
  4. Translation Quality Metrics: Track and log translation quality scores
  5. Retry Logic: Implement retry with exponential backoff for failed translations
  6. Admin API: Add endpoint to re-translate existing articles

These enhancements are not included in the current implementation to maintain simplicity and focus on core functionality.