Files
Munich-news/.kiro/specs/article-title-translation/design.md
2025-11-11 16:58:03 +01:00

329 lines
11 KiB
Markdown

# Design Document: Article Title Translation
## Overview
This feature extends the existing Ollama AI integration to translate German article titles to English during the crawling process. The translation will be performed immediately after article content extraction and before AI summarization. Both the original German title and English translation will be stored in the MongoDB article document, and the newsletter template will be updated to display the English title prominently with the original as a subtitle.
The design leverages the existing Ollama infrastructure (same server, configuration, and error handling patterns) to minimize complexity and maintain consistency with the current summarization feature.
## Architecture
### Component Interaction Flow
```
RSS Feed Entry
Crawler Service (extract_article_content)
Article Data (with German title)
Ollama Client (translate_title) ← New Method
Translation Result
Crawler Service (prepare article_doc)
MongoDB (articles collection with title + title_en)
Newsletter Service (fetch articles)
Newsletter Template (display English title + German subtitle)
Email to Subscribers
```
### Integration Points
1. **Ollama Client** - Add new `translate_title()` method alongside existing `summarize_article()` method
2. **Crawler Service** - Call translation after content extraction, before summarization
3. **Article Document Schema** - Add `title_en` and `translated_at` fields
4. **Newsletter Template** - Update title display logic to show English/German titles
## Components and Interfaces
### 1. Ollama Client Extension
**New Method: `translate_title(title, target_language='English')`**
```python
def translate_title(self, title, target_language='English'):
"""
Translate article title to target language
Args:
title (str): Original German title
target_language (str): Target language (default: 'English')
Returns:
dict: {
'success': bool,
'translated_title': str or None,
'error': str or None,
'duration': float
}
"""
```
**Implementation Details:**
- **Prompt Engineering**: Clear, concise prompt instructing the model to translate only the headline without explanations
- **Temperature**: 0.3 (lower than summarization's 0.7) for more consistent, deterministic translations
- **Token Limit**: 100 tokens (sufficient for title-length outputs)
- **Response Cleaning**:
- Remove surrounding quotes (single and double)
- Extract first line only (ignore any extra text)
- Trim whitespace
- **Error Handling**: Same pattern as `summarize_article()` - catch timeouts, connection errors, HTTP errors
- **Validation**: Check for empty title input before making API call
### 2. Crawler Service Integration
**Location**: In `crawl_rss_feed()` function, after content extraction
**Execution Order**:
1. Extract article content (existing)
2. **Translate title** (new)
3. Summarize article (existing)
4. Save to database (modified)
**Implementation Pattern**:
```python
# After article_data extraction
translation_result = None
original_title = article_data.get('title') or entry.get('title', '')
if Config.OLLAMA_ENABLED:
# Translate title
print(f" 🌐 Translating title...")
translation_result = ollama_client.translate_title(original_title)
if translation_result and translation_result['success']:
print(f" ✓ Title translated ({translation_result['duration']:.1f}s)")
else:
print(f" ⚠ Translation failed: {translation_result['error']}")
# Then summarize (existing code)
...
```
**Console Output Format**:
- Success: `✓ Title translated (0.8s)`
- Failure: `⚠ Translation failed: Request timed out`
### 3. Data Models
**MongoDB Article Document Schema Extension**:
```javascript
{
// Existing fields
title: String, // Original German title
author: String,
link: String,
content: String,
summary: String,
word_count: Number,
summary_word_count: Number,
source: String,
category: String,
published_at: Date,
crawled_at: Date,
summarized_at: Date,
created_at: Date,
// New fields
title_en: String, // English translation of title (nullable)
translated_at: Date // Timestamp when translation completed (nullable)
}
```
**Field Behavior**:
- `title_en`: NULL if translation fails or Ollama is disabled
- `translated_at`: NULL if translation fails, set to `datetime.utcnow()` on success
### 4. Newsletter Template Updates
**Current Title Display**:
```html
<h2 style="...">
{{ article.title }}
</h2>
```
**New Title Display Logic**:
```html
<!-- Primary title: English if available, otherwise German -->
<h2 style="margin: 12px 0 8px 0; font-size: 19px; font-weight: 700; line-height: 1.3; color: #1a1a1a;">
{{ article.title_en if article.title_en else article.title }}
</h2>
<!-- Subtitle: Original German title (only if English translation exists and differs) -->
{% if article.title_en and article.title_en != article.title %}
<p style="margin: 0 0 12px 0; font-size: 13px; color: #999999; font-style: italic;">
Original: {{ article.title }}
</p>
{% endif %}
```
**Display Rules**:
1. If `title_en` exists and differs from `title`: Show English as primary, German as subtitle
2. If `title_en` is NULL or same as `title`: Show only the original title
3. Subtitle styling: Smaller font (13px), gray color (#999999), italic
## Error Handling
### Translation Failure Scenarios
| Scenario | Behavior | User Impact |
|----------|----------|-------------|
| Ollama server unavailable | Skip translation, continue with summarization | Newsletter shows German title only |
| Translation timeout | Log error, store NULL in title_en | Newsletter shows German title only |
| Empty title input | Return error immediately, skip API call | Newsletter shows German title only |
| Ollama disabled in config | Skip translation entirely | Newsletter shows German title only |
| Network error | Catch exception, log error, continue | Newsletter shows German title only |
### Error Handling Principles
1. **Non-blocking**: Translation failures never prevent article processing
2. **Graceful degradation**: Fall back to original German title
3. **Consistent logging**: All errors logged with descriptive messages
4. **No retry logic**: Single attempt per article (same as summarization)
5. **Silent failures**: Newsletter displays seamlessly regardless of translation status
### Console Output Examples
**Success Case**:
```
🔍 Crawling: Neuer U-Bahn-Ausbau in München geplant...
🌐 Translating title...
✓ Title translated (0.8s)
🤖 Summarizing with AI...
✓ Summary: 45 words (from 320 words, 2.3s)
✓ Saved (320 words)
```
**Translation Failure Case**:
```
🔍 Crawling: Neuer U-Bahn-Ausbau in München geplant...
🌐 Translating title...
⚠ Translation failed: Request timed out after 30 seconds
🤖 Summarizing with AI...
✓ Summary: 45 words (from 320 words, 2.3s)
✓ Saved (320 words)
```
## Testing Strategy
### Unit Testing
**Ollama Client Tests** (`test_ollama_client.py`):
1. Test successful translation with valid German title
2. Test empty title input handling
3. Test timeout handling
4. Test connection error handling
5. Test response cleaning (quotes, newlines, whitespace)
6. Test translation with special characters
7. Test translation with very long titles
**Test Data Examples**:
- Simple: "München plant neue U-Bahn-Linie"
- With quotes: "\"Historischer Tag\" für München"
- With special chars: "Oktoberfest 2024: 7,5 Millionen Besucher"
- Long: "Stadtrat beschließt umfassende Maßnahmen zur Verbesserung der Verkehrsinfrastruktur..."
### Integration Testing
**Crawler Service Tests**:
1. Test article processing with translation enabled
2. Test article processing with translation disabled
3. Test article processing when translation fails
4. Test database document structure includes new fields
5. Test console output formatting
### Manual Testing
**End-to-End Workflow**:
1. Enable Ollama in configuration
2. Trigger crawl with `max_articles=2`
3. Verify console shows translation status
4. Check MongoDB for `title_en` and `translated_at` fields
5. Send test newsletter
6. Verify email displays English title with German subtitle
**Test Scenarios**:
- Fresh crawl with Ollama enabled
- Re-crawl existing articles (should skip translation)
- Crawl with Ollama disabled
- Crawl with Ollama server stopped (simulate failure)
### Performance Testing
**Metrics to Monitor**:
- Translation duration per article (target: < 2 seconds)
- Impact on total crawl time (translation + summarization)
- Ollama server resource usage
**Expected Performance**:
- Translation: ~0.5-1.5 seconds per title
- Total per article: ~3-5 seconds (translation + summarization)
- Acceptable for batch processing during scheduled crawls
## Configuration
### No New Configuration Required
The translation feature uses existing Ollama configuration:
```python
# From config.py (existing)
OLLAMA_ENABLED = True/False
OLLAMA_BASE_URL = "http://ollama:11434"
OLLAMA_MODEL = "phi3:latest"
OLLAMA_TIMEOUT = 30
```
**Rationale**: Simplifies deployment and maintains consistency. Translation is automatically enabled/disabled with the existing `OLLAMA_ENABLED` flag.
## Deployment Considerations
### Docker Container Updates
**Affected Services**:
- `crawler` service: Needs rebuild to include new translation code
- `sender` service: Needs rebuild to include updated newsletter template
**Deployment Steps**:
1. Update code in `news_crawler/ollama_client.py`
2. Update code in `news_crawler/crawler_service.py`
3. Update template in `news_sender/newsletter_template.html`
4. Rebuild containers: `docker-compose up -d --build crawler sender`
5. No database migration needed (new fields are nullable)
### Backward Compatibility
**Existing Articles**: Articles without `title_en` will display German title only (graceful fallback)
**No Breaking Changes**: Newsletter template handles NULL `title_en` values
### Rollback Plan
If issues arise:
1. Revert code changes
2. Rebuild containers
3. Existing articles with `title_en` will continue to work
4. New articles will only have German titles
## Future Enhancements
### Potential Improvements (Out of Scope)
1. **Batch Translation**: Translate multiple titles in single API call for efficiency
2. **Translation Caching**: Cache common phrases/words to reduce API calls
3. **Multi-language Support**: Add configuration for target language selection
4. **Translation Quality Metrics**: Track and log translation quality scores
5. **Retry Logic**: Implement retry with exponential backoff for failed translations
6. **Admin API**: Add endpoint to re-translate existing articles
These enhancements are not included in the current implementation to maintain simplicity and focus on core functionality.