Files
Munich-news/docs/FEATURES.md
2025-11-12 11:34:33 +01:00

7.3 KiB

Features Guide

Complete guide to Munich News Daily features.


Core Features

1. Automated News Crawling

  • Fetches articles from RSS feeds
  • Scheduled daily at 6:00 AM Berlin time
  • Extracts full article content
  • Handles multiple news sources

2. AI-Powered Summarization

  • Generates concise summaries (150 words)
  • Uses Ollama AI (phi3:latest model)
  • GPU acceleration available (5-10x faster)
  • Configurable summary length

3. Title Translation

  • Translates German titles to English
  • Uses Ollama AI
  • Displays both languages in newsletter
  • Stores both versions in database

4. Newsletter Generation

  • Beautiful HTML email template
  • Responsive design
  • Numbered articles
  • Summary statistics
  • Scheduled daily at 7:00 AM Berlin time

5. Engagement Tracking

  • Email open tracking (pixel)
  • Link click tracking
  • Analytics dashboard ready
  • Subscriber engagement metrics

News Crawler

How It Works

1. Fetch RSS feeds from database
2. Parse RSS XML
3. Extract article URLs
4. Fetch full article content
5. Extract text from HTML
6. Translate title (German → English)
7. Generate AI summary
8. Store in MongoDB

Content Extraction

Strategies (in order):

  1. Article Tag - Look for <article> tags
  2. Main Tag - Look for <main> content
  3. Content Divs - Common class names (content, article-body, etc.)
  4. Paragraph Aggregation - Collect all <p> tags
  5. Fallback - Use RSS description

Cleaning:

  • Remove scripts and styles
  • Remove navigation elements
  • Remove ads and sidebars
  • Extract clean text
  • Preserve paragraphs

RSS Feed Handling

Supported Formats:

  • RSS 2.0
  • Atom
  • Custom formats

Extracted Data:

  • Title
  • Link
  • Description/Summary
  • Published date
  • Author (if available)

Error Handling:

  • Retry failed requests
  • Skip invalid URLs
  • Log errors
  • Continue with next article

AI Features

Summarization

Process:

  1. Send article text to Ollama
  2. Request 150-word summary
  3. Receive AI-generated summary
  4. Store with article

Configuration:

OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest
SUMMARY_MAX_WORDS=150
OLLAMA_TIMEOUT=120

Performance:

  • CPU: ~8s per article
  • GPU: ~2s per article (4x faster)

Translation

Process:

  1. Send German title to Ollama
  2. Request English translation
  3. Receive translated title
  4. Store both versions

Configuration:

OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest

Performance:

  • CPU: ~1.5s per title
  • GPU: ~0.3s per title (5x faster)

Newsletter Display:

English Title (Primary)
Original: German Title (Subtitle)

Newsletter System

Template Features

  • Responsive Design - Works on all devices
  • Clean Layout - Easy to read
  • Numbered Articles - Clear organization
  • Summary Box - Quick stats
  • Tracking Links - Click tracking
  • Unsubscribe Link - Easy opt-out

Personalization

  • Greeting message
  • Date formatting
  • Article count
  • Source attribution
  • Author names

Tracking

Open Tracking:

  • Invisible 1x1 pixel image
  • Loaded when email opened
  • Records timestamp
  • Tracks unique opens

Click Tracking:

  • All article links tracked
  • Redirect through backend
  • Records click events
  • Tracks which articles clicked

Subscriber Management

Status System

Status Description Receives Newsletters
active Subscribed Yes
inactive Unsubscribed No

Operations

Subscribe:

curl -X POST http://localhost:5001/api/subscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'

Unsubscribe:

curl -X POST http://localhost:5001/api/unsubscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'

Check Stats:

curl http://localhost:5001/api/admin/stats | jq '.subscribers'

Admin Features

Manual Crawl

Trigger crawl anytime:

curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'

Test Email

Send test newsletter:

curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'

Send Newsletter

Send to all subscribers:

curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'

System Stats

View system statistics:

curl http://localhost:5001/api/admin/stats

Automation

Scheduled Tasks

Crawler (6:00 AM Berlin time):

  • Fetches new articles
  • Processes with AI
  • Stores in database

Sender (7:00 AM Berlin time):

  • Waits for crawler to finish
  • Fetches today's articles
  • Generates newsletter
  • Sends to all active subscribers

Manual Execution

# Run crawler manually
docker-compose exec crawler python crawler_service.py 10

# Run sender manually
docker-compose exec sender python sender_service.py send 10

# Send test email
docker-compose exec sender python sender_service.py test your@email.com

Configuration

Environment Variables

# Newsletter Settings
NEWSLETTER_MAX_ARTICLES=10
NEWSLETTER_HOURS_LOOKBACK=24
WEBSITE_URL=http://localhost:3000

# Ollama AI
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120
SUMMARY_MAX_WORDS=150

# Tracking
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90

RSS Feeds

Add feeds in MongoDB:

db.rss_feeds.insertOne({
  name: "Süddeutsche Zeitung München",
  url: "https://www.sueddeutsche.de/muenchen/rss",
  active: true
})

Performance Optimization

GPU Acceleration

Enable for 5-10x faster processing:

./start-with-gpu.sh

Benefits:

  • Faster summarization (8s → 2s)
  • Faster translation (1.5s → 0.3s)
  • Process more articles
  • Lower CPU usage

Batch Processing

Process multiple articles efficiently:

  • Model stays loaded in memory
  • Reduced overhead
  • Better throughput

Caching

  • Model caching (Ollama)
  • Database connection pooling
  • Persistent storage

Monitoring

Logs

# Crawler logs
docker-compose logs -f crawler

# Sender logs
docker-compose logs -f sender

# Backend logs
docker-compose logs -f backend

Metrics

  • Articles crawled
  • Summaries generated
  • Newsletters sent
  • Open rate
  • Click-through rate
  • Processing time

Health Checks

# Backend health
curl http://localhost:5001/health

# System stats
curl http://localhost:5001/api/admin/stats

Troubleshooting

Crawler Issues

No articles found:

  • Check RSS feed URLs
  • Verify feeds are active
  • Check network connectivity

Extraction failed:

  • Article structure changed
  • Paywall detected
  • Network timeout

AI processing failed:

  • Ollama not running
  • Model not downloaded
  • Timeout too short

Newsletter Issues

Not sending:

  • Check email configuration
  • Verify SMTP credentials
  • Check subscriber count

Tracking not working:

  • Verify tracking enabled
  • Check backend API accessible
  • Verify tracking URLs

See SETUP.md for configuration, API.md for API reference, and ARCHITECTURE.md for system design.