Files
Munich-news/docs/FEATURES.md
2025-11-12 11:34:33 +01:00

415 lines
7.3 KiB
Markdown

# Features Guide
Complete guide to Munich News Daily features.
---
## Core Features
### 1. Automated News Crawling
- Fetches articles from RSS feeds
- Scheduled daily at 6:00 AM Berlin time
- Extracts full article content
- Handles multiple news sources
### 2. AI-Powered Summarization
- Generates concise summaries (150 words)
- Uses Ollama AI (phi3:latest model)
- GPU acceleration available (5-10x faster)
- Configurable summary length
### 3. Title Translation
- Translates German titles to English
- Uses Ollama AI
- Displays both languages in newsletter
- Stores both versions in database
### 4. Newsletter Generation
- Beautiful HTML email template
- Responsive design
- Numbered articles
- Summary statistics
- Scheduled daily at 7:00 AM Berlin time
### 5. Engagement Tracking
- Email open tracking (pixel)
- Link click tracking
- Analytics dashboard ready
- Subscriber engagement metrics
---
## News Crawler
### How It Works
```
1. Fetch RSS feeds from database
2. Parse RSS XML
3. Extract article URLs
4. Fetch full article content
5. Extract text from HTML
6. Translate title (German → English)
7. Generate AI summary
8. Store in MongoDB
```
### Content Extraction
**Strategies (in order):**
1. **Article Tag** - Look for `<article>` tags
2. **Main Tag** - Look for `<main>` content
3. **Content Divs** - Common class names (content, article-body, etc.)
4. **Paragraph Aggregation** - Collect all `<p>` tags
5. **Fallback** - Use RSS description
**Cleaning:**
- Remove scripts and styles
- Remove navigation elements
- Remove ads and sidebars
- Extract clean text
- Preserve paragraphs
### RSS Feed Handling
**Supported Formats:**
- RSS 2.0
- Atom
- Custom formats
**Extracted Data:**
- Title
- Link
- Description/Summary
- Published date
- Author (if available)
**Error Handling:**
- Retry failed requests
- Skip invalid URLs
- Log errors
- Continue with next article
---
## AI Features
### Summarization
**Process:**
1. Send article text to Ollama
2. Request 150-word summary
3. Receive AI-generated summary
4. Store with article
**Configuration:**
```env
OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest
SUMMARY_MAX_WORDS=150
OLLAMA_TIMEOUT=120
```
**Performance:**
- CPU: ~8s per article
- GPU: ~2s per article (4x faster)
### Translation
**Process:**
1. Send German title to Ollama
2. Request English translation
3. Receive translated title
4. Store both versions
**Configuration:**
```env
OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest
```
**Performance:**
- CPU: ~1.5s per title
- GPU: ~0.3s per title (5x faster)
**Newsletter Display:**
```
English Title (Primary)
Original: German Title (Subtitle)
```
---
## Newsletter System
### Template Features
- **Responsive Design** - Works on all devices
- **Clean Layout** - Easy to read
- **Numbered Articles** - Clear organization
- **Summary Box** - Quick stats
- **Tracking Links** - Click tracking
- **Unsubscribe Link** - Easy opt-out
### Personalization
- Greeting message
- Date formatting
- Article count
- Source attribution
- Author names
### Tracking
**Open Tracking:**
- Invisible 1x1 pixel image
- Loaded when email opened
- Records timestamp
- Tracks unique opens
**Click Tracking:**
- All article links tracked
- Redirect through backend
- Records click events
- Tracks which articles clicked
---
## Subscriber Management
### Status System
| Status | Description | Receives Newsletters |
|--------|-------------|---------------------|
| `active` | Subscribed | ✅ Yes |
| `inactive` | Unsubscribed | ❌ No |
### Operations
**Subscribe:**
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Unsubscribe:**
```bash
curl -X POST http://localhost:5001/api/unsubscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Check Stats:**
```bash
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
```
---
## Admin Features
### Manual Crawl
Trigger crawl anytime:
```bash
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Test Email
Send test newsletter:
```bash
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
```
### Send Newsletter
Send to all subscribers:
```bash
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### System Stats
View system statistics:
```bash
curl http://localhost:5001/api/admin/stats
```
---
## Automation
### Scheduled Tasks
**Crawler (6:00 AM Berlin time):**
- Fetches new articles
- Processes with AI
- Stores in database
**Sender (7:00 AM Berlin time):**
- Waits for crawler to finish
- Fetches today's articles
- Generates newsletter
- Sends to all active subscribers
### Manual Execution
```bash
# Run crawler manually
docker-compose exec crawler python crawler_service.py 10
# Run sender manually
docker-compose exec sender python sender_service.py send 10
# Send test email
docker-compose exec sender python sender_service.py test your@email.com
```
---
## Configuration
### Environment Variables
```env
# Newsletter Settings
NEWSLETTER_MAX_ARTICLES=10
NEWSLETTER_HOURS_LOOKBACK=24
WEBSITE_URL=http://localhost:3000
# Ollama AI
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120
SUMMARY_MAX_WORDS=150
# Tracking
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90
```
### RSS Feeds
Add feeds in MongoDB:
```javascript
db.rss_feeds.insertOne({
name: "Süddeutsche Zeitung München",
url: "https://www.sueddeutsche.de/muenchen/rss",
active: true
})
```
---
## Performance Optimization
### GPU Acceleration
Enable for 5-10x faster processing:
```bash
./start-with-gpu.sh
```
**Benefits:**
- Faster summarization (8s → 2s)
- Faster translation (1.5s → 0.3s)
- Process more articles
- Lower CPU usage
### Batch Processing
Process multiple articles efficiently:
- Model stays loaded in memory
- Reduced overhead
- Better throughput
### Caching
- Model caching (Ollama)
- Database connection pooling
- Persistent storage
---
## Monitoring
### Logs
```bash
# Crawler logs
docker-compose logs -f crawler
# Sender logs
docker-compose logs -f sender
# Backend logs
docker-compose logs -f backend
```
### Metrics
- Articles crawled
- Summaries generated
- Newsletters sent
- Open rate
- Click-through rate
- Processing time
### Health Checks
```bash
# Backend health
curl http://localhost:5001/health
# System stats
curl http://localhost:5001/api/admin/stats
```
---
## Troubleshooting
### Crawler Issues
**No articles found:**
- Check RSS feed URLs
- Verify feeds are active
- Check network connectivity
**Extraction failed:**
- Article structure changed
- Paywall detected
- Network timeout
**AI processing failed:**
- Ollama not running
- Model not downloaded
- Timeout too short
### Newsletter Issues
**Not sending:**
- Check email configuration
- Verify SMTP credentials
- Check subscriber count
**Tracking not working:**
- Verify tracking enabled
- Check backend API accessible
- Verify tracking URLs
---
See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.