415 lines
7.3 KiB
Markdown
415 lines
7.3 KiB
Markdown
# Features Guide
|
|
|
|
Complete guide to Munich News Daily features.
|
|
|
|
---
|
|
|
|
## Core Features
|
|
|
|
### 1. Automated News Crawling
|
|
- Fetches articles from RSS feeds
|
|
- Scheduled daily at 6:00 AM Berlin time
|
|
- Extracts full article content
|
|
- Handles multiple news sources
|
|
|
|
### 2. AI-Powered Summarization
|
|
- Generates concise summaries (150 words)
|
|
- Uses Ollama AI (phi3:latest model)
|
|
- GPU acceleration available (5-10x faster)
|
|
- Configurable summary length
|
|
|
|
### 3. Title Translation
|
|
- Translates German titles to English
|
|
- Uses Ollama AI
|
|
- Displays both languages in newsletter
|
|
- Stores both versions in database
|
|
|
|
### 4. Newsletter Generation
|
|
- Beautiful HTML email template
|
|
- Responsive design
|
|
- Numbered articles
|
|
- Summary statistics
|
|
- Scheduled daily at 7:00 AM Berlin time
|
|
|
|
### 5. Engagement Tracking
|
|
- Email open tracking (pixel)
|
|
- Link click tracking
|
|
- Analytics dashboard ready
|
|
- Subscriber engagement metrics
|
|
|
|
---
|
|
|
|
## News Crawler
|
|
|
|
### How It Works
|
|
|
|
```
|
|
1. Fetch RSS feeds from database
|
|
2. Parse RSS XML
|
|
3. Extract article URLs
|
|
4. Fetch full article content
|
|
5. Extract text from HTML
|
|
6. Translate title (German → English)
|
|
7. Generate AI summary
|
|
8. Store in MongoDB
|
|
```
|
|
|
|
### Content Extraction
|
|
|
|
**Strategies (in order):**
|
|
|
|
1. **Article Tag** - Look for `<article>` tags
|
|
2. **Main Tag** - Look for `<main>` content
|
|
3. **Content Divs** - Common class names (content, article-body, etc.)
|
|
4. **Paragraph Aggregation** - Collect all `<p>` tags
|
|
5. **Fallback** - Use RSS description
|
|
|
|
**Cleaning:**
|
|
- Remove scripts and styles
|
|
- Remove navigation elements
|
|
- Remove ads and sidebars
|
|
- Extract clean text
|
|
- Preserve paragraphs
|
|
|
|
### RSS Feed Handling
|
|
|
|
**Supported Formats:**
|
|
- RSS 2.0
|
|
- Atom
|
|
- Custom formats
|
|
|
|
**Extracted Data:**
|
|
- Title
|
|
- Link
|
|
- Description/Summary
|
|
- Published date
|
|
- Author (if available)
|
|
|
|
**Error Handling:**
|
|
- Retry failed requests
|
|
- Skip invalid URLs
|
|
- Log errors
|
|
- Continue with next article
|
|
|
|
---
|
|
|
|
## AI Features
|
|
|
|
### Summarization
|
|
|
|
**Process:**
|
|
1. Send article text to Ollama
|
|
2. Request 150-word summary
|
|
3. Receive AI-generated summary
|
|
4. Store with article
|
|
|
|
**Configuration:**
|
|
```env
|
|
OLLAMA_ENABLED=true
|
|
OLLAMA_MODEL=phi3:latest
|
|
SUMMARY_MAX_WORDS=150
|
|
OLLAMA_TIMEOUT=120
|
|
```
|
|
|
|
**Performance:**
|
|
- CPU: ~8s per article
|
|
- GPU: ~2s per article (4x faster)
|
|
|
|
### Translation
|
|
|
|
**Process:**
|
|
1. Send German title to Ollama
|
|
2. Request English translation
|
|
3. Receive translated title
|
|
4. Store both versions
|
|
|
|
**Configuration:**
|
|
```env
|
|
OLLAMA_ENABLED=true
|
|
OLLAMA_MODEL=phi3:latest
|
|
```
|
|
|
|
**Performance:**
|
|
- CPU: ~1.5s per title
|
|
- GPU: ~0.3s per title (5x faster)
|
|
|
|
**Newsletter Display:**
|
|
```
|
|
English Title (Primary)
|
|
Original: German Title (Subtitle)
|
|
```
|
|
|
|
---
|
|
|
|
## Newsletter System
|
|
|
|
### Template Features
|
|
|
|
- **Responsive Design** - Works on all devices
|
|
- **Clean Layout** - Easy to read
|
|
- **Numbered Articles** - Clear organization
|
|
- **Summary Box** - Quick stats
|
|
- **Tracking Links** - Click tracking
|
|
- **Unsubscribe Link** - Easy opt-out
|
|
|
|
### Personalization
|
|
|
|
- Greeting message
|
|
- Date formatting
|
|
- Article count
|
|
- Source attribution
|
|
- Author names
|
|
|
|
### Tracking
|
|
|
|
**Open Tracking:**
|
|
- Invisible 1x1 pixel image
|
|
- Loaded when email opened
|
|
- Records timestamp
|
|
- Tracks unique opens
|
|
|
|
**Click Tracking:**
|
|
- All article links tracked
|
|
- Redirect through backend
|
|
- Records click events
|
|
- Tracks which articles clicked
|
|
|
|
---
|
|
|
|
## Subscriber Management
|
|
|
|
### Status System
|
|
|
|
| Status | Description | Receives Newsletters |
|
|
|--------|-------------|---------------------|
|
|
| `active` | Subscribed | ✅ Yes |
|
|
| `inactive` | Unsubscribed | ❌ No |
|
|
|
|
### Operations
|
|
|
|
**Subscribe:**
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/subscribe \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email": "user@example.com"}'
|
|
```
|
|
|
|
**Unsubscribe:**
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/unsubscribe \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email": "user@example.com"}'
|
|
```
|
|
|
|
**Check Stats:**
|
|
```bash
|
|
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
|
```
|
|
|
|
---
|
|
|
|
## Admin Features
|
|
|
|
### Manual Crawl
|
|
|
|
Trigger crawl anytime:
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"max_articles": 10}'
|
|
```
|
|
|
|
### Test Email
|
|
|
|
Send test newsletter:
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email": "test@example.com"}'
|
|
```
|
|
|
|
### Send Newsletter
|
|
|
|
Send to all subscribers:
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"max_articles": 10}'
|
|
```
|
|
|
|
### System Stats
|
|
|
|
View system statistics:
|
|
```bash
|
|
curl http://localhost:5001/api/admin/stats
|
|
```
|
|
|
|
---
|
|
|
|
## Automation
|
|
|
|
### Scheduled Tasks
|
|
|
|
**Crawler (6:00 AM Berlin time):**
|
|
- Fetches new articles
|
|
- Processes with AI
|
|
- Stores in database
|
|
|
|
**Sender (7:00 AM Berlin time):**
|
|
- Waits for crawler to finish
|
|
- Fetches today's articles
|
|
- Generates newsletter
|
|
- Sends to all active subscribers
|
|
|
|
### Manual Execution
|
|
|
|
```bash
|
|
# Run crawler manually
|
|
docker-compose exec crawler python crawler_service.py 10
|
|
|
|
# Run sender manually
|
|
docker-compose exec sender python sender_service.py send 10
|
|
|
|
# Send test email
|
|
docker-compose exec sender python sender_service.py test your@email.com
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```env
|
|
# Newsletter Settings
|
|
NEWSLETTER_MAX_ARTICLES=10
|
|
NEWSLETTER_HOURS_LOOKBACK=24
|
|
WEBSITE_URL=http://localhost:3000
|
|
|
|
# Ollama AI
|
|
OLLAMA_ENABLED=true
|
|
OLLAMA_BASE_URL=http://ollama:11434
|
|
OLLAMA_MODEL=phi3:latest
|
|
OLLAMA_TIMEOUT=120
|
|
SUMMARY_MAX_WORDS=150
|
|
|
|
# Tracking
|
|
TRACKING_ENABLED=true
|
|
TRACKING_API_URL=http://localhost:5001
|
|
TRACKING_DATA_RETENTION_DAYS=90
|
|
```
|
|
|
|
### RSS Feeds
|
|
|
|
Add feeds in MongoDB:
|
|
```javascript
|
|
db.rss_feeds.insertOne({
|
|
name: "Süddeutsche Zeitung München",
|
|
url: "https://www.sueddeutsche.de/muenchen/rss",
|
|
active: true
|
|
})
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Optimization
|
|
|
|
### GPU Acceleration
|
|
|
|
Enable for 5-10x faster processing:
|
|
```bash
|
|
./start-with-gpu.sh
|
|
```
|
|
|
|
**Benefits:**
|
|
- Faster summarization (8s → 2s)
|
|
- Faster translation (1.5s → 0.3s)
|
|
- Process more articles
|
|
- Lower CPU usage
|
|
|
|
### Batch Processing
|
|
|
|
Process multiple articles efficiently:
|
|
- Model stays loaded in memory
|
|
- Reduced overhead
|
|
- Better throughput
|
|
|
|
### Caching
|
|
|
|
- Model caching (Ollama)
|
|
- Database connection pooling
|
|
- Persistent storage
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Logs
|
|
|
|
```bash
|
|
# Crawler logs
|
|
docker-compose logs -f crawler
|
|
|
|
# Sender logs
|
|
docker-compose logs -f sender
|
|
|
|
# Backend logs
|
|
docker-compose logs -f backend
|
|
```
|
|
|
|
### Metrics
|
|
|
|
- Articles crawled
|
|
- Summaries generated
|
|
- Newsletters sent
|
|
- Open rate
|
|
- Click-through rate
|
|
- Processing time
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Backend health
|
|
curl http://localhost:5001/health
|
|
|
|
# System stats
|
|
curl http://localhost:5001/api/admin/stats
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Crawler Issues
|
|
|
|
**No articles found:**
|
|
- Check RSS feed URLs
|
|
- Verify feeds are active
|
|
- Check network connectivity
|
|
|
|
**Extraction failed:**
|
|
- Article structure changed
|
|
- Paywall detected
|
|
- Network timeout
|
|
|
|
**AI processing failed:**
|
|
- Ollama not running
|
|
- Model not downloaded
|
|
- Timeout too short
|
|
|
|
### Newsletter Issues
|
|
|
|
**Not sending:**
|
|
- Check email configuration
|
|
- Verify SMTP credentials
|
|
- Check subscriber count
|
|
|
|
**Tracking not working:**
|
|
- Verify tracking enabled
|
|
- Check backend API accessible
|
|
- Verify tracking URLs
|
|
|
|
---
|
|
|
|
See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.
|