update
This commit is contained in:
414
docs/FEATURES.md
Normal file
414
docs/FEATURES.md
Normal file
@@ -0,0 +1,414 @@
|
||||
# Features Guide
|
||||
|
||||
Complete guide to Munich News Daily features.
|
||||
|
||||
---
|
||||
|
||||
## Core Features
|
||||
|
||||
### 1. Automated News Crawling
|
||||
- Fetches articles from RSS feeds
|
||||
- Scheduled daily at 6:00 AM Berlin time
|
||||
- Extracts full article content
|
||||
- Handles multiple news sources
|
||||
|
||||
### 2. AI-Powered Summarization
|
||||
- Generates concise summaries (150 words)
|
||||
- Uses Ollama AI (phi3:latest model)
|
||||
- GPU acceleration available (5-10x faster)
|
||||
- Configurable summary length
|
||||
|
||||
### 3. Title Translation
|
||||
- Translates German titles to English
|
||||
- Uses Ollama AI
|
||||
- Displays both languages in newsletter
|
||||
- Stores both versions in database
|
||||
|
||||
### 4. Newsletter Generation
|
||||
- Beautiful HTML email template
|
||||
- Responsive design
|
||||
- Numbered articles
|
||||
- Summary statistics
|
||||
- Scheduled daily at 7:00 AM Berlin time
|
||||
|
||||
### 5. Engagement Tracking
|
||||
- Email open tracking (pixel)
|
||||
- Link click tracking
|
||||
- Analytics dashboard ready
|
||||
- Subscriber engagement metrics
|
||||
|
||||
---
|
||||
|
||||
## News Crawler
|
||||
|
||||
### How It Works
|
||||
|
||||
```
|
||||
1. Fetch RSS feeds from database
|
||||
2. Parse RSS XML
|
||||
3. Extract article URLs
|
||||
4. Fetch full article content
|
||||
5. Extract text from HTML
|
||||
6. Translate title (German → English)
|
||||
7. Generate AI summary
|
||||
8. Store in MongoDB
|
||||
```
|
||||
|
||||
### Content Extraction
|
||||
|
||||
**Strategies (in order):**
|
||||
|
||||
1. **Article Tag** - Look for `<article>` tags
|
||||
2. **Main Tag** - Look for `<main>` content
|
||||
3. **Content Divs** - Common class names (content, article-body, etc.)
|
||||
4. **Paragraph Aggregation** - Collect all `<p>` tags
|
||||
5. **Fallback** - Use RSS description
|
||||
|
||||
**Cleaning:**
|
||||
- Remove scripts and styles
|
||||
- Remove navigation elements
|
||||
- Remove ads and sidebars
|
||||
- Extract clean text
|
||||
- Preserve paragraphs
|
||||
|
||||
### RSS Feed Handling
|
||||
|
||||
**Supported Formats:**
|
||||
- RSS 2.0
|
||||
- Atom
|
||||
- Custom formats
|
||||
|
||||
**Extracted Data:**
|
||||
- Title
|
||||
- Link
|
||||
- Description/Summary
|
||||
- Published date
|
||||
- Author (if available)
|
||||
|
||||
**Error Handling:**
|
||||
- Retry failed requests
|
||||
- Skip invalid URLs
|
||||
- Log errors
|
||||
- Continue with next article
|
||||
|
||||
---
|
||||
|
||||
## AI Features
|
||||
|
||||
### Summarization
|
||||
|
||||
**Process:**
|
||||
1. Send article text to Ollama
|
||||
2. Request 150-word summary
|
||||
3. Receive AI-generated summary
|
||||
4. Store with article
|
||||
|
||||
**Configuration:**
|
||||
```env
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
SUMMARY_MAX_WORDS=150
|
||||
OLLAMA_TIMEOUT=120
|
||||
```
|
||||
|
||||
**Performance:**
|
||||
- CPU: ~8s per article
|
||||
- GPU: ~2s per article (4x faster)
|
||||
|
||||
### Translation
|
||||
|
||||
**Process:**
|
||||
1. Send German title to Ollama
|
||||
2. Request English translation
|
||||
3. Receive translated title
|
||||
4. Store both versions
|
||||
|
||||
**Configuration:**
|
||||
```env
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
```
|
||||
|
||||
**Performance:**
|
||||
- CPU: ~1.5s per title
|
||||
- GPU: ~0.3s per title (5x faster)
|
||||
|
||||
**Newsletter Display:**
|
||||
```
|
||||
English Title (Primary)
|
||||
Original: German Title (Subtitle)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Newsletter System
|
||||
|
||||
### Template Features
|
||||
|
||||
- **Responsive Design** - Works on all devices
|
||||
- **Clean Layout** - Easy to read
|
||||
- **Numbered Articles** - Clear organization
|
||||
- **Summary Box** - Quick stats
|
||||
- **Tracking Links** - Click tracking
|
||||
- **Unsubscribe Link** - Easy opt-out
|
||||
|
||||
### Personalization
|
||||
|
||||
- Greeting message
|
||||
- Date formatting
|
||||
- Article count
|
||||
- Source attribution
|
||||
- Author names
|
||||
|
||||
### Tracking
|
||||
|
||||
**Open Tracking:**
|
||||
- Invisible 1x1 pixel image
|
||||
- Loaded when email opened
|
||||
- Records timestamp
|
||||
- Tracks unique opens
|
||||
|
||||
**Click Tracking:**
|
||||
- All article links tracked
|
||||
- Redirect through backend
|
||||
- Records click events
|
||||
- Tracks which articles clicked
|
||||
|
||||
---
|
||||
|
||||
## Subscriber Management
|
||||
|
||||
### Status System
|
||||
|
||||
| Status | Description | Receives Newsletters |
|
||||
|--------|-------------|---------------------|
|
||||
| `active` | Subscribed | ✅ Yes |
|
||||
| `inactive` | Unsubscribed | ❌ No |
|
||||
|
||||
### Operations
|
||||
|
||||
**Subscribe:**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/subscribe \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email": "user@example.com"}'
|
||||
```
|
||||
|
||||
**Unsubscribe:**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/unsubscribe \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email": "user@example.com"}'
|
||||
```
|
||||
|
||||
**Check Stats:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Admin Features
|
||||
|
||||
### Manual Crawl
|
||||
|
||||
Trigger crawl anytime:
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"max_articles": 10}'
|
||||
```
|
||||
|
||||
### Test Email
|
||||
|
||||
Send test newsletter:
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email": "test@example.com"}'
|
||||
```
|
||||
|
||||
### Send Newsletter
|
||||
|
||||
Send to all subscribers:
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"max_articles": 10}'
|
||||
```
|
||||
|
||||
### System Stats
|
||||
|
||||
View system statistics:
|
||||
```bash
|
||||
curl http://localhost:5001/api/admin/stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Automation
|
||||
|
||||
### Scheduled Tasks
|
||||
|
||||
**Crawler (6:00 AM Berlin time):**
|
||||
- Fetches new articles
|
||||
- Processes with AI
|
||||
- Stores in database
|
||||
|
||||
**Sender (7:00 AM Berlin time):**
|
||||
- Waits for crawler to finish
|
||||
- Fetches today's articles
|
||||
- Generates newsletter
|
||||
- Sends to all active subscribers
|
||||
|
||||
### Manual Execution
|
||||
|
||||
```bash
|
||||
# Run crawler manually
|
||||
docker-compose exec crawler python crawler_service.py 10
|
||||
|
||||
# Run sender manually
|
||||
docker-compose exec sender python sender_service.py send 10
|
||||
|
||||
# Send test email
|
||||
docker-compose exec sender python sender_service.py test your@email.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```env
|
||||
# Newsletter Settings
|
||||
NEWSLETTER_MAX_ARTICLES=10
|
||||
NEWSLETTER_HOURS_LOOKBACK=24
|
||||
WEBSITE_URL=http://localhost:3000
|
||||
|
||||
# Ollama AI
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_BASE_URL=http://ollama:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
OLLAMA_TIMEOUT=120
|
||||
SUMMARY_MAX_WORDS=150
|
||||
|
||||
# Tracking
|
||||
TRACKING_ENABLED=true
|
||||
TRACKING_API_URL=http://localhost:5001
|
||||
TRACKING_DATA_RETENTION_DAYS=90
|
||||
```
|
||||
|
||||
### RSS Feeds
|
||||
|
||||
Add feeds in MongoDB:
|
||||
```javascript
|
||||
db.rss_feeds.insertOne({
|
||||
name: "Süddeutsche Zeitung München",
|
||||
url: "https://www.sueddeutsche.de/muenchen/rss",
|
||||
active: true
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### GPU Acceleration
|
||||
|
||||
Enable for 5-10x faster processing:
|
||||
```bash
|
||||
./start-with-gpu.sh
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Faster summarization (8s → 2s)
|
||||
- Faster translation (1.5s → 0.3s)
|
||||
- Process more articles
|
||||
- Lower CPU usage
|
||||
|
||||
### Batch Processing
|
||||
|
||||
Process multiple articles efficiently:
|
||||
- Model stays loaded in memory
|
||||
- Reduced overhead
|
||||
- Better throughput
|
||||
|
||||
### Caching
|
||||
|
||||
- Model caching (Ollama)
|
||||
- Database connection pooling
|
||||
- Persistent storage
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Crawler logs
|
||||
docker-compose logs -f crawler
|
||||
|
||||
# Sender logs
|
||||
docker-compose logs -f sender
|
||||
|
||||
# Backend logs
|
||||
docker-compose logs -f backend
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
- Articles crawled
|
||||
- Summaries generated
|
||||
- Newsletters sent
|
||||
- Open rate
|
||||
- Click-through rate
|
||||
- Processing time
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Backend health
|
||||
curl http://localhost:5001/health
|
||||
|
||||
# System stats
|
||||
curl http://localhost:5001/api/admin/stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Crawler Issues
|
||||
|
||||
**No articles found:**
|
||||
- Check RSS feed URLs
|
||||
- Verify feeds are active
|
||||
- Check network connectivity
|
||||
|
||||
**Extraction failed:**
|
||||
- Article structure changed
|
||||
- Paywall detected
|
||||
- Network timeout
|
||||
|
||||
**AI processing failed:**
|
||||
- Ollama not running
|
||||
- Model not downloaded
|
||||
- Timeout too short
|
||||
|
||||
### Newsletter Issues
|
||||
|
||||
**Not sending:**
|
||||
- Check email configuration
|
||||
- Verify SMTP credentials
|
||||
- Check subscriber count
|
||||
|
||||
**Tracking not working:**
|
||||
- Verify tracking enabled
|
||||
- Check backend API accessible
|
||||
- Verify tracking URLs
|
||||
|
||||
---
|
||||
|
||||
See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.
|
||||
Reference in New Issue
Block a user