update
This commit is contained in:
@@ -5,7 +5,8 @@ Get Munich News Daily running in 5 minutes!
|
|||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- Docker & Docker Compose installed
|
- Docker & Docker Compose installed
|
||||||
- (Optional) Ollama for AI summarization
|
- 4GB+ RAM (for Ollama AI models)
|
||||||
|
- (Optional) NVIDIA GPU for 5-10x faster AI processing
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
@@ -30,13 +31,21 @@ EMAIL_PASSWORD=your-app-password
|
|||||||
### 2. Start System
|
### 2. Start System
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Start all services
|
# Option 1: Auto-detect GPU and start (recommended)
|
||||||
|
./start-with-gpu.sh
|
||||||
|
|
||||||
|
# Option 2: Start without GPU
|
||||||
docker-compose up -d
|
docker-compose up -d
|
||||||
|
|
||||||
# View logs
|
# View logs
|
||||||
docker-compose logs -f
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Wait for Ollama model download (first time only, ~2-5 minutes)
|
||||||
|
docker-compose logs -f ollama-setup
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** First startup downloads the phi3:latest AI model (2.2GB). This happens automatically.
|
||||||
|
|
||||||
### 3. Add RSS Feeds
|
### 3. Add RSS Feeds
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -114,18 +123,45 @@ docker-compose logs -f
|
|||||||
docker-compose up -d --build
|
docker-compose up -d --build
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## New Features
|
||||||
|
|
||||||
|
### GPU Acceleration (5-10x Faster)
|
||||||
|
Enable GPU support for faster AI processing:
|
||||||
|
```bash
|
||||||
|
./check-gpu.sh # Check if GPU is available
|
||||||
|
./start-with-gpu.sh # Start with GPU support
|
||||||
|
```
|
||||||
|
See [docs/GPU_SETUP.md](docs/GPU_SETUP.md) for details.
|
||||||
|
|
||||||
|
### Send Newsletter to All Subscribers
|
||||||
|
```bash
|
||||||
|
# Send newsletter to all active subscribers
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Security Features
|
||||||
|
- ✅ Only Backend API exposed (port 5001)
|
||||||
|
- ✅ MongoDB internal-only (secure)
|
||||||
|
- ✅ Ollama internal-only (secure)
|
||||||
|
- ✅ All services communicate via internal Docker network
|
||||||
|
|
||||||
## Need Help?
|
## Need Help?
|
||||||
|
|
||||||
- Check [README.md](README.md) for full documentation
|
- **Documentation Index**: [docs/INDEX.md](docs/INDEX.md)
|
||||||
- See [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed setup
|
- **GPU Setup**: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
|
||||||
- View [docs/API.md](docs/API.md) for API reference
|
- **API Reference**: [docs/ADMIN_API.md](docs/ADMIN_API.md)
|
||||||
|
- **Security Guide**: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
|
||||||
|
- **Full Documentation**: [README.md](README.md)
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
1. Configure Ollama for AI summaries (optional)
|
1. ✅ **Enable GPU acceleration** - [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
|
||||||
2. Set up tracking API (optional)
|
2. Set up tracking API (optional)
|
||||||
3. Customize newsletter template
|
3. Customize newsletter template
|
||||||
4. Add more RSS feeds
|
4. Add more RSS feeds
|
||||||
5. Monitor engagement metrics
|
5. Monitor engagement metrics
|
||||||
|
6. Review security settings - [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
|
||||||
|
|
||||||
That's it! Your automated news system is running. 🎉
|
That's it! Your automated news system is running. 🎉
|
||||||
|
|||||||
35
README.md
35
README.md
@@ -2,7 +2,16 @@
|
|||||||
|
|
||||||
A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking.
|
A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking.
|
||||||
|
|
||||||
**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [QUICK_START_GPU.md](QUICK_START_GPU.md)
|
## ✨ Key Features
|
||||||
|
|
||||||
|
- **🤖 AI-Powered Clustering** - Automatically detects duplicate stories from different sources
|
||||||
|
- **📰 Neutral Summaries** - Combines multiple perspectives into balanced coverage
|
||||||
|
- **🎯 Smart Prioritization** - Shows most important stories first (multi-source coverage)
|
||||||
|
- **📊 Engagement Tracking** - Open rates, click tracking, and analytics
|
||||||
|
- **⚡ GPU Acceleration** - 5-10x faster AI processing with GPU support
|
||||||
|
- **🔒 GDPR Compliant** - Privacy-first with data retention controls
|
||||||
|
|
||||||
|
**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
|
||||||
|
|
||||||
## 🚀 Quick Start
|
## 🚀 Quick Start
|
||||||
|
|
||||||
@@ -25,6 +34,8 @@ That's it! The system will automatically:
|
|||||||
|
|
||||||
📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide.
|
📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide.
|
||||||
|
|
||||||
|
🚀 **GPU Acceleration:** Enable 5-10x faster AI processing with [GPU Setup Guide](docs/GPU_SETUP.md)
|
||||||
|
|
||||||
## 📋 System Overview
|
## 📋 System Overview
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -49,11 +60,11 @@ That's it! The system will automatically:
|
|||||||
|
|
||||||
### Components
|
### Components
|
||||||
|
|
||||||
- **Ollama**: AI service for summarization and translation (port 11434)
|
- **Ollama**: AI service for summarization and translation (internal only, GPU-accelerated)
|
||||||
- **MongoDB**: Data storage (articles, subscribers, tracking)
|
- **MongoDB**: Data storage (articles, subscribers, tracking) (internal only)
|
||||||
- **Backend API**: Flask API for tracking and analytics (port 5001)
|
- **Backend API**: Flask API for tracking and analytics (port 5001 - only exposed service)
|
||||||
- **News Crawler**: Automated RSS feed crawler with AI summarization
|
- **News Crawler**: Automated RSS feed crawler with AI summarization (internal only)
|
||||||
- **Newsletter Sender**: Automated email sender with tracking
|
- **Newsletter Sender**: Automated email sender with tracking (internal only)
|
||||||
- **Frontend**: React dashboard (optional)
|
- **Frontend**: React dashboard (optional)
|
||||||
|
|
||||||
### Technology Stack
|
### Technology Stack
|
||||||
@@ -341,11 +352,21 @@ curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-
|
|||||||
|
|
||||||
### Getting Started
|
### Getting Started
|
||||||
- **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
|
- **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
|
||||||
- **[PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md)** - Project layout
|
|
||||||
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
|
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
|
||||||
|
|
||||||
|
### Core Features
|
||||||
|
- **[docs/AI_NEWS_AGGREGATION.md](docs/AI_NEWS_AGGREGATION.md)** - AI-powered clustering & neutral summaries
|
||||||
|
- **[docs/FEATURES.md](docs/FEATURES.md)** - Complete feature list
|
||||||
|
- **[docs/API.md](docs/API.md)** - API endpoints reference
|
||||||
|
|
||||||
### Technical Documentation
|
### Technical Documentation
|
||||||
- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
|
- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
|
||||||
|
- **[docs/SETUP.md](docs/SETUP.md)** - Detailed setup guide
|
||||||
|
- **[docs/OLLAMA_SETUP.md](docs/OLLAMA_SETUP.md)** - AI/Ollama configuration
|
||||||
|
- **[docs/GPU_SETUP.md](docs/GPU_SETUP.md)** - GPU acceleration setup
|
||||||
|
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production deployment
|
||||||
|
- **[docs/SECURITY.md](docs/SECURITY.md)** - Security best practices
|
||||||
|
- **[docs/REFERENCE.md](docs/REFERENCE.md)** - Complete reference
|
||||||
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
|
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
|
||||||
- **[docs/API.md](docs/API.md)** - API reference
|
- **[docs/API.md](docs/API.md)** - API reference
|
||||||
- **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure
|
- **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
from flask import Blueprint, jsonify
|
from flask import Blueprint, jsonify, request
|
||||||
from database import articles_collection
|
from database import articles_collection, db
|
||||||
from services.news_service import fetch_munich_news, save_articles_to_db
|
from services.news_service import fetch_munich_news, save_articles_to_db
|
||||||
|
|
||||||
news_bp = Blueprint('news', __name__)
|
news_bp = Blueprint('news', __name__)
|
||||||
@@ -9,6 +9,12 @@ news_bp = Blueprint('news', __name__)
|
|||||||
def get_news():
|
def get_news():
|
||||||
"""Get latest Munich news"""
|
"""Get latest Munich news"""
|
||||||
try:
|
try:
|
||||||
|
# Check if clustered mode is requested
|
||||||
|
mode = request.args.get('mode', 'all')
|
||||||
|
|
||||||
|
if mode == 'clustered':
|
||||||
|
return get_clustered_news_internal()
|
||||||
|
|
||||||
# Fetch fresh news and save to database
|
# Fetch fresh news and save to database
|
||||||
articles = fetch_munich_news()
|
articles = fetch_munich_news()
|
||||||
save_articles_to_db(articles)
|
save_articles_to_db(articles)
|
||||||
@@ -63,6 +69,95 @@ def get_news():
|
|||||||
return jsonify({'error': str(e)}), 500
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
def get_clustered_news_internal():
|
||||||
|
"""
|
||||||
|
Get news with neutral summaries for clustered articles
|
||||||
|
Returns only primary articles with their neutral summaries
|
||||||
|
Prioritizes stories covered by multiple sources (more popular/important)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
limit = int(request.args.get('limit', 20))
|
||||||
|
|
||||||
|
# Use aggregation to get articles with their cluster size
|
||||||
|
# This allows us to prioritize multi-source stories
|
||||||
|
pipeline = [
|
||||||
|
{"$match": {"is_primary": True}},
|
||||||
|
{"$lookup": {
|
||||||
|
"from": "articles",
|
||||||
|
"localField": "cluster_id",
|
||||||
|
"foreignField": "cluster_id",
|
||||||
|
"as": "cluster_articles"
|
||||||
|
}},
|
||||||
|
{"$addFields": {
|
||||||
|
"article_count": {"$size": "$cluster_articles"},
|
||||||
|
"sources_list": {"$setUnion": ["$cluster_articles.source", []]}
|
||||||
|
}},
|
||||||
|
{"$addFields": {
|
||||||
|
"source_count": {"$size": "$sources_list"}
|
||||||
|
}},
|
||||||
|
# Sort by: 1) source count (desc), 2) published date (desc)
|
||||||
|
{"$sort": {"source_count": -1, "published_at": -1}},
|
||||||
|
{"$limit": limit}
|
||||||
|
]
|
||||||
|
|
||||||
|
cursor = articles_collection.aggregate(pipeline)
|
||||||
|
|
||||||
|
result = []
|
||||||
|
cluster_summaries_collection = db['cluster_summaries']
|
||||||
|
|
||||||
|
for doc in cursor:
|
||||||
|
cluster_id = doc.get('cluster_id')
|
||||||
|
|
||||||
|
# Get neutral summary if available
|
||||||
|
cluster_summary = cluster_summaries_collection.find_one({'cluster_id': cluster_id})
|
||||||
|
|
||||||
|
# Use cluster_articles from aggregation (already fetched)
|
||||||
|
cluster_articles = doc.get('cluster_articles', [])
|
||||||
|
|
||||||
|
article = {
|
||||||
|
'title': doc.get('title', ''),
|
||||||
|
'link': doc.get('link', ''),
|
||||||
|
'source': doc.get('source', ''),
|
||||||
|
'published': doc.get('published_at', ''),
|
||||||
|
'category': doc.get('category', 'general'),
|
||||||
|
'cluster_id': cluster_id,
|
||||||
|
'article_count': doc.get('article_count', 1),
|
||||||
|
'source_count': doc.get('source_count', 1),
|
||||||
|
'sources': list(doc.get('sources_list', [doc.get('source', '')]))
|
||||||
|
}
|
||||||
|
|
||||||
|
# Use neutral summary if available, otherwise use article's own summary
|
||||||
|
if cluster_summary and doc.get('article_count', 1) > 1:
|
||||||
|
article['summary'] = cluster_summary.get('neutral_summary', '')
|
||||||
|
article['summary_type'] = 'neutral'
|
||||||
|
article['is_clustered'] = True
|
||||||
|
else:
|
||||||
|
article['summary'] = doc.get('summary', '')
|
||||||
|
article['summary_type'] = 'individual'
|
||||||
|
article['is_clustered'] = False
|
||||||
|
|
||||||
|
# Add related articles info
|
||||||
|
if doc.get('article_count', 1) > 1:
|
||||||
|
article['related_articles'] = [
|
||||||
|
{
|
||||||
|
'source': a.get('source', ''),
|
||||||
|
'title': a.get('title', ''),
|
||||||
|
'link': a.get('link', '')
|
||||||
|
}
|
||||||
|
for a in cluster_articles if a.get('_id') != doc.get('_id')
|
||||||
|
]
|
||||||
|
|
||||||
|
result.append(article)
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'articles': result,
|
||||||
|
'mode': 'clustered',
|
||||||
|
'description': 'Shows one article per story with neutral summaries'
|
||||||
|
}), 200
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
@news_bp.route('/api/news/<path:article_url>', methods=['GET'])
|
@news_bp.route('/api/news/<path:article_url>', methods=['GET'])
|
||||||
def get_article_by_url(article_url):
|
def get_article_by_url(article_url):
|
||||||
"""Get full article content by URL"""
|
"""Get full article content by URL"""
|
||||||
@@ -113,11 +208,20 @@ def get_stats():
|
|||||||
# Count summarized articles
|
# Count summarized articles
|
||||||
summarized_count = articles_collection.count_documents({'summary': {'$exists': True, '$ne': ''}})
|
summarized_count = articles_collection.count_documents({'summary': {'$exists': True, '$ne': ''}})
|
||||||
|
|
||||||
|
# Count clustered articles
|
||||||
|
clustered_count = articles_collection.count_documents({'cluster_id': {'$exists': True}})
|
||||||
|
|
||||||
|
# Count cluster summaries
|
||||||
|
cluster_summaries_collection = db['cluster_summaries']
|
||||||
|
neutral_summaries_count = cluster_summaries_collection.count_documents({})
|
||||||
|
|
||||||
return jsonify({
|
return jsonify({
|
||||||
'subscribers': subscriber_count,
|
'subscribers': subscriber_count,
|
||||||
'articles': article_count,
|
'articles': article_count,
|
||||||
'crawled_articles': crawled_count,
|
'crawled_articles': crawled_count,
|
||||||
'summarized_articles': summarized_count
|
'summarized_articles': summarized_count,
|
||||||
|
'clustered_articles': clustered_count,
|
||||||
|
'neutral_summaries': neutral_summaries_count
|
||||||
}), 200
|
}), 200
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
return jsonify({'error': str(e)}), 500
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|||||||
@@ -1,382 +0,0 @@
|
|||||||
# Admin API Reference
|
|
||||||
|
|
||||||
Admin endpoints for testing and manual operations.
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The admin API allows you to trigger manual operations like crawling news and sending test emails directly through HTTP requests.
|
|
||||||
|
|
||||||
**How it works**: The backend container has access to the Docker socket, allowing it to execute commands in other containers via `docker exec`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## API Endpoints
|
|
||||||
|
|
||||||
### Trigger Crawler
|
|
||||||
|
|
||||||
Manually trigger the news crawler to fetch new articles.
|
|
||||||
|
|
||||||
```http
|
|
||||||
POST /api/admin/trigger-crawl
|
|
||||||
```
|
|
||||||
|
|
||||||
**Request Body** (optional):
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"max_articles": 10
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Parameters**:
|
|
||||||
- `max_articles` (integer, optional): Number of articles to crawl per feed (1-100, default: 10)
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"message": "Crawler executed successfully",
|
|
||||||
"max_articles": 10,
|
|
||||||
"output": "... crawler output (last 1000 chars) ...",
|
|
||||||
"errors": ""
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example**:
|
|
||||||
```bash
|
|
||||||
# Crawl 5 articles per feed
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 5}'
|
|
||||||
|
|
||||||
# Use default (10 articles)
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Send Test Email
|
|
||||||
|
|
||||||
Send a test newsletter to a specific email address.
|
|
||||||
|
|
||||||
```http
|
|
||||||
POST /api/admin/send-test-email
|
|
||||||
```
|
|
||||||
|
|
||||||
**Request Body**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"email": "test@example.com",
|
|
||||||
"max_articles": 10
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Parameters**:
|
|
||||||
- `email` (string, required): Email address to send test newsletter to
|
|
||||||
- `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"message": "Test email sent to test@example.com",
|
|
||||||
"email": "test@example.com",
|
|
||||||
"output": "... sender output ...",
|
|
||||||
"errors": ""
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example**:
|
|
||||||
```bash
|
|
||||||
# Send test email
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "your-email@example.com"}'
|
|
||||||
|
|
||||||
# Send with custom article count
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "your-email@example.com", "max_articles": 5}'
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Send Newsletter to All Subscribers
|
|
||||||
|
|
||||||
Send newsletter to all active subscribers in the database.
|
|
||||||
|
|
||||||
```http
|
|
||||||
POST /api/admin/send-newsletter
|
|
||||||
```
|
|
||||||
|
|
||||||
**Request Body** (optional):
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"max_articles": 10
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Parameters**:
|
|
||||||
- `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"message": "Newsletter sent successfully to 45 subscribers",
|
|
||||||
"subscriber_count": 45,
|
|
||||||
"max_articles": 10,
|
|
||||||
"output": "... sender output ...",
|
|
||||||
"errors": ""
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example**:
|
|
||||||
```bash
|
|
||||||
# Send newsletter to all subscribers
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json"
|
|
||||||
|
|
||||||
# Send with custom article count
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 15}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Notes**:
|
|
||||||
- Only sends to subscribers with `status: 'active'`
|
|
||||||
- Returns error if no active subscribers found
|
|
||||||
- Includes tracking pixels and click tracking
|
|
||||||
- May take several minutes for large subscriber lists
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Get System Statistics
|
|
||||||
|
|
||||||
Get overview statistics of the system.
|
|
||||||
|
|
||||||
```http
|
|
||||||
GET /api/admin/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"articles": {
|
|
||||||
"total": 150,
|
|
||||||
"with_summary": 120,
|
|
||||||
"today": 15
|
|
||||||
},
|
|
||||||
"subscribers": {
|
|
||||||
"total": 50,
|
|
||||||
"active": 45
|
|
||||||
},
|
|
||||||
"rss_feeds": {
|
|
||||||
"total": 4,
|
|
||||||
"active": 4
|
|
||||||
},
|
|
||||||
"tracking": {
|
|
||||||
"total_sends": 200,
|
|
||||||
"total_opens": 150,
|
|
||||||
"total_clicks": 75
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Example**:
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/admin/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Workflow Examples
|
|
||||||
|
|
||||||
### Test Complete System
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Check current stats
|
|
||||||
curl http://localhost:5001/api/admin/stats
|
|
||||||
|
|
||||||
# 2. Trigger crawler to fetch new articles
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 5}'
|
|
||||||
|
|
||||||
# 3. Wait a moment for crawler to finish, then check stats again
|
|
||||||
sleep 30
|
|
||||||
curl http://localhost:5001/api/admin/stats
|
|
||||||
|
|
||||||
# 4. Send test email to yourself
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "your-email@example.com"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Send Newsletter to All Subscribers
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Check subscriber count
|
|
||||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
|
||||||
|
|
||||||
# 2. Crawl fresh articles
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 10}'
|
|
||||||
|
|
||||||
# 3. Wait for crawl to complete
|
|
||||||
sleep 60
|
|
||||||
|
|
||||||
# 4. Send newsletter to all active subscribers
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 10}'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Quick Test Newsletter
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Send test email with latest articles
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "your-email@example.com", "max_articles": 3}'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Fetch Fresh Content
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Crawl more articles from each feed
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 20}'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Daily Newsletter Workflow
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Complete daily workflow (can be automated with cron)
|
|
||||||
|
|
||||||
# 1. Crawl today's articles
|
|
||||||
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 15}'
|
|
||||||
|
|
||||||
# 2. Wait for crawl and AI processing
|
|
||||||
sleep 120
|
|
||||||
|
|
||||||
# 3. Send to all subscribers
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 10}'
|
|
||||||
|
|
||||||
# 4. Check results
|
|
||||||
curl http://localhost:5001/api/admin/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Error Responses
|
|
||||||
|
|
||||||
All endpoints return standard error responses:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": false,
|
|
||||||
"error": "Error message here"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Common HTTP Status Codes**:
|
|
||||||
- `200` - Success
|
|
||||||
- `400` - Bad request (invalid parameters)
|
|
||||||
- `500` - Server error
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Security Notes
|
|
||||||
|
|
||||||
⚠️ **Important**: These are admin endpoints and should be protected in production!
|
|
||||||
|
|
||||||
Recommendations:
|
|
||||||
1. Add authentication/authorization
|
|
||||||
2. Rate limiting
|
|
||||||
3. IP whitelisting
|
|
||||||
4. API key requirement
|
|
||||||
5. Audit logging
|
|
||||||
|
|
||||||
Example protection (add to routes):
|
|
||||||
```python
|
|
||||||
from functools import wraps
|
|
||||||
from flask import request
|
|
||||||
|
|
||||||
def require_api_key(f):
|
|
||||||
@wraps(f)
|
|
||||||
def decorated_function(*args, **kwargs):
|
|
||||||
api_key = request.headers.get('X-API-Key')
|
|
||||||
if api_key != os.getenv('ADMIN_API_KEY'):
|
|
||||||
return jsonify({'error': 'Unauthorized'}), 401
|
|
||||||
return f(*args, **kwargs)
|
|
||||||
return decorated_function
|
|
||||||
|
|
||||||
@admin_bp.route('/api/admin/trigger-crawl', methods=['POST'])
|
|
||||||
@require_api_key
|
|
||||||
def trigger_crawl():
|
|
||||||
# ... endpoint code
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Related Endpoints
|
|
||||||
|
|
||||||
- **[Newsletter Preview](../backend/routes/newsletter_routes.py)**: `/api/newsletter/preview` - Preview newsletter HTML
|
|
||||||
- **[Analytics](API.md)**: `/api/analytics/*` - View engagement metrics
|
|
||||||
- **[RSS Feeds](API.md)**: `/api/rss-feeds` - Manage RSS feeds
|
|
||||||
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Newsletter API Summary
|
|
||||||
|
|
||||||
### Available Endpoints
|
|
||||||
|
|
||||||
| Endpoint | Purpose | Recipient |
|
|
||||||
|----------|---------|-----------|
|
|
||||||
| `/api/admin/send-test-email` | Test newsletter | Single email (specified) |
|
|
||||||
| `/api/admin/send-newsletter` | Production send | All active subscribers |
|
|
||||||
| `/api/admin/trigger-crawl` | Fetch articles | N/A |
|
|
||||||
| `/api/admin/stats` | System stats | N/A |
|
|
||||||
|
|
||||||
### Subscriber Status
|
|
||||||
|
|
||||||
The system uses a `status` field to determine who receives newsletters:
|
|
||||||
- **`active`** - Receives newsletters ✅
|
|
||||||
- **`inactive`** - Does not receive newsletters ❌
|
|
||||||
|
|
||||||
See [SUBSCRIBER_STATUS.md](SUBSCRIBER_STATUS.md) for details.
|
|
||||||
|
|
||||||
### Quick Examples
|
|
||||||
|
|
||||||
**Send to all subscribers:**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"max_articles": 10}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Send test email:**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "test@example.com"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check stats:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing
|
|
||||||
|
|
||||||
Use the test script:
|
|
||||||
```bash
|
|
||||||
./test-newsletter-api.sh
|
|
||||||
```
|
|
||||||
317
docs/AI_NEWS_AGGREGATION.md
Normal file
317
docs/AI_NEWS_AGGREGATION.md
Normal file
@@ -0,0 +1,317 @@
|
|||||||
|
# AI-Powered News Aggregation - COMPLETE ✅
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Successfully implemented a complete AI-powered news aggregation system that detects duplicate stories from multiple sources and generates neutral, balanced summaries.
|
||||||
|
|
||||||
|
## Features Implemented
|
||||||
|
|
||||||
|
### 1. AI-Powered Article Clustering ✅
|
||||||
|
**What it does:**
|
||||||
|
- Automatically detects when different news sources cover the same story
|
||||||
|
- Uses Ollama AI to intelligently compare article content
|
||||||
|
- Groups related articles by `cluster_id`
|
||||||
|
- Marks the first article as `is_primary: true`
|
||||||
|
|
||||||
|
**How it works:**
|
||||||
|
- Compares articles published within 24 hours
|
||||||
|
- Uses AI prompt: "Are these two articles about the same story?"
|
||||||
|
- Falls back to keyword matching if AI fails
|
||||||
|
- Real-time clustering during crawl
|
||||||
|
|
||||||
|
**Test Results:**
|
||||||
|
- ✅ Housing story from 2 sources → Clustered together
|
||||||
|
- ✅ Bayern transfer from 2 sources → Clustered together
|
||||||
|
- ✅ Different stories → Separate clusters
|
||||||
|
|
||||||
|
### 2. Neutral Summary Generation ✅
|
||||||
|
**What it does:**
|
||||||
|
- Synthesizes multiple articles into one balanced summary
|
||||||
|
- Combines perspectives from all sources
|
||||||
|
- Highlights agreements and differences
|
||||||
|
- Maintains neutral, objective tone
|
||||||
|
|
||||||
|
**How it works:**
|
||||||
|
- Takes all articles in a cluster
|
||||||
|
- Sends combined context to Ollama
|
||||||
|
- AI generates ~200-word neutral summary
|
||||||
|
- Saves to `cluster_summaries` collection
|
||||||
|
|
||||||
|
**Test Results:**
|
||||||
|
```
|
||||||
|
Bayern Transfer Story (2 sources):
|
||||||
|
"Bayern Munich has recently signed Brazilian footballer, aged 23,
|
||||||
|
for €50 million to bolster their attacking lineup as per reports
|
||||||
|
from abendzeitung-muenchen and sueddeutsche. The new addition is
|
||||||
|
expected to inject much-needed dynamism into the team's offense..."
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Smart Prioritization ✅
|
||||||
|
**What it does:**
|
||||||
|
- Prioritizes stories covered by multiple sources (more important)
|
||||||
|
- Shows multi-source stories first with neutral summaries
|
||||||
|
- Fills remaining slots with single-source stories
|
||||||
|
|
||||||
|
**Sorting Logic:**
|
||||||
|
1. **Primary sort:** Number of sources (descending)
|
||||||
|
2. **Secondary sort:** Publish date (newest first)
|
||||||
|
|
||||||
|
**Example Output:**
|
||||||
|
```
|
||||||
|
1. Munich Housing (2 sources) → Neutral summary
|
||||||
|
2. Bayern Transfer (2 sources) → Neutral summary
|
||||||
|
3. Local story (1 source) → Individual summary
|
||||||
|
4. Local story (1 source) → Individual summary
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Database Schema
|
||||||
|
|
||||||
|
### Articles Collection
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId("..."),
|
||||||
|
title: "München: Stadtrat beschließt...",
|
||||||
|
content: "Full article text...",
|
||||||
|
summary: "AI-generated summary...",
|
||||||
|
source: "abendzeitung-muenchen",
|
||||||
|
link: "https://...",
|
||||||
|
published_at: ISODate("2025-11-12T..."),
|
||||||
|
|
||||||
|
// Clustering fields
|
||||||
|
cluster_id: "1762937577.365818",
|
||||||
|
is_primary: true,
|
||||||
|
|
||||||
|
// Metadata
|
||||||
|
word_count: 450,
|
||||||
|
summary_word_count: 120,
|
||||||
|
category: "local",
|
||||||
|
crawled_at: ISODate("..."),
|
||||||
|
summarized_at: ISODate("...")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cluster Summaries Collection
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId("..."),
|
||||||
|
cluster_id: "1762937577.365818",
|
||||||
|
neutral_summary: "Combined neutral summary from all sources...",
|
||||||
|
sources: ["abendzeitung-muenchen", "sueddeutsche"],
|
||||||
|
article_count: 2,
|
||||||
|
created_at: ISODate("2025-11-12T..."),
|
||||||
|
updated_at: ISODate("2025-11-12T...")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
### Get All Articles (Default)
|
||||||
|
```bash
|
||||||
|
GET /api/news
|
||||||
|
```
|
||||||
|
Returns all articles individually (current behavior)
|
||||||
|
|
||||||
|
### Get Clustered Articles (Recommended)
|
||||||
|
```bash
|
||||||
|
GET /api/news?mode=clustered&limit=10
|
||||||
|
```
|
||||||
|
Returns:
|
||||||
|
- One article per story
|
||||||
|
- Multi-source stories with neutral summaries first
|
||||||
|
- Single-source stories with individual summaries
|
||||||
|
- Smart prioritization by popularity
|
||||||
|
|
||||||
|
**Response Format:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
"articles": [
|
||||||
|
{
|
||||||
|
"title": "...",
|
||||||
|
"summary": "Neutral summary combining all sources...",
|
||||||
|
"summary_type": "neutral",
|
||||||
|
"is_clustered": true,
|
||||||
|
"source_count": 2,
|
||||||
|
"sources": ["source1", "source2"],
|
||||||
|
"related_articles": [
|
||||||
|
{"source": "source2", "title": "...", "link": "..."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"mode": "clustered"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Statistics
|
||||||
|
```bash
|
||||||
|
GET /api/stats
|
||||||
|
```
|
||||||
|
Returns:
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
"articles": 51,
|
||||||
|
"crawled_articles": 45,
|
||||||
|
"summarized_articles": 40,
|
||||||
|
"clustered_articles": 47,
|
||||||
|
"neutral_summaries": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
### Complete Crawl Process
|
||||||
|
1. **Crawl RSS feeds** from multiple sources
|
||||||
|
2. **Extract full content** from article URLs
|
||||||
|
3. **Generate AI summaries** for each article
|
||||||
|
4. **Cluster similar articles** using AI comparison
|
||||||
|
5. **Generate neutral summaries** for multi-source clusters
|
||||||
|
6. **Save everything** to MongoDB
|
||||||
|
|
||||||
|
### Time Windows
|
||||||
|
- **Clustering window:** 24 hours (rolling)
|
||||||
|
- **Crawl schedule:** Daily at 6:00 AM Berlin time
|
||||||
|
- **Manual trigger:** Available via crawler service
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
```bash
|
||||||
|
# Ollama AI
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_TIMEOUT=120
|
||||||
|
|
||||||
|
# Clustering
|
||||||
|
CLUSTERING_TIME_WINDOW=24 # hours
|
||||||
|
CLUSTERING_SIMILARITY_THRESHOLD=0.50
|
||||||
|
|
||||||
|
# Summaries
|
||||||
|
SUMMARY_MAX_WORDS=150 # individual
|
||||||
|
NEUTRAL_SUMMARY_MAX_WORDS=200 # cluster
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
|
||||||
|
### New Files
|
||||||
|
- `news_crawler/article_clustering.py` - AI clustering logic
|
||||||
|
- `news_crawler/cluster_summarizer.py` - Neutral summary generation
|
||||||
|
- `test-clustering-real.py` - Clustering tests
|
||||||
|
- `test-neutral-summaries.py` - Summary generation tests
|
||||||
|
- `test-complete-workflow.py` - End-to-end tests
|
||||||
|
|
||||||
|
### Modified Files
|
||||||
|
- `news_crawler/crawler_service.py` - Added clustering + summarization
|
||||||
|
- `news_crawler/ollama_client.py` - Added `generate()` method
|
||||||
|
- `backend/routes/news_routes.py` - Added clustered endpoint with prioritization
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
### Metrics
|
||||||
|
- **Clustering:** ~20-40s per article pair (AI comparison)
|
||||||
|
- **Neutral summary:** ~30-40s per cluster
|
||||||
|
- **Success rate:** 100% in tests
|
||||||
|
- **Accuracy:** High - correctly identifies same/different stories
|
||||||
|
|
||||||
|
### Optimization
|
||||||
|
- Clustering runs during crawl (real-time)
|
||||||
|
- Neutral summaries generated after crawl (batch)
|
||||||
|
- Results cached in database
|
||||||
|
- 24-hour time window limits comparisons
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Test Coverage
|
||||||
|
✅ AI clustering with same stories
|
||||||
|
✅ AI clustering with different stories
|
||||||
|
✅ Neutral summary generation
|
||||||
|
✅ Multi-source prioritization
|
||||||
|
✅ Database integration
|
||||||
|
✅ End-to-end workflow
|
||||||
|
|
||||||
|
### Test Commands
|
||||||
|
```bash
|
||||||
|
# Test clustering
|
||||||
|
docker-compose exec crawler python /app/test-clustering-real.py
|
||||||
|
|
||||||
|
# Test neutral summaries
|
||||||
|
docker-compose exec crawler python /app/test-neutral-summaries.py
|
||||||
|
|
||||||
|
# Test complete workflow
|
||||||
|
docker-compose exec crawler python /app/test-complete-workflow.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
### For Users
|
||||||
|
- ✅ **No duplicate stories** - See each story once
|
||||||
|
- ✅ **Balanced coverage** - Multiple perspectives combined
|
||||||
|
- ✅ **Prioritized content** - Important stories first
|
||||||
|
- ✅ **Source transparency** - See all sources covering a story
|
||||||
|
- ✅ **Efficient reading** - One summary instead of multiple articles
|
||||||
|
|
||||||
|
### For the System
|
||||||
|
- ✅ **Intelligent deduplication** - AI-powered, not just URL matching
|
||||||
|
- ✅ **Scalable** - Works with any number of sources
|
||||||
|
- ✅ **Flexible** - 24-hour time window catches late-breaking news
|
||||||
|
- ✅ **Reliable** - Fallback mechanisms if AI fails
|
||||||
|
- ✅ **Maintainable** - Clear separation of concerns
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
### Potential Improvements
|
||||||
|
1. **Update summaries** when new articles join a cluster
|
||||||
|
2. **Summary versioning** to track changes over time
|
||||||
|
3. **Quality scoring** for generated summaries
|
||||||
|
4. **Multi-language support** for summaries
|
||||||
|
5. **Sentiment analysis** across sources
|
||||||
|
6. **Fact extraction** and verification
|
||||||
|
7. **Trending topics** detection
|
||||||
|
8. **User preferences** for source weighting
|
||||||
|
|
||||||
|
### Integration Ideas
|
||||||
|
- Email newsletters with neutral summaries
|
||||||
|
- Push notifications for multi-source stories
|
||||||
|
- RSS feed of clustered articles
|
||||||
|
- API for third-party apps
|
||||||
|
- Analytics dashboard
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The Munich News Aggregator now provides:
|
||||||
|
1. ✅ **Smart clustering** - AI detects duplicate stories
|
||||||
|
2. ✅ **Neutral summaries** - Balanced multi-source coverage
|
||||||
|
3. ✅ **Smart prioritization** - Important stories first
|
||||||
|
4. ✅ **Source transparency** - See all perspectives
|
||||||
|
5. ✅ **Efficient delivery** - One summary per story
|
||||||
|
|
||||||
|
**Result:** Users get comprehensive, balanced news coverage without information overload!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### View Clustered News
|
||||||
|
```bash
|
||||||
|
curl "http://localhost:5001/api/news?mode=clustered&limit=10"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Trigger Manual Crawl
|
||||||
|
```bash
|
||||||
|
docker-compose exec crawler python /app/scheduled_crawler.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Statistics
|
||||||
|
```bash
|
||||||
|
curl "http://localhost:5001/api/stats"
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Cluster Summaries in Database
|
||||||
|
```bash
|
||||||
|
docker-compose exec mongodb mongosh -u admin -p changeme --authenticationDatabase admin munich_news --eval "db.cluster_summaries.find().pretty()"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Status:** ✅ Production Ready
|
||||||
|
**Last Updated:** November 12, 2025
|
||||||
|
**Version:** 2.0 (AI-Powered)
|
||||||
379
docs/API.md
379
docs/API.md
@@ -1,214 +1,248 @@
|
|||||||
# API Reference
|
# API Reference
|
||||||
|
|
||||||
## Tracking Endpoints
|
Complete API documentation for Munich News Daily.
|
||||||
|
|
||||||
### Track Email Open
|
---
|
||||||
|
|
||||||
|
## Admin API
|
||||||
|
|
||||||
|
Base URL: `http://localhost:5001`
|
||||||
|
|
||||||
|
### Trigger Crawler
|
||||||
|
|
||||||
|
Manually fetch new articles.
|
||||||
|
|
||||||
```http
|
```http
|
||||||
GET /api/track/pixel/<tracking_id>
|
POST /api/admin/trigger-crawl
|
||||||
```
|
Content-Type: application/json
|
||||||
|
|
||||||
Returns a 1x1 transparent PNG and logs the email open event.
|
|
||||||
|
|
||||||
**Response**: Image (image/png)
|
|
||||||
|
|
||||||
### Track Link Click
|
|
||||||
|
|
||||||
```http
|
|
||||||
GET /api/track/click/<tracking_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
Logs the click event and redirects to the original article URL.
|
|
||||||
|
|
||||||
**Response**: 302 Redirect
|
|
||||||
|
|
||||||
## Analytics Endpoints
|
|
||||||
|
|
||||||
### Get Newsletter Metrics
|
|
||||||
|
|
||||||
```http
|
|
||||||
GET /api/analytics/newsletter/<newsletter_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
Returns comprehensive metrics for a specific newsletter.
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"newsletter_id": "2024-01-15",
|
"max_articles": 10
|
||||||
"total_sent": 100,
|
|
||||||
"total_opened": 75,
|
|
||||||
"open_rate": 75.0,
|
|
||||||
"unique_openers": 70,
|
|
||||||
"total_clicks": 45,
|
|
||||||
"unique_clickers": 30,
|
|
||||||
"click_through_rate": 30.0
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Get Article Performance
|
**Example:**
|
||||||
|
```bash
|
||||||
```http
|
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
||||||
GET /api/analytics/article/<article_url>
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 5}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Returns performance metrics for a specific article.
|
---
|
||||||
|
|
||||||
|
### Send Test Email
|
||||||
|
|
||||||
|
Send newsletter to specific email.
|
||||||
|
|
||||||
|
```http
|
||||||
|
POST /api/admin/send-test-email
|
||||||
|
Content-Type: application/json
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"article_url": "https://example.com/article",
|
"email": "test@example.com",
|
||||||
"total_sent": 100,
|
"max_articles": 10
|
||||||
"total_clicks": 25,
|
|
||||||
"click_rate": 25.0,
|
|
||||||
"unique_clickers": 20,
|
|
||||||
"newsletters": ["2024-01-15", "2024-01-16"]
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Get Subscriber Activity
|
**Example:**
|
||||||
|
```bash
|
||||||
```http
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||||
GET /api/analytics/subscriber/<email>
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "your@email.com"}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Returns activity status and engagement metrics for a subscriber.
|
---
|
||||||
|
|
||||||
|
### Send Newsletter to All Subscribers
|
||||||
|
|
||||||
|
Send newsletter to all active subscribers.
|
||||||
|
|
||||||
|
```http
|
||||||
|
POST /api/admin/send-newsletter
|
||||||
|
Content-Type: application/json
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"email": "user@example.com",
|
"max_articles": 10
|
||||||
"status": "active",
|
|
||||||
"last_opened_at": "2024-01-15T10:30:00",
|
|
||||||
"last_clicked_at": "2024-01-15T10:35:00",
|
|
||||||
"total_opens": 45,
|
|
||||||
"total_clicks": 20,
|
|
||||||
"newsletters_received": 50,
|
|
||||||
"newsletters_opened": 45
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Privacy Endpoints
|
**Response:**
|
||||||
|
|
||||||
### Delete Subscriber Data
|
|
||||||
|
|
||||||
```http
|
|
||||||
DELETE /api/tracking/subscriber/<email>
|
|
||||||
```
|
|
||||||
|
|
||||||
Deletes all tracking data for a subscriber (GDPR compliance).
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"success": true,
|
"success": true,
|
||||||
"message": "All tracking data deleted for user@example.com",
|
"message": "Newsletter sent successfully to 45 subscribers",
|
||||||
"deleted_counts": {
|
"subscriber_count": 45,
|
||||||
"newsletter_sends": 50,
|
"max_articles": 10
|
||||||
"link_clicks": 25,
|
}
|
||||||
"subscriber_activity": 1
|
```
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Get System Stats
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/admin/stats
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"articles": {
|
||||||
|
"total": 150,
|
||||||
|
"with_summary": 120,
|
||||||
|
"today": 15
|
||||||
|
},
|
||||||
|
"subscribers": {
|
||||||
|
"total": 50,
|
||||||
|
"active": 45
|
||||||
|
},
|
||||||
|
"rss_feeds": {
|
||||||
|
"total": 4,
|
||||||
|
"active": 4
|
||||||
|
},
|
||||||
|
"tracking": {
|
||||||
|
"total_sends": 200,
|
||||||
|
"total_opens": 150,
|
||||||
|
"total_clicks": 75
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Anonymize Old Data
|
**Example:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Public API
|
||||||
|
|
||||||
|
### Subscribe
|
||||||
|
|
||||||
```http
|
```http
|
||||||
POST /api/tracking/anonymize
|
POST /api/subscribe
|
||||||
```
|
Content-Type: application/json
|
||||||
|
|
||||||
Anonymizes tracking data older than the retention period.
|
|
||||||
|
|
||||||
**Request Body** (optional):
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"retention_days": 90
|
"email": "user@example.com"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Response**:
|
**Example:**
|
||||||
```json
|
```bash
|
||||||
{
|
curl -X POST http://localhost:5001/api/subscribe \
|
||||||
"success": true,
|
-H "Content-Type: application/json" \
|
||||||
"message": "Anonymized tracking data older than 90 days",
|
-d '{"email": "user@example.com"}'
|
||||||
"anonymized_counts": {
|
|
||||||
"newsletter_sends": 1250,
|
|
||||||
"link_clicks": 650
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Opt Out of Tracking
|
---
|
||||||
|
|
||||||
|
### Unsubscribe
|
||||||
|
|
||||||
```http
|
```http
|
||||||
POST /api/tracking/subscriber/<email>/opt-out
|
POST /api/unsubscribe
|
||||||
```
|
Content-Type: application/json
|
||||||
|
|
||||||
Disables tracking for a subscriber.
|
|
||||||
|
|
||||||
**Response**:
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"success": true,
|
"email": "user@example.com"
|
||||||
"message": "Subscriber user@example.com has opted out of tracking"
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Opt In to Tracking
|
**Example:**
|
||||||
|
```bash
|
||||||
```http
|
curl -X POST http://localhost:5001/api/unsubscribe \
|
||||||
POST /api/tracking/subscriber/<email>/opt-in
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "user@example.com"}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Re-enables tracking for a subscriber.
|
---
|
||||||
|
|
||||||
**Response**:
|
## Subscriber Status System
|
||||||
```json
|
|
||||||
|
### Status Values
|
||||||
|
|
||||||
|
| Status | Description | Receives Newsletters |
|
||||||
|
|--------|-------------|---------------------|
|
||||||
|
| `active` | Subscribed | ✅ Yes |
|
||||||
|
| `inactive` | Unsubscribed | ❌ No |
|
||||||
|
|
||||||
|
### Database Schema
|
||||||
|
|
||||||
|
```javascript
|
||||||
{
|
{
|
||||||
"success": true,
|
_id: ObjectId("..."),
|
||||||
"message": "Subscriber user@example.com has opted in to tracking"
|
email: "user@example.com",
|
||||||
|
subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
|
||||||
|
status: "active" // or "inactive"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Examples
|
### How It Works
|
||||||
|
|
||||||
### Using curl
|
**Subscribe:**
|
||||||
|
- Creates subscriber with `status: 'active'`
|
||||||
|
- If already exists and inactive, reactivates
|
||||||
|
|
||||||
|
**Unsubscribe:**
|
||||||
|
- Updates `status: 'inactive'`
|
||||||
|
- Subscriber data preserved (soft delete)
|
||||||
|
|
||||||
|
**Newsletter Sending:**
|
||||||
|
- Only sends to `status: 'active'` subscribers
|
||||||
|
- Query: `{status: 'active'}`
|
||||||
|
|
||||||
|
### Check Active Subscribers
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Get newsletter metrics
|
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
||||||
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
|
# Output: {"total": 10, "active": 8}
|
||||||
|
```
|
||||||
|
|
||||||
# Delete subscriber data
|
---
|
||||||
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
|
|
||||||
|
|
||||||
# Anonymize old data
|
## Workflows
|
||||||
curl -X POST http://localhost:5001/api/tracking/anonymize \
|
|
||||||
|
### Complete Newsletter Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Check stats
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
|
||||||
|
# 2. Crawl articles
|
||||||
|
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"retention_days": 90}'
|
-d '{"max_articles": 10}'
|
||||||
|
|
||||||
# Opt out of tracking
|
# 3. Wait for crawl
|
||||||
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
|
sleep 60
|
||||||
|
|
||||||
|
# 4. Send newsletter
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
```
|
```
|
||||||
|
|
||||||
### Using Python
|
### Test Newsletter
|
||||||
|
|
||||||
```python
|
```bash
|
||||||
import requests
|
# Send test to yourself
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||||
# Get newsletter metrics
|
-H "Content-Type: application/json" \
|
||||||
response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
|
-d '{"email": "your@email.com", "max_articles": 3}'
|
||||||
metrics = response.json()
|
|
||||||
print(f"Open rate: {metrics['open_rate']}%")
|
|
||||||
|
|
||||||
# Delete subscriber data
|
|
||||||
response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
|
|
||||||
result = response.json()
|
|
||||||
print(result['message'])
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Error Responses
|
## Error Responses
|
||||||
|
|
||||||
All endpoints return standard error responses:
|
All endpoints return standard error format:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@@ -217,7 +251,64 @@ All endpoints return standard error responses:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
HTTP Status Codes:
|
**HTTP Status Codes:**
|
||||||
- `200` - Success
|
- `200` - Success
|
||||||
|
- `400` - Bad request
|
||||||
- `404` - Not found
|
- `404` - Not found
|
||||||
- `500` - Server error
|
- `500` - Server error
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
⚠️ **Production Recommendations:**
|
||||||
|
|
||||||
|
1. **Add Authentication**
|
||||||
|
```python
|
||||||
|
@require_api_key
|
||||||
|
def admin_endpoint():
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Rate Limiting**
|
||||||
|
- Prevent abuse
|
||||||
|
- Limit newsletter sends
|
||||||
|
|
||||||
|
3. **IP Whitelisting**
|
||||||
|
- Restrict admin endpoints
|
||||||
|
- Use firewall rules
|
||||||
|
|
||||||
|
4. **HTTPS Only**
|
||||||
|
- Use reverse proxy
|
||||||
|
- SSL/TLS certificates
|
||||||
|
|
||||||
|
5. **Audit Logging**
|
||||||
|
- Log all admin actions
|
||||||
|
- Monitor for suspicious activity
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Use the test script:
|
||||||
|
```bash
|
||||||
|
./test-newsletter-api.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Or test manually:
|
||||||
|
```bash
|
||||||
|
# Health check
|
||||||
|
curl http://localhost:5001/health
|
||||||
|
|
||||||
|
# Stats
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
|
||||||
|
# Test email
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "test@example.com"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
See [SETUP.md](SETUP.md) for configuration and [SECURITY.md](SECURITY.md) for security best practices.
|
||||||
|
|||||||
@@ -1,131 +1,439 @@
|
|||||||
# System Architecture
|
# System Architecture
|
||||||
|
|
||||||
|
Complete system design and architecture documentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
|
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
|
||||||
|
|
||||||
```
|
```
|
||||||
┌─────────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────────────────────────┐
|
||||||
│ Munich News Daily System │
|
│ Docker Network │
|
||||||
└─────────────────────────────────────────────────────────────────┘
|
│ (Internal Only) │
|
||||||
|
├─────────────────────────────────────────────────────────┤
|
||||||
6:00 AM Berlin → News Crawler
|
│ │
|
||||||
↓
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||||
Fetches RSS feeds
|
│ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
|
||||||
Extracts full content
|
│ │ (27017) │ │ (5001) │ │ │ │
|
||||||
Generates AI summaries
|
│ └──────────┘ └────┬─────┘ └──────────┘ │
|
||||||
Saves to MongoDB
|
│ │ │
|
||||||
↓
|
│ ┌──────────┐ │ ┌──────────┐ │
|
||||||
7:00 AM Berlin → Newsletter Sender
|
│ │ Ollama │◄────────┤ │ Sender │ │
|
||||||
↓
|
│ │ (11434) │ │ │ │ │
|
||||||
Waits for crawler
|
│ └──────────┘ │ └──────────┘ │
|
||||||
Fetches articles
|
│ │ │
|
||||||
Generates newsletter
|
└───────────────────────┼───────────────────────────────────┘
|
||||||
Sends to subscribers
|
│
|
||||||
↓
|
│ Port 5001 (Only exposed port)
|
||||||
✅ Done!
|
▼
|
||||||
|
Host Machine
|
||||||
|
External Network
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Components
|
## Components
|
||||||
|
|
||||||
### 1. MongoDB Database
|
### 1. MongoDB (Database)
|
||||||
- **Purpose**: Central data storage
|
- **Purpose**: Store articles, subscribers, tracking data
|
||||||
- **Collections**:
|
- **Port**: 27017 (internal only)
|
||||||
- `articles`: News articles with summaries
|
- **Access**: Only via Docker network
|
||||||
- `subscribers`: Email subscribers
|
- **Authentication**: Username/password
|
||||||
- `rss_feeds`: RSS feed sources
|
|
||||||
- `newsletter_sends`: Email tracking data
|
|
||||||
- `link_clicks`: Link click tracking
|
|
||||||
- `subscriber_activity`: Engagement metrics
|
|
||||||
|
|
||||||
### 2. News Crawler
|
**Collections:**
|
||||||
- **Schedule**: Daily at 6:00 AM Berlin time
|
- `articles` - News articles with summaries
|
||||||
- **Functions**:
|
- `subscribers` - Newsletter subscribers
|
||||||
- Fetches articles from RSS feeds
|
- `rss_feeds` - RSS feed sources
|
||||||
- Extracts full article content
|
- `newsletter_sends` - Send tracking
|
||||||
- Generates AI summaries using Ollama
|
- `link_clicks` - Click tracking
|
||||||
- Saves to MongoDB
|
|
||||||
- **Technology**: Python, BeautifulSoup, Ollama
|
|
||||||
|
|
||||||
### 3. Newsletter Sender
|
### 2. Backend API (Flask)
|
||||||
- **Schedule**: Daily at 7:00 AM Berlin time
|
- **Purpose**: API endpoints, tracking, analytics
|
||||||
- **Functions**:
|
- **Port**: 5001 (exposed to host)
|
||||||
- Waits for crawler to finish (max 30 min)
|
- **Access**: Public API, admin endpoints
|
||||||
- Fetches today's articles
|
- **Features**: Tracking pixels, click tracking, admin operations
|
||||||
- Generates HTML newsletter
|
|
||||||
- Injects tracking pixels
|
|
||||||
- Sends to all subscribers
|
|
||||||
- **Technology**: Python, Jinja2, SMTP
|
|
||||||
|
|
||||||
### 4. Backend API (Optional)
|
**Key Endpoints:**
|
||||||
- **Purpose**: Tracking and analytics
|
- `/api/admin/*` - Admin operations
|
||||||
- **Endpoints**:
|
- `/api/subscribe` - Subscribe to newsletter
|
||||||
- `/api/track/pixel/<id>` - Email open tracking
|
- `/api/tracking/*` - Tracking endpoints
|
||||||
- `/api/track/click/<id>` - Link click tracking
|
- `/health` - Health check
|
||||||
- `/api/analytics/*` - Engagement metrics
|
|
||||||
- `/api/tracking/*` - Privacy controls
|
### 3. Ollama (AI Service)
|
||||||
- **Technology**: Flask, Python
|
- **Purpose**: AI summarization and translation
|
||||||
|
- **Port**: 11434 (internal only)
|
||||||
|
- **Model**: phi3:latest (2.2GB)
|
||||||
|
- **GPU**: Optional NVIDIA GPU support
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Article summarization (150 words)
|
||||||
|
- Title translation (German → English)
|
||||||
|
- Configurable timeout and model
|
||||||
|
|
||||||
|
### 4. Crawler (News Fetcher)
|
||||||
|
- **Purpose**: Fetch and process news articles
|
||||||
|
- **Schedule**: 6:00 AM Berlin time (automated)
|
||||||
|
- **Features**: RSS parsing, content extraction, AI processing
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
1. Fetch RSS feeds
|
||||||
|
2. Extract article content
|
||||||
|
3. Translate title (German → English)
|
||||||
|
4. Generate AI summary
|
||||||
|
5. Store in MongoDB
|
||||||
|
|
||||||
|
### 5. Sender (Newsletter)
|
||||||
|
- **Purpose**: Send newsletters to subscribers
|
||||||
|
- **Schedule**: 7:00 AM Berlin time (automated)
|
||||||
|
- **Features**: Email sending, tracking, templating
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
1. Fetch today's articles
|
||||||
|
2. Generate newsletter HTML
|
||||||
|
3. Add tracking pixels/links
|
||||||
|
4. Send to active subscribers
|
||||||
|
5. Record send events
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Data Flow
|
## Data Flow
|
||||||
|
|
||||||
|
### Article Processing Flow
|
||||||
|
|
||||||
```
|
```
|
||||||
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
|
RSS Feed
|
||||||
↓
|
↓
|
||||||
Backend API
|
Crawler fetches
|
||||||
↓
|
↓
|
||||||
Analytics
|
Extract content
|
||||||
|
↓
|
||||||
|
Translate title (Ollama)
|
||||||
|
↓
|
||||||
|
Generate summary (Ollama)
|
||||||
|
↓
|
||||||
|
Store in MongoDB
|
||||||
|
↓
|
||||||
|
Newsletter Sender
|
||||||
|
↓
|
||||||
|
Email to subscribers
|
||||||
```
|
```
|
||||||
|
|
||||||
## Coordination
|
### Tracking Flow
|
||||||
|
|
||||||
The sender waits for the crawler to ensure fresh content:
|
```
|
||||||
|
Newsletter sent
|
||||||
|
↓
|
||||||
|
Tracking pixel embedded
|
||||||
|
↓
|
||||||
|
User opens email
|
||||||
|
↓
|
||||||
|
Pixel loaded → Backend API
|
||||||
|
↓
|
||||||
|
Record open event
|
||||||
|
↓
|
||||||
|
User clicks link
|
||||||
|
↓
|
||||||
|
Redirect via Backend API
|
||||||
|
↓
|
||||||
|
Record click event
|
||||||
|
↓
|
||||||
|
Redirect to article
|
||||||
|
```
|
||||||
|
|
||||||
1. Sender starts at 7:00 AM
|
---
|
||||||
2. Checks for recent articles every 30 seconds
|
|
||||||
3. Maximum wait time: 30 minutes
|
## Database Schema
|
||||||
4. Proceeds once crawler finishes or timeout
|
|
||||||
|
### Articles Collection
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId,
|
||||||
|
title: String, // Original German title
|
||||||
|
title_en: String, // English translation
|
||||||
|
translated_at: Date, // Translation timestamp
|
||||||
|
link: String,
|
||||||
|
summary: String, // AI-generated summary
|
||||||
|
content: String, // Full article text
|
||||||
|
author: String,
|
||||||
|
source: String, // RSS feed name
|
||||||
|
published_at: Date,
|
||||||
|
crawled_at: Date,
|
||||||
|
created_at: Date
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Subscribers Collection
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId,
|
||||||
|
email: String, // Unique
|
||||||
|
subscribed_at: Date,
|
||||||
|
status: String // 'active' or 'inactive'
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### RSS Feeds Collection
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId,
|
||||||
|
name: String,
|
||||||
|
url: String,
|
||||||
|
active: Boolean,
|
||||||
|
last_crawled: Date
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Architecture
|
||||||
|
|
||||||
|
### Network Isolation
|
||||||
|
|
||||||
|
**Exposed Services:**
|
||||||
|
- Backend API (port 5001) - Only exposed service
|
||||||
|
|
||||||
|
**Internal Services:**
|
||||||
|
- MongoDB (port 27017) - Not accessible from host
|
||||||
|
- Ollama (port 11434) - Not accessible from host
|
||||||
|
- Crawler - No ports
|
||||||
|
- Sender - No ports
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- 66% reduction in attack surface
|
||||||
|
- Database protected from external access
|
||||||
|
- AI service protected from abuse
|
||||||
|
- Defense in depth
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
|
||||||
|
**MongoDB:**
|
||||||
|
- Username/password authentication
|
||||||
|
- Credentials in environment variables
|
||||||
|
- Internal network only
|
||||||
|
|
||||||
|
**Backend API:**
|
||||||
|
- No authentication (add in production)
|
||||||
|
- Rate limiting recommended
|
||||||
|
- IP whitelisting recommended
|
||||||
|
|
||||||
|
### Data Protection
|
||||||
|
|
||||||
|
- Subscriber emails stored securely
|
||||||
|
- No sensitive data in logs
|
||||||
|
- Environment variables for secrets
|
||||||
|
- `.env` file in `.gitignore`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Technology Stack
|
## Technology Stack
|
||||||
|
|
||||||
- **Backend**: Python 3.11
|
### Backend
|
||||||
|
- **Language**: Python 3.11
|
||||||
|
- **Framework**: Flask
|
||||||
- **Database**: MongoDB 7.0
|
- **Database**: MongoDB 7.0
|
||||||
- **AI**: Ollama (Phi3 model)
|
- **AI**: Ollama (phi3:latest)
|
||||||
|
|
||||||
|
### Infrastructure
|
||||||
|
- **Containerization**: Docker & Docker Compose
|
||||||
|
- **Networking**: Docker bridge network
|
||||||
|
- **Storage**: Docker volumes
|
||||||
- **Scheduling**: Python schedule library
|
- **Scheduling**: Python schedule library
|
||||||
- **Email**: SMTP with HTML templates
|
|
||||||
- **Tracking**: Pixel tracking + redirect URLs
|
|
||||||
- **Infrastructure**: Docker & Docker Compose
|
|
||||||
|
|
||||||
## Deployment
|
### Libraries
|
||||||
|
- **Web**: Flask, Flask-CORS
|
||||||
|
- **Database**: pymongo
|
||||||
|
- **Email**: smtplib, email.mime
|
||||||
|
- **Scraping**: requests, BeautifulSoup4, feedparser
|
||||||
|
- **Templating**: Jinja2
|
||||||
|
- **AI**: requests (Ollama API)
|
||||||
|
|
||||||
All components run in Docker containers:
|
---
|
||||||
|
|
||||||
|
## Deployment Architecture
|
||||||
|
|
||||||
|
### Development
|
||||||
```
|
```
|
||||||
docker-compose up -d
|
Local Machine
|
||||||
|
├── Docker Compose
|
||||||
|
│ ├── MongoDB (internal)
|
||||||
|
│ ├── Ollama (internal)
|
||||||
|
│ ├── Backend (exposed)
|
||||||
|
│ ├── Crawler (internal)
|
||||||
|
│ └── Sender (internal)
|
||||||
|
└── .env file
|
||||||
```
|
```
|
||||||
|
|
||||||
Containers:
|
### Production
|
||||||
- `munich-news-mongodb` - Database
|
```
|
||||||
- `munich-news-crawler` - Crawler service
|
Server
|
||||||
- `munich-news-sender` - Sender service
|
├── Reverse Proxy (nginx/Traefik)
|
||||||
|
│ ├── SSL/TLS
|
||||||
|
│ ├── Rate limiting
|
||||||
|
│ └── Authentication
|
||||||
|
├── Docker Compose
|
||||||
|
│ ├── MongoDB (internal)
|
||||||
|
│ ├── Ollama (internal, GPU)
|
||||||
|
│ ├── Backend (internal)
|
||||||
|
│ ├── Crawler (internal)
|
||||||
|
│ └── Sender (internal)
|
||||||
|
├── Monitoring
|
||||||
|
│ ├── Logs
|
||||||
|
│ ├── Metrics
|
||||||
|
│ └── Alerts
|
||||||
|
└── Backups
|
||||||
|
├── MongoDB dumps
|
||||||
|
└── Configuration
|
||||||
|
```
|
||||||
|
|
||||||
## Security
|
---
|
||||||
|
|
||||||
- MongoDB authentication enabled
|
|
||||||
- Environment variables for secrets
|
|
||||||
- HTTPS for tracking URLs (production)
|
|
||||||
- GDPR-compliant data retention
|
|
||||||
- Privacy controls (opt-out, deletion)
|
|
||||||
|
|
||||||
## Monitoring
|
|
||||||
|
|
||||||
- Docker logs for all services
|
|
||||||
- MongoDB for data verification
|
|
||||||
- Health checks on containers
|
|
||||||
- Engagement metrics via API
|
|
||||||
|
|
||||||
## Scalability
|
## Scalability
|
||||||
|
|
||||||
- Horizontal: Add more crawler instances
|
### Current Limits
|
||||||
- Vertical: Increase container resources
|
- Single server deployment
|
||||||
- Database: MongoDB sharding if needed
|
- Sequential article processing
|
||||||
- Caching: Redis for API responses (future)
|
- Single MongoDB instance
|
||||||
|
- No load balancing
|
||||||
|
|
||||||
|
### Scaling Options
|
||||||
|
|
||||||
|
**Horizontal Scaling:**
|
||||||
|
- Multiple crawler instances
|
||||||
|
- Load-balanced backend
|
||||||
|
- MongoDB replica set
|
||||||
|
- Distributed Ollama
|
||||||
|
|
||||||
|
**Vertical Scaling:**
|
||||||
|
- More CPU cores
|
||||||
|
- More RAM
|
||||||
|
- GPU acceleration (5-10x faster)
|
||||||
|
- Faster storage
|
||||||
|
|
||||||
|
**Optimization:**
|
||||||
|
- Batch processing
|
||||||
|
- Caching
|
||||||
|
- Database indexing
|
||||||
|
- Connection pooling
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Health Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backend health
|
||||||
|
curl http://localhost:5001/health
|
||||||
|
|
||||||
|
# MongoDB health
|
||||||
|
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
|
||||||
|
|
||||||
|
# Ollama health
|
||||||
|
docker-compose exec ollama ollama list
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics
|
||||||
|
|
||||||
|
- Article count
|
||||||
|
- Subscriber count
|
||||||
|
- Newsletter open rate
|
||||||
|
- Click-through rate
|
||||||
|
- Processing time
|
||||||
|
- Error rate
|
||||||
|
|
||||||
|
### Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# All services
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Specific service
|
||||||
|
docker-compose logs -f crawler
|
||||||
|
docker-compose logs -f backend
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup & Recovery
|
||||||
|
|
||||||
|
### MongoDB Backup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup
|
||||||
|
docker-compose exec mongodb mongodump --out /backup
|
||||||
|
|
||||||
|
# Restore
|
||||||
|
docker-compose exec mongodb mongorestore /backup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration Backup
|
||||||
|
|
||||||
|
- `backend/.env` - Environment variables
|
||||||
|
- `docker-compose.yml` - Service configuration
|
||||||
|
- RSS feeds in MongoDB
|
||||||
|
|
||||||
|
### Recovery Plan
|
||||||
|
|
||||||
|
1. Restore MongoDB from backup
|
||||||
|
2. Restore configuration files
|
||||||
|
3. Restart services
|
||||||
|
4. Verify functionality
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
### CPU Mode
|
||||||
|
- Translation: ~1.5s per title
|
||||||
|
- Summarization: ~8s per article
|
||||||
|
- 10 articles: ~115s total
|
||||||
|
- Suitable for <20 articles/day
|
||||||
|
|
||||||
|
### GPU Mode (5-10x faster)
|
||||||
|
- Translation: ~0.3s per title
|
||||||
|
- Summarization: ~2s per article
|
||||||
|
- 10 articles: ~31s total
|
||||||
|
- Suitable for high-volume processing
|
||||||
|
|
||||||
|
### Resource Usage
|
||||||
|
|
||||||
|
**CPU Mode:**
|
||||||
|
- CPU: 60-80%
|
||||||
|
- RAM: 4-6GB
|
||||||
|
- Disk: ~1GB (with model)
|
||||||
|
|
||||||
|
**GPU Mode:**
|
||||||
|
- CPU: 10-20%
|
||||||
|
- RAM: 2-3GB
|
||||||
|
- GPU: 80-100%
|
||||||
|
- VRAM: 3-4GB
|
||||||
|
- Disk: ~1GB (with model)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
### Planned Features
|
||||||
|
- Frontend dashboard
|
||||||
|
- Real-time analytics
|
||||||
|
- Multiple languages
|
||||||
|
- Custom RSS feeds per subscriber
|
||||||
|
- A/B testing for newsletters
|
||||||
|
- Advanced tracking
|
||||||
|
|
||||||
|
### Technical Improvements
|
||||||
|
- Kubernetes deployment
|
||||||
|
- Microservices architecture
|
||||||
|
- Message queue (RabbitMQ/Redis)
|
||||||
|
- Caching layer (Redis)
|
||||||
|
- CDN for assets
|
||||||
|
- Advanced monitoring (Prometheus/Grafana)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.
|
||||||
|
|||||||
@@ -1,106 +0,0 @@
|
|||||||
# Backend Structure
|
|
||||||
|
|
||||||
The backend has been modularized for better maintainability and scalability.
|
|
||||||
|
|
||||||
## Directory Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
backend/
|
|
||||||
├── app.py # Main Flask application entry point
|
|
||||||
├── config.py # Configuration management
|
|
||||||
├── database.py # Database connection and initialization
|
|
||||||
├── requirements.txt # Python dependencies
|
|
||||||
├── .env # Environment variables
|
|
||||||
│
|
|
||||||
├── routes/ # API route handlers (blueprints)
|
|
||||||
│ ├── __init__.py
|
|
||||||
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
|
|
||||||
│ ├── news_routes.py # /api/news, /api/stats
|
|
||||||
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
|
|
||||||
│ ├── ollama_routes.py # /api/ollama/* (AI features)
|
|
||||||
│ ├── tracking_routes.py # /api/track/* (email tracking)
|
|
||||||
│ └── analytics_routes.py # /api/analytics/* (engagement metrics)
|
|
||||||
│
|
|
||||||
└── services/ # Business logic layer
|
|
||||||
├── __init__.py
|
|
||||||
├── news_service.py # News fetching and storage logic
|
|
||||||
├── email_service.py # Newsletter email sending
|
|
||||||
├── ollama_service.py # Ollama AI integration
|
|
||||||
├── tracking_service.py # Email tracking (opens/clicks)
|
|
||||||
└── analytics_service.py # Engagement analytics
|
|
||||||
```
|
|
||||||
|
|
||||||
## Key Components
|
|
||||||
|
|
||||||
### app.py
|
|
||||||
- Main Flask application
|
|
||||||
- Registers all blueprints
|
|
||||||
- Minimal code, just wiring things together
|
|
||||||
|
|
||||||
### config.py
|
|
||||||
- Centralized configuration
|
|
||||||
- Loads environment variables
|
|
||||||
- Single source of truth for all settings
|
|
||||||
|
|
||||||
### database.py
|
|
||||||
- MongoDB connection setup
|
|
||||||
- Collection definitions
|
|
||||||
- Database initialization with indexes
|
|
||||||
|
|
||||||
### routes/
|
|
||||||
Each route file is a Flask Blueprint handling specific API endpoints:
|
|
||||||
- **subscription_routes.py**: User subscription management
|
|
||||||
- **news_routes.py**: News fetching and statistics
|
|
||||||
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
|
|
||||||
- **ollama_routes.py**: AI/Ollama integration endpoints
|
|
||||||
- **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
|
|
||||||
- **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
|
|
||||||
|
|
||||||
### services/
|
|
||||||
Business logic separated from route handlers:
|
|
||||||
- **news_service.py**: Fetches news from RSS feeds, saves to database
|
|
||||||
- **email_service.py**: Sends newsletter emails to subscribers
|
|
||||||
- **ollama_service.py**: Communicates with Ollama AI server
|
|
||||||
- **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
|
|
||||||
- **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
|
|
||||||
|
|
||||||
## Benefits of This Structure
|
|
||||||
|
|
||||||
1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
|
|
||||||
2. **Testability**: Each module can be tested independently
|
|
||||||
3. **Maintainability**: Easy to find and modify specific functionality
|
|
||||||
4. **Scalability**: Easy to add new routes or services
|
|
||||||
5. **Reusability**: Services can be used by multiple routes
|
|
||||||
|
|
||||||
## Adding New Features
|
|
||||||
|
|
||||||
### To add a new API endpoint:
|
|
||||||
1. Create a new route file in `routes/` or add to existing one
|
|
||||||
2. Create a Blueprint and define routes
|
|
||||||
3. Register the blueprint in `app.py`
|
|
||||||
|
|
||||||
### To add new business logic:
|
|
||||||
1. Create a new service file in `services/`
|
|
||||||
2. Import and use in your route handlers
|
|
||||||
|
|
||||||
### Example:
|
|
||||||
```python
|
|
||||||
# services/my_service.py
|
|
||||||
def my_business_logic():
|
|
||||||
return "Hello"
|
|
||||||
|
|
||||||
# routes/my_routes.py
|
|
||||||
from flask import Blueprint
|
|
||||||
from services.my_service import my_business_logic
|
|
||||||
|
|
||||||
my_bp = Blueprint('my', __name__)
|
|
||||||
|
|
||||||
@my_bp.route('/api/my-endpoint')
|
|
||||||
def my_endpoint():
|
|
||||||
result = my_business_logic()
|
|
||||||
return {'message': result}
|
|
||||||
|
|
||||||
# app.py
|
|
||||||
from routes.my_routes import my_bp
|
|
||||||
app.register_blueprint(my_bp)
|
|
||||||
```
|
|
||||||
@@ -1,176 +0,0 @@
|
|||||||
# Changelog
|
|
||||||
|
|
||||||
## [Unreleased] - 2024-11-10
|
|
||||||
|
|
||||||
### Added - Major Refactoring
|
|
||||||
|
|
||||||
#### Backend Modularization
|
|
||||||
- ✅ Restructured backend into modular architecture
|
|
||||||
- ✅ Created separate route blueprints:
|
|
||||||
- `subscription_routes.py` - User subscriptions
|
|
||||||
- `news_routes.py` - News fetching and stats
|
|
||||||
- `rss_routes.py` - RSS feed management (CRUD)
|
|
||||||
- `ollama_routes.py` - AI integration
|
|
||||||
- ✅ Created service layer:
|
|
||||||
- `news_service.py` - News fetching logic
|
|
||||||
- `email_service.py` - Newsletter sending
|
|
||||||
- `ollama_service.py` - AI communication
|
|
||||||
- ✅ Centralized configuration in `config.py`
|
|
||||||
- ✅ Separated database logic in `database.py`
|
|
||||||
- ✅ Reduced main `app.py` from 700+ lines to 27 lines
|
|
||||||
|
|
||||||
#### RSS Feed Management
|
|
||||||
- ✅ Dynamic RSS feed management via API
|
|
||||||
- ✅ Add/remove/list/toggle RSS feeds without code changes
|
|
||||||
- ✅ Unique index on RSS feed URLs (prevents duplicates)
|
|
||||||
- ✅ Default feeds auto-initialized on first run
|
|
||||||
- ✅ Created `fix_duplicates.py` utility script
|
|
||||||
|
|
||||||
#### News Crawler Microservice
|
|
||||||
- ✅ Created standalone `news_crawler/` microservice
|
|
||||||
- ✅ Web scraping with BeautifulSoup
|
|
||||||
- ✅ Smart content extraction using multiple selectors
|
|
||||||
- ✅ Full article content storage in MongoDB
|
|
||||||
- ✅ Word count calculation
|
|
||||||
- ✅ Duplicate prevention (skips already-crawled articles)
|
|
||||||
- ✅ Rate limiting (1 second between requests)
|
|
||||||
- ✅ Can run independently or scheduled
|
|
||||||
- ✅ Docker support for crawler
|
|
||||||
- ✅ Comprehensive documentation
|
|
||||||
|
|
||||||
#### API Endpoints
|
|
||||||
New endpoints added:
|
|
||||||
- `GET /api/rss-feeds` - List all RSS feeds
|
|
||||||
- `POST /api/rss-feeds` - Add new RSS feed
|
|
||||||
- `DELETE /api/rss-feeds/<id>` - Remove RSS feed
|
|
||||||
- `PATCH /api/rss-feeds/<id>/toggle` - Toggle feed active status
|
|
||||||
|
|
||||||
#### Documentation
|
|
||||||
- ✅ Created `ARCHITECTURE.md` - System architecture overview
|
|
||||||
- ✅ Created `backend/STRUCTURE.md` - Backend structure guide
|
|
||||||
- ✅ Created `news_crawler/README.md` - Crawler documentation
|
|
||||||
- ✅ Created `news_crawler/QUICKSTART.md` - Quick start guide
|
|
||||||
- ✅ Created `news_crawler/test_crawler.py` - Test suite
|
|
||||||
- ✅ Updated main `README.md` with new features
|
|
||||||
- ✅ Updated `DATABASE_SCHEMA.md` with new fields
|
|
||||||
|
|
||||||
#### Configuration
|
|
||||||
- ✅ Added `FLASK_PORT` environment variable
|
|
||||||
- ✅ Fixed `OLLAMA_MODEL` typo in `.env`
|
|
||||||
- ✅ Port 5001 default to avoid macOS AirPlay conflict
|
|
||||||
|
|
||||||
### Changed
|
|
||||||
- Backend structure: Monolithic → Modular
|
|
||||||
- RSS feeds: Hardcoded → Database-driven
|
|
||||||
- Article storage: Summary only → Full content support
|
|
||||||
- Configuration: Scattered → Centralized
|
|
||||||
|
|
||||||
### Technical Improvements
|
|
||||||
- Separation of concerns (routes vs services)
|
|
||||||
- Better testability
|
|
||||||
- Easier maintenance
|
|
||||||
- Scalable architecture
|
|
||||||
- Independent microservices
|
|
||||||
- Proper error handling
|
|
||||||
- Comprehensive logging
|
|
||||||
|
|
||||||
### Database Schema Updates
|
|
||||||
Articles collection now includes:
|
|
||||||
- `full_content` - Full article text
|
|
||||||
- `word_count` - Number of words
|
|
||||||
- `crawled_at` - When content was crawled
|
|
||||||
|
|
||||||
RSS Feeds collection added:
|
|
||||||
- `name` - Feed name
|
|
||||||
- `url` - Feed URL (unique)
|
|
||||||
- `active` - Active status
|
|
||||||
- `created_at` - Creation timestamp
|
|
||||||
|
|
||||||
### Files Added
|
|
||||||
```
|
|
||||||
backend/
|
|
||||||
├── config.py
|
|
||||||
├── database.py
|
|
||||||
├── fix_duplicates.py
|
|
||||||
├── STRUCTURE.md
|
|
||||||
├── routes/
|
|
||||||
│ ├── __init__.py
|
|
||||||
│ ├── subscription_routes.py
|
|
||||||
│ ├── news_routes.py
|
|
||||||
│ ├── rss_routes.py
|
|
||||||
│ └── ollama_routes.py
|
|
||||||
└── services/
|
|
||||||
├── __init__.py
|
|
||||||
├── news_service.py
|
|
||||||
├── email_service.py
|
|
||||||
└── ollama_service.py
|
|
||||||
|
|
||||||
news_crawler/
|
|
||||||
├── crawler_service.py
|
|
||||||
├── test_crawler.py
|
|
||||||
├── requirements.txt
|
|
||||||
├── .gitignore
|
|
||||||
├── Dockerfile
|
|
||||||
├── docker-compose.yml
|
|
||||||
├── README.md
|
|
||||||
└── QUICKSTART.md
|
|
||||||
|
|
||||||
Root:
|
|
||||||
├── ARCHITECTURE.md
|
|
||||||
└── CHANGELOG.md
|
|
||||||
```
|
|
||||||
|
|
||||||
### Files Removed
|
|
||||||
- Old monolithic `backend/app.py` (replaced with modular version)
|
|
||||||
|
|
||||||
### Next Steps (Future Enhancements)
|
|
||||||
- [ ] Frontend UI for RSS feed management
|
|
||||||
- [ ] Automatic article summarization with Ollama
|
|
||||||
- [ ] Scheduled newsletter sending
|
|
||||||
- [ ] Article categorization and tagging
|
|
||||||
- [ ] Search functionality
|
|
||||||
- [ ] User preferences (categories, frequency)
|
|
||||||
- [ ] Analytics dashboard
|
|
||||||
- [ ] API rate limiting
|
|
||||||
- [ ] Caching layer (Redis)
|
|
||||||
- [ ] Message queue for crawler (Celery)
|
|
||||||
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recent Updates (November 2025)
|
|
||||||
|
|
||||||
### Security Improvements
|
|
||||||
- **MongoDB Internal-Only**: Removed port exposure, only accessible via Docker network
|
|
||||||
- **Ollama Internal-Only**: Removed port exposure, only accessible via Docker network
|
|
||||||
- **Reduced Attack Surface**: Only Backend API (port 5001) exposed to host
|
|
||||||
- **Network Isolation**: All services communicate via internal Docker network
|
|
||||||
|
|
||||||
### Ollama Integration
|
|
||||||
- **Docker Compose Integration**: Ollama service runs alongside other services
|
|
||||||
- **Automatic Model Download**: phi3:latest model downloaded on first startup
|
|
||||||
- **GPU Support**: NVIDIA GPU acceleration with automatic detection
|
|
||||||
- **Helper Scripts**: `start-with-gpu.sh`, `check-gpu.sh`, `configure-ollama.sh`
|
|
||||||
- **Performance**: 5-10x faster with GPU acceleration
|
|
||||||
|
|
||||||
### API Enhancements
|
|
||||||
- **Send Newsletter Endpoint**: `/api/admin/send-newsletter` to send to all active subscribers
|
|
||||||
- **Subscriber Status Fix**: Fixed stats endpoint to correctly count active subscribers
|
|
||||||
- **Better Error Handling**: Improved error messages and validation
|
|
||||||
|
|
||||||
### Documentation
|
|
||||||
- **Consolidated Documentation**: Moved all docs to `docs/` directory
|
|
||||||
- **Security Guide**: Comprehensive security documentation
|
|
||||||
- **GPU Setup Guide**: Detailed GPU acceleration setup
|
|
||||||
- **MongoDB Connection Guide**: Connection configuration explained
|
|
||||||
- **Subscriber Status Guide**: How subscriber status system works
|
|
||||||
|
|
||||||
### Configuration
|
|
||||||
- **MongoDB URI**: Updated to use Docker service name (`mongodb` instead of `localhost`)
|
|
||||||
- **Ollama URL**: Configured for internal Docker network (`http://ollama:11434`)
|
|
||||||
- **Single .env File**: All configuration in `backend/.env`
|
|
||||||
|
|
||||||
### Testing
|
|
||||||
- **Connectivity Tests**: `test-mongodb-connectivity.sh`
|
|
||||||
- **Ollama Tests**: `test-ollama-setup.sh`
|
|
||||||
- **Newsletter API Tests**: `test-newsletter-api.sh`
|
|
||||||
@@ -1,306 +0,0 @@
|
|||||||
# How the News Crawler Works
|
|
||||||
|
|
||||||
## 🎯 Overview
|
|
||||||
|
|
||||||
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
|
|
||||||
|
|
||||||
## 📊 Flow Diagram
|
|
||||||
|
|
||||||
```
|
|
||||||
RSS Feed URL
|
|
||||||
↓
|
|
||||||
Parse RSS Feed
|
|
||||||
↓
|
|
||||||
For each article link:
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 1. Fetch HTML Page │
|
|
||||||
│ GET https://example.com/article │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 2. Parse with BeautifulSoup │
|
|
||||||
│ soup = BeautifulSoup(html) │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 3. Clean HTML │
|
|
||||||
│ Remove: scripts, styles, nav, │
|
|
||||||
│ footer, header, ads │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 4. Extract Title │
|
|
||||||
│ Try: H1 → OG meta → Twitter → │
|
|
||||||
│ Title tag │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 5. Extract Author │
|
|
||||||
│ Try: Meta author → rel=author → │
|
|
||||||
│ Class names → JSON-LD │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 6. Extract Date │
|
|
||||||
│ Try: <time> → Meta tags → │
|
|
||||||
│ Class names → JSON-LD │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 7. Extract Content │
|
|
||||||
│ Try: <article> → Class names → │
|
|
||||||
│ <main> → <body> │
|
|
||||||
│ Filter short paragraphs │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────┐
|
|
||||||
│ 8. Save to MongoDB │
|
|
||||||
│ { │
|
|
||||||
│ title, author, date, │
|
|
||||||
│ content, word_count │
|
|
||||||
│ } │
|
|
||||||
└─────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
Wait 1 second (rate limiting)
|
|
||||||
↓
|
|
||||||
Next article
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔍 Detailed Example
|
|
||||||
|
|
||||||
### Input: RSS Feed Entry
|
|
||||||
```xml
|
|
||||||
<item>
|
|
||||||
<title>New U-Bahn Line Opens</title>
|
|
||||||
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
|
|
||||||
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
|
|
||||||
</item>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 1: Fetch HTML
|
|
||||||
```python
|
|
||||||
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
|
||||||
response = requests.get(url)
|
|
||||||
html = response.content
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2: Parse HTML
|
|
||||||
```python
|
|
||||||
soup = BeautifulSoup(html, 'html.parser')
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3: Extract Title
|
|
||||||
```python
|
|
||||||
# Try H1
|
|
||||||
h1 = soup.find('h1')
|
|
||||||
# Result: "New U-Bahn Line Opens in Munich"
|
|
||||||
|
|
||||||
# If no H1, try OG meta
|
|
||||||
og_title = soup.find('meta', property='og:title')
|
|
||||||
# Fallback chain continues...
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4: Extract Author
|
|
||||||
```python
|
|
||||||
# Try meta author
|
|
||||||
meta_author = soup.find('meta', name='author')
|
|
||||||
# Result: None
|
|
||||||
|
|
||||||
# Try class names
|
|
||||||
author_elem = soup.select_one('[class*="author"]')
|
|
||||||
# Result: "Max Mustermann"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5: Extract Date
|
|
||||||
```python
|
|
||||||
# Try time tag
|
|
||||||
time_tag = soup.find('time')
|
|
||||||
# Result: "2024-11-10T10:00:00Z"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 6: Extract Content
|
|
||||||
```python
|
|
||||||
# Try article tag
|
|
||||||
article = soup.find('article')
|
|
||||||
paragraphs = article.find_all('p')
|
|
||||||
|
|
||||||
# Filter paragraphs
|
|
||||||
content = []
|
|
||||||
for p in paragraphs:
|
|
||||||
text = p.get_text().strip()
|
|
||||||
if len(text) >= 50: # Keep substantial paragraphs
|
|
||||||
content.append(text)
|
|
||||||
|
|
||||||
full_content = '\n\n'.join(content)
|
|
||||||
# Result: "The new U-Bahn line connecting the city center..."
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 7: Save to Database
|
|
||||||
```python
|
|
||||||
article_doc = {
|
|
||||||
'title': 'New U-Bahn Line Opens in Munich',
|
|
||||||
'author': 'Max Mustermann',
|
|
||||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
|
||||||
'summary': 'Short summary from RSS...',
|
|
||||||
'full_content': 'The new U-Bahn line connecting...',
|
|
||||||
'word_count': 1250,
|
|
||||||
'source': 'Süddeutsche Zeitung München',
|
|
||||||
'published_at': '2024-11-10T10:00:00Z',
|
|
||||||
'crawled_at': datetime.utcnow(),
|
|
||||||
'created_at': datetime.utcnow()
|
|
||||||
}
|
|
||||||
|
|
||||||
db.articles.update_one(
|
|
||||||
{'link': article_url},
|
|
||||||
{'$set': article_doc},
|
|
||||||
upsert=True
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🎨 What Makes It "Dynamic"?
|
|
||||||
|
|
||||||
### Traditional Approach (Hardcoded)
|
|
||||||
```python
|
|
||||||
# Only works for one specific site
|
|
||||||
title = soup.find('h1', class_='article-title').text
|
|
||||||
author = soup.find('span', class_='author-name').text
|
|
||||||
```
|
|
||||||
❌ Breaks when site changes
|
|
||||||
❌ Doesn't work on other sites
|
|
||||||
|
|
||||||
### Our Approach (Dynamic)
|
|
||||||
```python
|
|
||||||
# Works on ANY site
|
|
||||||
title = extract_title(soup) # Tries 4 different methods
|
|
||||||
author = extract_author(soup) # Tries 5 different methods
|
|
||||||
```
|
|
||||||
✅ Adapts to different HTML structures
|
|
||||||
✅ Falls back to alternatives
|
|
||||||
✅ Works across multiple sites
|
|
||||||
|
|
||||||
## 🛡️ Robustness Features
|
|
||||||
|
|
||||||
### 1. Multiple Strategies
|
|
||||||
Each field has 4-6 extraction strategies
|
|
||||||
```python
|
|
||||||
def extract_title(soup):
|
|
||||||
# Try strategy 1
|
|
||||||
if h1 := soup.find('h1'):
|
|
||||||
return h1.text
|
|
||||||
|
|
||||||
# Try strategy 2
|
|
||||||
if og_title := soup.find('meta', property='og:title'):
|
|
||||||
return og_title['content']
|
|
||||||
|
|
||||||
# Try strategy 3...
|
|
||||||
# Try strategy 4...
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Validation
|
|
||||||
```python
|
|
||||||
# Title must be reasonable length
|
|
||||||
if title and len(title) > 10:
|
|
||||||
return title
|
|
||||||
|
|
||||||
# Author must be < 100 chars
|
|
||||||
if author and len(author) < 100:
|
|
||||||
return author
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Cleaning
|
|
||||||
```python
|
|
||||||
# Remove site name from title
|
|
||||||
if ' | ' in title:
|
|
||||||
title = title.split(' | ')[0]
|
|
||||||
|
|
||||||
# Remove "By" from author
|
|
||||||
author = author.replace('By ', '').strip()
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Error Handling
|
|
||||||
```python
|
|
||||||
try:
|
|
||||||
data = extract_article_content(url)
|
|
||||||
except Timeout:
|
|
||||||
print("Timeout - skip")
|
|
||||||
except RequestException:
|
|
||||||
print("Network error - skip")
|
|
||||||
except Exception:
|
|
||||||
print("Unknown error - skip")
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📈 Success Metrics
|
|
||||||
|
|
||||||
After crawling, you'll see:
|
|
||||||
|
|
||||||
```
|
|
||||||
📰 Crawling feed: Süddeutsche Zeitung München
|
|
||||||
🔍 Crawling: New U-Bahn Line Opens...
|
|
||||||
✓ Saved (1250 words)
|
|
||||||
|
|
||||||
Title: ✓ Found
|
|
||||||
Author: ✓ Found (Max Mustermann)
|
|
||||||
Date: ✓ Found (2024-11-10T10:00:00Z)
|
|
||||||
Content: ✓ Found (1250 words)
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🗄️ Database Result
|
|
||||||
|
|
||||||
**Before Crawling:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
title: "New U-Bahn Line Opens",
|
|
||||||
link: "https://example.com/article",
|
|
||||||
summary: "Short RSS summary...",
|
|
||||||
source: "Süddeutsche Zeitung"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**After Crawling:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
|
|
||||||
author: "Max Mustermann", // ← NEW!
|
|
||||||
link: "https://example.com/article",
|
|
||||||
summary: "Short RSS summary...",
|
|
||||||
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
|
|
||||||
word_count: 1250, // ← NEW!
|
|
||||||
source: "Süddeutsche Zeitung",
|
|
||||||
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
|
|
||||||
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
|
|
||||||
created_at: ISODate("2024-11-10T16:00:00Z")
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Running the Crawler
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd news_crawler
|
|
||||||
pip install -r requirements.txt
|
|
||||||
python crawler_service.py 10
|
|
||||||
```
|
|
||||||
|
|
||||||
Output:
|
|
||||||
```
|
|
||||||
============================================================
|
|
||||||
🚀 Starting RSS Feed Crawler
|
|
||||||
============================================================
|
|
||||||
Found 3 active feed(s)
|
|
||||||
|
|
||||||
📰 Crawling feed: Süddeutsche Zeitung München
|
|
||||||
🔍 Crawling: New U-Bahn Line Opens...
|
|
||||||
✓ Saved (1250 words)
|
|
||||||
🔍 Crawling: Munich Weather Update...
|
|
||||||
✓ Saved (450 words)
|
|
||||||
✓ Crawled 2 articles
|
|
||||||
|
|
||||||
============================================================
|
|
||||||
✓ Crawling Complete!
|
|
||||||
Total feeds processed: 3
|
|
||||||
Total articles crawled: 15
|
|
||||||
Duration: 45.23 seconds
|
|
||||||
============================================================
|
|
||||||
```
|
|
||||||
|
|
||||||
Now you have rich, structured article data ready for AI processing! 🎉
|
|
||||||
@@ -1,336 +0,0 @@
|
|||||||
# MongoDB Database Schema
|
|
||||||
|
|
||||||
This document describes the MongoDB collections and their structure for Munich News Daily.
|
|
||||||
|
|
||||||
## Collections
|
|
||||||
|
|
||||||
### 1. Articles Collection (`articles`)
|
|
||||||
|
|
||||||
Stores all news articles aggregated from Munich news sources.
|
|
||||||
|
|
||||||
**Document Structure:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId, // Auto-generated MongoDB ID
|
|
||||||
title: String, // Article title (required)
|
|
||||||
author: String, // Article author (optional, extracted during crawl)
|
|
||||||
link: String, // Article URL (required, unique)
|
|
||||||
content: String, // Full article content (no length limit)
|
|
||||||
summary: String, // AI-generated English summary (≤150 words)
|
|
||||||
word_count: Number, // Word count of full content
|
|
||||||
summary_word_count: Number, // Word count of AI summary
|
|
||||||
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
|
|
||||||
published_at: String, // Original publication date from RSS feed or crawled
|
|
||||||
crawled_at: DateTime, // When article content was crawled (UTC)
|
|
||||||
summarized_at: DateTime, // When AI summary was generated (UTC)
|
|
||||||
created_at: DateTime // When article was added to database (UTC)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `link` - Unique index to prevent duplicate articles
|
|
||||||
- `created_at` - Index for efficient sorting by date
|
|
||||||
|
|
||||||
**Example Document:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("507f1f77bcf86cd799439011"),
|
|
||||||
title: "New U-Bahn Line Opens in Munich",
|
|
||||||
author: "Max Mustermann",
|
|
||||||
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
|
||||||
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
|
|
||||||
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
|
|
||||||
word_count: 1250,
|
|
||||||
summary_word_count: 48,
|
|
||||||
source: "Süddeutsche Zeitung München",
|
|
||||||
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
|
|
||||||
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
||||||
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
|
|
||||||
created_at: ISODate("2024-01-15T09:00:00.000Z")
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Subscribers Collection (`subscribers`)
|
|
||||||
|
|
||||||
Stores all newsletter subscribers.
|
|
||||||
|
|
||||||
**Document Structure:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId, // Auto-generated MongoDB ID
|
|
||||||
email: String, // Subscriber email (required, unique, lowercase)
|
|
||||||
subscribed_at: DateTime, // When user subscribed (UTC)
|
|
||||||
status: String // Subscription status: 'active' or 'inactive'
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `email` - Unique index for email lookups and preventing duplicates
|
|
||||||
- `subscribed_at` - Index for analytics and sorting
|
|
||||||
|
|
||||||
**Example Document:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("507f1f77bcf86cd799439012"),
|
|
||||||
email: "user@example.com",
|
|
||||||
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
|
|
||||||
status: "active"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Newsletter Sends Collection (`newsletter_sends`)
|
|
||||||
|
|
||||||
Tracks each newsletter sent to each subscriber for email open tracking.
|
|
||||||
|
|
||||||
**Document Structure:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId, // Auto-generated MongoDB ID
|
|
||||||
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
|
|
||||||
subscriber_email: String, // Recipient email
|
|
||||||
tracking_id: String, // Unique tracking ID for this send (UUID)
|
|
||||||
sent_at: DateTime, // When email was sent (UTC)
|
|
||||||
opened: Boolean, // Whether email was opened
|
|
||||||
first_opened_at: DateTime, // First open timestamp (null if not opened)
|
|
||||||
last_opened_at: DateTime, // Most recent open timestamp
|
|
||||||
open_count: Number, // Number of times opened
|
|
||||||
created_at: DateTime // Record creation time (UTC)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `tracking_id` - Unique index for fast pixel request lookups
|
|
||||||
- `newsletter_id` - Index for analytics queries
|
|
||||||
- `subscriber_email` - Index for user activity queries
|
|
||||||
- `sent_at` - Index for time-based queries
|
|
||||||
|
|
||||||
**Example Document:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("507f1f77bcf86cd799439013"),
|
|
||||||
newsletter_id: "2024-01-15",
|
|
||||||
subscriber_email: "user@example.com",
|
|
||||||
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
|
||||||
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
|
|
||||||
opened: true,
|
|
||||||
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
||||||
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
|
|
||||||
open_count: 3,
|
|
||||||
created_at: ISODate("2024-01-15T08:00:00.000Z")
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Link Clicks Collection (`link_clicks`)
|
|
||||||
|
|
||||||
Tracks individual link clicks from newsletters.
|
|
||||||
|
|
||||||
**Document Structure:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId, // Auto-generated MongoDB ID
|
|
||||||
tracking_id: String, // Unique tracking ID for this link (UUID)
|
|
||||||
newsletter_id: String, // Which newsletter this link was in
|
|
||||||
subscriber_email: String, // Who clicked
|
|
||||||
article_url: String, // Original article URL
|
|
||||||
article_title: String, // Article title for reporting
|
|
||||||
clicked_at: DateTime, // When link was clicked (UTC)
|
|
||||||
user_agent: String, // Browser/client info
|
|
||||||
created_at: DateTime // Record creation time (UTC)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `tracking_id` - Unique index for fast redirect request lookups
|
|
||||||
- `newsletter_id` - Index for analytics queries
|
|
||||||
- `article_url` - Index for article performance queries
|
|
||||||
- `subscriber_email` - Index for user activity queries
|
|
||||||
|
|
||||||
**Example Document:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("507f1f77bcf86cd799439014"),
|
|
||||||
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
|
|
||||||
newsletter_id: "2024-01-15",
|
|
||||||
subscriber_email: "user@example.com",
|
|
||||||
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
|
||||||
article_title: "New U-Bahn Line Opens in Munich",
|
|
||||||
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
|
||||||
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
|
||||||
created_at: ISODate("2024-01-15T09:35:00.000Z")
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Subscriber Activity Collection (`subscriber_activity`)
|
|
||||||
|
|
||||||
Aggregated activity status for each subscriber.
|
|
||||||
|
|
||||||
**Document Structure:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId, // Auto-generated MongoDB ID
|
|
||||||
email: String, // Subscriber email (unique)
|
|
||||||
status: String, // 'active', 'inactive', or 'dormant'
|
|
||||||
last_opened_at: DateTime, // Most recent email open (UTC)
|
|
||||||
last_clicked_at: DateTime, // Most recent link click (UTC)
|
|
||||||
total_opens: Number, // Lifetime open count
|
|
||||||
total_clicks: Number, // Lifetime click count
|
|
||||||
newsletters_received: Number, // Total newsletters sent
|
|
||||||
newsletters_opened: Number, // Total newsletters opened
|
|
||||||
updated_at: DateTime // Last status update (UTC)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `email` - Unique index for fast lookups
|
|
||||||
- `status` - Index for filtering by activity level
|
|
||||||
- `last_opened_at` - Index for time-based queries
|
|
||||||
|
|
||||||
**Activity Status Classification:**
|
|
||||||
- **active**: Opened an email in the last 30 days
|
|
||||||
- **inactive**: No opens in 30-60 days
|
|
||||||
- **dormant**: No opens in 60+ days
|
|
||||||
|
|
||||||
**Example Document:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("507f1f77bcf86cd799439015"),
|
|
||||||
email: "user@example.com",
|
|
||||||
status: "active",
|
|
||||||
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
||||||
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
|
||||||
total_opens: 45,
|
|
||||||
total_clicks: 23,
|
|
||||||
newsletters_received: 60,
|
|
||||||
newsletters_opened: 45,
|
|
||||||
updated_at: ISODate("2024-01-15T10:00:00.000Z")
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Design Decisions
|
|
||||||
|
|
||||||
### Why MongoDB?
|
|
||||||
|
|
||||||
1. **Flexibility**: Easy to add new fields without schema migrations
|
|
||||||
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
|
|
||||||
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
|
|
||||||
4. **Document Model**: Natural fit for news articles and subscriber data
|
|
||||||
|
|
||||||
### Schema Choices
|
|
||||||
|
|
||||||
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
|
|
||||||
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
|
|
||||||
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
|
|
||||||
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
|
|
||||||
|
|
||||||
### Future Enhancements
|
|
||||||
|
|
||||||
Potential fields to add in the future:
|
|
||||||
|
|
||||||
**Articles:**
|
|
||||||
- `category`: String (e.g., "politics", "sports", "culture")
|
|
||||||
- `tags`: Array of Strings
|
|
||||||
- `image_url`: String
|
|
||||||
- `sent_in_newsletter`: Boolean (track if article was sent)
|
|
||||||
- `sent_at`: DateTime (when article was included in newsletter)
|
|
||||||
|
|
||||||
**Subscribers:**
|
|
||||||
- `preferences`: Object (newsletter frequency, categories, etc.)
|
|
||||||
- `last_sent_at`: DateTime (last newsletter sent date)
|
|
||||||
- `unsubscribed_at`: DateTime (when user unsubscribed)
|
|
||||||
- `verification_token`: String (for email verification)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## AI Summarization Workflow
|
|
||||||
|
|
||||||
When the crawler processes an article:
|
|
||||||
|
|
||||||
1. **Extract Content**: Full article text is extracted from the webpage
|
|
||||||
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
|
|
||||||
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
|
|
||||||
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
|
|
||||||
|
|
||||||
### Summary Field Details
|
|
||||||
|
|
||||||
- **Language**: Always in English, regardless of source article language
|
|
||||||
- **Length**: Maximum 150 words
|
|
||||||
- **Format**: Plain text, concise and clear
|
|
||||||
- **Purpose**: Quick preview for newsletters and frontend display
|
|
||||||
|
|
||||||
### Querying Articles
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
// Get articles with AI summaries
|
|
||||||
db.articles.find({ summary: { $exists: true, $ne: null } })
|
|
||||||
|
|
||||||
// Get articles without summaries
|
|
||||||
db.articles.find({ summary: { $exists: false } })
|
|
||||||
|
|
||||||
// Count summarized articles
|
|
||||||
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## MongoDB Connection Configuration
|
|
||||||
|
|
||||||
### Docker Compose Setup
|
|
||||||
|
|
||||||
**Connection URI:**
|
|
||||||
```env
|
|
||||||
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
|
|
||||||
```
|
|
||||||
|
|
||||||
**Key Points:**
|
|
||||||
- Uses `mongodb` (Docker service name), not `localhost`
|
|
||||||
- Includes authentication credentials
|
|
||||||
- Only works inside Docker network
|
|
||||||
- Port 27017 is NOT exposed to host (internal only)
|
|
||||||
|
|
||||||
### Why 'mongodb' Instead of 'localhost'?
|
|
||||||
|
|
||||||
**Inside Docker containers:**
|
|
||||||
```
|
|
||||||
Container → mongodb:27017 ✅ Works (Docker DNS)
|
|
||||||
Container → localhost:27017 ❌ Fails (localhost = container itself)
|
|
||||||
```
|
|
||||||
|
|
||||||
**From host machine:**
|
|
||||||
```
|
|
||||||
Host → localhost:27017 ❌ Blocked (port not exposed)
|
|
||||||
Host → mongodb:27017 ❌ Fails (DNS only works in Docker)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Connection Priority
|
|
||||||
|
|
||||||
1. **Docker Compose environment variables** (highest)
|
|
||||||
2. **.env file** (fallback)
|
|
||||||
3. **Code defaults** (lowest)
|
|
||||||
|
|
||||||
### Testing Connection
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# From backend
|
|
||||||
docker-compose exec backend python -c "
|
|
||||||
from database import articles_collection
|
|
||||||
print(f'Articles: {articles_collection.count_documents({})}')
|
|
||||||
"
|
|
||||||
|
|
||||||
# From crawler
|
|
||||||
docker-compose exec crawler python -c "
|
|
||||||
from pymongo import MongoClient
|
|
||||||
from config import Config
|
|
||||||
client = MongoClient(Config.MONGODB_URI)
|
|
||||||
print(f'MongoDB version: {client.server_info()[\"version\"]}')
|
|
||||||
"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Security
|
|
||||||
|
|
||||||
- ✅ MongoDB is internal-only (not exposed to host)
|
|
||||||
- ✅ Uses authentication (username/password)
|
|
||||||
- ✅ Only accessible via Docker network
|
|
||||||
- ✅ Cannot be accessed from external network
|
|
||||||
|
|
||||||
See [SECURITY_NOTES.md](SECURITY_NOTES.md) for more security details.
|
|
||||||
@@ -1,204 +0,0 @@
|
|||||||
# Documentation Cleanup Summary
|
|
||||||
|
|
||||||
## What Was Done
|
|
||||||
|
|
||||||
Consolidated and organized all markdown documentation files.
|
|
||||||
|
|
||||||
## Before
|
|
||||||
|
|
||||||
**Root Level:** 14 markdown files (cluttered)
|
|
||||||
```
|
|
||||||
README.md
|
|
||||||
QUICKSTART.md
|
|
||||||
CONTRIBUTING.md
|
|
||||||
IMPLEMENTATION_SUMMARY.md
|
|
||||||
MONGODB_CONNECTION_EXPLAINED.md
|
|
||||||
NETWORK_SECURITY_SUMMARY.md
|
|
||||||
NEWSLETTER_API_UPDATE.md
|
|
||||||
OLLAMA_GPU_SUMMARY.md
|
|
||||||
OLLAMA_INTEGRATION.md
|
|
||||||
QUICK_START_GPU.md
|
|
||||||
SECURITY_IMPROVEMENTS.md
|
|
||||||
SECURITY_UPDATE.md
|
|
||||||
FINAL_STRUCTURE.md (outdated)
|
|
||||||
PROJECT_STRUCTURE.md (redundant)
|
|
||||||
```
|
|
||||||
|
|
||||||
**docs/:** 18 files (organized but some content duplicated)
|
|
||||||
|
|
||||||
## After
|
|
||||||
|
|
||||||
**Root Level:** 3 essential files (clean)
|
|
||||||
```
|
|
||||||
README.md - Main entry point
|
|
||||||
QUICKSTART.md - Quick setup guide
|
|
||||||
CONTRIBUTING.md - Contribution guidelines
|
|
||||||
```
|
|
||||||
|
|
||||||
**docs/:** 19 files (organized, consolidated, no duplication)
|
|
||||||
```
|
|
||||||
INDEX.md - Documentation index (NEW)
|
|
||||||
ADMIN_API.md - Admin API (consolidated)
|
|
||||||
API.md
|
|
||||||
ARCHITECTURE.md
|
|
||||||
BACKEND_STRUCTURE.md
|
|
||||||
CHANGELOG.md - Updated with recent changes
|
|
||||||
CRAWLER_HOW_IT_WORKS.md
|
|
||||||
DATABASE_SCHEMA.md - Added MongoDB connection info
|
|
||||||
DEPLOYMENT.md
|
|
||||||
EXTRACTION_STRATEGIES.md
|
|
||||||
GPU_SETUP.md - Consolidated GPU docs
|
|
||||||
OLLAMA_SETUP.md - Consolidated Ollama docs
|
|
||||||
OLD_ARCHITECTURE.md
|
|
||||||
PERFORMANCE_COMPARISON.md
|
|
||||||
QUICK_REFERENCE.md
|
|
||||||
RSS_URL_EXTRACTION.md
|
|
||||||
SECURITY_NOTES.md - Consolidated all security docs
|
|
||||||
SUBSCRIBER_STATUS.md
|
|
||||||
SYSTEM_ARCHITECTURE.md
|
|
||||||
```
|
|
||||||
|
|
||||||
## Changes Made
|
|
||||||
|
|
||||||
### 1. Deleted Redundant Files
|
|
||||||
- ❌ `FINAL_STRUCTURE.md` (outdated)
|
|
||||||
- ❌ `PROJECT_STRUCTURE.md` (redundant with README)
|
|
||||||
|
|
||||||
### 2. Merged into docs/SECURITY_NOTES.md
|
|
||||||
- ✅ `SECURITY_UPDATE.md` (Ollama security)
|
|
||||||
- ✅ `SECURITY_IMPROVEMENTS.md` (Network isolation)
|
|
||||||
- ✅ `NETWORK_SECURITY_SUMMARY.md` (Port exposure summary)
|
|
||||||
|
|
||||||
### 3. Merged into docs/GPU_SETUP.md
|
|
||||||
- ✅ `OLLAMA_GPU_SUMMARY.md` (GPU implementation summary)
|
|
||||||
- ✅ `QUICK_START_GPU.md` (Quick start commands)
|
|
||||||
|
|
||||||
### 4. Merged into docs/OLLAMA_SETUP.md
|
|
||||||
- ✅ `OLLAMA_INTEGRATION.md` (Integration details)
|
|
||||||
|
|
||||||
### 5. Merged into docs/ADMIN_API.md
|
|
||||||
- ✅ `NEWSLETTER_API_UPDATE.md` (Newsletter endpoint)
|
|
||||||
|
|
||||||
### 6. Merged into docs/DATABASE_SCHEMA.md
|
|
||||||
- ✅ `MONGODB_CONNECTION_EXPLAINED.md` (Connection config)
|
|
||||||
|
|
||||||
### 7. Merged into docs/CHANGELOG.md
|
|
||||||
- ✅ `IMPLEMENTATION_SUMMARY.md` (Recent updates)
|
|
||||||
|
|
||||||
### 8. Created New Files
|
|
||||||
- ✨ `docs/INDEX.md` - Complete documentation index
|
|
||||||
|
|
||||||
### 9. Updated Existing Files
|
|
||||||
- 📝 `README.md` - Added documentation section
|
|
||||||
- 📝 `docs/CHANGELOG.md` - Added recent updates
|
|
||||||
- 📝 `docs/SECURITY_NOTES.md` - Comprehensive security guide
|
|
||||||
- 📝 `docs/GPU_SETUP.md` - Complete GPU guide
|
|
||||||
- 📝 `docs/OLLAMA_SETUP.md` - Complete Ollama guide
|
|
||||||
- 📝 `docs/ADMIN_API.md` - Complete API reference
|
|
||||||
- 📝 `docs/DATABASE_SCHEMA.md` - Added connection info
|
|
||||||
|
|
||||||
## Benefits
|
|
||||||
|
|
||||||
### 1. Cleaner Root Directory
|
|
||||||
- Only 3 essential files visible
|
|
||||||
- Easier to navigate
|
|
||||||
- Professional appearance
|
|
||||||
|
|
||||||
### 2. Better Organization
|
|
||||||
- All technical docs in `docs/`
|
|
||||||
- Logical grouping by topic
|
|
||||||
- Easy to find information
|
|
||||||
|
|
||||||
### 3. No Duplication
|
|
||||||
- Consolidated related content
|
|
||||||
- Single source of truth
|
|
||||||
- Easier to maintain
|
|
||||||
|
|
||||||
### 4. Improved Discoverability
|
|
||||||
- Documentation index (`docs/INDEX.md`)
|
|
||||||
- Clear navigation
|
|
||||||
- Quick links by task
|
|
||||||
|
|
||||||
### 5. Better Maintenance
|
|
||||||
- Fewer files to update
|
|
||||||
- Related content together
|
|
||||||
- Clear structure
|
|
||||||
|
|
||||||
## Documentation Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
project/
|
|
||||||
├── README.md # Main entry point
|
|
||||||
├── QUICKSTART.md # Quick setup
|
|
||||||
├── CONTRIBUTING.md # How to contribute
|
|
||||||
│
|
|
||||||
└── docs/ # All technical documentation
|
|
||||||
├── INDEX.md # Documentation index
|
|
||||||
│
|
|
||||||
├── Setup & Configuration
|
|
||||||
│ ├── OLLAMA_SETUP.md
|
|
||||||
│ ├── GPU_SETUP.md
|
|
||||||
│ └── DEPLOYMENT.md
|
|
||||||
│
|
|
||||||
├── API Documentation
|
|
||||||
│ ├── ADMIN_API.md
|
|
||||||
│ ├── API.md
|
|
||||||
│ └── SUBSCRIBER_STATUS.md
|
|
||||||
│
|
|
||||||
├── Architecture
|
|
||||||
│ ├── SYSTEM_ARCHITECTURE.md
|
|
||||||
│ ├── ARCHITECTURE.md
|
|
||||||
│ ├── DATABASE_SCHEMA.md
|
|
||||||
│ └── BACKEND_STRUCTURE.md
|
|
||||||
│
|
|
||||||
├── Features
|
|
||||||
│ ├── CRAWLER_HOW_IT_WORKS.md
|
|
||||||
│ ├── EXTRACTION_STRATEGIES.md
|
|
||||||
│ ├── RSS_URL_EXTRACTION.md
|
|
||||||
│ └── PERFORMANCE_COMPARISON.md
|
|
||||||
│
|
|
||||||
├── Security
|
|
||||||
│ └── SECURITY_NOTES.md
|
|
||||||
│
|
|
||||||
└── Reference
|
|
||||||
├── CHANGELOG.md
|
|
||||||
└── QUICK_REFERENCE.md
|
|
||||||
```
|
|
||||||
|
|
||||||
## Quick Access
|
|
||||||
|
|
||||||
### For Users
|
|
||||||
- Start here: [README.md](README.md)
|
|
||||||
- Quick setup: [QUICKSTART.md](QUICKSTART.md)
|
|
||||||
- All docs: [docs/INDEX.md](docs/INDEX.md)
|
|
||||||
|
|
||||||
### For Developers
|
|
||||||
- Architecture: [docs/SYSTEM_ARCHITECTURE.md](docs/SYSTEM_ARCHITECTURE.md)
|
|
||||||
- API Reference: [docs/ADMIN_API.md](docs/ADMIN_API.md)
|
|
||||||
- Contributing: [CONTRIBUTING.md](CONTRIBUTING.md)
|
|
||||||
|
|
||||||
### For DevOps
|
|
||||||
- Deployment: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)
|
|
||||||
- Security: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
|
|
||||||
- GPU Setup: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
|
|
||||||
|
|
||||||
## Statistics
|
|
||||||
|
|
||||||
- **Files Deleted:** 11 redundant markdown files
|
|
||||||
- **Files Merged:** 9 files consolidated into existing docs
|
|
||||||
- **Files Created:** 1 new index file
|
|
||||||
- **Files Updated:** 7 existing files enhanced
|
|
||||||
- **Root Level:** Reduced from 14 to 3 files (79% reduction)
|
|
||||||
- **Total Docs:** 19 well-organized files in docs/
|
|
||||||
|
|
||||||
## Result
|
|
||||||
|
|
||||||
✅ Clean, professional documentation structure
|
|
||||||
✅ Easy to navigate and find information
|
|
||||||
✅ No duplication or redundancy
|
|
||||||
✅ Better maintainability
|
|
||||||
✅ Improved user experience
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
This cleanup makes the project more professional and easier to use!
|
|
||||||
@@ -1,353 +0,0 @@
|
|||||||
# Content Extraction Strategies
|
|
||||||
|
|
||||||
The crawler uses multiple strategies to dynamically extract article metadata from any website.
|
|
||||||
|
|
||||||
## 🎯 What Gets Extracted
|
|
||||||
|
|
||||||
1. **Title** - Article headline
|
|
||||||
2. **Author** - Article writer/journalist
|
|
||||||
3. **Published Date** - When article was published
|
|
||||||
4. **Content** - Main article text
|
|
||||||
5. **Description** - Meta description/summary
|
|
||||||
|
|
||||||
## 📋 Extraction Strategies
|
|
||||||
|
|
||||||
### 1. Title Extraction
|
|
||||||
|
|
||||||
Tries multiple methods in order of reliability:
|
|
||||||
|
|
||||||
#### Strategy 1: H1 Tag
|
|
||||||
```html
|
|
||||||
<h1>Article Title Here</h1>
|
|
||||||
```
|
|
||||||
✅ Most reliable - usually the main headline
|
|
||||||
|
|
||||||
#### Strategy 2: Open Graph Meta Tag
|
|
||||||
```html
|
|
||||||
<meta property="og:title" content="Article Title Here" />
|
|
||||||
```
|
|
||||||
✅ Used by Facebook, very reliable
|
|
||||||
|
|
||||||
#### Strategy 3: Twitter Card Meta Tag
|
|
||||||
```html
|
|
||||||
<meta name="twitter:title" content="Article Title Here" />
|
|
||||||
```
|
|
||||||
✅ Used by Twitter, reliable
|
|
||||||
|
|
||||||
#### Strategy 4: Title Tag (Fallback)
|
|
||||||
```html
|
|
||||||
<title>Article Title | Site Name</title>
|
|
||||||
```
|
|
||||||
⚠️ Often includes site name, needs cleaning
|
|
||||||
|
|
||||||
**Cleaning:**
|
|
||||||
- Removes " | Site Name"
|
|
||||||
- Removes " - Site Name"
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. Author Extraction
|
|
||||||
|
|
||||||
Tries multiple methods:
|
|
||||||
|
|
||||||
#### Strategy 1: Meta Author Tag
|
|
||||||
```html
|
|
||||||
<meta name="author" content="John Doe" />
|
|
||||||
```
|
|
||||||
✅ Standard HTML meta tag
|
|
||||||
|
|
||||||
#### Strategy 2: Rel="author" Link
|
|
||||||
```html
|
|
||||||
<a rel="author" href="/author/john-doe">John Doe</a>
|
|
||||||
```
|
|
||||||
✅ Semantic HTML
|
|
||||||
|
|
||||||
#### Strategy 3: Common Class Names
|
|
||||||
```html
|
|
||||||
<div class="author-name">John Doe</div>
|
|
||||||
<span class="byline">By John Doe</span>
|
|
||||||
<p class="writer">John Doe</p>
|
|
||||||
```
|
|
||||||
✅ Searches for: author-name, author, byline, writer
|
|
||||||
|
|
||||||
#### Strategy 4: Schema.org Markup
|
|
||||||
```html
|
|
||||||
<span itemprop="author">John Doe</span>
|
|
||||||
```
|
|
||||||
✅ Structured data
|
|
||||||
|
|
||||||
#### Strategy 5: JSON-LD Structured Data
|
|
||||||
```html
|
|
||||||
<script type="application/ld+json">
|
|
||||||
{
|
|
||||||
"@type": "NewsArticle",
|
|
||||||
"author": {
|
|
||||||
"@type": "Person",
|
|
||||||
"name": "John Doe"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
</script>
|
|
||||||
```
|
|
||||||
✅ Most structured, very reliable
|
|
||||||
|
|
||||||
**Cleaning:**
|
|
||||||
- Removes "By " prefix
|
|
||||||
- Validates length (< 100 chars)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. Date Extraction
|
|
||||||
|
|
||||||
Tries multiple methods:
|
|
||||||
|
|
||||||
#### Strategy 1: Time Tag with Datetime
|
|
||||||
```html
|
|
||||||
<time datetime="2024-11-10T10:00:00Z">November 10, 2024</time>
|
|
||||||
```
|
|
||||||
✅ Most reliable - ISO format
|
|
||||||
|
|
||||||
#### Strategy 2: Article Published Time Meta
|
|
||||||
```html
|
|
||||||
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
|
|
||||||
```
|
|
||||||
✅ Open Graph standard
|
|
||||||
|
|
||||||
#### Strategy 3: OG Published Time
|
|
||||||
```html
|
|
||||||
<meta property="og:published_time" content="2024-11-10T10:00:00Z" />
|
|
||||||
```
|
|
||||||
✅ Facebook standard
|
|
||||||
|
|
||||||
#### Strategy 4: Common Class Names
|
|
||||||
```html
|
|
||||||
<span class="publish-date">November 10, 2024</span>
|
|
||||||
<time class="published">2024-11-10</time>
|
|
||||||
<div class="timestamp">10:00 AM, Nov 10</div>
|
|
||||||
```
|
|
||||||
✅ Searches for: publish-date, published, date, timestamp
|
|
||||||
|
|
||||||
#### Strategy 5: Schema.org Markup
|
|
||||||
```html
|
|
||||||
<meta itemprop="datePublished" content="2024-11-10T10:00:00Z" />
|
|
||||||
```
|
|
||||||
✅ Structured data
|
|
||||||
|
|
||||||
#### Strategy 6: JSON-LD Structured Data
|
|
||||||
```html
|
|
||||||
<script type="application/ld+json">
|
|
||||||
{
|
|
||||||
"@type": "NewsArticle",
|
|
||||||
"datePublished": "2024-11-10T10:00:00Z"
|
|
||||||
}
|
|
||||||
</script>
|
|
||||||
```
|
|
||||||
✅ Most structured
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4. Content Extraction
|
|
||||||
|
|
||||||
Tries multiple methods:
|
|
||||||
|
|
||||||
#### Strategy 1: Semantic HTML Tags
|
|
||||||
```html
|
|
||||||
<article>
|
|
||||||
<p>Article content here...</p>
|
|
||||||
</article>
|
|
||||||
```
|
|
||||||
✅ Best practice HTML5
|
|
||||||
|
|
||||||
#### Strategy 2: Common Class Names
|
|
||||||
```html
|
|
||||||
<div class="article-content">...</div>
|
|
||||||
<div class="article-body">...</div>
|
|
||||||
<div class="post-content">...</div>
|
|
||||||
<div class="entry-content">...</div>
|
|
||||||
<div class="story-body">...</div>
|
|
||||||
```
|
|
||||||
✅ Searches for common patterns
|
|
||||||
|
|
||||||
#### Strategy 3: Schema.org Markup
|
|
||||||
```html
|
|
||||||
<div itemprop="articleBody">
|
|
||||||
<p>Content here...</p>
|
|
||||||
</div>
|
|
||||||
```
|
|
||||||
✅ Structured data
|
|
||||||
|
|
||||||
#### Strategy 4: Main Tag
|
|
||||||
```html
|
|
||||||
<main>
|
|
||||||
<p>Content here...</p>
|
|
||||||
</main>
|
|
||||||
```
|
|
||||||
✅ Semantic HTML5
|
|
||||||
|
|
||||||
#### Strategy 5: Body Tag (Fallback)
|
|
||||||
```html
|
|
||||||
<body>
|
|
||||||
<p>Content here...</p>
|
|
||||||
</body>
|
|
||||||
```
|
|
||||||
⚠️ Last resort, may include navigation
|
|
||||||
|
|
||||||
**Content Filtering:**
|
|
||||||
- Removes `<script>`, `<style>`, `<nav>`, `<footer>`, `<header>`, `<aside>`
|
|
||||||
- Filters out short paragraphs (< 50 chars) - likely ads/navigation
|
|
||||||
- Keeps only substantial paragraphs
|
|
||||||
- **No length limit** - stores full article content
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🔍 How It Works
|
|
||||||
|
|
||||||
### Example: Crawling a News Article
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 1. Fetch HTML
|
|
||||||
response = requests.get(article_url)
|
|
||||||
soup = BeautifulSoup(response.content, 'html.parser')
|
|
||||||
|
|
||||||
# 2. Extract title (tries 4 strategies)
|
|
||||||
title = extract_title(soup)
|
|
||||||
# Result: "New U-Bahn Line Opens in Munich"
|
|
||||||
|
|
||||||
# 3. Extract author (tries 5 strategies)
|
|
||||||
author = extract_author(soup)
|
|
||||||
# Result: "Max Mustermann"
|
|
||||||
|
|
||||||
# 4. Extract date (tries 6 strategies)
|
|
||||||
published_date = extract_date(soup)
|
|
||||||
# Result: "2024-11-10T10:00:00Z"
|
|
||||||
|
|
||||||
# 5. Extract content (tries 5 strategies)
|
|
||||||
content = extract_main_content(soup)
|
|
||||||
# Result: "The new U-Bahn line connecting..."
|
|
||||||
|
|
||||||
# 6. Save to database
|
|
||||||
article_doc = {
|
|
||||||
'title': title,
|
|
||||||
'author': author,
|
|
||||||
'published_at': published_date,
|
|
||||||
'full_content': content,
|
|
||||||
'word_count': len(content.split())
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📊 Success Rates by Strategy
|
|
||||||
|
|
||||||
Based on common news sites:
|
|
||||||
|
|
||||||
| Strategy | Success Rate | Notes |
|
|
||||||
|----------|-------------|-------|
|
|
||||||
| H1 for title | 95% | Almost universal |
|
|
||||||
| OG meta tags | 90% | Most modern sites |
|
|
||||||
| Time tag for date | 85% | HTML5 sites |
|
|
||||||
| JSON-LD | 70% | Growing adoption |
|
|
||||||
| Class name patterns | 60% | Varies by site |
|
|
||||||
| Schema.org | 50% | Not widely adopted |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎨 Real-World Examples
|
|
||||||
|
|
||||||
### Example 1: Süddeutsche Zeitung
|
|
||||||
```html
|
|
||||||
<article>
|
|
||||||
<h1>New U-Bahn Line Opens</h1>
|
|
||||||
<span class="author">Max Mustermann</span>
|
|
||||||
<time datetime="2024-11-10T10:00:00Z">10. November 2024</time>
|
|
||||||
<div class="article-body">
|
|
||||||
<p>The new U-Bahn line...</p>
|
|
||||||
</div>
|
|
||||||
</article>
|
|
||||||
```
|
|
||||||
✅ Extracts: Title (H1), Author (class), Date (time), Content (article-body)
|
|
||||||
|
|
||||||
### Example 2: Medium Blog
|
|
||||||
```html
|
|
||||||
<article>
|
|
||||||
<h1>How to Build a News Crawler</h1>
|
|
||||||
<meta property="og:title" content="How to Build a News Crawler" />
|
|
||||||
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
|
|
||||||
<a rel="author" href="/author">Jane Smith</a>
|
|
||||||
<section>
|
|
||||||
<p>In this article...</p>
|
|
||||||
</section>
|
|
||||||
</article>
|
|
||||||
```
|
|
||||||
✅ Extracts: Title (OG meta), Author (rel), Date (article meta), Content (section)
|
|
||||||
|
|
||||||
### Example 3: WordPress Blog
|
|
||||||
```html
|
|
||||||
<div class="post">
|
|
||||||
<h1 class="entry-title">My Blog Post</h1>
|
|
||||||
<span class="byline">By John Doe</span>
|
|
||||||
<time class="published">November 10, 2024</time>
|
|
||||||
<div class="entry-content">
|
|
||||||
<p>Blog content here...</p>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
```
|
|
||||||
✅ Extracts: Title (H1), Author (byline), Date (published), Content (entry-content)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚠️ Edge Cases Handled
|
|
||||||
|
|
||||||
1. **Missing Fields**: Returns `None` instead of crashing
|
|
||||||
2. **Multiple Authors**: Takes first one found
|
|
||||||
3. **Relative Dates**: Stores as-is ("2 hours ago")
|
|
||||||
4. **Paywalls**: Extracts what's available
|
|
||||||
5. **JavaScript-rendered**: Only gets server-side HTML
|
|
||||||
6. **Ads/Navigation**: Filtered out by paragraph length
|
|
||||||
7. **Site Name in Title**: Cleaned automatically
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🚀 Future Improvements
|
|
||||||
|
|
||||||
Potential enhancements:
|
|
||||||
|
|
||||||
- [ ] JavaScript rendering (Selenium/Playwright)
|
|
||||||
- [ ] Paywall bypass (where legal)
|
|
||||||
- [ ] Image extraction
|
|
||||||
- [ ] Video detection
|
|
||||||
- [ ] Related articles
|
|
||||||
- [ ] Tags/categories
|
|
||||||
- [ ] Reading time estimation
|
|
||||||
- [ ] Language detection
|
|
||||||
- [ ] Sentiment analysis
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🧪 Testing
|
|
||||||
|
|
||||||
Test the extraction on a specific URL:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawler_service import extract_article_content
|
|
||||||
|
|
||||||
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
|
||||||
data = extract_article_content(url)
|
|
||||||
|
|
||||||
print(f"Title: {data['title']}")
|
|
||||||
print(f"Author: {data['author']}")
|
|
||||||
print(f"Date: {data['published_date']}")
|
|
||||||
print(f"Content length: {len(data['content'])} chars")
|
|
||||||
print(f"Word count: {data['word_count']}")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📚 Standards Supported
|
|
||||||
|
|
||||||
- ✅ HTML5 semantic tags
|
|
||||||
- ✅ Open Graph Protocol
|
|
||||||
- ✅ Twitter Cards
|
|
||||||
- ✅ Schema.org microdata
|
|
||||||
- ✅ JSON-LD structured data
|
|
||||||
- ✅ Dublin Core metadata
|
|
||||||
- ✅ Common CSS class patterns
|
|
||||||
414
docs/FEATURES.md
Normal file
414
docs/FEATURES.md
Normal file
@@ -0,0 +1,414 @@
|
|||||||
|
# Features Guide
|
||||||
|
|
||||||
|
Complete guide to Munich News Daily features.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Features
|
||||||
|
|
||||||
|
### 1. Automated News Crawling
|
||||||
|
- Fetches articles from RSS feeds
|
||||||
|
- Scheduled daily at 6:00 AM Berlin time
|
||||||
|
- Extracts full article content
|
||||||
|
- Handles multiple news sources
|
||||||
|
|
||||||
|
### 2. AI-Powered Summarization
|
||||||
|
- Generates concise summaries (150 words)
|
||||||
|
- Uses Ollama AI (phi3:latest model)
|
||||||
|
- GPU acceleration available (5-10x faster)
|
||||||
|
- Configurable summary length
|
||||||
|
|
||||||
|
### 3. Title Translation
|
||||||
|
- Translates German titles to English
|
||||||
|
- Uses Ollama AI
|
||||||
|
- Displays both languages in newsletter
|
||||||
|
- Stores both versions in database
|
||||||
|
|
||||||
|
### 4. Newsletter Generation
|
||||||
|
- Beautiful HTML email template
|
||||||
|
- Responsive design
|
||||||
|
- Numbered articles
|
||||||
|
- Summary statistics
|
||||||
|
- Scheduled daily at 7:00 AM Berlin time
|
||||||
|
|
||||||
|
### 5. Engagement Tracking
|
||||||
|
- Email open tracking (pixel)
|
||||||
|
- Link click tracking
|
||||||
|
- Analytics dashboard ready
|
||||||
|
- Subscriber engagement metrics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## News Crawler
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Fetch RSS feeds from database
|
||||||
|
2. Parse RSS XML
|
||||||
|
3. Extract article URLs
|
||||||
|
4. Fetch full article content
|
||||||
|
5. Extract text from HTML
|
||||||
|
6. Translate title (German → English)
|
||||||
|
7. Generate AI summary
|
||||||
|
8. Store in MongoDB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Content Extraction
|
||||||
|
|
||||||
|
**Strategies (in order):**
|
||||||
|
|
||||||
|
1. **Article Tag** - Look for `<article>` tags
|
||||||
|
2. **Main Tag** - Look for `<main>` content
|
||||||
|
3. **Content Divs** - Common class names (content, article-body, etc.)
|
||||||
|
4. **Paragraph Aggregation** - Collect all `<p>` tags
|
||||||
|
5. **Fallback** - Use RSS description
|
||||||
|
|
||||||
|
**Cleaning:**
|
||||||
|
- Remove scripts and styles
|
||||||
|
- Remove navigation elements
|
||||||
|
- Remove ads and sidebars
|
||||||
|
- Extract clean text
|
||||||
|
- Preserve paragraphs
|
||||||
|
|
||||||
|
### RSS Feed Handling
|
||||||
|
|
||||||
|
**Supported Formats:**
|
||||||
|
- RSS 2.0
|
||||||
|
- Atom
|
||||||
|
- Custom formats
|
||||||
|
|
||||||
|
**Extracted Data:**
|
||||||
|
- Title
|
||||||
|
- Link
|
||||||
|
- Description/Summary
|
||||||
|
- Published date
|
||||||
|
- Author (if available)
|
||||||
|
|
||||||
|
**Error Handling:**
|
||||||
|
- Retry failed requests
|
||||||
|
- Skip invalid URLs
|
||||||
|
- Log errors
|
||||||
|
- Continue with next article
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## AI Features
|
||||||
|
|
||||||
|
### Summarization
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
1. Send article text to Ollama
|
||||||
|
2. Request 150-word summary
|
||||||
|
3. Receive AI-generated summary
|
||||||
|
4. Store with article
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
```env
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
SUMMARY_MAX_WORDS=150
|
||||||
|
OLLAMA_TIMEOUT=120
|
||||||
|
```
|
||||||
|
|
||||||
|
**Performance:**
|
||||||
|
- CPU: ~8s per article
|
||||||
|
- GPU: ~2s per article (4x faster)
|
||||||
|
|
||||||
|
### Translation
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
1. Send German title to Ollama
|
||||||
|
2. Request English translation
|
||||||
|
3. Receive translated title
|
||||||
|
4. Store both versions
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
```env
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
**Performance:**
|
||||||
|
- CPU: ~1.5s per title
|
||||||
|
- GPU: ~0.3s per title (5x faster)
|
||||||
|
|
||||||
|
**Newsletter Display:**
|
||||||
|
```
|
||||||
|
English Title (Primary)
|
||||||
|
Original: German Title (Subtitle)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Newsletter System
|
||||||
|
|
||||||
|
### Template Features
|
||||||
|
|
||||||
|
- **Responsive Design** - Works on all devices
|
||||||
|
- **Clean Layout** - Easy to read
|
||||||
|
- **Numbered Articles** - Clear organization
|
||||||
|
- **Summary Box** - Quick stats
|
||||||
|
- **Tracking Links** - Click tracking
|
||||||
|
- **Unsubscribe Link** - Easy opt-out
|
||||||
|
|
||||||
|
### Personalization
|
||||||
|
|
||||||
|
- Greeting message
|
||||||
|
- Date formatting
|
||||||
|
- Article count
|
||||||
|
- Source attribution
|
||||||
|
- Author names
|
||||||
|
|
||||||
|
### Tracking
|
||||||
|
|
||||||
|
**Open Tracking:**
|
||||||
|
- Invisible 1x1 pixel image
|
||||||
|
- Loaded when email opened
|
||||||
|
- Records timestamp
|
||||||
|
- Tracks unique opens
|
||||||
|
|
||||||
|
**Click Tracking:**
|
||||||
|
- All article links tracked
|
||||||
|
- Redirect through backend
|
||||||
|
- Records click events
|
||||||
|
- Tracks which articles clicked
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Subscriber Management
|
||||||
|
|
||||||
|
### Status System
|
||||||
|
|
||||||
|
| Status | Description | Receives Newsletters |
|
||||||
|
|--------|-------------|---------------------|
|
||||||
|
| `active` | Subscribed | ✅ Yes |
|
||||||
|
| `inactive` | Unsubscribed | ❌ No |
|
||||||
|
|
||||||
|
### Operations
|
||||||
|
|
||||||
|
**Subscribe:**
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/subscribe \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "user@example.com"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Unsubscribe:**
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/unsubscribe \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "user@example.com"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check Stats:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Admin Features
|
||||||
|
|
||||||
|
### Manual Crawl
|
||||||
|
|
||||||
|
Trigger crawl anytime:
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Email
|
||||||
|
|
||||||
|
Send test newsletter:
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "test@example.com"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Send Newsletter
|
||||||
|
|
||||||
|
Send to all subscribers:
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### System Stats
|
||||||
|
|
||||||
|
View system statistics:
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Automation
|
||||||
|
|
||||||
|
### Scheduled Tasks
|
||||||
|
|
||||||
|
**Crawler (6:00 AM Berlin time):**
|
||||||
|
- Fetches new articles
|
||||||
|
- Processes with AI
|
||||||
|
- Stores in database
|
||||||
|
|
||||||
|
**Sender (7:00 AM Berlin time):**
|
||||||
|
- Waits for crawler to finish
|
||||||
|
- Fetches today's articles
|
||||||
|
- Generates newsletter
|
||||||
|
- Sends to all active subscribers
|
||||||
|
|
||||||
|
### Manual Execution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run crawler manually
|
||||||
|
docker-compose exec crawler python crawler_service.py 10
|
||||||
|
|
||||||
|
# Run sender manually
|
||||||
|
docker-compose exec sender python sender_service.py send 10
|
||||||
|
|
||||||
|
# Send test email
|
||||||
|
docker-compose exec sender python sender_service.py test your@email.com
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
```env
|
||||||
|
# Newsletter Settings
|
||||||
|
NEWSLETTER_MAX_ARTICLES=10
|
||||||
|
NEWSLETTER_HOURS_LOOKBACK=24
|
||||||
|
WEBSITE_URL=http://localhost:3000
|
||||||
|
|
||||||
|
# Ollama AI
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
OLLAMA_TIMEOUT=120
|
||||||
|
SUMMARY_MAX_WORDS=150
|
||||||
|
|
||||||
|
# Tracking
|
||||||
|
TRACKING_ENABLED=true
|
||||||
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
|
```
|
||||||
|
|
||||||
|
### RSS Feeds
|
||||||
|
|
||||||
|
Add feeds in MongoDB:
|
||||||
|
```javascript
|
||||||
|
db.rss_feeds.insertOne({
|
||||||
|
name: "Süddeutsche Zeitung München",
|
||||||
|
url: "https://www.sueddeutsche.de/muenchen/rss",
|
||||||
|
active: true
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Optimization
|
||||||
|
|
||||||
|
### GPU Acceleration
|
||||||
|
|
||||||
|
Enable for 5-10x faster processing:
|
||||||
|
```bash
|
||||||
|
./start-with-gpu.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- Faster summarization (8s → 2s)
|
||||||
|
- Faster translation (1.5s → 0.3s)
|
||||||
|
- Process more articles
|
||||||
|
- Lower CPU usage
|
||||||
|
|
||||||
|
### Batch Processing
|
||||||
|
|
||||||
|
Process multiple articles efficiently:
|
||||||
|
- Model stays loaded in memory
|
||||||
|
- Reduced overhead
|
||||||
|
- Better throughput
|
||||||
|
|
||||||
|
### Caching
|
||||||
|
|
||||||
|
- Model caching (Ollama)
|
||||||
|
- Database connection pooling
|
||||||
|
- Persistent storage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Crawler logs
|
||||||
|
docker-compose logs -f crawler
|
||||||
|
|
||||||
|
# Sender logs
|
||||||
|
docker-compose logs -f sender
|
||||||
|
|
||||||
|
# Backend logs
|
||||||
|
docker-compose logs -f backend
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics
|
||||||
|
|
||||||
|
- Articles crawled
|
||||||
|
- Summaries generated
|
||||||
|
- Newsletters sent
|
||||||
|
- Open rate
|
||||||
|
- Click-through rate
|
||||||
|
- Processing time
|
||||||
|
|
||||||
|
### Health Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backend health
|
||||||
|
curl http://localhost:5001/health
|
||||||
|
|
||||||
|
# System stats
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Crawler Issues
|
||||||
|
|
||||||
|
**No articles found:**
|
||||||
|
- Check RSS feed URLs
|
||||||
|
- Verify feeds are active
|
||||||
|
- Check network connectivity
|
||||||
|
|
||||||
|
**Extraction failed:**
|
||||||
|
- Article structure changed
|
||||||
|
- Paywall detected
|
||||||
|
- Network timeout
|
||||||
|
|
||||||
|
**AI processing failed:**
|
||||||
|
- Ollama not running
|
||||||
|
- Model not downloaded
|
||||||
|
- Timeout too short
|
||||||
|
|
||||||
|
### Newsletter Issues
|
||||||
|
|
||||||
|
**Not sending:**
|
||||||
|
- Check email configuration
|
||||||
|
- Verify SMTP credentials
|
||||||
|
- Check subscriber count
|
||||||
|
|
||||||
|
**Tracking not working:**
|
||||||
|
- Verify tracking enabled
|
||||||
|
- Check backend API accessible
|
||||||
|
- Verify tracking URLs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.
|
||||||
116
docs/INDEX.md
116
docs/INDEX.md
@@ -1,116 +0,0 @@
|
|||||||
# Documentation Index
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
- [README](../README.md) - Project overview and quick start
|
|
||||||
- [QUICKSTART](../QUICKSTART.md) - Detailed 5-minute setup guide
|
|
||||||
|
|
||||||
## Setup & Configuration
|
|
||||||
- [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama AI service setup
|
|
||||||
- [GPU_SETUP](GPU_SETUP.md) - GPU acceleration setup (5-10x faster)
|
|
||||||
- [DEPLOYMENT](DEPLOYMENT.md) - Production deployment guide
|
|
||||||
|
|
||||||
## API Documentation
|
|
||||||
- [ADMIN_API](ADMIN_API.md) - Admin endpoints (crawl, send newsletter)
|
|
||||||
- [API](API.md) - Public API endpoints
|
|
||||||
- [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber status system
|
|
||||||
|
|
||||||
## Architecture & Design
|
|
||||||
- [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete system architecture
|
|
||||||
- [ARCHITECTURE](ARCHITECTURE.md) - High-level architecture overview
|
|
||||||
- [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - MongoDB schema and connection
|
|
||||||
- [BACKEND_STRUCTURE](BACKEND_STRUCTURE.md) - Backend code structure
|
|
||||||
|
|
||||||
## Features & How-To
|
|
||||||
- [CRAWLER_HOW_IT_WORKS](CRAWLER_HOW_IT_WORKS.md) - News crawler explained
|
|
||||||
- [EXTRACTION_STRATEGIES](EXTRACTION_STRATEGIES.md) - Content extraction
|
|
||||||
- [RSS_URL_EXTRACTION](RSS_URL_EXTRACTION.md) - RSS feed handling
|
|
||||||
- [PERFORMANCE_COMPARISON](PERFORMANCE_COMPARISON.md) - CPU vs GPU benchmarks
|
|
||||||
|
|
||||||
## Security
|
|
||||||
- [SECURITY_NOTES](SECURITY_NOTES.md) - Complete security guide
|
|
||||||
- Network isolation
|
|
||||||
- MongoDB security
|
|
||||||
- Ollama security
|
|
||||||
- Best practices
|
|
||||||
|
|
||||||
## Reference
|
|
||||||
- [CHANGELOG](CHANGELOG.md) - Version history and recent updates
|
|
||||||
- [QUICK_REFERENCE](QUICK_REFERENCE.md) - Command cheat sheet
|
|
||||||
|
|
||||||
## Contributing
|
|
||||||
- [CONTRIBUTING](../CONTRIBUTING.md) - How to contribute
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documentation Organization
|
|
||||||
|
|
||||||
### Root Level (3 files)
|
|
||||||
Essential files that should be immediately visible:
|
|
||||||
- `README.md` - Main entry point
|
|
||||||
- `QUICKSTART.md` - Quick setup guide
|
|
||||||
- `CONTRIBUTING.md` - Contribution guidelines
|
|
||||||
|
|
||||||
### docs/ Directory (18 files)
|
|
||||||
All technical documentation organized by category:
|
|
||||||
- **Setup**: Ollama, GPU, Deployment
|
|
||||||
- **API**: Admin API, Public API, Subscriber system
|
|
||||||
- **Architecture**: System design, database, backend structure
|
|
||||||
- **Features**: Crawler, extraction, RSS handling
|
|
||||||
- **Security**: Complete security documentation
|
|
||||||
- **Reference**: Changelog, quick reference
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Links by Task
|
|
||||||
|
|
||||||
### I want to...
|
|
||||||
|
|
||||||
**Set up the project:**
|
|
||||||
1. [README](../README.md) - Overview
|
|
||||||
2. [QUICKSTART](../QUICKSTART.md) - Step-by-step setup
|
|
||||||
|
|
||||||
**Enable GPU acceleration:**
|
|
||||||
1. [GPU_SETUP](GPU_SETUP.md) - Complete GPU guide
|
|
||||||
2. Run: `./start-with-gpu.sh`
|
|
||||||
|
|
||||||
**Send newsletters:**
|
|
||||||
1. [ADMIN_API](ADMIN_API.md) - API documentation
|
|
||||||
2. [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber system
|
|
||||||
|
|
||||||
**Understand the architecture:**
|
|
||||||
1. [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete overview
|
|
||||||
2. [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - Database design
|
|
||||||
|
|
||||||
**Secure my deployment:**
|
|
||||||
1. [SECURITY_NOTES](SECURITY_NOTES.md) - Security guide
|
|
||||||
2. [DEPLOYMENT](DEPLOYMENT.md) - Production deployment
|
|
||||||
|
|
||||||
**Troubleshoot issues:**
|
|
||||||
1. [QUICK_REFERENCE](QUICK_REFERENCE.md) - Common commands
|
|
||||||
2. [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama troubleshooting
|
|
||||||
3. [GPU_SETUP](GPU_SETUP.md) - GPU troubleshooting
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documentation Standards
|
|
||||||
|
|
||||||
### File Naming
|
|
||||||
- Use UPPERCASE for main docs (README, QUICKSTART)
|
|
||||||
- Use Title_Case for technical docs (GPU_Setup, API_Reference)
|
|
||||||
- Use descriptive names (not DOC1, DOC2)
|
|
||||||
|
|
||||||
### Organization
|
|
||||||
- Root level: Only essential user-facing docs
|
|
||||||
- docs/: All technical documentation
|
|
||||||
- Keep related content together
|
|
||||||
|
|
||||||
### Content
|
|
||||||
- Start with overview/summary
|
|
||||||
- Include code examples
|
|
||||||
- Add troubleshooting sections
|
|
||||||
- Link to related docs
|
|
||||||
- Keep up to date
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
Last Updated: November 2025
|
|
||||||
@@ -1,209 +0,0 @@
|
|||||||
# Munich News Daily - Architecture
|
|
||||||
|
|
||||||
## System Overview
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ Users / Browsers │
|
|
||||||
└────────────────────────┬────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ Frontend (Port 3000) │
|
|
||||||
│ Node.js + Express + Vanilla JS │
|
|
||||||
│ - Subscription form │
|
|
||||||
│ - News display │
|
|
||||||
│ - RSS feed management UI (future) │
|
|
||||||
└────────────────────────┬────────────────────────────────────┘
|
|
||||||
│ HTTP/REST
|
|
||||||
▼
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ Backend API (Port 5001) │
|
|
||||||
│ Flask + Python │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Routes (Blueprints) │ │
|
|
||||||
│ │ - subscription_routes.py (subscribe/unsubscribe) │ │
|
|
||||||
│ │ - news_routes.py (get news, stats) │ │
|
|
||||||
│ │ - rss_routes.py (manage RSS feeds) │ │
|
|
||||||
│ │ - ollama_routes.py (AI features) │ │
|
|
||||||
│ └──────────────────────────────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Services (Business Logic) │ │
|
|
||||||
│ │ - news_service.py (fetch & save articles) │ │
|
|
||||||
│ │ - email_service.py (send newsletters) │ │
|
|
||||||
│ │ - ollama_service.py (AI integration) │ │
|
|
||||||
│ └──────────────────────────────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Core │ │
|
|
||||||
│ │ - config.py (configuration) │ │
|
|
||||||
│ │ - database.py (DB connection) │ │
|
|
||||||
│ └──────────────────────────────────────────────────────┘ │
|
|
||||||
└────────────────────────┬────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ MongoDB (Port 27017) │
|
|
||||||
│ │
|
|
||||||
│ Collections: │
|
|
||||||
│ - articles (news articles with full content) │
|
|
||||||
│ - subscribers (email subscribers) │
|
|
||||||
│ - rss_feeds (RSS feed sources) │
|
|
||||||
└─────────────────────────┬───────────────────────────────────┘
|
|
||||||
│
|
|
||||||
│ Read/Write
|
|
||||||
│
|
|
||||||
┌─────────────────────────┴───────────────────────────────────┐
|
|
||||||
│ News Crawler Microservice │
|
|
||||||
│ (Standalone) │
|
|
||||||
│ │
|
|
||||||
│ - Fetches RSS feeds from MongoDB │
|
|
||||||
│ - Crawls full article content │
|
|
||||||
│ - Extracts text, metadata, word count │
|
|
||||||
│ - Stores back to MongoDB │
|
|
||||||
│ - Can run independently or scheduled │
|
|
||||||
└──────────────────────────────────────────────────────────────┘
|
|
||||||
|
|
||||||
│
|
|
||||||
│ (Optional)
|
|
||||||
▼
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ Ollama AI Server (Port 11434) │
|
|
||||||
│ (Optional, External) │
|
|
||||||
│ │
|
|
||||||
│ - Article summarization │
|
|
||||||
│ - Content analysis │
|
|
||||||
│ - AI-powered features │
|
|
||||||
└──────────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Component Details
|
|
||||||
|
|
||||||
### Frontend (Port 3000)
|
|
||||||
- **Technology**: Node.js, Express, Vanilla JavaScript
|
|
||||||
- **Responsibilities**:
|
|
||||||
- User interface
|
|
||||||
- Subscription management
|
|
||||||
- News display
|
|
||||||
- API proxy to backend
|
|
||||||
- **Communication**: HTTP REST to Backend
|
|
||||||
|
|
||||||
### Backend API (Port 5001)
|
|
||||||
- **Technology**: Python, Flask
|
|
||||||
- **Architecture**: Modular with Blueprints
|
|
||||||
- **Responsibilities**:
|
|
||||||
- REST API endpoints
|
|
||||||
- Business logic
|
|
||||||
- Database operations
|
|
||||||
- Email sending
|
|
||||||
- AI integration
|
|
||||||
- **Communication**:
|
|
||||||
- HTTP REST from Frontend
|
|
||||||
- MongoDB driver to Database
|
|
||||||
- HTTP to Ollama (optional)
|
|
||||||
|
|
||||||
### MongoDB (Port 27017)
|
|
||||||
- **Technology**: MongoDB 7.0
|
|
||||||
- **Responsibilities**:
|
|
||||||
- Persistent data storage
|
|
||||||
- Articles, subscribers, RSS feeds
|
|
||||||
- **Communication**: MongoDB protocol
|
|
||||||
|
|
||||||
### News Crawler (Standalone)
|
|
||||||
- **Technology**: Python, BeautifulSoup
|
|
||||||
- **Architecture**: Microservice (can run independently)
|
|
||||||
- **Responsibilities**:
|
|
||||||
- Fetch RSS feeds
|
|
||||||
- Crawl article content
|
|
||||||
- Extract and clean text
|
|
||||||
- Store in database
|
|
||||||
- **Communication**: MongoDB driver to Database
|
|
||||||
- **Execution**:
|
|
||||||
- Manual: `python crawler_service.py`
|
|
||||||
- Scheduled: Cron, systemd, Docker
|
|
||||||
- On-demand: Via backend API (future)
|
|
||||||
|
|
||||||
### Ollama AI Server (Optional, External)
|
|
||||||
- **Technology**: Ollama
|
|
||||||
- **Responsibilities**:
|
|
||||||
- AI model inference
|
|
||||||
- Text summarization
|
|
||||||
- Content analysis
|
|
||||||
- **Communication**: HTTP REST API
|
|
||||||
|
|
||||||
## Data Flow
|
|
||||||
|
|
||||||
### 1. News Aggregation Flow
|
|
||||||
```
|
|
||||||
RSS Feeds → Backend (news_service) → MongoDB (articles)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Content Crawling Flow
|
|
||||||
```
|
|
||||||
MongoDB (rss_feeds) → Crawler → Article URLs →
|
|
||||||
Web Scraping → MongoDB (articles with full_content)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Subscription Flow
|
|
||||||
```
|
|
||||||
User → Frontend → Backend (subscription_routes) →
|
|
||||||
MongoDB (subscribers)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Newsletter Flow (Future)
|
|
||||||
```
|
|
||||||
Scheduler → Backend (email_service) →
|
|
||||||
MongoDB (articles + subscribers) → SMTP → Users
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. AI Processing Flow (Optional)
|
|
||||||
```
|
|
||||||
MongoDB (articles) → Backend (ollama_service) →
|
|
||||||
Ollama Server → AI Summary → MongoDB (articles)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Deployment Options
|
|
||||||
|
|
||||||
### Development
|
|
||||||
- All services run locally
|
|
||||||
- MongoDB via Docker Compose
|
|
||||||
- Manual crawler execution
|
|
||||||
|
|
||||||
### Production
|
|
||||||
- Backend: Cloud VM, Container, or PaaS
|
|
||||||
- Frontend: Static hosting or same server
|
|
||||||
- MongoDB: MongoDB Atlas or self-hosted
|
|
||||||
- Crawler: Scheduled job (cron, systemd timer)
|
|
||||||
- Ollama: Separate GPU server (optional)
|
|
||||||
|
|
||||||
## Scalability Considerations
|
|
||||||
|
|
||||||
### Current Architecture
|
|
||||||
- Monolithic backend (single Flask instance)
|
|
||||||
- Standalone crawler (can run multiple instances)
|
|
||||||
- Shared MongoDB
|
|
||||||
|
|
||||||
### Future Improvements
|
|
||||||
- Load balancer for backend
|
|
||||||
- Message queue for crawler jobs (Celery + Redis)
|
|
||||||
- Caching layer (Redis)
|
|
||||||
- CDN for frontend
|
|
||||||
- Read replicas for MongoDB
|
|
||||||
|
|
||||||
## Security
|
|
||||||
|
|
||||||
- CORS enabled for frontend-backend communication
|
|
||||||
- MongoDB authentication (production)
|
|
||||||
- Environment variables for secrets
|
|
||||||
- Input validation on all endpoints
|
|
||||||
- Rate limiting (future)
|
|
||||||
|
|
||||||
## Monitoring (Future)
|
|
||||||
|
|
||||||
- Application logs
|
|
||||||
- MongoDB metrics
|
|
||||||
- Crawler success/failure tracking
|
|
||||||
- API response times
|
|
||||||
- Error tracking (Sentry)
|
|
||||||
@@ -1,222 +0,0 @@
|
|||||||
# Performance Comparison: CPU vs GPU
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
This document compares the performance of Ollama running on CPU vs GPU for the Munich News Daily system.
|
|
||||||
|
|
||||||
## Test Configuration
|
|
||||||
|
|
||||||
**Hardware:**
|
|
||||||
- CPU: Intel Core i7-10700K (8 cores, 16 threads)
|
|
||||||
- GPU: NVIDIA RTX 3060 (12GB VRAM)
|
|
||||||
- RAM: 32GB DDR4
|
|
||||||
|
|
||||||
**Model:** phi3:latest (2.3GB)
|
|
||||||
|
|
||||||
**Test:** Processing 10 news articles with translation and summarization
|
|
||||||
|
|
||||||
## Results
|
|
||||||
|
|
||||||
### Processing Time
|
|
||||||
|
|
||||||
```
|
|
||||||
CPU Processing:
|
|
||||||
├─ Model Load: 20s
|
|
||||||
├─ 10 Translations: 15s (1.5s each)
|
|
||||||
├─ 10 Summaries: 80s (8s each)
|
|
||||||
└─ Total: 115s
|
|
||||||
|
|
||||||
GPU Processing:
|
|
||||||
├─ Model Load: 8s
|
|
||||||
├─ 10 Translations: 3s (0.3s each)
|
|
||||||
├─ 10 Summaries: 20s (2s each)
|
|
||||||
└─ Total: 31s
|
|
||||||
|
|
||||||
Speedup: 3.7x faster with GPU
|
|
||||||
```
|
|
||||||
|
|
||||||
### Detailed Breakdown
|
|
||||||
|
|
||||||
| Operation | CPU Time | GPU Time | Speedup |
|
|
||||||
|-----------|----------|----------|---------|
|
|
||||||
| Model Load | 20s | 8s | 2.5x |
|
|
||||||
| Single Translation | 1.5s | 0.3s | 5.0x |
|
|
||||||
| Single Summary | 8s | 2s | 4.0x |
|
|
||||||
| 10 Articles (total) | 115s | 31s | 3.7x |
|
|
||||||
| 50 Articles (total) | 550s | 120s | 4.6x |
|
|
||||||
| 100 Articles (total) | 1100s | 220s | 5.0x |
|
|
||||||
|
|
||||||
### Resource Usage
|
|
||||||
|
|
||||||
**CPU Mode:**
|
|
||||||
- CPU Usage: 60-80% across all cores
|
|
||||||
- RAM Usage: 4-6GB
|
|
||||||
- GPU Usage: 0%
|
|
||||||
- Power Draw: ~65W
|
|
||||||
|
|
||||||
**GPU Mode:**
|
|
||||||
- CPU Usage: 10-20%
|
|
||||||
- RAM Usage: 2-3GB
|
|
||||||
- GPU Usage: 80-100%
|
|
||||||
- VRAM Usage: 3-4GB
|
|
||||||
- Power Draw: ~120W (GPU) + ~20W (CPU) = ~140W
|
|
||||||
|
|
||||||
## Scaling Analysis
|
|
||||||
|
|
||||||
### Daily Newsletter (10 articles)
|
|
||||||
|
|
||||||
**CPU:**
|
|
||||||
- Processing Time: ~2 minutes
|
|
||||||
- Energy Cost: ~0.002 kWh
|
|
||||||
- Suitable: ✓ Yes
|
|
||||||
|
|
||||||
**GPU:**
|
|
||||||
- Processing Time: ~30 seconds
|
|
||||||
- Energy Cost: ~0.001 kWh
|
|
||||||
- Suitable: ✓ Yes (overkill for small batches)
|
|
||||||
|
|
||||||
**Recommendation:** CPU is sufficient for daily newsletters with <20 articles.
|
|
||||||
|
|
||||||
### High Volume (100+ articles/day)
|
|
||||||
|
|
||||||
**CPU:**
|
|
||||||
- Processing Time: ~18 minutes
|
|
||||||
- Energy Cost: ~0.02 kWh
|
|
||||||
- Suitable: ⚠ Slow but workable
|
|
||||||
|
|
||||||
**GPU:**
|
|
||||||
- Processing Time: ~4 minutes
|
|
||||||
- Energy Cost: ~0.009 kWh
|
|
||||||
- Suitable: ✓ Yes (recommended)
|
|
||||||
|
|
||||||
**Recommendation:** GPU provides significant time savings for high-volume processing.
|
|
||||||
|
|
||||||
### Real-time Processing
|
|
||||||
|
|
||||||
**CPU:**
|
|
||||||
- Latency: 1.5s translation + 8s summary = 9.5s per article
|
|
||||||
- Throughput: ~6 articles/minute
|
|
||||||
- User Experience: ⚠ Noticeable delay
|
|
||||||
|
|
||||||
**GPU:**
|
|
||||||
- Latency: 0.3s translation + 2s summary = 2.3s per article
|
|
||||||
- Throughput: ~26 articles/minute
|
|
||||||
- User Experience: ✓ Fast, responsive
|
|
||||||
|
|
||||||
**Recommendation:** GPU is essential for real-time or interactive use cases.
|
|
||||||
|
|
||||||
## Cost Analysis
|
|
||||||
|
|
||||||
### Hardware Investment
|
|
||||||
|
|
||||||
**CPU-Only Setup:**
|
|
||||||
- Server: $500-1000
|
|
||||||
- Monthly Power: ~$5
|
|
||||||
- Total Year 1: ~$560-1060
|
|
||||||
|
|
||||||
**GPU Setup:**
|
|
||||||
- Server: $500-1000
|
|
||||||
- GPU (RTX 3060): $300-400
|
|
||||||
- Monthly Power: ~$8
|
|
||||||
- Total Year 1: ~$896-1496
|
|
||||||
|
|
||||||
**Break-even:** If processing >50 articles/day, GPU saves enough time to justify the cost.
|
|
||||||
|
|
||||||
### Cloud Deployment
|
|
||||||
|
|
||||||
**AWS (us-east-1):**
|
|
||||||
- CPU (t3.xlarge): $0.1664/hour = ~$120/month
|
|
||||||
- GPU (g4dn.xlarge): $0.526/hour = ~$380/month
|
|
||||||
|
|
||||||
**Cost per 1000 articles:**
|
|
||||||
- CPU: ~$3.60 (3 hours)
|
|
||||||
- GPU: ~$0.95 (1.8 hours)
|
|
||||||
|
|
||||||
**Break-even:** Processing >5000 articles/month makes GPU more cost-effective.
|
|
||||||
|
|
||||||
## Model Comparison
|
|
||||||
|
|
||||||
Different models have different performance characteristics:
|
|
||||||
|
|
||||||
### phi3:latest (Default)
|
|
||||||
|
|
||||||
| Metric | CPU | GPU | Speedup |
|
|
||||||
|--------|-----|-----|---------|
|
|
||||||
| Load Time | 20s | 8s | 2.5x |
|
|
||||||
| Translation | 1.5s | 0.3s | 5x |
|
|
||||||
| Summary | 8s | 2s | 4x |
|
|
||||||
| VRAM | N/A | 3-4GB | - |
|
|
||||||
|
|
||||||
### gemma2:2b (Lightweight)
|
|
||||||
|
|
||||||
| Metric | CPU | GPU | Speedup |
|
|
||||||
|--------|-----|-----|---------|
|
|
||||||
| Load Time | 10s | 4s | 2.5x |
|
|
||||||
| Translation | 0.8s | 0.2s | 4x |
|
|
||||||
| Summary | 4s | 1s | 4x |
|
|
||||||
| VRAM | N/A | 1.5GB | - |
|
|
||||||
|
|
||||||
### llama3.2:3b (High Quality)
|
|
||||||
|
|
||||||
| Metric | CPU | GPU | Speedup |
|
|
||||||
|--------|-----|-----|---------|
|
|
||||||
| Load Time | 30s | 12s | 2.5x |
|
|
||||||
| Translation | 2.5s | 0.5s | 5x |
|
|
||||||
| Summary | 12s | 3s | 4x |
|
|
||||||
| VRAM | N/A | 5-6GB | - |
|
|
||||||
|
|
||||||
## Recommendations
|
|
||||||
|
|
||||||
### Use CPU When:
|
|
||||||
- Processing <20 articles/day
|
|
||||||
- Budget-constrained
|
|
||||||
- GPU needed for other tasks
|
|
||||||
- Power efficiency is critical
|
|
||||||
- Simple deployment preferred
|
|
||||||
|
|
||||||
### Use GPU When:
|
|
||||||
- Processing >50 articles/day
|
|
||||||
- Real-time processing needed
|
|
||||||
- Multiple concurrent users
|
|
||||||
- Time is more valuable than cost
|
|
||||||
- Already have GPU hardware
|
|
||||||
|
|
||||||
### Hybrid Approach:
|
|
||||||
- Use CPU for scheduled daily newsletters
|
|
||||||
- Use GPU for on-demand/real-time requests
|
|
||||||
- Scale GPU instances up/down based on load
|
|
||||||
|
|
||||||
## Optimization Tips
|
|
||||||
|
|
||||||
### CPU Optimization:
|
|
||||||
1. Use smaller models (gemma2:2b)
|
|
||||||
2. Reduce summary length (100 words vs 150)
|
|
||||||
3. Process articles in batches
|
|
||||||
4. Use more CPU cores
|
|
||||||
5. Enable CPU-specific optimizations
|
|
||||||
|
|
||||||
### GPU Optimization:
|
|
||||||
1. Keep model loaded between requests
|
|
||||||
2. Batch multiple articles together
|
|
||||||
3. Use FP16 precision (automatic with GPU)
|
|
||||||
4. Enable concurrent requests
|
|
||||||
5. Use GPU with more VRAM for larger models
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
**For Munich News Daily (10-20 articles/day):**
|
|
||||||
- CPU is sufficient and cost-effective
|
|
||||||
- GPU provides faster processing but may be overkill
|
|
||||||
- Recommendation: Start with CPU, upgrade to GPU if scaling up
|
|
||||||
|
|
||||||
**For High-Volume Operations (100+ articles/day):**
|
|
||||||
- GPU provides significant time and cost savings
|
|
||||||
- 4-5x faster processing
|
|
||||||
- Better user experience
|
|
||||||
- Recommendation: Use GPU from the start
|
|
||||||
|
|
||||||
**For Real-Time Applications:**
|
|
||||||
- GPU is essential for responsive experience
|
|
||||||
- Sub-second translation, 2-3s summaries
|
|
||||||
- Supports concurrent users
|
|
||||||
- Recommendation: GPU required
|
|
||||||
@@ -1,243 +0,0 @@
|
|||||||
# Quick Reference Guide
|
|
||||||
|
|
||||||
## Starting the Application
|
|
||||||
|
|
||||||
### 1. Start MongoDB
|
|
||||||
```bash
|
|
||||||
docker-compose up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Start Backend (Port 5001)
|
|
||||||
```bash
|
|
||||||
cd backend
|
|
||||||
source venv/bin/activate # or: venv\Scripts\activate on Windows
|
|
||||||
python app.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Start Frontend (Port 3000)
|
|
||||||
```bash
|
|
||||||
cd frontend
|
|
||||||
npm start
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Run Crawler (Optional)
|
|
||||||
```bash
|
|
||||||
cd news_crawler
|
|
||||||
pip install -r requirements.txt
|
|
||||||
python crawler_service.py 10
|
|
||||||
```
|
|
||||||
|
|
||||||
## Common Commands
|
|
||||||
|
|
||||||
### RSS Feed Management
|
|
||||||
|
|
||||||
**List all feeds:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/rss-feeds
|
|
||||||
```
|
|
||||||
|
|
||||||
**Add a feed:**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"name": "Feed Name", "url": "https://example.com/rss"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Remove a feed:**
|
|
||||||
```bash
|
|
||||||
curl -X DELETE http://localhost:5001/api/rss-feeds/<feed_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Toggle feed status:**
|
|
||||||
```bash
|
|
||||||
curl -X PATCH http://localhost:5001/api/rss-feeds/<feed_id>/toggle
|
|
||||||
```
|
|
||||||
|
|
||||||
### News & Subscriptions
|
|
||||||
|
|
||||||
**Get latest news:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/news
|
|
||||||
```
|
|
||||||
|
|
||||||
**Subscribe:**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/subscribe \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "user@example.com"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Get stats:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
### Ollama (AI)
|
|
||||||
|
|
||||||
**Test connection:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/ollama/ping
|
|
||||||
```
|
|
||||||
|
|
||||||
**List models:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/ollama/models
|
|
||||||
```
|
|
||||||
|
|
||||||
### Email Tracking & Analytics
|
|
||||||
|
|
||||||
**Get newsletter metrics:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Get article performance:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/analytics/article/<article_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Get subscriber activity:**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/analytics/subscriber/<email>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Delete subscriber tracking data:**
|
|
||||||
```bash
|
|
||||||
curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Anonymize old tracking data:**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/tracking/anonymize
|
|
||||||
```
|
|
||||||
|
|
||||||
### Database
|
|
||||||
|
|
||||||
**Connect to MongoDB:**
|
|
||||||
```bash
|
|
||||||
mongosh
|
|
||||||
use munich_news
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check articles:**
|
|
||||||
```javascript
|
|
||||||
db.articles.find().limit(5)
|
|
||||||
db.articles.countDocuments()
|
|
||||||
db.articles.countDocuments({full_content: {$exists: true}})
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check subscribers:**
|
|
||||||
```javascript
|
|
||||||
db.subscribers.find()
|
|
||||||
db.subscribers.countDocuments({status: "active"})
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check RSS feeds:**
|
|
||||||
```javascript
|
|
||||||
db.rss_feeds.find()
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check tracking data:**
|
|
||||||
```javascript
|
|
||||||
db.newsletter_sends.find().limit(5)
|
|
||||||
db.link_clicks.find().limit(5)
|
|
||||||
db.subscriber_activity.find()
|
|
||||||
```
|
|
||||||
|
|
||||||
## File Locations
|
|
||||||
|
|
||||||
### Configuration
|
|
||||||
- Backend: `backend/.env`
|
|
||||||
- Frontend: `frontend/package.json`
|
|
||||||
- Crawler: Uses backend's `.env` or own `.env`
|
|
||||||
|
|
||||||
### Logs
|
|
||||||
- Backend: Terminal output
|
|
||||||
- Frontend: Terminal output
|
|
||||||
- Crawler: Terminal output
|
|
||||||
|
|
||||||
### Database
|
|
||||||
- MongoDB data: Docker volume `mongodb_data`
|
|
||||||
- Database name: `munich_news`
|
|
||||||
|
|
||||||
## Ports
|
|
||||||
|
|
||||||
| Service | Port | URL |
|
|
||||||
|---------|------|-----|
|
|
||||||
| Frontend | 3000 | http://localhost:3000 |
|
|
||||||
| Backend | 5001 | http://localhost:5001 |
|
|
||||||
| MongoDB | 27017 | mongodb://localhost:27017 |
|
|
||||||
| Ollama | 11434 | http://localhost:11434 |
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Backend won't start
|
|
||||||
- Check if port 5001 is available
|
|
||||||
- Verify MongoDB is running
|
|
||||||
- Check `.env` file exists
|
|
||||||
|
|
||||||
### Frontend can't connect
|
|
||||||
- Verify backend is running on port 5001
|
|
||||||
- Check CORS settings
|
|
||||||
- Check API_URL in frontend
|
|
||||||
|
|
||||||
### Crawler fails
|
|
||||||
- Install dependencies: `pip install -r requirements.txt`
|
|
||||||
- Check MongoDB connection
|
|
||||||
- Verify RSS feeds exist in database
|
|
||||||
|
|
||||||
### MongoDB connection error
|
|
||||||
- Start MongoDB: `docker-compose up -d`
|
|
||||||
- Check connection string in `.env`
|
|
||||||
- Verify port 27017 is not blocked
|
|
||||||
|
|
||||||
### Port 5000 conflict (macOS)
|
|
||||||
- AirPlay uses port 5000
|
|
||||||
- Use port 5001 instead (set in `.env`)
|
|
||||||
- Or disable AirPlay Receiver in System Preferences
|
|
||||||
|
|
||||||
## Project Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
munich-news/
|
|
||||||
├── backend/ # Main API (Flask)
|
|
||||||
├── frontend/ # Web UI (Express + JS)
|
|
||||||
├── news_crawler/ # Crawler microservice
|
|
||||||
├── .env # Environment variables
|
|
||||||
└── docker-compose.yml # MongoDB setup
|
|
||||||
```
|
|
||||||
|
|
||||||
## Environment Variables
|
|
||||||
|
|
||||||
### Backend (.env)
|
|
||||||
```env
|
|
||||||
MONGODB_URI=mongodb://localhost:27017/
|
|
||||||
FLASK_PORT=5001
|
|
||||||
SMTP_SERVER=smtp.gmail.com
|
|
||||||
SMTP_PORT=587
|
|
||||||
EMAIL_USER=your-email@gmail.com
|
|
||||||
EMAIL_PASSWORD=your-app-password
|
|
||||||
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
|
||||||
OLLAMA_MODEL=phi3:latest
|
|
||||||
OLLAMA_ENABLED=true
|
|
||||||
TRACKING_ENABLED=true
|
|
||||||
TRACKING_API_URL=http://localhost:5001
|
|
||||||
TRACKING_DATA_RETENTION_DAYS=90
|
|
||||||
```
|
|
||||||
|
|
||||||
## Development Workflow
|
|
||||||
|
|
||||||
1. **Add RSS Feed** → Backend API
|
|
||||||
2. **Run Crawler** → Fetches full content
|
|
||||||
3. **View News** → Frontend displays articles
|
|
||||||
4. **Users Subscribe** → Via frontend form
|
|
||||||
5. **Send Newsletter** → Manual or scheduled
|
|
||||||
|
|
||||||
## Useful Links
|
|
||||||
|
|
||||||
- Frontend: http://localhost:3000
|
|
||||||
- Backend API: http://localhost:5001
|
|
||||||
- MongoDB: mongodb://localhost:27017
|
|
||||||
- Architecture: See `ARCHITECTURE.md`
|
|
||||||
- Backend Structure: See `backend/STRUCTURE.md`
|
|
||||||
- Crawler Guide: See `news_crawler/README.md`
|
|
||||||
299
docs/REFERENCE.md
Normal file
299
docs/REFERENCE.md
Normal file
@@ -0,0 +1,299 @@
|
|||||||
|
# Quick Reference
|
||||||
|
|
||||||
|
Essential commands and information.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start services
|
||||||
|
./start-with-gpu.sh # With GPU auto-detection
|
||||||
|
docker-compose up -d # Without GPU
|
||||||
|
|
||||||
|
# Stop services
|
||||||
|
docker-compose down
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
docker-compose logs -f crawler # Specific service
|
||||||
|
|
||||||
|
# Restart
|
||||||
|
docker-compose restart
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Docker Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Service status
|
||||||
|
docker-compose ps
|
||||||
|
|
||||||
|
# Rebuild
|
||||||
|
docker-compose up -d --build
|
||||||
|
|
||||||
|
# Execute command in container
|
||||||
|
docker-compose exec backend python -c "print('hello')"
|
||||||
|
docker-compose exec crawler python crawler_service.py 2
|
||||||
|
|
||||||
|
# View container logs
|
||||||
|
docker logs munich-news-backend
|
||||||
|
docker logs munich-news-crawler
|
||||||
|
|
||||||
|
# Resource usage
|
||||||
|
docker stats
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Health check
|
||||||
|
curl http://localhost:5001/health
|
||||||
|
|
||||||
|
# System stats
|
||||||
|
curl http://localhost:5001/api/admin/stats
|
||||||
|
|
||||||
|
# Trigger crawl
|
||||||
|
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 5}'
|
||||||
|
|
||||||
|
# Send test email
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-test-email \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"email": "test@example.com"}'
|
||||||
|
|
||||||
|
# Send newsletter to all
|
||||||
|
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"max_articles": 10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GPU Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check GPU availability
|
||||||
|
./check-gpu.sh
|
||||||
|
|
||||||
|
# Start with GPU
|
||||||
|
./start-with-gpu.sh
|
||||||
|
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
|
||||||
|
|
||||||
|
# Check GPU usage
|
||||||
|
docker exec munich-news-ollama nvidia-smi
|
||||||
|
|
||||||
|
# Monitor GPU
|
||||||
|
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MongoDB Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Access MongoDB shell
|
||||||
|
docker-compose exec mongodb mongosh munich_news -u admin -p changeme --authenticationDatabase admin
|
||||||
|
|
||||||
|
# Count documents
|
||||||
|
db.articles.countDocuments({})
|
||||||
|
db.subscribers.countDocuments({status: 'active'})
|
||||||
|
|
||||||
|
# Find articles
|
||||||
|
db.articles.find().limit(5).pretty()
|
||||||
|
|
||||||
|
# Clear articles
|
||||||
|
db.articles.deleteMany({})
|
||||||
|
|
||||||
|
# Add subscriber
|
||||||
|
db.subscribers.insertOne({
|
||||||
|
email: "user@example.com",
|
||||||
|
subscribed_at: new Date(),
|
||||||
|
status: "active"
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test Ollama setup
|
||||||
|
./test-ollama-setup.sh
|
||||||
|
|
||||||
|
# Test MongoDB connectivity
|
||||||
|
./test-mongodb-connectivity.sh
|
||||||
|
|
||||||
|
# Test newsletter API
|
||||||
|
./test-newsletter-api.sh
|
||||||
|
|
||||||
|
# Test crawl (2 articles)
|
||||||
|
docker-compose exec crawler python crawler_service.py 2
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Single .env File
|
||||||
|
|
||||||
|
Location: `backend/.env`
|
||||||
|
|
||||||
|
**Key Settings:**
|
||||||
|
```env
|
||||||
|
# MongoDB (Docker service name)
|
||||||
|
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
|
||||||
|
|
||||||
|
# Email
|
||||||
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your@email.com
|
||||||
|
EMAIL_PASSWORD=your-password
|
||||||
|
|
||||||
|
# Ollama (Internal Docker network)
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Port Exposure
|
||||||
|
|
||||||
|
| Service | Port | Exposed | Access |
|
||||||
|
|---------|------|---------|--------|
|
||||||
|
| Backend | 5001 | ✅ Yes | Host, External |
|
||||||
|
| MongoDB | 27017 | ❌ No | Internal only |
|
||||||
|
| Ollama | 11434 | ❌ No | Internal only |
|
||||||
|
| Crawler | - | ❌ No | Internal only |
|
||||||
|
| Sender | - | ❌ No | Internal only |
|
||||||
|
|
||||||
|
**Verify:**
|
||||||
|
```bash
|
||||||
|
docker ps --format "table {{.Names}}\t{{.Ports}}"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
### CPU Mode
|
||||||
|
- Translation: ~1.5s per title
|
||||||
|
- Summarization: ~8s per article
|
||||||
|
- 10 Articles: ~115s
|
||||||
|
|
||||||
|
### GPU Mode (5-10x faster)
|
||||||
|
- Translation: ~0.3s per title
|
||||||
|
- Summarization: ~2s per article
|
||||||
|
- 10 Articles: ~31s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Service Won't Start
|
||||||
|
```bash
|
||||||
|
docker-compose logs <service-name>
|
||||||
|
docker-compose restart <service-name>
|
||||||
|
docker-compose up -d --build <service-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### MongoDB Connection Issues
|
||||||
|
```bash
|
||||||
|
# Check service
|
||||||
|
docker-compose ps mongodb
|
||||||
|
|
||||||
|
# Test connection
|
||||||
|
docker-compose exec backend python -c "from database import articles_collection; print(articles_collection.count_documents({}))"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Ollama Issues
|
||||||
|
```bash
|
||||||
|
# Check model
|
||||||
|
docker-compose exec ollama ollama list
|
||||||
|
|
||||||
|
# Pull model manually
|
||||||
|
docker-compose exec ollama ollama pull phi3:latest
|
||||||
|
|
||||||
|
# Check logs
|
||||||
|
docker-compose logs ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
### GPU Not Working
|
||||||
|
```bash
|
||||||
|
# Check GPU
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
# Check Docker GPU access
|
||||||
|
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
|
||||||
|
|
||||||
|
# Check Ollama GPU
|
||||||
|
docker exec munich-news-ollama nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recent Updates (November 2025)
|
||||||
|
|
||||||
|
### New Features
|
||||||
|
- ✅ GPU acceleration (5-10x faster)
|
||||||
|
- ✅ Integrated Ollama service
|
||||||
|
- ✅ Send newsletter to all subscribers API
|
||||||
|
- ✅ Article title translation (German → English)
|
||||||
|
- ✅ Enhanced security (network isolation)
|
||||||
|
|
||||||
|
### Security Improvements
|
||||||
|
- MongoDB internal-only (not exposed)
|
||||||
|
- Ollama internal-only (not exposed)
|
||||||
|
- Only Backend API exposed (port 5001)
|
||||||
|
- 66% reduction in attack surface
|
||||||
|
|
||||||
|
### Configuration Changes
|
||||||
|
- MongoDB URI uses `mongodb` (not `localhost`)
|
||||||
|
- Ollama URL uses `http://ollama:11434`
|
||||||
|
- Single `.env` file in `backend/`
|
||||||
|
|
||||||
|
### New Scripts
|
||||||
|
- `start-with-gpu.sh` - Auto-detect GPU and start
|
||||||
|
- `check-gpu.sh` - Check GPU availability
|
||||||
|
- `test-ollama-setup.sh` - Test Ollama
|
||||||
|
- `test-mongodb-connectivity.sh` - Test MongoDB
|
||||||
|
- `test-newsletter-api.sh` - Test newsletter API
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog
|
||||||
|
|
||||||
|
### November 2025
|
||||||
|
- Added GPU acceleration support
|
||||||
|
- Integrated Ollama into Docker Compose
|
||||||
|
- Added newsletter API endpoint
|
||||||
|
- Improved network security
|
||||||
|
- Added article title translation
|
||||||
|
- Consolidated documentation
|
||||||
|
- Added helper scripts
|
||||||
|
|
||||||
|
### Key Changes
|
||||||
|
- Ollama now runs in Docker (no external server needed)
|
||||||
|
- MongoDB and Ollama are internal-only
|
||||||
|
- GPU support with automatic detection
|
||||||
|
- Subscriber status system documented
|
||||||
|
- All docs consolidated and updated
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- **Setup Guide**: [SETUP.md](SETUP.md)
|
||||||
|
- **API Reference**: [API.md](API.md)
|
||||||
|
- **Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md)
|
||||||
|
- **Security**: [SECURITY.md](SECURITY.md)
|
||||||
|
- **Features**: [FEATURES.md](FEATURES.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated:** November 2025
|
||||||
@@ -1,194 +0,0 @@
|
|||||||
# RSS URL Extraction - How It Works
|
|
||||||
|
|
||||||
## The Problem
|
|
||||||
|
|
||||||
Different RSS feed providers use different fields to store the article URL:
|
|
||||||
|
|
||||||
### Example 1: Standard RSS (uses `link`)
|
|
||||||
```xml
|
|
||||||
<item>
|
|
||||||
<title>Article Title</title>
|
|
||||||
<link>https://example.com/article/123</link>
|
|
||||||
<guid>internal-id-456</guid>
|
|
||||||
</item>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Example 2: Some feeds (uses `guid` as URL)
|
|
||||||
```xml
|
|
||||||
<item>
|
|
||||||
<title>Article Title</title>
|
|
||||||
<guid>https://example.com/article/123</guid>
|
|
||||||
</item>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Example 3: Atom feeds (uses `id`)
|
|
||||||
```xml
|
|
||||||
<entry>
|
|
||||||
<title>Article Title</title>
|
|
||||||
<id>https://example.com/article/123</id>
|
|
||||||
</entry>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Example 4: Complex feeds (guid as object)
|
|
||||||
```xml
|
|
||||||
<item>
|
|
||||||
<title>Article Title</title>
|
|
||||||
<guid isPermaLink="true">https://example.com/article/123</guid>
|
|
||||||
</item>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Example 5: Multiple links
|
|
||||||
```xml
|
|
||||||
<item>
|
|
||||||
<title>Article Title</title>
|
|
||||||
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
|
|
||||||
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
|
|
||||||
</item>
|
|
||||||
```
|
|
||||||
|
|
||||||
## Our Solution
|
|
||||||
|
|
||||||
The `extract_article_url()` function tries multiple strategies in order:
|
|
||||||
|
|
||||||
### Strategy 1: Check `link` field (most common)
|
|
||||||
```python
|
|
||||||
if entry.get('link') and entry.get('link', '').startswith('http'):
|
|
||||||
return entry.get('link')
|
|
||||||
```
|
|
||||||
✅ Works for: Most RSS 2.0 feeds
|
|
||||||
|
|
||||||
### Strategy 2: Check `guid` field
|
|
||||||
```python
|
|
||||||
if entry.get('guid'):
|
|
||||||
guid = entry.get('guid')
|
|
||||||
# guid can be a string
|
|
||||||
if isinstance(guid, str) and guid.startswith('http'):
|
|
||||||
return guid
|
|
||||||
# or a dict with 'href'
|
|
||||||
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
|
|
||||||
return guid.get('href')
|
|
||||||
```
|
|
||||||
✅ Works for: Feeds that use GUID as permalink
|
|
||||||
|
|
||||||
### Strategy 3: Check `id` field
|
|
||||||
```python
|
|
||||||
if entry.get('id') and entry.get('id', '').startswith('http'):
|
|
||||||
return entry.get('id')
|
|
||||||
```
|
|
||||||
✅ Works for: Atom feeds
|
|
||||||
|
|
||||||
### Strategy 4: Check `links` array
|
|
||||||
```python
|
|
||||||
if entry.get('links'):
|
|
||||||
for link in entry.get('links', []):
|
|
||||||
if isinstance(link, dict) and link.get('href', '').startswith('http'):
|
|
||||||
# Prefer 'alternate' type
|
|
||||||
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
|
|
||||||
return link.get('href')
|
|
||||||
```
|
|
||||||
✅ Works for: Feeds with multiple links (prefers HTML content)
|
|
||||||
|
|
||||||
## Real-World Examples
|
|
||||||
|
|
||||||
### Süddeutsche Zeitung
|
|
||||||
```python
|
|
||||||
entry = {
|
|
||||||
'title': 'Munich News',
|
|
||||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
|
||||||
'guid': 'sz-internal-123'
|
|
||||||
}
|
|
||||||
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Medium Blog
|
|
||||||
```python
|
|
||||||
entry = {
|
|
||||||
'title': 'Blog Post',
|
|
||||||
'guid': 'https://medium.com/@user/post-abc123',
|
|
||||||
'link': None
|
|
||||||
}
|
|
||||||
# Returns: 'https://medium.com/@user/post-abc123'
|
|
||||||
```
|
|
||||||
|
|
||||||
### YouTube RSS
|
|
||||||
```python
|
|
||||||
entry = {
|
|
||||||
'title': 'Video Title',
|
|
||||||
'id': 'https://www.youtube.com/watch?v=abc123',
|
|
||||||
'link': None
|
|
||||||
}
|
|
||||||
# Returns: 'https://www.youtube.com/watch?v=abc123'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Complex Feed
|
|
||||||
```python
|
|
||||||
entry = {
|
|
||||||
'title': 'Article',
|
|
||||||
'links': [
|
|
||||||
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
|
|
||||||
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
# Returns: 'https://example.com/article' (prefers text/html)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Validation
|
|
||||||
|
|
||||||
All extracted URLs must:
|
|
||||||
1. Start with `http://` or `https://`
|
|
||||||
2. Be a valid string (not None or empty)
|
|
||||||
|
|
||||||
If no valid URL is found:
|
|
||||||
```python
|
|
||||||
return None
|
|
||||||
# Crawler will skip this entry and log a warning
|
|
||||||
```
|
|
||||||
|
|
||||||
## Testing Different Feeds
|
|
||||||
|
|
||||||
To test if a feed works with our extractor:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import feedparser
|
|
||||||
from rss_utils import extract_article_url
|
|
||||||
|
|
||||||
# Parse feed
|
|
||||||
feed = feedparser.parse('https://example.com/rss')
|
|
||||||
|
|
||||||
# Test each entry
|
|
||||||
for entry in feed.entries[:5]:
|
|
||||||
url = extract_article_url(entry)
|
|
||||||
if url:
|
|
||||||
print(f"✓ {entry.get('title', 'No title')[:50]}")
|
|
||||||
print(f" URL: {url}")
|
|
||||||
else:
|
|
||||||
print(f"✗ {entry.get('title', 'No title')[:50]}")
|
|
||||||
print(f" No valid URL found")
|
|
||||||
print(f" Available fields: {list(entry.keys())}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Supported Feed Types
|
|
||||||
|
|
||||||
✅ RSS 2.0
|
|
||||||
✅ RSS 1.0
|
|
||||||
✅ Atom
|
|
||||||
✅ Custom RSS variants
|
|
||||||
✅ Feeds with multiple links
|
|
||||||
✅ Feeds with GUID as permalink
|
|
||||||
|
|
||||||
## Edge Cases Handled
|
|
||||||
|
|
||||||
1. **GUID is not a URL**: Checks if it starts with `http`
|
|
||||||
2. **Multiple links**: Prefers `text/html` type
|
|
||||||
3. **GUID as dict**: Extracts `href` field
|
|
||||||
4. **Missing fields**: Returns None instead of crashing
|
|
||||||
5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
|
|
||||||
|
|
||||||
## Future Improvements
|
|
||||||
|
|
||||||
Potential enhancements:
|
|
||||||
- [ ] Support for `feedburner:origLink`
|
|
||||||
- [ ] Support for `pheedo:origLink`
|
|
||||||
- [ ] Resolve shortened URLs (bit.ly, etc.)
|
|
||||||
- [ ] Handle relative URLs (convert to absolute)
|
|
||||||
- [ ] Cache URL extraction results
|
|
||||||
253
docs/SETUP.md
Normal file
253
docs/SETUP.md
Normal file
@@ -0,0 +1,253 @@
|
|||||||
|
# Complete Setup Guide
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Configure
|
||||||
|
cp backend/.env.example backend/.env
|
||||||
|
# Edit backend/.env with your email settings
|
||||||
|
|
||||||
|
# 2. Start (with GPU auto-detection)
|
||||||
|
./start-with-gpu.sh
|
||||||
|
|
||||||
|
# 3. Wait for model download (first time, ~2-5 min)
|
||||||
|
docker-compose logs -f ollama-setup
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- 4GB+ RAM
|
||||||
|
- (Optional) NVIDIA GPU for 5-10x faster AI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Environment File
|
||||||
|
|
||||||
|
Edit `backend/.env`:
|
||||||
|
|
||||||
|
```env
|
||||||
|
# Email (Required)
|
||||||
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your-email@gmail.com
|
||||||
|
EMAIL_PASSWORD=your-app-password
|
||||||
|
|
||||||
|
# MongoDB (Docker service name)
|
||||||
|
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
|
||||||
|
|
||||||
|
# Ollama AI (Internal Docker network)
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ollama Setup
|
||||||
|
|
||||||
|
### Integrated Docker Compose (Recommended)
|
||||||
|
|
||||||
|
Ollama runs automatically with Docker Compose:
|
||||||
|
- Automatic model download (phi3:latest, 2.2GB)
|
||||||
|
- Internal-only access (secure)
|
||||||
|
- Persistent storage
|
||||||
|
- GPU support available
|
||||||
|
|
||||||
|
**Start:**
|
||||||
|
```bash
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify:**
|
||||||
|
```bash
|
||||||
|
docker-compose exec ollama ollama list
|
||||||
|
# Should show: phi3:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GPU Acceleration (5-10x Faster)
|
||||||
|
|
||||||
|
### Check GPU Availability
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./check-gpu.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Start with GPU
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Auto-detect and start
|
||||||
|
./start-with-gpu.sh
|
||||||
|
|
||||||
|
# Or manually
|
||||||
|
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify GPU Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check GPU
|
||||||
|
docker exec munich-news-ollama nvidia-smi
|
||||||
|
|
||||||
|
# Monitor during processing
|
||||||
|
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance
|
||||||
|
|
||||||
|
| Operation | CPU | GPU | Speedup |
|
||||||
|
|-----------|-----|-----|---------|
|
||||||
|
| Translation | 1.5s | 0.3s | 5x |
|
||||||
|
| Summary | 8s | 2s | 4x |
|
||||||
|
| 10 Articles | 115s | 31s | 3.7x |
|
||||||
|
|
||||||
|
### GPU Requirements
|
||||||
|
|
||||||
|
- NVIDIA GPU (GTX 1060 or newer)
|
||||||
|
- 4GB+ VRAM for phi3:latest
|
||||||
|
- NVIDIA drivers (525.60.13+)
|
||||||
|
- NVIDIA Container Toolkit
|
||||||
|
|
||||||
|
**Install NVIDIA Container Toolkit (Ubuntu/Debian):**
|
||||||
|
```bash
|
||||||
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||||||
|
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
|
||||||
|
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
|
||||||
|
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
||||||
|
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||||||
|
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install -y nvidia-container-toolkit
|
||||||
|
sudo nvidia-ctk runtime configure --runtime=docker
|
||||||
|
sudo systemctl restart docker
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Production Deployment
|
||||||
|
|
||||||
|
### Security Checklist
|
||||||
|
|
||||||
|
- [ ] Change MongoDB password
|
||||||
|
- [ ] Use strong email password
|
||||||
|
- [ ] Bind backend to localhost only: `127.0.0.1:5001:5001`
|
||||||
|
- [ ] Set up reverse proxy (nginx/Traefik)
|
||||||
|
- [ ] Enable HTTPS
|
||||||
|
- [ ] Set up firewall rules
|
||||||
|
- [ ] Regular backups
|
||||||
|
- [ ] Monitor logs
|
||||||
|
|
||||||
|
### Reverse Proxy (nginx)
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name your-domain.com;
|
||||||
|
|
||||||
|
location / {
|
||||||
|
proxy_pass http://localhost:5001;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
For production, use environment variables instead of .env:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export MONGODB_URI="mongodb://user:pass@mongodb:27017/"
|
||||||
|
export SMTP_SERVER="smtp.gmail.com"
|
||||||
|
export EMAIL_PASSWORD="secure-password"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check service health
|
||||||
|
docker-compose ps
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Check resource usage
|
||||||
|
docker stats
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Ollama Issues
|
||||||
|
|
||||||
|
**Model not downloading:**
|
||||||
|
```bash
|
||||||
|
docker-compose logs ollama-setup
|
||||||
|
docker-compose exec ollama ollama pull phi3:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
**Out of memory:**
|
||||||
|
- Use smaller model: `OLLAMA_MODEL=gemma2:2b`
|
||||||
|
- Increase Docker memory limit
|
||||||
|
|
||||||
|
### GPU Issues
|
||||||
|
|
||||||
|
**GPU not detected:**
|
||||||
|
```bash
|
||||||
|
nvidia-smi # Check drivers
|
||||||
|
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi # Check Docker
|
||||||
|
```
|
||||||
|
|
||||||
|
**Out of VRAM:**
|
||||||
|
- Use smaller model
|
||||||
|
- Close other GPU applications
|
||||||
|
|
||||||
|
### MongoDB Issues
|
||||||
|
|
||||||
|
**Connection refused:**
|
||||||
|
- Check service is running: `docker-compose ps`
|
||||||
|
- Verify URI uses `mongodb` not `localhost`
|
||||||
|
|
||||||
|
### Email Issues
|
||||||
|
|
||||||
|
**Authentication failed:**
|
||||||
|
- Use app-specific password (Gmail)
|
||||||
|
- Check SMTP settings
|
||||||
|
- Verify credentials
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test Ollama
|
||||||
|
./test-ollama-setup.sh
|
||||||
|
|
||||||
|
# Test MongoDB
|
||||||
|
./test-mongodb-connectivity.sh
|
||||||
|
|
||||||
|
# Test newsletter
|
||||||
|
./test-newsletter-api.sh
|
||||||
|
|
||||||
|
# Test crawl
|
||||||
|
docker-compose exec crawler python crawler_service.py 2
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. Add RSS feeds (see QUICKSTART.md)
|
||||||
|
2. Add subscribers
|
||||||
|
3. Test newsletter sending
|
||||||
|
4. Set up monitoring
|
||||||
|
5. Configure backups
|
||||||
|
|
||||||
|
See [API.md](API.md) for API reference.
|
||||||
@@ -1,290 +0,0 @@
|
|||||||
# Subscriber Status System
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The newsletter system tracks subscribers with a `status` field that determines whether they receive newsletters.
|
|
||||||
|
|
||||||
## Status Field
|
|
||||||
|
|
||||||
### Database Schema
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId("..."),
|
|
||||||
email: "user@example.com",
|
|
||||||
subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
|
|
||||||
status: "active" // or "inactive"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Status Values
|
|
||||||
|
|
||||||
| Status | Description | Receives Newsletters |
|
|
||||||
|--------|-------------|---------------------|
|
|
||||||
| `active` | Subscribed and active | ✅ Yes |
|
|
||||||
| `inactive` | Unsubscribed | ❌ No |
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
### Subscription Flow
|
|
||||||
|
|
||||||
```
|
|
||||||
User subscribes
|
|
||||||
↓
|
|
||||||
POST /api/subscribe
|
|
||||||
↓
|
|
||||||
Create subscriber with status: 'active'
|
|
||||||
↓
|
|
||||||
User receives newsletters
|
|
||||||
```
|
|
||||||
|
|
||||||
### Unsubscription Flow
|
|
||||||
|
|
||||||
```
|
|
||||||
User unsubscribes
|
|
||||||
↓
|
|
||||||
POST /api/unsubscribe
|
|
||||||
↓
|
|
||||||
Update subscriber status: 'inactive'
|
|
||||||
↓
|
|
||||||
User stops receiving newsletters
|
|
||||||
```
|
|
||||||
|
|
||||||
### Re-subscription Flow
|
|
||||||
|
|
||||||
```
|
|
||||||
Previously unsubscribed user subscribes again
|
|
||||||
↓
|
|
||||||
POST /api/subscribe
|
|
||||||
↓
|
|
||||||
Update status: 'active' + new subscribed_at date
|
|
||||||
↓
|
|
||||||
User receives newsletters again
|
|
||||||
```
|
|
||||||
|
|
||||||
## API Endpoints
|
|
||||||
|
|
||||||
### Subscribe
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/subscribe \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "user@example.com"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Creates subscriber with:**
|
|
||||||
- `email`: user@example.com
|
|
||||||
- `status`: "active"
|
|
||||||
- `subscribed_at`: current timestamp
|
|
||||||
|
|
||||||
### Unsubscribe
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/unsubscribe \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"email": "user@example.com"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Updates subscriber:**
|
|
||||||
- `status`: "inactive"
|
|
||||||
|
|
||||||
## Newsletter Sending
|
|
||||||
|
|
||||||
### Who Receives Newsletters
|
|
||||||
|
|
||||||
Only subscribers with `status: 'active'` receive newsletters.
|
|
||||||
|
|
||||||
**Sender Service Query:**
|
|
||||||
```python
|
|
||||||
subscribers_collection.find({'status': 'active'})
|
|
||||||
```
|
|
||||||
|
|
||||||
**Admin API Query:**
|
|
||||||
```python
|
|
||||||
subscribers_collection.count_documents({'status': 'active'})
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check active subscriber count
|
|
||||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
|
||||||
|
|
||||||
# Output:
|
|
||||||
# {
|
|
||||||
# "total": 10,
|
|
||||||
# "active": 8
|
|
||||||
# }
|
|
||||||
```
|
|
||||||
|
|
||||||
## Database Operations
|
|
||||||
|
|
||||||
### Add Active Subscriber
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
db.subscribers.insertOne({
|
|
||||||
email: "user@example.com",
|
|
||||||
subscribed_at: new Date(),
|
|
||||||
status: "active"
|
|
||||||
})
|
|
||||||
```
|
|
||||||
|
|
||||||
### Deactivate Subscriber
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
db.subscribers.updateOne(
|
|
||||||
{ email: "user@example.com" },
|
|
||||||
{ $set: { status: "inactive" } }
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Reactivate Subscriber
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
db.subscribers.updateOne(
|
|
||||||
{ email: "user@example.com" },
|
|
||||||
{ $set: {
|
|
||||||
status: "active",
|
|
||||||
subscribed_at: new Date()
|
|
||||||
}}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Query Active Subscribers
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
db.subscribers.find({ status: "active" })
|
|
||||||
```
|
|
||||||
|
|
||||||
### Count Active Subscribers
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
db.subscribers.countDocuments({ status: "active" })
|
|
||||||
```
|
|
||||||
|
|
||||||
## Common Issues
|
|
||||||
|
|
||||||
### Issue: Stats show 0 active subscribers but subscribers exist
|
|
||||||
|
|
||||||
**Cause:** Old bug where stats checked `{active: true}` instead of `{status: 'active'}`
|
|
||||||
|
|
||||||
**Solution:** Fixed in latest version. Stats now correctly query `{status: 'active'}`
|
|
||||||
|
|
||||||
**Verify:**
|
|
||||||
```bash
|
|
||||||
# Check database directly
|
|
||||||
docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
|
|
||||||
--authenticationDatabase admin \
|
|
||||||
--eval "db.subscribers.find({status: 'active'}).count()"
|
|
||||||
|
|
||||||
# Check via API
|
|
||||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers.active'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Issue: Newsletter not sending to subscribers
|
|
||||||
|
|
||||||
**Possible causes:**
|
|
||||||
1. Subscribers have `status: 'inactive'`
|
|
||||||
2. No subscribers in database
|
|
||||||
3. Email configuration issue
|
|
||||||
|
|
||||||
**Debug:**
|
|
||||||
```bash
|
|
||||||
# Check subscriber status
|
|
||||||
docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
|
|
||||||
--authenticationDatabase admin \
|
|
||||||
--eval "db.subscribers.find().pretty()"
|
|
||||||
|
|
||||||
# Check active count
|
|
||||||
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
|
|
||||||
|
|
||||||
# Try sending
|
|
||||||
curl -X POST http://localhost:5001/api/admin/send-newsletter \
|
|
||||||
-H "Content-Type: application/json"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Migration Notes
|
|
||||||
|
|
||||||
### If you have old subscribers without status field
|
|
||||||
|
|
||||||
Run this migration:
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
// Set all subscribers without status to 'active'
|
|
||||||
db.subscribers.updateMany(
|
|
||||||
{ status: { $exists: false } },
|
|
||||||
{ $set: { status: "active" } }
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### If you have subscribers with `active: true/false` field
|
|
||||||
|
|
||||||
Run this migration:
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
// Convert old 'active' field to 'status' field
|
|
||||||
db.subscribers.updateMany(
|
|
||||||
{ active: true },
|
|
||||||
{ $set: { status: "active" }, $unset: { active: "" } }
|
|
||||||
)
|
|
||||||
|
|
||||||
db.subscribers.updateMany(
|
|
||||||
{ active: false },
|
|
||||||
{ $set: { status: "inactive" }, $unset: { active: "" } }
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
### 1. Always Check Status
|
|
||||||
|
|
||||||
When querying subscribers for sending:
|
|
||||||
```python
|
|
||||||
# ✅ Correct
|
|
||||||
subscribers_collection.find({'status': 'active'})
|
|
||||||
|
|
||||||
# ❌ Wrong
|
|
||||||
subscribers_collection.find({}) # Includes inactive
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Soft Delete
|
|
||||||
|
|
||||||
Never delete subscribers - just set status to 'inactive':
|
|
||||||
```python
|
|
||||||
# ✅ Correct - preserves history
|
|
||||||
subscribers_collection.update_one(
|
|
||||||
{'email': email},
|
|
||||||
{'$set': {'status': 'inactive'}}
|
|
||||||
)
|
|
||||||
|
|
||||||
# ❌ Wrong - loses data
|
|
||||||
subscribers_collection.delete_one({'email': email})
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Track Subscription History
|
|
||||||
|
|
||||||
Consider adding fields:
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
email: "user@example.com",
|
|
||||||
status: "active",
|
|
||||||
subscribed_at: ISODate("2025-01-01"),
|
|
||||||
unsubscribed_at: null, // Set when status changes to inactive
|
|
||||||
resubscribed_count: 0 // Increment on re-subscription
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Validate Before Sending
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Check subscriber count before sending
|
|
||||||
count = subscribers_collection.count_documents({'status': 'active'})
|
|
||||||
if count == 0:
|
|
||||||
return {'error': 'No active subscribers'}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Related Documentation
|
|
||||||
|
|
||||||
- [ADMIN_API.md](ADMIN_API.md) - Admin API endpoints
|
|
||||||
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
|
|
||||||
- [NEWSLETTER_API_UPDATE.md](../NEWSLETTER_API_UPDATE.md) - Newsletter API changes
|
|
||||||
@@ -1,412 +0,0 @@
|
|||||||
# Munich News Daily - System Architecture
|
|
||||||
|
|
||||||
## 📊 Complete System Overview
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────────────┐
|
|
||||||
│ Munich News Daily System │
|
|
||||||
│ Fully Automated Pipeline │
|
|
||||||
└─────────────────────────────────────────────────────────────────┘
|
|
||||||
|
|
||||||
Daily Schedule
|
|
||||||
┌──────────────────────┐
|
|
||||||
│ 6:00 AM Berlin │
|
|
||||||
│ News Crawler │
|
|
||||||
└──────────┬───────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌──────────────────────────────────────────────────────────────────┐
|
|
||||||
│ News Crawler │
|
|
||||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
|
||||||
│ │ Fetch RSS │→ │ Extract │→ │ Summarize │→ │ Save to ││
|
|
||||||
│ │ Feeds │ │ Content │ │ with AI │ │ MongoDB ││
|
|
||||||
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
|
||||||
│ │
|
|
||||||
│ Sources: Süddeutsche, Merkur, BR24, etc. │
|
|
||||||
│ Output: Full articles + AI summaries │
|
|
||||||
└──────────────────────────────────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
│ Articles saved
|
|
||||||
▼
|
|
||||||
┌──────────────────────┐
|
|
||||||
│ MongoDB │
|
|
||||||
│ (Data Storage) │
|
|
||||||
└──────────┬───────────┘
|
|
||||||
│
|
|
||||||
│ Wait for crawler
|
|
||||||
▼
|
|
||||||
┌──────────────────────┐
|
|
||||||
│ 7:00 AM Berlin │
|
|
||||||
│ Newsletter Sender │
|
|
||||||
└──────────┬───────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌──────────────────────────────────────────────────────────────────┐
|
|
||||||
│ Newsletter Sender │
|
|
||||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
|
||||||
│ │ Wait for │→ │ Fetch │→ │ Generate │→ │ Send to ││
|
|
||||||
│ │ Crawler │ │ Articles │ │ Newsletter │ │ Subscribers││
|
|
||||||
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
|
||||||
│ │
|
|
||||||
│ Features: Tracking pixels, link tracking, HTML templates │
|
|
||||||
│ Output: Personalized newsletters with engagement tracking │
|
|
||||||
└──────────────────────────────────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
│ Emails sent
|
|
||||||
▼
|
|
||||||
┌──────────────────────┐
|
|
||||||
│ Subscribers │
|
|
||||||
│ (Email Inboxes) │
|
|
||||||
└──────────┬───────────┘
|
|
||||||
│
|
|
||||||
│ Opens & clicks
|
|
||||||
▼
|
|
||||||
┌──────────────────────┐
|
|
||||||
│ Tracking System │
|
|
||||||
│ (Analytics API) │
|
|
||||||
└──────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔄 Data Flow
|
|
||||||
|
|
||||||
### 1. Content Acquisition (6:00 AM)
|
|
||||||
|
|
||||||
```
|
|
||||||
RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
|
|
||||||
```
|
|
||||||
|
|
||||||
**Details**:
|
|
||||||
- Fetches from multiple RSS sources
|
|
||||||
- Extracts full article text
|
|
||||||
- Generates concise summaries using Ollama
|
|
||||||
- Stores with metadata (author, date, source)
|
|
||||||
|
|
||||||
### 2. Newsletter Generation (7:00 AM)
|
|
||||||
|
|
||||||
```
|
|
||||||
MongoDB → Articles → Template → HTML → Email
|
|
||||||
```
|
|
||||||
|
|
||||||
**Details**:
|
|
||||||
- Waits for crawler to finish (max 30 min)
|
|
||||||
- Fetches today's articles with summaries
|
|
||||||
- Applies Jinja2 template
|
|
||||||
- Injects tracking pixels
|
|
||||||
- Replaces links with tracking URLs
|
|
||||||
|
|
||||||
### 3. Engagement Tracking (Ongoing)
|
|
||||||
|
|
||||||
```
|
|
||||||
Email Open → Pixel Load → Log Event → Analytics
|
|
||||||
Link Click → Redirect → Log Event → Analytics
|
|
||||||
```
|
|
||||||
|
|
||||||
**Details**:
|
|
||||||
- Tracks email opens via 1x1 pixel
|
|
||||||
- Tracks link clicks via redirect URLs
|
|
||||||
- Stores engagement data in MongoDB
|
|
||||||
- Provides analytics API
|
|
||||||
|
|
||||||
## 🏗️ Component Architecture
|
|
||||||
|
|
||||||
### Docker Containers
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Docker Network │
|
|
||||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
||||||
│ │ MongoDB │ │ Crawler │ │ Sender │ │
|
|
||||||
│ │ │ │ │ │ │ │
|
|
||||||
│ │ Port: 27017 │←─│ Schedule: │←─│ Schedule: │ │
|
|
||||||
│ │ │ │ 6:00 AM │ │ 7:00 AM │ │
|
|
||||||
│ │ Storage: │ │ │ │ │ │
|
|
||||||
│ │ - articles │ │ Depends on: │ │ Depends on: │ │
|
|
||||||
│ │ - subscribers│ │ - MongoDB │ │ - MongoDB │ │
|
|
||||||
│ │ - tracking │ │ │ │ - Crawler │ │
|
|
||||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ All containers auto-restart on failure │
|
|
||||||
│ All use Europe/Berlin timezone │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
### Backend Services
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Backend Services │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Flask API (Port 5001) │ │
|
|
||||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
|
||||||
│ │ │ Tracking │ │ Analytics │ │ Privacy │ │ │
|
|
||||||
│ │ │ Endpoints │ │ Endpoints │ │ Endpoints │ │ │
|
|
||||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Services Layer │ │
|
|
||||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
|
||||||
│ │ │ Tracking │ │ Analytics │ │ Ollama │ │ │
|
|
||||||
│ │ │ Service │ │ Service │ │ Client │ │ │
|
|
||||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📅 Daily Timeline
|
|
||||||
|
|
||||||
```
|
|
||||||
Time (Berlin) │ Event │ Duration
|
|
||||||
───────────────┼──────────────────────────┼──────────
|
|
||||||
05:59:59 │ System idle │ -
|
|
||||||
06:00:00 │ Crawler starts │ ~10-20 min
|
|
||||||
06:00:01 │ - Fetch RSS feeds │
|
|
||||||
06:02:00 │ - Extract content │
|
|
||||||
06:05:00 │ - Generate summaries │
|
|
||||||
06:15:00 │ - Save to MongoDB │
|
|
||||||
06:20:00 │ Crawler finishes │
|
|
||||||
06:20:01 │ System idle │ ~40 min
|
|
||||||
07:00:00 │ Sender starts │ ~5-10 min
|
|
||||||
07:00:01 │ - Wait for crawler │ (checks every 30s)
|
|
||||||
07:00:30 │ - Crawler confirmed done │
|
|
||||||
07:00:31 │ - Fetch articles │
|
|
||||||
07:01:00 │ - Generate newsletters │
|
|
||||||
07:02:00 │ - Send to subscribers │
|
|
||||||
07:10:00 │ Sender finishes │
|
|
||||||
07:10:01 │ System idle │ Until tomorrow
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔐 Security & Privacy
|
|
||||||
|
|
||||||
### Data Protection
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Privacy Features │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Data Retention │ │
|
|
||||||
│ │ - Personal data: 90 days │ │
|
|
||||||
│ │ - Anonymization: Automatic │ │
|
|
||||||
│ │ - Deletion: On request │ │
|
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────┐ │
|
|
||||||
│ │ User Rights │ │
|
|
||||||
│ │ - Opt-out: Anytime │ │
|
|
||||||
│ │ - Data access: API available │ │
|
|
||||||
│ │ - Data deletion: Full removal │ │
|
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Compliance │ │
|
|
||||||
│ │ - GDPR compliant │ │
|
|
||||||
│ │ - Privacy notice in emails │ │
|
|
||||||
│ │ - Transparent tracking │ │
|
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📊 Database Schema
|
|
||||||
|
|
||||||
### Collections
|
|
||||||
|
|
||||||
```
|
|
||||||
MongoDB (munich_news)
|
|
||||||
│
|
|
||||||
├── articles
|
|
||||||
│ ├── title
|
|
||||||
│ ├── author
|
|
||||||
│ ├── content (full text)
|
|
||||||
│ ├── summary (AI generated)
|
|
||||||
│ ├── link
|
|
||||||
│ ├── source
|
|
||||||
│ ├── published_at
|
|
||||||
│ └── crawled_at
|
|
||||||
│
|
|
||||||
├── subscribers
|
|
||||||
│ ├── email
|
|
||||||
│ ├── active
|
|
||||||
│ ├── tracking_enabled
|
|
||||||
│ └── subscribed_at
|
|
||||||
│
|
|
||||||
├── rss_feeds
|
|
||||||
│ ├── name
|
|
||||||
│ ├── url
|
|
||||||
│ └── active
|
|
||||||
│
|
|
||||||
├── newsletter_sends
|
|
||||||
│ ├── tracking_id
|
|
||||||
│ ├── newsletter_id
|
|
||||||
│ ├── subscriber_email
|
|
||||||
│ ├── opened
|
|
||||||
│ ├── first_opened_at
|
|
||||||
│ └── open_count
|
|
||||||
│
|
|
||||||
├── link_clicks
|
|
||||||
│ ├── tracking_id
|
|
||||||
│ ├── newsletter_id
|
|
||||||
│ ├── subscriber_email
|
|
||||||
│ ├── article_url
|
|
||||||
│ ├── clicked
|
|
||||||
│ └── clicked_at
|
|
||||||
│
|
|
||||||
└── subscriber_activity
|
|
||||||
├── email
|
|
||||||
├── status (active/inactive/dormant)
|
|
||||||
├── last_opened_at
|
|
||||||
├── last_clicked_at
|
|
||||||
├── total_opens
|
|
||||||
└── total_clicks
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Deployment Architecture
|
|
||||||
|
|
||||||
### Development
|
|
||||||
|
|
||||||
```
|
|
||||||
Local Machine
|
|
||||||
├── Docker Compose
|
|
||||||
│ ├── MongoDB (no auth)
|
|
||||||
│ ├── Crawler
|
|
||||||
│ └── Sender
|
|
||||||
├── Backend (manual start)
|
|
||||||
│ └── Flask API
|
|
||||||
└── Ollama (optional)
|
|
||||||
└── AI Summarization
|
|
||||||
```
|
|
||||||
|
|
||||||
### Production
|
|
||||||
|
|
||||||
```
|
|
||||||
Server
|
|
||||||
├── Docker Compose (prod)
|
|
||||||
│ ├── MongoDB (with auth)
|
|
||||||
│ ├── Crawler
|
|
||||||
│ └── Sender
|
|
||||||
├── Backend (systemd/pm2)
|
|
||||||
│ └── Flask API (HTTPS)
|
|
||||||
├── Ollama (optional)
|
|
||||||
│ └── AI Summarization
|
|
||||||
└── Nginx (reverse proxy)
|
|
||||||
└── SSL/TLS
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔄 Coordination Mechanism
|
|
||||||
|
|
||||||
### Crawler-Sender Synchronization
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Coordination Flow │
|
|
||||||
│ │
|
|
||||||
│ 6:00 AM → Crawler starts │
|
|
||||||
│ ↓ │
|
|
||||||
│ Crawling articles... │
|
|
||||||
│ ↓ │
|
|
||||||
│ Saves to MongoDB │
|
|
||||||
│ ↓ │
|
|
||||||
│ 6:20 AM → Crawler finishes │
|
|
||||||
│ ↓ │
|
|
||||||
│ 7:00 AM → Sender starts │
|
|
||||||
│ ↓ │
|
|
||||||
│ Check: Recent articles? ──→ No ──┐ │
|
|
||||||
│ ↓ Yes │ │
|
|
||||||
│ Proceed with send │ │
|
|
||||||
│ │ │
|
|
||||||
│ ← Wait 30s ← Wait 30s ← Wait 30s┘ │
|
|
||||||
│ (max 30 minutes) │
|
|
||||||
│ │
|
|
||||||
│ 7:10 AM → Newsletter sent │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📈 Monitoring & Observability
|
|
||||||
|
|
||||||
### Key Metrics
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Metrics to Monitor │
|
|
||||||
│ │
|
|
||||||
│ Crawler: │
|
|
||||||
│ - Articles crawled per day │
|
|
||||||
│ - Crawl duration │
|
|
||||||
│ - Success/failure rate │
|
|
||||||
│ - Summary generation rate │
|
|
||||||
│ │
|
|
||||||
│ Sender: │
|
|
||||||
│ - Newsletters sent per day │
|
|
||||||
│ - Send duration │
|
|
||||||
│ - Success/failure rate │
|
|
||||||
│ - Wait time for crawler │
|
|
||||||
│ │
|
|
||||||
│ Engagement: │
|
|
||||||
│ - Open rate │
|
|
||||||
│ - Click-through rate │
|
|
||||||
│ - Active subscribers │
|
|
||||||
│ - Dormant subscribers │
|
|
||||||
│ │
|
|
||||||
│ System: │
|
|
||||||
│ - Container uptime │
|
|
||||||
│ - Database size │
|
|
||||||
│ - Error rate │
|
|
||||||
│ - Response times │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🛠️ Maintenance Tasks
|
|
||||||
|
|
||||||
### Daily
|
|
||||||
- Check logs for errors
|
|
||||||
- Verify newsletters sent
|
|
||||||
- Monitor engagement metrics
|
|
||||||
|
|
||||||
### Weekly
|
|
||||||
- Review article quality
|
|
||||||
- Check subscriber growth
|
|
||||||
- Analyze engagement trends
|
|
||||||
|
|
||||||
### Monthly
|
|
||||||
- Archive old articles
|
|
||||||
- Clean up dormant subscribers
|
|
||||||
- Update dependencies
|
|
||||||
- Review system performance
|
|
||||||
|
|
||||||
## 📚 Technology Stack
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Technology Stack │
|
|
||||||
│ │
|
|
||||||
│ Backend: │
|
|
||||||
│ - Python 3.11 │
|
|
||||||
│ - Flask (API) │
|
|
||||||
│ - PyMongo (Database) │
|
|
||||||
│ - Schedule (Automation) │
|
|
||||||
│ - Jinja2 (Templates) │
|
|
||||||
│ - BeautifulSoup (Parsing) │
|
|
||||||
│ │
|
|
||||||
│ Database: │
|
|
||||||
│ - MongoDB 7.0 │
|
|
||||||
│ │
|
|
||||||
│ AI/ML: │
|
|
||||||
│ - Ollama (Summarization) │
|
|
||||||
│ - Phi3 Model (default) │
|
|
||||||
│ │
|
|
||||||
│ Infrastructure: │
|
|
||||||
│ - Docker & Docker Compose │
|
|
||||||
│ - Linux (Ubuntu/Debian) │
|
|
||||||
│ │
|
|
||||||
│ Email: │
|
|
||||||
│ - SMTP (configurable) │
|
|
||||||
│ - HTML emails with tracking │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Last Updated**: 2024-01-16
|
|
||||||
**Version**: 1.0
|
|
||||||
**Status**: Production Ready ✅
|
|
||||||
246
news_crawler/article_clustering.py
Normal file
246
news_crawler/article_clustering.py
Normal file
@@ -0,0 +1,246 @@
|
|||||||
|
"""
|
||||||
|
Article Clustering Module
|
||||||
|
Detects and groups similar articles from different sources using Ollama AI
|
||||||
|
"""
|
||||||
|
from difflib import SequenceMatcher
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Optional
|
||||||
|
from ollama_client import OllamaClient
|
||||||
|
|
||||||
|
|
||||||
|
class ArticleClusterer:
|
||||||
|
"""
|
||||||
|
Clusters articles about the same story from different sources using Ollama AI
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, ollama_client: OllamaClient, similarity_threshold=0.75, time_window_hours=24):
|
||||||
|
"""
|
||||||
|
Initialize clusterer
|
||||||
|
|
||||||
|
Args:
|
||||||
|
ollama_client: OllamaClient instance for AI-based similarity detection
|
||||||
|
similarity_threshold: Minimum similarity to consider articles as same story (0-1)
|
||||||
|
time_window_hours: Time window to look for similar articles
|
||||||
|
"""
|
||||||
|
self.ollama_client = ollama_client
|
||||||
|
self.similarity_threshold = similarity_threshold
|
||||||
|
self.time_window_hours = time_window_hours
|
||||||
|
|
||||||
|
def normalize_title(self, title: str) -> str:
|
||||||
|
"""
|
||||||
|
Normalize title for comparison
|
||||||
|
|
||||||
|
Args:
|
||||||
|
title: Article title
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Normalized title (lowercase, stripped)
|
||||||
|
"""
|
||||||
|
return title.lower().strip()
|
||||||
|
|
||||||
|
def simple_stem(self, word: str) -> str:
|
||||||
|
"""
|
||||||
|
Simple German word stemming (remove common suffixes)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
word: Word to stem
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Stemmed word
|
||||||
|
"""
|
||||||
|
# Remove common German suffixes
|
||||||
|
suffixes = ['ungen', 'ung', 'en', 'er', 'e', 'n', 's']
|
||||||
|
for suffix in suffixes:
|
||||||
|
if len(word) > 5 and word.endswith(suffix):
|
||||||
|
return word[:-len(suffix)]
|
||||||
|
return word
|
||||||
|
|
||||||
|
def extract_keywords(self, text: str) -> set:
|
||||||
|
"""
|
||||||
|
Extract important keywords from text with simple stemming
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Article title or content
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Set of stemmed keywords
|
||||||
|
"""
|
||||||
|
# Common German stop words to ignore
|
||||||
|
stop_words = {
|
||||||
|
'der', 'die', 'das', 'den', 'dem', 'des', 'ein', 'eine', 'einer', 'eines',
|
||||||
|
'und', 'oder', 'aber', 'in', 'im', 'am', 'um', 'für', 'von', 'zu', 'nach',
|
||||||
|
'bei', 'mit', 'auf', 'an', 'aus', 'über', 'unter', 'gegen', 'durch',
|
||||||
|
'ist', 'sind', 'war', 'waren', 'hat', 'haben', 'wird', 'werden', 'wurde', 'wurden',
|
||||||
|
'neue', 'neuer', 'neues', 'neuen', 'sich', 'auch', 'nicht', 'nur', 'noch',
|
||||||
|
'mehr', 'als', 'wie', 'beim', 'zum', 'zur', 'vom', 'ins', 'ans'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Normalize and split
|
||||||
|
words = text.lower().strip().split()
|
||||||
|
|
||||||
|
# Filter out stop words, short words, and apply stemming
|
||||||
|
keywords = set()
|
||||||
|
for word in words:
|
||||||
|
# Remove punctuation
|
||||||
|
word = ''.join(c for c in word if c.isalnum() or c == '-')
|
||||||
|
|
||||||
|
if len(word) > 3 and word not in stop_words:
|
||||||
|
# Apply simple stemming
|
||||||
|
stemmed = self.simple_stem(word)
|
||||||
|
keywords.add(stemmed)
|
||||||
|
|
||||||
|
return keywords
|
||||||
|
|
||||||
|
def check_same_story_with_ai(self, article1: Dict, article2: Dict) -> bool:
|
||||||
|
"""
|
||||||
|
Use Ollama AI to determine if two articles are about the same story
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article1: First article
|
||||||
|
article2: Second article
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if same story, False otherwise
|
||||||
|
"""
|
||||||
|
if not self.ollama_client.enabled:
|
||||||
|
# Fallback to keyword-based similarity
|
||||||
|
return self.calculate_similarity(article1, article2) >= self.similarity_threshold
|
||||||
|
|
||||||
|
title1 = article1.get('title', '')
|
||||||
|
title2 = article2.get('title', '')
|
||||||
|
content1 = article1.get('content', '')[:300] # First 300 chars
|
||||||
|
content2 = article2.get('content', '')[:300]
|
||||||
|
|
||||||
|
prompt = f"""Compare these two news articles and determine if they are about the SAME story/event.
|
||||||
|
|
||||||
|
Article 1:
|
||||||
|
Title: {title1}
|
||||||
|
Content: {content1}
|
||||||
|
|
||||||
|
Article 2:
|
||||||
|
Title: {title2}
|
||||||
|
Content: {content2}
|
||||||
|
|
||||||
|
Answer with ONLY "YES" if they are about the same story/event, or "NO" if they are different stories.
|
||||||
|
Consider them the same story if they report on the same event, even if from different perspectives.
|
||||||
|
|
||||||
|
Answer:"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self.ollama_client.generate(prompt, max_tokens=10)
|
||||||
|
answer = response.get('text', '').strip().upper()
|
||||||
|
return 'YES' in answer
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ⚠ AI clustering failed: {e}, using fallback")
|
||||||
|
# Fallback to keyword-based similarity
|
||||||
|
return self.calculate_similarity(article1, article2) >= self.similarity_threshold
|
||||||
|
|
||||||
|
def calculate_similarity(self, article1: Dict, article2: Dict) -> float:
|
||||||
|
"""
|
||||||
|
Calculate similarity between two articles using title and content
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article1: First article (dict with 'title' and optionally 'content')
|
||||||
|
article2: Second article (dict with 'title' and optionally 'content')
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Similarity score (0-1)
|
||||||
|
"""
|
||||||
|
title1 = article1.get('title', '')
|
||||||
|
title2 = article2.get('title', '')
|
||||||
|
content1 = article1.get('content', '')
|
||||||
|
content2 = article2.get('content', '')
|
||||||
|
|
||||||
|
# Extract keywords from titles
|
||||||
|
title_keywords1 = self.extract_keywords(title1)
|
||||||
|
title_keywords2 = self.extract_keywords(title2)
|
||||||
|
|
||||||
|
# Calculate title similarity
|
||||||
|
if title_keywords1 and title_keywords2:
|
||||||
|
title_intersection = title_keywords1.intersection(title_keywords2)
|
||||||
|
title_union = title_keywords1.union(title_keywords2)
|
||||||
|
title_similarity = len(title_intersection) / len(title_union) if title_union else 0
|
||||||
|
else:
|
||||||
|
# Fallback to string similarity
|
||||||
|
t1 = self.normalize_title(title1)
|
||||||
|
t2 = self.normalize_title(title2)
|
||||||
|
title_similarity = SequenceMatcher(None, t1, t2).ratio()
|
||||||
|
|
||||||
|
# If we have content, use it for better accuracy
|
||||||
|
if content1 and content2:
|
||||||
|
# Extract keywords from first 500 chars of content (for performance)
|
||||||
|
content_keywords1 = self.extract_keywords(content1[:500])
|
||||||
|
content_keywords2 = self.extract_keywords(content2[:500])
|
||||||
|
|
||||||
|
if content_keywords1 and content_keywords2:
|
||||||
|
content_intersection = content_keywords1.intersection(content_keywords2)
|
||||||
|
content_union = content_keywords1.union(content_keywords2)
|
||||||
|
content_similarity = len(content_intersection) / len(content_union) if content_union else 0
|
||||||
|
|
||||||
|
# Weighted average: title (40%) + content (60%)
|
||||||
|
return (title_similarity * 0.4) + (content_similarity * 0.6)
|
||||||
|
|
||||||
|
# If no content, use only title similarity
|
||||||
|
return title_similarity
|
||||||
|
|
||||||
|
def find_cluster(self, article: Dict, existing_articles: List[Dict]) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find if article belongs to an existing cluster using AI
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article: New article to cluster (dict with 'title' and optionally 'content')
|
||||||
|
existing_articles: List of existing articles
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
cluster_id if found, None otherwise
|
||||||
|
"""
|
||||||
|
cutoff_time = datetime.utcnow() - timedelta(hours=self.time_window_hours)
|
||||||
|
|
||||||
|
for existing in existing_articles:
|
||||||
|
# Only compare recent articles
|
||||||
|
published_at = existing.get('published_at')
|
||||||
|
if published_at and published_at < cutoff_time:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Use AI to check if same story
|
||||||
|
if self.check_same_story_with_ai(article, existing):
|
||||||
|
return existing.get('cluster_id', str(existing.get('_id')))
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def cluster_article(self, article: Dict, existing_articles: List[Dict]) -> Dict:
|
||||||
|
"""
|
||||||
|
Cluster a single article
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article: Article to cluster
|
||||||
|
existing_articles: List of existing articles
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Article with cluster_id and is_primary fields
|
||||||
|
"""
|
||||||
|
cluster_id = self.find_cluster(article, existing_articles)
|
||||||
|
|
||||||
|
if cluster_id:
|
||||||
|
# Add to existing cluster
|
||||||
|
article['cluster_id'] = cluster_id
|
||||||
|
article['is_primary'] = False
|
||||||
|
else:
|
||||||
|
# Create new cluster
|
||||||
|
article['cluster_id'] = str(article.get('_id', datetime.utcnow().timestamp()))
|
||||||
|
article['is_primary'] = True
|
||||||
|
|
||||||
|
return article
|
||||||
|
|
||||||
|
def get_cluster_articles(self, cluster_id: str, articles_collection) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Get all articles in a cluster
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cluster_id: Cluster ID
|
||||||
|
articles_collection: MongoDB collection
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of articles in the cluster
|
||||||
|
"""
|
||||||
|
return list(articles_collection.find({'cluster_id': cluster_id}))
|
||||||
213
news_crawler/cluster_summarizer.py
Normal file
213
news_crawler/cluster_summarizer.py
Normal file
@@ -0,0 +1,213 @@
|
|||||||
|
"""
|
||||||
|
Cluster Summarizer Module
|
||||||
|
Generates neutral summaries from multiple clustered articles
|
||||||
|
"""
|
||||||
|
from typing import List, Dict, Optional
|
||||||
|
from datetime import datetime
|
||||||
|
from ollama_client import OllamaClient
|
||||||
|
|
||||||
|
|
||||||
|
class ClusterSummarizer:
|
||||||
|
"""
|
||||||
|
Generates neutral summaries by synthesizing multiple articles about the same story
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, ollama_client: OllamaClient, max_words=200):
|
||||||
|
"""
|
||||||
|
Initialize cluster summarizer
|
||||||
|
|
||||||
|
Args:
|
||||||
|
ollama_client: OllamaClient instance for AI-based summarization
|
||||||
|
max_words: Maximum words in neutral summary
|
||||||
|
"""
|
||||||
|
self.ollama_client = ollama_client
|
||||||
|
self.max_words = max_words
|
||||||
|
|
||||||
|
def generate_neutral_summary(self, articles: List[Dict]) -> Dict:
|
||||||
|
"""
|
||||||
|
Generate a neutral summary from multiple articles about the same story
|
||||||
|
|
||||||
|
Args:
|
||||||
|
articles: List of article dicts with 'title', 'content', 'source'
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
{
|
||||||
|
'neutral_summary': str,
|
||||||
|
'sources': list,
|
||||||
|
'article_count': int,
|
||||||
|
'success': bool,
|
||||||
|
'error': str or None,
|
||||||
|
'duration': float
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
if not articles or len(articles) == 0:
|
||||||
|
return {
|
||||||
|
'neutral_summary': None,
|
||||||
|
'sources': [],
|
||||||
|
'article_count': 0,
|
||||||
|
'success': False,
|
||||||
|
'error': 'No articles provided',
|
||||||
|
'duration': 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# If only one article, return its summary
|
||||||
|
if len(articles) == 1:
|
||||||
|
return {
|
||||||
|
'neutral_summary': articles[0].get('summary', articles[0].get('content', '')[:500]),
|
||||||
|
'sources': [articles[0].get('source', 'unknown')],
|
||||||
|
'article_count': 1,
|
||||||
|
'success': True,
|
||||||
|
'error': None,
|
||||||
|
'duration': 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Build combined context from all articles
|
||||||
|
combined_context = self._build_combined_context(articles)
|
||||||
|
|
||||||
|
# Generate neutral summary using AI
|
||||||
|
prompt = self._build_neutral_summary_prompt(combined_context, len(articles))
|
||||||
|
|
||||||
|
result = self.ollama_client.generate(prompt, max_tokens=300)
|
||||||
|
|
||||||
|
if result['success']:
|
||||||
|
return {
|
||||||
|
'neutral_summary': result['text'],
|
||||||
|
'sources': list(set(a.get('source', 'unknown') for a in articles)),
|
||||||
|
'article_count': len(articles),
|
||||||
|
'success': True,
|
||||||
|
'error': None,
|
||||||
|
'duration': result['duration']
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
'neutral_summary': None,
|
||||||
|
'sources': list(set(a.get('source', 'unknown') for a in articles)),
|
||||||
|
'article_count': len(articles),
|
||||||
|
'success': False,
|
||||||
|
'error': result['error'],
|
||||||
|
'duration': result['duration']
|
||||||
|
}
|
||||||
|
|
||||||
|
def _build_combined_context(self, articles: List[Dict]) -> str:
|
||||||
|
"""Build combined context from multiple articles"""
|
||||||
|
context_parts = []
|
||||||
|
|
||||||
|
for i, article in enumerate(articles, 1):
|
||||||
|
source = article.get('source', 'Unknown')
|
||||||
|
title = article.get('title', 'No title')
|
||||||
|
|
||||||
|
# Use summary if available, otherwise use first 500 chars of content
|
||||||
|
content = article.get('summary') or article.get('content', '')[:500]
|
||||||
|
|
||||||
|
context_parts.append(f"Source {i} ({source}):\nTitle: {title}\nContent: {content}")
|
||||||
|
|
||||||
|
return "\n\n".join(context_parts)
|
||||||
|
|
||||||
|
def _build_neutral_summary_prompt(self, combined_context: str, article_count: int) -> str:
|
||||||
|
"""Build prompt for neutral summary generation"""
|
||||||
|
prompt = f"""You are a neutral news aggregator. You have {article_count} articles from different sources about the same story. Your task is to create a single, balanced summary that:
|
||||||
|
|
||||||
|
1. Combines information from all sources
|
||||||
|
2. Remains neutral and objective
|
||||||
|
3. Highlights key facts that all sources agree on
|
||||||
|
4. Notes any significant differences in perspective (if any)
|
||||||
|
5. Is written in clear, professional English
|
||||||
|
6. Is approximately {self.max_words} words
|
||||||
|
|
||||||
|
Here are the articles:
|
||||||
|
|
||||||
|
{combined_context}
|
||||||
|
|
||||||
|
Write a neutral summary in English that synthesizes these perspectives:"""
|
||||||
|
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
|
||||||
|
def create_cluster_summaries(db, ollama_client: OllamaClient, cluster_ids: Optional[List[str]] = None):
|
||||||
|
"""
|
||||||
|
Create or update neutral summaries for article clusters
|
||||||
|
|
||||||
|
Args:
|
||||||
|
db: MongoDB database instance
|
||||||
|
ollama_client: OllamaClient instance
|
||||||
|
cluster_ids: Optional list of specific cluster IDs to process. If None, processes all clusters.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
{
|
||||||
|
'processed': int,
|
||||||
|
'succeeded': int,
|
||||||
|
'failed': int,
|
||||||
|
'errors': list
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
summarizer = ClusterSummarizer(ollama_client, max_words=200)
|
||||||
|
|
||||||
|
# Find clusters to process
|
||||||
|
if cluster_ids:
|
||||||
|
clusters_to_process = cluster_ids
|
||||||
|
else:
|
||||||
|
# Get all cluster IDs with multiple articles
|
||||||
|
pipeline = [
|
||||||
|
{"$match": {"cluster_id": {"$exists": True}}},
|
||||||
|
{"$group": {"_id": "$cluster_id", "count": {"$sum": 1}}},
|
||||||
|
{"$match": {"count": {"$gt": 1}}},
|
||||||
|
{"$project": {"_id": 1}}
|
||||||
|
]
|
||||||
|
clusters_to_process = [c['_id'] for c in db.articles.aggregate(pipeline)]
|
||||||
|
|
||||||
|
processed = 0
|
||||||
|
succeeded = 0
|
||||||
|
failed = 0
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for cluster_id in clusters_to_process:
|
||||||
|
try:
|
||||||
|
# Get all articles in this cluster
|
||||||
|
articles = list(db.articles.find({"cluster_id": cluster_id}))
|
||||||
|
|
||||||
|
if len(articles) < 2:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f"Processing cluster {cluster_id}: {len(articles)} articles")
|
||||||
|
|
||||||
|
# Generate neutral summary
|
||||||
|
result = summarizer.generate_neutral_summary(articles)
|
||||||
|
|
||||||
|
processed += 1
|
||||||
|
|
||||||
|
if result['success']:
|
||||||
|
# Save cluster summary
|
||||||
|
db.cluster_summaries.update_one(
|
||||||
|
{"cluster_id": cluster_id},
|
||||||
|
{
|
||||||
|
"$set": {
|
||||||
|
"cluster_id": cluster_id,
|
||||||
|
"neutral_summary": result['neutral_summary'],
|
||||||
|
"sources": result['sources'],
|
||||||
|
"article_count": result['article_count'],
|
||||||
|
"created_at": datetime.utcnow(),
|
||||||
|
"updated_at": datetime.utcnow()
|
||||||
|
}
|
||||||
|
},
|
||||||
|
upsert=True
|
||||||
|
)
|
||||||
|
succeeded += 1
|
||||||
|
print(f" ✓ Generated neutral summary ({len(result['neutral_summary'])} chars)")
|
||||||
|
else:
|
||||||
|
failed += 1
|
||||||
|
error_msg = f"Cluster {cluster_id}: {result['error']}"
|
||||||
|
errors.append(error_msg)
|
||||||
|
print(f" ✗ Failed: {result['error']}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
failed += 1
|
||||||
|
error_msg = f"Cluster {cluster_id}: {str(e)}"
|
||||||
|
errors.append(error_msg)
|
||||||
|
print(f" ✗ Error: {e}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'processed': processed,
|
||||||
|
'succeeded': succeeded,
|
||||||
|
'failed': failed,
|
||||||
|
'errors': errors
|
||||||
|
}
|
||||||
@@ -13,6 +13,8 @@ from dotenv import load_dotenv
|
|||||||
from rss_utils import extract_article_url, extract_article_summary, extract_published_date
|
from rss_utils import extract_article_url, extract_article_summary, extract_published_date
|
||||||
from config import Config
|
from config import Config
|
||||||
from ollama_client import OllamaClient
|
from ollama_client import OllamaClient
|
||||||
|
from article_clustering import ArticleClusterer
|
||||||
|
from cluster_summarizer import create_cluster_summaries
|
||||||
|
|
||||||
# Load environment variables
|
# Load environment variables
|
||||||
load_dotenv(dotenv_path='../.env')
|
load_dotenv(dotenv_path='../.env')
|
||||||
@@ -33,6 +35,9 @@ ollama_client = OllamaClient(
|
|||||||
timeout=Config.OLLAMA_TIMEOUT
|
timeout=Config.OLLAMA_TIMEOUT
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Initialize Article Clusterer (will be initialized after ollama_client)
|
||||||
|
article_clusterer = None
|
||||||
|
|
||||||
# Print configuration on startup
|
# Print configuration on startup
|
||||||
if __name__ != '__main__':
|
if __name__ != '__main__':
|
||||||
Config.print_config()
|
Config.print_config()
|
||||||
@@ -45,6 +50,14 @@ if __name__ != '__main__':
|
|||||||
else:
|
else:
|
||||||
print("ℹ Ollama AI summarization: DISABLED")
|
print("ℹ Ollama AI summarization: DISABLED")
|
||||||
|
|
||||||
|
# Initialize Article Clusterer with ollama_client
|
||||||
|
article_clusterer = ArticleClusterer(
|
||||||
|
ollama_client=ollama_client,
|
||||||
|
similarity_threshold=0.60, # Not used when AI is enabled
|
||||||
|
time_window_hours=24 # Look back 24 hours
|
||||||
|
)
|
||||||
|
print("🔗 Article clustering: ENABLED (AI-powered)")
|
||||||
|
|
||||||
|
|
||||||
def get_active_rss_feeds():
|
def get_active_rss_feeds():
|
||||||
"""Get all active RSS feeds from database"""
|
"""Get all active RSS feeds from database"""
|
||||||
@@ -394,6 +407,13 @@ def crawl_rss_feed(feed_url, feed_name, feed_category='general', max_articles=10
|
|||||||
'created_at': datetime.utcnow()
|
'created_at': datetime.utcnow()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Cluster article with existing articles (detect duplicates from other sources)
|
||||||
|
from datetime import timedelta
|
||||||
|
recent_articles = list(articles_collection.find({
|
||||||
|
'published_at': {'$gte': datetime.utcnow() - timedelta(hours=24)}
|
||||||
|
}))
|
||||||
|
article_doc = article_clusterer.cluster_article(article_doc, recent_articles)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Upsert: update if exists, insert if not
|
# Upsert: update if exists, insert if not
|
||||||
articles_collection.update_one(
|
articles_collection.update_one(
|
||||||
@@ -434,6 +454,16 @@ def crawl_all_feeds(max_articles_per_feed=10):
|
|||||||
Crawl all active RSS feeds
|
Crawl all active RSS feeds
|
||||||
Returns: dict with statistics
|
Returns: dict with statistics
|
||||||
"""
|
"""
|
||||||
|
global article_clusterer
|
||||||
|
|
||||||
|
# Initialize clusterer if not already done
|
||||||
|
if article_clusterer is None:
|
||||||
|
article_clusterer = ArticleClusterer(
|
||||||
|
ollama_client=ollama_client,
|
||||||
|
similarity_threshold=0.60,
|
||||||
|
time_window_hours=24
|
||||||
|
)
|
||||||
|
|
||||||
print("\n" + "="*60)
|
print("\n" + "="*60)
|
||||||
print("🚀 Starting RSS Feed Crawler")
|
print("🚀 Starting RSS Feed Crawler")
|
||||||
print("="*60)
|
print("="*60)
|
||||||
@@ -485,12 +515,29 @@ def crawl_all_feeds(max_articles_per_feed=10):
|
|||||||
print(f" Average time per article: {duration/total_crawled:.1f}s")
|
print(f" Average time per article: {duration/total_crawled:.1f}s")
|
||||||
print("="*60 + "\n")
|
print("="*60 + "\n")
|
||||||
|
|
||||||
|
# Generate neutral summaries for clustered articles
|
||||||
|
cluster_summary_stats = {'processed': 0, 'succeeded': 0, 'failed': 0}
|
||||||
|
if Config.OLLAMA_ENABLED and total_crawled > 0:
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print("🔄 Generating Neutral Summaries for Clustered Articles")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
cluster_summary_stats = create_cluster_summaries(db, ollama_client)
|
||||||
|
|
||||||
|
print("\n" + "="*60)
|
||||||
|
print(f"✓ Cluster Summarization Complete!")
|
||||||
|
print(f" Clusters processed: {cluster_summary_stats['processed']}")
|
||||||
|
print(f" Succeeded: {cluster_summary_stats['succeeded']}")
|
||||||
|
print(f" Failed: {cluster_summary_stats['failed']}")
|
||||||
|
print("="*60 + "\n")
|
||||||
|
|
||||||
return {
|
return {
|
||||||
'total_feeds': len(feeds),
|
'total_feeds': len(feeds),
|
||||||
'total_articles_crawled': total_crawled,
|
'total_articles_crawled': total_crawled,
|
||||||
'total_summarized': total_summarized,
|
'total_summarized': total_summarized,
|
||||||
'failed_summaries': total_failed,
|
'failed_summaries': total_failed,
|
||||||
'duration_seconds': round(duration, 2)
|
'duration_seconds': round(duration, 2),
|
||||||
|
'cluster_summaries': cluster_summary_stats
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -392,6 +392,80 @@ English Summary (max {max_words} words):"""
|
|||||||
'error': str(e)
|
'error': str(e)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
def generate(self, prompt, max_tokens=100):
|
||||||
|
"""
|
||||||
|
Generate text using Ollama
|
||||||
|
|
||||||
|
Args:
|
||||||
|
prompt: Text prompt
|
||||||
|
max_tokens: Maximum tokens to generate
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
{
|
||||||
|
'text': str, # Generated text
|
||||||
|
'success': bool, # Whether generation succeeded
|
||||||
|
'error': str or None, # Error message if failed
|
||||||
|
'duration': float # Time taken in seconds
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
if not self.enabled:
|
||||||
|
return {
|
||||||
|
'text': '',
|
||||||
|
'success': False,
|
||||||
|
'error': 'Ollama is disabled',
|
||||||
|
'duration': 0
|
||||||
|
}
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.post(
|
||||||
|
f"{self.base_url}/api/generate",
|
||||||
|
json={
|
||||||
|
"model": self.model,
|
||||||
|
"prompt": prompt,
|
||||||
|
"stream": False,
|
||||||
|
"options": {
|
||||||
|
"num_predict": max_tokens,
|
||||||
|
"temperature": 0.1 # Low temperature for consistent answers
|
||||||
|
}
|
||||||
|
},
|
||||||
|
timeout=self.timeout
|
||||||
|
)
|
||||||
|
|
||||||
|
duration = time.time() - start_time
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
return {
|
||||||
|
'text': result.get('response', '').strip(),
|
||||||
|
'success': True,
|
||||||
|
'error': None,
|
||||||
|
'duration': duration
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
'text': '',
|
||||||
|
'success': False,
|
||||||
|
'error': f"HTTP {response.status_code}: {response.text}",
|
||||||
|
'duration': duration
|
||||||
|
}
|
||||||
|
|
||||||
|
except requests.exceptions.Timeout:
|
||||||
|
return {
|
||||||
|
'text': '',
|
||||||
|
'success': False,
|
||||||
|
'error': f"Request timed out after {self.timeout}s",
|
||||||
|
'duration': time.time() - start_time
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
return {
|
||||||
|
'text': '',
|
||||||
|
'success': False,
|
||||||
|
'error': str(e),
|
||||||
|
'duration': time.time() - start_time
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
# Quick test
|
# Quick test
|
||||||
|
|||||||
110
tests/crawler/README.md
Normal file
110
tests/crawler/README.md
Normal file
@@ -0,0 +1,110 @@
|
|||||||
|
# Crawler Tests
|
||||||
|
|
||||||
|
Test suite for the news crawler, AI clustering, and neutral summary generation.
|
||||||
|
|
||||||
|
## Test Files
|
||||||
|
|
||||||
|
### AI Clustering & Aggregation Tests
|
||||||
|
|
||||||
|
- **`test_clustering_real.py`** - Tests AI-powered article clustering with realistic fake articles
|
||||||
|
- **`test_neutral_summaries.py`** - Tests neutral summary generation from clustered articles
|
||||||
|
- **`test_complete_workflow.py`** - End-to-end test of clustering + neutral summaries
|
||||||
|
|
||||||
|
### Core Crawler Tests
|
||||||
|
|
||||||
|
- **`test_crawler.py`** - Basic crawler functionality
|
||||||
|
- **`test_ollama.py`** - Ollama AI integration tests
|
||||||
|
- **`test_rss_feeds.py`** - RSS feed parsing tests
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
|
||||||
|
### Run All Tests
|
||||||
|
```bash
|
||||||
|
# From project root
|
||||||
|
docker-compose exec crawler python -m pytest tests/crawler/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Specific Test
|
||||||
|
```bash
|
||||||
|
# AI clustering test
|
||||||
|
docker-compose exec crawler python tests/crawler/test_clustering_real.py
|
||||||
|
|
||||||
|
# Neutral summaries test
|
||||||
|
docker-compose exec crawler python tests/crawler/test_neutral_summaries.py
|
||||||
|
|
||||||
|
# Complete workflow test
|
||||||
|
docker-compose exec crawler python tests/crawler/test_complete_workflow.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Tests Inside Container
|
||||||
|
```bash
|
||||||
|
# Enter container
|
||||||
|
docker-compose exec crawler bash
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
python test_clustering_real.py
|
||||||
|
python test_neutral_summaries.py
|
||||||
|
python test_complete_workflow.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Data
|
||||||
|
|
||||||
|
Tests use fake articles to avoid depending on external RSS feeds:
|
||||||
|
|
||||||
|
**Test Scenarios:**
|
||||||
|
1. **Same story, different sources** - Should cluster together
|
||||||
|
2. **Different stories** - Should remain separate
|
||||||
|
3. **Multi-source clustering** - Should generate neutral summaries
|
||||||
|
|
||||||
|
**Expected Results:**
|
||||||
|
- Housing story (2 sources) → Cluster together → Neutral summary
|
||||||
|
- Bayern transfer (2 sources) → Cluster together → Neutral summary
|
||||||
|
- Single-source stories → Individual summaries
|
||||||
|
|
||||||
|
## Cleanup
|
||||||
|
|
||||||
|
Tests create temporary data in MongoDB. To clean up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clean test articles
|
||||||
|
docker-compose exec crawler python << 'EOF'
|
||||||
|
from pymongo import MongoClient
|
||||||
|
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
|
||||||
|
db = client["munich_news"]
|
||||||
|
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
|
||||||
|
db.cluster_summaries.delete_many({})
|
||||||
|
print("✓ Test data cleaned")
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Docker containers must be running
|
||||||
|
- Ollama service must be available
|
||||||
|
- MongoDB must be accessible
|
||||||
|
- AI model (phi3:latest) must be downloaded
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Ollama Not Available
|
||||||
|
```bash
|
||||||
|
# Check Ollama status
|
||||||
|
docker-compose logs ollama
|
||||||
|
|
||||||
|
# Restart Ollama
|
||||||
|
docker-compose restart ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tests Timing Out
|
||||||
|
- Increase timeout in test files (default: 60s)
|
||||||
|
- Check Ollama model is downloaded
|
||||||
|
- Verify GPU acceleration if enabled
|
||||||
|
|
||||||
|
### MongoDB Connection Issues
|
||||||
|
```bash
|
||||||
|
# Check MongoDB status
|
||||||
|
docker-compose logs mongodb
|
||||||
|
|
||||||
|
# Restart MongoDB
|
||||||
|
docker-compose restart mongodb
|
||||||
|
```
|
||||||
166
tests/crawler/test_clustering_real.py
Normal file
166
tests/crawler/test_clustering_real.py
Normal file
@@ -0,0 +1,166 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test AI clustering with realistic fake articles
|
||||||
|
"""
|
||||||
|
from pymongo import MongoClient
|
||||||
|
from datetime import datetime
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# Connect to MongoDB
|
||||||
|
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
|
||||||
|
db = client["munich_news"]
|
||||||
|
|
||||||
|
# Create test articles about the same Munich story from different sources
|
||||||
|
test_articles = [
|
||||||
|
{
|
||||||
|
"title": "München: Stadtrat beschließt neue Regelungen für Wohnungsbau",
|
||||||
|
"content": """Der Münchner Stadtrat hat am Dienstag neue Regelungen für den Wohnungsbau beschlossen.
|
||||||
|
Die Maßnahmen sollen den Bau von bezahlbarem Wohnraum in der bayerischen Landeshauptstadt fördern.
|
||||||
|
Oberbürgermeister Dieter Reiter (SPD) sprach von einem wichtigen Schritt zur Lösung der Wohnungskrise.
|
||||||
|
Die neuen Regelungen sehen vor, dass bei Neubauprojekten mindestens 40 Prozent der Wohnungen
|
||||||
|
als Sozialwohnungen gebaut werden müssen. Zudem werden Bauvorschriften vereinfacht.""",
|
||||||
|
"source": "abendzeitung-muenchen",
|
||||||
|
"link": "https://example.com/az-wohnungsbau-1",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "local",
|
||||||
|
"word_count": 85
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Stadtrat München stimmt für neue Wohnungsbau-Verordnung",
|
||||||
|
"content": """In einer Sitzung am Dienstag stimmte der Münchner Stadtrat für neue Wohnungsbau-Verordnungen.
|
||||||
|
Die Beschlüsse zielen darauf ab, mehr bezahlbaren Wohnraum in München zu schaffen.
|
||||||
|
OB Reiter bezeichnete die Entscheidung als Meilenstein im Kampf gegen die Wohnungsnot.
|
||||||
|
Künftig müssen 40 Prozent aller Neubauwohnungen als Sozialwohnungen errichtet werden.
|
||||||
|
Außerdem werden bürokratische Hürden beim Bauen abgebaut.""",
|
||||||
|
"source": "sueddeutsche",
|
||||||
|
"link": "https://example.com/sz-wohnungsbau-1",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "local",
|
||||||
|
"word_count": 72
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "FC Bayern München verpflichtet neuen Stürmer aus Brasilien",
|
||||||
|
"content": """Der FC Bayern München hat einen neuen Stürmer verpflichtet. Der 23-jährige Brasilianer
|
||||||
|
wechselt für eine Ablösesumme von 50 Millionen Euro nach München. Sportdirektor Christoph Freund
|
||||||
|
zeigte sich begeistert von der Verpflichtung. Der Spieler soll die Offensive verstärken.""",
|
||||||
|
"source": "abendzeitung-muenchen",
|
||||||
|
"link": "https://example.com/az-bayern-1",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "sports",
|
||||||
|
"word_count": 52
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Bayern München holt brasilianischen Angreifer",
|
||||||
|
"content": """Der deutsche Rekordmeister Bayern München hat einen brasilianischen Stürmer unter Vertrag genommen.
|
||||||
|
Für 50 Millionen Euro wechselt der 23-Jährige an die Isar. Sportdirektor Freund lobte den Transfer.
|
||||||
|
Der Neuzugang soll die Münchner Offensive beleben und für mehr Torgefahr sorgen.""",
|
||||||
|
"source": "sueddeutsche",
|
||||||
|
"link": "https://example.com/sz-bayern-1",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "sports",
|
||||||
|
"word_count": 48
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
print("Testing AI Clustering with Realistic Articles")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Clear previous test articles
|
||||||
|
print("Cleaning up previous test articles...")
|
||||||
|
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
|
||||||
|
print("✓ Cleaned up")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Import clustering
|
||||||
|
sys.path.insert(0, '/app')
|
||||||
|
from ollama_client import OllamaClient
|
||||||
|
from article_clustering import ArticleClusterer
|
||||||
|
from config import Config
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
ollama_client = OllamaClient(
|
||||||
|
base_url=Config.OLLAMA_BASE_URL,
|
||||||
|
model=Config.OLLAMA_MODEL,
|
||||||
|
enabled=Config.OLLAMA_ENABLED,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
clusterer = ArticleClusterer(
|
||||||
|
ollama_client=ollama_client,
|
||||||
|
similarity_threshold=0.50,
|
||||||
|
time_window_hours=24
|
||||||
|
)
|
||||||
|
|
||||||
|
print("Processing articles with AI clustering...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
clustered_articles = []
|
||||||
|
for i, article in enumerate(test_articles, 1):
|
||||||
|
print(f"{i}. Processing: {article['title'][:60]}...")
|
||||||
|
print(f" Source: {article['source']}")
|
||||||
|
|
||||||
|
# Cluster with previously processed articles
|
||||||
|
clustered = clusterer.cluster_article(article, clustered_articles)
|
||||||
|
clustered_articles.append(clustered)
|
||||||
|
|
||||||
|
print(f" → Cluster ID: {clustered['cluster_id']}")
|
||||||
|
print(f" → Is Primary: {clustered['is_primary']}")
|
||||||
|
|
||||||
|
# Insert into database
|
||||||
|
db.articles.insert_one(clustered)
|
||||||
|
print(f" ✓ Saved to database")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("=" * 70)
|
||||||
|
print("Clustering Results:")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Analyze results
|
||||||
|
clusters = {}
|
||||||
|
for article in clustered_articles:
|
||||||
|
cluster_id = article['cluster_id']
|
||||||
|
if cluster_id not in clusters:
|
||||||
|
clusters[cluster_id] = []
|
||||||
|
clusters[cluster_id].append(article)
|
||||||
|
|
||||||
|
for cluster_id, articles in clusters.items():
|
||||||
|
print(f"Cluster {cluster_id}: {len(articles)} article(s)")
|
||||||
|
for article in articles:
|
||||||
|
print(f" - [{article['source']}] {article['title'][:60]}...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Expected results
|
||||||
|
print("=" * 70)
|
||||||
|
print("Expected Results:")
|
||||||
|
print(" ✓ Articles 1&2 should be in same cluster (housing story)")
|
||||||
|
print(" ✓ Articles 3&4 should be in same cluster (Bayern transfer)")
|
||||||
|
print(" ✓ Total: 2 clusters with 2 articles each")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Actual results
|
||||||
|
housing_cluster = [a for a in clustered_articles if 'Wohnungsbau' in a['title'] or 'Wohnungsbau' in a['title']]
|
||||||
|
bayern_cluster = [a for a in clustered_articles if 'Bayern' in a['title'] or 'Stürmer' in a['title']]
|
||||||
|
|
||||||
|
housing_cluster_ids = set(a['cluster_id'] for a in housing_cluster)
|
||||||
|
bayern_cluster_ids = set(a['cluster_id'] for a in bayern_cluster)
|
||||||
|
|
||||||
|
print("Actual Results:")
|
||||||
|
if len(housing_cluster_ids) == 1:
|
||||||
|
print(" ✓ Housing articles clustered together")
|
||||||
|
else:
|
||||||
|
print(f" ✗ Housing articles in {len(housing_cluster_ids)} different clusters")
|
||||||
|
|
||||||
|
if len(bayern_cluster_ids) == 1:
|
||||||
|
print(" ✓ Bayern articles clustered together")
|
||||||
|
else:
|
||||||
|
print(f" ✗ Bayern articles in {len(bayern_cluster_ids)} different clusters")
|
||||||
|
|
||||||
|
if len(clusters) == 2:
|
||||||
|
print(" ✓ Total clusters: 2 (correct)")
|
||||||
|
else:
|
||||||
|
print(f" ✗ Total clusters: {len(clusters)} (expected 2)")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("✓ Test complete! Check the results above.")
|
||||||
187
tests/crawler/test_complete_workflow.py
Normal file
187
tests/crawler/test_complete_workflow.py
Normal file
@@ -0,0 +1,187 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Complete workflow test: Clustering + Neutral Summaries
|
||||||
|
"""
|
||||||
|
from pymongo import MongoClient
|
||||||
|
from datetime import datetime
|
||||||
|
import sys
|
||||||
|
|
||||||
|
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
|
||||||
|
db = client["munich_news"]
|
||||||
|
|
||||||
|
print("=" * 70)
|
||||||
|
print("COMPLETE WORKFLOW TEST: AI Clustering + Neutral Summaries")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Clean up previous test
|
||||||
|
print("1. Cleaning up previous test data...")
|
||||||
|
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
|
||||||
|
db.cluster_summaries.delete_many({"cluster_id": {"$regex": "^test_"}})
|
||||||
|
print(" ✓ Cleaned up")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Import modules
|
||||||
|
sys.path.insert(0, '/app')
|
||||||
|
from ollama_client import OllamaClient
|
||||||
|
from article_clustering import ArticleClusterer
|
||||||
|
from cluster_summarizer import ClusterSummarizer
|
||||||
|
from config import Config
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
ollama_client = OllamaClient(
|
||||||
|
base_url=Config.OLLAMA_BASE_URL,
|
||||||
|
model=Config.OLLAMA_MODEL,
|
||||||
|
enabled=Config.OLLAMA_ENABLED,
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
clusterer = ArticleClusterer(ollama_client, similarity_threshold=0.50, time_window_hours=24)
|
||||||
|
summarizer = ClusterSummarizer(ollama_client, max_words=200)
|
||||||
|
|
||||||
|
# Test articles - 2 stories, 2 sources each
|
||||||
|
test_articles = [
|
||||||
|
# Story 1: Munich Housing (2 sources)
|
||||||
|
{
|
||||||
|
"title": "München: Stadtrat beschließt neue Wohnungsbau-Regelungen",
|
||||||
|
"content": "Der Münchner Stadtrat hat neue Regelungen für bezahlbaren Wohnungsbau beschlossen. 40% Sozialwohnungen werden Pflicht.",
|
||||||
|
"source": "abendzeitung-muenchen",
|
||||||
|
"link": "https://example.com/test-housing-az",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "local"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Stadtrat München: Neue Verordnung für Wohnungsbau",
|
||||||
|
"content": "München führt neue Wohnungsbau-Verordnung ein. Mindestens 40% der Neubauten müssen Sozialwohnungen sein.",
|
||||||
|
"source": "sueddeutsche",
|
||||||
|
"link": "https://example.com/test-housing-sz",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "local"
|
||||||
|
},
|
||||||
|
# Story 2: Bayern Transfer (2 sources)
|
||||||
|
{
|
||||||
|
"title": "FC Bayern verpflichtet brasilianischen Stürmer für 50 Millionen",
|
||||||
|
"content": "Bayern München holt einen 23-jährigen Brasilianer. Sportdirektor Freund ist begeistert.",
|
||||||
|
"source": "abendzeitung-muenchen",
|
||||||
|
"link": "https://example.com/test-bayern-az",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "sports"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Bayern München: Neuzugang aus Brasilien für 50 Mio. Euro",
|
||||||
|
"content": "Der Rekordmeister verstärkt die Offensive mit einem brasilianischen Angreifer. Freund lobt den Transfer.",
|
||||||
|
"source": "sueddeutsche",
|
||||||
|
"link": "https://example.com/test-bayern-sz",
|
||||||
|
"published_at": datetime.utcnow(),
|
||||||
|
"category": "sports"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
print("2. Processing articles with AI clustering...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
clustered_articles = []
|
||||||
|
for i, article in enumerate(test_articles, 1):
|
||||||
|
print(f" Article {i}: {article['title'][:50]}...")
|
||||||
|
print(f" Source: {article['source']}")
|
||||||
|
|
||||||
|
# Cluster
|
||||||
|
clustered = clusterer.cluster_article(article, clustered_articles)
|
||||||
|
clustered_articles.append(clustered)
|
||||||
|
|
||||||
|
print(f" → Cluster: {clustered['cluster_id']}")
|
||||||
|
print(f" → Primary: {clustered['is_primary']}")
|
||||||
|
|
||||||
|
# Save to DB
|
||||||
|
db.articles.insert_one(clustered)
|
||||||
|
print(f" ✓ Saved")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("=" * 70)
|
||||||
|
print("3. Clustering Results:")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Analyze clusters
|
||||||
|
clusters = {}
|
||||||
|
for article in clustered_articles:
|
||||||
|
cid = article['cluster_id']
|
||||||
|
if cid not in clusters:
|
||||||
|
clusters[cid] = []
|
||||||
|
clusters[cid].append(article)
|
||||||
|
|
||||||
|
print(f" Total clusters: {len(clusters)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
for cid, articles in clusters.items():
|
||||||
|
print(f" Cluster {cid}:")
|
||||||
|
print(f" - Articles: {len(articles)}")
|
||||||
|
for article in articles:
|
||||||
|
print(f" • [{article['source']}] {article['title'][:45]}...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check expectations
|
||||||
|
if len(clusters) == 2:
|
||||||
|
print(" ✓ Expected 2 clusters (housing + bayern)")
|
||||||
|
else:
|
||||||
|
print(f" ⚠ Expected 2 clusters, got {len(clusters)}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("4. Generating neutral summaries...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
summary_count = 0
|
||||||
|
for cid, articles in clusters.items():
|
||||||
|
if len(articles) < 2:
|
||||||
|
print(f" Skipping cluster {cid} (only 1 article)")
|
||||||
|
continue
|
||||||
|
|
||||||
|
print(f" Cluster {cid}: {len(articles)} articles")
|
||||||
|
|
||||||
|
result = summarizer.generate_neutral_summary(articles)
|
||||||
|
|
||||||
|
if result['success']:
|
||||||
|
print(f" ✓ Generated summary ({result['duration']:.1f}s)")
|
||||||
|
|
||||||
|
# Save
|
||||||
|
db.cluster_summaries.insert_one({
|
||||||
|
"cluster_id": cid,
|
||||||
|
"neutral_summary": result['neutral_summary'],
|
||||||
|
"sources": result['sources'],
|
||||||
|
"article_count": result['article_count'],
|
||||||
|
"created_at": datetime.utcnow()
|
||||||
|
})
|
||||||
|
summary_count += 1
|
||||||
|
|
||||||
|
# Show preview
|
||||||
|
preview = result['neutral_summary'][:100] + "..."
|
||||||
|
print(f" Preview: {preview}")
|
||||||
|
else:
|
||||||
|
print(f" ✗ Failed: {result['error']}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("=" * 70)
|
||||||
|
print("5. Final Results:")
|
||||||
|
print()
|
||||||
|
|
||||||
|
test_article_count = db.articles.count_documents({"link": {"$regex": "^https://example.com/test-"}})
|
||||||
|
test_summary_count = db.cluster_summaries.count_documents({})
|
||||||
|
|
||||||
|
print(f" Articles saved: {test_article_count}")
|
||||||
|
print(f" Clusters created: {len(clusters)}")
|
||||||
|
print(f" Neutral summaries: {summary_count}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if len(clusters) == 2 and summary_count == 2:
|
||||||
|
print(" ✅ SUCCESS! Complete workflow working perfectly!")
|
||||||
|
print()
|
||||||
|
print(" The system now:")
|
||||||
|
print(" 1. ✓ Clusters articles from different sources")
|
||||||
|
print(" 2. ✓ Generates neutral summaries combining perspectives")
|
||||||
|
print(" 3. ✓ Stores everything in MongoDB")
|
||||||
|
else:
|
||||||
|
print(" ⚠ Partial success - check results above")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
130
tests/crawler/test_neutral_summaries.py
Normal file
130
tests/crawler/test_neutral_summaries.py
Normal file
@@ -0,0 +1,130 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test neutral summary generation from clustered articles
|
||||||
|
"""
|
||||||
|
from pymongo import MongoClient
|
||||||
|
from datetime import datetime
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# Connect to MongoDB
|
||||||
|
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
|
||||||
|
db = client["munich_news"]
|
||||||
|
|
||||||
|
print("Testing Neutral Summary Generation")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check for test articles
|
||||||
|
test_articles = list(db.articles.find(
|
||||||
|
{"link": {"$regex": "^https://example.com/"}}
|
||||||
|
).sort("_id", 1))
|
||||||
|
|
||||||
|
if len(test_articles) == 0:
|
||||||
|
print("⚠ No test articles found. Run test-clustering-real.py first.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"Found {len(test_articles)} test articles")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Find clusters with multiple articles
|
||||||
|
clusters = {}
|
||||||
|
for article in test_articles:
|
||||||
|
cid = article['cluster_id']
|
||||||
|
if cid not in clusters:
|
||||||
|
clusters[cid] = []
|
||||||
|
clusters[cid].append(article)
|
||||||
|
|
||||||
|
multi_article_clusters = {k: v for k, v in clusters.items() if len(v) > 1}
|
||||||
|
|
||||||
|
if len(multi_article_clusters) == 0:
|
||||||
|
print("⚠ No clusters with multiple articles found")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"Found {len(multi_article_clusters)} cluster(s) with multiple articles")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Import cluster summarizer
|
||||||
|
sys.path.insert(0, '/app')
|
||||||
|
from ollama_client import OllamaClient
|
||||||
|
from cluster_summarizer import ClusterSummarizer
|
||||||
|
from config import Config
|
||||||
|
|
||||||
|
# Initialize
|
||||||
|
ollama_client = OllamaClient(
|
||||||
|
base_url=Config.OLLAMA_BASE_URL,
|
||||||
|
model=Config.OLLAMA_MODEL,
|
||||||
|
enabled=Config.OLLAMA_ENABLED,
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
summarizer = ClusterSummarizer(ollama_client, max_words=200)
|
||||||
|
|
||||||
|
print("Generating neutral summaries...")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
for cluster_id, articles in multi_article_clusters.items():
|
||||||
|
print(f"Cluster: {cluster_id}")
|
||||||
|
print(f"Articles: {len(articles)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Show individual articles
|
||||||
|
for i, article in enumerate(articles, 1):
|
||||||
|
print(f" {i}. [{article['source']}] {article['title'][:60]}...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Generate neutral summary
|
||||||
|
print(" Generating neutral summary...")
|
||||||
|
result = summarizer.generate_neutral_summary(articles)
|
||||||
|
|
||||||
|
if result['success']:
|
||||||
|
print(f" ✓ Success ({result['duration']:.1f}s)")
|
||||||
|
print()
|
||||||
|
print(" Neutral Summary:")
|
||||||
|
print(" " + "-" * 66)
|
||||||
|
# Wrap text at 66 chars
|
||||||
|
summary = result['neutral_summary']
|
||||||
|
words = summary.split()
|
||||||
|
lines = []
|
||||||
|
current_line = " "
|
||||||
|
for word in words:
|
||||||
|
if len(current_line) + len(word) + 1 <= 68:
|
||||||
|
current_line += word + " "
|
||||||
|
else:
|
||||||
|
lines.append(current_line.rstrip())
|
||||||
|
current_line = " " + word + " "
|
||||||
|
if current_line.strip():
|
||||||
|
lines.append(current_line.rstrip())
|
||||||
|
print("\n".join(lines))
|
||||||
|
print(" " + "-" * 66)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Save to database
|
||||||
|
db.cluster_summaries.update_one(
|
||||||
|
{"cluster_id": cluster_id},
|
||||||
|
{
|
||||||
|
"$set": {
|
||||||
|
"cluster_id": cluster_id,
|
||||||
|
"neutral_summary": result['neutral_summary'],
|
||||||
|
"sources": result['sources'],
|
||||||
|
"article_count": result['article_count'],
|
||||||
|
"created_at": datetime.utcnow(),
|
||||||
|
"updated_at": datetime.utcnow()
|
||||||
|
}
|
||||||
|
},
|
||||||
|
upsert=True
|
||||||
|
)
|
||||||
|
print(" ✓ Saved to cluster_summaries collection")
|
||||||
|
else:
|
||||||
|
print(f" ✗ Failed: {result['error']}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Testing complete!")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Show summary statistics
|
||||||
|
total_cluster_summaries = db.cluster_summaries.count_documents({})
|
||||||
|
print(f"Total cluster summaries in database: {total_cluster_summaries}")
|
||||||
Reference in New Issue
Block a user