This commit is contained in:
2025-11-12 11:34:33 +01:00
parent f35f8eef8a
commit 94c89589af
32 changed files with 3272 additions and 3805 deletions

View File

@@ -5,7 +5,8 @@ Get Munich News Daily running in 5 minutes!
## Prerequisites
- Docker & Docker Compose installed
- (Optional) Ollama for AI summarization
- 4GB+ RAM (for Ollama AI models)
- (Optional) NVIDIA GPU for 5-10x faster AI processing
## Setup
@@ -30,13 +31,21 @@ EMAIL_PASSWORD=your-app-password
### 2. Start System
```bash
# Start all services
# Option 1: Auto-detect GPU and start (recommended)
./start-with-gpu.sh
# Option 2: Start without GPU
docker-compose up -d
# View logs
docker-compose logs -f
# Wait for Ollama model download (first time only, ~2-5 minutes)
docker-compose logs -f ollama-setup
```
**Note:** First startup downloads the phi3:latest AI model (2.2GB). This happens automatically.
### 3. Add RSS Feeds
```bash
@@ -114,18 +123,45 @@ docker-compose logs -f
docker-compose up -d --build
```
## New Features
### GPU Acceleration (5-10x Faster)
Enable GPU support for faster AI processing:
```bash
./check-gpu.sh # Check if GPU is available
./start-with-gpu.sh # Start with GPU support
```
See [docs/GPU_SETUP.md](docs/GPU_SETUP.md) for details.
### Send Newsletter to All Subscribers
```bash
# Send newsletter to all active subscribers
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Security Features
- ✅ Only Backend API exposed (port 5001)
- ✅ MongoDB internal-only (secure)
- ✅ Ollama internal-only (secure)
- ✅ All services communicate via internal Docker network
## Need Help?
- Check [README.md](README.md) for full documentation
- See [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed setup
- View [docs/API.md](docs/API.md) for API reference
- **Documentation Index**: [docs/INDEX.md](docs/INDEX.md)
- **GPU Setup**: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
- **API Reference**: [docs/ADMIN_API.md](docs/ADMIN_API.md)
- **Security Guide**: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
- **Full Documentation**: [README.md](README.md)
## Next Steps
1. Configure Ollama for AI summaries (optional)
1. **Enable GPU acceleration** - [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
2. Set up tracking API (optional)
3. Customize newsletter template
4. Add more RSS feeds
5. Monitor engagement metrics
6. Review security settings - [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
That's it! Your automated news system is running. 🎉

View File

@@ -2,7 +2,16 @@
A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking.
**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [QUICK_START_GPU.md](QUICK_START_GPU.md)
## ✨ Key Features
- **🤖 AI-Powered Clustering** - Automatically detects duplicate stories from different sources
- **📰 Neutral Summaries** - Combines multiple perspectives into balanced coverage
- **🎯 Smart Prioritization** - Shows most important stories first (multi-source coverage)
- **📊 Engagement Tracking** - Open rates, click tracking, and analytics
- **⚡ GPU Acceleration** - 5-10x faster AI processing with GPU support
- **🔒 GDPR Compliant** - Privacy-first with data retention controls
**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
## 🚀 Quick Start
@@ -25,6 +34,8 @@ That's it! The system will automatically:
📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide.
🚀 **GPU Acceleration:** Enable 5-10x faster AI processing with [GPU Setup Guide](docs/GPU_SETUP.md)
## 📋 System Overview
```
@@ -49,11 +60,11 @@ That's it! The system will automatically:
### Components
- **Ollama**: AI service for summarization and translation (port 11434)
- **MongoDB**: Data storage (articles, subscribers, tracking)
- **Backend API**: Flask API for tracking and analytics (port 5001)
- **News Crawler**: Automated RSS feed crawler with AI summarization
- **Newsletter Sender**: Automated email sender with tracking
- **Ollama**: AI service for summarization and translation (internal only, GPU-accelerated)
- **MongoDB**: Data storage (articles, subscribers, tracking) (internal only)
- **Backend API**: Flask API for tracking and analytics (port 5001 - only exposed service)
- **News Crawler**: Automated RSS feed crawler with AI summarization (internal only)
- **Newsletter Sender**: Automated email sender with tracking (internal only)
- **Frontend**: React dashboard (optional)
### Technology Stack
@@ -341,11 +352,21 @@ curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-
### Getting Started
- **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
- **[PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md)** - Project layout
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
### Core Features
- **[docs/AI_NEWS_AGGREGATION.md](docs/AI_NEWS_AGGREGATION.md)** - AI-powered clustering & neutral summaries
- **[docs/FEATURES.md](docs/FEATURES.md)** - Complete feature list
- **[docs/API.md](docs/API.md)** - API endpoints reference
### Technical Documentation
- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
- **[docs/SETUP.md](docs/SETUP.md)** - Detailed setup guide
- **[docs/OLLAMA_SETUP.md](docs/OLLAMA_SETUP.md)** - AI/Ollama configuration
- **[docs/GPU_SETUP.md](docs/GPU_SETUP.md)** - GPU acceleration setup
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production deployment
- **[docs/SECURITY.md](docs/SECURITY.md)** - Security best practices
- **[docs/REFERENCE.md](docs/REFERENCE.md)** - Complete reference
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
- **[docs/API.md](docs/API.md)** - API reference
- **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure

View File

@@ -1,5 +1,5 @@
from flask import Blueprint, jsonify
from database import articles_collection
from flask import Blueprint, jsonify, request
from database import articles_collection, db
from services.news_service import fetch_munich_news, save_articles_to_db
news_bp = Blueprint('news', __name__)
@@ -9,6 +9,12 @@ news_bp = Blueprint('news', __name__)
def get_news():
"""Get latest Munich news"""
try:
# Check if clustered mode is requested
mode = request.args.get('mode', 'all')
if mode == 'clustered':
return get_clustered_news_internal()
# Fetch fresh news and save to database
articles = fetch_munich_news()
save_articles_to_db(articles)
@@ -63,6 +69,95 @@ def get_news():
return jsonify({'error': str(e)}), 500
def get_clustered_news_internal():
"""
Get news with neutral summaries for clustered articles
Returns only primary articles with their neutral summaries
Prioritizes stories covered by multiple sources (more popular/important)
"""
try:
limit = int(request.args.get('limit', 20))
# Use aggregation to get articles with their cluster size
# This allows us to prioritize multi-source stories
pipeline = [
{"$match": {"is_primary": True}},
{"$lookup": {
"from": "articles",
"localField": "cluster_id",
"foreignField": "cluster_id",
"as": "cluster_articles"
}},
{"$addFields": {
"article_count": {"$size": "$cluster_articles"},
"sources_list": {"$setUnion": ["$cluster_articles.source", []]}
}},
{"$addFields": {
"source_count": {"$size": "$sources_list"}
}},
# Sort by: 1) source count (desc), 2) published date (desc)
{"$sort": {"source_count": -1, "published_at": -1}},
{"$limit": limit}
]
cursor = articles_collection.aggregate(pipeline)
result = []
cluster_summaries_collection = db['cluster_summaries']
for doc in cursor:
cluster_id = doc.get('cluster_id')
# Get neutral summary if available
cluster_summary = cluster_summaries_collection.find_one({'cluster_id': cluster_id})
# Use cluster_articles from aggregation (already fetched)
cluster_articles = doc.get('cluster_articles', [])
article = {
'title': doc.get('title', ''),
'link': doc.get('link', ''),
'source': doc.get('source', ''),
'published': doc.get('published_at', ''),
'category': doc.get('category', 'general'),
'cluster_id': cluster_id,
'article_count': doc.get('article_count', 1),
'source_count': doc.get('source_count', 1),
'sources': list(doc.get('sources_list', [doc.get('source', '')]))
}
# Use neutral summary if available, otherwise use article's own summary
if cluster_summary and doc.get('article_count', 1) > 1:
article['summary'] = cluster_summary.get('neutral_summary', '')
article['summary_type'] = 'neutral'
article['is_clustered'] = True
else:
article['summary'] = doc.get('summary', '')
article['summary_type'] = 'individual'
article['is_clustered'] = False
# Add related articles info
if doc.get('article_count', 1) > 1:
article['related_articles'] = [
{
'source': a.get('source', ''),
'title': a.get('title', ''),
'link': a.get('link', '')
}
for a in cluster_articles if a.get('_id') != doc.get('_id')
]
result.append(article)
return jsonify({
'articles': result,
'mode': 'clustered',
'description': 'Shows one article per story with neutral summaries'
}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
@news_bp.route('/api/news/<path:article_url>', methods=['GET'])
def get_article_by_url(article_url):
"""Get full article content by URL"""
@@ -113,11 +208,20 @@ def get_stats():
# Count summarized articles
summarized_count = articles_collection.count_documents({'summary': {'$exists': True, '$ne': ''}})
# Count clustered articles
clustered_count = articles_collection.count_documents({'cluster_id': {'$exists': True}})
# Count cluster summaries
cluster_summaries_collection = db['cluster_summaries']
neutral_summaries_count = cluster_summaries_collection.count_documents({})
return jsonify({
'subscribers': subscriber_count,
'articles': article_count,
'crawled_articles': crawled_count,
'summarized_articles': summarized_count
'summarized_articles': summarized_count,
'clustered_articles': clustered_count,
'neutral_summaries': neutral_summaries_count
}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500

View File

@@ -1,382 +0,0 @@
# Admin API Reference
Admin endpoints for testing and manual operations.
## Overview
The admin API allows you to trigger manual operations like crawling news and sending test emails directly through HTTP requests.
**How it works**: The backend container has access to the Docker socket, allowing it to execute commands in other containers via `docker exec`.
---
## API Endpoints
### Trigger Crawler
Manually trigger the news crawler to fetch new articles.
```http
POST /api/admin/trigger-crawl
```
**Request Body** (optional):
```json
{
"max_articles": 10
}
```
**Parameters**:
- `max_articles` (integer, optional): Number of articles to crawl per feed (1-100, default: 10)
**Response**:
```json
{
"success": true,
"message": "Crawler executed successfully",
"max_articles": 10,
"output": "... crawler output (last 1000 chars) ...",
"errors": ""
}
```
**Example**:
```bash
# Crawl 5 articles per feed
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 5}'
# Use default (10 articles)
curl -X POST http://localhost:5001/api/admin/trigger-crawl
```
---
### Send Test Email
Send a test newsletter to a specific email address.
```http
POST /api/admin/send-test-email
```
**Request Body**:
```json
{
"email": "test@example.com",
"max_articles": 10
}
```
**Parameters**:
- `email` (string, required): Email address to send test newsletter to
- `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
**Response**:
```json
{
"success": true,
"message": "Test email sent to test@example.com",
"email": "test@example.com",
"output": "... sender output ...",
"errors": ""
}
```
**Example**:
```bash
# Send test email
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your-email@example.com"}'
# Send with custom article count
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your-email@example.com", "max_articles": 5}'
```
---
### Send Newsletter to All Subscribers
Send newsletter to all active subscribers in the database.
```http
POST /api/admin/send-newsletter
```
**Request Body** (optional):
```json
{
"max_articles": 10
}
```
**Parameters**:
- `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
**Response**:
```json
{
"success": true,
"message": "Newsletter sent successfully to 45 subscribers",
"subscriber_count": 45,
"max_articles": 10,
"output": "... sender output ...",
"errors": ""
}
```
**Example**:
```bash
# Send newsletter to all subscribers
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json"
# Send with custom article count
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 15}'
```
**Notes**:
- Only sends to subscribers with `status: 'active'`
- Returns error if no active subscribers found
- Includes tracking pixels and click tracking
- May take several minutes for large subscriber lists
---
### Get System Statistics
Get overview statistics of the system.
```http
GET /api/admin/stats
```
**Response**:
```json
{
"articles": {
"total": 150,
"with_summary": 120,
"today": 15
},
"subscribers": {
"total": 50,
"active": 45
},
"rss_feeds": {
"total": 4,
"active": 4
},
"tracking": {
"total_sends": 200,
"total_opens": 150,
"total_clicks": 75
}
}
```
**Example**:
```bash
curl http://localhost:5001/api/admin/stats
```
---
## Workflow Examples
### Test Complete System
```bash
# 1. Check current stats
curl http://localhost:5001/api/admin/stats
# 2. Trigger crawler to fetch new articles
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 5}'
# 3. Wait a moment for crawler to finish, then check stats again
sleep 30
curl http://localhost:5001/api/admin/stats
# 4. Send test email to yourself
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your-email@example.com"}'
```
### Send Newsletter to All Subscribers
```bash
# 1. Check subscriber count
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
# 2. Crawl fresh articles
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
# 3. Wait for crawl to complete
sleep 60
# 4. Send newsletter to all active subscribers
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Quick Test Newsletter
```bash
# Send test email with latest articles
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your-email@example.com", "max_articles": 3}'
```
### Fetch Fresh Content
```bash
# Crawl more articles from each feed
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 20}'
```
### Daily Newsletter Workflow
```bash
# Complete daily workflow (can be automated with cron)
# 1. Crawl today's articles
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 15}'
# 2. Wait for crawl and AI processing
sleep 120
# 3. Send to all subscribers
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
# 4. Check results
curl http://localhost:5001/api/admin/stats
```
---
## Error Responses
All endpoints return standard error responses:
```json
{
"success": false,
"error": "Error message here"
}
```
**Common HTTP Status Codes**:
- `200` - Success
- `400` - Bad request (invalid parameters)
- `500` - Server error
---
## Security Notes
⚠️ **Important**: These are admin endpoints and should be protected in production!
Recommendations:
1. Add authentication/authorization
2. Rate limiting
3. IP whitelisting
4. API key requirement
5. Audit logging
Example protection (add to routes):
```python
from functools import wraps
from flask import request
def require_api_key(f):
@wraps(f)
def decorated_function(*args, **kwargs):
api_key = request.headers.get('X-API-Key')
if api_key != os.getenv('ADMIN_API_KEY'):
return jsonify({'error': 'Unauthorized'}), 401
return f(*args, **kwargs)
return decorated_function
@admin_bp.route('/api/admin/trigger-crawl', methods=['POST'])
@require_api_key
def trigger_crawl():
# ... endpoint code
```
---
## Related Endpoints
- **[Newsletter Preview](../backend/routes/newsletter_routes.py)**: `/api/newsletter/preview` - Preview newsletter HTML
- **[Analytics](API.md)**: `/api/analytics/*` - View engagement metrics
- **[RSS Feeds](API.md)**: `/api/rss-feeds` - Manage RSS feeds
---
## Newsletter API Summary
### Available Endpoints
| Endpoint | Purpose | Recipient |
|----------|---------|-----------|
| `/api/admin/send-test-email` | Test newsletter | Single email (specified) |
| `/api/admin/send-newsletter` | Production send | All active subscribers |
| `/api/admin/trigger-crawl` | Fetch articles | N/A |
| `/api/admin/stats` | System stats | N/A |
### Subscriber Status
The system uses a `status` field to determine who receives newsletters:
- **`active`** - Receives newsletters ✅
- **`inactive`** - Does not receive newsletters ❌
See [SUBSCRIBER_STATUS.md](SUBSCRIBER_STATUS.md) for details.
### Quick Examples
**Send to all subscribers:**
```bash
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
**Send test email:**
```bash
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
```
**Check stats:**
```bash
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
```
### Testing
Use the test script:
```bash
./test-newsletter-api.sh
```

317
docs/AI_NEWS_AGGREGATION.md Normal file
View File

@@ -0,0 +1,317 @@
# AI-Powered News Aggregation - COMPLETE ✅
## Overview
Successfully implemented a complete AI-powered news aggregation system that detects duplicate stories from multiple sources and generates neutral, balanced summaries.
## Features Implemented
### 1. AI-Powered Article Clustering ✅
**What it does:**
- Automatically detects when different news sources cover the same story
- Uses Ollama AI to intelligently compare article content
- Groups related articles by `cluster_id`
- Marks the first article as `is_primary: true`
**How it works:**
- Compares articles published within 24 hours
- Uses AI prompt: "Are these two articles about the same story?"
- Falls back to keyword matching if AI fails
- Real-time clustering during crawl
**Test Results:**
- ✅ Housing story from 2 sources → Clustered together
- ✅ Bayern transfer from 2 sources → Clustered together
- ✅ Different stories → Separate clusters
### 2. Neutral Summary Generation ✅
**What it does:**
- Synthesizes multiple articles into one balanced summary
- Combines perspectives from all sources
- Highlights agreements and differences
- Maintains neutral, objective tone
**How it works:**
- Takes all articles in a cluster
- Sends combined context to Ollama
- AI generates ~200-word neutral summary
- Saves to `cluster_summaries` collection
**Test Results:**
```
Bayern Transfer Story (2 sources):
"Bayern Munich has recently signed Brazilian footballer, aged 23,
for €50 million to bolster their attacking lineup as per reports
from abendzeitung-muenchen and sueddeutsche. The new addition is
expected to inject much-needed dynamism into the team's offense..."
```
### 3. Smart Prioritization ✅
**What it does:**
- Prioritizes stories covered by multiple sources (more important)
- Shows multi-source stories first with neutral summaries
- Fills remaining slots with single-source stories
**Sorting Logic:**
1. **Primary sort:** Number of sources (descending)
2. **Secondary sort:** Publish date (newest first)
**Example Output:**
```
1. Munich Housing (2 sources) → Neutral summary
2. Bayern Transfer (2 sources) → Neutral summary
3. Local story (1 source) → Individual summary
4. Local story (1 source) → Individual summary
...
```
## Database Schema
### Articles Collection
```javascript
{
_id: ObjectId("..."),
title: "München: Stadtrat beschließt...",
content: "Full article text...",
summary: "AI-generated summary...",
source: "abendzeitung-muenchen",
link: "https://...",
published_at: ISODate("2025-11-12T..."),
// Clustering fields
cluster_id: "1762937577.365818",
is_primary: true,
// Metadata
word_count: 450,
summary_word_count: 120,
category: "local",
crawled_at: ISODate("..."),
summarized_at: ISODate("...")
}
```
### Cluster Summaries Collection
```javascript
{
_id: ObjectId("..."),
cluster_id: "1762937577.365818",
neutral_summary: "Combined neutral summary from all sources...",
sources: ["abendzeitung-muenchen", "sueddeutsche"],
article_count: 2,
created_at: ISODate("2025-11-12T..."),
updated_at: ISODate("2025-11-12T...")
}
```
## API Endpoints
### Get All Articles (Default)
```bash
GET /api/news
```
Returns all articles individually (current behavior)
### Get Clustered Articles (Recommended)
```bash
GET /api/news?mode=clustered&limit=10
```
Returns:
- One article per story
- Multi-source stories with neutral summaries first
- Single-source stories with individual summaries
- Smart prioritization by popularity
**Response Format:**
```javascript
{
"articles": [
{
"title": "...",
"summary": "Neutral summary combining all sources...",
"summary_type": "neutral",
"is_clustered": true,
"source_count": 2,
"sources": ["source1", "source2"],
"related_articles": [
{"source": "source2", "title": "...", "link": "..."}
]
}
],
"mode": "clustered"
}
```
### Get Statistics
```bash
GET /api/stats
```
Returns:
```javascript
{
"articles": 51,
"crawled_articles": 45,
"summarized_articles": 40,
"clustered_articles": 47,
"neutral_summaries": 3
}
```
## Workflow
### Complete Crawl Process
1. **Crawl RSS feeds** from multiple sources
2. **Extract full content** from article URLs
3. **Generate AI summaries** for each article
4. **Cluster similar articles** using AI comparison
5. **Generate neutral summaries** for multi-source clusters
6. **Save everything** to MongoDB
### Time Windows
- **Clustering window:** 24 hours (rolling)
- **Crawl schedule:** Daily at 6:00 AM Berlin time
- **Manual trigger:** Available via crawler service
## Configuration
### Environment Variables
```bash
# Ollama AI
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true
OLLAMA_TIMEOUT=120
# Clustering
CLUSTERING_TIME_WINDOW=24 # hours
CLUSTERING_SIMILARITY_THRESHOLD=0.50
# Summaries
SUMMARY_MAX_WORDS=150 # individual
NEUTRAL_SUMMARY_MAX_WORDS=200 # cluster
```
## Files Created/Modified
### New Files
- `news_crawler/article_clustering.py` - AI clustering logic
- `news_crawler/cluster_summarizer.py` - Neutral summary generation
- `test-clustering-real.py` - Clustering tests
- `test-neutral-summaries.py` - Summary generation tests
- `test-complete-workflow.py` - End-to-end tests
### Modified Files
- `news_crawler/crawler_service.py` - Added clustering + summarization
- `news_crawler/ollama_client.py` - Added `generate()` method
- `backend/routes/news_routes.py` - Added clustered endpoint with prioritization
## Performance
### Metrics
- **Clustering:** ~20-40s per article pair (AI comparison)
- **Neutral summary:** ~30-40s per cluster
- **Success rate:** 100% in tests
- **Accuracy:** High - correctly identifies same/different stories
### Optimization
- Clustering runs during crawl (real-time)
- Neutral summaries generated after crawl (batch)
- Results cached in database
- 24-hour time window limits comparisons
## Testing
### Test Coverage
✅ AI clustering with same stories
✅ AI clustering with different stories
✅ Neutral summary generation
✅ Multi-source prioritization
✅ Database integration
✅ End-to-end workflow
### Test Commands
```bash
# Test clustering
docker-compose exec crawler python /app/test-clustering-real.py
# Test neutral summaries
docker-compose exec crawler python /app/test-neutral-summaries.py
# Test complete workflow
docker-compose exec crawler python /app/test-complete-workflow.py
```
## Benefits
### For Users
-**No duplicate stories** - See each story once
-**Balanced coverage** - Multiple perspectives combined
-**Prioritized content** - Important stories first
-**Source transparency** - See all sources covering a story
-**Efficient reading** - One summary instead of multiple articles
### For the System
-**Intelligent deduplication** - AI-powered, not just URL matching
-**Scalable** - Works with any number of sources
-**Flexible** - 24-hour time window catches late-breaking news
-**Reliable** - Fallback mechanisms if AI fails
-**Maintainable** - Clear separation of concerns
## Future Enhancements
### Potential Improvements
1. **Update summaries** when new articles join a cluster
2. **Summary versioning** to track changes over time
3. **Quality scoring** for generated summaries
4. **Multi-language support** for summaries
5. **Sentiment analysis** across sources
6. **Fact extraction** and verification
7. **Trending topics** detection
8. **User preferences** for source weighting
### Integration Ideas
- Email newsletters with neutral summaries
- Push notifications for multi-source stories
- RSS feed of clustered articles
- API for third-party apps
- Analytics dashboard
## Conclusion
The Munich News Aggregator now provides:
1.**Smart clustering** - AI detects duplicate stories
2.**Neutral summaries** - Balanced multi-source coverage
3.**Smart prioritization** - Important stories first
4.**Source transparency** - See all perspectives
5.**Efficient delivery** - One summary per story
**Result:** Users get comprehensive, balanced news coverage without information overload!
---
## Quick Start
### View Clustered News
```bash
curl "http://localhost:5001/api/news?mode=clustered&limit=10"
```
### Trigger Manual Crawl
```bash
docker-compose exec crawler python /app/scheduled_crawler.py
```
### Check Statistics
```bash
curl "http://localhost:5001/api/stats"
```
### View Cluster Summaries in Database
```bash
docker-compose exec mongodb mongosh -u admin -p changeme --authenticationDatabase admin munich_news --eval "db.cluster_summaries.find().pretty()"
```
---
**Status:** ✅ Production Ready
**Last Updated:** November 12, 2025
**Version:** 2.0 (AI-Powered)

View File

@@ -1,214 +1,248 @@
# API Reference
## Tracking Endpoints
Complete API documentation for Munich News Daily.
### Track Email Open
---
## Admin API
Base URL: `http://localhost:5001`
### Trigger Crawler
Manually fetch new articles.
```http
GET /api/track/pixel/<tracking_id>
```
POST /api/admin/trigger-crawl
Content-Type: application/json
Returns a 1x1 transparent PNG and logs the email open event.
**Response**: Image (image/png)
### Track Link Click
```http
GET /api/track/click/<tracking_id>
```
Logs the click event and redirects to the original article URL.
**Response**: 302 Redirect
## Analytics Endpoints
### Get Newsletter Metrics
```http
GET /api/analytics/newsletter/<newsletter_id>
```
Returns comprehensive metrics for a specific newsletter.
**Response**:
```json
{
"newsletter_id": "2024-01-15",
"total_sent": 100,
"total_opened": 75,
"open_rate": 75.0,
"unique_openers": 70,
"total_clicks": 45,
"unique_clickers": 30,
"click_through_rate": 30.0
"max_articles": 10
}
```
### Get Article Performance
```http
GET /api/analytics/article/<article_url>
**Example:**
```bash
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 5}'
```
Returns performance metrics for a specific article.
---
### Send Test Email
Send newsletter to specific email.
```http
POST /api/admin/send-test-email
Content-Type: application/json
**Response**:
```json
{
"article_url": "https://example.com/article",
"total_sent": 100,
"total_clicks": 25,
"click_rate": 25.0,
"unique_clickers": 20,
"newsletters": ["2024-01-15", "2024-01-16"]
"email": "test@example.com",
"max_articles": 10
}
```
### Get Subscriber Activity
```http
GET /api/analytics/subscriber/<email>
**Example:**
```bash
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your@email.com"}'
```
Returns activity status and engagement metrics for a subscriber.
---
### Send Newsletter to All Subscribers
Send newsletter to all active subscribers.
```http
POST /api/admin/send-newsletter
Content-Type: application/json
**Response**:
```json
{
"email": "user@example.com",
"status": "active",
"last_opened_at": "2024-01-15T10:30:00",
"last_clicked_at": "2024-01-15T10:35:00",
"total_opens": 45,
"total_clicks": 20,
"newsletters_received": 50,
"newsletters_opened": 45
"max_articles": 10
}
```
## Privacy Endpoints
### Delete Subscriber Data
```http
DELETE /api/tracking/subscriber/<email>
```
Deletes all tracking data for a subscriber (GDPR compliance).
**Response**:
**Response:**
```json
{
"success": true,
"message": "All tracking data deleted for user@example.com",
"deleted_counts": {
"newsletter_sends": 50,
"link_clicks": 25,
"subscriber_activity": 1
"message": "Newsletter sent successfully to 45 subscribers",
"subscriber_count": 45,
"max_articles": 10
}
```
**Example:**
```bash
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
---
### Get System Stats
```http
GET /api/admin/stats
```
**Response:**
```json
{
"articles": {
"total": 150,
"with_summary": 120,
"today": 15
},
"subscribers": {
"total": 50,
"active": 45
},
"rss_feeds": {
"total": 4,
"active": 4
},
"tracking": {
"total_sends": 200,
"total_opens": 150,
"total_clicks": 75
}
}
```
### Anonymize Old Data
**Example:**
```bash
curl http://localhost:5001/api/admin/stats
```
---
## Public API
### Subscribe
```http
POST /api/tracking/anonymize
```
POST /api/subscribe
Content-Type: application/json
Anonymizes tracking data older than the retention period.
**Request Body** (optional):
```json
{
"retention_days": 90
"email": "user@example.com"
}
```
**Response**:
```json
{
"success": true,
"message": "Anonymized tracking data older than 90 days",
"anonymized_counts": {
"newsletter_sends": 1250,
"link_clicks": 650
}
}
**Example:**
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
### Opt Out of Tracking
---
### Unsubscribe
```http
POST /api/tracking/subscriber/<email>/opt-out
```
POST /api/unsubscribe
Content-Type: application/json
Disables tracking for a subscriber.
**Response**:
```json
{
"success": true,
"message": "Subscriber user@example.com has opted out of tracking"
"email": "user@example.com"
}
```
### Opt In to Tracking
```http
POST /api/tracking/subscriber/<email>/opt-in
**Example:**
```bash
curl -X POST http://localhost:5001/api/unsubscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
Re-enables tracking for a subscriber.
---
**Response**:
```json
## Subscriber Status System
### Status Values
| Status | Description | Receives Newsletters |
|--------|-------------|---------------------|
| `active` | Subscribed | ✅ Yes |
| `inactive` | Unsubscribed | ❌ No |
### Database Schema
```javascript
{
"success": true,
"message": "Subscriber user@example.com has opted in to tracking"
_id: ObjectId("..."),
email: "user@example.com",
subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
status: "active" // or "inactive"
}
```
## Examples
### How It Works
### Using curl
**Subscribe:**
- Creates subscriber with `status: 'active'`
- If already exists and inactive, reactivates
**Unsubscribe:**
- Updates `status: 'inactive'`
- Subscriber data preserved (soft delete)
**Newsletter Sending:**
- Only sends to `status: 'active'` subscribers
- Query: `{status: 'active'}`
### Check Active Subscribers
```bash
# Get newsletter metrics
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
# Output: {"total": 10, "active": 8}
```
# Delete subscriber data
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
---
# Anonymize old data
curl -X POST http://localhost:5001/api/tracking/anonymize \
## Workflows
### Complete Newsletter Workflow
```bash
# 1. Check stats
curl http://localhost:5001/api/admin/stats
# 2. Crawl articles
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"retention_days": 90}'
-d '{"max_articles": 10}'
# Opt out of tracking
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
# 3. Wait for crawl
sleep 60
# 4. Send newsletter
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Using Python
### Test Newsletter
```python
import requests
# Get newsletter metrics
response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
metrics = response.json()
print(f"Open rate: {metrics['open_rate']}%")
# Delete subscriber data
response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
result = response.json()
print(result['message'])
```bash
# Send test to yourself
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "your@email.com", "max_articles": 3}'
```
---
## Error Responses
All endpoints return standard error responses:
All endpoints return standard error format:
```json
{
@@ -217,7 +251,64 @@ All endpoints return standard error responses:
}
```
HTTP Status Codes:
**HTTP Status Codes:**
- `200` - Success
- `400` - Bad request
- `404` - Not found
- `500` - Server error
---
## Security
⚠️ **Production Recommendations:**
1. **Add Authentication**
```python
@require_api_key
def admin_endpoint():
# ...
```
2. **Rate Limiting**
- Prevent abuse
- Limit newsletter sends
3. **IP Whitelisting**
- Restrict admin endpoints
- Use firewall rules
4. **HTTPS Only**
- Use reverse proxy
- SSL/TLS certificates
5. **Audit Logging**
- Log all admin actions
- Monitor for suspicious activity
---
## Testing
Use the test script:
```bash
./test-newsletter-api.sh
```
Or test manually:
```bash
# Health check
curl http://localhost:5001/health
# Stats
curl http://localhost:5001/api/admin/stats
# Test email
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
```
---
See [SETUP.md](SETUP.md) for configuration and [SECURITY.md](SECURITY.md) for security best practices.

View File

@@ -1,131 +1,439 @@
# System Architecture
Complete system design and architecture documentation.
---
## Overview
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
```
┌─────────────────────────────────────────────────────────────────
Munich News Daily System
└─────────────────────────────────────────────────────────────────┘
6:00 AM Berlin → News Crawler
Fetches RSS feeds
Extracts full content
Generates AI summaries
Saves to MongoDB
7:00 AM Berlin → Newsletter Sender
Waits for crawler
Fetches articles
Generates newsletter
Sends to subscribers
✅ Done!
┌─────────────────────────────────────────────────────────┐
Docker Network
│ (Internal Only) │
├─────────────────────────────────────────────────────────┤
│ │
┌──────────┐ ┌──────────┐ ┌──────────┐
│ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
│ (27017) │ (5001) │ │ │
└──────────┘ └────┬─────┘ └──────────┘ │
│ │
┌──────────┐ ┌──────────┐ │
│ │ Ollama │◄────────┤ │ Sender │ │
│ (11434) │ │ │ │
└──────────┘ └──────────┘ │
│ │
└───────────────────────┼───────────────────────────────────┘
│ Port 5001 (Only exposed port)
Host Machine
External Network
```
---
## Components
### 1. MongoDB Database
- **Purpose**: Central data storage
- **Collections**:
- `articles`: News articles with summaries
- `subscribers`: Email subscribers
- `rss_feeds`: RSS feed sources
- `newsletter_sends`: Email tracking data
- `link_clicks`: Link click tracking
- `subscriber_activity`: Engagement metrics
### 1. MongoDB (Database)
- **Purpose**: Store articles, subscribers, tracking data
- **Port**: 27017 (internal only)
- **Access**: Only via Docker network
- **Authentication**: Username/password
### 2. News Crawler
- **Schedule**: Daily at 6:00 AM Berlin time
- **Functions**:
- Fetches articles from RSS feeds
- Extracts full article content
- Generates AI summaries using Ollama
- Saves to MongoDB
- **Technology**: Python, BeautifulSoup, Ollama
**Collections:**
- `articles` - News articles with summaries
- `subscribers` - Newsletter subscribers
- `rss_feeds` - RSS feed sources
- `newsletter_sends` - Send tracking
- `link_clicks` - Click tracking
### 3. Newsletter Sender
- **Schedule**: Daily at 7:00 AM Berlin time
- **Functions**:
- Waits for crawler to finish (max 30 min)
- Fetches today's articles
- Generates HTML newsletter
- Injects tracking pixels
- Sends to all subscribers
- **Technology**: Python, Jinja2, SMTP
### 2. Backend API (Flask)
- **Purpose**: API endpoints, tracking, analytics
- **Port**: 5001 (exposed to host)
- **Access**: Public API, admin endpoints
- **Features**: Tracking pixels, click tracking, admin operations
### 4. Backend API (Optional)
- **Purpose**: Tracking and analytics
- **Endpoints**:
- `/api/track/pixel/<id>` - Email open tracking
- `/api/track/click/<id>` - Link click tracking
- `/api/analytics/*` - Engagement metrics
- `/api/tracking/*` - Privacy controls
- **Technology**: Flask, Python
**Key Endpoints:**
- `/api/admin/*` - Admin operations
- `/api/subscribe` - Subscribe to newsletter
- `/api/tracking/*` - Tracking endpoints
- `/health` - Health check
### 3. Ollama (AI Service)
- **Purpose**: AI summarization and translation
- **Port**: 11434 (internal only)
- **Model**: phi3:latest (2.2GB)
- **GPU**: Optional NVIDIA GPU support
**Features:**
- Article summarization (150 words)
- Title translation (German → English)
- Configurable timeout and model
### 4. Crawler (News Fetcher)
- **Purpose**: Fetch and process news articles
- **Schedule**: 6:00 AM Berlin time (automated)
- **Features**: RSS parsing, content extraction, AI processing
**Process:**
1. Fetch RSS feeds
2. Extract article content
3. Translate title (German → English)
4. Generate AI summary
5. Store in MongoDB
### 5. Sender (Newsletter)
- **Purpose**: Send newsletters to subscribers
- **Schedule**: 7:00 AM Berlin time (automated)
- **Features**: Email sending, tracking, templating
**Process:**
1. Fetch today's articles
2. Generate newsletter HTML
3. Add tracking pixels/links
4. Send to active subscribers
5. Record send events
---
## Data Flow
### Article Processing Flow
```
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
RSS Feed
Backend API
Crawler fetches
Analytics
Extract content
Translate title (Ollama)
Generate summary (Ollama)
Store in MongoDB
Newsletter Sender
Email to subscribers
```
## Coordination
### Tracking Flow
The sender waits for the crawler to ensure fresh content:
```
Newsletter sent
Tracking pixel embedded
User opens email
Pixel loaded → Backend API
Record open event
User clicks link
Redirect via Backend API
Record click event
Redirect to article
```
1. Sender starts at 7:00 AM
2. Checks for recent articles every 30 seconds
3. Maximum wait time: 30 minutes
4. Proceeds once crawler finishes or timeout
---
## Database Schema
### Articles Collection
```javascript
{
_id: ObjectId,
title: String, // Original German title
title_en: String, // English translation
translated_at: Date, // Translation timestamp
link: String,
summary: String, // AI-generated summary
content: String, // Full article text
author: String,
source: String, // RSS feed name
published_at: Date,
crawled_at: Date,
created_at: Date
}
```
### Subscribers Collection
```javascript
{
_id: ObjectId,
email: String, // Unique
subscribed_at: Date,
status: String // 'active' or 'inactive'
}
```
### RSS Feeds Collection
```javascript
{
_id: ObjectId,
name: String,
url: String,
active: Boolean,
last_crawled: Date
}
```
---
## Security Architecture
### Network Isolation
**Exposed Services:**
- Backend API (port 5001) - Only exposed service
**Internal Services:**
- MongoDB (port 27017) - Not accessible from host
- Ollama (port 11434) - Not accessible from host
- Crawler - No ports
- Sender - No ports
**Benefits:**
- 66% reduction in attack surface
- Database protected from external access
- AI service protected from abuse
- Defense in depth
### Authentication
**MongoDB:**
- Username/password authentication
- Credentials in environment variables
- Internal network only
**Backend API:**
- No authentication (add in production)
- Rate limiting recommended
- IP whitelisting recommended
### Data Protection
- Subscriber emails stored securely
- No sensitive data in logs
- Environment variables for secrets
- `.env` file in `.gitignore`
---
## Technology Stack
- **Backend**: Python 3.11
### Backend
- **Language**: Python 3.11
- **Framework**: Flask
- **Database**: MongoDB 7.0
- **AI**: Ollama (Phi3 model)
- **AI**: Ollama (phi3:latest)
### Infrastructure
- **Containerization**: Docker & Docker Compose
- **Networking**: Docker bridge network
- **Storage**: Docker volumes
- **Scheduling**: Python schedule library
- **Email**: SMTP with HTML templates
- **Tracking**: Pixel tracking + redirect URLs
- **Infrastructure**: Docker & Docker Compose
## Deployment
### Libraries
- **Web**: Flask, Flask-CORS
- **Database**: pymongo
- **Email**: smtplib, email.mime
- **Scraping**: requests, BeautifulSoup4, feedparser
- **Templating**: Jinja2
- **AI**: requests (Ollama API)
All components run in Docker containers:
---
## Deployment Architecture
### Development
```
docker-compose up -d
Local Machine
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal)
│ ├── Backend (exposed)
│ ├── Crawler (internal)
│ └── Sender (internal)
└── .env file
```
Containers:
- `munich-news-mongodb` - Database
- `munich-news-crawler` - Crawler service
- `munich-news-sender` - Sender service
### Production
```
Server
├── Reverse Proxy (nginx/Traefik)
│ ├── SSL/TLS
│ ├── Rate limiting
│ └── Authentication
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal, GPU)
│ ├── Backend (internal)
│ ├── Crawler (internal)
│ └── Sender (internal)
├── Monitoring
│ ├── Logs
│ ├── Metrics
│ └── Alerts
└── Backups
├── MongoDB dumps
└── Configuration
```
## Security
- MongoDB authentication enabled
- Environment variables for secrets
- HTTPS for tracking URLs (production)
- GDPR-compliant data retention
- Privacy controls (opt-out, deletion)
## Monitoring
- Docker logs for all services
- MongoDB for data verification
- Health checks on containers
- Engagement metrics via API
---
## Scalability
- Horizontal: Add more crawler instances
- Vertical: Increase container resources
- Database: MongoDB sharding if needed
- Caching: Redis for API responses (future)
### Current Limits
- Single server deployment
- Sequential article processing
- Single MongoDB instance
- No load balancing
### Scaling Options
**Horizontal Scaling:**
- Multiple crawler instances
- Load-balanced backend
- MongoDB replica set
- Distributed Ollama
**Vertical Scaling:**
- More CPU cores
- More RAM
- GPU acceleration (5-10x faster)
- Faster storage
**Optimization:**
- Batch processing
- Caching
- Database indexing
- Connection pooling
---
## Monitoring
### Health Checks
```bash
# Backend health
curl http://localhost:5001/health
# MongoDB health
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
# Ollama health
docker-compose exec ollama ollama list
```
### Metrics
- Article count
- Subscriber count
- Newsletter open rate
- Click-through rate
- Processing time
- Error rate
### Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f backend
```
---
## Backup & Recovery
### MongoDB Backup
```bash
# Backup
docker-compose exec mongodb mongodump --out /backup
# Restore
docker-compose exec mongodb mongorestore /backup
```
### Configuration Backup
- `backend/.env` - Environment variables
- `docker-compose.yml` - Service configuration
- RSS feeds in MongoDB
### Recovery Plan
1. Restore MongoDB from backup
2. Restore configuration files
3. Restart services
4. Verify functionality
---
## Performance
### CPU Mode
- Translation: ~1.5s per title
- Summarization: ~8s per article
- 10 articles: ~115s total
- Suitable for <20 articles/day
### GPU Mode (5-10x faster)
- Translation: ~0.3s per title
- Summarization: ~2s per article
- 10 articles: ~31s total
- Suitable for high-volume processing
### Resource Usage
**CPU Mode:**
- CPU: 60-80%
- RAM: 4-6GB
- Disk: ~1GB (with model)
**GPU Mode:**
- CPU: 10-20%
- RAM: 2-3GB
- GPU: 80-100%
- VRAM: 3-4GB
- Disk: ~1GB (with model)
---
## Future Enhancements
### Planned Features
- Frontend dashboard
- Real-time analytics
- Multiple languages
- Custom RSS feeds per subscriber
- A/B testing for newsletters
- Advanced tracking
### Technical Improvements
- Kubernetes deployment
- Microservices architecture
- Message queue (RabbitMQ/Redis)
- Caching layer (Redis)
- CDN for assets
- Advanced monitoring (Prometheus/Grafana)
---
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.

View File

@@ -1,106 +0,0 @@
# Backend Structure
The backend has been modularized for better maintainability and scalability.
## Directory Structure
```
backend/
├── app.py # Main Flask application entry point
├── config.py # Configuration management
├── database.py # Database connection and initialization
├── requirements.txt # Python dependencies
├── .env # Environment variables
├── routes/ # API route handlers (blueprints)
│ ├── __init__.py
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
│ ├── news_routes.py # /api/news, /api/stats
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
│ ├── ollama_routes.py # /api/ollama/* (AI features)
│ ├── tracking_routes.py # /api/track/* (email tracking)
│ └── analytics_routes.py # /api/analytics/* (engagement metrics)
└── services/ # Business logic layer
├── __init__.py
├── news_service.py # News fetching and storage logic
├── email_service.py # Newsletter email sending
├── ollama_service.py # Ollama AI integration
├── tracking_service.py # Email tracking (opens/clicks)
└── analytics_service.py # Engagement analytics
```
## Key Components
### app.py
- Main Flask application
- Registers all blueprints
- Minimal code, just wiring things together
### config.py
- Centralized configuration
- Loads environment variables
- Single source of truth for all settings
### database.py
- MongoDB connection setup
- Collection definitions
- Database initialization with indexes
### routes/
Each route file is a Flask Blueprint handling specific API endpoints:
- **subscription_routes.py**: User subscription management
- **news_routes.py**: News fetching and statistics
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
- **ollama_routes.py**: AI/Ollama integration endpoints
- **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
- **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
### services/
Business logic separated from route handlers:
- **news_service.py**: Fetches news from RSS feeds, saves to database
- **email_service.py**: Sends newsletter emails to subscribers
- **ollama_service.py**: Communicates with Ollama AI server
- **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
- **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
## Benefits of This Structure
1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
2. **Testability**: Each module can be tested independently
3. **Maintainability**: Easy to find and modify specific functionality
4. **Scalability**: Easy to add new routes or services
5. **Reusability**: Services can be used by multiple routes
## Adding New Features
### To add a new API endpoint:
1. Create a new route file in `routes/` or add to existing one
2. Create a Blueprint and define routes
3. Register the blueprint in `app.py`
### To add new business logic:
1. Create a new service file in `services/`
2. Import and use in your route handlers
### Example:
```python
# services/my_service.py
def my_business_logic():
return "Hello"
# routes/my_routes.py
from flask import Blueprint
from services.my_service import my_business_logic
my_bp = Blueprint('my', __name__)
@my_bp.route('/api/my-endpoint')
def my_endpoint():
result = my_business_logic()
return {'message': result}
# app.py
from routes.my_routes import my_bp
app.register_blueprint(my_bp)
```

View File

@@ -1,176 +0,0 @@
# Changelog
## [Unreleased] - 2024-11-10
### Added - Major Refactoring
#### Backend Modularization
- ✅ Restructured backend into modular architecture
- ✅ Created separate route blueprints:
- `subscription_routes.py` - User subscriptions
- `news_routes.py` - News fetching and stats
- `rss_routes.py` - RSS feed management (CRUD)
- `ollama_routes.py` - AI integration
- ✅ Created service layer:
- `news_service.py` - News fetching logic
- `email_service.py` - Newsletter sending
- `ollama_service.py` - AI communication
- ✅ Centralized configuration in `config.py`
- ✅ Separated database logic in `database.py`
- ✅ Reduced main `app.py` from 700+ lines to 27 lines
#### RSS Feed Management
- ✅ Dynamic RSS feed management via API
- ✅ Add/remove/list/toggle RSS feeds without code changes
- ✅ Unique index on RSS feed URLs (prevents duplicates)
- ✅ Default feeds auto-initialized on first run
- ✅ Created `fix_duplicates.py` utility script
#### News Crawler Microservice
- ✅ Created standalone `news_crawler/` microservice
- ✅ Web scraping with BeautifulSoup
- ✅ Smart content extraction using multiple selectors
- ✅ Full article content storage in MongoDB
- ✅ Word count calculation
- ✅ Duplicate prevention (skips already-crawled articles)
- ✅ Rate limiting (1 second between requests)
- ✅ Can run independently or scheduled
- ✅ Docker support for crawler
- ✅ Comprehensive documentation
#### API Endpoints
New endpoints added:
- `GET /api/rss-feeds` - List all RSS feeds
- `POST /api/rss-feeds` - Add new RSS feed
- `DELETE /api/rss-feeds/<id>` - Remove RSS feed
- `PATCH /api/rss-feeds/<id>/toggle` - Toggle feed active status
#### Documentation
- ✅ Created `ARCHITECTURE.md` - System architecture overview
- ✅ Created `backend/STRUCTURE.md` - Backend structure guide
- ✅ Created `news_crawler/README.md` - Crawler documentation
- ✅ Created `news_crawler/QUICKSTART.md` - Quick start guide
- ✅ Created `news_crawler/test_crawler.py` - Test suite
- ✅ Updated main `README.md` with new features
- ✅ Updated `DATABASE_SCHEMA.md` with new fields
#### Configuration
- ✅ Added `FLASK_PORT` environment variable
- ✅ Fixed `OLLAMA_MODEL` typo in `.env`
- ✅ Port 5001 default to avoid macOS AirPlay conflict
### Changed
- Backend structure: Monolithic → Modular
- RSS feeds: Hardcoded → Database-driven
- Article storage: Summary only → Full content support
- Configuration: Scattered → Centralized
### Technical Improvements
- Separation of concerns (routes vs services)
- Better testability
- Easier maintenance
- Scalable architecture
- Independent microservices
- Proper error handling
- Comprehensive logging
### Database Schema Updates
Articles collection now includes:
- `full_content` - Full article text
- `word_count` - Number of words
- `crawled_at` - When content was crawled
RSS Feeds collection added:
- `name` - Feed name
- `url` - Feed URL (unique)
- `active` - Active status
- `created_at` - Creation timestamp
### Files Added
```
backend/
├── config.py
├── database.py
├── fix_duplicates.py
├── STRUCTURE.md
├── routes/
│ ├── __init__.py
│ ├── subscription_routes.py
│ ├── news_routes.py
│ ├── rss_routes.py
│ └── ollama_routes.py
└── services/
├── __init__.py
├── news_service.py
├── email_service.py
└── ollama_service.py
news_crawler/
├── crawler_service.py
├── test_crawler.py
├── requirements.txt
├── .gitignore
├── Dockerfile
├── docker-compose.yml
├── README.md
└── QUICKSTART.md
Root:
├── ARCHITECTURE.md
└── CHANGELOG.md
```
### Files Removed
- Old monolithic `backend/app.py` (replaced with modular version)
### Next Steps (Future Enhancements)
- [ ] Frontend UI for RSS feed management
- [ ] Automatic article summarization with Ollama
- [ ] Scheduled newsletter sending
- [ ] Article categorization and tagging
- [ ] Search functionality
- [ ] User preferences (categories, frequency)
- [ ] Analytics dashboard
- [ ] API rate limiting
- [ ] Caching layer (Redis)
- [ ] Message queue for crawler (Celery)
---
## Recent Updates (November 2025)
### Security Improvements
- **MongoDB Internal-Only**: Removed port exposure, only accessible via Docker network
- **Ollama Internal-Only**: Removed port exposure, only accessible via Docker network
- **Reduced Attack Surface**: Only Backend API (port 5001) exposed to host
- **Network Isolation**: All services communicate via internal Docker network
### Ollama Integration
- **Docker Compose Integration**: Ollama service runs alongside other services
- **Automatic Model Download**: phi3:latest model downloaded on first startup
- **GPU Support**: NVIDIA GPU acceleration with automatic detection
- **Helper Scripts**: `start-with-gpu.sh`, `check-gpu.sh`, `configure-ollama.sh`
- **Performance**: 5-10x faster with GPU acceleration
### API Enhancements
- **Send Newsletter Endpoint**: `/api/admin/send-newsletter` to send to all active subscribers
- **Subscriber Status Fix**: Fixed stats endpoint to correctly count active subscribers
- **Better Error Handling**: Improved error messages and validation
### Documentation
- **Consolidated Documentation**: Moved all docs to `docs/` directory
- **Security Guide**: Comprehensive security documentation
- **GPU Setup Guide**: Detailed GPU acceleration setup
- **MongoDB Connection Guide**: Connection configuration explained
- **Subscriber Status Guide**: How subscriber status system works
### Configuration
- **MongoDB URI**: Updated to use Docker service name (`mongodb` instead of `localhost`)
- **Ollama URL**: Configured for internal Docker network (`http://ollama:11434`)
- **Single .env File**: All configuration in `backend/.env`
### Testing
- **Connectivity Tests**: `test-mongodb-connectivity.sh`
- **Ollama Tests**: `test-ollama-setup.sh`
- **Newsletter API Tests**: `test-newsletter-api.sh`

View File

@@ -1,306 +0,0 @@
# How the News Crawler Works
## 🎯 Overview
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
## 📊 Flow Diagram
```
RSS Feed URL
Parse RSS Feed
For each article link:
┌─────────────────────────────────────┐
│ 1. Fetch HTML Page │
│ GET https://example.com/article │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 2. Parse with BeautifulSoup │
│ soup = BeautifulSoup(html) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 3. Clean HTML │
│ Remove: scripts, styles, nav, │
│ footer, header, ads │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 4. Extract Title │
│ Try: H1 → OG meta → Twitter → │
│ Title tag │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 5. Extract Author │
│ Try: Meta author → rel=author → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 6. Extract Date │
│ Try: <time> → Meta tags → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 7. Extract Content │
│ Try: <article> → Class names → │
│ <main> → <body> │
│ Filter short paragraphs │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 8. Save to MongoDB │
│ { │
│ title, author, date, │
│ content, word_count │
│ } │
└─────────────────────────────────────┘
Wait 1 second (rate limiting)
Next article
```
## 🔍 Detailed Example
### Input: RSS Feed Entry
```xml
<item>
<title>New U-Bahn Line Opens</title>
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
</item>
```
### Step 1: Fetch HTML
```python
url = "https://www.sueddeutsche.de/muenchen/article-123"
response = requests.get(url)
html = response.content
```
### Step 2: Parse HTML
```python
soup = BeautifulSoup(html, 'html.parser')
```
### Step 3: Extract Title
```python
# Try H1
h1 = soup.find('h1')
# Result: "New U-Bahn Line Opens in Munich"
# If no H1, try OG meta
og_title = soup.find('meta', property='og:title')
# Fallback chain continues...
```
### Step 4: Extract Author
```python
# Try meta author
meta_author = soup.find('meta', name='author')
# Result: None
# Try class names
author_elem = soup.select_one('[class*="author"]')
# Result: "Max Mustermann"
```
### Step 5: Extract Date
```python
# Try time tag
time_tag = soup.find('time')
# Result: "2024-11-10T10:00:00Z"
```
### Step 6: Extract Content
```python
# Try article tag
article = soup.find('article')
paragraphs = article.find_all('p')
# Filter paragraphs
content = []
for p in paragraphs:
text = p.get_text().strip()
if len(text) >= 50: # Keep substantial paragraphs
content.append(text)
full_content = '\n\n'.join(content)
# Result: "The new U-Bahn line connecting the city center..."
```
### Step 7: Save to Database
```python
article_doc = {
'title': 'New U-Bahn Line Opens in Munich',
'author': 'Max Mustermann',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'summary': 'Short summary from RSS...',
'full_content': 'The new U-Bahn line connecting...',
'word_count': 1250,
'source': 'Süddeutsche Zeitung München',
'published_at': '2024-11-10T10:00:00Z',
'crawled_at': datetime.utcnow(),
'created_at': datetime.utcnow()
}
db.articles.update_one(
{'link': article_url},
{'$set': article_doc},
upsert=True
)
```
## 🎨 What Makes It "Dynamic"?
### Traditional Approach (Hardcoded)
```python
# Only works for one specific site
title = soup.find('h1', class_='article-title').text
author = soup.find('span', class_='author-name').text
```
❌ Breaks when site changes
❌ Doesn't work on other sites
### Our Approach (Dynamic)
```python
# Works on ANY site
title = extract_title(soup) # Tries 4 different methods
author = extract_author(soup) # Tries 5 different methods
```
✅ Adapts to different HTML structures
✅ Falls back to alternatives
✅ Works across multiple sites
## 🛡️ Robustness Features
### 1. Multiple Strategies
Each field has 4-6 extraction strategies
```python
def extract_title(soup):
# Try strategy 1
if h1 := soup.find('h1'):
return h1.text
# Try strategy 2
if og_title := soup.find('meta', property='og:title'):
return og_title['content']
# Try strategy 3...
# Try strategy 4...
```
### 2. Validation
```python
# Title must be reasonable length
if title and len(title) > 10:
return title
# Author must be < 100 chars
if author and len(author) < 100:
return author
```
### 3. Cleaning
```python
# Remove site name from title
if ' | ' in title:
title = title.split(' | ')[0]
# Remove "By" from author
author = author.replace('By ', '').strip()
```
### 4. Error Handling
```python
try:
data = extract_article_content(url)
except Timeout:
print("Timeout - skip")
except RequestException:
print("Network error - skip")
except Exception:
print("Unknown error - skip")
```
## 📈 Success Metrics
After crawling, you'll see:
```
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
Title: ✓ Found
Author: ✓ Found (Max Mustermann)
Date: ✓ Found (2024-11-10T10:00:00Z)
Content: ✓ Found (1250 words)
```
## 🗄️ Database Result
**Before Crawling:**
```javascript
{
title: "New U-Bahn Line Opens",
link: "https://example.com/article",
summary: "Short RSS summary...",
source: "Süddeutsche Zeitung"
}
```
**After Crawling:**
```javascript
{
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
author: "Max Mustermann", // ← NEW!
link: "https://example.com/article",
summary: "Short RSS summary...",
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
word_count: 1250, // ← NEW!
source: "Süddeutsche Zeitung",
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
created_at: ISODate("2024-11-10T16:00:00Z")
}
```
## 🚀 Running the Crawler
```bash
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
```
Output:
```
============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
🔍 Crawling: Munich Weather Update...
✓ Saved (450 words)
✓ Crawled 2 articles
============================================================
✓ Crawling Complete!
Total feeds processed: 3
Total articles crawled: 15
Duration: 45.23 seconds
============================================================
```
Now you have rich, structured article data ready for AI processing! 🎉

View File

@@ -1,336 +0,0 @@
# MongoDB Database Schema
This document describes the MongoDB collections and their structure for Munich News Daily.
## Collections
### 1. Articles Collection (`articles`)
Stores all news articles aggregated from Munich news sources.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
title: String, // Article title (required)
author: String, // Article author (optional, extracted during crawl)
link: String, // Article URL (required, unique)
content: String, // Full article content (no length limit)
summary: String, // AI-generated English summary (≤150 words)
word_count: Number, // Word count of full content
summary_word_count: Number, // Word count of AI summary
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
published_at: String, // Original publication date from RSS feed or crawled
crawled_at: DateTime, // When article content was crawled (UTC)
summarized_at: DateTime, // When AI summary was generated (UTC)
created_at: DateTime // When article was added to database (UTC)
}
```
**Indexes:**
- `link` - Unique index to prevent duplicate articles
- `created_at` - Index for efficient sorting by date
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439011"),
title: "New U-Bahn Line Opens in Munich",
author: "Max Mustermann",
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
word_count: 1250,
summary_word_count: 48,
source: "Süddeutsche Zeitung München",
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
created_at: ISODate("2024-01-15T09:00:00.000Z")
}
```
### 2. Subscribers Collection (`subscribers`)
Stores all newsletter subscribers.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (required, unique, lowercase)
subscribed_at: DateTime, // When user subscribed (UTC)
status: String // Subscription status: 'active' or 'inactive'
}
```
**Indexes:**
- `email` - Unique index for email lookups and preventing duplicates
- `subscribed_at` - Index for analytics and sorting
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439012"),
email: "user@example.com",
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
status: "active"
}
```
### 3. Newsletter Sends Collection (`newsletter_sends`)
Tracks each newsletter sent to each subscriber for email open tracking.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
subscriber_email: String, // Recipient email
tracking_id: String, // Unique tracking ID for this send (UUID)
sent_at: DateTime, // When email was sent (UTC)
opened: Boolean, // Whether email was opened
first_opened_at: DateTime, // First open timestamp (null if not opened)
last_opened_at: DateTime, // Most recent open timestamp
open_count: Number, // Number of times opened
created_at: DateTime // Record creation time (UTC)
}
```
**Indexes:**
- `tracking_id` - Unique index for fast pixel request lookups
- `newsletter_id` - Index for analytics queries
- `subscriber_email` - Index for user activity queries
- `sent_at` - Index for time-based queries
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439013"),
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
opened: true,
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
open_count: 3,
created_at: ISODate("2024-01-15T08:00:00.000Z")
}
```
### 4. Link Clicks Collection (`link_clicks`)
Tracks individual link clicks from newsletters.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
tracking_id: String, // Unique tracking ID for this link (UUID)
newsletter_id: String, // Which newsletter this link was in
subscriber_email: String, // Who clicked
article_url: String, // Original article URL
article_title: String, // Article title for reporting
clicked_at: DateTime, // When link was clicked (UTC)
user_agent: String, // Browser/client info
created_at: DateTime // Record creation time (UTC)
}
```
**Indexes:**
- `tracking_id` - Unique index for fast redirect request lookups
- `newsletter_id` - Index for analytics queries
- `article_url` - Index for article performance queries
- `subscriber_email` - Index for user activity queries
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439014"),
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
article_title: "New U-Bahn Line Opens in Munich",
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
created_at: ISODate("2024-01-15T09:35:00.000Z")
}
```
### 5. Subscriber Activity Collection (`subscriber_activity`)
Aggregated activity status for each subscriber.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (unique)
status: String, // 'active', 'inactive', or 'dormant'
last_opened_at: DateTime, // Most recent email open (UTC)
last_clicked_at: DateTime, // Most recent link click (UTC)
total_opens: Number, // Lifetime open count
total_clicks: Number, // Lifetime click count
newsletters_received: Number, // Total newsletters sent
newsletters_opened: Number, // Total newsletters opened
updated_at: DateTime // Last status update (UTC)
}
```
**Indexes:**
- `email` - Unique index for fast lookups
- `status` - Index for filtering by activity level
- `last_opened_at` - Index for time-based queries
**Activity Status Classification:**
- **active**: Opened an email in the last 30 days
- **inactive**: No opens in 30-60 days
- **dormant**: No opens in 60+ days
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439015"),
email: "user@example.com",
status: "active",
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
total_opens: 45,
total_clicks: 23,
newsletters_received: 60,
newsletters_opened: 45,
updated_at: ISODate("2024-01-15T10:00:00.000Z")
}
```
## Design Decisions
### Why MongoDB?
1. **Flexibility**: Easy to add new fields without schema migrations
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
4. **Document Model**: Natural fit for news articles and subscriber data
### Schema Choices
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
### Future Enhancements
Potential fields to add in the future:
**Articles:**
- `category`: String (e.g., "politics", "sports", "culture")
- `tags`: Array of Strings
- `image_url`: String
- `sent_in_newsletter`: Boolean (track if article was sent)
- `sent_at`: DateTime (when article was included in newsletter)
**Subscribers:**
- `preferences`: Object (newsletter frequency, categories, etc.)
- `last_sent_at`: DateTime (last newsletter sent date)
- `unsubscribed_at`: DateTime (when user unsubscribed)
- `verification_token`: String (for email verification)
## AI Summarization Workflow
When the crawler processes an article:
1. **Extract Content**: Full article text is extracted from the webpage
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
### Summary Field Details
- **Language**: Always in English, regardless of source article language
- **Length**: Maximum 150 words
- **Format**: Plain text, concise and clear
- **Purpose**: Quick preview for newsletters and frontend display
### Querying Articles
```javascript
// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })
// Get articles without summaries
db.articles.find({ summary: { $exists: false } })
// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
```
---
## MongoDB Connection Configuration
### Docker Compose Setup
**Connection URI:**
```env
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
```
**Key Points:**
- Uses `mongodb` (Docker service name), not `localhost`
- Includes authentication credentials
- Only works inside Docker network
- Port 27017 is NOT exposed to host (internal only)
### Why 'mongodb' Instead of 'localhost'?
**Inside Docker containers:**
```
Container → mongodb:27017 ✅ Works (Docker DNS)
Container → localhost:27017 ❌ Fails (localhost = container itself)
```
**From host machine:**
```
Host → localhost:27017 ❌ Blocked (port not exposed)
Host → mongodb:27017 ❌ Fails (DNS only works in Docker)
```
### Connection Priority
1. **Docker Compose environment variables** (highest)
2. **.env file** (fallback)
3. **Code defaults** (lowest)
### Testing Connection
```bash
# From backend
docker-compose exec backend python -c "
from database import articles_collection
print(f'Articles: {articles_collection.count_documents({})}')
"
# From crawler
docker-compose exec crawler python -c "
from pymongo import MongoClient
from config import Config
client = MongoClient(Config.MONGODB_URI)
print(f'MongoDB version: {client.server_info()[\"version\"]}')
"
```
### Security
- ✅ MongoDB is internal-only (not exposed to host)
- ✅ Uses authentication (username/password)
- ✅ Only accessible via Docker network
- ✅ Cannot be accessed from external network
See [SECURITY_NOTES.md](SECURITY_NOTES.md) for more security details.

View File

@@ -1,204 +0,0 @@
# Documentation Cleanup Summary
## What Was Done
Consolidated and organized all markdown documentation files.
## Before
**Root Level:** 14 markdown files (cluttered)
```
README.md
QUICKSTART.md
CONTRIBUTING.md
IMPLEMENTATION_SUMMARY.md
MONGODB_CONNECTION_EXPLAINED.md
NETWORK_SECURITY_SUMMARY.md
NEWSLETTER_API_UPDATE.md
OLLAMA_GPU_SUMMARY.md
OLLAMA_INTEGRATION.md
QUICK_START_GPU.md
SECURITY_IMPROVEMENTS.md
SECURITY_UPDATE.md
FINAL_STRUCTURE.md (outdated)
PROJECT_STRUCTURE.md (redundant)
```
**docs/:** 18 files (organized but some content duplicated)
## After
**Root Level:** 3 essential files (clean)
```
README.md - Main entry point
QUICKSTART.md - Quick setup guide
CONTRIBUTING.md - Contribution guidelines
```
**docs/:** 19 files (organized, consolidated, no duplication)
```
INDEX.md - Documentation index (NEW)
ADMIN_API.md - Admin API (consolidated)
API.md
ARCHITECTURE.md
BACKEND_STRUCTURE.md
CHANGELOG.md - Updated with recent changes
CRAWLER_HOW_IT_WORKS.md
DATABASE_SCHEMA.md - Added MongoDB connection info
DEPLOYMENT.md
EXTRACTION_STRATEGIES.md
GPU_SETUP.md - Consolidated GPU docs
OLLAMA_SETUP.md - Consolidated Ollama docs
OLD_ARCHITECTURE.md
PERFORMANCE_COMPARISON.md
QUICK_REFERENCE.md
RSS_URL_EXTRACTION.md
SECURITY_NOTES.md - Consolidated all security docs
SUBSCRIBER_STATUS.md
SYSTEM_ARCHITECTURE.md
```
## Changes Made
### 1. Deleted Redundant Files
-`FINAL_STRUCTURE.md` (outdated)
-`PROJECT_STRUCTURE.md` (redundant with README)
### 2. Merged into docs/SECURITY_NOTES.md
-`SECURITY_UPDATE.md` (Ollama security)
-`SECURITY_IMPROVEMENTS.md` (Network isolation)
-`NETWORK_SECURITY_SUMMARY.md` (Port exposure summary)
### 3. Merged into docs/GPU_SETUP.md
-`OLLAMA_GPU_SUMMARY.md` (GPU implementation summary)
-`QUICK_START_GPU.md` (Quick start commands)
### 4. Merged into docs/OLLAMA_SETUP.md
-`OLLAMA_INTEGRATION.md` (Integration details)
### 5. Merged into docs/ADMIN_API.md
-`NEWSLETTER_API_UPDATE.md` (Newsletter endpoint)
### 6. Merged into docs/DATABASE_SCHEMA.md
-`MONGODB_CONNECTION_EXPLAINED.md` (Connection config)
### 7. Merged into docs/CHANGELOG.md
-`IMPLEMENTATION_SUMMARY.md` (Recent updates)
### 8. Created New Files
-`docs/INDEX.md` - Complete documentation index
### 9. Updated Existing Files
- 📝 `README.md` - Added documentation section
- 📝 `docs/CHANGELOG.md` - Added recent updates
- 📝 `docs/SECURITY_NOTES.md` - Comprehensive security guide
- 📝 `docs/GPU_SETUP.md` - Complete GPU guide
- 📝 `docs/OLLAMA_SETUP.md` - Complete Ollama guide
- 📝 `docs/ADMIN_API.md` - Complete API reference
- 📝 `docs/DATABASE_SCHEMA.md` - Added connection info
## Benefits
### 1. Cleaner Root Directory
- Only 3 essential files visible
- Easier to navigate
- Professional appearance
### 2. Better Organization
- All technical docs in `docs/`
- Logical grouping by topic
- Easy to find information
### 3. No Duplication
- Consolidated related content
- Single source of truth
- Easier to maintain
### 4. Improved Discoverability
- Documentation index (`docs/INDEX.md`)
- Clear navigation
- Quick links by task
### 5. Better Maintenance
- Fewer files to update
- Related content together
- Clear structure
## Documentation Structure
```
project/
├── README.md # Main entry point
├── QUICKSTART.md # Quick setup
├── CONTRIBUTING.md # How to contribute
└── docs/ # All technical documentation
├── INDEX.md # Documentation index
├── Setup & Configuration
│ ├── OLLAMA_SETUP.md
│ ├── GPU_SETUP.md
│ └── DEPLOYMENT.md
├── API Documentation
│ ├── ADMIN_API.md
│ ├── API.md
│ └── SUBSCRIBER_STATUS.md
├── Architecture
│ ├── SYSTEM_ARCHITECTURE.md
│ ├── ARCHITECTURE.md
│ ├── DATABASE_SCHEMA.md
│ └── BACKEND_STRUCTURE.md
├── Features
│ ├── CRAWLER_HOW_IT_WORKS.md
│ ├── EXTRACTION_STRATEGIES.md
│ ├── RSS_URL_EXTRACTION.md
│ └── PERFORMANCE_COMPARISON.md
├── Security
│ └── SECURITY_NOTES.md
└── Reference
├── CHANGELOG.md
└── QUICK_REFERENCE.md
```
## Quick Access
### For Users
- Start here: [README.md](README.md)
- Quick setup: [QUICKSTART.md](QUICKSTART.md)
- All docs: [docs/INDEX.md](docs/INDEX.md)
### For Developers
- Architecture: [docs/SYSTEM_ARCHITECTURE.md](docs/SYSTEM_ARCHITECTURE.md)
- API Reference: [docs/ADMIN_API.md](docs/ADMIN_API.md)
- Contributing: [CONTRIBUTING.md](CONTRIBUTING.md)
### For DevOps
- Deployment: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)
- Security: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
- GPU Setup: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
## Statistics
- **Files Deleted:** 11 redundant markdown files
- **Files Merged:** 9 files consolidated into existing docs
- **Files Created:** 1 new index file
- **Files Updated:** 7 existing files enhanced
- **Root Level:** Reduced from 14 to 3 files (79% reduction)
- **Total Docs:** 19 well-organized files in docs/
## Result
✅ Clean, professional documentation structure
✅ Easy to navigate and find information
✅ No duplication or redundancy
✅ Better maintainability
✅ Improved user experience
---
This cleanup makes the project more professional and easier to use!

View File

@@ -1,353 +0,0 @@
# Content Extraction Strategies
The crawler uses multiple strategies to dynamically extract article metadata from any website.
## 🎯 What Gets Extracted
1. **Title** - Article headline
2. **Author** - Article writer/journalist
3. **Published Date** - When article was published
4. **Content** - Main article text
5. **Description** - Meta description/summary
## 📋 Extraction Strategies
### 1. Title Extraction
Tries multiple methods in order of reliability:
#### Strategy 1: H1 Tag
```html
<h1>Article Title Here</h1>
```
✅ Most reliable - usually the main headline
#### Strategy 2: Open Graph Meta Tag
```html
<meta property="og:title" content="Article Title Here" />
```
✅ Used by Facebook, very reliable
#### Strategy 3: Twitter Card Meta Tag
```html
<meta name="twitter:title" content="Article Title Here" />
```
✅ Used by Twitter, reliable
#### Strategy 4: Title Tag (Fallback)
```html
<title>Article Title | Site Name</title>
```
⚠️ Often includes site name, needs cleaning
**Cleaning:**
- Removes " | Site Name"
- Removes " - Site Name"
---
### 2. Author Extraction
Tries multiple methods:
#### Strategy 1: Meta Author Tag
```html
<meta name="author" content="John Doe" />
```
✅ Standard HTML meta tag
#### Strategy 2: Rel="author" Link
```html
<a rel="author" href="/author/john-doe">John Doe</a>
```
✅ Semantic HTML
#### Strategy 3: Common Class Names
```html
<div class="author-name">John Doe</div>
<span class="byline">By John Doe</span>
<p class="writer">John Doe</p>
```
✅ Searches for: author-name, author, byline, writer
#### Strategy 4: Schema.org Markup
```html
<span itemprop="author">John Doe</span>
```
✅ Structured data
#### Strategy 5: JSON-LD Structured Data
```html
<script type="application/ld+json">
{
"@type": "NewsArticle",
"author": {
"@type": "Person",
"name": "John Doe"
}
}
</script>
```
✅ Most structured, very reliable
**Cleaning:**
- Removes "By " prefix
- Validates length (< 100 chars)
---
### 3. Date Extraction
Tries multiple methods:
#### Strategy 1: Time Tag with Datetime
```html
<time datetime="2024-11-10T10:00:00Z">November 10, 2024</time>
```
✅ Most reliable - ISO format
#### Strategy 2: Article Published Time Meta
```html
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
```
✅ Open Graph standard
#### Strategy 3: OG Published Time
```html
<meta property="og:published_time" content="2024-11-10T10:00:00Z" />
```
✅ Facebook standard
#### Strategy 4: Common Class Names
```html
<span class="publish-date">November 10, 2024</span>
<time class="published">2024-11-10</time>
<div class="timestamp">10:00 AM, Nov 10</div>
```
✅ Searches for: publish-date, published, date, timestamp
#### Strategy 5: Schema.org Markup
```html
<meta itemprop="datePublished" content="2024-11-10T10:00:00Z" />
```
✅ Structured data
#### Strategy 6: JSON-LD Structured Data
```html
<script type="application/ld+json">
{
"@type": "NewsArticle",
"datePublished": "2024-11-10T10:00:00Z"
}
</script>
```
✅ Most structured
---
### 4. Content Extraction
Tries multiple methods:
#### Strategy 1: Semantic HTML Tags
```html
<article>
<p>Article content here...</p>
</article>
```
✅ Best practice HTML5
#### Strategy 2: Common Class Names
```html
<div class="article-content">...</div>
<div class="article-body">...</div>
<div class="post-content">...</div>
<div class="entry-content">...</div>
<div class="story-body">...</div>
```
✅ Searches for common patterns
#### Strategy 3: Schema.org Markup
```html
<div itemprop="articleBody">
<p>Content here...</p>
</div>
```
✅ Structured data
#### Strategy 4: Main Tag
```html
<main>
<p>Content here...</p>
</main>
```
✅ Semantic HTML5
#### Strategy 5: Body Tag (Fallback)
```html
<body>
<p>Content here...</p>
</body>
```
⚠️ Last resort, may include navigation
**Content Filtering:**
- Removes `<script>`, `<style>`, `<nav>`, `<footer>`, `<header>`, `<aside>`
- Filters out short paragraphs (< 50 chars) - likely ads/navigation
- Keeps only substantial paragraphs
- **No length limit** - stores full article content
---
## 🔍 How It Works
### Example: Crawling a News Article
```python
# 1. Fetch HTML
response = requests.get(article_url)
soup = BeautifulSoup(response.content, 'html.parser')
# 2. Extract title (tries 4 strategies)
title = extract_title(soup)
# Result: "New U-Bahn Line Opens in Munich"
# 3. Extract author (tries 5 strategies)
author = extract_author(soup)
# Result: "Max Mustermann"
# 4. Extract date (tries 6 strategies)
published_date = extract_date(soup)
# Result: "2024-11-10T10:00:00Z"
# 5. Extract content (tries 5 strategies)
content = extract_main_content(soup)
# Result: "The new U-Bahn line connecting..."
# 6. Save to database
article_doc = {
'title': title,
'author': author,
'published_at': published_date,
'full_content': content,
'word_count': len(content.split())
}
```
---
## 📊 Success Rates by Strategy
Based on common news sites:
| Strategy | Success Rate | Notes |
|----------|-------------|-------|
| H1 for title | 95% | Almost universal |
| OG meta tags | 90% | Most modern sites |
| Time tag for date | 85% | HTML5 sites |
| JSON-LD | 70% | Growing adoption |
| Class name patterns | 60% | Varies by site |
| Schema.org | 50% | Not widely adopted |
---
## 🎨 Real-World Examples
### Example 1: Süddeutsche Zeitung
```html
<article>
<h1>New U-Bahn Line Opens</h1>
<span class="author">Max Mustermann</span>
<time datetime="2024-11-10T10:00:00Z">10. November 2024</time>
<div class="article-body">
<p>The new U-Bahn line...</p>
</div>
</article>
```
✅ Extracts: Title (H1), Author (class), Date (time), Content (article-body)
### Example 2: Medium Blog
```html
<article>
<h1>How to Build a News Crawler</h1>
<meta property="og:title" content="How to Build a News Crawler" />
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
<a rel="author" href="/author">Jane Smith</a>
<section>
<p>In this article...</p>
</section>
</article>
```
✅ Extracts: Title (OG meta), Author (rel), Date (article meta), Content (section)
### Example 3: WordPress Blog
```html
<div class="post">
<h1 class="entry-title">My Blog Post</h1>
<span class="byline">By John Doe</span>
<time class="published">November 10, 2024</time>
<div class="entry-content">
<p>Blog content here...</p>
</div>
</div>
```
✅ Extracts: Title (H1), Author (byline), Date (published), Content (entry-content)
---
## ⚠️ Edge Cases Handled
1. **Missing Fields**: Returns `None` instead of crashing
2. **Multiple Authors**: Takes first one found
3. **Relative Dates**: Stores as-is ("2 hours ago")
4. **Paywalls**: Extracts what's available
5. **JavaScript-rendered**: Only gets server-side HTML
6. **Ads/Navigation**: Filtered out by paragraph length
7. **Site Name in Title**: Cleaned automatically
---
## 🚀 Future Improvements
Potential enhancements:
- [ ] JavaScript rendering (Selenium/Playwright)
- [ ] Paywall bypass (where legal)
- [ ] Image extraction
- [ ] Video detection
- [ ] Related articles
- [ ] Tags/categories
- [ ] Reading time estimation
- [ ] Language detection
- [ ] Sentiment analysis
---
## 🧪 Testing
Test the extraction on a specific URL:
```python
from crawler_service import extract_article_content
url = "https://www.sueddeutsche.de/muenchen/article-123"
data = extract_article_content(url)
print(f"Title: {data['title']}")
print(f"Author: {data['author']}")
print(f"Date: {data['published_date']}")
print(f"Content length: {len(data['content'])} chars")
print(f"Word count: {data['word_count']}")
```
---
## 📚 Standards Supported
- ✅ HTML5 semantic tags
- ✅ Open Graph Protocol
- ✅ Twitter Cards
- ✅ Schema.org microdata
- ✅ JSON-LD structured data
- ✅ Dublin Core metadata
- ✅ Common CSS class patterns

414
docs/FEATURES.md Normal file
View File

@@ -0,0 +1,414 @@
# Features Guide
Complete guide to Munich News Daily features.
---
## Core Features
### 1. Automated News Crawling
- Fetches articles from RSS feeds
- Scheduled daily at 6:00 AM Berlin time
- Extracts full article content
- Handles multiple news sources
### 2. AI-Powered Summarization
- Generates concise summaries (150 words)
- Uses Ollama AI (phi3:latest model)
- GPU acceleration available (5-10x faster)
- Configurable summary length
### 3. Title Translation
- Translates German titles to English
- Uses Ollama AI
- Displays both languages in newsletter
- Stores both versions in database
### 4. Newsletter Generation
- Beautiful HTML email template
- Responsive design
- Numbered articles
- Summary statistics
- Scheduled daily at 7:00 AM Berlin time
### 5. Engagement Tracking
- Email open tracking (pixel)
- Link click tracking
- Analytics dashboard ready
- Subscriber engagement metrics
---
## News Crawler
### How It Works
```
1. Fetch RSS feeds from database
2. Parse RSS XML
3. Extract article URLs
4. Fetch full article content
5. Extract text from HTML
6. Translate title (German → English)
7. Generate AI summary
8. Store in MongoDB
```
### Content Extraction
**Strategies (in order):**
1. **Article Tag** - Look for `<article>` tags
2. **Main Tag** - Look for `<main>` content
3. **Content Divs** - Common class names (content, article-body, etc.)
4. **Paragraph Aggregation** - Collect all `<p>` tags
5. **Fallback** - Use RSS description
**Cleaning:**
- Remove scripts and styles
- Remove navigation elements
- Remove ads and sidebars
- Extract clean text
- Preserve paragraphs
### RSS Feed Handling
**Supported Formats:**
- RSS 2.0
- Atom
- Custom formats
**Extracted Data:**
- Title
- Link
- Description/Summary
- Published date
- Author (if available)
**Error Handling:**
- Retry failed requests
- Skip invalid URLs
- Log errors
- Continue with next article
---
## AI Features
### Summarization
**Process:**
1. Send article text to Ollama
2. Request 150-word summary
3. Receive AI-generated summary
4. Store with article
**Configuration:**
```env
OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest
SUMMARY_MAX_WORDS=150
OLLAMA_TIMEOUT=120
```
**Performance:**
- CPU: ~8s per article
- GPU: ~2s per article (4x faster)
### Translation
**Process:**
1. Send German title to Ollama
2. Request English translation
3. Receive translated title
4. Store both versions
**Configuration:**
```env
OLLAMA_ENABLED=true
OLLAMA_MODEL=phi3:latest
```
**Performance:**
- CPU: ~1.5s per title
- GPU: ~0.3s per title (5x faster)
**Newsletter Display:**
```
English Title (Primary)
Original: German Title (Subtitle)
```
---
## Newsletter System
### Template Features
- **Responsive Design** - Works on all devices
- **Clean Layout** - Easy to read
- **Numbered Articles** - Clear organization
- **Summary Box** - Quick stats
- **Tracking Links** - Click tracking
- **Unsubscribe Link** - Easy opt-out
### Personalization
- Greeting message
- Date formatting
- Article count
- Source attribution
- Author names
### Tracking
**Open Tracking:**
- Invisible 1x1 pixel image
- Loaded when email opened
- Records timestamp
- Tracks unique opens
**Click Tracking:**
- All article links tracked
- Redirect through backend
- Records click events
- Tracks which articles clicked
---
## Subscriber Management
### Status System
| Status | Description | Receives Newsletters |
|--------|-------------|---------------------|
| `active` | Subscribed | ✅ Yes |
| `inactive` | Unsubscribed | ❌ No |
### Operations
**Subscribe:**
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Unsubscribe:**
```bash
curl -X POST http://localhost:5001/api/unsubscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Check Stats:**
```bash
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
```
---
## Admin Features
### Manual Crawl
Trigger crawl anytime:
```bash
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Test Email
Send test newsletter:
```bash
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
```
### Send Newsletter
Send to all subscribers:
```bash
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### System Stats
View system statistics:
```bash
curl http://localhost:5001/api/admin/stats
```
---
## Automation
### Scheduled Tasks
**Crawler (6:00 AM Berlin time):**
- Fetches new articles
- Processes with AI
- Stores in database
**Sender (7:00 AM Berlin time):**
- Waits for crawler to finish
- Fetches today's articles
- Generates newsletter
- Sends to all active subscribers
### Manual Execution
```bash
# Run crawler manually
docker-compose exec crawler python crawler_service.py 10
# Run sender manually
docker-compose exec sender python sender_service.py send 10
# Send test email
docker-compose exec sender python sender_service.py test your@email.com
```
---
## Configuration
### Environment Variables
```env
# Newsletter Settings
NEWSLETTER_MAX_ARTICLES=10
NEWSLETTER_HOURS_LOOKBACK=24
WEBSITE_URL=http://localhost:3000
# Ollama AI
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_TIMEOUT=120
SUMMARY_MAX_WORDS=150
# Tracking
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90
```
### RSS Feeds
Add feeds in MongoDB:
```javascript
db.rss_feeds.insertOne({
name: "Süddeutsche Zeitung München",
url: "https://www.sueddeutsche.de/muenchen/rss",
active: true
})
```
---
## Performance Optimization
### GPU Acceleration
Enable for 5-10x faster processing:
```bash
./start-with-gpu.sh
```
**Benefits:**
- Faster summarization (8s → 2s)
- Faster translation (1.5s → 0.3s)
- Process more articles
- Lower CPU usage
### Batch Processing
Process multiple articles efficiently:
- Model stays loaded in memory
- Reduced overhead
- Better throughput
### Caching
- Model caching (Ollama)
- Database connection pooling
- Persistent storage
---
## Monitoring
### Logs
```bash
# Crawler logs
docker-compose logs -f crawler
# Sender logs
docker-compose logs -f sender
# Backend logs
docker-compose logs -f backend
```
### Metrics
- Articles crawled
- Summaries generated
- Newsletters sent
- Open rate
- Click-through rate
- Processing time
### Health Checks
```bash
# Backend health
curl http://localhost:5001/health
# System stats
curl http://localhost:5001/api/admin/stats
```
---
## Troubleshooting
### Crawler Issues
**No articles found:**
- Check RSS feed URLs
- Verify feeds are active
- Check network connectivity
**Extraction failed:**
- Article structure changed
- Paywall detected
- Network timeout
**AI processing failed:**
- Ollama not running
- Model not downloaded
- Timeout too short
### Newsletter Issues
**Not sending:**
- Check email configuration
- Verify SMTP credentials
- Check subscriber count
**Tracking not working:**
- Verify tracking enabled
- Check backend API accessible
- Verify tracking URLs
---
See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.

View File

@@ -1,116 +0,0 @@
# Documentation Index
## Quick Start
- [README](../README.md) - Project overview and quick start
- [QUICKSTART](../QUICKSTART.md) - Detailed 5-minute setup guide
## Setup & Configuration
- [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama AI service setup
- [GPU_SETUP](GPU_SETUP.md) - GPU acceleration setup (5-10x faster)
- [DEPLOYMENT](DEPLOYMENT.md) - Production deployment guide
## API Documentation
- [ADMIN_API](ADMIN_API.md) - Admin endpoints (crawl, send newsletter)
- [API](API.md) - Public API endpoints
- [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber status system
## Architecture & Design
- [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete system architecture
- [ARCHITECTURE](ARCHITECTURE.md) - High-level architecture overview
- [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - MongoDB schema and connection
- [BACKEND_STRUCTURE](BACKEND_STRUCTURE.md) - Backend code structure
## Features & How-To
- [CRAWLER_HOW_IT_WORKS](CRAWLER_HOW_IT_WORKS.md) - News crawler explained
- [EXTRACTION_STRATEGIES](EXTRACTION_STRATEGIES.md) - Content extraction
- [RSS_URL_EXTRACTION](RSS_URL_EXTRACTION.md) - RSS feed handling
- [PERFORMANCE_COMPARISON](PERFORMANCE_COMPARISON.md) - CPU vs GPU benchmarks
## Security
- [SECURITY_NOTES](SECURITY_NOTES.md) - Complete security guide
- Network isolation
- MongoDB security
- Ollama security
- Best practices
## Reference
- [CHANGELOG](CHANGELOG.md) - Version history and recent updates
- [QUICK_REFERENCE](QUICK_REFERENCE.md) - Command cheat sheet
## Contributing
- [CONTRIBUTING](../CONTRIBUTING.md) - How to contribute
---
## Documentation Organization
### Root Level (3 files)
Essential files that should be immediately visible:
- `README.md` - Main entry point
- `QUICKSTART.md` - Quick setup guide
- `CONTRIBUTING.md` - Contribution guidelines
### docs/ Directory (18 files)
All technical documentation organized by category:
- **Setup**: Ollama, GPU, Deployment
- **API**: Admin API, Public API, Subscriber system
- **Architecture**: System design, database, backend structure
- **Features**: Crawler, extraction, RSS handling
- **Security**: Complete security documentation
- **Reference**: Changelog, quick reference
---
## Quick Links by Task
### I want to...
**Set up the project:**
1. [README](../README.md) - Overview
2. [QUICKSTART](../QUICKSTART.md) - Step-by-step setup
**Enable GPU acceleration:**
1. [GPU_SETUP](GPU_SETUP.md) - Complete GPU guide
2. Run: `./start-with-gpu.sh`
**Send newsletters:**
1. [ADMIN_API](ADMIN_API.md) - API documentation
2. [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber system
**Understand the architecture:**
1. [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete overview
2. [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - Database design
**Secure my deployment:**
1. [SECURITY_NOTES](SECURITY_NOTES.md) - Security guide
2. [DEPLOYMENT](DEPLOYMENT.md) - Production deployment
**Troubleshoot issues:**
1. [QUICK_REFERENCE](QUICK_REFERENCE.md) - Common commands
2. [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama troubleshooting
3. [GPU_SETUP](GPU_SETUP.md) - GPU troubleshooting
---
## Documentation Standards
### File Naming
- Use UPPERCASE for main docs (README, QUICKSTART)
- Use Title_Case for technical docs (GPU_Setup, API_Reference)
- Use descriptive names (not DOC1, DOC2)
### Organization
- Root level: Only essential user-facing docs
- docs/: All technical documentation
- Keep related content together
### Content
- Start with overview/summary
- Include code examples
- Add troubleshooting sections
- Link to related docs
- Keep up to date
---
Last Updated: November 2025

View File

@@ -1,209 +0,0 @@
# Munich News Daily - Architecture
## System Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Users / Browsers │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Port 3000) │
│ Node.js + Express + Vanilla JS │
│ - Subscription form │
│ - News display │
│ - RSS feed management UI (future) │
└────────────────────────┬────────────────────────────────────┘
│ HTTP/REST
┌─────────────────────────────────────────────────────────────┐
│ Backend API (Port 5001) │
│ Flask + Python │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Routes (Blueprints) │ │
│ │ - subscription_routes.py (subscribe/unsubscribe) │ │
│ │ - news_routes.py (get news, stats) │ │
│ │ - rss_routes.py (manage RSS feeds) │ │
│ │ - ollama_routes.py (AI features) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Services (Business Logic) │ │
│ │ - news_service.py (fetch & save articles) │ │
│ │ - email_service.py (send newsletters) │ │
│ │ - ollama_service.py (AI integration) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Core │ │
│ │ - config.py (configuration) │ │
│ │ - database.py (DB connection) │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ MongoDB (Port 27017) │
│ │
│ Collections: │
│ - articles (news articles with full content) │
│ - subscribers (email subscribers) │
│ - rss_feeds (RSS feed sources) │
└─────────────────────────┬───────────────────────────────────┘
│ Read/Write
┌─────────────────────────┴───────────────────────────────────┐
│ News Crawler Microservice │
│ (Standalone) │
│ │
│ - Fetches RSS feeds from MongoDB │
│ - Crawls full article content │
│ - Extracts text, metadata, word count │
│ - Stores back to MongoDB │
│ - Can run independently or scheduled │
└──────────────────────────────────────────────────────────────┘
│ (Optional)
┌─────────────────────────────────────────────────────────────┐
│ Ollama AI Server (Port 11434) │
│ (Optional, External) │
│ │
│ - Article summarization │
│ - Content analysis │
│ - AI-powered features │
└──────────────────────────────────────────────────────────────┘
```
## Component Details
### Frontend (Port 3000)
- **Technology**: Node.js, Express, Vanilla JavaScript
- **Responsibilities**:
- User interface
- Subscription management
- News display
- API proxy to backend
- **Communication**: HTTP REST to Backend
### Backend API (Port 5001)
- **Technology**: Python, Flask
- **Architecture**: Modular with Blueprints
- **Responsibilities**:
- REST API endpoints
- Business logic
- Database operations
- Email sending
- AI integration
- **Communication**:
- HTTP REST from Frontend
- MongoDB driver to Database
- HTTP to Ollama (optional)
### MongoDB (Port 27017)
- **Technology**: MongoDB 7.0
- **Responsibilities**:
- Persistent data storage
- Articles, subscribers, RSS feeds
- **Communication**: MongoDB protocol
### News Crawler (Standalone)
- **Technology**: Python, BeautifulSoup
- **Architecture**: Microservice (can run independently)
- **Responsibilities**:
- Fetch RSS feeds
- Crawl article content
- Extract and clean text
- Store in database
- **Communication**: MongoDB driver to Database
- **Execution**:
- Manual: `python crawler_service.py`
- Scheduled: Cron, systemd, Docker
- On-demand: Via backend API (future)
### Ollama AI Server (Optional, External)
- **Technology**: Ollama
- **Responsibilities**:
- AI model inference
- Text summarization
- Content analysis
- **Communication**: HTTP REST API
## Data Flow
### 1. News Aggregation Flow
```
RSS Feeds → Backend (news_service) → MongoDB (articles)
```
### 2. Content Crawling Flow
```
MongoDB (rss_feeds) → Crawler → Article URLs →
Web Scraping → MongoDB (articles with full_content)
```
### 3. Subscription Flow
```
User → Frontend → Backend (subscription_routes) →
MongoDB (subscribers)
```
### 4. Newsletter Flow (Future)
```
Scheduler → Backend (email_service) →
MongoDB (articles + subscribers) → SMTP → Users
```
### 5. AI Processing Flow (Optional)
```
MongoDB (articles) → Backend (ollama_service) →
Ollama Server → AI Summary → MongoDB (articles)
```
## Deployment Options
### Development
- All services run locally
- MongoDB via Docker Compose
- Manual crawler execution
### Production
- Backend: Cloud VM, Container, or PaaS
- Frontend: Static hosting or same server
- MongoDB: MongoDB Atlas or self-hosted
- Crawler: Scheduled job (cron, systemd timer)
- Ollama: Separate GPU server (optional)
## Scalability Considerations
### Current Architecture
- Monolithic backend (single Flask instance)
- Standalone crawler (can run multiple instances)
- Shared MongoDB
### Future Improvements
- Load balancer for backend
- Message queue for crawler jobs (Celery + Redis)
- Caching layer (Redis)
- CDN for frontend
- Read replicas for MongoDB
## Security
- CORS enabled for frontend-backend communication
- MongoDB authentication (production)
- Environment variables for secrets
- Input validation on all endpoints
- Rate limiting (future)
## Monitoring (Future)
- Application logs
- MongoDB metrics
- Crawler success/failure tracking
- API response times
- Error tracking (Sentry)

View File

@@ -1,222 +0,0 @@
# Performance Comparison: CPU vs GPU
## Overview
This document compares the performance of Ollama running on CPU vs GPU for the Munich News Daily system.
## Test Configuration
**Hardware:**
- CPU: Intel Core i7-10700K (8 cores, 16 threads)
- GPU: NVIDIA RTX 3060 (12GB VRAM)
- RAM: 32GB DDR4
**Model:** phi3:latest (2.3GB)
**Test:** Processing 10 news articles with translation and summarization
## Results
### Processing Time
```
CPU Processing:
├─ Model Load: 20s
├─ 10 Translations: 15s (1.5s each)
├─ 10 Summaries: 80s (8s each)
└─ Total: 115s
GPU Processing:
├─ Model Load: 8s
├─ 10 Translations: 3s (0.3s each)
├─ 10 Summaries: 20s (2s each)
└─ Total: 31s
Speedup: 3.7x faster with GPU
```
### Detailed Breakdown
| Operation | CPU Time | GPU Time | Speedup |
|-----------|----------|----------|---------|
| Model Load | 20s | 8s | 2.5x |
| Single Translation | 1.5s | 0.3s | 5.0x |
| Single Summary | 8s | 2s | 4.0x |
| 10 Articles (total) | 115s | 31s | 3.7x |
| 50 Articles (total) | 550s | 120s | 4.6x |
| 100 Articles (total) | 1100s | 220s | 5.0x |
### Resource Usage
**CPU Mode:**
- CPU Usage: 60-80% across all cores
- RAM Usage: 4-6GB
- GPU Usage: 0%
- Power Draw: ~65W
**GPU Mode:**
- CPU Usage: 10-20%
- RAM Usage: 2-3GB
- GPU Usage: 80-100%
- VRAM Usage: 3-4GB
- Power Draw: ~120W (GPU) + ~20W (CPU) = ~140W
## Scaling Analysis
### Daily Newsletter (10 articles)
**CPU:**
- Processing Time: ~2 minutes
- Energy Cost: ~0.002 kWh
- Suitable: ✓ Yes
**GPU:**
- Processing Time: ~30 seconds
- Energy Cost: ~0.001 kWh
- Suitable: ✓ Yes (overkill for small batches)
**Recommendation:** CPU is sufficient for daily newsletters with <20 articles.
### High Volume (100+ articles/day)
**CPU:**
- Processing Time: ~18 minutes
- Energy Cost: ~0.02 kWh
- Suitable: ⚠ Slow but workable
**GPU:**
- Processing Time: ~4 minutes
- Energy Cost: ~0.009 kWh
- Suitable: ✓ Yes (recommended)
**Recommendation:** GPU provides significant time savings for high-volume processing.
### Real-time Processing
**CPU:**
- Latency: 1.5s translation + 8s summary = 9.5s per article
- Throughput: ~6 articles/minute
- User Experience: ⚠ Noticeable delay
**GPU:**
- Latency: 0.3s translation + 2s summary = 2.3s per article
- Throughput: ~26 articles/minute
- User Experience: ✓ Fast, responsive
**Recommendation:** GPU is essential for real-time or interactive use cases.
## Cost Analysis
### Hardware Investment
**CPU-Only Setup:**
- Server: $500-1000
- Monthly Power: ~$5
- Total Year 1: ~$560-1060
**GPU Setup:**
- Server: $500-1000
- GPU (RTX 3060): $300-400
- Monthly Power: ~$8
- Total Year 1: ~$896-1496
**Break-even:** If processing >50 articles/day, GPU saves enough time to justify the cost.
### Cloud Deployment
**AWS (us-east-1):**
- CPU (t3.xlarge): $0.1664/hour = ~$120/month
- GPU (g4dn.xlarge): $0.526/hour = ~$380/month
**Cost per 1000 articles:**
- CPU: ~$3.60 (3 hours)
- GPU: ~$0.95 (1.8 hours)
**Break-even:** Processing >5000 articles/month makes GPU more cost-effective.
## Model Comparison
Different models have different performance characteristics:
### phi3:latest (Default)
| Metric | CPU | GPU | Speedup |
|--------|-----|-----|---------|
| Load Time | 20s | 8s | 2.5x |
| Translation | 1.5s | 0.3s | 5x |
| Summary | 8s | 2s | 4x |
| VRAM | N/A | 3-4GB | - |
### gemma2:2b (Lightweight)
| Metric | CPU | GPU | Speedup |
|--------|-----|-----|---------|
| Load Time | 10s | 4s | 2.5x |
| Translation | 0.8s | 0.2s | 4x |
| Summary | 4s | 1s | 4x |
| VRAM | N/A | 1.5GB | - |
### llama3.2:3b (High Quality)
| Metric | CPU | GPU | Speedup |
|--------|-----|-----|---------|
| Load Time | 30s | 12s | 2.5x |
| Translation | 2.5s | 0.5s | 5x |
| Summary | 12s | 3s | 4x |
| VRAM | N/A | 5-6GB | - |
## Recommendations
### Use CPU When:
- Processing <20 articles/day
- Budget-constrained
- GPU needed for other tasks
- Power efficiency is critical
- Simple deployment preferred
### Use GPU When:
- Processing >50 articles/day
- Real-time processing needed
- Multiple concurrent users
- Time is more valuable than cost
- Already have GPU hardware
### Hybrid Approach:
- Use CPU for scheduled daily newsletters
- Use GPU for on-demand/real-time requests
- Scale GPU instances up/down based on load
## Optimization Tips
### CPU Optimization:
1. Use smaller models (gemma2:2b)
2. Reduce summary length (100 words vs 150)
3. Process articles in batches
4. Use more CPU cores
5. Enable CPU-specific optimizations
### GPU Optimization:
1. Keep model loaded between requests
2. Batch multiple articles together
3. Use FP16 precision (automatic with GPU)
4. Enable concurrent requests
5. Use GPU with more VRAM for larger models
## Conclusion
**For Munich News Daily (10-20 articles/day):**
- CPU is sufficient and cost-effective
- GPU provides faster processing but may be overkill
- Recommendation: Start with CPU, upgrade to GPU if scaling up
**For High-Volume Operations (100+ articles/day):**
- GPU provides significant time and cost savings
- 4-5x faster processing
- Better user experience
- Recommendation: Use GPU from the start
**For Real-Time Applications:**
- GPU is essential for responsive experience
- Sub-second translation, 2-3s summaries
- Supports concurrent users
- Recommendation: GPU required

View File

@@ -1,243 +0,0 @@
# Quick Reference Guide
## Starting the Application
### 1. Start MongoDB
```bash
docker-compose up -d
```
### 2. Start Backend (Port 5001)
```bash
cd backend
source venv/bin/activate # or: venv\Scripts\activate on Windows
python app.py
```
### 3. Start Frontend (Port 3000)
```bash
cd frontend
npm start
```
### 4. Run Crawler (Optional)
```bash
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
```
## Common Commands
### RSS Feed Management
**List all feeds:**
```bash
curl http://localhost:5001/api/rss-feeds
```
**Add a feed:**
```bash
curl -X POST http://localhost:5001/api/rss-feeds \
-H "Content-Type: application/json" \
-d '{"name": "Feed Name", "url": "https://example.com/rss"}'
```
**Remove a feed:**
```bash
curl -X DELETE http://localhost:5001/api/rss-feeds/<feed_id>
```
**Toggle feed status:**
```bash
curl -X PATCH http://localhost:5001/api/rss-feeds/<feed_id>/toggle
```
### News & Subscriptions
**Get latest news:**
```bash
curl http://localhost:5001/api/news
```
**Subscribe:**
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Get stats:**
```bash
curl http://localhost:5001/api/stats
```
### Ollama (AI)
**Test connection:**
```bash
curl http://localhost:5001/api/ollama/ping
```
**List models:**
```bash
curl http://localhost:5001/api/ollama/models
```
### Email Tracking & Analytics
**Get newsletter metrics:**
```bash
curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
```
**Get article performance:**
```bash
curl http://localhost:5001/api/analytics/article/<article_id>
```
**Get subscriber activity:**
```bash
curl http://localhost:5001/api/analytics/subscriber/<email>
```
**Delete subscriber tracking data:**
```bash
curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
```
**Anonymize old tracking data:**
```bash
curl -X POST http://localhost:5001/api/tracking/anonymize
```
### Database
**Connect to MongoDB:**
```bash
mongosh
use munich_news
```
**Check articles:**
```javascript
db.articles.find().limit(5)
db.articles.countDocuments()
db.articles.countDocuments({full_content: {$exists: true}})
```
**Check subscribers:**
```javascript
db.subscribers.find()
db.subscribers.countDocuments({status: "active"})
```
**Check RSS feeds:**
```javascript
db.rss_feeds.find()
```
**Check tracking data:**
```javascript
db.newsletter_sends.find().limit(5)
db.link_clicks.find().limit(5)
db.subscriber_activity.find()
```
## File Locations
### Configuration
- Backend: `backend/.env`
- Frontend: `frontend/package.json`
- Crawler: Uses backend's `.env` or own `.env`
### Logs
- Backend: Terminal output
- Frontend: Terminal output
- Crawler: Terminal output
### Database
- MongoDB data: Docker volume `mongodb_data`
- Database name: `munich_news`
## Ports
| Service | Port | URL |
|---------|------|-----|
| Frontend | 3000 | http://localhost:3000 |
| Backend | 5001 | http://localhost:5001 |
| MongoDB | 27017 | mongodb://localhost:27017 |
| Ollama | 11434 | http://localhost:11434 |
## Troubleshooting
### Backend won't start
- Check if port 5001 is available
- Verify MongoDB is running
- Check `.env` file exists
### Frontend can't connect
- Verify backend is running on port 5001
- Check CORS settings
- Check API_URL in frontend
### Crawler fails
- Install dependencies: `pip install -r requirements.txt`
- Check MongoDB connection
- Verify RSS feeds exist in database
### MongoDB connection error
- Start MongoDB: `docker-compose up -d`
- Check connection string in `.env`
- Verify port 27017 is not blocked
### Port 5000 conflict (macOS)
- AirPlay uses port 5000
- Use port 5001 instead (set in `.env`)
- Or disable AirPlay Receiver in System Preferences
## Project Structure
```
munich-news/
├── backend/ # Main API (Flask)
├── frontend/ # Web UI (Express + JS)
├── news_crawler/ # Crawler microservice
├── .env # Environment variables
└── docker-compose.yml # MongoDB setup
```
## Environment Variables
### Backend (.env)
```env
MONGODB_URI=mongodb://localhost:27017/
FLASK_PORT=5001
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90
```
## Development Workflow
1. **Add RSS Feed** → Backend API
2. **Run Crawler** → Fetches full content
3. **View News** → Frontend displays articles
4. **Users Subscribe** → Via frontend form
5. **Send Newsletter** → Manual or scheduled
## Useful Links
- Frontend: http://localhost:3000
- Backend API: http://localhost:5001
- MongoDB: mongodb://localhost:27017
- Architecture: See `ARCHITECTURE.md`
- Backend Structure: See `backend/STRUCTURE.md`
- Crawler Guide: See `news_crawler/README.md`

299
docs/REFERENCE.md Normal file
View File

@@ -0,0 +1,299 @@
# Quick Reference
Essential commands and information.
---
## Quick Start
```bash
# Start services
./start-with-gpu.sh # With GPU auto-detection
docker-compose up -d # Without GPU
# Stop services
docker-compose down
# View logs
docker-compose logs -f
docker-compose logs -f crawler # Specific service
# Restart
docker-compose restart
```
---
## Docker Commands
```bash
# Service status
docker-compose ps
# Rebuild
docker-compose up -d --build
# Execute command in container
docker-compose exec backend python -c "print('hello')"
docker-compose exec crawler python crawler_service.py 2
# View container logs
docker logs munich-news-backend
docker logs munich-news-crawler
# Resource usage
docker stats
```
---
## API Commands
```bash
# Health check
curl http://localhost:5001/health
# System stats
curl http://localhost:5001/api/admin/stats
# Trigger crawl
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 5}'
# Send test email
curl -X POST http://localhost:5001/api/admin/send-test-email \
-H "Content-Type: application/json" \
-d '{"email": "test@example.com"}'
# Send newsletter to all
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
---
## GPU Commands
```bash
# Check GPU availability
./check-gpu.sh
# Start with GPU
./start-with-gpu.sh
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
# Check GPU usage
docker exec munich-news-ollama nvidia-smi
# Monitor GPU
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
```
---
## MongoDB Commands
```bash
# Access MongoDB shell
docker-compose exec mongodb mongosh munich_news -u admin -p changeme --authenticationDatabase admin
# Count documents
db.articles.countDocuments({})
db.subscribers.countDocuments({status: 'active'})
# Find articles
db.articles.find().limit(5).pretty()
# Clear articles
db.articles.deleteMany({})
# Add subscriber
db.subscribers.insertOne({
email: "user@example.com",
subscribed_at: new Date(),
status: "active"
})
```
---
## Testing Commands
```bash
# Test Ollama setup
./test-ollama-setup.sh
# Test MongoDB connectivity
./test-mongodb-connectivity.sh
# Test newsletter API
./test-newsletter-api.sh
# Test crawl (2 articles)
docker-compose exec crawler python crawler_service.py 2
```
---
## Configuration
### Single .env File
Location: `backend/.env`
**Key Settings:**
```env
# MongoDB (Docker service name)
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
# Email
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your@email.com
EMAIL_PASSWORD=your-password
# Ollama (Internal Docker network)
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
```
---
## Port Exposure
| Service | Port | Exposed | Access |
|---------|------|---------|--------|
| Backend | 5001 | ✅ Yes | Host, External |
| MongoDB | 27017 | ❌ No | Internal only |
| Ollama | 11434 | ❌ No | Internal only |
| Crawler | - | ❌ No | Internal only |
| Sender | - | ❌ No | Internal only |
**Verify:**
```bash
docker ps --format "table {{.Names}}\t{{.Ports}}"
```
---
## Performance
### CPU Mode
- Translation: ~1.5s per title
- Summarization: ~8s per article
- 10 Articles: ~115s
### GPU Mode (5-10x faster)
- Translation: ~0.3s per title
- Summarization: ~2s per article
- 10 Articles: ~31s
---
## Troubleshooting
### Service Won't Start
```bash
docker-compose logs <service-name>
docker-compose restart <service-name>
docker-compose up -d --build <service-name>
```
### MongoDB Connection Issues
```bash
# Check service
docker-compose ps mongodb
# Test connection
docker-compose exec backend python -c "from database import articles_collection; print(articles_collection.count_documents({}))"
```
### Ollama Issues
```bash
# Check model
docker-compose exec ollama ollama list
# Pull model manually
docker-compose exec ollama ollama pull phi3:latest
# Check logs
docker-compose logs ollama
```
### GPU Not Working
```bash
# Check GPU
nvidia-smi
# Check Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
# Check Ollama GPU
docker exec munich-news-ollama nvidia-smi
```
---
## Recent Updates (November 2025)
### New Features
- ✅ GPU acceleration (5-10x faster)
- ✅ Integrated Ollama service
- ✅ Send newsletter to all subscribers API
- ✅ Article title translation (German → English)
- ✅ Enhanced security (network isolation)
### Security Improvements
- MongoDB internal-only (not exposed)
- Ollama internal-only (not exposed)
- Only Backend API exposed (port 5001)
- 66% reduction in attack surface
### Configuration Changes
- MongoDB URI uses `mongodb` (not `localhost`)
- Ollama URL uses `http://ollama:11434`
- Single `.env` file in `backend/`
### New Scripts
- `start-with-gpu.sh` - Auto-detect GPU and start
- `check-gpu.sh` - Check GPU availability
- `test-ollama-setup.sh` - Test Ollama
- `test-mongodb-connectivity.sh` - Test MongoDB
- `test-newsletter-api.sh` - Test newsletter API
---
## Changelog
### November 2025
- Added GPU acceleration support
- Integrated Ollama into Docker Compose
- Added newsletter API endpoint
- Improved network security
- Added article title translation
- Consolidated documentation
- Added helper scripts
### Key Changes
- Ollama now runs in Docker (no external server needed)
- MongoDB and Ollama are internal-only
- GPU support with automatic detection
- Subscriber status system documented
- All docs consolidated and updated
---
## Links
- **Setup Guide**: [SETUP.md](SETUP.md)
- **API Reference**: [API.md](API.md)
- **Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md)
- **Security**: [SECURITY.md](SECURITY.md)
- **Features**: [FEATURES.md](FEATURES.md)
---
**Last Updated:** November 2025

View File

@@ -1,194 +0,0 @@
# RSS URL Extraction - How It Works
## The Problem
Different RSS feed providers use different fields to store the article URL:
### Example 1: Standard RSS (uses `link`)
```xml
<item>
<title>Article Title</title>
<link>https://example.com/article/123</link>
<guid>internal-id-456</guid>
</item>
```
### Example 2: Some feeds (uses `guid` as URL)
```xml
<item>
<title>Article Title</title>
<guid>https://example.com/article/123</guid>
</item>
```
### Example 3: Atom feeds (uses `id`)
```xml
<entry>
<title>Article Title</title>
<id>https://example.com/article/123</id>
</entry>
```
### Example 4: Complex feeds (guid as object)
```xml
<item>
<title>Article Title</title>
<guid isPermaLink="true">https://example.com/article/123</guid>
</item>
```
### Example 5: Multiple links
```xml
<item>
<title>Article Title</title>
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
</item>
```
## Our Solution
The `extract_article_url()` function tries multiple strategies in order:
### Strategy 1: Check `link` field (most common)
```python
if entry.get('link') and entry.get('link', '').startswith('http'):
return entry.get('link')
```
✅ Works for: Most RSS 2.0 feeds
### Strategy 2: Check `guid` field
```python
if entry.get('guid'):
guid = entry.get('guid')
# guid can be a string
if isinstance(guid, str) and guid.startswith('http'):
return guid
# or a dict with 'href'
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
return guid.get('href')
```
✅ Works for: Feeds that use GUID as permalink
### Strategy 3: Check `id` field
```python
if entry.get('id') and entry.get('id', '').startswith('http'):
return entry.get('id')
```
✅ Works for: Atom feeds
### Strategy 4: Check `links` array
```python
if entry.get('links'):
for link in entry.get('links', []):
if isinstance(link, dict) and link.get('href', '').startswith('http'):
# Prefer 'alternate' type
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
return link.get('href')
```
✅ Works for: Feeds with multiple links (prefers HTML content)
## Real-World Examples
### Süddeutsche Zeitung
```python
entry = {
'title': 'Munich News',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'guid': 'sz-internal-123'
}
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
```
### Medium Blog
```python
entry = {
'title': 'Blog Post',
'guid': 'https://medium.com/@user/post-abc123',
'link': None
}
# Returns: 'https://medium.com/@user/post-abc123'
```
### YouTube RSS
```python
entry = {
'title': 'Video Title',
'id': 'https://www.youtube.com/watch?v=abc123',
'link': None
}
# Returns: 'https://www.youtube.com/watch?v=abc123'
```
### Complex Feed
```python
entry = {
'title': 'Article',
'links': [
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
]
}
# Returns: 'https://example.com/article' (prefers text/html)
```
## Validation
All extracted URLs must:
1. Start with `http://` or `https://`
2. Be a valid string (not None or empty)
If no valid URL is found:
```python
return None
# Crawler will skip this entry and log a warning
```
## Testing Different Feeds
To test if a feed works with our extractor:
```python
import feedparser
from rss_utils import extract_article_url
# Parse feed
feed = feedparser.parse('https://example.com/rss')
# Test each entry
for entry in feed.entries[:5]:
url = extract_article_url(entry)
if url:
print(f"{entry.get('title', 'No title')[:50]}")
print(f" URL: {url}")
else:
print(f"{entry.get('title', 'No title')[:50]}")
print(f" No valid URL found")
print(f" Available fields: {list(entry.keys())}")
```
## Supported Feed Types
✅ RSS 2.0
✅ RSS 1.0
✅ Atom
✅ Custom RSS variants
✅ Feeds with multiple links
✅ Feeds with GUID as permalink
## Edge Cases Handled
1. **GUID is not a URL**: Checks if it starts with `http`
2. **Multiple links**: Prefers `text/html` type
3. **GUID as dict**: Extracts `href` field
4. **Missing fields**: Returns None instead of crashing
5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
## Future Improvements
Potential enhancements:
- [ ] Support for `feedburner:origLink`
- [ ] Support for `pheedo:origLink`
- [ ] Resolve shortened URLs (bit.ly, etc.)
- [ ] Handle relative URLs (convert to absolute)
- [ ] Cache URL extraction results

253
docs/SETUP.md Normal file
View File

@@ -0,0 +1,253 @@
# Complete Setup Guide
## Quick Start
```bash
# 1. Configure
cp backend/.env.example backend/.env
# Edit backend/.env with your email settings
# 2. Start (with GPU auto-detection)
./start-with-gpu.sh
# 3. Wait for model download (first time, ~2-5 min)
docker-compose logs -f ollama-setup
```
---
## Prerequisites
- Docker & Docker Compose
- 4GB+ RAM
- (Optional) NVIDIA GPU for 5-10x faster AI
---
## Configuration
### Environment File
Edit `backend/.env`:
```env
# Email (Required)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
# MongoDB (Docker service name)
MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
# Ollama AI (Internal Docker network)
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=phi3:latest
```
---
## Ollama Setup
### Integrated Docker Compose (Recommended)
Ollama runs automatically with Docker Compose:
- Automatic model download (phi3:latest, 2.2GB)
- Internal-only access (secure)
- Persistent storage
- GPU support available
**Start:**
```bash
docker-compose up -d
```
**Verify:**
```bash
docker-compose exec ollama ollama list
# Should show: phi3:latest
```
---
## GPU Acceleration (5-10x Faster)
### Check GPU Availability
```bash
./check-gpu.sh
```
### Start with GPU
```bash
# Auto-detect and start
./start-with-gpu.sh
# Or manually
docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
```
### Verify GPU Usage
```bash
# Check GPU
docker exec munich-news-ollama nvidia-smi
# Monitor during processing
watch -n 1 'docker exec munich-news-ollama nvidia-smi'
```
### Performance
| Operation | CPU | GPU | Speedup |
|-----------|-----|-----|---------|
| Translation | 1.5s | 0.3s | 5x |
| Summary | 8s | 2s | 4x |
| 10 Articles | 115s | 31s | 3.7x |
### GPU Requirements
- NVIDIA GPU (GTX 1060 or newer)
- 4GB+ VRAM for phi3:latest
- NVIDIA drivers (525.60.13+)
- NVIDIA Container Toolkit
**Install NVIDIA Container Toolkit (Ubuntu/Debian):**
```bash
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```
---
## Production Deployment
### Security Checklist
- [ ] Change MongoDB password
- [ ] Use strong email password
- [ ] Bind backend to localhost only: `127.0.0.1:5001:5001`
- [ ] Set up reverse proxy (nginx/Traefik)
- [ ] Enable HTTPS
- [ ] Set up firewall rules
- [ ] Regular backups
- [ ] Monitor logs
### Reverse Proxy (nginx)
```nginx
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:5001;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
```
### Environment Variables
For production, use environment variables instead of .env:
```bash
export MONGODB_URI="mongodb://user:pass@mongodb:27017/"
export SMTP_SERVER="smtp.gmail.com"
export EMAIL_PASSWORD="secure-password"
```
### Monitoring
```bash
# Check service health
docker-compose ps
# View logs
docker-compose logs -f
# Check resource usage
docker stats
```
---
## Troubleshooting
### Ollama Issues
**Model not downloading:**
```bash
docker-compose logs ollama-setup
docker-compose exec ollama ollama pull phi3:latest
```
**Out of memory:**
- Use smaller model: `OLLAMA_MODEL=gemma2:2b`
- Increase Docker memory limit
### GPU Issues
**GPU not detected:**
```bash
nvidia-smi # Check drivers
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi # Check Docker
```
**Out of VRAM:**
- Use smaller model
- Close other GPU applications
### MongoDB Issues
**Connection refused:**
- Check service is running: `docker-compose ps`
- Verify URI uses `mongodb` not `localhost`
### Email Issues
**Authentication failed:**
- Use app-specific password (Gmail)
- Check SMTP settings
- Verify credentials
---
## Testing
```bash
# Test Ollama
./test-ollama-setup.sh
# Test MongoDB
./test-mongodb-connectivity.sh
# Test newsletter
./test-newsletter-api.sh
# Test crawl
docker-compose exec crawler python crawler_service.py 2
```
---
## Next Steps
1. Add RSS feeds (see QUICKSTART.md)
2. Add subscribers
3. Test newsletter sending
4. Set up monitoring
5. Configure backups
See [API.md](API.md) for API reference.

View File

@@ -1,290 +0,0 @@
# Subscriber Status System
## Overview
The newsletter system tracks subscribers with a `status` field that determines whether they receive newsletters.
## Status Field
### Database Schema
```javascript
{
_id: ObjectId("..."),
email: "user@example.com",
subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
status: "active" // or "inactive"
}
```
### Status Values
| Status | Description | Receives Newsletters |
|--------|-------------|---------------------|
| `active` | Subscribed and active | ✅ Yes |
| `inactive` | Unsubscribed | ❌ No |
## How It Works
### Subscription Flow
```
User subscribes
POST /api/subscribe
Create subscriber with status: 'active'
User receives newsletters
```
### Unsubscription Flow
```
User unsubscribes
POST /api/unsubscribe
Update subscriber status: 'inactive'
User stops receiving newsletters
```
### Re-subscription Flow
```
Previously unsubscribed user subscribes again
POST /api/subscribe
Update status: 'active' + new subscribed_at date
User receives newsletters again
```
## API Endpoints
### Subscribe
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Creates subscriber with:**
- `email`: user@example.com
- `status`: "active"
- `subscribed_at`: current timestamp
### Unsubscribe
```bash
curl -X POST http://localhost:5001/api/unsubscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Updates subscriber:**
- `status`: "inactive"
## Newsletter Sending
### Who Receives Newsletters
Only subscribers with `status: 'active'` receive newsletters.
**Sender Service Query:**
```python
subscribers_collection.find({'status': 'active'})
```
**Admin API Query:**
```python
subscribers_collection.count_documents({'status': 'active'})
```
### Testing
```bash
# Check active subscriber count
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
# Output:
# {
# "total": 10,
# "active": 8
# }
```
## Database Operations
### Add Active Subscriber
```javascript
db.subscribers.insertOne({
email: "user@example.com",
subscribed_at: new Date(),
status: "active"
})
```
### Deactivate Subscriber
```javascript
db.subscribers.updateOne(
{ email: "user@example.com" },
{ $set: { status: "inactive" } }
)
```
### Reactivate Subscriber
```javascript
db.subscribers.updateOne(
{ email: "user@example.com" },
{ $set: {
status: "active",
subscribed_at: new Date()
}}
)
```
### Query Active Subscribers
```javascript
db.subscribers.find({ status: "active" })
```
### Count Active Subscribers
```javascript
db.subscribers.countDocuments({ status: "active" })
```
## Common Issues
### Issue: Stats show 0 active subscribers but subscribers exist
**Cause:** Old bug where stats checked `{active: true}` instead of `{status: 'active'}`
**Solution:** Fixed in latest version. Stats now correctly query `{status: 'active'}`
**Verify:**
```bash
# Check database directly
docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
--authenticationDatabase admin \
--eval "db.subscribers.find({status: 'active'}).count()"
# Check via API
curl http://localhost:5001/api/admin/stats | jq '.subscribers.active'
```
### Issue: Newsletter not sending to subscribers
**Possible causes:**
1. Subscribers have `status: 'inactive'`
2. No subscribers in database
3. Email configuration issue
**Debug:**
```bash
# Check subscriber status
docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
--authenticationDatabase admin \
--eval "db.subscribers.find().pretty()"
# Check active count
curl http://localhost:5001/api/admin/stats | jq '.subscribers'
# Try sending
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json"
```
## Migration Notes
### If you have old subscribers without status field
Run this migration:
```javascript
// Set all subscribers without status to 'active'
db.subscribers.updateMany(
{ status: { $exists: false } },
{ $set: { status: "active" } }
)
```
### If you have subscribers with `active: true/false` field
Run this migration:
```javascript
// Convert old 'active' field to 'status' field
db.subscribers.updateMany(
{ active: true },
{ $set: { status: "active" }, $unset: { active: "" } }
)
db.subscribers.updateMany(
{ active: false },
{ $set: { status: "inactive" }, $unset: { active: "" } }
)
```
## Best Practices
### 1. Always Check Status
When querying subscribers for sending:
```python
# ✅ Correct
subscribers_collection.find({'status': 'active'})
# ❌ Wrong
subscribers_collection.find({}) # Includes inactive
```
### 2. Soft Delete
Never delete subscribers - just set status to 'inactive':
```python
# ✅ Correct - preserves history
subscribers_collection.update_one(
{'email': email},
{'$set': {'status': 'inactive'}}
)
# ❌ Wrong - loses data
subscribers_collection.delete_one({'email': email})
```
### 3. Track Subscription History
Consider adding fields:
```javascript
{
email: "user@example.com",
status: "active",
subscribed_at: ISODate("2025-01-01"),
unsubscribed_at: null, // Set when status changes to inactive
resubscribed_count: 0 // Increment on re-subscription
}
```
### 4. Validate Before Sending
```python
# Check subscriber count before sending
count = subscribers_collection.count_documents({'status': 'active'})
if count == 0:
return {'error': 'No active subscribers'}
```
## Related Documentation
- [ADMIN_API.md](ADMIN_API.md) - Admin API endpoints
- [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
- [NEWSLETTER_API_UPDATE.md](../NEWSLETTER_API_UPDATE.md) - Newsletter API changes

View File

@@ -1,412 +0,0 @@
# Munich News Daily - System Architecture
## 📊 Complete System Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Munich News Daily System │
│ Fully Automated Pipeline │
└─────────────────────────────────────────────────────────────────┘
Daily Schedule
┌──────────────────────┐
│ 6:00 AM Berlin │
│ News Crawler │
└──────────┬───────────┘
┌──────────────────────────────────────────────────────────────────┐
│ News Crawler │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
│ │ Fetch RSS │→ │ Extract │→ │ Summarize │→ │ Save to ││
│ │ Feeds │ │ Content │ │ with AI │ │ MongoDB ││
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
│ │
│ Sources: Süddeutsche, Merkur, BR24, etc. │
│ Output: Full articles + AI summaries │
└──────────────────────────────────────────────────────────────────┘
│ Articles saved
┌──────────────────────┐
│ MongoDB │
│ (Data Storage) │
└──────────┬───────────┘
│ Wait for crawler
┌──────────────────────┐
│ 7:00 AM Berlin │
│ Newsletter Sender │
└──────────┬───────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Newsletter Sender │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
│ │ Wait for │→ │ Fetch │→ │ Generate │→ │ Send to ││
│ │ Crawler │ │ Articles │ │ Newsletter │ │ Subscribers││
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
│ │
│ Features: Tracking pixels, link tracking, HTML templates │
│ Output: Personalized newsletters with engagement tracking │
└──────────────────────────────────────────────────────────────────┘
│ Emails sent
┌──────────────────────┐
│ Subscribers │
│ (Email Inboxes) │
└──────────┬───────────┘
│ Opens & clicks
┌──────────────────────┐
│ Tracking System │
│ (Analytics API) │
└──────────────────────┘
```
## 🔄 Data Flow
### 1. Content Acquisition (6:00 AM)
```
RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
```
**Details**:
- Fetches from multiple RSS sources
- Extracts full article text
- Generates concise summaries using Ollama
- Stores with metadata (author, date, source)
### 2. Newsletter Generation (7:00 AM)
```
MongoDB → Articles → Template → HTML → Email
```
**Details**:
- Waits for crawler to finish (max 30 min)
- Fetches today's articles with summaries
- Applies Jinja2 template
- Injects tracking pixels
- Replaces links with tracking URLs
### 3. Engagement Tracking (Ongoing)
```
Email Open → Pixel Load → Log Event → Analytics
Link Click → Redirect → Log Event → Analytics
```
**Details**:
- Tracks email opens via 1x1 pixel
- Tracks link clicks via redirect URLs
- Stores engagement data in MongoDB
- Provides analytics API
## 🏗️ Component Architecture
### Docker Containers
```
┌─────────────────────────────────────────────────────────┐
│ Docker Network │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ MongoDB │ │ Crawler │ │ Sender │ │
│ │ │ │ │ │ │ │
│ │ Port: 27017 │←─│ Schedule: │←─│ Schedule: │ │
│ │ │ │ 6:00 AM │ │ 7:00 AM │ │
│ │ Storage: │ │ │ │ │ │
│ │ - articles │ │ Depends on: │ │ Depends on: │ │
│ │ - subscribers│ │ - MongoDB │ │ - MongoDB │ │
│ │ - tracking │ │ │ │ - Crawler │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ All containers auto-restart on failure │
│ All use Europe/Berlin timezone │
└─────────────────────────────────────────────────────────┘
```
### Backend Services
```
┌─────────────────────────────────────────────────────────┐
│ Backend Services │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Flask API (Port 5001) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Tracking │ │ Analytics │ │ Privacy │ │ │
│ │ │ Endpoints │ │ Endpoints │ │ Endpoints │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Services Layer │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Tracking │ │ Analytics │ │ Ollama │ │ │
│ │ │ Service │ │ Service │ │ Client │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
## 📅 Daily Timeline
```
Time (Berlin) │ Event │ Duration
───────────────┼──────────────────────────┼──────────
05:59:59 │ System idle │ -
06:00:00 │ Crawler starts │ ~10-20 min
06:00:01 │ - Fetch RSS feeds │
06:02:00 │ - Extract content │
06:05:00 │ - Generate summaries │
06:15:00 │ - Save to MongoDB │
06:20:00 │ Crawler finishes │
06:20:01 │ System idle │ ~40 min
07:00:00 │ Sender starts │ ~5-10 min
07:00:01 │ - Wait for crawler │ (checks every 30s)
07:00:30 │ - Crawler confirmed done │
07:00:31 │ - Fetch articles │
07:01:00 │ - Generate newsletters │
07:02:00 │ - Send to subscribers │
07:10:00 │ Sender finishes │
07:10:01 │ System idle │ Until tomorrow
```
## 🔐 Security & Privacy
### Data Protection
```
┌─────────────────────────────────────────────────────────┐
│ Privacy Features │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Data Retention │ │
│ │ - Personal data: 90 days │ │
│ │ - Anonymization: Automatic │ │
│ │ - Deletion: On request │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ User Rights │ │
│ │ - Opt-out: Anytime │ │
│ │ - Data access: API available │ │
│ │ - Data deletion: Full removal │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Compliance │ │
│ │ - GDPR compliant │ │
│ │ - Privacy notice in emails │ │
│ │ - Transparent tracking │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
## 📊 Database Schema
### Collections
```
MongoDB (munich_news)
├── articles
│ ├── title
│ ├── author
│ ├── content (full text)
│ ├── summary (AI generated)
│ ├── link
│ ├── source
│ ├── published_at
│ └── crawled_at
├── subscribers
│ ├── email
│ ├── active
│ ├── tracking_enabled
│ └── subscribed_at
├── rss_feeds
│ ├── name
│ ├── url
│ └── active
├── newsletter_sends
│ ├── tracking_id
│ ├── newsletter_id
│ ├── subscriber_email
│ ├── opened
│ ├── first_opened_at
│ └── open_count
├── link_clicks
│ ├── tracking_id
│ ├── newsletter_id
│ ├── subscriber_email
│ ├── article_url
│ ├── clicked
│ └── clicked_at
└── subscriber_activity
├── email
├── status (active/inactive/dormant)
├── last_opened_at
├── last_clicked_at
├── total_opens
└── total_clicks
```
## 🚀 Deployment Architecture
### Development
```
Local Machine
├── Docker Compose
│ ├── MongoDB (no auth)
│ ├── Crawler
│ └── Sender
├── Backend (manual start)
│ └── Flask API
└── Ollama (optional)
└── AI Summarization
```
### Production
```
Server
├── Docker Compose (prod)
│ ├── MongoDB (with auth)
│ ├── Crawler
│ └── Sender
├── Backend (systemd/pm2)
│ └── Flask API (HTTPS)
├── Ollama (optional)
│ └── AI Summarization
└── Nginx (reverse proxy)
└── SSL/TLS
```
## 🔄 Coordination Mechanism
### Crawler-Sender Synchronization
```
┌─────────────────────────────────────────────────────────┐
│ Coordination Flow │
│ │
│ 6:00 AM → Crawler starts │
│ ↓ │
│ Crawling articles... │
│ ↓ │
│ Saves to MongoDB │
│ ↓ │
│ 6:20 AM → Crawler finishes │
│ ↓ │
│ 7:00 AM → Sender starts │
│ ↓ │
│ Check: Recent articles? ──→ No ──┐ │
│ ↓ Yes │ │
│ Proceed with send │ │
│ │ │
│ ← Wait 30s ← Wait 30s ← Wait 30s┘ │
│ (max 30 minutes) │
│ │
│ 7:10 AM → Newsletter sent │
└─────────────────────────────────────────────────────────┘
```
## 📈 Monitoring & Observability
### Key Metrics
```
┌─────────────────────────────────────────────────────────┐
│ Metrics to Monitor │
│ │
│ Crawler: │
│ - Articles crawled per day │
│ - Crawl duration │
│ - Success/failure rate │
│ - Summary generation rate │
│ │
│ Sender: │
│ - Newsletters sent per day │
│ - Send duration │
│ - Success/failure rate │
│ - Wait time for crawler │
│ │
│ Engagement: │
│ - Open rate │
│ - Click-through rate │
│ - Active subscribers │
│ - Dormant subscribers │
│ │
│ System: │
│ - Container uptime │
│ - Database size │
│ - Error rate │
│ - Response times │
└─────────────────────────────────────────────────────────┘
```
## 🛠️ Maintenance Tasks
### Daily
- Check logs for errors
- Verify newsletters sent
- Monitor engagement metrics
### Weekly
- Review article quality
- Check subscriber growth
- Analyze engagement trends
### Monthly
- Archive old articles
- Clean up dormant subscribers
- Update dependencies
- Review system performance
## 📚 Technology Stack
```
┌─────────────────────────────────────────────────────────┐
│ Technology Stack │
│ │
│ Backend: │
│ - Python 3.11 │
│ - Flask (API) │
│ - PyMongo (Database) │
│ - Schedule (Automation) │
│ - Jinja2 (Templates) │
│ - BeautifulSoup (Parsing) │
│ │
│ Database: │
│ - MongoDB 7.0 │
│ │
│ AI/ML: │
│ - Ollama (Summarization) │
│ - Phi3 Model (default) │
│ │
│ Infrastructure: │
│ - Docker & Docker Compose │
│ - Linux (Ubuntu/Debian) │
│ │
│ Email: │
│ - SMTP (configurable) │
│ - HTML emails with tracking │
└─────────────────────────────────────────────────────────┘
```
---
**Last Updated**: 2024-01-16
**Version**: 1.0
**Status**: Production Ready ✅

View File

@@ -0,0 +1,246 @@
"""
Article Clustering Module
Detects and groups similar articles from different sources using Ollama AI
"""
from difflib import SequenceMatcher
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from ollama_client import OllamaClient
class ArticleClusterer:
"""
Clusters articles about the same story from different sources using Ollama AI
"""
def __init__(self, ollama_client: OllamaClient, similarity_threshold=0.75, time_window_hours=24):
"""
Initialize clusterer
Args:
ollama_client: OllamaClient instance for AI-based similarity detection
similarity_threshold: Minimum similarity to consider articles as same story (0-1)
time_window_hours: Time window to look for similar articles
"""
self.ollama_client = ollama_client
self.similarity_threshold = similarity_threshold
self.time_window_hours = time_window_hours
def normalize_title(self, title: str) -> str:
"""
Normalize title for comparison
Args:
title: Article title
Returns:
Normalized title (lowercase, stripped)
"""
return title.lower().strip()
def simple_stem(self, word: str) -> str:
"""
Simple German word stemming (remove common suffixes)
Args:
word: Word to stem
Returns:
Stemmed word
"""
# Remove common German suffixes
suffixes = ['ungen', 'ung', 'en', 'er', 'e', 'n', 's']
for suffix in suffixes:
if len(word) > 5 and word.endswith(suffix):
return word[:-len(suffix)]
return word
def extract_keywords(self, text: str) -> set:
"""
Extract important keywords from text with simple stemming
Args:
text: Article title or content
Returns:
Set of stemmed keywords
"""
# Common German stop words to ignore
stop_words = {
'der', 'die', 'das', 'den', 'dem', 'des', 'ein', 'eine', 'einer', 'eines',
'und', 'oder', 'aber', 'in', 'im', 'am', 'um', 'für', 'von', 'zu', 'nach',
'bei', 'mit', 'auf', 'an', 'aus', 'über', 'unter', 'gegen', 'durch',
'ist', 'sind', 'war', 'waren', 'hat', 'haben', 'wird', 'werden', 'wurde', 'wurden',
'neue', 'neuer', 'neues', 'neuen', 'sich', 'auch', 'nicht', 'nur', 'noch',
'mehr', 'als', 'wie', 'beim', 'zum', 'zur', 'vom', 'ins', 'ans'
}
# Normalize and split
words = text.lower().strip().split()
# Filter out stop words, short words, and apply stemming
keywords = set()
for word in words:
# Remove punctuation
word = ''.join(c for c in word if c.isalnum() or c == '-')
if len(word) > 3 and word not in stop_words:
# Apply simple stemming
stemmed = self.simple_stem(word)
keywords.add(stemmed)
return keywords
def check_same_story_with_ai(self, article1: Dict, article2: Dict) -> bool:
"""
Use Ollama AI to determine if two articles are about the same story
Args:
article1: First article
article2: Second article
Returns:
True if same story, False otherwise
"""
if not self.ollama_client.enabled:
# Fallback to keyword-based similarity
return self.calculate_similarity(article1, article2) >= self.similarity_threshold
title1 = article1.get('title', '')
title2 = article2.get('title', '')
content1 = article1.get('content', '')[:300] # First 300 chars
content2 = article2.get('content', '')[:300]
prompt = f"""Compare these two news articles and determine if they are about the SAME story/event.
Article 1:
Title: {title1}
Content: {content1}
Article 2:
Title: {title2}
Content: {content2}
Answer with ONLY "YES" if they are about the same story/event, or "NO" if they are different stories.
Consider them the same story if they report on the same event, even if from different perspectives.
Answer:"""
try:
response = self.ollama_client.generate(prompt, max_tokens=10)
answer = response.get('text', '').strip().upper()
return 'YES' in answer
except Exception as e:
print(f" ⚠ AI clustering failed: {e}, using fallback")
# Fallback to keyword-based similarity
return self.calculate_similarity(article1, article2) >= self.similarity_threshold
def calculate_similarity(self, article1: Dict, article2: Dict) -> float:
"""
Calculate similarity between two articles using title and content
Args:
article1: First article (dict with 'title' and optionally 'content')
article2: Second article (dict with 'title' and optionally 'content')
Returns:
Similarity score (0-1)
"""
title1 = article1.get('title', '')
title2 = article2.get('title', '')
content1 = article1.get('content', '')
content2 = article2.get('content', '')
# Extract keywords from titles
title_keywords1 = self.extract_keywords(title1)
title_keywords2 = self.extract_keywords(title2)
# Calculate title similarity
if title_keywords1 and title_keywords2:
title_intersection = title_keywords1.intersection(title_keywords2)
title_union = title_keywords1.union(title_keywords2)
title_similarity = len(title_intersection) / len(title_union) if title_union else 0
else:
# Fallback to string similarity
t1 = self.normalize_title(title1)
t2 = self.normalize_title(title2)
title_similarity = SequenceMatcher(None, t1, t2).ratio()
# If we have content, use it for better accuracy
if content1 and content2:
# Extract keywords from first 500 chars of content (for performance)
content_keywords1 = self.extract_keywords(content1[:500])
content_keywords2 = self.extract_keywords(content2[:500])
if content_keywords1 and content_keywords2:
content_intersection = content_keywords1.intersection(content_keywords2)
content_union = content_keywords1.union(content_keywords2)
content_similarity = len(content_intersection) / len(content_union) if content_union else 0
# Weighted average: title (40%) + content (60%)
return (title_similarity * 0.4) + (content_similarity * 0.6)
# If no content, use only title similarity
return title_similarity
def find_cluster(self, article: Dict, existing_articles: List[Dict]) -> Optional[str]:
"""
Find if article belongs to an existing cluster using AI
Args:
article: New article to cluster (dict with 'title' and optionally 'content')
existing_articles: List of existing articles
Returns:
cluster_id if found, None otherwise
"""
cutoff_time = datetime.utcnow() - timedelta(hours=self.time_window_hours)
for existing in existing_articles:
# Only compare recent articles
published_at = existing.get('published_at')
if published_at and published_at < cutoff_time:
continue
# Use AI to check if same story
if self.check_same_story_with_ai(article, existing):
return existing.get('cluster_id', str(existing.get('_id')))
return None
def cluster_article(self, article: Dict, existing_articles: List[Dict]) -> Dict:
"""
Cluster a single article
Args:
article: Article to cluster
existing_articles: List of existing articles
Returns:
Article with cluster_id and is_primary fields
"""
cluster_id = self.find_cluster(article, existing_articles)
if cluster_id:
# Add to existing cluster
article['cluster_id'] = cluster_id
article['is_primary'] = False
else:
# Create new cluster
article['cluster_id'] = str(article.get('_id', datetime.utcnow().timestamp()))
article['is_primary'] = True
return article
def get_cluster_articles(self, cluster_id: str, articles_collection) -> List[Dict]:
"""
Get all articles in a cluster
Args:
cluster_id: Cluster ID
articles_collection: MongoDB collection
Returns:
List of articles in the cluster
"""
return list(articles_collection.find({'cluster_id': cluster_id}))

View File

@@ -0,0 +1,213 @@
"""
Cluster Summarizer Module
Generates neutral summaries from multiple clustered articles
"""
from typing import List, Dict, Optional
from datetime import datetime
from ollama_client import OllamaClient
class ClusterSummarizer:
"""
Generates neutral summaries by synthesizing multiple articles about the same story
"""
def __init__(self, ollama_client: OllamaClient, max_words=200):
"""
Initialize cluster summarizer
Args:
ollama_client: OllamaClient instance for AI-based summarization
max_words: Maximum words in neutral summary
"""
self.ollama_client = ollama_client
self.max_words = max_words
def generate_neutral_summary(self, articles: List[Dict]) -> Dict:
"""
Generate a neutral summary from multiple articles about the same story
Args:
articles: List of article dicts with 'title', 'content', 'source'
Returns:
{
'neutral_summary': str,
'sources': list,
'article_count': int,
'success': bool,
'error': str or None,
'duration': float
}
"""
if not articles or len(articles) == 0:
return {
'neutral_summary': None,
'sources': [],
'article_count': 0,
'success': False,
'error': 'No articles provided',
'duration': 0
}
# If only one article, return its summary
if len(articles) == 1:
return {
'neutral_summary': articles[0].get('summary', articles[0].get('content', '')[:500]),
'sources': [articles[0].get('source', 'unknown')],
'article_count': 1,
'success': True,
'error': None,
'duration': 0
}
# Build combined context from all articles
combined_context = self._build_combined_context(articles)
# Generate neutral summary using AI
prompt = self._build_neutral_summary_prompt(combined_context, len(articles))
result = self.ollama_client.generate(prompt, max_tokens=300)
if result['success']:
return {
'neutral_summary': result['text'],
'sources': list(set(a.get('source', 'unknown') for a in articles)),
'article_count': len(articles),
'success': True,
'error': None,
'duration': result['duration']
}
else:
return {
'neutral_summary': None,
'sources': list(set(a.get('source', 'unknown') for a in articles)),
'article_count': len(articles),
'success': False,
'error': result['error'],
'duration': result['duration']
}
def _build_combined_context(self, articles: List[Dict]) -> str:
"""Build combined context from multiple articles"""
context_parts = []
for i, article in enumerate(articles, 1):
source = article.get('source', 'Unknown')
title = article.get('title', 'No title')
# Use summary if available, otherwise use first 500 chars of content
content = article.get('summary') or article.get('content', '')[:500]
context_parts.append(f"Source {i} ({source}):\nTitle: {title}\nContent: {content}")
return "\n\n".join(context_parts)
def _build_neutral_summary_prompt(self, combined_context: str, article_count: int) -> str:
"""Build prompt for neutral summary generation"""
prompt = f"""You are a neutral news aggregator. You have {article_count} articles from different sources about the same story. Your task is to create a single, balanced summary that:
1. Combines information from all sources
2. Remains neutral and objective
3. Highlights key facts that all sources agree on
4. Notes any significant differences in perspective (if any)
5. Is written in clear, professional English
6. Is approximately {self.max_words} words
Here are the articles:
{combined_context}
Write a neutral summary in English that synthesizes these perspectives:"""
return prompt
def create_cluster_summaries(db, ollama_client: OllamaClient, cluster_ids: Optional[List[str]] = None):
"""
Create or update neutral summaries for article clusters
Args:
db: MongoDB database instance
ollama_client: OllamaClient instance
cluster_ids: Optional list of specific cluster IDs to process. If None, processes all clusters.
Returns:
{
'processed': int,
'succeeded': int,
'failed': int,
'errors': list
}
"""
summarizer = ClusterSummarizer(ollama_client, max_words=200)
# Find clusters to process
if cluster_ids:
clusters_to_process = cluster_ids
else:
# Get all cluster IDs with multiple articles
pipeline = [
{"$match": {"cluster_id": {"$exists": True}}},
{"$group": {"_id": "$cluster_id", "count": {"$sum": 1}}},
{"$match": {"count": {"$gt": 1}}},
{"$project": {"_id": 1}}
]
clusters_to_process = [c['_id'] for c in db.articles.aggregate(pipeline)]
processed = 0
succeeded = 0
failed = 0
errors = []
for cluster_id in clusters_to_process:
try:
# Get all articles in this cluster
articles = list(db.articles.find({"cluster_id": cluster_id}))
if len(articles) < 2:
continue
print(f"Processing cluster {cluster_id}: {len(articles)} articles")
# Generate neutral summary
result = summarizer.generate_neutral_summary(articles)
processed += 1
if result['success']:
# Save cluster summary
db.cluster_summaries.update_one(
{"cluster_id": cluster_id},
{
"$set": {
"cluster_id": cluster_id,
"neutral_summary": result['neutral_summary'],
"sources": result['sources'],
"article_count": result['article_count'],
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow()
}
},
upsert=True
)
succeeded += 1
print(f" ✓ Generated neutral summary ({len(result['neutral_summary'])} chars)")
else:
failed += 1
error_msg = f"Cluster {cluster_id}: {result['error']}"
errors.append(error_msg)
print(f" ✗ Failed: {result['error']}")
except Exception as e:
failed += 1
error_msg = f"Cluster {cluster_id}: {str(e)}"
errors.append(error_msg)
print(f" ✗ Error: {e}")
return {
'processed': processed,
'succeeded': succeeded,
'failed': failed,
'errors': errors
}

View File

@@ -13,6 +13,8 @@ from dotenv import load_dotenv
from rss_utils import extract_article_url, extract_article_summary, extract_published_date
from config import Config
from ollama_client import OllamaClient
from article_clustering import ArticleClusterer
from cluster_summarizer import create_cluster_summaries
# Load environment variables
load_dotenv(dotenv_path='../.env')
@@ -33,6 +35,9 @@ ollama_client = OllamaClient(
timeout=Config.OLLAMA_TIMEOUT
)
# Initialize Article Clusterer (will be initialized after ollama_client)
article_clusterer = None
# Print configuration on startup
if __name__ != '__main__':
Config.print_config()
@@ -45,6 +50,14 @@ if __name__ != '__main__':
else:
print(" Ollama AI summarization: DISABLED")
# Initialize Article Clusterer with ollama_client
article_clusterer = ArticleClusterer(
ollama_client=ollama_client,
similarity_threshold=0.60, # Not used when AI is enabled
time_window_hours=24 # Look back 24 hours
)
print("🔗 Article clustering: ENABLED (AI-powered)")
def get_active_rss_feeds():
"""Get all active RSS feeds from database"""
@@ -394,6 +407,13 @@ def crawl_rss_feed(feed_url, feed_name, feed_category='general', max_articles=10
'created_at': datetime.utcnow()
}
# Cluster article with existing articles (detect duplicates from other sources)
from datetime import timedelta
recent_articles = list(articles_collection.find({
'published_at': {'$gte': datetime.utcnow() - timedelta(hours=24)}
}))
article_doc = article_clusterer.cluster_article(article_doc, recent_articles)
try:
# Upsert: update if exists, insert if not
articles_collection.update_one(
@@ -434,6 +454,16 @@ def crawl_all_feeds(max_articles_per_feed=10):
Crawl all active RSS feeds
Returns: dict with statistics
"""
global article_clusterer
# Initialize clusterer if not already done
if article_clusterer is None:
article_clusterer = ArticleClusterer(
ollama_client=ollama_client,
similarity_threshold=0.60,
time_window_hours=24
)
print("\n" + "="*60)
print("🚀 Starting RSS Feed Crawler")
print("="*60)
@@ -485,12 +515,29 @@ def crawl_all_feeds(max_articles_per_feed=10):
print(f" Average time per article: {duration/total_crawled:.1f}s")
print("="*60 + "\n")
# Generate neutral summaries for clustered articles
cluster_summary_stats = {'processed': 0, 'succeeded': 0, 'failed': 0}
if Config.OLLAMA_ENABLED and total_crawled > 0:
print("\n" + "="*60)
print("🔄 Generating Neutral Summaries for Clustered Articles")
print("="*60)
cluster_summary_stats = create_cluster_summaries(db, ollama_client)
print("\n" + "="*60)
print(f"✓ Cluster Summarization Complete!")
print(f" Clusters processed: {cluster_summary_stats['processed']}")
print(f" Succeeded: {cluster_summary_stats['succeeded']}")
print(f" Failed: {cluster_summary_stats['failed']}")
print("="*60 + "\n")
return {
'total_feeds': len(feeds),
'total_articles_crawled': total_crawled,
'total_summarized': total_summarized,
'failed_summaries': total_failed,
'duration_seconds': round(duration, 2)
'duration_seconds': round(duration, 2),
'cluster_summaries': cluster_summary_stats
}

View File

@@ -392,6 +392,80 @@ English Summary (max {max_words} words):"""
'error': str(e)
}
def generate(self, prompt, max_tokens=100):
"""
Generate text using Ollama
Args:
prompt: Text prompt
max_tokens: Maximum tokens to generate
Returns:
{
'text': str, # Generated text
'success': bool, # Whether generation succeeded
'error': str or None, # Error message if failed
'duration': float # Time taken in seconds
}
"""
if not self.enabled:
return {
'text': '',
'success': False,
'error': 'Ollama is disabled',
'duration': 0
}
start_time = time.time()
try:
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"num_predict": max_tokens,
"temperature": 0.1 # Low temperature for consistent answers
}
},
timeout=self.timeout
)
duration = time.time() - start_time
if response.status_code == 200:
result = response.json()
return {
'text': result.get('response', '').strip(),
'success': True,
'error': None,
'duration': duration
}
else:
return {
'text': '',
'success': False,
'error': f"HTTP {response.status_code}: {response.text}",
'duration': duration
}
except requests.exceptions.Timeout:
return {
'text': '',
'success': False,
'error': f"Request timed out after {self.timeout}s",
'duration': time.time() - start_time
}
except Exception as e:
return {
'text': '',
'success': False,
'error': str(e),
'duration': time.time() - start_time
}
if __name__ == '__main__':
# Quick test

110
tests/crawler/README.md Normal file
View File

@@ -0,0 +1,110 @@
# Crawler Tests
Test suite for the news crawler, AI clustering, and neutral summary generation.
## Test Files
### AI Clustering & Aggregation Tests
- **`test_clustering_real.py`** - Tests AI-powered article clustering with realistic fake articles
- **`test_neutral_summaries.py`** - Tests neutral summary generation from clustered articles
- **`test_complete_workflow.py`** - End-to-end test of clustering + neutral summaries
### Core Crawler Tests
- **`test_crawler.py`** - Basic crawler functionality
- **`test_ollama.py`** - Ollama AI integration tests
- **`test_rss_feeds.py`** - RSS feed parsing tests
## Running Tests
### Run All Tests
```bash
# From project root
docker-compose exec crawler python -m pytest tests/crawler/
```
### Run Specific Test
```bash
# AI clustering test
docker-compose exec crawler python tests/crawler/test_clustering_real.py
# Neutral summaries test
docker-compose exec crawler python tests/crawler/test_neutral_summaries.py
# Complete workflow test
docker-compose exec crawler python tests/crawler/test_complete_workflow.py
```
### Run Tests Inside Container
```bash
# Enter container
docker-compose exec crawler bash
# Run tests
python test_clustering_real.py
python test_neutral_summaries.py
python test_complete_workflow.py
```
## Test Data
Tests use fake articles to avoid depending on external RSS feeds:
**Test Scenarios:**
1. **Same story, different sources** - Should cluster together
2. **Different stories** - Should remain separate
3. **Multi-source clustering** - Should generate neutral summaries
**Expected Results:**
- Housing story (2 sources) → Cluster together → Neutral summary
- Bayern transfer (2 sources) → Cluster together → Neutral summary
- Single-source stories → Individual summaries
## Cleanup
Tests create temporary data in MongoDB. To clean up:
```bash
# Clean test articles
docker-compose exec crawler python << 'EOF'
from pymongo import MongoClient
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
db = client["munich_news"]
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
db.cluster_summaries.delete_many({})
print("✓ Test data cleaned")
EOF
```
## Requirements
- Docker containers must be running
- Ollama service must be available
- MongoDB must be accessible
- AI model (phi3:latest) must be downloaded
## Troubleshooting
### Ollama Not Available
```bash
# Check Ollama status
docker-compose logs ollama
# Restart Ollama
docker-compose restart ollama
```
### Tests Timing Out
- Increase timeout in test files (default: 60s)
- Check Ollama model is downloaded
- Verify GPU acceleration if enabled
### MongoDB Connection Issues
```bash
# Check MongoDB status
docker-compose logs mongodb
# Restart MongoDB
docker-compose restart mongodb
```

View File

@@ -0,0 +1,166 @@
#!/usr/bin/env python3
"""
Test AI clustering with realistic fake articles
"""
from pymongo import MongoClient
from datetime import datetime
import sys
# Connect to MongoDB
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
db = client["munich_news"]
# Create test articles about the same Munich story from different sources
test_articles = [
{
"title": "München: Stadtrat beschließt neue Regelungen für Wohnungsbau",
"content": """Der Münchner Stadtrat hat am Dienstag neue Regelungen für den Wohnungsbau beschlossen.
Die Maßnahmen sollen den Bau von bezahlbarem Wohnraum in der bayerischen Landeshauptstadt fördern.
Oberbürgermeister Dieter Reiter (SPD) sprach von einem wichtigen Schritt zur Lösung der Wohnungskrise.
Die neuen Regelungen sehen vor, dass bei Neubauprojekten mindestens 40 Prozent der Wohnungen
als Sozialwohnungen gebaut werden müssen. Zudem werden Bauvorschriften vereinfacht.""",
"source": "abendzeitung-muenchen",
"link": "https://example.com/az-wohnungsbau-1",
"published_at": datetime.utcnow(),
"category": "local",
"word_count": 85
},
{
"title": "Stadtrat München stimmt für neue Wohnungsbau-Verordnung",
"content": """In einer Sitzung am Dienstag stimmte der Münchner Stadtrat für neue Wohnungsbau-Verordnungen.
Die Beschlüsse zielen darauf ab, mehr bezahlbaren Wohnraum in München zu schaffen.
OB Reiter bezeichnete die Entscheidung als Meilenstein im Kampf gegen die Wohnungsnot.
Künftig müssen 40 Prozent aller Neubauwohnungen als Sozialwohnungen errichtet werden.
Außerdem werden bürokratische Hürden beim Bauen abgebaut.""",
"source": "sueddeutsche",
"link": "https://example.com/sz-wohnungsbau-1",
"published_at": datetime.utcnow(),
"category": "local",
"word_count": 72
},
{
"title": "FC Bayern München verpflichtet neuen Stürmer aus Brasilien",
"content": """Der FC Bayern München hat einen neuen Stürmer verpflichtet. Der 23-jährige Brasilianer
wechselt für eine Ablösesumme von 50 Millionen Euro nach München. Sportdirektor Christoph Freund
zeigte sich begeistert von der Verpflichtung. Der Spieler soll die Offensive verstärken.""",
"source": "abendzeitung-muenchen",
"link": "https://example.com/az-bayern-1",
"published_at": datetime.utcnow(),
"category": "sports",
"word_count": 52
},
{
"title": "Bayern München holt brasilianischen Angreifer",
"content": """Der deutsche Rekordmeister Bayern München hat einen brasilianischen Stürmer unter Vertrag genommen.
Für 50 Millionen Euro wechselt der 23-Jährige an die Isar. Sportdirektor Freund lobte den Transfer.
Der Neuzugang soll die Münchner Offensive beleben und für mehr Torgefahr sorgen.""",
"source": "sueddeutsche",
"link": "https://example.com/sz-bayern-1",
"published_at": datetime.utcnow(),
"category": "sports",
"word_count": 48
}
]
print("Testing AI Clustering with Realistic Articles")
print("=" * 70)
print()
# Clear previous test articles
print("Cleaning up previous test articles...")
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
print("✓ Cleaned up")
print()
# Import clustering
sys.path.insert(0, '/app')
from ollama_client import OllamaClient
from article_clustering import ArticleClusterer
from config import Config
# Initialize
ollama_client = OllamaClient(
base_url=Config.OLLAMA_BASE_URL,
model=Config.OLLAMA_MODEL,
enabled=Config.OLLAMA_ENABLED,
timeout=30
)
clusterer = ArticleClusterer(
ollama_client=ollama_client,
similarity_threshold=0.50,
time_window_hours=24
)
print("Processing articles with AI clustering...")
print()
clustered_articles = []
for i, article in enumerate(test_articles, 1):
print(f"{i}. Processing: {article['title'][:60]}...")
print(f" Source: {article['source']}")
# Cluster with previously processed articles
clustered = clusterer.cluster_article(article, clustered_articles)
clustered_articles.append(clustered)
print(f" → Cluster ID: {clustered['cluster_id']}")
print(f" → Is Primary: {clustered['is_primary']}")
# Insert into database
db.articles.insert_one(clustered)
print(f" ✓ Saved to database")
print()
print("=" * 70)
print("Clustering Results:")
print()
# Analyze results
clusters = {}
for article in clustered_articles:
cluster_id = article['cluster_id']
if cluster_id not in clusters:
clusters[cluster_id] = []
clusters[cluster_id].append(article)
for cluster_id, articles in clusters.items():
print(f"Cluster {cluster_id}: {len(articles)} article(s)")
for article in articles:
print(f" - [{article['source']}] {article['title'][:60]}...")
print()
# Expected results
print("=" * 70)
print("Expected Results:")
print(" ✓ Articles 1&2 should be in same cluster (housing story)")
print(" ✓ Articles 3&4 should be in same cluster (Bayern transfer)")
print(" ✓ Total: 2 clusters with 2 articles each")
print()
# Actual results
housing_cluster = [a for a in clustered_articles if 'Wohnungsbau' in a['title'] or 'Wohnungsbau' in a['title']]
bayern_cluster = [a for a in clustered_articles if 'Bayern' in a['title'] or 'Stürmer' in a['title']]
housing_cluster_ids = set(a['cluster_id'] for a in housing_cluster)
bayern_cluster_ids = set(a['cluster_id'] for a in bayern_cluster)
print("Actual Results:")
if len(housing_cluster_ids) == 1:
print(" ✓ Housing articles clustered together")
else:
print(f" ✗ Housing articles in {len(housing_cluster_ids)} different clusters")
if len(bayern_cluster_ids) == 1:
print(" ✓ Bayern articles clustered together")
else:
print(f" ✗ Bayern articles in {len(bayern_cluster_ids)} different clusters")
if len(clusters) == 2:
print(" ✓ Total clusters: 2 (correct)")
else:
print(f" ✗ Total clusters: {len(clusters)} (expected 2)")
print()
print("=" * 70)
print("✓ Test complete! Check the results above.")

View File

@@ -0,0 +1,187 @@
#!/usr/bin/env python3
"""
Complete workflow test: Clustering + Neutral Summaries
"""
from pymongo import MongoClient
from datetime import datetime
import sys
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
db = client["munich_news"]
print("=" * 70)
print("COMPLETE WORKFLOW TEST: AI Clustering + Neutral Summaries")
print("=" * 70)
print()
# Clean up previous test
print("1. Cleaning up previous test data...")
db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
db.cluster_summaries.delete_many({"cluster_id": {"$regex": "^test_"}})
print(" ✓ Cleaned up")
print()
# Import modules
sys.path.insert(0, '/app')
from ollama_client import OllamaClient
from article_clustering import ArticleClusterer
from cluster_summarizer import ClusterSummarizer
from config import Config
# Initialize
ollama_client = OllamaClient(
base_url=Config.OLLAMA_BASE_URL,
model=Config.OLLAMA_MODEL,
enabled=Config.OLLAMA_ENABLED,
timeout=60
)
clusterer = ArticleClusterer(ollama_client, similarity_threshold=0.50, time_window_hours=24)
summarizer = ClusterSummarizer(ollama_client, max_words=200)
# Test articles - 2 stories, 2 sources each
test_articles = [
# Story 1: Munich Housing (2 sources)
{
"title": "München: Stadtrat beschließt neue Wohnungsbau-Regelungen",
"content": "Der Münchner Stadtrat hat neue Regelungen für bezahlbaren Wohnungsbau beschlossen. 40% Sozialwohnungen werden Pflicht.",
"source": "abendzeitung-muenchen",
"link": "https://example.com/test-housing-az",
"published_at": datetime.utcnow(),
"category": "local"
},
{
"title": "Stadtrat München: Neue Verordnung für Wohnungsbau",
"content": "München führt neue Wohnungsbau-Verordnung ein. Mindestens 40% der Neubauten müssen Sozialwohnungen sein.",
"source": "sueddeutsche",
"link": "https://example.com/test-housing-sz",
"published_at": datetime.utcnow(),
"category": "local"
},
# Story 2: Bayern Transfer (2 sources)
{
"title": "FC Bayern verpflichtet brasilianischen Stürmer für 50 Millionen",
"content": "Bayern München holt einen 23-jährigen Brasilianer. Sportdirektor Freund ist begeistert.",
"source": "abendzeitung-muenchen",
"link": "https://example.com/test-bayern-az",
"published_at": datetime.utcnow(),
"category": "sports"
},
{
"title": "Bayern München: Neuzugang aus Brasilien für 50 Mio. Euro",
"content": "Der Rekordmeister verstärkt die Offensive mit einem brasilianischen Angreifer. Freund lobt den Transfer.",
"source": "sueddeutsche",
"link": "https://example.com/test-bayern-sz",
"published_at": datetime.utcnow(),
"category": "sports"
}
]
print("2. Processing articles with AI clustering...")
print()
clustered_articles = []
for i, article in enumerate(test_articles, 1):
print(f" Article {i}: {article['title'][:50]}...")
print(f" Source: {article['source']}")
# Cluster
clustered = clusterer.cluster_article(article, clustered_articles)
clustered_articles.append(clustered)
print(f" → Cluster: {clustered['cluster_id']}")
print(f" → Primary: {clustered['is_primary']}")
# Save to DB
db.articles.insert_one(clustered)
print(f" ✓ Saved")
print()
print("=" * 70)
print("3. Clustering Results:")
print()
# Analyze clusters
clusters = {}
for article in clustered_articles:
cid = article['cluster_id']
if cid not in clusters:
clusters[cid] = []
clusters[cid].append(article)
print(f" Total clusters: {len(clusters)}")
print()
for cid, articles in clusters.items():
print(f" Cluster {cid}:")
print(f" - Articles: {len(articles)}")
for article in articles:
print(f" • [{article['source']}] {article['title'][:45]}...")
print()
# Check expectations
if len(clusters) == 2:
print(" ✓ Expected 2 clusters (housing + bayern)")
else:
print(f" ⚠ Expected 2 clusters, got {len(clusters)}")
print()
print("=" * 70)
print("4. Generating neutral summaries...")
print()
summary_count = 0
for cid, articles in clusters.items():
if len(articles) < 2:
print(f" Skipping cluster {cid} (only 1 article)")
continue
print(f" Cluster {cid}: {len(articles)} articles")
result = summarizer.generate_neutral_summary(articles)
if result['success']:
print(f" ✓ Generated summary ({result['duration']:.1f}s)")
# Save
db.cluster_summaries.insert_one({
"cluster_id": cid,
"neutral_summary": result['neutral_summary'],
"sources": result['sources'],
"article_count": result['article_count'],
"created_at": datetime.utcnow()
})
summary_count += 1
# Show preview
preview = result['neutral_summary'][:100] + "..."
print(f" Preview: {preview}")
else:
print(f" ✗ Failed: {result['error']}")
print()
print("=" * 70)
print("5. Final Results:")
print()
test_article_count = db.articles.count_documents({"link": {"$regex": "^https://example.com/test-"}})
test_summary_count = db.cluster_summaries.count_documents({})
print(f" Articles saved: {test_article_count}")
print(f" Clusters created: {len(clusters)}")
print(f" Neutral summaries: {summary_count}")
print()
if len(clusters) == 2 and summary_count == 2:
print(" ✅ SUCCESS! Complete workflow working perfectly!")
print()
print(" The system now:")
print(" 1. ✓ Clusters articles from different sources")
print(" 2. ✓ Generates neutral summaries combining perspectives")
print(" 3. ✓ Stores everything in MongoDB")
else:
print(" ⚠ Partial success - check results above")
print()
print("=" * 70)

View File

@@ -0,0 +1,130 @@
#!/usr/bin/env python3
"""
Test neutral summary generation from clustered articles
"""
from pymongo import MongoClient
from datetime import datetime
import sys
# Connect to MongoDB
client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
db = client["munich_news"]
print("Testing Neutral Summary Generation")
print("=" * 70)
print()
# Check for test articles
test_articles = list(db.articles.find(
{"link": {"$regex": "^https://example.com/"}}
).sort("_id", 1))
if len(test_articles) == 0:
print("⚠ No test articles found. Run test-clustering-real.py first.")
sys.exit(1)
print(f"Found {len(test_articles)} test articles")
print()
# Find clusters with multiple articles
clusters = {}
for article in test_articles:
cid = article['cluster_id']
if cid not in clusters:
clusters[cid] = []
clusters[cid].append(article)
multi_article_clusters = {k: v for k, v in clusters.items() if len(v) > 1}
if len(multi_article_clusters) == 0:
print("⚠ No clusters with multiple articles found")
sys.exit(1)
print(f"Found {len(multi_article_clusters)} cluster(s) with multiple articles")
print()
# Import cluster summarizer
sys.path.insert(0, '/app')
from ollama_client import OllamaClient
from cluster_summarizer import ClusterSummarizer
from config import Config
# Initialize
ollama_client = OllamaClient(
base_url=Config.OLLAMA_BASE_URL,
model=Config.OLLAMA_MODEL,
enabled=Config.OLLAMA_ENABLED,
timeout=60
)
summarizer = ClusterSummarizer(ollama_client, max_words=200)
print("Generating neutral summaries...")
print("=" * 70)
print()
for cluster_id, articles in multi_article_clusters.items():
print(f"Cluster: {cluster_id}")
print(f"Articles: {len(articles)}")
print()
# Show individual articles
for i, article in enumerate(articles, 1):
print(f" {i}. [{article['source']}] {article['title'][:60]}...")
print()
# Generate neutral summary
print(" Generating neutral summary...")
result = summarizer.generate_neutral_summary(articles)
if result['success']:
print(f" ✓ Success ({result['duration']:.1f}s)")
print()
print(" Neutral Summary:")
print(" " + "-" * 66)
# Wrap text at 66 chars
summary = result['neutral_summary']
words = summary.split()
lines = []
current_line = " "
for word in words:
if len(current_line) + len(word) + 1 <= 68:
current_line += word + " "
else:
lines.append(current_line.rstrip())
current_line = " " + word + " "
if current_line.strip():
lines.append(current_line.rstrip())
print("\n".join(lines))
print(" " + "-" * 66)
print()
# Save to database
db.cluster_summaries.update_one(
{"cluster_id": cluster_id},
{
"$set": {
"cluster_id": cluster_id,
"neutral_summary": result['neutral_summary'],
"sources": result['sources'],
"article_count": result['article_count'],
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow()
}
},
upsert=True
)
print(" ✓ Saved to cluster_summaries collection")
else:
print(f" ✗ Failed: {result['error']}")
print()
print("=" * 70)
print()
print("Testing complete!")
print()
# Show summary statistics
total_cluster_summaries = db.cluster_summaries.count_documents({})
print(f"Total cluster summaries in database: {total_cluster_summaries}")