update

2025-11-12 11:34:33 +01:00
parent f35f8eef8a
commit 94c89589af
32 changed files with 3272 additions and 3805 deletions
--- a/QUICKSTART.md
+++ b/QUICKSTART.md
@@ -5,7 +5,8 @@ Get Munich News Daily running in 5 minutes!
 ## Prerequisites
 - Docker & Docker Compose installed
- (Optional) Ollama for AI summarization
+- 4GB+ RAM (for Ollama AI models)
 - (Optional) NVIDIA GPU for 5-10x faster AI processing
 ## Setup
@@ -30,13 +31,21 @@ EMAIL_PASSWORD=your-app-password
 ### 2. Start System
 ```bash
-# Start all services
+# Option 1: Auto-detect GPU and start (recommended)
 ./start-with-gpu.sh
 # Option 2: Start without GPU
 docker-compose up -d
 # View logs
 docker-compose logs -f
 # Wait for Ollama model download (first time only, ~2-5 minutes)
 docker-compose logs -f ollama-setup
 ```
 **Note:** First startup downloads the phi3:latest AI model (2.2GB). This happens automatically.
 ### 3. Add RSS Feeds
 ```bash
@@ -114,18 +123,45 @@ docker-compose logs -f
 docker-compose up -d --build
 ```
 ## New Features
 ### GPU Acceleration (5-10x Faster)
 Enable GPU support for faster AI processing:
 ```bash
 ./check-gpu.sh          # Check if GPU is available
 ./start-with-gpu.sh     # Start with GPU support
 ```
 See [docs/GPU_SETUP.md](docs/GPU_SETUP.md) for details.
 ### Send Newsletter to All Subscribers
 ```bash
 # Send newsletter to all active subscribers
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ### Security Features
 - ✅ Only Backend API exposed (port 5001)
 - ✅ MongoDB internal-only (secure)
 - ✅ Ollama internal-only (secure)
 - ✅ All services communicate via internal Docker network
 ## Need Help?
- Check [README.md](README.md) for full documentation
+- **Documentation Index**: [docs/INDEX.md](docs/INDEX.md)
- See [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed setup
+- **GPU Setup**: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
- View [docs/API.md](docs/API.md) for API reference
+- **API Reference**: [docs/ADMIN_API.md](docs/ADMIN_API.md)
 - **Security Guide**: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
 - **Full Documentation**: [README.md](README.md)
 ## Next Steps
-1. Configure Ollama for AI summaries (optional)
+1. ✅ **Enable GPU acceleration** - [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
 2. Set up tracking API (optional)
 3. Customize newsletter template
 4. Add more RSS feeds
 5. Monitor engagement metrics
 6. Review security settings - [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
 That's it! Your automated news system is running. 🎉
--- a/README.md
+++ b/README.md
@@ -2,7 +2,16 @@
 A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking.
-**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [QUICK_START_GPU.md](QUICK_START_GPU.md)
+## ✨ Key Features
 - **🤖 AI-Powered Clustering** - Automatically detects duplicate stories from different sources
 - **📰 Neutral Summaries** - Combines multiple perspectives into balanced coverage
 - **🎯 Smart Prioritization** - Shows most important stories first (multi-source coverage)
 - **📊 Engagement Tracking** - Open rates, click tracking, and analytics
 - **⚡ GPU Acceleration** - 5-10x faster AI processing with GPU support
 - **🔒 GDPR Compliant** - Privacy-first with data retention controls
 **🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
 ## 🚀 Quick Start
@@ -25,6 +34,8 @@ That's it! The system will automatically:
 📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide.
 🚀 **GPU Acceleration:** Enable 5-10x faster AI processing with [GPU Setup Guide](docs/GPU_SETUP.md)
 ## 📋 System Overview
 ```
@@ -49,11 +60,11 @@ That's it! The system will automatically:
 ### Components
- **Ollama**: AI service for summarization and translation (port 11434)
+- **Ollama**: AI service for summarization and translation (internal only, GPU-accelerated)
- **MongoDB**: Data storage (articles, subscribers, tracking)
+- **MongoDB**: Data storage (articles, subscribers, tracking) (internal only)
- **Backend API**: Flask API for tracking and analytics (port 5001)
+- **Backend API**: Flask API for tracking and analytics (port 5001 - only exposed service)
- **News Crawler**: Automated RSS feed crawler with AI summarization
+- **News Crawler**: Automated RSS feed crawler with AI summarization (internal only)
- **Newsletter Sender**: Automated email sender with tracking
+- **Newsletter Sender**: Automated email sender with tracking (internal only)
 - **Frontend**: React dashboard (optional)
 ### Technology Stack
@@ -341,11 +352,21 @@ curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-
 ### Getting Started
 - **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
 - **[PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md)** - Project layout
 - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
 ### Core Features
 - **[docs/AI_NEWS_AGGREGATION.md](docs/AI_NEWS_AGGREGATION.md)** - AI-powered clustering & neutral summaries
 - **[docs/FEATURES.md](docs/FEATURES.md)** - Complete feature list
 - **[docs/API.md](docs/API.md)** - API endpoints reference
 ### Technical Documentation
 - **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
 - **[docs/SETUP.md](docs/SETUP.md)** - Detailed setup guide
 - **[docs/OLLAMA_SETUP.md](docs/OLLAMA_SETUP.md)** - AI/Ollama configuration
 - **[docs/GPU_SETUP.md](docs/GPU_SETUP.md)** - GPU acceleration setup
 - **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production deployment
 - **[docs/SECURITY.md](docs/SECURITY.md)** - Security best practices
 - **[docs/REFERENCE.md](docs/REFERENCE.md)** - Complete reference
 - **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
 - **[docs/API.md](docs/API.md)** - API reference
 - **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure
--- a/backend/routes/news_routes.py
+++ b/backend/routes/news_routes.py
@@ -1,5 +1,5 @@
-from flask import Blueprint, jsonify
+from flask import Blueprint, jsonify, request
-from database import articles_collection
+from database import articles_collection, db
 from services.news_service import fetch_munich_news, save_articles_to_db
 news_bp = Blueprint('news', __name__)
@@ -9,6 +9,12 @@ news_bp = Blueprint('news', __name__)
 def get_news():
    """Get latest Munich news"""
    try:
        # Check if clustered mode is requested
        mode = request.args.get('mode', 'all')
        if mode == 'clustered':
            return get_clustered_news_internal()
        # Fetch fresh news and save to database
        articles = fetch_munich_news()
        save_articles_to_db(articles)
@@ -63,6 +69,95 @@ def get_news():
        return jsonify({'error': str(e)}), 500
 def get_clustered_news_internal():
    """
    Get news with neutral summaries for clustered articles
    Returns only primary articles with their neutral summaries
    Prioritizes stories covered by multiple sources (more popular/important)
    """
    try:
        limit = int(request.args.get('limit', 20))
        # Use aggregation to get articles with their cluster size
        # This allows us to prioritize multi-source stories
        pipeline = [
            {"$match": {"is_primary": True}},
            {"$lookup": {
                "from": "articles",
                "localField": "cluster_id",
                "foreignField": "cluster_id",
                "as": "cluster_articles"
            }},
            {"$addFields": {
                "article_count": {"$size": "$cluster_articles"},
                "sources_list": {"$setUnion": ["$cluster_articles.source", []]}
            }},
            {"$addFields": {
                "source_count": {"$size": "$sources_list"}
            }},
            # Sort by: 1) source count (desc), 2) published date (desc)
            {"$sort": {"source_count": -1, "published_at": -1}},
            {"$limit": limit}
        ]
        cursor = articles_collection.aggregate(pipeline)
        result = []
        cluster_summaries_collection = db['cluster_summaries']
        for doc in cursor:
            cluster_id = doc.get('cluster_id')
            # Get neutral summary if available
            cluster_summary = cluster_summaries_collection.find_one({'cluster_id': cluster_id})
            # Use cluster_articles from aggregation (already fetched)
            cluster_articles = doc.get('cluster_articles', [])
            article = {
                'title': doc.get('title', ''),
                'link': doc.get('link', ''),
                'source': doc.get('source', ''),
                'published': doc.get('published_at', ''),
                'category': doc.get('category', 'general'),
                'cluster_id': cluster_id,
                'article_count': doc.get('article_count', 1),
                'source_count': doc.get('source_count', 1),
                'sources': list(doc.get('sources_list', [doc.get('source', '')]))
            }
            # Use neutral summary if available, otherwise use article's own summary
            if cluster_summary and doc.get('article_count', 1) > 1:
                article['summary'] = cluster_summary.get('neutral_summary', '')
                article['summary_type'] = 'neutral'
                article['is_clustered'] = True
            else:
                article['summary'] = doc.get('summary', '')
                article['summary_type'] = 'individual'
                article['is_clustered'] = False
            # Add related articles info
            if doc.get('article_count', 1) > 1:
                article['related_articles'] = [
                    {
                        'source': a.get('source', ''),
                        'title': a.get('title', ''),
                        'link': a.get('link', '')
                    }
                    for a in cluster_articles if a.get('_id') != doc.get('_id')
                ]
            result.append(article)
        return jsonify({
            'articles': result,
            'mode': 'clustered',
            'description': 'Shows one article per story with neutral summaries'
        }), 200
    except Exception as e:
        return jsonify({'error': str(e)}), 500
@news_bp.route('/api/news/<path:article_url>', methods=['GET'])
 def get_article_by_url(article_url):
    """Get full article content by URL"""
@@ -113,11 +208,20 @@ def get_stats():
        # Count summarized articles
        summarized_count = articles_collection.count_documents({'summary': {'$exists': True, '$ne': ''}})
        # Count clustered articles
        clustered_count = articles_collection.count_documents({'cluster_id': {'$exists': True}})
        # Count cluster summaries
        cluster_summaries_collection = db['cluster_summaries']
        neutral_summaries_count = cluster_summaries_collection.count_documents({})
        return jsonify({
            'subscribers': subscriber_count,
            'articles': article_count,
            'crawled_articles': crawled_count,
-            'summarized_articles': summarized_count
+            'summarized_articles': summarized_count,
            'clustered_articles': clustered_count,
            'neutral_summaries': neutral_summaries_count
        }), 200
    except Exception as e:
        return jsonify({'error': str(e)}), 500
--- a/docs/ADMIN_API.md
+++ b/docs/ADMIN_API.md
@@ -1,382 +0,0 @@
 # Admin API Reference
 Admin endpoints for testing and manual operations.
 ## Overview
 The admin API allows you to trigger manual operations like crawling news and sending test emails directly through HTTP requests.
 **How it works**: The backend container has access to the Docker socket, allowing it to execute commands in other containers via `docker exec`.
 ---
 ## API Endpoints
 ### Trigger Crawler
 Manually trigger the news crawler to fetch new articles.
 ```http
 POST /api/admin/trigger-crawl
 ```
 **Request Body** (optional):
 ```json
 {
  "max_articles": 10
 }
 ```
 **Parameters**:
 - `max_articles` (integer, optional): Number of articles to crawl per feed (1-100, default: 10)
 **Response**:
 ```json
 {
  "success": true,
  "message": "Crawler executed successfully",
  "max_articles": 10,
  "output": "... crawler output (last 1000 chars) ...",
  "errors": ""
 }
 ```
 **Example**:
 ```bash
 # Crawl 5 articles per feed
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 5}'
 # Use default (10 articles)
 curl -X POST http://localhost:5001/api/admin/trigger-crawl
 ```
 ---
 ### Send Test Email
 Send a test newsletter to a specific email address.
 ```http
 POST /api/admin/send-test-email
 ```
 **Request Body**:
 ```json
 {
  "email": "test@example.com",
  "max_articles": 10
 }
 ```
 **Parameters**:
 - `email` (string, required): Email address to send test newsletter to
 - `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
 **Response**:
 ```json
 {
  "success": true,
  "message": "Test email sent to test@example.com",
  "email": "test@example.com",
  "output": "... sender output ...",
  "errors": ""
 }
 ```
 **Example**:
 ```bash
 # Send test email
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "your-email@example.com"}'
 # Send with custom article count
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "your-email@example.com", "max_articles": 5}'
 ```
 ---
 ### Send Newsletter to All Subscribers
 Send newsletter to all active subscribers in the database.
 ```http
 POST /api/admin/send-newsletter
 ```
 **Request Body** (optional):
 ```json
 {
  "max_articles": 10
 }
 ```
 **Parameters**:
 - `max_articles` (integer, optional): Number of articles to include (1-50, default: 10)
 **Response**:
 ```json
 {
  "success": true,
  "message": "Newsletter sent successfully to 45 subscribers",
  "subscriber_count": 45,
  "max_articles": 10,
  "output": "... sender output ...",
  "errors": ""
 }
 ```
 **Example**:
 ```bash
 # Send newsletter to all subscribers
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json"
 # Send with custom article count
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 15}'
 ```
 **Notes**:
 - Only sends to subscribers with `status: 'active'`
 - Returns error if no active subscribers found
 - Includes tracking pixels and click tracking
 - May take several minutes for large subscriber lists
 ---
 ### Get System Statistics
 Get overview statistics of the system.
 ```http
 GET /api/admin/stats
 ```
 **Response**:
 ```json
 {
  "articles": {
    "total": 150,
    "with_summary": 120,
    "today": 15
  },
  "subscribers": {
    "total": 50,
    "active": 45
  },
  "rss_feeds": {
    "total": 4,
    "active": 4
  },
  "tracking": {
    "total_sends": 200,
    "total_opens": 150,
    "total_clicks": 75
  }
 }
 ```
 **Example**:
 ```bash
 curl http://localhost:5001/api/admin/stats
 ```
 ---
 ## Workflow Examples
 ### Test Complete System
 ```bash
 # 1. Check current stats
 curl http://localhost:5001/api/admin/stats
 # 2. Trigger crawler to fetch new articles
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 5}'
 # 3. Wait a moment for crawler to finish, then check stats again
 sleep 30
 curl http://localhost:5001/api/admin/stats
 # 4. Send test email to yourself
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "your-email@example.com"}'
 ```
 ### Send Newsletter to All Subscribers
 ```bash
 # 1. Check subscriber count
 curl http://localhost:5001/api/admin/stats | jq '.subscribers'
 # 2. Crawl fresh articles
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 # 3. Wait for crawl to complete
 sleep 60
 # 4. Send newsletter to all active subscribers
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ### Quick Test Newsletter
 ```bash
 # Send test email with latest articles
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "your-email@example.com", "max_articles": 3}'
 ```
 ### Fetch Fresh Content
 ```bash
 # Crawl more articles from each feed
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 20}'
 ```
 ### Daily Newsletter Workflow
 ```bash
 # Complete daily workflow (can be automated with cron)
 # 1. Crawl today's articles
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 15}'
 # 2. Wait for crawl and AI processing
 sleep 120
 # 3. Send to all subscribers
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 # 4. Check results
 curl http://localhost:5001/api/admin/stats
 ```
 ---
 ## Error Responses
 All endpoints return standard error responses:
 ```json
 {
  "success": false,
  "error": "Error message here"
 }
 ```
 **Common HTTP Status Codes**:
 - `200` - Success
 - `400` - Bad request (invalid parameters)
 - `500` - Server error
 ---
 ## Security Notes
 ⚠️ **Important**: These are admin endpoints and should be protected in production!
 Recommendations:
 1. Add authentication/authorization
 2. Rate limiting
 3. IP whitelisting
 4. API key requirement
 5. Audit logging
 Example protection (add to routes):
 ```python
 from functools import wraps
 from flask import request
 def require_api_key(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        api_key = request.headers.get('X-API-Key')
        if api_key != os.getenv('ADMIN_API_KEY'):
            return jsonify({'error': 'Unauthorized'}), 401
        return f(*args, **kwargs)
    return decorated_function
@admin_bp.route('/api/admin/trigger-crawl', methods=['POST'])
@require_api_key
 def trigger_crawl():
    # ... endpoint code
 ```
 ---
 ## Related Endpoints
 - **[Newsletter Preview](../backend/routes/newsletter_routes.py)**: `/api/newsletter/preview` - Preview newsletter HTML
 - **[Analytics](API.md)**: `/api/analytics/*` - View engagement metrics
 - **[RSS Feeds](API.md)**: `/api/rss-feeds` - Manage RSS feeds
 ---
 ## Newsletter API Summary
 ### Available Endpoints
 | Endpoint | Purpose | Recipient |
 |----------|---------|-----------|
 | `/api/admin/send-test-email` | Test newsletter | Single email (specified) |
 | `/api/admin/send-newsletter` | Production send | All active subscribers |
 | `/api/admin/trigger-crawl` | Fetch articles | N/A |
 | `/api/admin/stats` | System stats | N/A |
 ### Subscriber Status
 The system uses a `status` field to determine who receives newsletters:
 - **`active`** - Receives newsletters ✅
 - **`inactive`** - Does not receive newsletters ❌
 See [SUBSCRIBER_STATUS.md](SUBSCRIBER_STATUS.md) for details.
 ### Quick Examples
 **Send to all subscribers:**
 ```bash
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 **Send test email:**
 ```bash
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'
 ```
 **Check stats:**
 ```bash
 curl http://localhost:5001/api/admin/stats | jq '.subscribers'
 ```
 ### Testing
 Use the test script:
 ```bash
 ./test-newsletter-api.sh
 ```
--- a/docs/AI_NEWS_AGGREGATION.md
+++ b/docs/AI_NEWS_AGGREGATION.md
@@ -0,0 +1,317 @@
 # AI-Powered News Aggregation - COMPLETE ✅
 ## Overview
 Successfully implemented a complete AI-powered news aggregation system that detects duplicate stories from multiple sources and generates neutral, balanced summaries.
 ## Features Implemented
 ### 1. AI-Powered Article Clustering ✅
 **What it does:**
 - Automatically detects when different news sources cover the same story
 - Uses Ollama AI to intelligently compare article content
 - Groups related articles by `cluster_id`
 - Marks the first article as `is_primary: true`
 **How it works:**
 - Compares articles published within 24 hours
 - Uses AI prompt: "Are these two articles about the same story?"
 - Falls back to keyword matching if AI fails
 - Real-time clustering during crawl
 **Test Results:**
 - ✅ Housing story from 2 sources → Clustered together
 - ✅ Bayern transfer from 2 sources → Clustered together
 - ✅ Different stories → Separate clusters
 ### 2. Neutral Summary Generation ✅
 **What it does:**
 - Synthesizes multiple articles into one balanced summary
 - Combines perspectives from all sources
 - Highlights agreements and differences
 - Maintains neutral, objective tone
 **How it works:**
 - Takes all articles in a cluster
 - Sends combined context to Ollama
 - AI generates ~200-word neutral summary
 - Saves to `cluster_summaries` collection
 **Test Results:**
 ```
 Bayern Transfer Story (2 sources):
 "Bayern Munich has recently signed Brazilian footballer, aged 23, 
 for €50 million to bolster their attacking lineup as per reports 
 from abendzeitung-muenchen and sueddeutsche. The new addition is 
 expected to inject much-needed dynamism into the team's offense..."
 ```
 ### 3. Smart Prioritization ✅
 **What it does:**
 - Prioritizes stories covered by multiple sources (more important)
 - Shows multi-source stories first with neutral summaries
 - Fills remaining slots with single-source stories
 **Sorting Logic:**
 1. **Primary sort:** Number of sources (descending)
 2. **Secondary sort:** Publish date (newest first)
 **Example Output:**
 ```
 1. Munich Housing (2 sources) → Neutral summary
 2. Bayern Transfer (2 sources) → Neutral summary
 3. Local story (1 source) → Individual summary
 4. Local story (1 source) → Individual summary
 ...
 ```
 ## Database Schema
 ### Articles Collection
 ```javascript
 {
  _id: ObjectId("..."),
  title: "München: Stadtrat beschließt...",
  content: "Full article text...",
  summary: "AI-generated summary...",
  source: "abendzeitung-muenchen",
  link: "https://...",
  published_at: ISODate("2025-11-12T..."),
  // Clustering fields
  cluster_id: "1762937577.365818",
  is_primary: true,
  // Metadata
  word_count: 450,
  summary_word_count: 120,
  category: "local",
  crawled_at: ISODate("..."),
  summarized_at: ISODate("...")
 }
 ```
 ### Cluster Summaries Collection
 ```javascript
 {
  _id: ObjectId("..."),
  cluster_id: "1762937577.365818",
  neutral_summary: "Combined neutral summary from all sources...",
  sources: ["abendzeitung-muenchen", "sueddeutsche"],
  article_count: 2,
  created_at: ISODate("2025-11-12T..."),
  updated_at: ISODate("2025-11-12T...")
 }
 ```
 ## API Endpoints
 ### Get All Articles (Default)
 ```bash
 GET /api/news
 ```
 Returns all articles individually (current behavior)
 ### Get Clustered Articles (Recommended)
 ```bash
 GET /api/news?mode=clustered&limit=10
 ```
 Returns:
 - One article per story
 - Multi-source stories with neutral summaries first
 - Single-source stories with individual summaries
 - Smart prioritization by popularity
 **Response Format:**
 ```javascript
 {
  "articles": [
    {
      "title": "...",
      "summary": "Neutral summary combining all sources...",
      "summary_type": "neutral",
      "is_clustered": true,
      "source_count": 2,
      "sources": ["source1", "source2"],
      "related_articles": [
        {"source": "source2", "title": "...", "link": "..."}
      ]
    }
  ],
  "mode": "clustered"
 }
 ```
 ### Get Statistics
 ```bash
 GET /api/stats
 ```
 Returns:
 ```javascript
 {
  "articles": 51,
  "crawled_articles": 45,
  "summarized_articles": 40,
  "clustered_articles": 47,
  "neutral_summaries": 3
 }
 ```
 ## Workflow
 ### Complete Crawl Process
 1. **Crawl RSS feeds** from multiple sources
 2. **Extract full content** from article URLs
 3. **Generate AI summaries** for each article
 4. **Cluster similar articles** using AI comparison
 5. **Generate neutral summaries** for multi-source clusters
 6. **Save everything** to MongoDB
 ### Time Windows
 - **Clustering window:** 24 hours (rolling)
 - **Crawl schedule:** Daily at 6:00 AM Berlin time
 - **Manual trigger:** Available via crawler service
 ## Configuration
 ### Environment Variables
 ```bash
 # Ollama AI
 OLLAMA_BASE_URL=http://ollama:11434
 OLLAMA_MODEL=phi3:latest
 OLLAMA_ENABLED=true
 OLLAMA_TIMEOUT=120
 # Clustering
 CLUSTERING_TIME_WINDOW=24  # hours
 CLUSTERING_SIMILARITY_THRESHOLD=0.50
 # Summaries
 SUMMARY_MAX_WORDS=150  # individual
 NEUTRAL_SUMMARY_MAX_WORDS=200  # cluster
 ```
 ## Files Created/Modified
 ### New Files
 - `news_crawler/article_clustering.py` - AI clustering logic
 - `news_crawler/cluster_summarizer.py` - Neutral summary generation
 - `test-clustering-real.py` - Clustering tests
 - `test-neutral-summaries.py` - Summary generation tests
 - `test-complete-workflow.py` - End-to-end tests
 ### Modified Files
 - `news_crawler/crawler_service.py` - Added clustering + summarization
 - `news_crawler/ollama_client.py` - Added `generate()` method
 - `backend/routes/news_routes.py` - Added clustered endpoint with prioritization
 ## Performance
 ### Metrics
 - **Clustering:** ~20-40s per article pair (AI comparison)
 - **Neutral summary:** ~30-40s per cluster
 - **Success rate:** 100% in tests
 - **Accuracy:** High - correctly identifies same/different stories
 ### Optimization
 - Clustering runs during crawl (real-time)
 - Neutral summaries generated after crawl (batch)
 - Results cached in database
 - 24-hour time window limits comparisons
 ## Testing
 ### Test Coverage
 ✅ AI clustering with same stories
 ✅ AI clustering with different stories  
 ✅ Neutral summary generation
 ✅ Multi-source prioritization
 ✅ Database integration
 ✅ End-to-end workflow
 ### Test Commands
 ```bash
 # Test clustering
 docker-compose exec crawler python /app/test-clustering-real.py
 # Test neutral summaries
 docker-compose exec crawler python /app/test-neutral-summaries.py
 # Test complete workflow
 docker-compose exec crawler python /app/test-complete-workflow.py
 ```
 ## Benefits
 ### For Users
 - ✅ **No duplicate stories** - See each story once
 - ✅ **Balanced coverage** - Multiple perspectives combined
 - ✅ **Prioritized content** - Important stories first
 - ✅ **Source transparency** - See all sources covering a story
 - ✅ **Efficient reading** - One summary instead of multiple articles
 ### For the System
 - ✅ **Intelligent deduplication** - AI-powered, not just URL matching
 - ✅ **Scalable** - Works with any number of sources
 - ✅ **Flexible** - 24-hour time window catches late-breaking news
 - ✅ **Reliable** - Fallback mechanisms if AI fails
 - ✅ **Maintainable** - Clear separation of concerns
 ## Future Enhancements
 ### Potential Improvements
 1. **Update summaries** when new articles join a cluster
 2. **Summary versioning** to track changes over time
 3. **Quality scoring** for generated summaries
 4. **Multi-language support** for summaries
 5. **Sentiment analysis** across sources
 6. **Fact extraction** and verification
 7. **Trending topics** detection
 8. **User preferences** for source weighting
 ### Integration Ideas
 - Email newsletters with neutral summaries
 - Push notifications for multi-source stories
 - RSS feed of clustered articles
 - API for third-party apps
 - Analytics dashboard
 ## Conclusion
 The Munich News Aggregator now provides:
 1. ✅ **Smart clustering** - AI detects duplicate stories
 2. ✅ **Neutral summaries** - Balanced multi-source coverage
 3. ✅ **Smart prioritization** - Important stories first
 4. ✅ **Source transparency** - See all perspectives
 5. ✅ **Efficient delivery** - One summary per story
 **Result:** Users get comprehensive, balanced news coverage without information overload!
 ---
 ## Quick Start
 ### View Clustered News
 ```bash
 curl "http://localhost:5001/api/news?mode=clustered&limit=10"
 ```
 ### Trigger Manual Crawl
 ```bash
 docker-compose exec crawler python /app/scheduled_crawler.py
 ```
 ### Check Statistics
 ```bash
 curl "http://localhost:5001/api/stats"
 ```
 ### View Cluster Summaries in Database
 ```bash
 docker-compose exec mongodb mongosh -u admin -p changeme --authenticationDatabase admin munich_news --eval "db.cluster_summaries.find().pretty()"
 ```
 ---
 **Status:** ✅ Production Ready
 **Last Updated:** November 12, 2025
 **Version:** 2.0 (AI-Powered)
--- a/docs/API.md
+++ b/docs/API.md
@@ -1,214 +1,248 @@
 # API Reference
-## Tracking Endpoints
+Complete API documentation for Munich News Daily.
-### Track Email Open
+---
 ## Admin API
 Base URL: `http://localhost:5001`
 ### Trigger Crawler
 Manually fetch new articles.
 ```http
-GET /api/track/pixel/<tracking_id>
+POST /api/admin/trigger-crawl
-```
+Content-Type: application/json
 Returns a 1x1 transparent PNG and logs the email open event.
 **Response**: Image (image/png)
 ### Track Link Click
 ```http
 GET /api/track/click/<tracking_id>
 ```
 Logs the click event and redirects to the original article URL.
 **Response**: 302 Redirect
 ## Analytics Endpoints
 ### Get Newsletter Metrics
 ```http
 GET /api/analytics/newsletter/<newsletter_id>
 ```
 Returns comprehensive metrics for a specific newsletter.
 **Response**:
 ```json
 {
-  "newsletter_id": "2024-01-15",
+  "max_articles": 10
  "total_sent": 100,
  "total_opened": 75,
  "open_rate": 75.0,
  "unique_openers": 70,
  "total_clicks": 45,
  "unique_clickers": 30,
  "click_through_rate": 30.0
 }
 ```
-### Get Article Performance
+**Example:**
-
+```bash
-```http
+curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-GET /api/analytics/article/<article_url>
+  -H "Content-Type: application/json" \
  -d '{"max_articles": 5}'
 ```
-Returns performance metrics for a specific article.
+---
 ### Send Test Email
 Send newsletter to specific email.
 ```http
 POST /api/admin/send-test-email
 Content-Type: application/json
 **Response**:
 ```json
 {
-  "article_url": "https://example.com/article",
+  "email": "test@example.com",
-  "total_sent": 100,
+  "max_articles": 10
  "total_clicks": 25,
  "click_rate": 25.0,
  "unique_clickers": 20,
  "newsletters": ["2024-01-15", "2024-01-16"]
 }
 ```
-### Get Subscriber Activity
+**Example:**
-
+```bash
-```http
+curl -X POST http://localhost:5001/api/admin/send-test-email \
-GET /api/analytics/subscriber/<email>
+  -H "Content-Type: application/json" \
  -d '{"email": "your@email.com"}'
 ```
-Returns activity status and engagement metrics for a subscriber.
+---
 ### Send Newsletter to All Subscribers
 Send newsletter to all active subscribers.
 ```http
 POST /api/admin/send-newsletter
 Content-Type: application/json
 **Response**:
 ```json
 {
-  "email": "user@example.com",
+  "max_articles": 10
  "status": "active",
  "last_opened_at": "2024-01-15T10:30:00",
  "last_clicked_at": "2024-01-15T10:35:00",
  "total_opens": 45,
  "total_clicks": 20,
  "newsletters_received": 50,
  "newsletters_opened": 45
 }
 ```
-## Privacy Endpoints
+**Response:**
 ### Delete Subscriber Data
 ```http
 DELETE /api/tracking/subscriber/<email>
 ```
 Deletes all tracking data for a subscriber (GDPR compliance).
 **Response**:
 ```json
 {
  "success": true,
-  "message": "All tracking data deleted for user@example.com",
+  "message": "Newsletter sent successfully to 45 subscribers",
-  "deleted_counts": {
+  "subscriber_count": 45,
-    "newsletter_sends": 50,
+  "max_articles": 10
-    "link_clicks": 25,
+}
-    "subscriber_activity": 1
+```
 **Example:**
 ```bash
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ---
 ### Get System Stats
 ```http
 GET /api/admin/stats
 ```
 **Response:**
 ```json
 {
  "articles": {
    "total": 150,
    "with_summary": 120,
    "today": 15
  },
  "subscribers": {
    "total": 50,
    "active": 45
  },
  "rss_feeds": {
    "total": 4,
    "active": 4
  },
  "tracking": {
    "total_sends": 200,
    "total_opens": 150,
    "total_clicks": 75
  }
 }
 ```
-### Anonymize Old Data
+**Example:**
 ```bash
 curl http://localhost:5001/api/admin/stats
 ```
 ---
 ## Public API
 ### Subscribe
 ```http
-POST /api/tracking/anonymize
+POST /api/subscribe
-```
+Content-Type: application/json
 Anonymizes tracking data older than the retention period.
 **Request Body** (optional):
 ```json
 {
-  "retention_days": 90
+  "email": "user@example.com"
 }
 ```
-**Response**:
+**Example:**
-```json
+```bash
-{
+curl -X POST http://localhost:5001/api/subscribe \
-  "success": true,
+  -H "Content-Type: application/json" \
-  "message": "Anonymized tracking data older than 90 days",
+  -d '{"email": "user@example.com"}'
  "anonymized_counts": {
    "newsletter_sends": 1250,
    "link_clicks": 650
  }
 }
 ```
-### Opt Out of Tracking
+---
 ### Unsubscribe
 ```http
-POST /api/tracking/subscriber/<email>/opt-out
+POST /api/unsubscribe
-```
+Content-Type: application/json
 Disables tracking for a subscriber.
 **Response**:
 ```json
 {
-  "success": true,
+  "email": "user@example.com"
  "message": "Subscriber user@example.com has opted out of tracking"
 }
 ```
-### Opt In to Tracking
+**Example:**
-
+```bash
-```http
+curl -X POST http://localhost:5001/api/unsubscribe \
-POST /api/tracking/subscriber/<email>/opt-in
+  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
-Re-enables tracking for a subscriber.
+---
-**Response**:
+## Subscriber Status System
-```json
+
 ### Status Values
 | Status | Description | Receives Newsletters |
 |--------|-------------|---------------------|
 | `active` | Subscribed | ✅ Yes |
 | `inactive` | Unsubscribed | ❌ No |
 ### Database Schema
 ```javascript
 {
-  "success": true,
+  _id: ObjectId("..."),
-  "message": "Subscriber user@example.com has opted in to tracking"
+  email: "user@example.com",
  subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
  status: "active"  // or "inactive"
 }
 ```
-## Examples
+### How It Works
-### Using curl
+**Subscribe:**
 - Creates subscriber with `status: 'active'`
 - If already exists and inactive, reactivates
 **Unsubscribe:**
 - Updates `status: 'inactive'`
 - Subscriber data preserved (soft delete)
 **Newsletter Sending:**
 - Only sends to `status: 'active'` subscribers
 - Query: `{status: 'active'}`
 ### Check Active Subscribers
 ```bash
-# Get newsletter metrics
+curl http://localhost:5001/api/admin/stats | jq '.subscribers'
-curl http://localhost:5001/api/analytics/newsletter/2024-01-15
+# Output: {"total": 10, "active": 8}
 ```
-# Delete subscriber data
+---
 curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
-# Anonymize old data
+## Workflows
-curl -X POST http://localhost:5001/api/tracking/anonymize \
+
 ### Complete Newsletter Workflow
 ```bash
 # 1. Check stats
 curl http://localhost:5001/api/admin/stats
 # 2. Crawl articles
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
-  -d '{"retention_days": 90}'
+  -d '{"max_articles": 10}'
-# Opt out of tracking
+# 3. Wait for crawl
-curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
+sleep 60
 # 4. Send newsletter
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
-### Using Python
+### Test Newsletter
-```python
+```bash
-import requests
+# Send test to yourself
-
+curl -X POST http://localhost:5001/api/admin/send-test-email \
-# Get newsletter metrics
+  -H "Content-Type: application/json" \
-response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
+  -d '{"email": "your@email.com", "max_articles": 3}'
 metrics = response.json()
 print(f"Open rate: {metrics['open_rate']}%")
 # Delete subscriber data
 response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
 result = response.json()
 print(result['message'])
 ```
 ---
 ## Error Responses
-All endpoints return standard error responses:
+All endpoints return standard error format:
 ```json
 {
@@ -217,7 +251,64 @@ All endpoints return standard error responses:
 }
 ```
-HTTP Status Codes:
+**HTTP Status Codes:**
 - `200` - Success
 - `400` - Bad request
 - `404` - Not found
 - `500` - Server error
 ---
 ## Security
 ⚠️ **Production Recommendations:**
 1. **Add Authentication**
   ```python
   @require_api_key
   def admin_endpoint():
       # ...
   ```
 2. **Rate Limiting**
   - Prevent abuse
   - Limit newsletter sends
 3. **IP Whitelisting**
   - Restrict admin endpoints
   - Use firewall rules
 4. **HTTPS Only**
   - Use reverse proxy
   - SSL/TLS certificates
 5. **Audit Logging**
   - Log all admin actions
   - Monitor for suspicious activity
 ---
 ## Testing
 Use the test script:
 ```bash
 ./test-newsletter-api.sh
 ```
 Or test manually:
 ```bash
 # Health check
 curl http://localhost:5001/health
 # Stats
 curl http://localhost:5001/api/admin/stats
 # Test email
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'
 ```
 ---
 See [SETUP.md](SETUP.md) for configuration and [SECURITY.md](SECURITY.md) for security best practices.
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -1,131 +1,439 @@
 # System Architecture
 Complete system design and architecture documentation.
 ---
 ## Overview
-Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
+Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
 ```
-┌─────────────────────────────────────────────────────────────────┐
+┌─────────────────────────────────────────────────────────┐
-│                    Munich News Daily System                      │
+│                    Docker Network                        │
-└─────────────────────────────────────────────────────────────────┘
+│                  (Internal Only)                         │
-
+├─────────────────────────────────────────────────────────┤
-6:00 AM Berlin → News Crawler
+│                                                           │
-                 ↓
+│  ┌──────────┐    ┌──────────┐    ┌──────────┐          │
-                 Fetches RSS feeds
+│  │ MongoDB  │◄───┤ Backend  │◄───┤ Crawler  │          │
-                 Extracts full content
+│  │ (27017)  │    │ (5001)   │    │          │          │
-                 Generates AI summaries
+│  └──────────┘    └────┬─────┘    └──────────┘          │
-                 Saves to MongoDB
+│                       │                                   │
-                 ↓
+│  ┌──────────┐         │           ┌──────────┐          │
-7:00 AM Berlin → Newsletter Sender
+│  │ Ollama   │◄────────┤           │ Sender   │          │
-                 ↓
+│  │ (11434)  │         │           │          │          │
-                 Waits for crawler
+│  └──────────┘         │           └──────────┘          │
-                 Fetches articles
+│                       │                                   │
-                 Generates newsletter
+└───────────────────────┼───────────────────────────────────┘
-                 Sends to subscribers
+                        │
-                 ↓
+                        │ Port 5001 (Only exposed port)
-                 ✅ Done!
+                        ▼
                   Host Machine
                   External Network
 ```
 ---
 ## Components
-### 1. MongoDB Database
+### 1. MongoDB (Database)
- **Purpose**: Central data storage
+- **Purpose**: Store articles, subscribers, tracking data
- **Collections**:
+- **Port**: 27017 (internal only)
-  - `articles`: News articles with summaries
+- **Access**: Only via Docker network
-  - `subscribers`: Email subscribers
+- **Authentication**: Username/password
  - `rss_feeds`: RSS feed sources
  - `newsletter_sends`: Email tracking data
  - `link_clicks`: Link click tracking
  - `subscriber_activity`: Engagement metrics
-### 2. News Crawler
+**Collections:**
- **Schedule**: Daily at 6:00 AM Berlin time
+- `articles` - News articles with summaries
- **Functions**:
+- `subscribers` - Newsletter subscribers
-  - Fetches articles from RSS feeds
+- `rss_feeds` - RSS feed sources
-  - Extracts full article content
+- `newsletter_sends` - Send tracking
-  - Generates AI summaries using Ollama
+- `link_clicks` - Click tracking
  - Saves to MongoDB
 - **Technology**: Python, BeautifulSoup, Ollama
-### 3. Newsletter Sender
+### 2. Backend API (Flask)
- **Schedule**: Daily at 7:00 AM Berlin time
+- **Purpose**: API endpoints, tracking, analytics
- **Functions**:
+- **Port**: 5001 (exposed to host)
-  - Waits for crawler to finish (max 30 min)
+- **Access**: Public API, admin endpoints
-  - Fetches today's articles
+- **Features**: Tracking pixels, click tracking, admin operations
  - Generates HTML newsletter
  - Injects tracking pixels
  - Sends to all subscribers
 - **Technology**: Python, Jinja2, SMTP
-### 4. Backend API (Optional)
+**Key Endpoints:**
- **Purpose**: Tracking and analytics
+- `/api/admin/*` - Admin operations
- **Endpoints**:
+- `/api/subscribe` - Subscribe to newsletter
-  - `/api/track/pixel/<id>` - Email open tracking
+- `/api/tracking/*` - Tracking endpoints
-  - `/api/track/click/<id>` - Link click tracking
+- `/health` - Health check
-  - `/api/analytics/*` - Engagement metrics
+
-  - `/api/tracking/*` - Privacy controls
+### 3. Ollama (AI Service)
- **Technology**: Flask, Python
+- **Purpose**: AI summarization and translation
 - **Port**: 11434 (internal only)
 - **Model**: phi3:latest (2.2GB)
 - **GPU**: Optional NVIDIA GPU support
 **Features:**
 - Article summarization (150 words)
 - Title translation (German → English)
 - Configurable timeout and model
 ### 4. Crawler (News Fetcher)
 - **Purpose**: Fetch and process news articles
 - **Schedule**: 6:00 AM Berlin time (automated)
 - **Features**: RSS parsing, content extraction, AI processing
 **Process:**
 1. Fetch RSS feeds
 2. Extract article content
 3. Translate title (German → English)
 4. Generate AI summary
 5. Store in MongoDB
 ### 5. Sender (Newsletter)
 - **Purpose**: Send newsletters to subscribers
 - **Schedule**: 7:00 AM Berlin time (automated)
 - **Features**: Email sending, tracking, templating
 **Process:**
 1. Fetch today's articles
 2. Generate newsletter HTML
 3. Add tracking pixels/links
 4. Send to active subscribers
 5. Record send events
 ---
 ## Data Flow
 ### Article Processing Flow
 ```
-RSS Feeds → Crawler → MongoDB → Sender → Subscribers
+RSS Feed
   ↓
-                    Backend API
+Crawler fetches
   ↓
-                    Analytics
+Extract content
   ↓
 Translate title (Ollama)
   ↓
 Generate summary (Ollama)
   ↓
 Store in MongoDB
   ↓
 Newsletter Sender
   ↓
 Email to subscribers
 ```
-## Coordination
+### Tracking Flow
-The sender waits for the crawler to ensure fresh content:
+```
 Newsletter sent
   ↓
 Tracking pixel embedded
   ↓
 User opens email
   ↓
 Pixel loaded → Backend API
   ↓
 Record open event
   ↓
 User clicks link
   ↓
 Redirect via Backend API
   ↓
 Record click event
   ↓
 Redirect to article
 ```
-1. Sender starts at 7:00 AM
+---
-2. Checks for recent articles every 30 seconds
+
-3. Maximum wait time: 30 minutes
+## Database Schema
-4. Proceeds once crawler finishes or timeout
+
 ### Articles Collection
 ```javascript
 {
  _id: ObjectId,
  title: String,              // Original German title
  title_en: String,           // English translation
  translated_at: Date,        // Translation timestamp
  link: String,
  summary: String,            // AI-generated summary
  content: String,            // Full article text
  author: String,
  source: String,             // RSS feed name
  published_at: Date,
  crawled_at: Date,
  created_at: Date
 }
 ```
 ### Subscribers Collection
 ```javascript
 {
  _id: ObjectId,
  email: String,              // Unique
  subscribed_at: Date,
  status: String              // 'active' or 'inactive'
 }
 ```
 ### RSS Feeds Collection
 ```javascript
 {
  _id: ObjectId,
  name: String,
  url: String,
  active: Boolean,
  last_crawled: Date
 }
 ```
 ---
 ## Security Architecture
 ### Network Isolation
 **Exposed Services:**
 - Backend API (port 5001) - Only exposed service
 **Internal Services:**
 - MongoDB (port 27017) - Not accessible from host
 - Ollama (port 11434) - Not accessible from host
 - Crawler - No ports
 - Sender - No ports
 **Benefits:**
 - 66% reduction in attack surface
 - Database protected from external access
 - AI service protected from abuse
 - Defense in depth
 ### Authentication
 **MongoDB:**
 - Username/password authentication
 - Credentials in environment variables
 - Internal network only
 **Backend API:**
 - No authentication (add in production)
 - Rate limiting recommended
 - IP whitelisting recommended
 ### Data Protection
 - Subscriber emails stored securely
 - No sensitive data in logs
 - Environment variables for secrets
 - `.env` file in `.gitignore`
 ---
 ## Technology Stack
- **Backend**: Python 3.11
+### Backend
 - **Language**: Python 3.11
 - **Framework**: Flask
 - **Database**: MongoDB 7.0
- **AI**: Ollama (Phi3 model)
+- **AI**: Ollama (phi3:latest)
 ### Infrastructure
 - **Containerization**: Docker & Docker Compose
 - **Networking**: Docker bridge network
 - **Storage**: Docker volumes
 - **Scheduling**: Python schedule library
 - **Email**: SMTP with HTML templates
 - **Tracking**: Pixel tracking + redirect URLs
 - **Infrastructure**: Docker & Docker Compose
-## Deployment
+### Libraries
 - **Web**: Flask, Flask-CORS
 - **Database**: pymongo
 - **Email**: smtplib, email.mime
 - **Scraping**: requests, BeautifulSoup4, feedparser
 - **Templating**: Jinja2
 - **AI**: requests (Ollama API)
-All components run in Docker containers:
+---
 ## Deployment Architecture
 ### Development
 ```
-docker-compose up -d
+Local Machine
 ├── Docker Compose
 │   ├── MongoDB (internal)
 │   ├── Ollama (internal)
 │   ├── Backend (exposed)
 │   ├── Crawler (internal)
 │   └── Sender (internal)
 └── .env file
 ```
-Containers:
+### Production
- `munich-news-mongodb` - Database
+```
- `munich-news-crawler` - Crawler service
+Server
- `munich-news-sender` - Sender service
+├── Reverse Proxy (nginx/Traefik)
 │   ├── SSL/TLS
 │   ├── Rate limiting
 │   └── Authentication
 ├── Docker Compose
 │   ├── MongoDB (internal)
 │   ├── Ollama (internal, GPU)
 │   ├── Backend (internal)
 │   ├── Crawler (internal)
 │   └── Sender (internal)
 ├── Monitoring
 │   ├── Logs
 │   ├── Metrics
 │   └── Alerts
 └── Backups
    ├── MongoDB dumps
    └── Configuration
 ```
-## Security
+---
 - MongoDB authentication enabled
 - Environment variables for secrets
 - HTTPS for tracking URLs (production)
 - GDPR-compliant data retention
 - Privacy controls (opt-out, deletion)
 ## Monitoring
 - Docker logs for all services
 - MongoDB for data verification
 - Health checks on containers
 - Engagement metrics via API
 ## Scalability
- Horizontal: Add more crawler instances
+### Current Limits
- Vertical: Increase container resources
+- Single server deployment
- Database: MongoDB sharding if needed
+- Sequential article processing
- Caching: Redis for API responses (future)
+- Single MongoDB instance
 - No load balancing
 ### Scaling Options
 **Horizontal Scaling:**
 - Multiple crawler instances
 - Load-balanced backend
 - MongoDB replica set
 - Distributed Ollama
 **Vertical Scaling:**
 - More CPU cores
 - More RAM
 - GPU acceleration (5-10x faster)
 - Faster storage
 **Optimization:**
 - Batch processing
 - Caching
 - Database indexing
 - Connection pooling
 ---
 ## Monitoring
 ### Health Checks
 ```bash
 # Backend health
 curl http://localhost:5001/health
 # MongoDB health
 docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
 # Ollama health
 docker-compose exec ollama ollama list
 ```
 ### Metrics
 - Article count
 - Subscriber count
 - Newsletter open rate
 - Click-through rate
 - Processing time
 - Error rate
 ### Logs
 ```bash
 # All services
 docker-compose logs -f
 # Specific service
 docker-compose logs -f crawler
 docker-compose logs -f backend
 ```
 ---
 ## Backup & Recovery
 ### MongoDB Backup
 ```bash
 # Backup
 docker-compose exec mongodb mongodump --out /backup
 # Restore
 docker-compose exec mongodb mongorestore /backup
 ```
 ### Configuration Backup
 - `backend/.env` - Environment variables
 - `docker-compose.yml` - Service configuration
 - RSS feeds in MongoDB
 ### Recovery Plan
 1. Restore MongoDB from backup
 2. Restore configuration files
 3. Restart services
 4. Verify functionality
 ---
 ## Performance
 ### CPU Mode
 - Translation: ~1.5s per title
 - Summarization: ~8s per article
 - 10 articles: ~115s total
 - Suitable for <20 articles/day
 ### GPU Mode (5-10x faster)
 - Translation: ~0.3s per title
 - Summarization: ~2s per article
 - 10 articles: ~31s total
 - Suitable for high-volume processing
 ### Resource Usage
 **CPU Mode:**
 - CPU: 60-80%
 - RAM: 4-6GB
 - Disk: ~1GB (with model)
 **GPU Mode:**
 - CPU: 10-20%
 - RAM: 2-3GB
 - GPU: 80-100%
 - VRAM: 3-4GB
 - Disk: ~1GB (with model)
 ---
 ## Future Enhancements
 ### Planned Features
 - Frontend dashboard
 - Real-time analytics
 - Multiple languages
 - Custom RSS feeds per subscriber
 - A/B testing for newsletters
 - Advanced tracking
 ### Technical Improvements
 - Kubernetes deployment
 - Microservices architecture
 - Message queue (RabbitMQ/Redis)
 - Caching layer (Redis)
 - CDN for assets
 - Advanced monitoring (Prometheus/Grafana)
 ---
 See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.
--- a/docs/BACKEND_STRUCTURE.md
+++ b/docs/BACKEND_STRUCTURE.md
@@ -1,106 +0,0 @@
 # Backend Structure
 The backend has been modularized for better maintainability and scalability.
 ## Directory Structure
 ```
 backend/
 ├── app.py                      # Main Flask application entry point
 ├── config.py                   # Configuration management
 ├── database.py                 # Database connection and initialization
 ├── requirements.txt            # Python dependencies
 ├── .env                        # Environment variables
 │
 ├── routes/                     # API route handlers (blueprints)
 │   ├── __init__.py
 │   ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
 │   ├── news_routes.py          # /api/news, /api/stats
 │   ├── rss_routes.py           # /api/rss-feeds (CRUD operations)
 │   ├── ollama_routes.py        # /api/ollama/* (AI features)
 │   ├── tracking_routes.py      # /api/track/* (email tracking)
 │   └── analytics_routes.py     # /api/analytics/* (engagement metrics)
 │
 └── services/                   # Business logic layer
    ├── __init__.py
    ├── news_service.py         # News fetching and storage logic
    ├── email_service.py        # Newsletter email sending
    ├── ollama_service.py       # Ollama AI integration
    ├── tracking_service.py     # Email tracking (opens/clicks)
    └── analytics_service.py    # Engagement analytics
 ```
 ## Key Components
 ### app.py
 - Main Flask application
 - Registers all blueprints
 - Minimal code, just wiring things together
 ### config.py
 - Centralized configuration
 - Loads environment variables
 - Single source of truth for all settings
 ### database.py
 - MongoDB connection setup
 - Collection definitions
 - Database initialization with indexes
 ### routes/
 Each route file is a Flask Blueprint handling specific API endpoints:
 - **subscription_routes.py**: User subscription management
 - **news_routes.py**: News fetching and statistics
 - **rss_routes.py**: RSS feed management (add/remove/list/toggle)
 - **ollama_routes.py**: AI/Ollama integration endpoints
 - **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
 - **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
 ### services/
 Business logic separated from route handlers:
 - **news_service.py**: Fetches news from RSS feeds, saves to database
 - **email_service.py**: Sends newsletter emails to subscribers
 - **ollama_service.py**: Communicates with Ollama AI server
 - **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
 - **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
 ## Benefits of This Structure
 1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
 2. **Testability**: Each module can be tested independently
 3. **Maintainability**: Easy to find and modify specific functionality
 4. **Scalability**: Easy to add new routes or services
 5. **Reusability**: Services can be used by multiple routes
 ## Adding New Features
 ### To add a new API endpoint:
 1. Create a new route file in `routes/` or add to existing one
 2. Create a Blueprint and define routes
 3. Register the blueprint in `app.py`
 ### To add new business logic:
 1. Create a new service file in `services/`
 2. Import and use in your route handlers
 ### Example:
 ```python
 # services/my_service.py
 def my_business_logic():
    return "Hello"
 # routes/my_routes.py
 from flask import Blueprint
 from services.my_service import my_business_logic
 my_bp = Blueprint('my', __name__)
@my_bp.route('/api/my-endpoint')
 def my_endpoint():
    result = my_business_logic()
    return {'message': result}
 # app.py
 from routes.my_routes import my_bp
 app.register_blueprint(my_bp)
 ```
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -1,176 +0,0 @@
 # Changelog
 ## [Unreleased] - 2024-11-10
 ### Added - Major Refactoring
 #### Backend Modularization
 - ✅ Restructured backend into modular architecture
 - ✅ Created separate route blueprints:
  - `subscription_routes.py` - User subscriptions
  - `news_routes.py` - News fetching and stats
  - `rss_routes.py` - RSS feed management (CRUD)
  - `ollama_routes.py` - AI integration
 - ✅ Created service layer:
  - `news_service.py` - News fetching logic
  - `email_service.py` - Newsletter sending
  - `ollama_service.py` - AI communication
 - ✅ Centralized configuration in `config.py`
 - ✅ Separated database logic in `database.py`
 - ✅ Reduced main `app.py` from 700+ lines to 27 lines
 #### RSS Feed Management
 - ✅ Dynamic RSS feed management via API
 - ✅ Add/remove/list/toggle RSS feeds without code changes
 - ✅ Unique index on RSS feed URLs (prevents duplicates)
 - ✅ Default feeds auto-initialized on first run
 - ✅ Created `fix_duplicates.py` utility script
 #### News Crawler Microservice
 - ✅ Created standalone `news_crawler/` microservice
 - ✅ Web scraping with BeautifulSoup
 - ✅ Smart content extraction using multiple selectors
 - ✅ Full article content storage in MongoDB
 - ✅ Word count calculation
 - ✅ Duplicate prevention (skips already-crawled articles)
 - ✅ Rate limiting (1 second between requests)
 - ✅ Can run independently or scheduled
 - ✅ Docker support for crawler
 - ✅ Comprehensive documentation
 #### API Endpoints
 New endpoints added:
 - `GET /api/rss-feeds` - List all RSS feeds
 - `POST /api/rss-feeds` - Add new RSS feed
 - `DELETE /api/rss-feeds/<id>` - Remove RSS feed
 - `PATCH /api/rss-feeds/<id>/toggle` - Toggle feed active status
 #### Documentation
 - ✅ Created `ARCHITECTURE.md` - System architecture overview
 - ✅ Created `backend/STRUCTURE.md` - Backend structure guide
 - ✅ Created `news_crawler/README.md` - Crawler documentation
 - ✅ Created `news_crawler/QUICKSTART.md` - Quick start guide
 - ✅ Created `news_crawler/test_crawler.py` - Test suite
 - ✅ Updated main `README.md` with new features
 - ✅ Updated `DATABASE_SCHEMA.md` with new fields
 #### Configuration
 - ✅ Added `FLASK_PORT` environment variable
 - ✅ Fixed `OLLAMA_MODEL` typo in `.env`
 - ✅ Port 5001 default to avoid macOS AirPlay conflict
 ### Changed
 - Backend structure: Monolithic → Modular
 - RSS feeds: Hardcoded → Database-driven
 - Article storage: Summary only → Full content support
 - Configuration: Scattered → Centralized
 ### Technical Improvements
 - Separation of concerns (routes vs services)
 - Better testability
 - Easier maintenance
 - Scalable architecture
 - Independent microservices
 - Proper error handling
 - Comprehensive logging
 ### Database Schema Updates
 Articles collection now includes:
 - `full_content` - Full article text
 - `word_count` - Number of words
 - `crawled_at` - When content was crawled
 RSS Feeds collection added:
 - `name` - Feed name
 - `url` - Feed URL (unique)
 - `active` - Active status
 - `created_at` - Creation timestamp
 ### Files Added
 ```
 backend/
 ├── config.py
 ├── database.py
 ├── fix_duplicates.py
 ├── STRUCTURE.md
 ├── routes/
 │   ├── __init__.py
 │   ├── subscription_routes.py
 │   ├── news_routes.py
 │   ├── rss_routes.py
 │   └── ollama_routes.py
 └── services/
    ├── __init__.py
    ├── news_service.py
    ├── email_service.py
    └── ollama_service.py
 news_crawler/
 ├── crawler_service.py
 ├── test_crawler.py
 ├── requirements.txt
 ├── .gitignore
 ├── Dockerfile
 ├── docker-compose.yml
 ├── README.md
 └── QUICKSTART.md
 Root:
 ├── ARCHITECTURE.md
 └── CHANGELOG.md
 ```
 ### Files Removed
 - Old monolithic `backend/app.py` (replaced with modular version)
 ### Next Steps (Future Enhancements)
 - [ ] Frontend UI for RSS feed management
 - [ ] Automatic article summarization with Ollama
 - [ ] Scheduled newsletter sending
 - [ ] Article categorization and tagging
 - [ ] Search functionality
 - [ ] User preferences (categories, frequency)
 - [ ] Analytics dashboard
 - [ ] API rate limiting
 - [ ] Caching layer (Redis)
 - [ ] Message queue for crawler (Celery)
 ---
 ## Recent Updates (November 2025)
 ### Security Improvements
 - **MongoDB Internal-Only**: Removed port exposure, only accessible via Docker network
 - **Ollama Internal-Only**: Removed port exposure, only accessible via Docker network
 - **Reduced Attack Surface**: Only Backend API (port 5001) exposed to host
 - **Network Isolation**: All services communicate via internal Docker network
 ### Ollama Integration
 - **Docker Compose Integration**: Ollama service runs alongside other services
 - **Automatic Model Download**: phi3:latest model downloaded on first startup
 - **GPU Support**: NVIDIA GPU acceleration with automatic detection
 - **Helper Scripts**: `start-with-gpu.sh`, `check-gpu.sh`, `configure-ollama.sh`
 - **Performance**: 5-10x faster with GPU acceleration
 ### API Enhancements
 - **Send Newsletter Endpoint**: `/api/admin/send-newsletter` to send to all active subscribers
 - **Subscriber Status Fix**: Fixed stats endpoint to correctly count active subscribers
 - **Better Error Handling**: Improved error messages and validation
 ### Documentation
 - **Consolidated Documentation**: Moved all docs to `docs/` directory
 - **Security Guide**: Comprehensive security documentation
 - **GPU Setup Guide**: Detailed GPU acceleration setup
 - **MongoDB Connection Guide**: Connection configuration explained
 - **Subscriber Status Guide**: How subscriber status system works
 ### Configuration
 - **MongoDB URI**: Updated to use Docker service name (`mongodb` instead of `localhost`)
 - **Ollama URL**: Configured for internal Docker network (`http://ollama:11434`)
 - **Single .env File**: All configuration in `backend/.env`
 ### Testing
 - **Connectivity Tests**: `test-mongodb-connectivity.sh`
 - **Ollama Tests**: `test-ollama-setup.sh`
 - **Newsletter API Tests**: `test-newsletter-api.sh`
--- a/docs/CRAWLER_HOW_IT_WORKS.md
+++ b/docs/CRAWLER_HOW_IT_WORKS.md
@@ -1,306 +0,0 @@
 # How the News Crawler Works
 ## 🎯 Overview
 The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
 ## 📊 Flow Diagram
 ```
 RSS Feed URL
    ↓
 Parse RSS Feed
    ↓
 For each article link:
    ↓
 ┌─────────────────────────────────────┐
 │  1. Fetch HTML Page                 │
 │     GET https://example.com/article │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  2. Parse with BeautifulSoup        │
 │     soup = BeautifulSoup(html)      │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  3. Clean HTML                      │
 │     Remove: scripts, styles, nav,   │
 │     footer, header, ads             │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  4. Extract Title                   │
 │     Try: H1 → OG meta → Twitter →   │
 │     Title tag                       │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  5. Extract Author                  │
 │     Try: Meta author → rel=author → │
 │     Class names → JSON-LD           │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  6. Extract Date                    │
 │     Try: <time> → Meta tags →       │
 │     Class names → JSON-LD           │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  7. Extract Content                 │
 │     Try: <article> → Class names →  │
 │     <main> → <body>                 │
 │     Filter short paragraphs         │
 └─────────────────────────────────────┘
    ↓
 ┌─────────────────────────────────────┐
 │  8. Save to MongoDB                 │
 │     {                               │
 │       title, author, date,          │
 │       content, word_count           │
 │     }                               │
 └─────────────────────────────────────┘
    ↓
 Wait 1 second (rate limiting)
    ↓
 Next article
 ```
 ## 🔍 Detailed Example
 ### Input: RSS Feed Entry
 ```xml
 <item>
  <title>New U-Bahn Line Opens</title>
  <link>https://www.sueddeutsche.de/muenchen/article-123</link>
  <pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
 </item>
 ```
 ### Step 1: Fetch HTML
 ```python
 url = "https://www.sueddeutsche.de/muenchen/article-123"
 response = requests.get(url)
 html = response.content
 ```
 ### Step 2: Parse HTML
 ```python
 soup = BeautifulSoup(html, 'html.parser')
 ```
 ### Step 3: Extract Title
 ```python
 # Try H1
 h1 = soup.find('h1')
 # Result: "New U-Bahn Line Opens in Munich"
 # If no H1, try OG meta
 og_title = soup.find('meta', property='og:title')
 # Fallback chain continues...
 ```
 ### Step 4: Extract Author
 ```python
 # Try meta author
 meta_author = soup.find('meta', name='author')
 # Result: None
 # Try class names
 author_elem = soup.select_one('[class*="author"]')
 # Result: "Max Mustermann"
 ```
 ### Step 5: Extract Date
 ```python
 # Try time tag
 time_tag = soup.find('time')
 # Result: "2024-11-10T10:00:00Z"
 ```
 ### Step 6: Extract Content
 ```python
 # Try article tag
 article = soup.find('article')
 paragraphs = article.find_all('p')
 # Filter paragraphs
 content = []
 for p in paragraphs:
    text = p.get_text().strip()
    if len(text) >= 50:  # Keep substantial paragraphs
        content.append(text)
 full_content = '\n\n'.join(content)
 # Result: "The new U-Bahn line connecting the city center..."
 ```
 ### Step 7: Save to Database
 ```python
 article_doc = {
    'title': 'New U-Bahn Line Opens in Munich',
    'author': 'Max Mustermann',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'summary': 'Short summary from RSS...',
    'full_content': 'The new U-Bahn line connecting...',
    'word_count': 1250,
    'source': 'Süddeutsche Zeitung München',
    'published_at': '2024-11-10T10:00:00Z',
    'crawled_at': datetime.utcnow(),
    'created_at': datetime.utcnow()
 }
 db.articles.update_one(
    {'link': article_url},
    {'$set': article_doc},
    upsert=True
 )
 ```
 ## 🎨 What Makes It "Dynamic"?
 ### Traditional Approach (Hardcoded)
 ```python
 # Only works for one specific site
 title = soup.find('h1', class_='article-title').text
 author = soup.find('span', class_='author-name').text
 ```
 ❌ Breaks when site changes
 ❌ Doesn't work on other sites
 ### Our Approach (Dynamic)
 ```python
 # Works on ANY site
 title = extract_title(soup)  # Tries 4 different methods
 author = extract_author(soup)  # Tries 5 different methods
 ```
 ✅ Adapts to different HTML structures
 ✅ Falls back to alternatives
 ✅ Works across multiple sites
 ## 🛡️ Robustness Features
 ### 1. Multiple Strategies
 Each field has 4-6 extraction strategies
 ```python
 def extract_title(soup):
    # Try strategy 1
    if h1 := soup.find('h1'):
        return h1.text
    # Try strategy 2
    if og_title := soup.find('meta', property='og:title'):
        return og_title['content']
    # Try strategy 3...
    # Try strategy 4...
 ```
 ### 2. Validation
 ```python
 # Title must be reasonable length
 if title and len(title) > 10:
    return title
 # Author must be < 100 chars
 if author and len(author) < 100:
    return author
 ```
 ### 3. Cleaning
 ```python
 # Remove site name from title
 if ' | ' in title:
    title = title.split(' | ')[0]
 # Remove "By" from author
 author = author.replace('By ', '').strip()
 ```
 ### 4. Error Handling
 ```python
 try:
    data = extract_article_content(url)
 except Timeout:
    print("Timeout - skip")
 except RequestException:
    print("Network error - skip")
 except Exception:
    print("Unknown error - skip")
 ```
 ## 📈 Success Metrics
 After crawling, you'll see:
 ```
 📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)
   Title: ✓ Found
   Author: ✓ Found (Max Mustermann)
   Date: ✓ Found (2024-11-10T10:00:00Z)
   Content: ✓ Found (1250 words)
 ```
 ## 🗄️ Database Result
 **Before Crawling:**
 ```javascript
 {
  title: "New U-Bahn Line Opens",
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  source: "Süddeutsche Zeitung"
 }
 ```
 **After Crawling:**
 ```javascript
 {
  title: "New U-Bahn Line Opens in Munich",  // ← Enhanced
  author: "Max Mustermann",                   // ← NEW!
  link: "https://example.com/article",
  summary: "Short RSS summary...",
  full_content: "The new U-Bahn line...",    // ← NEW! (1250 words)
  word_count: 1250,                           // ← NEW!
  source: "Süddeutsche Zeitung",
  published_at: "2024-11-10T10:00:00Z",      // ← Enhanced
  crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
  created_at: ISODate("2024-11-10T16:00:00Z")
 }
 ```
 ## 🚀 Running the Crawler
 ```bash
 cd news_crawler
 pip install -r requirements.txt
 python crawler_service.py 10
 ```
 Output:
 ```
 ============================================================
 🚀 Starting RSS Feed Crawler
 ============================================================
 Found 3 active feed(s)
 📰 Crawling feed: Süddeutsche Zeitung München
   🔍 Crawling: New U-Bahn Line Opens...
   ✓ Saved (1250 words)
   🔍 Crawling: Munich Weather Update...
   ✓ Saved (450 words)
   ✓ Crawled 2 articles
 ============================================================
 ✓ Crawling Complete!
  Total feeds processed: 3
  Total articles crawled: 15
  Duration: 45.23 seconds
 ============================================================
 ```
 Now you have rich, structured article data ready for AI processing! 🎉
--- a/docs/DATABASE_SCHEMA.md
+++ b/docs/DATABASE_SCHEMA.md
@@ -1,336 +0,0 @@
 # MongoDB Database Schema
 This document describes the MongoDB collections and their structure for Munich News Daily.
 ## Collections
 ### 1. Articles Collection (`articles`)
 Stores all news articles aggregated from Munich news sources.
 **Document Structure:**
 ```javascript
 {
  _id: ObjectId,                    // Auto-generated MongoDB ID
  title: String,                    // Article title (required)
  author: String,                   // Article author (optional, extracted during crawl)
  link: String,                     // Article URL (required, unique)
  content: String,                  // Full article content (no length limit)
  summary: String,                  // AI-generated English summary (≤150 words)
  word_count: Number,               // Word count of full content
  summary_word_count: Number,       // Word count of AI summary
  source: String,                   // News source name (e.g., "Süddeutsche Zeitung München")
  published_at: String,             // Original publication date from RSS feed or crawled
  crawled_at: DateTime,             // When article content was crawled (UTC)
  summarized_at: DateTime,          // When AI summary was generated (UTC)
  created_at: DateTime              // When article was added to database (UTC)
 }
 ```
 **Indexes:**
 - `link` - Unique index to prevent duplicate articles
 - `created_at` - Index for efficient sorting by date
 **Example Document:**
 ```javascript
 {
  _id: ObjectId("507f1f77bcf86cd799439011"),
  title: "New U-Bahn Line Opens in Munich",
  author: "Max Mustermann",
  link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
  summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
  word_count: 1250,
  summary_word_count: 48,
  source: "Süddeutsche Zeitung München",
  published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
  crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
  summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
  created_at: ISODate("2024-01-15T09:00:00.000Z")
 }
 ```
 ### 2. Subscribers Collection (`subscribers`)
 Stores all newsletter subscribers.
 **Document Structure:**
 ```javascript
 {
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (required, unique, lowercase)
  subscribed_at: DateTime,          // When user subscribed (UTC)
  status: String                    // Subscription status: 'active' or 'inactive'
 }
 ```
 **Indexes:**
 - `email` - Unique index for email lookups and preventing duplicates
 - `subscribed_at` - Index for analytics and sorting
 **Example Document:**
 ```javascript
 {
  _id: ObjectId("507f1f77bcf86cd799439012"),
  email: "user@example.com",
  subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
  status: "active"
 }
 ```
 ### 3. Newsletter Sends Collection (`newsletter_sends`)
 Tracks each newsletter sent to each subscriber for email open tracking.
 **Document Structure:**
 ```javascript
 {
  _id: ObjectId,                    // Auto-generated MongoDB ID
  newsletter_id: String,            // Unique ID for this newsletter batch (date-based)
  subscriber_email: String,         // Recipient email
  tracking_id: String,              // Unique tracking ID for this send (UUID)
  sent_at: DateTime,                // When email was sent (UTC)
  opened: Boolean,                  // Whether email was opened
  first_opened_at: DateTime,        // First open timestamp (null if not opened)
  last_opened_at: DateTime,         // Most recent open timestamp
  open_count: Number,               // Number of times opened
  created_at: DateTime              // Record creation time (UTC)
 }
 ```
 **Indexes:**
 - `tracking_id` - Unique index for fast pixel request lookups
 - `newsletter_id` - Index for analytics queries
 - `subscriber_email` - Index for user activity queries
 - `sent_at` - Index for time-based queries
 **Example Document:**
 ```javascript
 {
  _id: ObjectId("507f1f77bcf86cd799439013"),
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  sent_at: ISODate("2024-01-15T08:00:00.000Z"),
  opened: true,
  first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
  open_count: 3,
  created_at: ISODate("2024-01-15T08:00:00.000Z")
 }
 ```
 ### 4. Link Clicks Collection (`link_clicks`)
 Tracks individual link clicks from newsletters.
 **Document Structure:**
 ```javascript
 {
  _id: ObjectId,                    // Auto-generated MongoDB ID
  tracking_id: String,              // Unique tracking ID for this link (UUID)
  newsletter_id: String,            // Which newsletter this link was in
  subscriber_email: String,         // Who clicked
  article_url: String,              // Original article URL
  article_title: String,            // Article title for reporting
  clicked_at: DateTime,             // When link was clicked (UTC)
  user_agent: String,               // Browser/client info
  created_at: DateTime              // Record creation time (UTC)
 }
 ```
 **Indexes:**
 - `tracking_id` - Unique index for fast redirect request lookups
 - `newsletter_id` - Index for analytics queries
 - `article_url` - Index for article performance queries
 - `subscriber_email` - Index for user activity queries
 **Example Document:**
 ```javascript
 {
  _id: ObjectId("507f1f77bcf86cd799439014"),
  tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  article_title: "New U-Bahn Line Opens in Munich",
  clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  created_at: ISODate("2024-01-15T09:35:00.000Z")
 }
 ```
 ### 5. Subscriber Activity Collection (`subscriber_activity`)
 Aggregated activity status for each subscriber.
 **Document Structure:**
 ```javascript
 {
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (unique)
  status: String,                   // 'active', 'inactive', or 'dormant'
  last_opened_at: DateTime,         // Most recent email open (UTC)
  last_clicked_at: DateTime,        // Most recent link click (UTC)
  total_opens: Number,              // Lifetime open count
  total_clicks: Number,             // Lifetime click count
  newsletters_received: Number,     // Total newsletters sent
  newsletters_opened: Number,       // Total newsletters opened
  updated_at: DateTime              // Last status update (UTC)
 }
 ```
 **Indexes:**
 - `email` - Unique index for fast lookups
 - `status` - Index for filtering by activity level
 - `last_opened_at` - Index for time-based queries
 **Activity Status Classification:**
 - **active**: Opened an email in the last 30 days
 - **inactive**: No opens in 30-60 days
 - **dormant**: No opens in 60+ days
 **Example Document:**
 ```javascript
 {
  _id: ObjectId("507f1f77bcf86cd799439015"),
  email: "user@example.com",
  status: "active",
  last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  total_opens: 45,
  total_clicks: 23,
  newsletters_received: 60,
  newsletters_opened: 45,
  updated_at: ISODate("2024-01-15T10:00:00.000Z")
 }
 ```
 ## Design Decisions
 ### Why MongoDB?
 1. **Flexibility**: Easy to add new fields without schema migrations
 2. **Scalability**: Handles large volumes of articles and subscribers efficiently
 3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
 4. **Document Model**: Natural fit for news articles and subscriber data
 ### Schema Choices
 1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
 2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
 3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
 4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
 ### Future Enhancements
 Potential fields to add in the future:
 **Articles:**
 - `category`: String (e.g., "politics", "sports", "culture")
 - `tags`: Array of Strings
 - `image_url`: String
 - `sent_in_newsletter`: Boolean (track if article was sent)
 - `sent_at`: DateTime (when article was included in newsletter)
 **Subscribers:**
 - `preferences`: Object (newsletter frequency, categories, etc.)
 - `last_sent_at`: DateTime (last newsletter sent date)
 - `unsubscribed_at`: DateTime (when user unsubscribed)
 - `verification_token`: String (for email verification)
 ## AI Summarization Workflow
 When the crawler processes an article:
 1. **Extract Content**: Full article text is extracted from the webpage
 2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
 3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
 4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
 ### Summary Field Details
 - **Language**: Always in English, regardless of source article language
 - **Length**: Maximum 150 words
 - **Format**: Plain text, concise and clear
 - **Purpose**: Quick preview for newsletters and frontend display
 ### Querying Articles
 ```javascript
 // Get articles with AI summaries
 db.articles.find({ summary: { $exists: true, $ne: null } })
 // Get articles without summaries
 db.articles.find({ summary: { $exists: false } })
 // Count summarized articles
 db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
 ```
 ---
 ## MongoDB Connection Configuration
 ### Docker Compose Setup
 **Connection URI:**
 ```env
 MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
 ```
 **Key Points:**
 - Uses `mongodb` (Docker service name), not `localhost`
 - Includes authentication credentials
 - Only works inside Docker network
 - Port 27017 is NOT exposed to host (internal only)
 ### Why 'mongodb' Instead of 'localhost'?
 **Inside Docker containers:**
 ```
 Container → mongodb:27017 ✅ Works (Docker DNS)
 Container → localhost:27017 ❌ Fails (localhost = container itself)
 ```
 **From host machine:**
 ```
 Host → localhost:27017 ❌ Blocked (port not exposed)
 Host → mongodb:27017 ❌ Fails (DNS only works in Docker)
 ```
 ### Connection Priority
 1. **Docker Compose environment variables** (highest)
 2. **.env file** (fallback)
 3. **Code defaults** (lowest)
 ### Testing Connection
 ```bash
 # From backend
 docker-compose exec backend python -c "
 from database import articles_collection
 print(f'Articles: {articles_collection.count_documents({})}')
 "
 # From crawler
 docker-compose exec crawler python -c "
 from pymongo import MongoClient
 from config import Config
 client = MongoClient(Config.MONGODB_URI)
 print(f'MongoDB version: {client.server_info()[\"version\"]}')
 "
 ```
 ### Security
 - ✅ MongoDB is internal-only (not exposed to host)
 - ✅ Uses authentication (username/password)
 - ✅ Only accessible via Docker network
 - ✅ Cannot be accessed from external network
 See [SECURITY_NOTES.md](SECURITY_NOTES.md) for more security details.
--- a/docs/DOCUMENTATION_CLEANUP.md
+++ b/docs/DOCUMENTATION_CLEANUP.md
@@ -1,204 +0,0 @@
 # Documentation Cleanup Summary
 ## What Was Done
 Consolidated and organized all markdown documentation files.
 ## Before
 **Root Level:** 14 markdown files (cluttered)
 ```
 README.md
 QUICKSTART.md
 CONTRIBUTING.md
 IMPLEMENTATION_SUMMARY.md
 MONGODB_CONNECTION_EXPLAINED.md
 NETWORK_SECURITY_SUMMARY.md
 NEWSLETTER_API_UPDATE.md
 OLLAMA_GPU_SUMMARY.md
 OLLAMA_INTEGRATION.md
 QUICK_START_GPU.md
 SECURITY_IMPROVEMENTS.md
 SECURITY_UPDATE.md
 FINAL_STRUCTURE.md (outdated)
 PROJECT_STRUCTURE.md (redundant)
 ```
 **docs/:** 18 files (organized but some content duplicated)
 ## After
 **Root Level:** 3 essential files (clean)
 ```
 README.md           - Main entry point
 QUICKSTART.md       - Quick setup guide
 CONTRIBUTING.md     - Contribution guidelines
 ```
 **docs/:** 19 files (organized, consolidated, no duplication)
 ```
 INDEX.md                    - Documentation index (NEW)
 ADMIN_API.md               - Admin API (consolidated)
 API.md
 ARCHITECTURE.md
 BACKEND_STRUCTURE.md
 CHANGELOG.md               - Updated with recent changes
 CRAWLER_HOW_IT_WORKS.md
 DATABASE_SCHEMA.md         - Added MongoDB connection info
 DEPLOYMENT.md
 EXTRACTION_STRATEGIES.md
 GPU_SETUP.md               - Consolidated GPU docs
 OLLAMA_SETUP.md            - Consolidated Ollama docs
 OLD_ARCHITECTURE.md
 PERFORMANCE_COMPARISON.md
 QUICK_REFERENCE.md
 RSS_URL_EXTRACTION.md
 SECURITY_NOTES.md          - Consolidated all security docs
 SUBSCRIBER_STATUS.md
 SYSTEM_ARCHITECTURE.md
 ```
 ## Changes Made
 ### 1. Deleted Redundant Files
 - ❌ `FINAL_STRUCTURE.md` (outdated)
 - ❌ `PROJECT_STRUCTURE.md` (redundant with README)
 ### 2. Merged into docs/SECURITY_NOTES.md
 - ✅ `SECURITY_UPDATE.md` (Ollama security)
 - ✅ `SECURITY_IMPROVEMENTS.md` (Network isolation)
 - ✅ `NETWORK_SECURITY_SUMMARY.md` (Port exposure summary)
 ### 3. Merged into docs/GPU_SETUP.md
 - ✅ `OLLAMA_GPU_SUMMARY.md` (GPU implementation summary)
 - ✅ `QUICK_START_GPU.md` (Quick start commands)
 ### 4. Merged into docs/OLLAMA_SETUP.md
 - ✅ `OLLAMA_INTEGRATION.md` (Integration details)
 ### 5. Merged into docs/ADMIN_API.md
 - ✅ `NEWSLETTER_API_UPDATE.md` (Newsletter endpoint)
 ### 6. Merged into docs/DATABASE_SCHEMA.md
 - ✅ `MONGODB_CONNECTION_EXPLAINED.md` (Connection config)
 ### 7. Merged into docs/CHANGELOG.md
 - ✅ `IMPLEMENTATION_SUMMARY.md` (Recent updates)
 ### 8. Created New Files
 - ✨ `docs/INDEX.md` - Complete documentation index
 ### 9. Updated Existing Files
 - 📝 `README.md` - Added documentation section
 - 📝 `docs/CHANGELOG.md` - Added recent updates
 - 📝 `docs/SECURITY_NOTES.md` - Comprehensive security guide
 - 📝 `docs/GPU_SETUP.md` - Complete GPU guide
 - 📝 `docs/OLLAMA_SETUP.md` - Complete Ollama guide
 - 📝 `docs/ADMIN_API.md` - Complete API reference
 - 📝 `docs/DATABASE_SCHEMA.md` - Added connection info
 ## Benefits
 ### 1. Cleaner Root Directory
 - Only 3 essential files visible
 - Easier to navigate
 - Professional appearance
 ### 2. Better Organization
 - All technical docs in `docs/`
 - Logical grouping by topic
 - Easy to find information
 ### 3. No Duplication
 - Consolidated related content
 - Single source of truth
 - Easier to maintain
 ### 4. Improved Discoverability
 - Documentation index (`docs/INDEX.md`)
 - Clear navigation
 - Quick links by task
 ### 5. Better Maintenance
 - Fewer files to update
 - Related content together
 - Clear structure
 ## Documentation Structure
 ```
 project/
 ├── README.md                    # Main entry point
 ├── QUICKSTART.md               # Quick setup
 ├── CONTRIBUTING.md             # How to contribute
 │
 └── docs/                       # All technical documentation
    ├── INDEX.md                # Documentation index
    │
    ├── Setup & Configuration
    │   ├── OLLAMA_SETUP.md
    │   ├── GPU_SETUP.md
    │   └── DEPLOYMENT.md
    │
    ├── API Documentation
    │   ├── ADMIN_API.md
    │   ├── API.md
    │   └── SUBSCRIBER_STATUS.md
    │
    ├── Architecture
    │   ├── SYSTEM_ARCHITECTURE.md
    │   ├── ARCHITECTURE.md
    │   ├── DATABASE_SCHEMA.md
    │   └── BACKEND_STRUCTURE.md
    │
    ├── Features
    │   ├── CRAWLER_HOW_IT_WORKS.md
    │   ├── EXTRACTION_STRATEGIES.md
    │   ├── RSS_URL_EXTRACTION.md
    │   └── PERFORMANCE_COMPARISON.md
    │
    ├── Security
    │   └── SECURITY_NOTES.md
    │
    └── Reference
        ├── CHANGELOG.md
        └── QUICK_REFERENCE.md
 ```
 ## Quick Access
 ### For Users
 - Start here: [README.md](README.md)
 - Quick setup: [QUICKSTART.md](QUICKSTART.md)
 - All docs: [docs/INDEX.md](docs/INDEX.md)
 ### For Developers
 - Architecture: [docs/SYSTEM_ARCHITECTURE.md](docs/SYSTEM_ARCHITECTURE.md)
 - API Reference: [docs/ADMIN_API.md](docs/ADMIN_API.md)
 - Contributing: [CONTRIBUTING.md](CONTRIBUTING.md)
 ### For DevOps
 - Deployment: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)
 - Security: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
 - GPU Setup: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
 ## Statistics
 - **Files Deleted:** 11 redundant markdown files
 - **Files Merged:** 9 files consolidated into existing docs
 - **Files Created:** 1 new index file
 - **Files Updated:** 7 existing files enhanced
 - **Root Level:** Reduced from 14 to 3 files (79% reduction)
 - **Total Docs:** 19 well-organized files in docs/
 ## Result
 ✅ Clean, professional documentation structure
 ✅ Easy to navigate and find information
 ✅ No duplication or redundancy
 ✅ Better maintainability
 ✅ Improved user experience
 ---
 This cleanup makes the project more professional and easier to use!
--- a/docs/EXTRACTION_STRATEGIES.md
+++ b/docs/EXTRACTION_STRATEGIES.md
@@ -1,353 +0,0 @@
 # Content Extraction Strategies
 The crawler uses multiple strategies to dynamically extract article metadata from any website.
 ## 🎯 What Gets Extracted
 1. **Title** - Article headline
 2. **Author** - Article writer/journalist
 3. **Published Date** - When article was published
 4. **Content** - Main article text
 5. **Description** - Meta description/summary
 ## 📋 Extraction Strategies
 ### 1. Title Extraction
 Tries multiple methods in order of reliability:
 #### Strategy 1: H1 Tag
 ```html
 <h1>Article Title Here</h1>
 ```
 ✅ Most reliable - usually the main headline
 #### Strategy 2: Open Graph Meta Tag
 ```html
 <meta property="og:title" content="Article Title Here" />
 ```
 ✅ Used by Facebook, very reliable
 #### Strategy 3: Twitter Card Meta Tag
 ```html
 <meta name="twitter:title" content="Article Title Here" />
 ```
 ✅ Used by Twitter, reliable
 #### Strategy 4: Title Tag (Fallback)
 ```html
 <title>Article Title | Site Name</title>
 ```
 ⚠️ Often includes site name, needs cleaning
 **Cleaning:**
 - Removes " | Site Name"
 - Removes " - Site Name"
 ---
 ### 2. Author Extraction
 Tries multiple methods:
 #### Strategy 1: Meta Author Tag
 ```html
 <meta name="author" content="John Doe" />
 ```
 ✅ Standard HTML meta tag
 #### Strategy 2: Rel="author" Link
 ```html
 <a rel="author" href="/author/john-doe">John Doe</a>
 ```
 ✅ Semantic HTML
 #### Strategy 3: Common Class Names
 ```html
 <div class="author-name">John Doe</div>
 <span class="byline">By John Doe</span>
 <p class="writer">John Doe</p>
 ```
 ✅ Searches for: author-name, author, byline, writer
 #### Strategy 4: Schema.org Markup
 ```html
 <span itemprop="author">John Doe</span>
 ```
 ✅ Structured data
 #### Strategy 5: JSON-LD Structured Data
 ```html
 <script type="application/ld+json">
 {
  "@type": "NewsArticle",
  "author": {
    "@type": "Person",
    "name": "John Doe"
  }
 }
 </script>
 ```
 ✅ Most structured, very reliable
 **Cleaning:**
 - Removes "By " prefix
 - Validates length (< 100 chars)
 ---
 ### 3. Date Extraction
 Tries multiple methods:
 #### Strategy 1: Time Tag with Datetime
 ```html
 <time datetime="2024-11-10T10:00:00Z">November 10, 2024</time>
 ```
 ✅ Most reliable - ISO format
 #### Strategy 2: Article Published Time Meta
 ```html
 <meta property="article:published_time" content="2024-11-10T10:00:00Z" />
 ```
 ✅ Open Graph standard
 #### Strategy 3: OG Published Time
 ```html
 <meta property="og:published_time" content="2024-11-10T10:00:00Z" />
 ```
 ✅ Facebook standard
 #### Strategy 4: Common Class Names
 ```html
 <span class="publish-date">November 10, 2024</span>
 <time class="published">2024-11-10</time>
 <div class="timestamp">10:00 AM, Nov 10</div>
 ```
 ✅ Searches for: publish-date, published, date, timestamp
 #### Strategy 5: Schema.org Markup
 ```html
 <meta itemprop="datePublished" content="2024-11-10T10:00:00Z" />
 ```
 ✅ Structured data
 #### Strategy 6: JSON-LD Structured Data
 ```html
 <script type="application/ld+json">
 {
  "@type": "NewsArticle",
  "datePublished": "2024-11-10T10:00:00Z"
 }
 </script>
 ```
 ✅ Most structured
 ---
 ### 4. Content Extraction
 Tries multiple methods:
 #### Strategy 1: Semantic HTML Tags
 ```html
 <article>
  <p>Article content here...</p>
 </article>
 ```
 ✅ Best practice HTML5
 #### Strategy 2: Common Class Names
 ```html
 <div class="article-content">...</div>
 <div class="article-body">...</div>
 <div class="post-content">...</div>
 <div class="entry-content">...</div>
 <div class="story-body">...</div>
 ```
 ✅ Searches for common patterns
 #### Strategy 3: Schema.org Markup
 ```html
 <div itemprop="articleBody">
  <p>Content here...</p>
 </div>
 ```
 ✅ Structured data
 #### Strategy 4: Main Tag
 ```html
 <main>
  <p>Content here...</p>
 </main>
 ```
 ✅ Semantic HTML5
 #### Strategy 5: Body Tag (Fallback)
 ```html
 <body>
  <p>Content here...</p>
 </body>
 ```
 ⚠️ Last resort, may include navigation
 **Content Filtering:**
 - Removes `<script>`, `<style>`, `<nav>`, `<footer>`, `<header>`, `<aside>`
 - Filters out short paragraphs (< 50 chars) - likely ads/navigation
 - Keeps only substantial paragraphs
 - **No length limit** - stores full article content
 ---
 ## 🔍 How It Works
 ### Example: Crawling a News Article
 ```python
 # 1. Fetch HTML
 response = requests.get(article_url)
 soup = BeautifulSoup(response.content, 'html.parser')
 # 2. Extract title (tries 4 strategies)
 title = extract_title(soup)
 # Result: "New U-Bahn Line Opens in Munich"
 # 3. Extract author (tries 5 strategies)
 author = extract_author(soup)
 # Result: "Max Mustermann"
 # 4. Extract date (tries 6 strategies)
 published_date = extract_date(soup)
 # Result: "2024-11-10T10:00:00Z"
 # 5. Extract content (tries 5 strategies)
 content = extract_main_content(soup)
 # Result: "The new U-Bahn line connecting..."
 # 6. Save to database
 article_doc = {
    'title': title,
    'author': author,
    'published_at': published_date,
    'full_content': content,
    'word_count': len(content.split())
 }
 ```
 ---
 ## 📊 Success Rates by Strategy
 Based on common news sites:
 | Strategy | Success Rate | Notes |
 |----------|-------------|-------|
 | H1 for title | 95% | Almost universal |
 | OG meta tags | 90% | Most modern sites |
 | Time tag for date | 85% | HTML5 sites |
 | JSON-LD | 70% | Growing adoption |
 | Class name patterns | 60% | Varies by site |
 | Schema.org | 50% | Not widely adopted |
 ---
 ## 🎨 Real-World Examples
 ### Example 1: Süddeutsche Zeitung
 ```html
 <article>
  <h1>New U-Bahn Line Opens</h1>
  <span class="author">Max Mustermann</span>
  <time datetime="2024-11-10T10:00:00Z">10. November 2024</time>
  <div class="article-body">
    <p>The new U-Bahn line...</p>
  </div>
 </article>
 ```
 ✅ Extracts: Title (H1), Author (class), Date (time), Content (article-body)
 ### Example 2: Medium Blog
 ```html
 <article>
  <h1>How to Build a News Crawler</h1>
  <meta property="og:title" content="How to Build a News Crawler" />
  <meta property="article:published_time" content="2024-11-10T10:00:00Z" />
  <a rel="author" href="/author">Jane Smith</a>
  <section>
    <p>In this article...</p>
  </section>
 </article>
 ```
 ✅ Extracts: Title (OG meta), Author (rel), Date (article meta), Content (section)
 ### Example 3: WordPress Blog
 ```html
 <div class="post">
  <h1 class="entry-title">My Blog Post</h1>
  <span class="byline">By John Doe</span>
  <time class="published">November 10, 2024</time>
  <div class="entry-content">
    <p>Blog content here...</p>
  </div>
 </div>
 ```
 ✅ Extracts: Title (H1), Author (byline), Date (published), Content (entry-content)
 ---
 ## ⚠️ Edge Cases Handled
 1. **Missing Fields**: Returns `None` instead of crashing
 2. **Multiple Authors**: Takes first one found
 3. **Relative Dates**: Stores as-is ("2 hours ago")
 4. **Paywalls**: Extracts what's available
 5. **JavaScript-rendered**: Only gets server-side HTML
 6. **Ads/Navigation**: Filtered out by paragraph length
 7. **Site Name in Title**: Cleaned automatically
 ---
 ## 🚀 Future Improvements
 Potential enhancements:
 - [ ] JavaScript rendering (Selenium/Playwright)
 - [ ] Paywall bypass (where legal)
 - [ ] Image extraction
 - [ ] Video detection
 - [ ] Related articles
 - [ ] Tags/categories
 - [ ] Reading time estimation
 - [ ] Language detection
 - [ ] Sentiment analysis
 ---
 ## 🧪 Testing
 Test the extraction on a specific URL:
 ```python
 from crawler_service import extract_article_content
 url = "https://www.sueddeutsche.de/muenchen/article-123"
 data = extract_article_content(url)
 print(f"Title: {data['title']}")
 print(f"Author: {data['author']}")
 print(f"Date: {data['published_date']}")
 print(f"Content length: {len(data['content'])} chars")
 print(f"Word count: {data['word_count']}")
 ```
 ---
 ## 📚 Standards Supported
 - ✅ HTML5 semantic tags
 - ✅ Open Graph Protocol
 - ✅ Twitter Cards
 - ✅ Schema.org microdata
 - ✅ JSON-LD structured data
 - ✅ Dublin Core metadata
 - ✅ Common CSS class patterns
--- a/docs/FEATURES.md
+++ b/docs/FEATURES.md
@@ -0,0 +1,414 @@
 # Features Guide
 Complete guide to Munich News Daily features.
 ---
 ## Core Features
 ### 1. Automated News Crawling
 - Fetches articles from RSS feeds
 - Scheduled daily at 6:00 AM Berlin time
 - Extracts full article content
 - Handles multiple news sources
 ### 2. AI-Powered Summarization
 - Generates concise summaries (150 words)
 - Uses Ollama AI (phi3:latest model)
 - GPU acceleration available (5-10x faster)
 - Configurable summary length
 ### 3. Title Translation
 - Translates German titles to English
 - Uses Ollama AI
 - Displays both languages in newsletter
 - Stores both versions in database
 ### 4. Newsletter Generation
 - Beautiful HTML email template
 - Responsive design
 - Numbered articles
 - Summary statistics
 - Scheduled daily at 7:00 AM Berlin time
 ### 5. Engagement Tracking
 - Email open tracking (pixel)
 - Link click tracking
 - Analytics dashboard ready
 - Subscriber engagement metrics
 ---
 ## News Crawler
 ### How It Works
 ```
 1. Fetch RSS feeds from database
 2. Parse RSS XML
 3. Extract article URLs
 4. Fetch full article content
 5. Extract text from HTML
 6. Translate title (German → English)
 7. Generate AI summary
 8. Store in MongoDB
 ```
 ### Content Extraction
 **Strategies (in order):**
 1. **Article Tag** - Look for `<article>` tags
 2. **Main Tag** - Look for `<main>` content
 3. **Content Divs** - Common class names (content, article-body, etc.)
 4. **Paragraph Aggregation** - Collect all `<p>` tags
 5. **Fallback** - Use RSS description
 **Cleaning:**
 - Remove scripts and styles
 - Remove navigation elements
 - Remove ads and sidebars
 - Extract clean text
 - Preserve paragraphs
 ### RSS Feed Handling
 **Supported Formats:**
 - RSS 2.0
 - Atom
 - Custom formats
 **Extracted Data:**
 - Title
 - Link
 - Description/Summary
 - Published date
 - Author (if available)
 **Error Handling:**
 - Retry failed requests
 - Skip invalid URLs
 - Log errors
 - Continue with next article
 ---
 ## AI Features
 ### Summarization
 **Process:**
 1. Send article text to Ollama
 2. Request 150-word summary
 3. Receive AI-generated summary
 4. Store with article
 **Configuration:**
 ```env
 OLLAMA_ENABLED=true
 OLLAMA_MODEL=phi3:latest
 SUMMARY_MAX_WORDS=150
 OLLAMA_TIMEOUT=120
 ```
 **Performance:**
 - CPU: ~8s per article
 - GPU: ~2s per article (4x faster)
 ### Translation
 **Process:**
 1. Send German title to Ollama
 2. Request English translation
 3. Receive translated title
 4. Store both versions
 **Configuration:**
 ```env
 OLLAMA_ENABLED=true
 OLLAMA_MODEL=phi3:latest
 ```
 **Performance:**
 - CPU: ~1.5s per title
 - GPU: ~0.3s per title (5x faster)
 **Newsletter Display:**
 ```
 English Title (Primary)
 Original: German Title (Subtitle)
 ```
 ---
 ## Newsletter System
 ### Template Features
 - **Responsive Design** - Works on all devices
 - **Clean Layout** - Easy to read
 - **Numbered Articles** - Clear organization
 - **Summary Box** - Quick stats
 - **Tracking Links** - Click tracking
 - **Unsubscribe Link** - Easy opt-out
 ### Personalization
 - Greeting message
 - Date formatting
 - Article count
 - Source attribution
 - Author names
 ### Tracking
 **Open Tracking:**
 - Invisible 1x1 pixel image
 - Loaded when email opened
 - Records timestamp
 - Tracks unique opens
 **Click Tracking:**
 - All article links tracked
 - Redirect through backend
 - Records click events
 - Tracks which articles clicked
 ---
 ## Subscriber Management
 ### Status System
 | Status | Description | Receives Newsletters |
 |--------|-------------|---------------------|
 | `active` | Subscribed | ✅ Yes |
 | `inactive` | Unsubscribed | ❌ No |
 ### Operations
 **Subscribe:**
 ```bash
 curl -X POST http://localhost:5001/api/subscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
 **Unsubscribe:**
 ```bash
 curl -X POST http://localhost:5001/api/unsubscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
 **Check Stats:**
 ```bash
 curl http://localhost:5001/api/admin/stats | jq '.subscribers'
 ```
 ---
 ## Admin Features
 ### Manual Crawl
 Trigger crawl anytime:
 ```bash
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ### Test Email
 Send test newsletter:
 ```bash
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'
 ```
 ### Send Newsletter
 Send to all subscribers:
 ```bash
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ### System Stats
 View system statistics:
 ```bash
 curl http://localhost:5001/api/admin/stats
 ```
 ---
 ## Automation
 ### Scheduled Tasks
 **Crawler (6:00 AM Berlin time):**
 - Fetches new articles
 - Processes with AI
 - Stores in database
 **Sender (7:00 AM Berlin time):**
 - Waits for crawler to finish
 - Fetches today's articles
 - Generates newsletter
 - Sends to all active subscribers
 ### Manual Execution
 ```bash
 # Run crawler manually
 docker-compose exec crawler python crawler_service.py 10
 # Run sender manually
 docker-compose exec sender python sender_service.py send 10
 # Send test email
 docker-compose exec sender python sender_service.py test your@email.com
 ```
 ---
 ## Configuration
 ### Environment Variables
 ```env
 # Newsletter Settings
 NEWSLETTER_MAX_ARTICLES=10
 NEWSLETTER_HOURS_LOOKBACK=24
 WEBSITE_URL=http://localhost:3000
 # Ollama AI
 OLLAMA_ENABLED=true
 OLLAMA_BASE_URL=http://ollama:11434
 OLLAMA_MODEL=phi3:latest
 OLLAMA_TIMEOUT=120
 SUMMARY_MAX_WORDS=150
 # Tracking
 TRACKING_ENABLED=true
 TRACKING_API_URL=http://localhost:5001
 TRACKING_DATA_RETENTION_DAYS=90
 ```
 ### RSS Feeds
 Add feeds in MongoDB:
 ```javascript
 db.rss_feeds.insertOne({
  name: "Süddeutsche Zeitung München",
  url: "https://www.sueddeutsche.de/muenchen/rss",
  active: true
 })
 ```
 ---
 ## Performance Optimization
 ### GPU Acceleration
 Enable for 5-10x faster processing:
 ```bash
 ./start-with-gpu.sh
 ```
 **Benefits:**
 - Faster summarization (8s → 2s)
 - Faster translation (1.5s → 0.3s)
 - Process more articles
 - Lower CPU usage
 ### Batch Processing
 Process multiple articles efficiently:
 - Model stays loaded in memory
 - Reduced overhead
 - Better throughput
 ### Caching
 - Model caching (Ollama)
 - Database connection pooling
 - Persistent storage
 ---
 ## Monitoring
 ### Logs
 ```bash
 # Crawler logs
 docker-compose logs -f crawler
 # Sender logs
 docker-compose logs -f sender
 # Backend logs
 docker-compose logs -f backend
 ```
 ### Metrics
 - Articles crawled
 - Summaries generated
 - Newsletters sent
 - Open rate
 - Click-through rate
 - Processing time
 ### Health Checks
 ```bash
 # Backend health
 curl http://localhost:5001/health
 # System stats
 curl http://localhost:5001/api/admin/stats
 ```
 ---
 ## Troubleshooting
 ### Crawler Issues
 **No articles found:**
 - Check RSS feed URLs
 - Verify feeds are active
 - Check network connectivity
 **Extraction failed:**
 - Article structure changed
 - Paywall detected
 - Network timeout
 **AI processing failed:**
 - Ollama not running
 - Model not downloaded
 - Timeout too short
 ### Newsletter Issues
 **Not sending:**
 - Check email configuration
 - Verify SMTP credentials
 - Check subscriber count
 **Tracking not working:**
 - Verify tracking enabled
 - Check backend API accessible
 - Verify tracking URLs
 ---
 See [SETUP.md](SETUP.md) for configuration, [API.md](API.md) for API reference, and [ARCHITECTURE.md](ARCHITECTURE.md) for system design.
--- a/docs/INDEX.md
+++ b/docs/INDEX.md
@@ -1,116 +0,0 @@
 # Documentation Index
 ## Quick Start
 - [README](../README.md) - Project overview and quick start
 - [QUICKSTART](../QUICKSTART.md) - Detailed 5-minute setup guide
 ## Setup & Configuration
 - [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama AI service setup
 - [GPU_SETUP](GPU_SETUP.md) - GPU acceleration setup (5-10x faster)
 - [DEPLOYMENT](DEPLOYMENT.md) - Production deployment guide
 ## API Documentation
 - [ADMIN_API](ADMIN_API.md) - Admin endpoints (crawl, send newsletter)
 - [API](API.md) - Public API endpoints
 - [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber status system
 ## Architecture & Design
 - [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete system architecture
 - [ARCHITECTURE](ARCHITECTURE.md) - High-level architecture overview
 - [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - MongoDB schema and connection
 - [BACKEND_STRUCTURE](BACKEND_STRUCTURE.md) - Backend code structure
 ## Features & How-To
 - [CRAWLER_HOW_IT_WORKS](CRAWLER_HOW_IT_WORKS.md) - News crawler explained
 - [EXTRACTION_STRATEGIES](EXTRACTION_STRATEGIES.md) - Content extraction
 - [RSS_URL_EXTRACTION](RSS_URL_EXTRACTION.md) - RSS feed handling
 - [PERFORMANCE_COMPARISON](PERFORMANCE_COMPARISON.md) - CPU vs GPU benchmarks
 ## Security
 - [SECURITY_NOTES](SECURITY_NOTES.md) - Complete security guide
  - Network isolation
  - MongoDB security
  - Ollama security
  - Best practices
 ## Reference
 - [CHANGELOG](CHANGELOG.md) - Version history and recent updates
 - [QUICK_REFERENCE](QUICK_REFERENCE.md) - Command cheat sheet
 ## Contributing
 - [CONTRIBUTING](../CONTRIBUTING.md) - How to contribute
 ---
 ## Documentation Organization
 ### Root Level (3 files)
 Essential files that should be immediately visible:
 - `README.md` - Main entry point
 - `QUICKSTART.md` - Quick setup guide
 - `CONTRIBUTING.md` - Contribution guidelines
 ### docs/ Directory (18 files)
 All technical documentation organized by category:
 - **Setup**: Ollama, GPU, Deployment
 - **API**: Admin API, Public API, Subscriber system
 - **Architecture**: System design, database, backend structure
 - **Features**: Crawler, extraction, RSS handling
 - **Security**: Complete security documentation
 - **Reference**: Changelog, quick reference
 ---
 ## Quick Links by Task
 ### I want to...
 **Set up the project:**
 1. [README](../README.md) - Overview
 2. [QUICKSTART](../QUICKSTART.md) - Step-by-step setup
 **Enable GPU acceleration:**
 1. [GPU_SETUP](GPU_SETUP.md) - Complete GPU guide
 2. Run: `./start-with-gpu.sh`
 **Send newsletters:**
 1. [ADMIN_API](ADMIN_API.md) - API documentation
 2. [SUBSCRIBER_STATUS](SUBSCRIBER_STATUS.md) - Subscriber system
 **Understand the architecture:**
 1. [SYSTEM_ARCHITECTURE](SYSTEM_ARCHITECTURE.md) - Complete overview
 2. [DATABASE_SCHEMA](DATABASE_SCHEMA.md) - Database design
 **Secure my deployment:**
 1. [SECURITY_NOTES](SECURITY_NOTES.md) - Security guide
 2. [DEPLOYMENT](DEPLOYMENT.md) - Production deployment
 **Troubleshoot issues:**
 1. [QUICK_REFERENCE](QUICK_REFERENCE.md) - Common commands
 2. [OLLAMA_SETUP](OLLAMA_SETUP.md) - Ollama troubleshooting
 3. [GPU_SETUP](GPU_SETUP.md) - GPU troubleshooting
 ---
 ## Documentation Standards
 ### File Naming
 - Use UPPERCASE for main docs (README, QUICKSTART)
 - Use Title_Case for technical docs (GPU_Setup, API_Reference)
 - Use descriptive names (not DOC1, DOC2)
 ### Organization
 - Root level: Only essential user-facing docs
 - docs/: All technical documentation
 - Keep related content together
 ### Content
 - Start with overview/summary
 - Include code examples
 - Add troubleshooting sections
 - Link to related docs
 - Keep up to date
 ---
 Last Updated: November 2025
--- a/docs/OLD_ARCHITECTURE.md
+++ b/docs/OLD_ARCHITECTURE.md
@@ -1,209 +0,0 @@
 # Munich News Daily - Architecture
 ## System Overview
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                        Users / Browsers                      │
 └────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Frontend (Port 3000)                      │
 │                  Node.js + Express + Vanilla JS              │
 │  - Subscription form                                         │
 │  - News display                                              │
 │  - RSS feed management UI (future)                           │
 └────────────────────────┬────────────────────────────────────┘
                         │ HTTP/REST
                         ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Backend API (Port 5001)                   │
 │                      Flask + Python                          │
 │                                                              │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │ Routes (Blueprints)                                  │  │
 │  │  - subscription_routes.py  (subscribe/unsubscribe)   │  │
 │  │  - news_routes.py          (get news, stats)         │  │
 │  │  - rss_routes.py           (manage RSS feeds)        │  │
 │  │  - ollama_routes.py        (AI features)             │  │
 │  └──────────────────────────────────────────────────────┘  │
 │                                                              │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │ Services (Business Logic)                            │  │
 │  │  - news_service.py         (fetch & save articles)   │  │
 │  │  - email_service.py        (send newsletters)        │  │
 │  │  - ollama_service.py       (AI integration)          │  │
 │  └──────────────────────────────────────────────────────┘  │
 │                                                              │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │ Core                                                 │  │
 │  │  - config.py               (configuration)           │  │
 │  │  - database.py             (DB connection)           │  │
 │  └──────────────────────────────────────────────────────┘  │
 └────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    MongoDB (Port 27017)                      │
 │                                                              │
 │  Collections:                                                │
 │  - articles         (news articles with full content)        │
 │  - subscribers      (email subscribers)                      │
 │  - rss_feeds        (RSS feed sources)                       │
 └─────────────────────────┬───────────────────────────────────┘
                          │
                          │ Read/Write
                          │
 ┌─────────────────────────┴───────────────────────────────────┐
 │              News Crawler Microservice                       │
 │                    (Standalone)                              │
 │                                                              │
 │  - Fetches RSS feeds from MongoDB                            │
 │  - Crawls full article content                               │
 │  - Extracts text, metadata, word count                       │
 │  - Stores back to MongoDB                                    │
 │  - Can run independently or scheduled                        │
 └──────────────────────────────────────────────────────────────┘
                          │
                          │ (Optional)
                          ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                  Ollama AI Server (Port 11434)               │
 │                    (Optional, External)                      │
 │                                                              │
 │  - Article summarization                                     │
 │  - Content analysis                                          │
 │  - AI-powered features                                       │
 └──────────────────────────────────────────────────────────────┘
 ```
 ## Component Details
 ### Frontend (Port 3000)
 - **Technology**: Node.js, Express, Vanilla JavaScript
 - **Responsibilities**:
  - User interface
  - Subscription management
  - News display
  - API proxy to backend
 - **Communication**: HTTP REST to Backend
 ### Backend API (Port 5001)
 - **Technology**: Python, Flask
 - **Architecture**: Modular with Blueprints
 - **Responsibilities**:
  - REST API endpoints
  - Business logic
  - Database operations
  - Email sending
  - AI integration
 - **Communication**: 
  - HTTP REST from Frontend
  - MongoDB driver to Database
  - HTTP to Ollama (optional)
 ### MongoDB (Port 27017)
 - **Technology**: MongoDB 7.0
 - **Responsibilities**:
  - Persistent data storage
  - Articles, subscribers, RSS feeds
 - **Communication**: MongoDB protocol
 ### News Crawler (Standalone)
 - **Technology**: Python, BeautifulSoup
 - **Architecture**: Microservice (can run independently)
 - **Responsibilities**:
  - Fetch RSS feeds
  - Crawl article content
  - Extract and clean text
  - Store in database
 - **Communication**: MongoDB driver to Database
 - **Execution**: 
  - Manual: `python crawler_service.py`
  - Scheduled: Cron, systemd, Docker
  - On-demand: Via backend API (future)
 ### Ollama AI Server (Optional, External)
 - **Technology**: Ollama
 - **Responsibilities**:
  - AI model inference
  - Text summarization
  - Content analysis
 - **Communication**: HTTP REST API
 ## Data Flow
 ### 1. News Aggregation Flow
 ```
 RSS Feeds → Backend (news_service) → MongoDB (articles)
 ```
 ### 2. Content Crawling Flow
 ```
 MongoDB (rss_feeds) → Crawler → Article URLs → 
 Web Scraping → MongoDB (articles with full_content)
 ```
 ### 3. Subscription Flow
 ```
 User → Frontend → Backend (subscription_routes) → 
 MongoDB (subscribers)
 ```
 ### 4. Newsletter Flow (Future)
 ```
 Scheduler → Backend (email_service) → 
 MongoDB (articles + subscribers) → SMTP → Users
 ```
 ### 5. AI Processing Flow (Optional)
 ```
 MongoDB (articles) → Backend (ollama_service) → 
 Ollama Server → AI Summary → MongoDB (articles)
 ```
 ## Deployment Options
 ### Development
 - All services run locally
 - MongoDB via Docker Compose
 - Manual crawler execution
 ### Production
 - Backend: Cloud VM, Container, or PaaS
 - Frontend: Static hosting or same server
 - MongoDB: MongoDB Atlas or self-hosted
 - Crawler: Scheduled job (cron, systemd timer)
 - Ollama: Separate GPU server (optional)
 ## Scalability Considerations
 ### Current Architecture
 - Monolithic backend (single Flask instance)
 - Standalone crawler (can run multiple instances)
 - Shared MongoDB
 ### Future Improvements
 - Load balancer for backend
 - Message queue for crawler jobs (Celery + Redis)
 - Caching layer (Redis)
 - CDN for frontend
 - Read replicas for MongoDB
 ## Security
 - CORS enabled for frontend-backend communication
 - MongoDB authentication (production)
 - Environment variables for secrets
 - Input validation on all endpoints
 - Rate limiting (future)
 ## Monitoring (Future)
 - Application logs
 - MongoDB metrics
 - Crawler success/failure tracking
 - API response times
 - Error tracking (Sentry)
--- a/docs/PERFORMANCE_COMPARISON.md
+++ b/docs/PERFORMANCE_COMPARISON.md
@@ -1,222 +0,0 @@
 # Performance Comparison: CPU vs GPU
 ## Overview
 This document compares the performance of Ollama running on CPU vs GPU for the Munich News Daily system.
 ## Test Configuration
 **Hardware:**
 - CPU: Intel Core i7-10700K (8 cores, 16 threads)
 - GPU: NVIDIA RTX 3060 (12GB VRAM)
 - RAM: 32GB DDR4
 **Model:** phi3:latest (2.3GB)
 **Test:** Processing 10 news articles with translation and summarization
 ## Results
 ### Processing Time
 ```
 CPU Processing:
 ├─ Model Load:        20s
 ├─ 10 Translations:   15s (1.5s each)
 ├─ 10 Summaries:      80s (8s each)
 └─ Total:            115s
 GPU Processing:
 ├─ Model Load:         8s
 ├─ 10 Translations:    3s (0.3s each)
 ├─ 10 Summaries:      20s (2s each)
 └─ Total:             31s
 Speedup: 3.7x faster with GPU
 ```
 ### Detailed Breakdown
 | Operation | CPU Time | GPU Time | Speedup |
 |-----------|----------|----------|---------|
 | Model Load | 20s | 8s | 2.5x |
 | Single Translation | 1.5s | 0.3s | 5.0x |
 | Single Summary | 8s | 2s | 4.0x |
 | 10 Articles (total) | 115s | 31s | 3.7x |
 | 50 Articles (total) | 550s | 120s | 4.6x |
 | 100 Articles (total) | 1100s | 220s | 5.0x |
 ### Resource Usage
 **CPU Mode:**
 - CPU Usage: 60-80% across all cores
 - RAM Usage: 4-6GB
 - GPU Usage: 0%
 - Power Draw: ~65W
 **GPU Mode:**
 - CPU Usage: 10-20%
 - RAM Usage: 2-3GB
 - GPU Usage: 80-100%
 - VRAM Usage: 3-4GB
 - Power Draw: ~120W (GPU) + ~20W (CPU) = ~140W
 ## Scaling Analysis
 ### Daily Newsletter (10 articles)
 **CPU:**
 - Processing Time: ~2 minutes
 - Energy Cost: ~0.002 kWh
 - Suitable: ✓ Yes
 **GPU:**
 - Processing Time: ~30 seconds
 - Energy Cost: ~0.001 kWh
 - Suitable: ✓ Yes (overkill for small batches)
 **Recommendation:** CPU is sufficient for daily newsletters with <20 articles.
 ### High Volume (100+ articles/day)
 **CPU:**
 - Processing Time: ~18 minutes
 - Energy Cost: ~0.02 kWh
 - Suitable: ⚠ Slow but workable
 **GPU:**
 - Processing Time: ~4 minutes
 - Energy Cost: ~0.009 kWh
 - Suitable: ✓ Yes (recommended)
 **Recommendation:** GPU provides significant time savings for high-volume processing.
 ### Real-time Processing
 **CPU:**
 - Latency: 1.5s translation + 8s summary = 9.5s per article
 - Throughput: ~6 articles/minute
 - User Experience: ⚠ Noticeable delay
 **GPU:**
 - Latency: 0.3s translation + 2s summary = 2.3s per article
 - Throughput: ~26 articles/minute
 - User Experience: ✓ Fast, responsive
 **Recommendation:** GPU is essential for real-time or interactive use cases.
 ## Cost Analysis
 ### Hardware Investment
 **CPU-Only Setup:**
 - Server: $500-1000
 - Monthly Power: ~$5
 - Total Year 1: ~$560-1060
 **GPU Setup:**
 - Server: $500-1000
 - GPU (RTX 3060): $300-400
 - Monthly Power: ~$8
 - Total Year 1: ~$896-1496
 **Break-even:** If processing >50 articles/day, GPU saves enough time to justify the cost.
 ### Cloud Deployment
 **AWS (us-east-1):**
 - CPU (t3.xlarge): $0.1664/hour = ~$120/month
 - GPU (g4dn.xlarge): $0.526/hour = ~$380/month
 **Cost per 1000 articles:**
 - CPU: ~$3.60 (3 hours)
 - GPU: ~$0.95 (1.8 hours)
 **Break-even:** Processing >5000 articles/month makes GPU more cost-effective.
 ## Model Comparison
 Different models have different performance characteristics:
 ### phi3:latest (Default)
 | Metric | CPU | GPU | Speedup |
 |--------|-----|-----|---------|
 | Load Time | 20s | 8s | 2.5x |
 | Translation | 1.5s | 0.3s | 5x |
 | Summary | 8s | 2s | 4x |
 | VRAM | N/A | 3-4GB | - |
 ### gemma2:2b (Lightweight)
 | Metric | CPU | GPU | Speedup |
 |--------|-----|-----|---------|
 | Load Time | 10s | 4s | 2.5x |
 | Translation | 0.8s | 0.2s | 4x |
 | Summary | 4s | 1s | 4x |
 | VRAM | N/A | 1.5GB | - |
 ### llama3.2:3b (High Quality)
 | Metric | CPU | GPU | Speedup |
 |--------|-----|-----|---------|
 | Load Time | 30s | 12s | 2.5x |
 | Translation | 2.5s | 0.5s | 5x |
 | Summary | 12s | 3s | 4x |
 | VRAM | N/A | 5-6GB | - |
 ## Recommendations
 ### Use CPU When:
 - Processing <20 articles/day
 - Budget-constrained
 - GPU needed for other tasks
 - Power efficiency is critical
 - Simple deployment preferred
 ### Use GPU When:
 - Processing >50 articles/day
 - Real-time processing needed
 - Multiple concurrent users
 - Time is more valuable than cost
 - Already have GPU hardware
 ### Hybrid Approach:
 - Use CPU for scheduled daily newsletters
 - Use GPU for on-demand/real-time requests
 - Scale GPU instances up/down based on load
 ## Optimization Tips
 ### CPU Optimization:
 1. Use smaller models (gemma2:2b)
 2. Reduce summary length (100 words vs 150)
 3. Process articles in batches
 4. Use more CPU cores
 5. Enable CPU-specific optimizations
 ### GPU Optimization:
 1. Keep model loaded between requests
 2. Batch multiple articles together
 3. Use FP16 precision (automatic with GPU)
 4. Enable concurrent requests
 5. Use GPU with more VRAM for larger models
 ## Conclusion
 **For Munich News Daily (10-20 articles/day):**
 - CPU is sufficient and cost-effective
 - GPU provides faster processing but may be overkill
 - Recommendation: Start with CPU, upgrade to GPU if scaling up
 **For High-Volume Operations (100+ articles/day):**
 - GPU provides significant time and cost savings
 - 4-5x faster processing
 - Better user experience
 - Recommendation: Use GPU from the start
 **For Real-Time Applications:**
 - GPU is essential for responsive experience
 - Sub-second translation, 2-3s summaries
 - Supports concurrent users
 - Recommendation: GPU required
--- a/docs/QUICK_REFERENCE.md
+++ b/docs/QUICK_REFERENCE.md
@@ -1,243 +0,0 @@
 # Quick Reference Guide
 ## Starting the Application
 ### 1. Start MongoDB
 ```bash
 docker-compose up -d
 ```
 ### 2. Start Backend (Port 5001)
 ```bash
 cd backend
 source venv/bin/activate  # or: venv\Scripts\activate on Windows
 python app.py
 ```
 ### 3. Start Frontend (Port 3000)
 ```bash
 cd frontend
 npm start
 ```
 ### 4. Run Crawler (Optional)
 ```bash
 cd news_crawler
 pip install -r requirements.txt
 python crawler_service.py 10
 ```
 ## Common Commands
 ### RSS Feed Management
 **List all feeds:**
 ```bash
 curl http://localhost:5001/api/rss-feeds
 ```
 **Add a feed:**
 ```bash
 curl -X POST http://localhost:5001/api/rss-feeds \
  -H "Content-Type: application/json" \
  -d '{"name": "Feed Name", "url": "https://example.com/rss"}'
 ```
 **Remove a feed:**
 ```bash
 curl -X DELETE http://localhost:5001/api/rss-feeds/<feed_id>
 ```
 **Toggle feed status:**
 ```bash
 curl -X PATCH http://localhost:5001/api/rss-feeds/<feed_id>/toggle
 ```
 ### News & Subscriptions
 **Get latest news:**
 ```bash
 curl http://localhost:5001/api/news
 ```
 **Subscribe:**
 ```bash
 curl -X POST http://localhost:5001/api/subscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
 **Get stats:**
 ```bash
 curl http://localhost:5001/api/stats
 ```
 ### Ollama (AI)
 **Test connection:**
 ```bash
 curl http://localhost:5001/api/ollama/ping
 ```
 **List models:**
 ```bash
 curl http://localhost:5001/api/ollama/models
 ```
 ### Email Tracking & Analytics
 **Get newsletter metrics:**
 ```bash
 curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
 ```
 **Get article performance:**
 ```bash
 curl http://localhost:5001/api/analytics/article/<article_id>
 ```
 **Get subscriber activity:**
 ```bash
 curl http://localhost:5001/api/analytics/subscriber/<email>
 ```
 **Delete subscriber tracking data:**
 ```bash
 curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
 ```
 **Anonymize old tracking data:**
 ```bash
 curl -X POST http://localhost:5001/api/tracking/anonymize
 ```
 ### Database
 **Connect to MongoDB:**
 ```bash
 mongosh
 use munich_news
 ```
 **Check articles:**
 ```javascript
 db.articles.find().limit(5)
 db.articles.countDocuments()
 db.articles.countDocuments({full_content: {$exists: true}})
 ```
 **Check subscribers:**
 ```javascript
 db.subscribers.find()
 db.subscribers.countDocuments({status: "active"})
 ```
 **Check RSS feeds:**
 ```javascript
 db.rss_feeds.find()
 ```
 **Check tracking data:**
 ```javascript
 db.newsletter_sends.find().limit(5)
 db.link_clicks.find().limit(5)
 db.subscriber_activity.find()
 ```
 ## File Locations
 ### Configuration
 - Backend: `backend/.env`
 - Frontend: `frontend/package.json`
 - Crawler: Uses backend's `.env` or own `.env`
 ### Logs
 - Backend: Terminal output
 - Frontend: Terminal output
 - Crawler: Terminal output
 ### Database
 - MongoDB data: Docker volume `mongodb_data`
 - Database name: `munich_news`
 ## Ports
 | Service | Port | URL |
 |---------|------|-----|
 | Frontend | 3000 | http://localhost:3000 |
 | Backend | 5001 | http://localhost:5001 |
 | MongoDB | 27017 | mongodb://localhost:27017 |
 | Ollama | 11434 | http://localhost:11434 |
 ## Troubleshooting
 ### Backend won't start
 - Check if port 5001 is available
 - Verify MongoDB is running
 - Check `.env` file exists
 ### Frontend can't connect
 - Verify backend is running on port 5001
 - Check CORS settings
 - Check API_URL in frontend
 ### Crawler fails
 - Install dependencies: `pip install -r requirements.txt`
 - Check MongoDB connection
 - Verify RSS feeds exist in database
 ### MongoDB connection error
 - Start MongoDB: `docker-compose up -d`
 - Check connection string in `.env`
 - Verify port 27017 is not blocked
 ### Port 5000 conflict (macOS)
 - AirPlay uses port 5000
 - Use port 5001 instead (set in `.env`)
 - Or disable AirPlay Receiver in System Preferences
 ## Project Structure
 ```
 munich-news/
 ├── backend/          # Main API (Flask)
 ├── frontend/         # Web UI (Express + JS)
 ├── news_crawler/     # Crawler microservice
 ├── .env             # Environment variables
 └── docker-compose.yml  # MongoDB setup
 ```
 ## Environment Variables
 ### Backend (.env)
 ```env
 MONGODB_URI=mongodb://localhost:27017/
 FLASK_PORT=5001
 SMTP_SERVER=smtp.gmail.com
 SMTP_PORT=587
 EMAIL_USER=your-email@gmail.com
 EMAIL_PASSWORD=your-app-password
 OLLAMA_BASE_URL=http://127.0.0.1:11434
 OLLAMA_MODEL=phi3:latest
 OLLAMA_ENABLED=true
 TRACKING_ENABLED=true
 TRACKING_API_URL=http://localhost:5001
 TRACKING_DATA_RETENTION_DAYS=90
 ```
 ## Development Workflow
 1. **Add RSS Feed** → Backend API
 2. **Run Crawler** → Fetches full content
 3. **View News** → Frontend displays articles
 4. **Users Subscribe** → Via frontend form
 5. **Send Newsletter** → Manual or scheduled
 ## Useful Links
 - Frontend: http://localhost:3000
 - Backend API: http://localhost:5001
 - MongoDB: mongodb://localhost:27017
 - Architecture: See `ARCHITECTURE.md`
 - Backend Structure: See `backend/STRUCTURE.md`
 - Crawler Guide: See `news_crawler/README.md`
--- a/docs/REFERENCE.md
+++ b/docs/REFERENCE.md
@@ -0,0 +1,299 @@
 # Quick Reference
 Essential commands and information.
 ---
 ## Quick Start
 ```bash
 # Start services
 ./start-with-gpu.sh              # With GPU auto-detection
 docker-compose up -d             # Without GPU
 # Stop services
 docker-compose down
 # View logs
 docker-compose logs -f
 docker-compose logs -f crawler   # Specific service
 # Restart
 docker-compose restart
 ```
 ---
 ## Docker Commands
 ```bash
 # Service status
 docker-compose ps
 # Rebuild
 docker-compose up -d --build
 # Execute command in container
 docker-compose exec backend python -c "print('hello')"
 docker-compose exec crawler python crawler_service.py 2
 # View container logs
 docker logs munich-news-backend
 docker logs munich-news-crawler
 # Resource usage
 docker stats
 ```
 ---
 ## API Commands
 ```bash
 # Health check
 curl http://localhost:5001/health
 # System stats
 curl http://localhost:5001/api/admin/stats
 # Trigger crawl
 curl -X POST http://localhost:5001/api/admin/trigger-crawl \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 5}'
 # Send test email
 curl -X POST http://localhost:5001/api/admin/send-test-email \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com"}'
 # Send newsletter to all
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json" \
  -d '{"max_articles": 10}'
 ```
 ---
 ## GPU Commands
 ```bash
 # Check GPU availability
 ./check-gpu.sh
 # Start with GPU
 ./start-with-gpu.sh
 docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
 # Check GPU usage
 docker exec munich-news-ollama nvidia-smi
 # Monitor GPU
 watch -n 1 'docker exec munich-news-ollama nvidia-smi'
 ```
 ---
 ## MongoDB Commands
 ```bash
 # Access MongoDB shell
 docker-compose exec mongodb mongosh munich_news -u admin -p changeme --authenticationDatabase admin
 # Count documents
 db.articles.countDocuments({})
 db.subscribers.countDocuments({status: 'active'})
 # Find articles
 db.articles.find().limit(5).pretty()
 # Clear articles
 db.articles.deleteMany({})
 # Add subscriber
 db.subscribers.insertOne({
  email: "user@example.com",
  subscribed_at: new Date(),
  status: "active"
 })
 ```
 ---
 ## Testing Commands
 ```bash
 # Test Ollama setup
 ./test-ollama-setup.sh
 # Test MongoDB connectivity
 ./test-mongodb-connectivity.sh
 # Test newsletter API
 ./test-newsletter-api.sh
 # Test crawl (2 articles)
 docker-compose exec crawler python crawler_service.py 2
 ```
 ---
 ## Configuration
 ### Single .env File
 Location: `backend/.env`
 **Key Settings:**
 ```env
 # MongoDB (Docker service name)
 MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
 # Email
 SMTP_SERVER=smtp.gmail.com
 SMTP_PORT=587
 EMAIL_USER=your@email.com
 EMAIL_PASSWORD=your-password
 # Ollama (Internal Docker network)
 OLLAMA_ENABLED=true
 OLLAMA_BASE_URL=http://ollama:11434
 OLLAMA_MODEL=phi3:latest
 ```
 ---
 ## Port Exposure
 | Service | Port | Exposed | Access |
 |---------|------|---------|--------|
 | Backend | 5001 | ✅ Yes | Host, External |
 | MongoDB | 27017 | ❌ No | Internal only |
 | Ollama | 11434 | ❌ No | Internal only |
 | Crawler | - | ❌ No | Internal only |
 | Sender | - | ❌ No | Internal only |
 **Verify:**
 ```bash
 docker ps --format "table {{.Names}}\t{{.Ports}}"
 ```
 ---
 ## Performance
 ### CPU Mode
 - Translation: ~1.5s per title
 - Summarization: ~8s per article
 - 10 Articles: ~115s
 ### GPU Mode (5-10x faster)
 - Translation: ~0.3s per title
 - Summarization: ~2s per article
 - 10 Articles: ~31s
 ---
 ## Troubleshooting
 ### Service Won't Start
 ```bash
 docker-compose logs <service-name>
 docker-compose restart <service-name>
 docker-compose up -d --build <service-name>
 ```
 ### MongoDB Connection Issues
 ```bash
 # Check service
 docker-compose ps mongodb
 # Test connection
 docker-compose exec backend python -c "from database import articles_collection; print(articles_collection.count_documents({}))"
 ```
 ### Ollama Issues
 ```bash
 # Check model
 docker-compose exec ollama ollama list
 # Pull model manually
 docker-compose exec ollama ollama pull phi3:latest
 # Check logs
 docker-compose logs ollama
 ```
 ### GPU Not Working
 ```bash
 # Check GPU
 nvidia-smi
 # Check Docker GPU access
 docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
 # Check Ollama GPU
 docker exec munich-news-ollama nvidia-smi
 ```
 ---
 ## Recent Updates (November 2025)
 ### New Features
 - ✅ GPU acceleration (5-10x faster)
 - ✅ Integrated Ollama service
 - ✅ Send newsletter to all subscribers API
 - ✅ Article title translation (German → English)
 - ✅ Enhanced security (network isolation)
 ### Security Improvements
 - MongoDB internal-only (not exposed)
 - Ollama internal-only (not exposed)
 - Only Backend API exposed (port 5001)
 - 66% reduction in attack surface
 ### Configuration Changes
 - MongoDB URI uses `mongodb` (not `localhost`)
 - Ollama URL uses `http://ollama:11434`
 - Single `.env` file in `backend/`
 ### New Scripts
 - `start-with-gpu.sh` - Auto-detect GPU and start
 - `check-gpu.sh` - Check GPU availability
 - `test-ollama-setup.sh` - Test Ollama
 - `test-mongodb-connectivity.sh` - Test MongoDB
 - `test-newsletter-api.sh` - Test newsletter API
 ---
 ## Changelog
 ### November 2025
 - Added GPU acceleration support
 - Integrated Ollama into Docker Compose
 - Added newsletter API endpoint
 - Improved network security
 - Added article title translation
 - Consolidated documentation
 - Added helper scripts
 ### Key Changes
 - Ollama now runs in Docker (no external server needed)
 - MongoDB and Ollama are internal-only
 - GPU support with automatic detection
 - Subscriber status system documented
 - All docs consolidated and updated
 ---
 ## Links
 - **Setup Guide**: [SETUP.md](SETUP.md)
 - **API Reference**: [API.md](API.md)
 - **Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md)
 - **Security**: [SECURITY.md](SECURITY.md)
 - **Features**: [FEATURES.md](FEATURES.md)
 ---
 **Last Updated:** November 2025
--- a/docs/RSS_URL_EXTRACTION.md
+++ b/docs/RSS_URL_EXTRACTION.md
@@ -1,194 +0,0 @@
 # RSS URL Extraction - How It Works
 ## The Problem
 Different RSS feed providers use different fields to store the article URL:
 ### Example 1: Standard RSS (uses `link`)
 ```xml
 <item>
  <title>Article Title</title>
  <link>https://example.com/article/123</link>
  <guid>internal-id-456</guid>
 </item>
 ```
 ### Example 2: Some feeds (uses `guid` as URL)
 ```xml
 <item>
  <title>Article Title</title>
  <guid>https://example.com/article/123</guid>
 </item>
 ```
 ### Example 3: Atom feeds (uses `id`)
 ```xml
 <entry>
  <title>Article Title</title>
  <id>https://example.com/article/123</id>
 </entry>
 ```
 ### Example 4: Complex feeds (guid as object)
 ```xml
 <item>
  <title>Article Title</title>
  <guid isPermaLink="true">https://example.com/article/123</guid>
 </item>
 ```
 ### Example 5: Multiple links
 ```xml
 <item>
  <title>Article Title</title>
  <link rel="alternate" type="text/html" href="https://example.com/article/123"/>
  <link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
 </item>
 ```
 ## Our Solution
 The `extract_article_url()` function tries multiple strategies in order:
 ### Strategy 1: Check `link` field (most common)
 ```python
 if entry.get('link') and entry.get('link', '').startswith('http'):
    return entry.get('link')
 ```
 ✅ Works for: Most RSS 2.0 feeds
 ### Strategy 2: Check `guid` field
 ```python
 if entry.get('guid'):
    guid = entry.get('guid')
    # guid can be a string
    if isinstance(guid, str) and guid.startswith('http'):
        return guid
    # or a dict with 'href'
    elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
        return guid.get('href')
 ```
 ✅ Works for: Feeds that use GUID as permalink
 ### Strategy 3: Check `id` field
 ```python
 if entry.get('id') and entry.get('id', '').startswith('http'):
    return entry.get('id')
 ```
 ✅ Works for: Atom feeds
 ### Strategy 4: Check `links` array
 ```python
 if entry.get('links'):
    for link in entry.get('links', []):
        if isinstance(link, dict) and link.get('href', '').startswith('http'):
            # Prefer 'alternate' type
            if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
                return link.get('href')
 ```
 ✅ Works for: Feeds with multiple links (prefers HTML content)
 ## Real-World Examples
 ### Süddeutsche Zeitung
 ```python
 entry = {
    'title': 'Munich News',
    'link': 'https://www.sueddeutsche.de/muenchen/article-123',
    'guid': 'sz-internal-123'
 }
 # Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
 ```
 ### Medium Blog
 ```python
 entry = {
    'title': 'Blog Post',
    'guid': 'https://medium.com/@user/post-abc123',
    'link': None
 }
 # Returns: 'https://medium.com/@user/post-abc123'
 ```
 ### YouTube RSS
 ```python
 entry = {
    'title': 'Video Title',
    'id': 'https://www.youtube.com/watch?v=abc123',
    'link': None
 }
 # Returns: 'https://www.youtube.com/watch?v=abc123'
 ```
 ### Complex Feed
 ```python
 entry = {
    'title': 'Article',
    'links': [
        {'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
        {'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
    ]
 }
 # Returns: 'https://example.com/article' (prefers text/html)
 ```
 ## Validation
 All extracted URLs must:
 1. Start with `http://` or `https://`
 2. Be a valid string (not None or empty)
 If no valid URL is found:
 ```python
 return None
 # Crawler will skip this entry and log a warning
 ```
 ## Testing Different Feeds
 To test if a feed works with our extractor:
 ```python
 import feedparser
 from rss_utils import extract_article_url
 # Parse feed
 feed = feedparser.parse('https://example.com/rss')
 # Test each entry
 for entry in feed.entries[:5]:
    url = extract_article_url(entry)
    if url:
        print(f"✓ {entry.get('title', 'No title')[:50]}")
        print(f"  URL: {url}")
    else:
        print(f"✗ {entry.get('title', 'No title')[:50]}")
        print(f"  No valid URL found")
        print(f"  Available fields: {list(entry.keys())}")
 ```
 ## Supported Feed Types
 ✅ RSS 2.0  
 ✅ RSS 1.0  
 ✅ Atom  
 ✅ Custom RSS variants  
 ✅ Feeds with multiple links  
 ✅ Feeds with GUID as permalink  
 ## Edge Cases Handled
 1. **GUID is not a URL**: Checks if it starts with `http`
 2. **Multiple links**: Prefers `text/html` type
 3. **GUID as dict**: Extracts `href` field
 4. **Missing fields**: Returns None instead of crashing
 5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
 ## Future Improvements
 Potential enhancements:
 - [ ] Support for `feedburner:origLink`
 - [ ] Support for `pheedo:origLink`
 - [ ] Resolve shortened URLs (bit.ly, etc.)
 - [ ] Handle relative URLs (convert to absolute)
 - [ ] Cache URL extraction results
--- a/docs/SECURITY_NOTES.md
+++ b/docs/SECURITY_NOTES.md
--- a/docs/SETUP.md
+++ b/docs/SETUP.md
@@ -0,0 +1,253 @@
 # Complete Setup Guide
 ## Quick Start
 ```bash
 # 1. Configure
 cp backend/.env.example backend/.env
 # Edit backend/.env with your email settings
 # 2. Start (with GPU auto-detection)
 ./start-with-gpu.sh
 # 3. Wait for model download (first time, ~2-5 min)
 docker-compose logs -f ollama-setup
 ```
 ---
 ## Prerequisites
 - Docker & Docker Compose
 - 4GB+ RAM
 - (Optional) NVIDIA GPU for 5-10x faster AI
 ---
 ## Configuration
 ### Environment File
 Edit `backend/.env`:
 ```env
 # Email (Required)
 SMTP_SERVER=smtp.gmail.com
 SMTP_PORT=587
 EMAIL_USER=your-email@gmail.com
 EMAIL_PASSWORD=your-app-password
 # MongoDB (Docker service name)
 MONGODB_URI=mongodb://admin:changeme@mongodb:27017/
 # Ollama AI (Internal Docker network)
 OLLAMA_ENABLED=true
 OLLAMA_BASE_URL=http://ollama:11434
 OLLAMA_MODEL=phi3:latest
 ```
 ---
 ## Ollama Setup
 ### Integrated Docker Compose (Recommended)
 Ollama runs automatically with Docker Compose:
 - Automatic model download (phi3:latest, 2.2GB)
 - Internal-only access (secure)
 - Persistent storage
 - GPU support available
 **Start:**
 ```bash
 docker-compose up -d
 ```
 **Verify:**
 ```bash
 docker-compose exec ollama ollama list
 # Should show: phi3:latest
 ```
 ---
 ## GPU Acceleration (5-10x Faster)
 ### Check GPU Availability
 ```bash
 ./check-gpu.sh
 ```
 ### Start with GPU
 ```bash
 # Auto-detect and start
 ./start-with-gpu.sh
 # Or manually
 docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
 ```
 ### Verify GPU Usage
 ```bash
 # Check GPU
 docker exec munich-news-ollama nvidia-smi
 # Monitor during processing
 watch -n 1 'docker exec munich-news-ollama nvidia-smi'
 ```
 ### Performance
 | Operation | CPU | GPU | Speedup |
 |-----------|-----|-----|---------|
 | Translation | 1.5s | 0.3s | 5x |
 | Summary | 8s | 2s | 4x |
 | 10 Articles | 115s | 31s | 3.7x |
 ### GPU Requirements
 - NVIDIA GPU (GTX 1060 or newer)
 - 4GB+ VRAM for phi3:latest
 - NVIDIA drivers (525.60.13+)
 - NVIDIA Container Toolkit
 **Install NVIDIA Container Toolkit (Ubuntu/Debian):**
 ```bash
 distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
 curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
 curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 sudo apt-get update
 sudo apt-get install -y nvidia-container-toolkit
 sudo nvidia-ctk runtime configure --runtime=docker
 sudo systemctl restart docker
 ```
 ---
 ## Production Deployment
 ### Security Checklist
 - [ ] Change MongoDB password
 - [ ] Use strong email password
 - [ ] Bind backend to localhost only: `127.0.0.1:5001:5001`
 - [ ] Set up reverse proxy (nginx/Traefik)
 - [ ] Enable HTTPS
 - [ ] Set up firewall rules
 - [ ] Regular backups
 - [ ] Monitor logs
 ### Reverse Proxy (nginx)
 ```nginx
 server {
    listen 80;
    server_name your-domain.com;
    location / {
        proxy_pass http://localhost:5001;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
 }
 ```
 ### Environment Variables
 For production, use environment variables instead of .env:
 ```bash
 export MONGODB_URI="mongodb://user:pass@mongodb:27017/"
 export SMTP_SERVER="smtp.gmail.com"
 export EMAIL_PASSWORD="secure-password"
 ```
 ### Monitoring
 ```bash
 # Check service health
 docker-compose ps
 # View logs
 docker-compose logs -f
 # Check resource usage
 docker stats
 ```
 ---
 ## Troubleshooting
 ### Ollama Issues
 **Model not downloading:**
 ```bash
 docker-compose logs ollama-setup
 docker-compose exec ollama ollama pull phi3:latest
 ```
 **Out of memory:**
 - Use smaller model: `OLLAMA_MODEL=gemma2:2b`
 - Increase Docker memory limit
 ### GPU Issues
 **GPU not detected:**
 ```bash
 nvidia-smi  # Check drivers
 docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi  # Check Docker
 ```
 **Out of VRAM:**
 - Use smaller model
 - Close other GPU applications
 ### MongoDB Issues
 **Connection refused:**
 - Check service is running: `docker-compose ps`
 - Verify URI uses `mongodb` not `localhost`
 ### Email Issues
 **Authentication failed:**
 - Use app-specific password (Gmail)
 - Check SMTP settings
 - Verify credentials
 ---
 ## Testing
 ```bash
 # Test Ollama
 ./test-ollama-setup.sh
 # Test MongoDB
 ./test-mongodb-connectivity.sh
 # Test newsletter
 ./test-newsletter-api.sh
 # Test crawl
 docker-compose exec crawler python crawler_service.py 2
 ```
 ---
 ## Next Steps
 1. Add RSS feeds (see QUICKSTART.md)
 2. Add subscribers
 3. Test newsletter sending
 4. Set up monitoring
 5. Configure backups
 See [API.md](API.md) for API reference.
--- a/docs/SUBSCRIBER_STATUS.md
+++ b/docs/SUBSCRIBER_STATUS.md
@@ -1,290 +0,0 @@
 # Subscriber Status System
 ## Overview
 The newsletter system tracks subscribers with a `status` field that determines whether they receive newsletters.
 ## Status Field
 ### Database Schema
 ```javascript
 {
  _id: ObjectId("..."),
  email: "user@example.com",
  subscribed_at: ISODate("2025-11-11T15:50:29.478Z"),
  status: "active"  // or "inactive"
 }
 ```
 ### Status Values
 | Status | Description | Receives Newsletters |
 |--------|-------------|---------------------|
 | `active` | Subscribed and active | ✅ Yes |
 | `inactive` | Unsubscribed | ❌ No |
 ## How It Works
 ### Subscription Flow
 ```
 User subscribes
    ↓
 POST /api/subscribe
    ↓
 Create subscriber with status: 'active'
    ↓
 User receives newsletters
 ```
 ### Unsubscription Flow
 ```
 User unsubscribes
    ↓
 POST /api/unsubscribe
    ↓
 Update subscriber status: 'inactive'
    ↓
 User stops receiving newsletters
 ```
 ### Re-subscription Flow
 ```
 Previously unsubscribed user subscribes again
    ↓
 POST /api/subscribe
    ↓
 Update status: 'active' + new subscribed_at date
    ↓
 User receives newsletters again
 ```
 ## API Endpoints
 ### Subscribe
 ```bash
 curl -X POST http://localhost:5001/api/subscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
 **Creates subscriber with:**
 - `email`: user@example.com
 - `status`: "active"
 - `subscribed_at`: current timestamp
 ### Unsubscribe
 ```bash
 curl -X POST http://localhost:5001/api/unsubscribe \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'
 ```
 **Updates subscriber:**
 - `status`: "inactive"
 ## Newsletter Sending
 ### Who Receives Newsletters
 Only subscribers with `status: 'active'` receive newsletters.
 **Sender Service Query:**
 ```python
 subscribers_collection.find({'status': 'active'})
 ```
 **Admin API Query:**
 ```python
 subscribers_collection.count_documents({'status': 'active'})
 ```
 ### Testing
 ```bash
 # Check active subscriber count
 curl http://localhost:5001/api/admin/stats | jq '.subscribers'
 # Output:
 # {
 #   "total": 10,
 #   "active": 8
 # }
 ```
 ## Database Operations
 ### Add Active Subscriber
 ```javascript
 db.subscribers.insertOne({
  email: "user@example.com",
  subscribed_at: new Date(),
  status: "active"
 })
 ```
 ### Deactivate Subscriber
 ```javascript
 db.subscribers.updateOne(
  { email: "user@example.com" },
  { $set: { status: "inactive" } }
 )
 ```
 ### Reactivate Subscriber
 ```javascript
 db.subscribers.updateOne(
  { email: "user@example.com" },
  { $set: { 
    status: "active",
    subscribed_at: new Date()
  }}
 )
 ```
 ### Query Active Subscribers
 ```javascript
 db.subscribers.find({ status: "active" })
 ```
 ### Count Active Subscribers
 ```javascript
 db.subscribers.countDocuments({ status: "active" })
 ```
 ## Common Issues
 ### Issue: Stats show 0 active subscribers but subscribers exist
 **Cause:** Old bug where stats checked `{active: true}` instead of `{status: 'active'}`
 **Solution:** Fixed in latest version. Stats now correctly query `{status: 'active'}`
 **Verify:**
 ```bash
 # Check database directly
 docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
  --authenticationDatabase admin \
  --eval "db.subscribers.find({status: 'active'}).count()"
 # Check via API
 curl http://localhost:5001/api/admin/stats | jq '.subscribers.active'
 ```
 ### Issue: Newsletter not sending to subscribers
 **Possible causes:**
 1. Subscribers have `status: 'inactive'`
 2. No subscribers in database
 3. Email configuration issue
 **Debug:**
 ```bash
 # Check subscriber status
 docker-compose exec mongodb mongosh munich_news -u admin -p changeme \
  --authenticationDatabase admin \
  --eval "db.subscribers.find().pretty()"
 # Check active count
 curl http://localhost:5001/api/admin/stats | jq '.subscribers'
 # Try sending
 curl -X POST http://localhost:5001/api/admin/send-newsletter \
  -H "Content-Type: application/json"
 ```
 ## Migration Notes
 ### If you have old subscribers without status field
 Run this migration:
 ```javascript
 // Set all subscribers without status to 'active'
 db.subscribers.updateMany(
  { status: { $exists: false } },
  { $set: { status: "active" } }
 )
 ```
 ### If you have subscribers with `active: true/false` field
 Run this migration:
 ```javascript
 // Convert old 'active' field to 'status' field
 db.subscribers.updateMany(
  { active: true },
  { $set: { status: "active" }, $unset: { active: "" } }
 )
 db.subscribers.updateMany(
  { active: false },
  { $set: { status: "inactive" }, $unset: { active: "" } }
 )
 ```
 ## Best Practices
 ### 1. Always Check Status
 When querying subscribers for sending:
 ```python
 # ✅ Correct
 subscribers_collection.find({'status': 'active'})
 # ❌ Wrong
 subscribers_collection.find({})  # Includes inactive
 ```
 ### 2. Soft Delete
 Never delete subscribers - just set status to 'inactive':
 ```python
 # ✅ Correct - preserves history
 subscribers_collection.update_one(
    {'email': email},
    {'$set': {'status': 'inactive'}}
 )
 # ❌ Wrong - loses data
 subscribers_collection.delete_one({'email': email})
 ```
 ### 3. Track Subscription History
 Consider adding fields:
 ```javascript
 {
  email: "user@example.com",
  status: "active",
  subscribed_at: ISODate("2025-01-01"),
  unsubscribed_at: null,  // Set when status changes to inactive
  resubscribed_count: 0   // Increment on re-subscription
 }
 ```
 ### 4. Validate Before Sending
 ```python
 # Check subscriber count before sending
 count = subscribers_collection.count_documents({'status': 'active'})
 if count == 0:
    return {'error': 'No active subscribers'}
 ```
 ## Related Documentation
 - [ADMIN_API.md](ADMIN_API.md) - Admin API endpoints
 - [DATABASE_SCHEMA.md](DATABASE_SCHEMA.md) - Complete database schema
 - [NEWSLETTER_API_UPDATE.md](../NEWSLETTER_API_UPDATE.md) - Newsletter API changes
--- a/docs/SYSTEM_ARCHITECTURE.md
+++ b/docs/SYSTEM_ARCHITECTURE.md
@@ -1,412 +0,0 @@
 # Munich News Daily - System Architecture
 ## 📊 Complete System Overview
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    Munich News Daily System                      │
 │                     Fully Automated Pipeline                     │
 └─────────────────────────────────────────────────────────────────┘
                         Daily Schedule
                    ┌──────────────────────┐
                    │   6:00 AM Berlin     │
                    │   News Crawler       │
                    └──────────┬───────────┘
                               │
                               ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                        News Crawler                               │
 │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐│
 │  │ Fetch RSS  │→ │  Extract   │→ │ Summarize  │→ │   Save to  ││
 │  │   Feeds    │  │  Content   │  │  with AI   │  │  MongoDB   ││
 │  └────────────┘  └────────────┘  └────────────┘  └────────────┘│
 │                                                                   │
 │  Sources: Süddeutsche, Merkur, BR24, etc.                       │
 │  Output: Full articles + AI summaries                            │
 └──────────────────────────────────────────────────────────────────┘
                               │
                               │ Articles saved
                               ▼
                    ┌──────────────────────┐
                    │      MongoDB         │
                    │   (Data Storage)     │
                    └──────────┬───────────┘
                               │
                               │ Wait for crawler
                               ▼
                    ┌──────────────────────┐
                    │   7:00 AM Berlin     │
                    │  Newsletter Sender   │
                    └──────────┬───────────┘
                               │
                               ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                     Newsletter Sender                             │
 │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐│
 │  │ Wait for   │→ │   Fetch    │→ │  Generate  │→ │   Send to  ││
 │  │  Crawler   │  │  Articles  │  │ Newsletter │  │ Subscribers││
 │  └────────────┘  └────────────┘  └────────────┘  └────────────┘│
 │                                                                   │
 │  Features: Tracking pixels, link tracking, HTML templates        │
 │  Output: Personalized newsletters with engagement tracking       │
 └──────────────────────────────────────────────────────────────────┘
                               │
                               │ Emails sent
                               ▼
                    ┌──────────────────────┐
                    │    Subscribers       │
                    │  (Email Inboxes)     │
                    └──────────┬───────────┘
                               │
                               │ Opens & clicks
                               ▼
                    ┌──────────────────────┐
                    │  Tracking System     │
                    │  (Analytics API)     │
                    └──────────────────────┘
 ```
 ## 🔄 Data Flow
 ### 1. Content Acquisition (6:00 AM)
 ```
 RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
 ```
 **Details**:
 - Fetches from multiple RSS sources
 - Extracts full article text
 - Generates concise summaries using Ollama
 - Stores with metadata (author, date, source)
 ### 2. Newsletter Generation (7:00 AM)
 ```
 MongoDB → Articles → Template → HTML → Email
 ```
 **Details**:
 - Waits for crawler to finish (max 30 min)
 - Fetches today's articles with summaries
 - Applies Jinja2 template
 - Injects tracking pixels
 - Replaces links with tracking URLs
 ### 3. Engagement Tracking (Ongoing)
 ```
 Email Open → Pixel Load → Log Event → Analytics
 Link Click → Redirect → Log Event → Analytics
 ```
 **Details**:
 - Tracks email opens via 1x1 pixel
 - Tracks link clicks via redirect URLs
 - Stores engagement data in MongoDB
 - Provides analytics API
 ## 🏗️ Component Architecture
 ### Docker Containers
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                    Docker Network                        │
 │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
 │  │   MongoDB    │  │    Crawler   │  │    Sender    │ │
 │  │              │  │              │  │              │ │
 │  │  Port: 27017 │←─│  Schedule:   │←─│  Schedule:   │ │
 │  │              │  │  6:00 AM     │  │  7:00 AM     │ │
 │  │  Storage:    │  │              │  │              │ │
 │  │  - articles  │  │  Depends on: │  │  Depends on: │ │
 │  │  - subscribers│  │  - MongoDB   │  │  - MongoDB   │ │
 │  │  - tracking  │  │              │  │  - Crawler   │ │
 │  └──────────────┘  └──────────────┘  └──────────────┘ │
 │                                                          │
 │  All containers auto-restart on failure                 │
 │  All use Europe/Berlin timezone                         │
 └─────────────────────────────────────────────────────────┘
 ```
 ### Backend Services
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                   Backend Services                       │
 │                                                          │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │  Flask API (Port 5001)                           │  │
 │  │  ┌────────────┐  ┌────────────┐  ┌────────────┐ │  │
 │  │  │ Tracking   │  │ Analytics  │  │  Privacy   │ │  │
 │  │  │ Endpoints  │  │ Endpoints  │  │ Endpoints  │ │  │
 │  │  └────────────┘  └────────────┘  └────────────┘ │  │
 │  └──────────────────────────────────────────────────┘  │
 │                                                          │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │  Services Layer                                  │  │
 │  │  ┌────────────┐  ┌────────────┐  ┌────────────┐ │  │
 │  │  │ Tracking   │  │ Analytics  │  │  Ollama    │ │  │
 │  │  │ Service    │  │ Service    │  │  Client    │ │  │
 │  │  └────────────┘  └────────────┘  └────────────┘ │  │
 │  └──────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────┘
 ```
 ## 📅 Daily Timeline
 ```
 Time (Berlin)  │ Event                    │ Duration
 ───────────────┼──────────────────────────┼──────────
 05:59:59       │ System idle              │ -
 06:00:00       │ Crawler starts           │ ~10-20 min
 06:00:01       │ - Fetch RSS feeds        │
 06:02:00       │ - Extract content        │
 06:05:00       │ - Generate summaries     │
 06:15:00       │ - Save to MongoDB        │
 06:20:00       │ Crawler finishes         │
 06:20:01       │ System idle              │ ~40 min
 07:00:00       │ Sender starts            │ ~5-10 min
 07:00:01       │ - Wait for crawler       │ (checks every 30s)
 07:00:30       │ - Crawler confirmed done │
 07:00:31       │ - Fetch articles         │
 07:01:00       │ - Generate newsletters   │
 07:02:00       │ - Send to subscribers    │
 07:10:00       │ Sender finishes          │
 07:10:01       │ System idle              │ Until tomorrow
 ```
 ## 🔐 Security & Privacy
 ### Data Protection
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                  Privacy Features                        │
 │                                                          │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │  Data Retention                                  │  │
 │  │  - Personal data: 90 days                        │  │
 │  │  - Anonymization: Automatic                      │  │
 │  │  - Deletion: On request                          │  │
 │  └──────────────────────────────────────────────────┘  │
 │                                                          │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │  User Rights                                     │  │
 │  │  - Opt-out: Anytime                              │  │
 │  │  - Data access: API available                    │  │
 │  │  - Data deletion: Full removal                   │  │
 │  └──────────────────────────────────────────────────┘  │
 │                                                          │
 │  ┌──────────────────────────────────────────────────┐  │
 │  │  Compliance                                      │  │
 │  │  - GDPR compliant                                │  │
 │  │  - Privacy notice in emails                      │  │
 │  │  - Transparent tracking                          │  │
 │  └──────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────┘
 ```
 ## 📊 Database Schema
 ### Collections
 ```
 MongoDB (munich_news)
 │
 ├── articles
 │   ├── title
 │   ├── author
 │   ├── content (full text)
 │   ├── summary (AI generated)
 │   ├── link
 │   ├── source
 │   ├── published_at
 │   └── crawled_at
 │
 ├── subscribers
 │   ├── email
 │   ├── active
 │   ├── tracking_enabled
 │   └── subscribed_at
 │
 ├── rss_feeds
 │   ├── name
 │   ├── url
 │   └── active
 │
 ├── newsletter_sends
 │   ├── tracking_id
 │   ├── newsletter_id
 │   ├── subscriber_email
 │   ├── opened
 │   ├── first_opened_at
 │   └── open_count
 │
 ├── link_clicks
 │   ├── tracking_id
 │   ├── newsletter_id
 │   ├── subscriber_email
 │   ├── article_url
 │   ├── clicked
 │   └── clicked_at
 │
 └── subscriber_activity
    ├── email
    ├── status (active/inactive/dormant)
    ├── last_opened_at
    ├── last_clicked_at
    ├── total_opens
    └── total_clicks
 ```
 ## 🚀 Deployment Architecture
 ### Development
 ```
 Local Machine
 ├── Docker Compose
 │   ├── MongoDB (no auth)
 │   ├── Crawler
 │   └── Sender
 ├── Backend (manual start)
 │   └── Flask API
 └── Ollama (optional)
    └── AI Summarization
 ```
 ### Production
 ```
 Server
 ├── Docker Compose (prod)
 │   ├── MongoDB (with auth)
 │   ├── Crawler
 │   └── Sender
 ├── Backend (systemd/pm2)
 │   └── Flask API (HTTPS)
 ├── Ollama (optional)
 │   └── AI Summarization
 └── Nginx (reverse proxy)
    └── SSL/TLS
 ```
 ## 🔄 Coordination Mechanism
 ### Crawler-Sender Synchronization
 ```
 ┌─────────────────────────────────────────────────────────┐
 │              Coordination Flow                           │
 │                                                          │
 │  6:00 AM → Crawler starts                               │
 │            ↓                                             │
 │            Crawling articles...                          │
 │            ↓                                             │
 │            Saves to MongoDB                              │
 │            ↓                                             │
 │  6:20 AM → Crawler finishes                             │
 │            ↓                                             │
 │  7:00 AM → Sender starts                                │
 │            ↓                                             │
 │            Check: Recent articles? ──→ No ──┐           │
 │            ↓ Yes                            │           │
 │            Proceed with send                │           │
 │                                             │           │
 │            ← Wait 30s ← Wait 30s ← Wait 30s┘           │
 │            (max 30 minutes)                             │
 │                                                          │
 │  7:10 AM → Newsletter sent                              │
 └─────────────────────────────────────────────────────────┘
 ```
 ## 📈 Monitoring & Observability
 ### Key Metrics
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                    Metrics to Monitor                    │
 │                                                          │
 │  Crawler:                                                │
 │  - Articles crawled per day                              │
 │  - Crawl duration                                        │
 │  - Success/failure rate                                  │
 │  - Summary generation rate                               │
 │                                                          │
 │  Sender:                                                 │
 │  - Newsletters sent per day                              │
 │  - Send duration                                         │
 │  - Success/failure rate                                  │
 │  - Wait time for crawler                                 │
 │                                                          │
 │  Engagement:                                             │
 │  - Open rate                                             │
 │  - Click-through rate                                    │
 │  - Active subscribers                                    │
 │  - Dormant subscribers                                   │
 │                                                          │
 │  System:                                                 │
 │  - Container uptime                                      │
 │  - Database size                                         │
 │  - Error rate                                            │
 │  - Response times                                        │
 └─────────────────────────────────────────────────────────┘
 ```
 ## 🛠️ Maintenance Tasks
 ### Daily
 - Check logs for errors
 - Verify newsletters sent
 - Monitor engagement metrics
 ### Weekly
 - Review article quality
 - Check subscriber growth
 - Analyze engagement trends
 ### Monthly
 - Archive old articles
 - Clean up dormant subscribers
 - Update dependencies
 - Review system performance
 ## 📚 Technology Stack
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                   Technology Stack                       │
 │                                                          │
 │  Backend:                                                │
 │  - Python 3.11                                           │
 │  - Flask (API)                                           │
 │  - PyMongo (Database)                                    │
 │  - Schedule (Automation)                                 │
 │  - Jinja2 (Templates)                                    │
 │  - BeautifulSoup (Parsing)                               │
 │                                                          │
 │  Database:                                               │
 │  - MongoDB 7.0                                           │
 │                                                          │
 │  AI/ML:                                                  │
 │  - Ollama (Summarization)                                │
 │  - Phi3 Model (default)                                  │
 │                                                          │
 │  Infrastructure:                                         │
 │  - Docker & Docker Compose                               │
 │  - Linux (Ubuntu/Debian)                                 │
 │                                                          │
 │  Email:                                                  │
 │  - SMTP (configurable)                                   │
 │  - HTML emails with tracking                             │
 └─────────────────────────────────────────────────────────┘
 ```
 ---
 **Last Updated**: 2024-01-16  
 **Version**: 1.0  
 **Status**: Production Ready ✅
--- a/news_crawler/article_clustering.py
+++ b/news_crawler/article_clustering.py
@@ -0,0 +1,246 @@
 """
 Article Clustering Module
 Detects and groups similar articles from different sources using Ollama AI
 """
 from difflib import SequenceMatcher
 from datetime import datetime, timedelta
 from typing import List, Dict, Optional
 from ollama_client import OllamaClient
 class ArticleClusterer:
    """
    Clusters articles about the same story from different sources using Ollama AI
    """
    def __init__(self, ollama_client: OllamaClient, similarity_threshold=0.75, time_window_hours=24):
        """
        Initialize clusterer
        Args:
            ollama_client: OllamaClient instance for AI-based similarity detection
            similarity_threshold: Minimum similarity to consider articles as same story (0-1)
            time_window_hours: Time window to look for similar articles
        """
        self.ollama_client = ollama_client
        self.similarity_threshold = similarity_threshold
        self.time_window_hours = time_window_hours
    def normalize_title(self, title: str) -> str:
        """
        Normalize title for comparison
        Args:
            title: Article title
        Returns:
            Normalized title (lowercase, stripped)
        """
        return title.lower().strip()
    def simple_stem(self, word: str) -> str:
        """
        Simple German word stemming (remove common suffixes)
        Args:
            word: Word to stem
        Returns:
            Stemmed word
        """
        # Remove common German suffixes
        suffixes = ['ungen', 'ung', 'en', 'er', 'e', 'n', 's']
        for suffix in suffixes:
            if len(word) > 5 and word.endswith(suffix):
                return word[:-len(suffix)]
        return word
    def extract_keywords(self, text: str) -> set:
        """
        Extract important keywords from text with simple stemming
        Args:
            text: Article title or content
        Returns:
            Set of stemmed keywords
        """
        # Common German stop words to ignore
        stop_words = {
            'der', 'die', 'das', 'den', 'dem', 'des', 'ein', 'eine', 'einer', 'eines',
            'und', 'oder', 'aber', 'in', 'im', 'am', 'um', 'für', 'von', 'zu', 'nach',
            'bei', 'mit', 'auf', 'an', 'aus', 'über', 'unter', 'gegen', 'durch',
            'ist', 'sind', 'war', 'waren', 'hat', 'haben', 'wird', 'werden', 'wurde', 'wurden',
            'neue', 'neuer', 'neues', 'neuen', 'sich', 'auch', 'nicht', 'nur', 'noch',
            'mehr', 'als', 'wie', 'beim', 'zum', 'zur', 'vom', 'ins', 'ans'
        }
        # Normalize and split
        words = text.lower().strip().split()
        # Filter out stop words, short words, and apply stemming
        keywords = set()
        for word in words:
            # Remove punctuation
            word = ''.join(c for c in word if c.isalnum() or c == '-')
            if len(word) > 3 and word not in stop_words:
                # Apply simple stemming
                stemmed = self.simple_stem(word)
                keywords.add(stemmed)
        return keywords
    def check_same_story_with_ai(self, article1: Dict, article2: Dict) -> bool:
        """
        Use Ollama AI to determine if two articles are about the same story
        Args:
            article1: First article
            article2: Second article
        Returns:
            True if same story, False otherwise
        """
        if not self.ollama_client.enabled:
            # Fallback to keyword-based similarity
            return self.calculate_similarity(article1, article2) >= self.similarity_threshold
        title1 = article1.get('title', '')
        title2 = article2.get('title', '')
        content1 = article1.get('content', '')[:300]  # First 300 chars
        content2 = article2.get('content', '')[:300]
        prompt = f"""Compare these two news articles and determine if they are about the SAME story/event.
 Article 1:
 Title: {title1}
 Content: {content1}
 Article 2:
 Title: {title2}
 Content: {content2}
 Answer with ONLY "YES" if they are about the same story/event, or "NO" if they are different stories.
 Consider them the same story if they report on the same event, even if from different perspectives.
 Answer:"""
        try:
            response = self.ollama_client.generate(prompt, max_tokens=10)
            answer = response.get('text', '').strip().upper()
            return 'YES' in answer
        except Exception as e:
            print(f"   ⚠ AI clustering failed: {e}, using fallback")
            # Fallback to keyword-based similarity
            return self.calculate_similarity(article1, article2) >= self.similarity_threshold
    def calculate_similarity(self, article1: Dict, article2: Dict) -> float:
        """
        Calculate similarity between two articles using title and content
        Args:
            article1: First article (dict with 'title' and optionally 'content')
            article2: Second article (dict with 'title' and optionally 'content')
        Returns:
            Similarity score (0-1)
        """
        title1 = article1.get('title', '')
        title2 = article2.get('title', '')
        content1 = article1.get('content', '')
        content2 = article2.get('content', '')
        # Extract keywords from titles
        title_keywords1 = self.extract_keywords(title1)
        title_keywords2 = self.extract_keywords(title2)
        # Calculate title similarity
        if title_keywords1 and title_keywords2:
            title_intersection = title_keywords1.intersection(title_keywords2)
            title_union = title_keywords1.union(title_keywords2)
            title_similarity = len(title_intersection) / len(title_union) if title_union else 0
        else:
            # Fallback to string similarity
            t1 = self.normalize_title(title1)
            t2 = self.normalize_title(title2)
            title_similarity = SequenceMatcher(None, t1, t2).ratio()
        # If we have content, use it for better accuracy
        if content1 and content2:
            # Extract keywords from first 500 chars of content (for performance)
            content_keywords1 = self.extract_keywords(content1[:500])
            content_keywords2 = self.extract_keywords(content2[:500])
            if content_keywords1 and content_keywords2:
                content_intersection = content_keywords1.intersection(content_keywords2)
                content_union = content_keywords1.union(content_keywords2)
                content_similarity = len(content_intersection) / len(content_union) if content_union else 0
                # Weighted average: title (40%) + content (60%)
                return (title_similarity * 0.4) + (content_similarity * 0.6)
        # If no content, use only title similarity
        return title_similarity
    def find_cluster(self, article: Dict, existing_articles: List[Dict]) -> Optional[str]:
        """
        Find if article belongs to an existing cluster using AI
        Args:
            article: New article to cluster (dict with 'title' and optionally 'content')
            existing_articles: List of existing articles
        Returns:
            cluster_id if found, None otherwise
        """
        cutoff_time = datetime.utcnow() - timedelta(hours=self.time_window_hours)
        for existing in existing_articles:
            # Only compare recent articles
            published_at = existing.get('published_at')
            if published_at and published_at < cutoff_time:
                continue
            # Use AI to check if same story
            if self.check_same_story_with_ai(article, existing):
                return existing.get('cluster_id', str(existing.get('_id')))
        return None
    def cluster_article(self, article: Dict, existing_articles: List[Dict]) -> Dict:
        """
        Cluster a single article
        Args:
            article: Article to cluster
            existing_articles: List of existing articles
        Returns:
            Article with cluster_id and is_primary fields
        """
        cluster_id = self.find_cluster(article, existing_articles)
        if cluster_id:
            # Add to existing cluster
            article['cluster_id'] = cluster_id
            article['is_primary'] = False
        else:
            # Create new cluster
            article['cluster_id'] = str(article.get('_id', datetime.utcnow().timestamp()))
            article['is_primary'] = True
        return article
    def get_cluster_articles(self, cluster_id: str, articles_collection) -> List[Dict]:
        """
        Get all articles in a cluster
        Args:
            cluster_id: Cluster ID
            articles_collection: MongoDB collection
        Returns:
            List of articles in the cluster
        """
        return list(articles_collection.find({'cluster_id': cluster_id}))
--- a/news_crawler/cluster_summarizer.py
+++ b/news_crawler/cluster_summarizer.py
@@ -0,0 +1,213 @@
 """
 Cluster Summarizer Module
 Generates neutral summaries from multiple clustered articles
 """
 from typing import List, Dict, Optional
 from datetime import datetime
 from ollama_client import OllamaClient
 class ClusterSummarizer:
    """
    Generates neutral summaries by synthesizing multiple articles about the same story
    """
    def __init__(self, ollama_client: OllamaClient, max_words=200):
        """
        Initialize cluster summarizer
        Args:
            ollama_client: OllamaClient instance for AI-based summarization
            max_words: Maximum words in neutral summary
        """
        self.ollama_client = ollama_client
        self.max_words = max_words
    def generate_neutral_summary(self, articles: List[Dict]) -> Dict:
        """
        Generate a neutral summary from multiple articles about the same story
        Args:
            articles: List of article dicts with 'title', 'content', 'source'
        Returns:
            {
                'neutral_summary': str,
                'sources': list,
                'article_count': int,
                'success': bool,
                'error': str or None,
                'duration': float
            }
        """
        if not articles or len(articles) == 0:
            return {
                'neutral_summary': None,
                'sources': [],
                'article_count': 0,
                'success': False,
                'error': 'No articles provided',
                'duration': 0
            }
        # If only one article, return its summary
        if len(articles) == 1:
            return {
                'neutral_summary': articles[0].get('summary', articles[0].get('content', '')[:500]),
                'sources': [articles[0].get('source', 'unknown')],
                'article_count': 1,
                'success': True,
                'error': None,
                'duration': 0
            }
        # Build combined context from all articles
        combined_context = self._build_combined_context(articles)
        # Generate neutral summary using AI
        prompt = self._build_neutral_summary_prompt(combined_context, len(articles))
        result = self.ollama_client.generate(prompt, max_tokens=300)
        if result['success']:
            return {
                'neutral_summary': result['text'],
                'sources': list(set(a.get('source', 'unknown') for a in articles)),
                'article_count': len(articles),
                'success': True,
                'error': None,
                'duration': result['duration']
            }
        else:
            return {
                'neutral_summary': None,
                'sources': list(set(a.get('source', 'unknown') for a in articles)),
                'article_count': len(articles),
                'success': False,
                'error': result['error'],
                'duration': result['duration']
            }
    def _build_combined_context(self, articles: List[Dict]) -> str:
        """Build combined context from multiple articles"""
        context_parts = []
        for i, article in enumerate(articles, 1):
            source = article.get('source', 'Unknown')
            title = article.get('title', 'No title')
            # Use summary if available, otherwise use first 500 chars of content
            content = article.get('summary') or article.get('content', '')[:500]
            context_parts.append(f"Source {i} ({source}):\nTitle: {title}\nContent: {content}")
        return "\n\n".join(context_parts)
    def _build_neutral_summary_prompt(self, combined_context: str, article_count: int) -> str:
        """Build prompt for neutral summary generation"""
        prompt = f"""You are a neutral news aggregator. You have {article_count} articles from different sources about the same story. Your task is to create a single, balanced summary that:
 1. Combines information from all sources
 2. Remains neutral and objective
 3. Highlights key facts that all sources agree on
 4. Notes any significant differences in perspective (if any)
 5. Is written in clear, professional English
 6. Is approximately {self.max_words} words
 Here are the articles:
 {combined_context}
 Write a neutral summary in English that synthesizes these perspectives:"""
        return prompt
 def create_cluster_summaries(db, ollama_client: OllamaClient, cluster_ids: Optional[List[str]] = None):
    """
    Create or update neutral summaries for article clusters
    Args:
        db: MongoDB database instance
        ollama_client: OllamaClient instance
        cluster_ids: Optional list of specific cluster IDs to process. If None, processes all clusters.
    Returns:
        {
            'processed': int,
            'succeeded': int,
            'failed': int,
            'errors': list
        }
    """
    summarizer = ClusterSummarizer(ollama_client, max_words=200)
    # Find clusters to process
    if cluster_ids:
        clusters_to_process = cluster_ids
    else:
        # Get all cluster IDs with multiple articles
        pipeline = [
            {"$match": {"cluster_id": {"$exists": True}}},
            {"$group": {"_id": "$cluster_id", "count": {"$sum": 1}}},
            {"$match": {"count": {"$gt": 1}}},
            {"$project": {"_id": 1}}
        ]
        clusters_to_process = [c['_id'] for c in db.articles.aggregate(pipeline)]
    processed = 0
    succeeded = 0
    failed = 0
    errors = []
    for cluster_id in clusters_to_process:
        try:
            # Get all articles in this cluster
            articles = list(db.articles.find({"cluster_id": cluster_id}))
            if len(articles) < 2:
                continue
            print(f"Processing cluster {cluster_id}: {len(articles)} articles")
            # Generate neutral summary
            result = summarizer.generate_neutral_summary(articles)
            processed += 1
            if result['success']:
                # Save cluster summary
                db.cluster_summaries.update_one(
                    {"cluster_id": cluster_id},
                    {
                        "$set": {
                            "cluster_id": cluster_id,
                            "neutral_summary": result['neutral_summary'],
                            "sources": result['sources'],
                            "article_count": result['article_count'],
                            "created_at": datetime.utcnow(),
                            "updated_at": datetime.utcnow()
                        }
                    },
                    upsert=True
                )
                succeeded += 1
                print(f"  ✓ Generated neutral summary ({len(result['neutral_summary'])} chars)")
            else:
                failed += 1
                error_msg = f"Cluster {cluster_id}: {result['error']}"
                errors.append(error_msg)
                print(f"  ✗ Failed: {result['error']}")
        except Exception as e:
            failed += 1
            error_msg = f"Cluster {cluster_id}: {str(e)}"
            errors.append(error_msg)
            print(f"  ✗ Error: {e}")
    return {
        'processed': processed,
        'succeeded': succeeded,
        'failed': failed,
        'errors': errors
    }
--- a/news_crawler/crawler_service.py
+++ b/news_crawler/crawler_service.py
@@ -13,6 +13,8 @@ from dotenv import load_dotenv
 from rss_utils import extract_article_url, extract_article_summary, extract_published_date
 from config import Config
 from ollama_client import OllamaClient
 from article_clustering import ArticleClusterer
 from cluster_summarizer import create_cluster_summaries
 # Load environment variables
 load_dotenv(dotenv_path='../.env')
@@ -33,6 +35,9 @@ ollama_client = OllamaClient(
    timeout=Config.OLLAMA_TIMEOUT
 )
 # Initialize Article Clusterer (will be initialized after ollama_client)
 article_clusterer = None
 # Print configuration on startup
 if __name__ != '__main__':
    Config.print_config()
@@ -45,6 +50,14 @@ if __name__ != '__main__':
    else:
        print("ℹ Ollama AI summarization: DISABLED")
    # Initialize Article Clusterer with ollama_client
    article_clusterer = ArticleClusterer(
        ollama_client=ollama_client,
        similarity_threshold=0.60,  # Not used when AI is enabled
        time_window_hours=24        # Look back 24 hours
    )
    print("🔗 Article clustering: ENABLED (AI-powered)")
 def get_active_rss_feeds():
    """Get all active RSS feeds from database"""
@@ -394,6 +407,13 @@ def crawl_rss_feed(feed_url, feed_name, feed_category='general', max_articles=10
                    'created_at': datetime.utcnow()
                }
                # Cluster article with existing articles (detect duplicates from other sources)
                from datetime import timedelta
                recent_articles = list(articles_collection.find({
                    'published_at': {'$gte': datetime.utcnow() - timedelta(hours=24)}
                }))
                article_doc = article_clusterer.cluster_article(article_doc, recent_articles)
                try:
                    # Upsert: update if exists, insert if not
                    articles_collection.update_one(
@@ -434,6 +454,16 @@ def crawl_all_feeds(max_articles_per_feed=10):
    Crawl all active RSS feeds
    Returns: dict with statistics
    """
    global article_clusterer
    # Initialize clusterer if not already done
    if article_clusterer is None:
        article_clusterer = ArticleClusterer(
            ollama_client=ollama_client,
            similarity_threshold=0.60,
            time_window_hours=24
        )
    print("\n" + "="*60)
    print("🚀 Starting RSS Feed Crawler")
    print("="*60)
@@ -485,12 +515,29 @@ def crawl_all_feeds(max_articles_per_feed=10):
        print(f"  Average time per article: {duration/total_crawled:.1f}s")
    print("="*60 + "\n")
    # Generate neutral summaries for clustered articles
    cluster_summary_stats = {'processed': 0, 'succeeded': 0, 'failed': 0}
    if Config.OLLAMA_ENABLED and total_crawled > 0:
        print("\n" + "="*60)
        print("🔄 Generating Neutral Summaries for Clustered Articles")
        print("="*60)
        cluster_summary_stats = create_cluster_summaries(db, ollama_client)
        print("\n" + "="*60)
        print(f"✓ Cluster Summarization Complete!")
        print(f"  Clusters processed: {cluster_summary_stats['processed']}")
        print(f"  Succeeded: {cluster_summary_stats['succeeded']}")
        print(f"  Failed: {cluster_summary_stats['failed']}")
        print("="*60 + "\n")
    return {
        'total_feeds': len(feeds),
        'total_articles_crawled': total_crawled,
        'total_summarized': total_summarized,
        'failed_summaries': total_failed,
-        'duration_seconds': round(duration, 2)
+        'duration_seconds': round(duration, 2),
        'cluster_summaries': cluster_summary_stats
    }
--- a/news_crawler/ollama_client.py
+++ b/news_crawler/ollama_client.py
@@ -392,6 +392,80 @@ English Summary (max {max_words} words):"""
                'error': str(e)
            }
    def generate(self, prompt, max_tokens=100):
        """
        Generate text using Ollama
        Args:
            prompt: Text prompt
            max_tokens: Maximum tokens to generate
        Returns:
            {
                'text': str,           # Generated text
                'success': bool,       # Whether generation succeeded
                'error': str or None,  # Error message if failed
                'duration': float      # Time taken in seconds
            }
        """
        if not self.enabled:
            return {
                'text': '',
                'success': False,
                'error': 'Ollama is disabled',
                'duration': 0
            }
        start_time = time.time()
        try:
            response = requests.post(
                f"{self.base_url}/api/generate",
                json={
                    "model": self.model,
                    "prompt": prompt,
                    "stream": False,
                    "options": {
                        "num_predict": max_tokens,
                        "temperature": 0.1  # Low temperature for consistent answers
                    }
                },
                timeout=self.timeout
            )
            duration = time.time() - start_time
            if response.status_code == 200:
                result = response.json()
                return {
                    'text': result.get('response', '').strip(),
                    'success': True,
                    'error': None,
                    'duration': duration
                }
            else:
                return {
                    'text': '',
                    'success': False,
                    'error': f"HTTP {response.status_code}: {response.text}",
                    'duration': duration
                }
        except requests.exceptions.Timeout:
            return {
                'text': '',
                'success': False,
                'error': f"Request timed out after {self.timeout}s",
                'duration': time.time() - start_time
            }
        except Exception as e:
            return {
                'text': '',
                'success': False,
                'error': str(e),
                'duration': time.time() - start_time
            }
 if __name__ == '__main__':
    # Quick test
--- a/tests/crawler/README.md
+++ b/tests/crawler/README.md
@@ -0,0 +1,110 @@
 # Crawler Tests
 Test suite for the news crawler, AI clustering, and neutral summary generation.
 ## Test Files
 ### AI Clustering & Aggregation Tests
 - **`test_clustering_real.py`** - Tests AI-powered article clustering with realistic fake articles
 - **`test_neutral_summaries.py`** - Tests neutral summary generation from clustered articles
 - **`test_complete_workflow.py`** - End-to-end test of clustering + neutral summaries
 ### Core Crawler Tests
 - **`test_crawler.py`** - Basic crawler functionality
 - **`test_ollama.py`** - Ollama AI integration tests
 - **`test_rss_feeds.py`** - RSS feed parsing tests
 ## Running Tests
 ### Run All Tests
 ```bash
 # From project root
 docker-compose exec crawler python -m pytest tests/crawler/
 ```
 ### Run Specific Test
 ```bash
 # AI clustering test
 docker-compose exec crawler python tests/crawler/test_clustering_real.py
 # Neutral summaries test
 docker-compose exec crawler python tests/crawler/test_neutral_summaries.py
 # Complete workflow test
 docker-compose exec crawler python tests/crawler/test_complete_workflow.py
 ```
 ### Run Tests Inside Container
 ```bash
 # Enter container
 docker-compose exec crawler bash
 # Run tests
 python test_clustering_real.py
 python test_neutral_summaries.py
 python test_complete_workflow.py
 ```
 ## Test Data
 Tests use fake articles to avoid depending on external RSS feeds:
 **Test Scenarios:**
 1. **Same story, different sources** - Should cluster together
 2. **Different stories** - Should remain separate
 3. **Multi-source clustering** - Should generate neutral summaries
 **Expected Results:**
 - Housing story (2 sources) → Cluster together → Neutral summary
 - Bayern transfer (2 sources) → Cluster together → Neutral summary
 - Single-source stories → Individual summaries
 ## Cleanup
 Tests create temporary data in MongoDB. To clean up:
 ```bash
 # Clean test articles
 docker-compose exec crawler python << 'EOF'
 from pymongo import MongoClient
 client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
 db = client["munich_news"]
 db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
 db.cluster_summaries.delete_many({})
 print("✓ Test data cleaned")
 EOF
 ```
 ## Requirements
 - Docker containers must be running
 - Ollama service must be available
 - MongoDB must be accessible
 - AI model (phi3:latest) must be downloaded
 ## Troubleshooting
 ### Ollama Not Available
 ```bash
 # Check Ollama status
 docker-compose logs ollama
 # Restart Ollama
 docker-compose restart ollama
 ```
 ### Tests Timing Out
 - Increase timeout in test files (default: 60s)
 - Check Ollama model is downloaded
 - Verify GPU acceleration if enabled
 ### MongoDB Connection Issues
 ```bash
 # Check MongoDB status
 docker-compose logs mongodb
 # Restart MongoDB
 docker-compose restart mongodb
 ```
--- a/tests/crawler/test_clustering_real.py
+++ b/tests/crawler/test_clustering_real.py
@@ -0,0 +1,166 @@
 #!/usr/bin/env python3
 """
 Test AI clustering with realistic fake articles
 """
 from pymongo import MongoClient
 from datetime import datetime
 import sys
 # Connect to MongoDB
 client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
 db = client["munich_news"]
 # Create test articles about the same Munich story from different sources
 test_articles = [
    {
        "title": "München: Stadtrat beschließt neue Regelungen für Wohnungsbau",
        "content": """Der Münchner Stadtrat hat am Dienstag neue Regelungen für den Wohnungsbau beschlossen. 
        Die Maßnahmen sollen den Bau von bezahlbarem Wohnraum in der bayerischen Landeshauptstadt fördern. 
        Oberbürgermeister Dieter Reiter (SPD) sprach von einem wichtigen Schritt zur Lösung der Wohnungskrise.
        Die neuen Regelungen sehen vor, dass bei Neubauprojekten mindestens 40 Prozent der Wohnungen 
        als Sozialwohnungen gebaut werden müssen. Zudem werden Bauvorschriften vereinfacht.""",
        "source": "abendzeitung-muenchen",
        "link": "https://example.com/az-wohnungsbau-1",
        "published_at": datetime.utcnow(),
        "category": "local",
        "word_count": 85
    },
    {
        "title": "Stadtrat München stimmt für neue Wohnungsbau-Verordnung",
        "content": """In einer Sitzung am Dienstag stimmte der Münchner Stadtrat für neue Wohnungsbau-Verordnungen.
        Die Beschlüsse zielen darauf ab, mehr bezahlbaren Wohnraum in München zu schaffen.
        OB Reiter bezeichnete die Entscheidung als Meilenstein im Kampf gegen die Wohnungsnot.
        Künftig müssen 40 Prozent aller Neubauwohnungen als Sozialwohnungen errichtet werden.
        Außerdem werden bürokratische Hürden beim Bauen abgebaut.""",
        "source": "sueddeutsche",
        "link": "https://example.com/sz-wohnungsbau-1",
        "published_at": datetime.utcnow(),
        "category": "local",
        "word_count": 72
    },
    {
        "title": "FC Bayern München verpflichtet neuen Stürmer aus Brasilien",
        "content": """Der FC Bayern München hat einen neuen Stürmer verpflichtet. Der 23-jährige Brasilianer
        wechselt für eine Ablösesumme von 50 Millionen Euro nach München. Sportdirektor Christoph Freund
        zeigte sich begeistert von der Verpflichtung. Der Spieler soll die Offensive verstärken.""",
        "source": "abendzeitung-muenchen",
        "link": "https://example.com/az-bayern-1",
        "published_at": datetime.utcnow(),
        "category": "sports",
        "word_count": 52
    },
    {
        "title": "Bayern München holt brasilianischen Angreifer",
        "content": """Der deutsche Rekordmeister Bayern München hat einen brasilianischen Stürmer unter Vertrag genommen.
        Für 50 Millionen Euro wechselt der 23-Jährige an die Isar. Sportdirektor Freund lobte den Transfer.
        Der Neuzugang soll die Münchner Offensive beleben und für mehr Torgefahr sorgen.""",
        "source": "sueddeutsche",
        "link": "https://example.com/sz-bayern-1",
        "published_at": datetime.utcnow(),
        "category": "sports",
        "word_count": 48
    }
 ]
 print("Testing AI Clustering with Realistic Articles")
 print("=" * 70)
 print()
 # Clear previous test articles
 print("Cleaning up previous test articles...")
 db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
 print("✓ Cleaned up")
 print()
 # Import clustering
 sys.path.insert(0, '/app')
 from ollama_client import OllamaClient
 from article_clustering import ArticleClusterer
 from config import Config
 # Initialize
 ollama_client = OllamaClient(
    base_url=Config.OLLAMA_BASE_URL,
    model=Config.OLLAMA_MODEL,
    enabled=Config.OLLAMA_ENABLED,
    timeout=30
 )
 clusterer = ArticleClusterer(
    ollama_client=ollama_client,
    similarity_threshold=0.50,
    time_window_hours=24
 )
 print("Processing articles with AI clustering...")
 print()
 clustered_articles = []
 for i, article in enumerate(test_articles, 1):
    print(f"{i}. Processing: {article['title'][:60]}...")
    print(f"   Source: {article['source']}")
    # Cluster with previously processed articles
    clustered = clusterer.cluster_article(article, clustered_articles)
    clustered_articles.append(clustered)
    print(f"   → Cluster ID: {clustered['cluster_id']}")
    print(f"   → Is Primary: {clustered['is_primary']}")
    # Insert into database
    db.articles.insert_one(clustered)
    print(f"   ✓ Saved to database")
    print()
 print("=" * 70)
 print("Clustering Results:")
 print()
 # Analyze results
 clusters = {}
 for article in clustered_articles:
    cluster_id = article['cluster_id']
    if cluster_id not in clusters:
        clusters[cluster_id] = []
    clusters[cluster_id].append(article)
 for cluster_id, articles in clusters.items():
    print(f"Cluster {cluster_id}: {len(articles)} article(s)")
    for article in articles:
        print(f"  - [{article['source']}] {article['title'][:60]}...")
    print()
 # Expected results
 print("=" * 70)
 print("Expected Results:")
 print("  ✓ Articles 1&2 should be in same cluster (housing story)")
 print("  ✓ Articles 3&4 should be in same cluster (Bayern transfer)")
 print("  ✓ Total: 2 clusters with 2 articles each")
 print()
 # Actual results
 housing_cluster = [a for a in clustered_articles if 'Wohnungsbau' in a['title'] or 'Wohnungsbau' in a['title']]
 bayern_cluster = [a for a in clustered_articles if 'Bayern' in a['title'] or 'Stürmer' in a['title']]
 housing_cluster_ids = set(a['cluster_id'] for a in housing_cluster)
 bayern_cluster_ids = set(a['cluster_id'] for a in bayern_cluster)
 print("Actual Results:")
 if len(housing_cluster_ids) == 1:
    print("  ✓ Housing articles clustered together")
 else:
    print(f"  ✗ Housing articles in {len(housing_cluster_ids)} different clusters")
 if len(bayern_cluster_ids) == 1:
    print("  ✓ Bayern articles clustered together")
 else:
    print(f"  ✗ Bayern articles in {len(bayern_cluster_ids)} different clusters")
 if len(clusters) == 2:
    print("  ✓ Total clusters: 2 (correct)")
 else:
    print(f"  ✗ Total clusters: {len(clusters)} (expected 2)")
 print()
 print("=" * 70)
 print("✓ Test complete! Check the results above.")
--- a/tests/crawler/test_complete_workflow.py
+++ b/tests/crawler/test_complete_workflow.py
@@ -0,0 +1,187 @@
 #!/usr/bin/env python3
 """
 Complete workflow test: Clustering + Neutral Summaries
 """
 from pymongo import MongoClient
 from datetime import datetime
 import sys
 client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
 db = client["munich_news"]
 print("=" * 70)
 print("COMPLETE WORKFLOW TEST: AI Clustering + Neutral Summaries")
 print("=" * 70)
 print()
 # Clean up previous test
 print("1. Cleaning up previous test data...")
 db.articles.delete_many({"link": {"$regex": "^https://example.com/"}})
 db.cluster_summaries.delete_many({"cluster_id": {"$regex": "^test_"}})
 print("   ✓ Cleaned up")
 print()
 # Import modules
 sys.path.insert(0, '/app')
 from ollama_client import OllamaClient
 from article_clustering import ArticleClusterer
 from cluster_summarizer import ClusterSummarizer
 from config import Config
 # Initialize
 ollama_client = OllamaClient(
    base_url=Config.OLLAMA_BASE_URL,
    model=Config.OLLAMA_MODEL,
    enabled=Config.OLLAMA_ENABLED,
    timeout=60
 )
 clusterer = ArticleClusterer(ollama_client, similarity_threshold=0.50, time_window_hours=24)
 summarizer = ClusterSummarizer(ollama_client, max_words=200)
 # Test articles - 2 stories, 2 sources each
 test_articles = [
    # Story 1: Munich Housing (2 sources)
    {
        "title": "München: Stadtrat beschließt neue Wohnungsbau-Regelungen",
        "content": "Der Münchner Stadtrat hat neue Regelungen für bezahlbaren Wohnungsbau beschlossen. 40% Sozialwohnungen werden Pflicht.",
        "source": "abendzeitung-muenchen",
        "link": "https://example.com/test-housing-az",
        "published_at": datetime.utcnow(),
        "category": "local"
    },
    {
        "title": "Stadtrat München: Neue Verordnung für Wohnungsbau",
        "content": "München führt neue Wohnungsbau-Verordnung ein. Mindestens 40% der Neubauten müssen Sozialwohnungen sein.",
        "source": "sueddeutsche",
        "link": "https://example.com/test-housing-sz",
        "published_at": datetime.utcnow(),
        "category": "local"
    },
    # Story 2: Bayern Transfer (2 sources)
    {
        "title": "FC Bayern verpflichtet brasilianischen Stürmer für 50 Millionen",
        "content": "Bayern München holt einen 23-jährigen Brasilianer. Sportdirektor Freund ist begeistert.",
        "source": "abendzeitung-muenchen",
        "link": "https://example.com/test-bayern-az",
        "published_at": datetime.utcnow(),
        "category": "sports"
    },
    {
        "title": "Bayern München: Neuzugang aus Brasilien für 50 Mio. Euro",
        "content": "Der Rekordmeister verstärkt die Offensive mit einem brasilianischen Angreifer. Freund lobt den Transfer.",
        "source": "sueddeutsche",
        "link": "https://example.com/test-bayern-sz",
        "published_at": datetime.utcnow(),
        "category": "sports"
    }
 ]
 print("2. Processing articles with AI clustering...")
 print()
 clustered_articles = []
 for i, article in enumerate(test_articles, 1):
    print(f"   Article {i}: {article['title'][:50]}...")
    print(f"   Source: {article['source']}")
    # Cluster
    clustered = clusterer.cluster_article(article, clustered_articles)
    clustered_articles.append(clustered)
    print(f"   → Cluster: {clustered['cluster_id']}")
    print(f"   → Primary: {clustered['is_primary']}")
    # Save to DB
    db.articles.insert_one(clustered)
    print(f"   ✓ Saved")
    print()
 print("=" * 70)
 print("3. Clustering Results:")
 print()
 # Analyze clusters
 clusters = {}
 for article in clustered_articles:
    cid = article['cluster_id']
    if cid not in clusters:
        clusters[cid] = []
    clusters[cid].append(article)
 print(f"   Total clusters: {len(clusters)}")
 print()
 for cid, articles in clusters.items():
    print(f"   Cluster {cid}:")
    print(f"   - Articles: {len(articles)}")
    for article in articles:
        print(f"     • [{article['source']}] {article['title'][:45]}...")
    print()
 # Check expectations
 if len(clusters) == 2:
    print("   ✓ Expected 2 clusters (housing + bayern)")
 else:
    print(f"   ⚠ Expected 2 clusters, got {len(clusters)}")
 print()
 print("=" * 70)
 print("4. Generating neutral summaries...")
 print()
 summary_count = 0
 for cid, articles in clusters.items():
    if len(articles) < 2:
        print(f"   Skipping cluster {cid} (only 1 article)")
        continue
    print(f"   Cluster {cid}: {len(articles)} articles")
    result = summarizer.generate_neutral_summary(articles)
    if result['success']:
        print(f"   ✓ Generated summary ({result['duration']:.1f}s)")
        # Save
        db.cluster_summaries.insert_one({
            "cluster_id": cid,
            "neutral_summary": result['neutral_summary'],
            "sources": result['sources'],
            "article_count": result['article_count'],
            "created_at": datetime.utcnow()
        })
        summary_count += 1
        # Show preview
        preview = result['neutral_summary'][:100] + "..."
        print(f"   Preview: {preview}")
    else:
        print(f"   ✗ Failed: {result['error']}")
    print()
 print("=" * 70)
 print("5. Final Results:")
 print()
 test_article_count = db.articles.count_documents({"link": {"$regex": "^https://example.com/test-"}})
 test_summary_count = db.cluster_summaries.count_documents({})
 print(f"   Articles saved: {test_article_count}")
 print(f"   Clusters created: {len(clusters)}")
 print(f"   Neutral summaries: {summary_count}")
 print()
 if len(clusters) == 2 and summary_count == 2:
    print("   ✅ SUCCESS! Complete workflow working perfectly!")
    print()
    print("   The system now:")
    print("   1. ✓ Clusters articles from different sources")
    print("   2. ✓ Generates neutral summaries combining perspectives")
    print("   3. ✓ Stores everything in MongoDB")
 else:
    print("   ⚠ Partial success - check results above")
 print()
 print("=" * 70)
--- a/tests/crawler/test_neutral_summaries.py
+++ b/tests/crawler/test_neutral_summaries.py
@@ -0,0 +1,130 @@
 #!/usr/bin/env python3
 """
 Test neutral summary generation from clustered articles
 """
 from pymongo import MongoClient
 from datetime import datetime
 import sys
 # Connect to MongoDB
 client = MongoClient("mongodb://admin:changeme@mongodb:27017/")
 db = client["munich_news"]
 print("Testing Neutral Summary Generation")
 print("=" * 70)
 print()
 # Check for test articles
 test_articles = list(db.articles.find(
    {"link": {"$regex": "^https://example.com/"}}
 ).sort("_id", 1))
 if len(test_articles) == 0:
    print("⚠ No test articles found. Run test-clustering-real.py first.")
    sys.exit(1)
 print(f"Found {len(test_articles)} test articles")
 print()
 # Find clusters with multiple articles
 clusters = {}
 for article in test_articles:
    cid = article['cluster_id']
    if cid not in clusters:
        clusters[cid] = []
    clusters[cid].append(article)
 multi_article_clusters = {k: v for k, v in clusters.items() if len(v) > 1}
 if len(multi_article_clusters) == 0:
    print("⚠ No clusters with multiple articles found")
    sys.exit(1)
 print(f"Found {len(multi_article_clusters)} cluster(s) with multiple articles")
 print()
 # Import cluster summarizer
 sys.path.insert(0, '/app')
 from ollama_client import OllamaClient
 from cluster_summarizer import ClusterSummarizer
 from config import Config
 # Initialize
 ollama_client = OllamaClient(
    base_url=Config.OLLAMA_BASE_URL,
    model=Config.OLLAMA_MODEL,
    enabled=Config.OLLAMA_ENABLED,
    timeout=60
 )
 summarizer = ClusterSummarizer(ollama_client, max_words=200)
 print("Generating neutral summaries...")
 print("=" * 70)
 print()
 for cluster_id, articles in multi_article_clusters.items():
    print(f"Cluster: {cluster_id}")
    print(f"Articles: {len(articles)}")
    print()
    # Show individual articles
    for i, article in enumerate(articles, 1):
        print(f"  {i}. [{article['source']}] {article['title'][:60]}...")
    print()
    # Generate neutral summary
    print("  Generating neutral summary...")
    result = summarizer.generate_neutral_summary(articles)
    if result['success']:
        print(f"  ✓ Success ({result['duration']:.1f}s)")
        print()
        print("  Neutral Summary:")
        print("  " + "-" * 66)
        # Wrap text at 66 chars
        summary = result['neutral_summary']
        words = summary.split()
        lines = []
        current_line = "  "
        for word in words:
            if len(current_line) + len(word) + 1 <= 68:
                current_line += word + " "
            else:
                lines.append(current_line.rstrip())
                current_line = "  " + word + " "
        if current_line.strip():
            lines.append(current_line.rstrip())
        print("\n".join(lines))
        print("  " + "-" * 66)
        print()
        # Save to database
        db.cluster_summaries.update_one(
            {"cluster_id": cluster_id},
            {
                "$set": {
                    "cluster_id": cluster_id,
                    "neutral_summary": result['neutral_summary'],
                    "sources": result['sources'],
                    "article_count": result['article_count'],
                    "created_at": datetime.utcnow(),
                    "updated_at": datetime.utcnow()
                }
            },
            upsert=True
        )
        print("  ✓ Saved to cluster_summaries collection")
    else:
        print(f"  ✗ Failed: {result['error']}")
    print()
    print("=" * 70)
    print()
 print("Testing complete!")
 print()
 # Show summary statistics
 total_cluster_summaries = db.cluster_summaries.count_documents({})
 print(f"Total cluster summaries in database: {total_cluster_summaries}")