# System Architecture Complete system design and architecture documentation. --- ## Overview Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters. ``` ┌─────────────────────────────────────────────────────────┐ │ Docker Network │ │ (Internal Only) │ ├─────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │ │ │ (27017) │ │ (5001) │ │ │ │ │ └──────────┘ └────┬─────┘ └──────────┘ │ │ │ │ │ ┌──────────┐ │ ┌──────────┐ │ │ │ Ollama │◄────────┤ │ Sender │ │ │ │ (11434) │ │ │ │ │ │ └──────────┘ │ └──────────┘ │ │ │ │ └───────────────────────┼───────────────────────────────────┘ │ │ Port 5001 (Only exposed port) ▼ Host Machine External Network ``` --- ## Components ### 1. MongoDB (Database) - **Purpose**: Store articles, subscribers, tracking data - **Port**: 27017 (internal only) - **Access**: Only via Docker network - **Authentication**: Username/password **Collections:** - `articles` - News articles with summaries - `subscribers` - Newsletter subscribers - `rss_feeds` - RSS feed sources - `newsletter_sends` - Send tracking - `link_clicks` - Click tracking ### 2. Backend API (Flask) - **Purpose**: API endpoints, tracking, analytics - **Port**: 5001 (exposed to host) - **Access**: Public API, admin endpoints - **Features**: Tracking pixels, click tracking, admin operations **Key Endpoints:** - `/api/admin/*` - Admin operations - `/api/subscribe` - Subscribe to newsletter - `/api/tracking/*` - Tracking endpoints - `/health` - Health check ### 3. Ollama (AI Service) - **Purpose**: AI summarization and translation - **Port**: 11434 (internal only) - **Model**: phi3:latest (2.2GB) - **GPU**: Optional NVIDIA GPU support **Features:** - Article summarization (150 words) - Title translation (German → English) - Configurable timeout and model ### 4. Crawler (News Fetcher) - **Purpose**: Fetch and process news articles - **Schedule**: 6:00 AM Berlin time (automated) - **Features**: RSS parsing, content extraction, AI processing **Process:** 1. Fetch RSS feeds 2. Extract article content 3. Translate title (German → English) 4. Generate AI summary 5. Store in MongoDB ### 5. Sender (Newsletter) - **Purpose**: Send newsletters to subscribers - **Schedule**: 7:00 AM Berlin time (automated) - **Features**: Email sending, tracking, templating **Process:** 1. Fetch today's articles 2. Generate newsletter HTML 3. Add tracking pixels/links 4. Send to active subscribers 5. Record send events --- ## Data Flow ### Article Processing Flow ``` RSS Feed ↓ Crawler fetches ↓ Extract content ↓ Translate title (Ollama) ↓ Generate summary (Ollama) ↓ Store in MongoDB ↓ Newsletter Sender ↓ Email to subscribers ``` ### Tracking Flow ``` Newsletter sent ↓ Tracking pixel embedded ↓ User opens email ↓ Pixel loaded → Backend API ↓ Record open event ↓ User clicks link ↓ Redirect via Backend API ↓ Record click event ↓ Redirect to article ``` --- ## Database Schema ### Articles Collection ```javascript { _id: ObjectId, title: String, // Original German title title_en: String, // English translation translated_at: Date, // Translation timestamp link: String, summary: String, // AI-generated summary content: String, // Full article text author: String, source: String, // RSS feed name published_at: Date, crawled_at: Date, created_at: Date } ``` ### Subscribers Collection ```javascript { _id: ObjectId, email: String, // Unique subscribed_at: Date, status: String // 'active' or 'inactive' } ``` ### RSS Feeds Collection ```javascript { _id: ObjectId, name: String, url: String, active: Boolean, last_crawled: Date } ``` --- ## Security Architecture ### Network Isolation **Exposed Services:** - Backend API (port 5001) - Only exposed service **Internal Services:** - MongoDB (port 27017) - Not accessible from host - Ollama (port 11434) - Not accessible from host - Crawler - No ports - Sender - No ports **Benefits:** - 66% reduction in attack surface - Database protected from external access - AI service protected from abuse - Defense in depth ### Authentication **MongoDB:** - Username/password authentication - Credentials in environment variables - Internal network only **Backend API:** - No authentication (add in production) - Rate limiting recommended - IP whitelisting recommended ### Data Protection - Subscriber emails stored securely - No sensitive data in logs - Environment variables for secrets - `.env` file in `.gitignore` --- ## Technology Stack ### Backend - **Language**: Python 3.11 - **Framework**: Flask - **Database**: MongoDB 7.0 - **AI**: Ollama (phi3:latest) ### Infrastructure - **Containerization**: Docker & Docker Compose - **Networking**: Docker bridge network - **Storage**: Docker volumes - **Scheduling**: Python schedule library ### Libraries - **Web**: Flask, Flask-CORS - **Database**: pymongo - **Email**: smtplib, email.mime - **Scraping**: requests, BeautifulSoup4, feedparser - **Templating**: Jinja2 - **AI**: requests (Ollama API) --- ## Deployment Architecture ### Development ``` Local Machine ├── Docker Compose │ ├── MongoDB (internal) │ ├── Ollama (internal) │ ├── Backend (exposed) │ ├── Crawler (internal) │ └── Sender (internal) └── .env file ``` ### Production ``` Server ├── Reverse Proxy (nginx/Traefik) │ ├── SSL/TLS │ ├── Rate limiting │ └── Authentication ├── Docker Compose │ ├── MongoDB (internal) │ ├── Ollama (internal, GPU) │ ├── Backend (internal) │ ├── Crawler (internal) │ └── Sender (internal) ├── Monitoring │ ├── Logs │ ├── Metrics │ └── Alerts └── Backups ├── MongoDB dumps └── Configuration ``` --- ## Scalability ### Current Limits - Single server deployment - Sequential article processing - Single MongoDB instance - No load balancing ### Scaling Options **Horizontal Scaling:** - Multiple crawler instances - Load-balanced backend - MongoDB replica set - Distributed Ollama **Vertical Scaling:** - More CPU cores - More RAM - GPU acceleration (5-10x faster) - Faster storage **Optimization:** - Batch processing - Caching - Database indexing - Connection pooling --- ## Monitoring ### Health Checks ```bash # Backend health curl http://localhost:5001/health # MongoDB health docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')" # Ollama health docker-compose exec ollama ollama list ``` ### Metrics - Article count - Subscriber count - Newsletter open rate - Click-through rate - Processing time - Error rate ### Logs ```bash # All services docker-compose logs -f # Specific service docker-compose logs -f crawler docker-compose logs -f backend ``` --- ## Backup & Recovery ### MongoDB Backup ```bash # Backup docker-compose exec mongodb mongodump --out /backup # Restore docker-compose exec mongodb mongorestore /backup ``` ### Configuration Backup - `backend/.env` - Environment variables - `docker-compose.yml` - Service configuration - RSS feeds in MongoDB ### Recovery Plan 1. Restore MongoDB from backup 2. Restore configuration files 3. Restart services 4. Verify functionality --- ## Performance ### CPU Mode - Translation: ~1.5s per title - Summarization: ~8s per article - 10 articles: ~115s total - Suitable for <20 articles/day ### GPU Mode (5-10x faster) - Translation: ~0.3s per title - Summarization: ~2s per article - 10 articles: ~31s total - Suitable for high-volume processing ### Resource Usage **CPU Mode:** - CPU: 60-80% - RAM: 4-6GB - Disk: ~1GB (with model) **GPU Mode:** - CPU: 10-20% - RAM: 2-3GB - GPU: 80-100% - VRAM: 3-4GB - Disk: ~1GB (with model) --- ## Future Enhancements ### Planned Features - Frontend dashboard - Real-time analytics - Multiple languages - Custom RSS feeds per subscriber - A/B testing for newsletters - Advanced tracking ### Technical Improvements - Kubernetes deployment - Microservices architecture - Message queue (RabbitMQ/Redis) - Caching layer (Redis) - CDN for assets - Advanced monitoring (Prometheus/Grafana) --- See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.