9.7 KiB
9.7 KiB
System Architecture
Complete system design and architecture documentation.
Overview
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
┌─────────────────────────────────────────────────────────┐
│ Docker Network │
│ (Internal Only) │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
│ │ (27017) │ │ (5001) │ │ │ │
│ └──────────┘ └────┬─────┘ └──────────┘ │
│ │ │
│ ┌──────────┐ │ ┌──────────┐ │
│ │ Ollama │◄────────┤ │ Sender │ │
│ │ (11434) │ │ │ │ │
│ └──────────┘ │ └──────────┘ │
│ │ │
└───────────────────────┼───────────────────────────────────┘
│
│ Port 5001 (Only exposed port)
▼
Host Machine
External Network
Components
1. MongoDB (Database)
- Purpose: Store articles, subscribers, tracking data
- Port: 27017 (internal only)
- Access: Only via Docker network
- Authentication: Username/password
Collections:
articles- News articles with summariessubscribers- Newsletter subscribersrss_feeds- RSS feed sourcesnewsletter_sends- Send trackinglink_clicks- Click tracking
2. Backend API (Flask)
- Purpose: API endpoints, tracking, analytics
- Port: 5001 (exposed to host)
- Access: Public API, admin endpoints
- Features: Tracking pixels, click tracking, admin operations
Key Endpoints:
/api/admin/*- Admin operations/api/subscribe- Subscribe to newsletter/api/tracking/*- Tracking endpoints/health- Health check
3. Ollama (AI Service)
- Purpose: AI summarization and translation
- Port: 11434 (internal only)
- Model: phi3:latest (2.2GB)
- GPU: Optional NVIDIA GPU support
Features:
- Article summarization (150 words)
- Title translation (German → English)
- Configurable timeout and model
4. Crawler (News Fetcher)
- Purpose: Fetch and process news articles
- Schedule: 6:00 AM Berlin time (automated)
- Features: RSS parsing, content extraction, AI processing
Process:
- Fetch RSS feeds
- Extract article content
- Translate title (German → English)
- Generate AI summary
- Store in MongoDB
5. Sender (Newsletter)
- Purpose: Send newsletters to subscribers
- Schedule: 7:00 AM Berlin time (automated)
- Features: Email sending, tracking, templating
Process:
- Fetch today's articles
- Generate newsletter HTML
- Add tracking pixels/links
- Send to active subscribers
- Record send events
Data Flow
Article Processing Flow
RSS Feed
↓
Crawler fetches
↓
Extract content
↓
Translate title (Ollama)
↓
Generate summary (Ollama)
↓
Store in MongoDB
↓
Newsletter Sender
↓
Email to subscribers
Tracking Flow
Newsletter sent
↓
Tracking pixel embedded
↓
User opens email
↓
Pixel loaded → Backend API
↓
Record open event
↓
User clicks link
↓
Redirect via Backend API
↓
Record click event
↓
Redirect to article
Database Schema
Articles Collection
{
_id: ObjectId,
title: String, // Original German title
title_en: String, // English translation
translated_at: Date, // Translation timestamp
link: String,
summary: String, // AI-generated summary
content: String, // Full article text
author: String,
source: String, // RSS feed name
published_at: Date,
crawled_at: Date,
created_at: Date
}
Subscribers Collection
{
_id: ObjectId,
email: String, // Unique
subscribed_at: Date,
status: String // 'active' or 'inactive'
}
RSS Feeds Collection
{
_id: ObjectId,
name: String,
url: String,
active: Boolean,
last_crawled: Date
}
Security Architecture
Network Isolation
Exposed Services:
- Backend API (port 5001) - Only exposed service
Internal Services:
- MongoDB (port 27017) - Not accessible from host
- Ollama (port 11434) - Not accessible from host
- Crawler - No ports
- Sender - No ports
Benefits:
- 66% reduction in attack surface
- Database protected from external access
- AI service protected from abuse
- Defense in depth
Authentication
MongoDB:
- Username/password authentication
- Credentials in environment variables
- Internal network only
Backend API:
- No authentication (add in production)
- Rate limiting recommended
- IP whitelisting recommended
Data Protection
- Subscriber emails stored securely
- No sensitive data in logs
- Environment variables for secrets
.envfile in.gitignore
Technology Stack
Backend
- Language: Python 3.11
- Framework: Flask
- Database: MongoDB 7.0
- AI: Ollama (phi3:latest)
Infrastructure
- Containerization: Docker & Docker Compose
- Networking: Docker bridge network
- Storage: Docker volumes
- Scheduling: Python schedule library
Libraries
- Web: Flask, Flask-CORS
- Database: pymongo
- Email: smtplib, email.mime
- Scraping: requests, BeautifulSoup4, feedparser
- Templating: Jinja2
- AI: requests (Ollama API)
Deployment Architecture
Development
Local Machine
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal)
│ ├── Backend (exposed)
│ ├── Crawler (internal)
│ └── Sender (internal)
└── .env file
Production
Server
├── Reverse Proxy (nginx/Traefik)
│ ├── SSL/TLS
│ ├── Rate limiting
│ └── Authentication
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal, GPU)
│ ├── Backend (internal)
│ ├── Crawler (internal)
│ └── Sender (internal)
├── Monitoring
│ ├── Logs
│ ├── Metrics
│ └── Alerts
└── Backups
├── MongoDB dumps
└── Configuration
Scalability
Current Limits
- Single server deployment
- Sequential article processing
- Single MongoDB instance
- No load balancing
Scaling Options
Horizontal Scaling:
- Multiple crawler instances
- Load-balanced backend
- MongoDB replica set
- Distributed Ollama
Vertical Scaling:
- More CPU cores
- More RAM
- GPU acceleration (5-10x faster)
- Faster storage
Optimization:
- Batch processing
- Caching
- Database indexing
- Connection pooling
Monitoring
Health Checks
# Backend health
curl http://localhost:5001/health
# MongoDB health
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
# Ollama health
docker-compose exec ollama ollama list
Metrics
- Article count
- Subscriber count
- Newsletter open rate
- Click-through rate
- Processing time
- Error rate
Logs
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f backend
Backup & Recovery
MongoDB Backup
# Backup
docker-compose exec mongodb mongodump --out /backup
# Restore
docker-compose exec mongodb mongorestore /backup
Configuration Backup
backend/.env- Environment variablesdocker-compose.yml- Service configuration- RSS feeds in MongoDB
Recovery Plan
- Restore MongoDB from backup
- Restore configuration files
- Restart services
- Verify functionality
Performance
CPU Mode
- Translation: ~1.5s per title
- Summarization: ~8s per article
- 10 articles: ~115s total
- Suitable for <20 articles/day
GPU Mode (5-10x faster)
- Translation: ~0.3s per title
- Summarization: ~2s per article
- 10 articles: ~31s total
- Suitable for high-volume processing
Resource Usage
CPU Mode:
- CPU: 60-80%
- RAM: 4-6GB
- Disk: ~1GB (with model)
GPU Mode:
- CPU: 10-20%
- RAM: 2-3GB
- GPU: 80-100%
- VRAM: 3-4GB
- Disk: ~1GB (with model)
Future Enhancements
Planned Features
- Frontend dashboard
- Real-time analytics
- Multiple languages
- Custom RSS feeds per subscriber
- A/B testing for newsletters
- Advanced tracking
Technical Improvements
- Kubernetes deployment
- Microservices architecture
- Message queue (RabbitMQ/Redis)
- Caching layer (Redis)
- CDN for assets
- Advanced monitoring (Prometheus/Grafana)
See SETUP.md for deployment guide and SECURITY.md for security best practices.