This commit is contained in:
2025-11-12 11:34:33 +01:00
parent f35f8eef8a
commit 94c89589af
32 changed files with 3272 additions and 3805 deletions

View File

@@ -1,131 +1,439 @@
# System Architecture
Complete system design and architecture documentation.
---
## Overview
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
```
┌─────────────────────────────────────────────────────────────────
Munich News Daily System
└─────────────────────────────────────────────────────────────────┘
6:00 AM Berlin → News Crawler
Fetches RSS feeds
Extracts full content
Generates AI summaries
Saves to MongoDB
7:00 AM Berlin → Newsletter Sender
Waits for crawler
Fetches articles
Generates newsletter
Sends to subscribers
✅ Done!
┌─────────────────────────────────────────────────────────┐
Docker Network
│ (Internal Only) │
├─────────────────────────────────────────────────────────┤
│ │
┌──────────┐ ┌──────────┐ ┌──────────┐
│ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
│ (27017) │ (5001) │ │ │
└──────────┘ └────┬─────┘ └──────────┘ │
│ │
┌──────────┐ ┌──────────┐ │
│ │ Ollama │◄────────┤ │ Sender │ │
│ (11434) │ │ │ │
└──────────┘ └──────────┘ │
│ │
└───────────────────────┼───────────────────────────────────┘
│ Port 5001 (Only exposed port)
Host Machine
External Network
```
---
## Components
### 1. MongoDB Database
- **Purpose**: Central data storage
- **Collections**:
- `articles`: News articles with summaries
- `subscribers`: Email subscribers
- `rss_feeds`: RSS feed sources
- `newsletter_sends`: Email tracking data
- `link_clicks`: Link click tracking
- `subscriber_activity`: Engagement metrics
### 1. MongoDB (Database)
- **Purpose**: Store articles, subscribers, tracking data
- **Port**: 27017 (internal only)
- **Access**: Only via Docker network
- **Authentication**: Username/password
### 2. News Crawler
- **Schedule**: Daily at 6:00 AM Berlin time
- **Functions**:
- Fetches articles from RSS feeds
- Extracts full article content
- Generates AI summaries using Ollama
- Saves to MongoDB
- **Technology**: Python, BeautifulSoup, Ollama
**Collections:**
- `articles` - News articles with summaries
- `subscribers` - Newsletter subscribers
- `rss_feeds` - RSS feed sources
- `newsletter_sends` - Send tracking
- `link_clicks` - Click tracking
### 3. Newsletter Sender
- **Schedule**: Daily at 7:00 AM Berlin time
- **Functions**:
- Waits for crawler to finish (max 30 min)
- Fetches today's articles
- Generates HTML newsletter
- Injects tracking pixels
- Sends to all subscribers
- **Technology**: Python, Jinja2, SMTP
### 2. Backend API (Flask)
- **Purpose**: API endpoints, tracking, analytics
- **Port**: 5001 (exposed to host)
- **Access**: Public API, admin endpoints
- **Features**: Tracking pixels, click tracking, admin operations
### 4. Backend API (Optional)
- **Purpose**: Tracking and analytics
- **Endpoints**:
- `/api/track/pixel/<id>` - Email open tracking
- `/api/track/click/<id>` - Link click tracking
- `/api/analytics/*` - Engagement metrics
- `/api/tracking/*` - Privacy controls
- **Technology**: Flask, Python
**Key Endpoints:**
- `/api/admin/*` - Admin operations
- `/api/subscribe` - Subscribe to newsletter
- `/api/tracking/*` - Tracking endpoints
- `/health` - Health check
### 3. Ollama (AI Service)
- **Purpose**: AI summarization and translation
- **Port**: 11434 (internal only)
- **Model**: phi3:latest (2.2GB)
- **GPU**: Optional NVIDIA GPU support
**Features:**
- Article summarization (150 words)
- Title translation (German → English)
- Configurable timeout and model
### 4. Crawler (News Fetcher)
- **Purpose**: Fetch and process news articles
- **Schedule**: 6:00 AM Berlin time (automated)
- **Features**: RSS parsing, content extraction, AI processing
**Process:**
1. Fetch RSS feeds
2. Extract article content
3. Translate title (German → English)
4. Generate AI summary
5. Store in MongoDB
### 5. Sender (Newsletter)
- **Purpose**: Send newsletters to subscribers
- **Schedule**: 7:00 AM Berlin time (automated)
- **Features**: Email sending, tracking, templating
**Process:**
1. Fetch today's articles
2. Generate newsletter HTML
3. Add tracking pixels/links
4. Send to active subscribers
5. Record send events
---
## Data Flow
### Article Processing Flow
```
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
Backend API
Analytics
RSS Feed
Crawler fetches
Extract content
Translate title (Ollama)
Generate summary (Ollama)
Store in MongoDB
Newsletter Sender
Email to subscribers
```
## Coordination
### Tracking Flow
The sender waits for the crawler to ensure fresh content:
```
Newsletter sent
Tracking pixel embedded
User opens email
Pixel loaded → Backend API
Record open event
User clicks link
Redirect via Backend API
Record click event
Redirect to article
```
1. Sender starts at 7:00 AM
2. Checks for recent articles every 30 seconds
3. Maximum wait time: 30 minutes
4. Proceeds once crawler finishes or timeout
---
## Database Schema
### Articles Collection
```javascript
{
_id: ObjectId,
title: String, // Original German title
title_en: String, // English translation
translated_at: Date, // Translation timestamp
link: String,
summary: String, // AI-generated summary
content: String, // Full article text
author: String,
source: String, // RSS feed name
published_at: Date,
crawled_at: Date,
created_at: Date
}
```
### Subscribers Collection
```javascript
{
_id: ObjectId,
email: String, // Unique
subscribed_at: Date,
status: String // 'active' or 'inactive'
}
```
### RSS Feeds Collection
```javascript
{
_id: ObjectId,
name: String,
url: String,
active: Boolean,
last_crawled: Date
}
```
---
## Security Architecture
### Network Isolation
**Exposed Services:**
- Backend API (port 5001) - Only exposed service
**Internal Services:**
- MongoDB (port 27017) - Not accessible from host
- Ollama (port 11434) - Not accessible from host
- Crawler - No ports
- Sender - No ports
**Benefits:**
- 66% reduction in attack surface
- Database protected from external access
- AI service protected from abuse
- Defense in depth
### Authentication
**MongoDB:**
- Username/password authentication
- Credentials in environment variables
- Internal network only
**Backend API:**
- No authentication (add in production)
- Rate limiting recommended
- IP whitelisting recommended
### Data Protection
- Subscriber emails stored securely
- No sensitive data in logs
- Environment variables for secrets
- `.env` file in `.gitignore`
---
## Technology Stack
- **Backend**: Python 3.11
### Backend
- **Language**: Python 3.11
- **Framework**: Flask
- **Database**: MongoDB 7.0
- **AI**: Ollama (Phi3 model)
- **AI**: Ollama (phi3:latest)
### Infrastructure
- **Containerization**: Docker & Docker Compose
- **Networking**: Docker bridge network
- **Storage**: Docker volumes
- **Scheduling**: Python schedule library
- **Email**: SMTP with HTML templates
- **Tracking**: Pixel tracking + redirect URLs
- **Infrastructure**: Docker & Docker Compose
## Deployment
### Libraries
- **Web**: Flask, Flask-CORS
- **Database**: pymongo
- **Email**: smtplib, email.mime
- **Scraping**: requests, BeautifulSoup4, feedparser
- **Templating**: Jinja2
- **AI**: requests (Ollama API)
All components run in Docker containers:
---
## Deployment Architecture
### Development
```
docker-compose up -d
Local Machine
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal)
│ ├── Backend (exposed)
│ ├── Crawler (internal)
│ └── Sender (internal)
└── .env file
```
Containers:
- `munich-news-mongodb` - Database
- `munich-news-crawler` - Crawler service
- `munich-news-sender` - Sender service
### Production
```
Server
├── Reverse Proxy (nginx/Traefik)
│ ├── SSL/TLS
│ ├── Rate limiting
│ └── Authentication
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal, GPU)
│ ├── Backend (internal)
│ ├── Crawler (internal)
│ └── Sender (internal)
├── Monitoring
│ ├── Logs
│ ├── Metrics
│ └── Alerts
└── Backups
├── MongoDB dumps
└── Configuration
```
## Security
- MongoDB authentication enabled
- Environment variables for secrets
- HTTPS for tracking URLs (production)
- GDPR-compliant data retention
- Privacy controls (opt-out, deletion)
## Monitoring
- Docker logs for all services
- MongoDB for data verification
- Health checks on containers
- Engagement metrics via API
---
## Scalability
- Horizontal: Add more crawler instances
- Vertical: Increase container resources
- Database: MongoDB sharding if needed
- Caching: Redis for API responses (future)
### Current Limits
- Single server deployment
- Sequential article processing
- Single MongoDB instance
- No load balancing
### Scaling Options
**Horizontal Scaling:**
- Multiple crawler instances
- Load-balanced backend
- MongoDB replica set
- Distributed Ollama
**Vertical Scaling:**
- More CPU cores
- More RAM
- GPU acceleration (5-10x faster)
- Faster storage
**Optimization:**
- Batch processing
- Caching
- Database indexing
- Connection pooling
---
## Monitoring
### Health Checks
```bash
# Backend health
curl http://localhost:5001/health
# MongoDB health
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
# Ollama health
docker-compose exec ollama ollama list
```
### Metrics
- Article count
- Subscriber count
- Newsletter open rate
- Click-through rate
- Processing time
- Error rate
### Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f backend
```
---
## Backup & Recovery
### MongoDB Backup
```bash
# Backup
docker-compose exec mongodb mongodump --out /backup
# Restore
docker-compose exec mongodb mongorestore /backup
```
### Configuration Backup
- `backend/.env` - Environment variables
- `docker-compose.yml` - Service configuration
- RSS feeds in MongoDB
### Recovery Plan
1. Restore MongoDB from backup
2. Restore configuration files
3. Restart services
4. Verify functionality
---
## Performance
### CPU Mode
- Translation: ~1.5s per title
- Summarization: ~8s per article
- 10 articles: ~115s total
- Suitable for <20 articles/day
### GPU Mode (5-10x faster)
- Translation: ~0.3s per title
- Summarization: ~2s per article
- 10 articles: ~31s total
- Suitable for high-volume processing
### Resource Usage
**CPU Mode:**
- CPU: 60-80%
- RAM: 4-6GB
- Disk: ~1GB (with model)
**GPU Mode:**
- CPU: 10-20%
- RAM: 2-3GB
- GPU: 80-100%
- VRAM: 3-4GB
- Disk: ~1GB (with model)
---
## Future Enhancements
### Planned Features
- Frontend dashboard
- Real-time analytics
- Multiple languages
- Custom RSS feeds per subscriber
- A/B testing for newsletters
- Advanced tracking
### Technical Improvements
- Kubernetes deployment
- Microservices architecture
- Message queue (RabbitMQ/Redis)
- Caching layer (Redis)
- CDN for assets
- Advanced monitoring (Prometheus/Grafana)
---
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.