update
This commit is contained in:
@@ -1,131 +1,439 @@
|
||||
# System Architecture
|
||||
|
||||
Complete system design and architecture documentation.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
|
||||
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Munich News Daily System │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
6:00 AM Berlin → News Crawler
|
||||
↓
|
||||
Fetches RSS feeds
|
||||
Extracts full content
|
||||
Generates AI summaries
|
||||
Saves to MongoDB
|
||||
↓
|
||||
7:00 AM Berlin → Newsletter Sender
|
||||
↓
|
||||
Waits for crawler
|
||||
Fetches articles
|
||||
Generates newsletter
|
||||
Sends to subscribers
|
||||
↓
|
||||
✅ Done!
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Docker Network │
|
||||
│ (Internal Only) │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
|
||||
│ │ (27017) │ │ (5001) │ │ │ │
|
||||
│ └──────────┘ └────┬─────┘ └──────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────┐ │ ┌──────────┐ │
|
||||
│ │ Ollama │◄────────┤ │ Sender │ │
|
||||
│ │ (11434) │ │ │ │ │
|
||||
│ └──────────┘ │ └──────────┘ │
|
||||
│ │ │
|
||||
└───────────────────────┼───────────────────────────────────┘
|
||||
│
|
||||
│ Port 5001 (Only exposed port)
|
||||
▼
|
||||
Host Machine
|
||||
External Network
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
### 1. MongoDB Database
|
||||
- **Purpose**: Central data storage
|
||||
- **Collections**:
|
||||
- `articles`: News articles with summaries
|
||||
- `subscribers`: Email subscribers
|
||||
- `rss_feeds`: RSS feed sources
|
||||
- `newsletter_sends`: Email tracking data
|
||||
- `link_clicks`: Link click tracking
|
||||
- `subscriber_activity`: Engagement metrics
|
||||
### 1. MongoDB (Database)
|
||||
- **Purpose**: Store articles, subscribers, tracking data
|
||||
- **Port**: 27017 (internal only)
|
||||
- **Access**: Only via Docker network
|
||||
- **Authentication**: Username/password
|
||||
|
||||
### 2. News Crawler
|
||||
- **Schedule**: Daily at 6:00 AM Berlin time
|
||||
- **Functions**:
|
||||
- Fetches articles from RSS feeds
|
||||
- Extracts full article content
|
||||
- Generates AI summaries using Ollama
|
||||
- Saves to MongoDB
|
||||
- **Technology**: Python, BeautifulSoup, Ollama
|
||||
**Collections:**
|
||||
- `articles` - News articles with summaries
|
||||
- `subscribers` - Newsletter subscribers
|
||||
- `rss_feeds` - RSS feed sources
|
||||
- `newsletter_sends` - Send tracking
|
||||
- `link_clicks` - Click tracking
|
||||
|
||||
### 3. Newsletter Sender
|
||||
- **Schedule**: Daily at 7:00 AM Berlin time
|
||||
- **Functions**:
|
||||
- Waits for crawler to finish (max 30 min)
|
||||
- Fetches today's articles
|
||||
- Generates HTML newsletter
|
||||
- Injects tracking pixels
|
||||
- Sends to all subscribers
|
||||
- **Technology**: Python, Jinja2, SMTP
|
||||
### 2. Backend API (Flask)
|
||||
- **Purpose**: API endpoints, tracking, analytics
|
||||
- **Port**: 5001 (exposed to host)
|
||||
- **Access**: Public API, admin endpoints
|
||||
- **Features**: Tracking pixels, click tracking, admin operations
|
||||
|
||||
### 4. Backend API (Optional)
|
||||
- **Purpose**: Tracking and analytics
|
||||
- **Endpoints**:
|
||||
- `/api/track/pixel/<id>` - Email open tracking
|
||||
- `/api/track/click/<id>` - Link click tracking
|
||||
- `/api/analytics/*` - Engagement metrics
|
||||
- `/api/tracking/*` - Privacy controls
|
||||
- **Technology**: Flask, Python
|
||||
**Key Endpoints:**
|
||||
- `/api/admin/*` - Admin operations
|
||||
- `/api/subscribe` - Subscribe to newsletter
|
||||
- `/api/tracking/*` - Tracking endpoints
|
||||
- `/health` - Health check
|
||||
|
||||
### 3. Ollama (AI Service)
|
||||
- **Purpose**: AI summarization and translation
|
||||
- **Port**: 11434 (internal only)
|
||||
- **Model**: phi3:latest (2.2GB)
|
||||
- **GPU**: Optional NVIDIA GPU support
|
||||
|
||||
**Features:**
|
||||
- Article summarization (150 words)
|
||||
- Title translation (German → English)
|
||||
- Configurable timeout and model
|
||||
|
||||
### 4. Crawler (News Fetcher)
|
||||
- **Purpose**: Fetch and process news articles
|
||||
- **Schedule**: 6:00 AM Berlin time (automated)
|
||||
- **Features**: RSS parsing, content extraction, AI processing
|
||||
|
||||
**Process:**
|
||||
1. Fetch RSS feeds
|
||||
2. Extract article content
|
||||
3. Translate title (German → English)
|
||||
4. Generate AI summary
|
||||
5. Store in MongoDB
|
||||
|
||||
### 5. Sender (Newsletter)
|
||||
- **Purpose**: Send newsletters to subscribers
|
||||
- **Schedule**: 7:00 AM Berlin time (automated)
|
||||
- **Features**: Email sending, tracking, templating
|
||||
|
||||
**Process:**
|
||||
1. Fetch today's articles
|
||||
2. Generate newsletter HTML
|
||||
3. Add tracking pixels/links
|
||||
4. Send to active subscribers
|
||||
5. Record send events
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Article Processing Flow
|
||||
|
||||
```
|
||||
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
|
||||
↓
|
||||
Backend API
|
||||
↓
|
||||
Analytics
|
||||
RSS Feed
|
||||
↓
|
||||
Crawler fetches
|
||||
↓
|
||||
Extract content
|
||||
↓
|
||||
Translate title (Ollama)
|
||||
↓
|
||||
Generate summary (Ollama)
|
||||
↓
|
||||
Store in MongoDB
|
||||
↓
|
||||
Newsletter Sender
|
||||
↓
|
||||
Email to subscribers
|
||||
```
|
||||
|
||||
## Coordination
|
||||
### Tracking Flow
|
||||
|
||||
The sender waits for the crawler to ensure fresh content:
|
||||
```
|
||||
Newsletter sent
|
||||
↓
|
||||
Tracking pixel embedded
|
||||
↓
|
||||
User opens email
|
||||
↓
|
||||
Pixel loaded → Backend API
|
||||
↓
|
||||
Record open event
|
||||
↓
|
||||
User clicks link
|
||||
↓
|
||||
Redirect via Backend API
|
||||
↓
|
||||
Record click event
|
||||
↓
|
||||
Redirect to article
|
||||
```
|
||||
|
||||
1. Sender starts at 7:00 AM
|
||||
2. Checks for recent articles every 30 seconds
|
||||
3. Maximum wait time: 30 minutes
|
||||
4. Proceeds once crawler finishes or timeout
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Articles Collection
|
||||
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId,
|
||||
title: String, // Original German title
|
||||
title_en: String, // English translation
|
||||
translated_at: Date, // Translation timestamp
|
||||
link: String,
|
||||
summary: String, // AI-generated summary
|
||||
content: String, // Full article text
|
||||
author: String,
|
||||
source: String, // RSS feed name
|
||||
published_at: Date,
|
||||
crawled_at: Date,
|
||||
created_at: Date
|
||||
}
|
||||
```
|
||||
|
||||
### Subscribers Collection
|
||||
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId,
|
||||
email: String, // Unique
|
||||
subscribed_at: Date,
|
||||
status: String // 'active' or 'inactive'
|
||||
}
|
||||
```
|
||||
|
||||
### RSS Feeds Collection
|
||||
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId,
|
||||
name: String,
|
||||
url: String,
|
||||
active: Boolean,
|
||||
last_crawled: Date
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Network Isolation
|
||||
|
||||
**Exposed Services:**
|
||||
- Backend API (port 5001) - Only exposed service
|
||||
|
||||
**Internal Services:**
|
||||
- MongoDB (port 27017) - Not accessible from host
|
||||
- Ollama (port 11434) - Not accessible from host
|
||||
- Crawler - No ports
|
||||
- Sender - No ports
|
||||
|
||||
**Benefits:**
|
||||
- 66% reduction in attack surface
|
||||
- Database protected from external access
|
||||
- AI service protected from abuse
|
||||
- Defense in depth
|
||||
|
||||
### Authentication
|
||||
|
||||
**MongoDB:**
|
||||
- Username/password authentication
|
||||
- Credentials in environment variables
|
||||
- Internal network only
|
||||
|
||||
**Backend API:**
|
||||
- No authentication (add in production)
|
||||
- Rate limiting recommended
|
||||
- IP whitelisting recommended
|
||||
|
||||
### Data Protection
|
||||
|
||||
- Subscriber emails stored securely
|
||||
- No sensitive data in logs
|
||||
- Environment variables for secrets
|
||||
- `.env` file in `.gitignore`
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
- **Backend**: Python 3.11
|
||||
### Backend
|
||||
- **Language**: Python 3.11
|
||||
- **Framework**: Flask
|
||||
- **Database**: MongoDB 7.0
|
||||
- **AI**: Ollama (Phi3 model)
|
||||
- **AI**: Ollama (phi3:latest)
|
||||
|
||||
### Infrastructure
|
||||
- **Containerization**: Docker & Docker Compose
|
||||
- **Networking**: Docker bridge network
|
||||
- **Storage**: Docker volumes
|
||||
- **Scheduling**: Python schedule library
|
||||
- **Email**: SMTP with HTML templates
|
||||
- **Tracking**: Pixel tracking + redirect URLs
|
||||
- **Infrastructure**: Docker & Docker Compose
|
||||
|
||||
## Deployment
|
||||
### Libraries
|
||||
- **Web**: Flask, Flask-CORS
|
||||
- **Database**: pymongo
|
||||
- **Email**: smtplib, email.mime
|
||||
- **Scraping**: requests, BeautifulSoup4, feedparser
|
||||
- **Templating**: Jinja2
|
||||
- **AI**: requests (Ollama API)
|
||||
|
||||
All components run in Docker containers:
|
||||
---
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
### Development
|
||||
```
|
||||
docker-compose up -d
|
||||
Local Machine
|
||||
├── Docker Compose
|
||||
│ ├── MongoDB (internal)
|
||||
│ ├── Ollama (internal)
|
||||
│ ├── Backend (exposed)
|
||||
│ ├── Crawler (internal)
|
||||
│ └── Sender (internal)
|
||||
└── .env file
|
||||
```
|
||||
|
||||
Containers:
|
||||
- `munich-news-mongodb` - Database
|
||||
- `munich-news-crawler` - Crawler service
|
||||
- `munich-news-sender` - Sender service
|
||||
### Production
|
||||
```
|
||||
Server
|
||||
├── Reverse Proxy (nginx/Traefik)
|
||||
│ ├── SSL/TLS
|
||||
│ ├── Rate limiting
|
||||
│ └── Authentication
|
||||
├── Docker Compose
|
||||
│ ├── MongoDB (internal)
|
||||
│ ├── Ollama (internal, GPU)
|
||||
│ ├── Backend (internal)
|
||||
│ ├── Crawler (internal)
|
||||
│ └── Sender (internal)
|
||||
├── Monitoring
|
||||
│ ├── Logs
|
||||
│ ├── Metrics
|
||||
│ └── Alerts
|
||||
└── Backups
|
||||
├── MongoDB dumps
|
||||
└── Configuration
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
- MongoDB authentication enabled
|
||||
- Environment variables for secrets
|
||||
- HTTPS for tracking URLs (production)
|
||||
- GDPR-compliant data retention
|
||||
- Privacy controls (opt-out, deletion)
|
||||
|
||||
## Monitoring
|
||||
|
||||
- Docker logs for all services
|
||||
- MongoDB for data verification
|
||||
- Health checks on containers
|
||||
- Engagement metrics via API
|
||||
---
|
||||
|
||||
## Scalability
|
||||
|
||||
- Horizontal: Add more crawler instances
|
||||
- Vertical: Increase container resources
|
||||
- Database: MongoDB sharding if needed
|
||||
- Caching: Redis for API responses (future)
|
||||
### Current Limits
|
||||
- Single server deployment
|
||||
- Sequential article processing
|
||||
- Single MongoDB instance
|
||||
- No load balancing
|
||||
|
||||
### Scaling Options
|
||||
|
||||
**Horizontal Scaling:**
|
||||
- Multiple crawler instances
|
||||
- Load-balanced backend
|
||||
- MongoDB replica set
|
||||
- Distributed Ollama
|
||||
|
||||
**Vertical Scaling:**
|
||||
- More CPU cores
|
||||
- More RAM
|
||||
- GPU acceleration (5-10x faster)
|
||||
- Faster storage
|
||||
|
||||
**Optimization:**
|
||||
- Batch processing
|
||||
- Caching
|
||||
- Database indexing
|
||||
- Connection pooling
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Backend health
|
||||
curl http://localhost:5001/health
|
||||
|
||||
# MongoDB health
|
||||
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
|
||||
|
||||
# Ollama health
|
||||
docker-compose exec ollama ollama list
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
- Article count
|
||||
- Subscriber count
|
||||
- Newsletter open rate
|
||||
- Click-through rate
|
||||
- Processing time
|
||||
- Error rate
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# All services
|
||||
docker-compose logs -f
|
||||
|
||||
# Specific service
|
||||
docker-compose logs -f crawler
|
||||
docker-compose logs -f backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup & Recovery
|
||||
|
||||
### MongoDB Backup
|
||||
|
||||
```bash
|
||||
# Backup
|
||||
docker-compose exec mongodb mongodump --out /backup
|
||||
|
||||
# Restore
|
||||
docker-compose exec mongodb mongorestore /backup
|
||||
```
|
||||
|
||||
### Configuration Backup
|
||||
|
||||
- `backend/.env` - Environment variables
|
||||
- `docker-compose.yml` - Service configuration
|
||||
- RSS feeds in MongoDB
|
||||
|
||||
### Recovery Plan
|
||||
|
||||
1. Restore MongoDB from backup
|
||||
2. Restore configuration files
|
||||
3. Restart services
|
||||
4. Verify functionality
|
||||
|
||||
---
|
||||
|
||||
## Performance
|
||||
|
||||
### CPU Mode
|
||||
- Translation: ~1.5s per title
|
||||
- Summarization: ~8s per article
|
||||
- 10 articles: ~115s total
|
||||
- Suitable for <20 articles/day
|
||||
|
||||
### GPU Mode (5-10x faster)
|
||||
- Translation: ~0.3s per title
|
||||
- Summarization: ~2s per article
|
||||
- 10 articles: ~31s total
|
||||
- Suitable for high-volume processing
|
||||
|
||||
### Resource Usage
|
||||
|
||||
**CPU Mode:**
|
||||
- CPU: 60-80%
|
||||
- RAM: 4-6GB
|
||||
- Disk: ~1GB (with model)
|
||||
|
||||
**GPU Mode:**
|
||||
- CPU: 10-20%
|
||||
- RAM: 2-3GB
|
||||
- GPU: 80-100%
|
||||
- VRAM: 3-4GB
|
||||
- Disk: ~1GB (with model)
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
- Frontend dashboard
|
||||
- Real-time analytics
|
||||
- Multiple languages
|
||||
- Custom RSS feeds per subscriber
|
||||
- A/B testing for newsletters
|
||||
- Advanced tracking
|
||||
|
||||
### Technical Improvements
|
||||
- Kubernetes deployment
|
||||
- Microservices architecture
|
||||
- Message queue (RabbitMQ/Redis)
|
||||
- Caching layer (Redis)
|
||||
- CDN for assets
|
||||
- Advanced monitoring (Prometheus/Grafana)
|
||||
|
||||
---
|
||||
|
||||
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.
|
||||
|
||||
Reference in New Issue
Block a user