440 lines
9.7 KiB
Markdown
440 lines
9.7 KiB
Markdown
# System Architecture
|
|
|
|
Complete system design and architecture documentation.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Docker Network │
|
|
│ (Internal Only) │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
|
|
│ │ (27017) │ │ (5001) │ │ │ │
|
|
│ └──────────┘ └────┬─────┘ └──────────┘ │
|
|
│ │ │
|
|
│ ┌──────────┐ │ ┌──────────┐ │
|
|
│ │ Ollama │◄────────┤ │ Sender │ │
|
|
│ │ (11434) │ │ │ │ │
|
|
│ └──────────┘ │ └──────────┘ │
|
|
│ │ │
|
|
└───────────────────────┼───────────────────────────────────┘
|
|
│
|
|
│ Port 5001 (Only exposed port)
|
|
▼
|
|
Host Machine
|
|
External Network
|
|
```
|
|
|
|
---
|
|
|
|
## Components
|
|
|
|
### 1. MongoDB (Database)
|
|
- **Purpose**: Store articles, subscribers, tracking data
|
|
- **Port**: 27017 (internal only)
|
|
- **Access**: Only via Docker network
|
|
- **Authentication**: Username/password
|
|
|
|
**Collections:**
|
|
- `articles` - News articles with summaries
|
|
- `subscribers` - Newsletter subscribers
|
|
- `rss_feeds` - RSS feed sources
|
|
- `newsletter_sends` - Send tracking
|
|
- `link_clicks` - Click tracking
|
|
|
|
### 2. Backend API (Flask)
|
|
- **Purpose**: API endpoints, tracking, analytics
|
|
- **Port**: 5001 (exposed to host)
|
|
- **Access**: Public API, admin endpoints
|
|
- **Features**: Tracking pixels, click tracking, admin operations
|
|
|
|
**Key Endpoints:**
|
|
- `/api/admin/*` - Admin operations
|
|
- `/api/subscribe` - Subscribe to newsletter
|
|
- `/api/tracking/*` - Tracking endpoints
|
|
- `/health` - Health check
|
|
|
|
### 3. Ollama (AI Service)
|
|
- **Purpose**: AI summarization and translation
|
|
- **Port**: 11434 (internal only)
|
|
- **Model**: phi3:latest (2.2GB)
|
|
- **GPU**: Optional NVIDIA GPU support
|
|
|
|
**Features:**
|
|
- Article summarization (150 words)
|
|
- Title translation (German → English)
|
|
- Configurable timeout and model
|
|
|
|
### 4. Crawler (News Fetcher)
|
|
- **Purpose**: Fetch and process news articles
|
|
- **Schedule**: 6:00 AM Berlin time (automated)
|
|
- **Features**: RSS parsing, content extraction, AI processing
|
|
|
|
**Process:**
|
|
1. Fetch RSS feeds
|
|
2. Extract article content
|
|
3. Translate title (German → English)
|
|
4. Generate AI summary
|
|
5. Store in MongoDB
|
|
|
|
### 5. Sender (Newsletter)
|
|
- **Purpose**: Send newsletters to subscribers
|
|
- **Schedule**: 7:00 AM Berlin time (automated)
|
|
- **Features**: Email sending, tracking, templating
|
|
|
|
**Process:**
|
|
1. Fetch today's articles
|
|
2. Generate newsletter HTML
|
|
3. Add tracking pixels/links
|
|
4. Send to active subscribers
|
|
5. Record send events
|
|
|
|
---
|
|
|
|
## Data Flow
|
|
|
|
### Article Processing Flow
|
|
|
|
```
|
|
RSS Feed
|
|
↓
|
|
Crawler fetches
|
|
↓
|
|
Extract content
|
|
↓
|
|
Translate title (Ollama)
|
|
↓
|
|
Generate summary (Ollama)
|
|
↓
|
|
Store in MongoDB
|
|
↓
|
|
Newsletter Sender
|
|
↓
|
|
Email to subscribers
|
|
```
|
|
|
|
### Tracking Flow
|
|
|
|
```
|
|
Newsletter sent
|
|
↓
|
|
Tracking pixel embedded
|
|
↓
|
|
User opens email
|
|
↓
|
|
Pixel loaded → Backend API
|
|
↓
|
|
Record open event
|
|
↓
|
|
User clicks link
|
|
↓
|
|
Redirect via Backend API
|
|
↓
|
|
Record click event
|
|
↓
|
|
Redirect to article
|
|
```
|
|
|
|
---
|
|
|
|
## Database Schema
|
|
|
|
### Articles Collection
|
|
|
|
```javascript
|
|
{
|
|
_id: ObjectId,
|
|
title: String, // Original German title
|
|
title_en: String, // English translation
|
|
translated_at: Date, // Translation timestamp
|
|
link: String,
|
|
summary: String, // AI-generated summary
|
|
content: String, // Full article text
|
|
author: String,
|
|
source: String, // RSS feed name
|
|
published_at: Date,
|
|
crawled_at: Date,
|
|
created_at: Date
|
|
}
|
|
```
|
|
|
|
### Subscribers Collection
|
|
|
|
```javascript
|
|
{
|
|
_id: ObjectId,
|
|
email: String, // Unique
|
|
subscribed_at: Date,
|
|
status: String // 'active' or 'inactive'
|
|
}
|
|
```
|
|
|
|
### RSS Feeds Collection
|
|
|
|
```javascript
|
|
{
|
|
_id: ObjectId,
|
|
name: String,
|
|
url: String,
|
|
active: Boolean,
|
|
last_crawled: Date
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Security Architecture
|
|
|
|
### Network Isolation
|
|
|
|
**Exposed Services:**
|
|
- Backend API (port 5001) - Only exposed service
|
|
|
|
**Internal Services:**
|
|
- MongoDB (port 27017) - Not accessible from host
|
|
- Ollama (port 11434) - Not accessible from host
|
|
- Crawler - No ports
|
|
- Sender - No ports
|
|
|
|
**Benefits:**
|
|
- 66% reduction in attack surface
|
|
- Database protected from external access
|
|
- AI service protected from abuse
|
|
- Defense in depth
|
|
|
|
### Authentication
|
|
|
|
**MongoDB:**
|
|
- Username/password authentication
|
|
- Credentials in environment variables
|
|
- Internal network only
|
|
|
|
**Backend API:**
|
|
- No authentication (add in production)
|
|
- Rate limiting recommended
|
|
- IP whitelisting recommended
|
|
|
|
### Data Protection
|
|
|
|
- Subscriber emails stored securely
|
|
- No sensitive data in logs
|
|
- Environment variables for secrets
|
|
- `.env` file in `.gitignore`
|
|
|
|
---
|
|
|
|
## Technology Stack
|
|
|
|
### Backend
|
|
- **Language**: Python 3.11
|
|
- **Framework**: Flask
|
|
- **Database**: MongoDB 7.0
|
|
- **AI**: Ollama (phi3:latest)
|
|
|
|
### Infrastructure
|
|
- **Containerization**: Docker & Docker Compose
|
|
- **Networking**: Docker bridge network
|
|
- **Storage**: Docker volumes
|
|
- **Scheduling**: Python schedule library
|
|
|
|
### Libraries
|
|
- **Web**: Flask, Flask-CORS
|
|
- **Database**: pymongo
|
|
- **Email**: smtplib, email.mime
|
|
- **Scraping**: requests, BeautifulSoup4, feedparser
|
|
- **Templating**: Jinja2
|
|
- **AI**: requests (Ollama API)
|
|
|
|
---
|
|
|
|
## Deployment Architecture
|
|
|
|
### Development
|
|
```
|
|
Local Machine
|
|
├── Docker Compose
|
|
│ ├── MongoDB (internal)
|
|
│ ├── Ollama (internal)
|
|
│ ├── Backend (exposed)
|
|
│ ├── Crawler (internal)
|
|
│ └── Sender (internal)
|
|
└── .env file
|
|
```
|
|
|
|
### Production
|
|
```
|
|
Server
|
|
├── Reverse Proxy (nginx/Traefik)
|
|
│ ├── SSL/TLS
|
|
│ ├── Rate limiting
|
|
│ └── Authentication
|
|
├── Docker Compose
|
|
│ ├── MongoDB (internal)
|
|
│ ├── Ollama (internal, GPU)
|
|
│ ├── Backend (internal)
|
|
│ ├── Crawler (internal)
|
|
│ └── Sender (internal)
|
|
├── Monitoring
|
|
│ ├── Logs
|
|
│ ├── Metrics
|
|
│ └── Alerts
|
|
└── Backups
|
|
├── MongoDB dumps
|
|
└── Configuration
|
|
```
|
|
|
|
---
|
|
|
|
## Scalability
|
|
|
|
### Current Limits
|
|
- Single server deployment
|
|
- Sequential article processing
|
|
- Single MongoDB instance
|
|
- No load balancing
|
|
|
|
### Scaling Options
|
|
|
|
**Horizontal Scaling:**
|
|
- Multiple crawler instances
|
|
- Load-balanced backend
|
|
- MongoDB replica set
|
|
- Distributed Ollama
|
|
|
|
**Vertical Scaling:**
|
|
- More CPU cores
|
|
- More RAM
|
|
- GPU acceleration (5-10x faster)
|
|
- Faster storage
|
|
|
|
**Optimization:**
|
|
- Batch processing
|
|
- Caching
|
|
- Database indexing
|
|
- Connection pooling
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Backend health
|
|
curl http://localhost:5001/health
|
|
|
|
# MongoDB health
|
|
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
|
|
|
|
# Ollama health
|
|
docker-compose exec ollama ollama list
|
|
```
|
|
|
|
### Metrics
|
|
|
|
- Article count
|
|
- Subscriber count
|
|
- Newsletter open rate
|
|
- Click-through rate
|
|
- Processing time
|
|
- Error rate
|
|
|
|
### Logs
|
|
|
|
```bash
|
|
# All services
|
|
docker-compose logs -f
|
|
|
|
# Specific service
|
|
docker-compose logs -f crawler
|
|
docker-compose logs -f backend
|
|
```
|
|
|
|
---
|
|
|
|
## Backup & Recovery
|
|
|
|
### MongoDB Backup
|
|
|
|
```bash
|
|
# Backup
|
|
docker-compose exec mongodb mongodump --out /backup
|
|
|
|
# Restore
|
|
docker-compose exec mongodb mongorestore /backup
|
|
```
|
|
|
|
### Configuration Backup
|
|
|
|
- `backend/.env` - Environment variables
|
|
- `docker-compose.yml` - Service configuration
|
|
- RSS feeds in MongoDB
|
|
|
|
### Recovery Plan
|
|
|
|
1. Restore MongoDB from backup
|
|
2. Restore configuration files
|
|
3. Restart services
|
|
4. Verify functionality
|
|
|
|
---
|
|
|
|
## Performance
|
|
|
|
### CPU Mode
|
|
- Translation: ~1.5s per title
|
|
- Summarization: ~8s per article
|
|
- 10 articles: ~115s total
|
|
- Suitable for <20 articles/day
|
|
|
|
### GPU Mode (5-10x faster)
|
|
- Translation: ~0.3s per title
|
|
- Summarization: ~2s per article
|
|
- 10 articles: ~31s total
|
|
- Suitable for high-volume processing
|
|
|
|
### Resource Usage
|
|
|
|
**CPU Mode:**
|
|
- CPU: 60-80%
|
|
- RAM: 4-6GB
|
|
- Disk: ~1GB (with model)
|
|
|
|
**GPU Mode:**
|
|
- CPU: 10-20%
|
|
- RAM: 2-3GB
|
|
- GPU: 80-100%
|
|
- VRAM: 3-4GB
|
|
- Disk: ~1GB (with model)
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
- Frontend dashboard
|
|
- Real-time analytics
|
|
- Multiple languages
|
|
- Custom RSS feeds per subscriber
|
|
- A/B testing for newsletters
|
|
- Advanced tracking
|
|
|
|
### Technical Improvements
|
|
- Kubernetes deployment
|
|
- Microservices architecture
|
|
- Message queue (RabbitMQ/Redis)
|
|
- Caching layer (Redis)
|
|
- CDN for assets
|
|
- Advanced monitoring (Prometheus/Grafana)
|
|
|
|
---
|
|
|
|
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.
|