Files
Munich-news/docs/ARCHITECTURE.md
2025-11-12 11:34:33 +01:00

440 lines
9.7 KiB
Markdown

# System Architecture
Complete system design and architecture documentation.
---
## Overview
Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.
```
┌─────────────────────────────────────────────────────────┐
│ Docker Network │
│ (Internal Only) │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ MongoDB │◄───┤ Backend │◄───┤ Crawler │ │
│ │ (27017) │ │ (5001) │ │ │ │
│ └──────────┘ └────┬─────┘ └──────────┘ │
│ │ │
│ ┌──────────┐ │ ┌──────────┐ │
│ │ Ollama │◄────────┤ │ Sender │ │
│ │ (11434) │ │ │ │ │
│ └──────────┘ │ └──────────┘ │
│ │ │
└───────────────────────┼───────────────────────────────────┘
│ Port 5001 (Only exposed port)
Host Machine
External Network
```
---
## Components
### 1. MongoDB (Database)
- **Purpose**: Store articles, subscribers, tracking data
- **Port**: 27017 (internal only)
- **Access**: Only via Docker network
- **Authentication**: Username/password
**Collections:**
- `articles` - News articles with summaries
- `subscribers` - Newsletter subscribers
- `rss_feeds` - RSS feed sources
- `newsletter_sends` - Send tracking
- `link_clicks` - Click tracking
### 2. Backend API (Flask)
- **Purpose**: API endpoints, tracking, analytics
- **Port**: 5001 (exposed to host)
- **Access**: Public API, admin endpoints
- **Features**: Tracking pixels, click tracking, admin operations
**Key Endpoints:**
- `/api/admin/*` - Admin operations
- `/api/subscribe` - Subscribe to newsletter
- `/api/tracking/*` - Tracking endpoints
- `/health` - Health check
### 3. Ollama (AI Service)
- **Purpose**: AI summarization and translation
- **Port**: 11434 (internal only)
- **Model**: phi3:latest (2.2GB)
- **GPU**: Optional NVIDIA GPU support
**Features:**
- Article summarization (150 words)
- Title translation (German → English)
- Configurable timeout and model
### 4. Crawler (News Fetcher)
- **Purpose**: Fetch and process news articles
- **Schedule**: 6:00 AM Berlin time (automated)
- **Features**: RSS parsing, content extraction, AI processing
**Process:**
1. Fetch RSS feeds
2. Extract article content
3. Translate title (German → English)
4. Generate AI summary
5. Store in MongoDB
### 5. Sender (Newsletter)
- **Purpose**: Send newsletters to subscribers
- **Schedule**: 7:00 AM Berlin time (automated)
- **Features**: Email sending, tracking, templating
**Process:**
1. Fetch today's articles
2. Generate newsletter HTML
3. Add tracking pixels/links
4. Send to active subscribers
5. Record send events
---
## Data Flow
### Article Processing Flow
```
RSS Feed
Crawler fetches
Extract content
Translate title (Ollama)
Generate summary (Ollama)
Store in MongoDB
Newsletter Sender
Email to subscribers
```
### Tracking Flow
```
Newsletter sent
Tracking pixel embedded
User opens email
Pixel loaded → Backend API
Record open event
User clicks link
Redirect via Backend API
Record click event
Redirect to article
```
---
## Database Schema
### Articles Collection
```javascript
{
_id: ObjectId,
title: String, // Original German title
title_en: String, // English translation
translated_at: Date, // Translation timestamp
link: String,
summary: String, // AI-generated summary
content: String, // Full article text
author: String,
source: String, // RSS feed name
published_at: Date,
crawled_at: Date,
created_at: Date
}
```
### Subscribers Collection
```javascript
{
_id: ObjectId,
email: String, // Unique
subscribed_at: Date,
status: String // 'active' or 'inactive'
}
```
### RSS Feeds Collection
```javascript
{
_id: ObjectId,
name: String,
url: String,
active: Boolean,
last_crawled: Date
}
```
---
## Security Architecture
### Network Isolation
**Exposed Services:**
- Backend API (port 5001) - Only exposed service
**Internal Services:**
- MongoDB (port 27017) - Not accessible from host
- Ollama (port 11434) - Not accessible from host
- Crawler - No ports
- Sender - No ports
**Benefits:**
- 66% reduction in attack surface
- Database protected from external access
- AI service protected from abuse
- Defense in depth
### Authentication
**MongoDB:**
- Username/password authentication
- Credentials in environment variables
- Internal network only
**Backend API:**
- No authentication (add in production)
- Rate limiting recommended
- IP whitelisting recommended
### Data Protection
- Subscriber emails stored securely
- No sensitive data in logs
- Environment variables for secrets
- `.env` file in `.gitignore`
---
## Technology Stack
### Backend
- **Language**: Python 3.11
- **Framework**: Flask
- **Database**: MongoDB 7.0
- **AI**: Ollama (phi3:latest)
### Infrastructure
- **Containerization**: Docker & Docker Compose
- **Networking**: Docker bridge network
- **Storage**: Docker volumes
- **Scheduling**: Python schedule library
### Libraries
- **Web**: Flask, Flask-CORS
- **Database**: pymongo
- **Email**: smtplib, email.mime
- **Scraping**: requests, BeautifulSoup4, feedparser
- **Templating**: Jinja2
- **AI**: requests (Ollama API)
---
## Deployment Architecture
### Development
```
Local Machine
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal)
│ ├── Backend (exposed)
│ ├── Crawler (internal)
│ └── Sender (internal)
└── .env file
```
### Production
```
Server
├── Reverse Proxy (nginx/Traefik)
│ ├── SSL/TLS
│ ├── Rate limiting
│ └── Authentication
├── Docker Compose
│ ├── MongoDB (internal)
│ ├── Ollama (internal, GPU)
│ ├── Backend (internal)
│ ├── Crawler (internal)
│ └── Sender (internal)
├── Monitoring
│ ├── Logs
│ ├── Metrics
│ └── Alerts
└── Backups
├── MongoDB dumps
└── Configuration
```
---
## Scalability
### Current Limits
- Single server deployment
- Sequential article processing
- Single MongoDB instance
- No load balancing
### Scaling Options
**Horizontal Scaling:**
- Multiple crawler instances
- Load-balanced backend
- MongoDB replica set
- Distributed Ollama
**Vertical Scaling:**
- More CPU cores
- More RAM
- GPU acceleration (5-10x faster)
- Faster storage
**Optimization:**
- Batch processing
- Caching
- Database indexing
- Connection pooling
---
## Monitoring
### Health Checks
```bash
# Backend health
curl http://localhost:5001/health
# MongoDB health
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"
# Ollama health
docker-compose exec ollama ollama list
```
### Metrics
- Article count
- Subscriber count
- Newsletter open rate
- Click-through rate
- Processing time
- Error rate
### Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f backend
```
---
## Backup & Recovery
### MongoDB Backup
```bash
# Backup
docker-compose exec mongodb mongodump --out /backup
# Restore
docker-compose exec mongodb mongorestore /backup
```
### Configuration Backup
- `backend/.env` - Environment variables
- `docker-compose.yml` - Service configuration
- RSS feeds in MongoDB
### Recovery Plan
1. Restore MongoDB from backup
2. Restore configuration files
3. Restart services
4. Verify functionality
---
## Performance
### CPU Mode
- Translation: ~1.5s per title
- Summarization: ~8s per article
- 10 articles: ~115s total
- Suitable for <20 articles/day
### GPU Mode (5-10x faster)
- Translation: ~0.3s per title
- Summarization: ~2s per article
- 10 articles: ~31s total
- Suitable for high-volume processing
### Resource Usage
**CPU Mode:**
- CPU: 60-80%
- RAM: 4-6GB
- Disk: ~1GB (with model)
**GPU Mode:**
- CPU: 10-20%
- RAM: 2-3GB
- GPU: 80-100%
- VRAM: 3-4GB
- Disk: ~1GB (with model)
---
## Future Enhancements
### Planned Features
- Frontend dashboard
- Real-time analytics
- Multiple languages
- Custom RSS feeds per subscriber
- A/B testing for newsletters
- Advanced tracking
### Technical Improvements
- Kubernetes deployment
- Microservices architecture
- Message queue (RabbitMQ/Redis)
- Caching layer (Redis)
- CDN for assets
- Advanced monitoring (Prometheus/Grafana)
---
See [SETUP.md](SETUP.md) for deployment guide and [SECURITY.md](SECURITY.md) for security best practices.