Files
Munich-news/docs/ARCHITECTURE.md
2025-11-12 11:34:33 +01:00

9.7 KiB

System Architecture

Complete system design and architecture documentation.


Overview

Munich News Daily is an automated news aggregation system that crawls Munich news sources, generates AI summaries, and sends daily newsletters.

┌─────────────────────────────────────────────────────────┐
│                    Docker Network                        │
│                  (Internal Only)                         │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐          │
│  │ MongoDB  │◄───┤ Backend  │◄───┤ Crawler  │          │
│  │ (27017)  │    │ (5001)   │    │          │          │
│  └──────────┘    └────┬─────┘    └──────────┘          │
│                       │                                   │
│  ┌──────────┐         │           ┌──────────┐          │
│  │ Ollama   │◄────────┤           │ Sender   │          │
│  │ (11434)  │         │           │          │          │
│  └──────────┘         │           └──────────┘          │
│                       │                                   │
└───────────────────────┼───────────────────────────────────┘
                        │
                        │ Port 5001 (Only exposed port)
                        ▼
                   Host Machine
                   External Network

Components

1. MongoDB (Database)

  • Purpose: Store articles, subscribers, tracking data
  • Port: 27017 (internal only)
  • Access: Only via Docker network
  • Authentication: Username/password

Collections:

  • articles - News articles with summaries
  • subscribers - Newsletter subscribers
  • rss_feeds - RSS feed sources
  • newsletter_sends - Send tracking
  • link_clicks - Click tracking

2. Backend API (Flask)

  • Purpose: API endpoints, tracking, analytics
  • Port: 5001 (exposed to host)
  • Access: Public API, admin endpoints
  • Features: Tracking pixels, click tracking, admin operations

Key Endpoints:

  • /api/admin/* - Admin operations
  • /api/subscribe - Subscribe to newsletter
  • /api/tracking/* - Tracking endpoints
  • /health - Health check

3. Ollama (AI Service)

  • Purpose: AI summarization and translation
  • Port: 11434 (internal only)
  • Model: phi3:latest (2.2GB)
  • GPU: Optional NVIDIA GPU support

Features:

  • Article summarization (150 words)
  • Title translation (German → English)
  • Configurable timeout and model

4. Crawler (News Fetcher)

  • Purpose: Fetch and process news articles
  • Schedule: 6:00 AM Berlin time (automated)
  • Features: RSS parsing, content extraction, AI processing

Process:

  1. Fetch RSS feeds
  2. Extract article content
  3. Translate title (German → English)
  4. Generate AI summary
  5. Store in MongoDB

5. Sender (Newsletter)

  • Purpose: Send newsletters to subscribers
  • Schedule: 7:00 AM Berlin time (automated)
  • Features: Email sending, tracking, templating

Process:

  1. Fetch today's articles
  2. Generate newsletter HTML
  3. Add tracking pixels/links
  4. Send to active subscribers
  5. Record send events

Data Flow

Article Processing Flow

RSS Feed
   ↓
Crawler fetches
   ↓
Extract content
   ↓
Translate title (Ollama)
   ↓
Generate summary (Ollama)
   ↓
Store in MongoDB
   ↓
Newsletter Sender
   ↓
Email to subscribers

Tracking Flow

Newsletter sent
   ↓
Tracking pixel embedded
   ↓
User opens email
   ↓
Pixel loaded → Backend API
   ↓
Record open event
   ↓
User clicks link
   ↓
Redirect via Backend API
   ↓
Record click event
   ↓
Redirect to article

Database Schema

Articles Collection

{
  _id: ObjectId,
  title: String,              // Original German title
  title_en: String,           // English translation
  translated_at: Date,        // Translation timestamp
  link: String,
  summary: String,            // AI-generated summary
  content: String,            // Full article text
  author: String,
  source: String,             // RSS feed name
  published_at: Date,
  crawled_at: Date,
  created_at: Date
}

Subscribers Collection

{
  _id: ObjectId,
  email: String,              // Unique
  subscribed_at: Date,
  status: String              // 'active' or 'inactive'
}

RSS Feeds Collection

{
  _id: ObjectId,
  name: String,
  url: String,
  active: Boolean,
  last_crawled: Date
}

Security Architecture

Network Isolation

Exposed Services:

  • Backend API (port 5001) - Only exposed service

Internal Services:

  • MongoDB (port 27017) - Not accessible from host
  • Ollama (port 11434) - Not accessible from host
  • Crawler - No ports
  • Sender - No ports

Benefits:

  • 66% reduction in attack surface
  • Database protected from external access
  • AI service protected from abuse
  • Defense in depth

Authentication

MongoDB:

  • Username/password authentication
  • Credentials in environment variables
  • Internal network only

Backend API:

  • No authentication (add in production)
  • Rate limiting recommended
  • IP whitelisting recommended

Data Protection

  • Subscriber emails stored securely
  • No sensitive data in logs
  • Environment variables for secrets
  • .env file in .gitignore

Technology Stack

Backend

  • Language: Python 3.11
  • Framework: Flask
  • Database: MongoDB 7.0
  • AI: Ollama (phi3:latest)

Infrastructure

  • Containerization: Docker & Docker Compose
  • Networking: Docker bridge network
  • Storage: Docker volumes
  • Scheduling: Python schedule library

Libraries

  • Web: Flask, Flask-CORS
  • Database: pymongo
  • Email: smtplib, email.mime
  • Scraping: requests, BeautifulSoup4, feedparser
  • Templating: Jinja2
  • AI: requests (Ollama API)

Deployment Architecture

Development

Local Machine
├── Docker Compose
│   ├── MongoDB (internal)
│   ├── Ollama (internal)
│   ├── Backend (exposed)
│   ├── Crawler (internal)
│   └── Sender (internal)
└── .env file

Production

Server
├── Reverse Proxy (nginx/Traefik)
│   ├── SSL/TLS
│   ├── Rate limiting
│   └── Authentication
├── Docker Compose
│   ├── MongoDB (internal)
│   ├── Ollama (internal, GPU)
│   ├── Backend (internal)
│   ├── Crawler (internal)
│   └── Sender (internal)
├── Monitoring
│   ├── Logs
│   ├── Metrics
│   └── Alerts
└── Backups
    ├── MongoDB dumps
    └── Configuration

Scalability

Current Limits

  • Single server deployment
  • Sequential article processing
  • Single MongoDB instance
  • No load balancing

Scaling Options

Horizontal Scaling:

  • Multiple crawler instances
  • Load-balanced backend
  • MongoDB replica set
  • Distributed Ollama

Vertical Scaling:

  • More CPU cores
  • More RAM
  • GPU acceleration (5-10x faster)
  • Faster storage

Optimization:

  • Batch processing
  • Caching
  • Database indexing
  • Connection pooling

Monitoring

Health Checks

# Backend health
curl http://localhost:5001/health

# MongoDB health
docker-compose exec mongodb mongosh --eval "db.adminCommand('ping')"

# Ollama health
docker-compose exec ollama ollama list

Metrics

  • Article count
  • Subscriber count
  • Newsletter open rate
  • Click-through rate
  • Processing time
  • Error rate

Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f crawler
docker-compose logs -f backend

Backup & Recovery

MongoDB Backup

# Backup
docker-compose exec mongodb mongodump --out /backup

# Restore
docker-compose exec mongodb mongorestore /backup

Configuration Backup

  • backend/.env - Environment variables
  • docker-compose.yml - Service configuration
  • RSS feeds in MongoDB

Recovery Plan

  1. Restore MongoDB from backup
  2. Restore configuration files
  3. Restart services
  4. Verify functionality

Performance

CPU Mode

  • Translation: ~1.5s per title
  • Summarization: ~8s per article
  • 10 articles: ~115s total
  • Suitable for <20 articles/day

GPU Mode (5-10x faster)

  • Translation: ~0.3s per title
  • Summarization: ~2s per article
  • 10 articles: ~31s total
  • Suitable for high-volume processing

Resource Usage

CPU Mode:

  • CPU: 60-80%
  • RAM: 4-6GB
  • Disk: ~1GB (with model)

GPU Mode:

  • CPU: 10-20%
  • RAM: 2-3GB
  • GPU: 80-100%
  • VRAM: 3-4GB
  • Disk: ~1GB (with model)

Future Enhancements

Planned Features

  • Frontend dashboard
  • Real-time analytics
  • Multiple languages
  • Custom RSS feeds per subscriber
  • A/B testing for newsletters
  • Advanced tracking

Technical Improvements

  • Kubernetes deployment
  • Microservices architecture
  • Message queue (RabbitMQ/Redis)
  • Caching layer (Redis)
  • CDN for assets
  • Advanced monitoring (Prometheus/Grafana)

See SETUP.md for deployment guide and SECURITY.md for security best practices.