This commit is contained in:
2025-11-11 14:09:21 +01:00
parent bcd0a10576
commit 1075a91eac
57 changed files with 5598 additions and 1366 deletions

223
docs/API.md Normal file
View File

@@ -0,0 +1,223 @@
# API Reference
## Tracking Endpoints
### Track Email Open
```http
GET /api/track/pixel/<tracking_id>
```
Returns a 1x1 transparent PNG and logs the email open event.
**Response**: Image (image/png)
### Track Link Click
```http
GET /api/track/click/<tracking_id>
```
Logs the click event and redirects to the original article URL.
**Response**: 302 Redirect
## Analytics Endpoints
### Get Newsletter Metrics
```http
GET /api/analytics/newsletter/<newsletter_id>
```
Returns comprehensive metrics for a specific newsletter.
**Response**:
```json
{
"newsletter_id": "2024-01-15",
"total_sent": 100,
"total_opened": 75,
"open_rate": 75.0,
"unique_openers": 70,
"total_clicks": 45,
"unique_clickers": 30,
"click_through_rate": 30.0
}
```
### Get Article Performance
```http
GET /api/analytics/article/<article_url>
```
Returns performance metrics for a specific article.
**Response**:
```json
{
"article_url": "https://example.com/article",
"total_sent": 100,
"total_clicks": 25,
"click_rate": 25.0,
"unique_clickers": 20,
"newsletters": ["2024-01-15", "2024-01-16"]
}
```
### Get Subscriber Activity
```http
GET /api/analytics/subscriber/<email>
```
Returns activity status and engagement metrics for a subscriber.
**Response**:
```json
{
"email": "user@example.com",
"status": "active",
"last_opened_at": "2024-01-15T10:30:00",
"last_clicked_at": "2024-01-15T10:35:00",
"total_opens": 45,
"total_clicks": 20,
"newsletters_received": 50,
"newsletters_opened": 45
}
```
## Privacy Endpoints
### Delete Subscriber Data
```http
DELETE /api/tracking/subscriber/<email>
```
Deletes all tracking data for a subscriber (GDPR compliance).
**Response**:
```json
{
"success": true,
"message": "All tracking data deleted for user@example.com",
"deleted_counts": {
"newsletter_sends": 50,
"link_clicks": 25,
"subscriber_activity": 1
}
}
```
### Anonymize Old Data
```http
POST /api/tracking/anonymize
```
Anonymizes tracking data older than the retention period.
**Request Body** (optional):
```json
{
"retention_days": 90
}
```
**Response**:
```json
{
"success": true,
"message": "Anonymized tracking data older than 90 days",
"anonymized_counts": {
"newsletter_sends": 1250,
"link_clicks": 650
}
}
```
### Opt Out of Tracking
```http
POST /api/tracking/subscriber/<email>/opt-out
```
Disables tracking for a subscriber.
**Response**:
```json
{
"success": true,
"message": "Subscriber user@example.com has opted out of tracking"
}
```
### Opt In to Tracking
```http
POST /api/tracking/subscriber/<email>/opt-in
```
Re-enables tracking for a subscriber.
**Response**:
```json
{
"success": true,
"message": "Subscriber user@example.com has opted in to tracking"
}
```
## Examples
### Using curl
```bash
# Get newsletter metrics
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
# Delete subscriber data
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
# Anonymize old data
curl -X POST http://localhost:5001/api/tracking/anonymize \
-H "Content-Type: application/json" \
-d '{"retention_days": 90}'
# Opt out of tracking
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
```
### Using Python
```python
import requests
# Get newsletter metrics
response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
metrics = response.json()
print(f"Open rate: {metrics['open_rate']}%")
# Delete subscriber data
response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
result = response.json()
print(result['message'])
```
## Error Responses
All endpoints return standard error responses:
```json
{
"success": false,
"error": "Error message here"
}
```
HTTP Status Codes:
- `200` - Success
- `404` - Not found
- `500` - Server error

131
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,131 @@
# System Architecture
## Overview
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
```
┌─────────────────────────────────────────────────────────────────┐
│ Munich News Daily System │
└─────────────────────────────────────────────────────────────────┘
6:00 AM Berlin → News Crawler
Fetches RSS feeds
Extracts full content
Generates AI summaries
Saves to MongoDB
7:00 AM Berlin → Newsletter Sender
Waits for crawler
Fetches articles
Generates newsletter
Sends to subscribers
✅ Done!
```
## Components
### 1. MongoDB Database
- **Purpose**: Central data storage
- **Collections**:
- `articles`: News articles with summaries
- `subscribers`: Email subscribers
- `rss_feeds`: RSS feed sources
- `newsletter_sends`: Email tracking data
- `link_clicks`: Link click tracking
- `subscriber_activity`: Engagement metrics
### 2. News Crawler
- **Schedule**: Daily at 6:00 AM Berlin time
- **Functions**:
- Fetches articles from RSS feeds
- Extracts full article content
- Generates AI summaries using Ollama
- Saves to MongoDB
- **Technology**: Python, BeautifulSoup, Ollama
### 3. Newsletter Sender
- **Schedule**: Daily at 7:00 AM Berlin time
- **Functions**:
- Waits for crawler to finish (max 30 min)
- Fetches today's articles
- Generates HTML newsletter
- Injects tracking pixels
- Sends to all subscribers
- **Technology**: Python, Jinja2, SMTP
### 4. Backend API (Optional)
- **Purpose**: Tracking and analytics
- **Endpoints**:
- `/api/track/pixel/<id>` - Email open tracking
- `/api/track/click/<id>` - Link click tracking
- `/api/analytics/*` - Engagement metrics
- `/api/tracking/*` - Privacy controls
- **Technology**: Flask, Python
## Data Flow
```
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
Backend API
Analytics
```
## Coordination
The sender waits for the crawler to ensure fresh content:
1. Sender starts at 7:00 AM
2. Checks for recent articles every 30 seconds
3. Maximum wait time: 30 minutes
4. Proceeds once crawler finishes or timeout
## Technology Stack
- **Backend**: Python 3.11
- **Database**: MongoDB 7.0
- **AI**: Ollama (Phi3 model)
- **Scheduling**: Python schedule library
- **Email**: SMTP with HTML templates
- **Tracking**: Pixel tracking + redirect URLs
- **Infrastructure**: Docker & Docker Compose
## Deployment
All components run in Docker containers:
```
docker-compose up -d
```
Containers:
- `munich-news-mongodb` - Database
- `munich-news-crawler` - Crawler service
- `munich-news-sender` - Sender service
## Security
- MongoDB authentication enabled
- Environment variables for secrets
- HTTPS for tracking URLs (production)
- GDPR-compliant data retention
- Privacy controls (opt-out, deletion)
## Monitoring
- Docker logs for all services
- MongoDB for data verification
- Health checks on containers
- Engagement metrics via API
## Scalability
- Horizontal: Add more crawler instances
- Vertical: Increase container resources
- Database: MongoDB sharding if needed
- Caching: Redis for API responses (future)

106
docs/BACKEND_STRUCTURE.md Normal file
View File

@@ -0,0 +1,106 @@
# Backend Structure
The backend has been modularized for better maintainability and scalability.
## Directory Structure
```
backend/
├── app.py # Main Flask application entry point
├── config.py # Configuration management
├── database.py # Database connection and initialization
├── requirements.txt # Python dependencies
├── .env # Environment variables
├── routes/ # API route handlers (blueprints)
│ ├── __init__.py
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
│ ├── news_routes.py # /api/news, /api/stats
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
│ ├── ollama_routes.py # /api/ollama/* (AI features)
│ ├── tracking_routes.py # /api/track/* (email tracking)
│ └── analytics_routes.py # /api/analytics/* (engagement metrics)
└── services/ # Business logic layer
├── __init__.py
├── news_service.py # News fetching and storage logic
├── email_service.py # Newsletter email sending
├── ollama_service.py # Ollama AI integration
├── tracking_service.py # Email tracking (opens/clicks)
└── analytics_service.py # Engagement analytics
```
## Key Components
### app.py
- Main Flask application
- Registers all blueprints
- Minimal code, just wiring things together
### config.py
- Centralized configuration
- Loads environment variables
- Single source of truth for all settings
### database.py
- MongoDB connection setup
- Collection definitions
- Database initialization with indexes
### routes/
Each route file is a Flask Blueprint handling specific API endpoints:
- **subscription_routes.py**: User subscription management
- **news_routes.py**: News fetching and statistics
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
- **ollama_routes.py**: AI/Ollama integration endpoints
- **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
- **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
### services/
Business logic separated from route handlers:
- **news_service.py**: Fetches news from RSS feeds, saves to database
- **email_service.py**: Sends newsletter emails to subscribers
- **ollama_service.py**: Communicates with Ollama AI server
- **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
- **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
## Benefits of This Structure
1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
2. **Testability**: Each module can be tested independently
3. **Maintainability**: Easy to find and modify specific functionality
4. **Scalability**: Easy to add new routes or services
5. **Reusability**: Services can be used by multiple routes
## Adding New Features
### To add a new API endpoint:
1. Create a new route file in `routes/` or add to existing one
2. Create a Blueprint and define routes
3. Register the blueprint in `app.py`
### To add new business logic:
1. Create a new service file in `services/`
2. Import and use in your route handlers
### Example:
```python
# services/my_service.py
def my_business_logic():
return "Hello"
# routes/my_routes.py
from flask import Blueprint
from services.my_service import my_business_logic
my_bp = Blueprint('my', __name__)
@my_bp.route('/api/my-endpoint')
def my_endpoint():
result = my_business_logic()
return {'message': result}
# app.py
from routes.my_routes import my_bp
app.register_blueprint(my_bp)
```

136
docs/CHANGELOG.md Normal file
View File

@@ -0,0 +1,136 @@
# Changelog
## [Unreleased] - 2024-11-10
### Added - Major Refactoring
#### Backend Modularization
- ✅ Restructured backend into modular architecture
- ✅ Created separate route blueprints:
- `subscription_routes.py` - User subscriptions
- `news_routes.py` - News fetching and stats
- `rss_routes.py` - RSS feed management (CRUD)
- `ollama_routes.py` - AI integration
- ✅ Created service layer:
- `news_service.py` - News fetching logic
- `email_service.py` - Newsletter sending
- `ollama_service.py` - AI communication
- ✅ Centralized configuration in `config.py`
- ✅ Separated database logic in `database.py`
- ✅ Reduced main `app.py` from 700+ lines to 27 lines
#### RSS Feed Management
- ✅ Dynamic RSS feed management via API
- ✅ Add/remove/list/toggle RSS feeds without code changes
- ✅ Unique index on RSS feed URLs (prevents duplicates)
- ✅ Default feeds auto-initialized on first run
- ✅ Created `fix_duplicates.py` utility script
#### News Crawler Microservice
- ✅ Created standalone `news_crawler/` microservice
- ✅ Web scraping with BeautifulSoup
- ✅ Smart content extraction using multiple selectors
- ✅ Full article content storage in MongoDB
- ✅ Word count calculation
- ✅ Duplicate prevention (skips already-crawled articles)
- ✅ Rate limiting (1 second between requests)
- ✅ Can run independently or scheduled
- ✅ Docker support for crawler
- ✅ Comprehensive documentation
#### API Endpoints
New endpoints added:
- `GET /api/rss-feeds` - List all RSS feeds
- `POST /api/rss-feeds` - Add new RSS feed
- `DELETE /api/rss-feeds/<id>` - Remove RSS feed
- `PATCH /api/rss-feeds/<id>/toggle` - Toggle feed active status
#### Documentation
- ✅ Created `ARCHITECTURE.md` - System architecture overview
- ✅ Created `backend/STRUCTURE.md` - Backend structure guide
- ✅ Created `news_crawler/README.md` - Crawler documentation
- ✅ Created `news_crawler/QUICKSTART.md` - Quick start guide
- ✅ Created `news_crawler/test_crawler.py` - Test suite
- ✅ Updated main `README.md` with new features
- ✅ Updated `DATABASE_SCHEMA.md` with new fields
#### Configuration
- ✅ Added `FLASK_PORT` environment variable
- ✅ Fixed `OLLAMA_MODEL` typo in `.env`
- ✅ Port 5001 default to avoid macOS AirPlay conflict
### Changed
- Backend structure: Monolithic → Modular
- RSS feeds: Hardcoded → Database-driven
- Article storage: Summary only → Full content support
- Configuration: Scattered → Centralized
### Technical Improvements
- Separation of concerns (routes vs services)
- Better testability
- Easier maintenance
- Scalable architecture
- Independent microservices
- Proper error handling
- Comprehensive logging
### Database Schema Updates
Articles collection now includes:
- `full_content` - Full article text
- `word_count` - Number of words
- `crawled_at` - When content was crawled
RSS Feeds collection added:
- `name` - Feed name
- `url` - Feed URL (unique)
- `active` - Active status
- `created_at` - Creation timestamp
### Files Added
```
backend/
├── config.py
├── database.py
├── fix_duplicates.py
├── STRUCTURE.md
├── routes/
│ ├── __init__.py
│ ├── subscription_routes.py
│ ├── news_routes.py
│ ├── rss_routes.py
│ └── ollama_routes.py
└── services/
├── __init__.py
├── news_service.py
├── email_service.py
└── ollama_service.py
news_crawler/
├── crawler_service.py
├── test_crawler.py
├── requirements.txt
├── .gitignore
├── Dockerfile
├── docker-compose.yml
├── README.md
└── QUICKSTART.md
Root:
├── ARCHITECTURE.md
└── CHANGELOG.md
```
### Files Removed
- Old monolithic `backend/app.py` (replaced with modular version)
### Next Steps (Future Enhancements)
- [ ] Frontend UI for RSS feed management
- [ ] Automatic article summarization with Ollama
- [ ] Scheduled newsletter sending
- [ ] Article categorization and tagging
- [ ] Search functionality
- [ ] User preferences (categories, frequency)
- [ ] Analytics dashboard
- [ ] API rate limiting
- [ ] Caching layer (Redis)
- [ ] Message queue for crawler (Celery)

View File

@@ -0,0 +1,306 @@
# How the News Crawler Works
## 🎯 Overview
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
## 📊 Flow Diagram
```
RSS Feed URL
Parse RSS Feed
For each article link:
┌─────────────────────────────────────┐
│ 1. Fetch HTML Page │
│ GET https://example.com/article │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 2. Parse with BeautifulSoup │
│ soup = BeautifulSoup(html) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 3. Clean HTML │
│ Remove: scripts, styles, nav, │
│ footer, header, ads │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 4. Extract Title │
│ Try: H1 → OG meta → Twitter → │
│ Title tag │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 5. Extract Author │
│ Try: Meta author → rel=author → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 6. Extract Date │
│ Try: <time> → Meta tags → │
│ Class names → JSON-LD │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 7. Extract Content │
│ Try: <article> → Class names → │
│ <main> → <body> │
│ Filter short paragraphs │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 8. Save to MongoDB │
│ { │
│ title, author, date, │
│ content, word_count │
│ } │
└─────────────────────────────────────┘
Wait 1 second (rate limiting)
Next article
```
## 🔍 Detailed Example
### Input: RSS Feed Entry
```xml
<item>
<title>New U-Bahn Line Opens</title>
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
</item>
```
### Step 1: Fetch HTML
```python
url = "https://www.sueddeutsche.de/muenchen/article-123"
response = requests.get(url)
html = response.content
```
### Step 2: Parse HTML
```python
soup = BeautifulSoup(html, 'html.parser')
```
### Step 3: Extract Title
```python
# Try H1
h1 = soup.find('h1')
# Result: "New U-Bahn Line Opens in Munich"
# If no H1, try OG meta
og_title = soup.find('meta', property='og:title')
# Fallback chain continues...
```
### Step 4: Extract Author
```python
# Try meta author
meta_author = soup.find('meta', name='author')
# Result: None
# Try class names
author_elem = soup.select_one('[class*="author"]')
# Result: "Max Mustermann"
```
### Step 5: Extract Date
```python
# Try time tag
time_tag = soup.find('time')
# Result: "2024-11-10T10:00:00Z"
```
### Step 6: Extract Content
```python
# Try article tag
article = soup.find('article')
paragraphs = article.find_all('p')
# Filter paragraphs
content = []
for p in paragraphs:
text = p.get_text().strip()
if len(text) >= 50: # Keep substantial paragraphs
content.append(text)
full_content = '\n\n'.join(content)
# Result: "The new U-Bahn line connecting the city center..."
```
### Step 7: Save to Database
```python
article_doc = {
'title': 'New U-Bahn Line Opens in Munich',
'author': 'Max Mustermann',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'summary': 'Short summary from RSS...',
'full_content': 'The new U-Bahn line connecting...',
'word_count': 1250,
'source': 'Süddeutsche Zeitung München',
'published_at': '2024-11-10T10:00:00Z',
'crawled_at': datetime.utcnow(),
'created_at': datetime.utcnow()
}
db.articles.update_one(
{'link': article_url},
{'$set': article_doc},
upsert=True
)
```
## 🎨 What Makes It "Dynamic"?
### Traditional Approach (Hardcoded)
```python
# Only works for one specific site
title = soup.find('h1', class_='article-title').text
author = soup.find('span', class_='author-name').text
```
❌ Breaks when site changes
❌ Doesn't work on other sites
### Our Approach (Dynamic)
```python
# Works on ANY site
title = extract_title(soup) # Tries 4 different methods
author = extract_author(soup) # Tries 5 different methods
```
✅ Adapts to different HTML structures
✅ Falls back to alternatives
✅ Works across multiple sites
## 🛡️ Robustness Features
### 1. Multiple Strategies
Each field has 4-6 extraction strategies
```python
def extract_title(soup):
# Try strategy 1
if h1 := soup.find('h1'):
return h1.text
# Try strategy 2
if og_title := soup.find('meta', property='og:title'):
return og_title['content']
# Try strategy 3...
# Try strategy 4...
```
### 2. Validation
```python
# Title must be reasonable length
if title and len(title) > 10:
return title
# Author must be < 100 chars
if author and len(author) < 100:
return author
```
### 3. Cleaning
```python
# Remove site name from title
if ' | ' in title:
title = title.split(' | ')[0]
# Remove "By" from author
author = author.replace('By ', '').strip()
```
### 4. Error Handling
```python
try:
data = extract_article_content(url)
except Timeout:
print("Timeout - skip")
except RequestException:
print("Network error - skip")
except Exception:
print("Unknown error - skip")
```
## 📈 Success Metrics
After crawling, you'll see:
```
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
Title: ✓ Found
Author: ✓ Found (Max Mustermann)
Date: ✓ Found (2024-11-10T10:00:00Z)
Content: ✓ Found (1250 words)
```
## 🗄️ Database Result
**Before Crawling:**
```javascript
{
title: "New U-Bahn Line Opens",
link: "https://example.com/article",
summary: "Short RSS summary...",
source: "Süddeutsche Zeitung"
}
```
**After Crawling:**
```javascript
{
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
author: "Max Mustermann", // ← NEW!
link: "https://example.com/article",
summary: "Short RSS summary...",
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
word_count: 1250, // ← NEW!
source: "Süddeutsche Zeitung",
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
created_at: ISODate("2024-11-10T16:00:00Z")
}
```
## 🚀 Running the Crawler
```bash
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
```
Output:
```
============================================================
🚀 Starting RSS Feed Crawler
============================================================
Found 3 active feed(s)
📰 Crawling feed: Süddeutsche Zeitung München
🔍 Crawling: New U-Bahn Line Opens...
✓ Saved (1250 words)
🔍 Crawling: Munich Weather Update...
✓ Saved (450 words)
✓ Crawled 2 articles
============================================================
✓ Crawling Complete!
Total feeds processed: 3
Total articles crawled: 15
Duration: 45.23 seconds
============================================================
```
Now you have rich, structured article data ready for AI processing! 🎉

271
docs/DATABASE_SCHEMA.md Normal file
View File

@@ -0,0 +1,271 @@
# MongoDB Database Schema
This document describes the MongoDB collections and their structure for Munich News Daily.
## Collections
### 1. Articles Collection (`articles`)
Stores all news articles aggregated from Munich news sources.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
title: String, // Article title (required)
author: String, // Article author (optional, extracted during crawl)
link: String, // Article URL (required, unique)
content: String, // Full article content (no length limit)
summary: String, // AI-generated English summary (≤150 words)
word_count: Number, // Word count of full content
summary_word_count: Number, // Word count of AI summary
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
published_at: String, // Original publication date from RSS feed or crawled
crawled_at: DateTime, // When article content was crawled (UTC)
summarized_at: DateTime, // When AI summary was generated (UTC)
created_at: DateTime // When article was added to database (UTC)
}
```
**Indexes:**
- `link` - Unique index to prevent duplicate articles
- `created_at` - Index for efficient sorting by date
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439011"),
title: "New U-Bahn Line Opens in Munich",
author: "Max Mustermann",
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
word_count: 1250,
summary_word_count: 48,
source: "Süddeutsche Zeitung München",
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
created_at: ISODate("2024-01-15T09:00:00.000Z")
}
```
### 2. Subscribers Collection (`subscribers`)
Stores all newsletter subscribers.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (required, unique, lowercase)
subscribed_at: DateTime, // When user subscribed (UTC)
status: String // Subscription status: 'active' or 'inactive'
}
```
**Indexes:**
- `email` - Unique index for email lookups and preventing duplicates
- `subscribed_at` - Index for analytics and sorting
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439012"),
email: "user@example.com",
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
status: "active"
}
```
### 3. Newsletter Sends Collection (`newsletter_sends`)
Tracks each newsletter sent to each subscriber for email open tracking.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
subscriber_email: String, // Recipient email
tracking_id: String, // Unique tracking ID for this send (UUID)
sent_at: DateTime, // When email was sent (UTC)
opened: Boolean, // Whether email was opened
first_opened_at: DateTime, // First open timestamp (null if not opened)
last_opened_at: DateTime, // Most recent open timestamp
open_count: Number, // Number of times opened
created_at: DateTime // Record creation time (UTC)
}
```
**Indexes:**
- `tracking_id` - Unique index for fast pixel request lookups
- `newsletter_id` - Index for analytics queries
- `subscriber_email` - Index for user activity queries
- `sent_at` - Index for time-based queries
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439013"),
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
opened: true,
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
open_count: 3,
created_at: ISODate("2024-01-15T08:00:00.000Z")
}
```
### 4. Link Clicks Collection (`link_clicks`)
Tracks individual link clicks from newsletters.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
tracking_id: String, // Unique tracking ID for this link (UUID)
newsletter_id: String, // Which newsletter this link was in
subscriber_email: String, // Who clicked
article_url: String, // Original article URL
article_title: String, // Article title for reporting
clicked_at: DateTime, // When link was clicked (UTC)
user_agent: String, // Browser/client info
created_at: DateTime // Record creation time (UTC)
}
```
**Indexes:**
- `tracking_id` - Unique index for fast redirect request lookups
- `newsletter_id` - Index for analytics queries
- `article_url` - Index for article performance queries
- `subscriber_email` - Index for user activity queries
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439014"),
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
article_title: "New U-Bahn Line Opens in Munich",
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
created_at: ISODate("2024-01-15T09:35:00.000Z")
}
```
### 5. Subscriber Activity Collection (`subscriber_activity`)
Aggregated activity status for each subscriber.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (unique)
status: String, // 'active', 'inactive', or 'dormant'
last_opened_at: DateTime, // Most recent email open (UTC)
last_clicked_at: DateTime, // Most recent link click (UTC)
total_opens: Number, // Lifetime open count
total_clicks: Number, // Lifetime click count
newsletters_received: Number, // Total newsletters sent
newsletters_opened: Number, // Total newsletters opened
updated_at: DateTime // Last status update (UTC)
}
```
**Indexes:**
- `email` - Unique index for fast lookups
- `status` - Index for filtering by activity level
- `last_opened_at` - Index for time-based queries
**Activity Status Classification:**
- **active**: Opened an email in the last 30 days
- **inactive**: No opens in 30-60 days
- **dormant**: No opens in 60+ days
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439015"),
email: "user@example.com",
status: "active",
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
total_opens: 45,
total_clicks: 23,
newsletters_received: 60,
newsletters_opened: 45,
updated_at: ISODate("2024-01-15T10:00:00.000Z")
}
```
## Design Decisions
### Why MongoDB?
1. **Flexibility**: Easy to add new fields without schema migrations
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
4. **Document Model**: Natural fit for news articles and subscriber data
### Schema Choices
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
### Future Enhancements
Potential fields to add in the future:
**Articles:**
- `category`: String (e.g., "politics", "sports", "culture")
- `tags`: Array of Strings
- `image_url`: String
- `sent_in_newsletter`: Boolean (track if article was sent)
- `sent_at`: DateTime (when article was included in newsletter)
**Subscribers:**
- `preferences`: Object (newsletter frequency, categories, etc.)
- `last_sent_at`: DateTime (last newsletter sent date)
- `unsubscribed_at`: DateTime (when user unsubscribed)
- `verification_token`: String (for email verification)
## AI Summarization Workflow
When the crawler processes an article:
1. **Extract Content**: Full article text is extracted from the webpage
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
### Summary Field Details
- **Language**: Always in English, regardless of source article language
- **Length**: Maximum 150 words
- **Format**: Plain text, concise and clear
- **Purpose**: Quick preview for newsletters and frontend display
### Querying Articles
```javascript
// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })
// Get articles without summaries
db.articles.find({ summary: { $exists: false } })
// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
```

274
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,274 @@
# Deployment Guide
## Quick Start
```bash
# 1. Clone repository
git clone <repository-url>
cd munich-news
# 2. Configure environment
cp backend/.env.example backend/.env
# Edit backend/.env with your settings
# 3. Start system
docker-compose up -d
# 4. View logs
docker-compose logs -f
```
## Environment Configuration
### Required Settings
Edit `backend/.env`:
```env
# Email (Required)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
# MongoDB (Optional - defaults provided)
MONGODB_URI=mongodb://localhost:27017/
# Tracking (Optional)
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
```
### Optional Settings
```env
# Newsletter
NEWSLETTER_MAX_ARTICLES=10
NEWSLETTER_HOURS_LOOKBACK=24
# Ollama AI
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_MODEL=phi3:latest
# Tracking
TRACKING_DATA_RETENTION_DAYS=90
```
## Production Deployment
### 1. Set MongoDB Password
```bash
export MONGO_PASSWORD=your-secure-password
docker-compose up -d
```
### 2. Use HTTPS for Tracking
Update `backend/.env`:
```env
TRACKING_API_URL=https://yourdomain.com
```
### 3. Configure Log Rotation
Add to `docker-compose.yml`:
```yaml
services:
crawler:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
### 4. Set Up Backups
```bash
# Daily MongoDB backup
0 3 * * * docker exec munich-news-mongodb mongodump --out=/data/backup/$(date +\%Y\%m\%d)
```
### 5. Enable Backend API
Uncomment backend service in `docker-compose.yml`:
```yaml
backend:
build:
context: ./backend
ports:
- "5001:5001"
# ... rest of config
```
## Schedule Configuration
### Change Crawler Time
Edit `news_crawler/scheduled_crawler.py`:
```python
schedule.every().day.at("06:00").do(run_crawler) # Change time
```
### Change Sender Time
Edit `news_sender/scheduled_sender.py`:
```python
schedule.every().day.at("07:00").do(run_sender) # Change time
```
Rebuild after changes:
```bash
docker-compose up -d --build
```
## Database Setup
### Add RSS Feeds
```bash
mongosh munich_news
db.rss_feeds.insertMany([
{
name: "Süddeutsche Zeitung München",
url: "https://www.sueddeutsche.de/muenchen/rss",
active: true
},
{
name: "Merkur München",
url: "https://www.merkur.de/lokales/muenchen/rss/feed.rss",
active: true
}
])
```
### Add Subscribers
```bash
mongosh munich_news
db.subscribers.insertMany([
{
email: "user1@example.com",
active: true,
tracking_enabled: true,
subscribed_at: new Date()
},
{
email: "user2@example.com",
active: true,
tracking_enabled: true,
subscribed_at: new Date()
}
])
```
## Monitoring
### Check Container Status
```bash
docker-compose ps
```
### View Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f sender
```
### Check Database
```bash
mongosh munich_news
// Count articles
db.articles.countDocuments()
// Count subscribers
db.subscribers.countDocuments({ active: true })
// View recent articles
db.articles.find().sort({ crawled_at: -1 }).limit(5)
```
## Troubleshooting
### Containers Won't Start
```bash
# Check logs
docker-compose logs
# Rebuild
docker-compose up -d --build
# Reset everything
docker-compose down -v
docker-compose up -d
```
### Crawler Not Finding Articles
```bash
# Check RSS feeds
mongosh munich_news --eval "db.rss_feeds.find({ active: true })"
# Test manually
docker-compose exec crawler python crawler_service.py 5
```
### Newsletter Not Sending
```bash
# Test email
docker-compose exec sender python sender_service.py test your-email@example.com
# Check SMTP config
docker-compose exec sender python -c "from sender_service import Config; print(Config.SMTP_SERVER)"
```
## Maintenance
### Update System
```bash
git pull
docker-compose up -d --build
```
### Backup Database
```bash
docker exec munich-news-mongodb mongodump --out=/data/backup
```
### Clean Old Data
```bash
mongosh munich_news
// Delete articles older than 90 days
db.articles.deleteMany({
crawled_at: { $lt: new Date(Date.now() - 90*24*60*60*1000) }
})
```
## Security Checklist
- [ ] Set strong MongoDB password
- [ ] Use HTTPS for tracking URLs
- [ ] Secure SMTP credentials
- [ ] Enable firewall rules
- [ ] Set up log rotation
- [ ] Configure backups
- [ ] Monitor for failures
- [ ] Keep dependencies updated

View File

@@ -0,0 +1,353 @@
# Content Extraction Strategies
The crawler uses multiple strategies to dynamically extract article metadata from any website.
## 🎯 What Gets Extracted
1. **Title** - Article headline
2. **Author** - Article writer/journalist
3. **Published Date** - When article was published
4. **Content** - Main article text
5. **Description** - Meta description/summary
## 📋 Extraction Strategies
### 1. Title Extraction
Tries multiple methods in order of reliability:
#### Strategy 1: H1 Tag
```html
<h1>Article Title Here</h1>
```
✅ Most reliable - usually the main headline
#### Strategy 2: Open Graph Meta Tag
```html
<meta property="og:title" content="Article Title Here" />
```
✅ Used by Facebook, very reliable
#### Strategy 3: Twitter Card Meta Tag
```html
<meta name="twitter:title" content="Article Title Here" />
```
✅ Used by Twitter, reliable
#### Strategy 4: Title Tag (Fallback)
```html
<title>Article Title | Site Name</title>
```
⚠️ Often includes site name, needs cleaning
**Cleaning:**
- Removes " | Site Name"
- Removes " - Site Name"
---
### 2. Author Extraction
Tries multiple methods:
#### Strategy 1: Meta Author Tag
```html
<meta name="author" content="John Doe" />
```
✅ Standard HTML meta tag
#### Strategy 2: Rel="author" Link
```html
<a rel="author" href="/author/john-doe">John Doe</a>
```
✅ Semantic HTML
#### Strategy 3: Common Class Names
```html
<div class="author-name">John Doe</div>
<span class="byline">By John Doe</span>
<p class="writer">John Doe</p>
```
✅ Searches for: author-name, author, byline, writer
#### Strategy 4: Schema.org Markup
```html
<span itemprop="author">John Doe</span>
```
✅ Structured data
#### Strategy 5: JSON-LD Structured Data
```html
<script type="application/ld+json">
{
"@type": "NewsArticle",
"author": {
"@type": "Person",
"name": "John Doe"
}
}
</script>
```
✅ Most structured, very reliable
**Cleaning:**
- Removes "By " prefix
- Validates length (< 100 chars)
---
### 3. Date Extraction
Tries multiple methods:
#### Strategy 1: Time Tag with Datetime
```html
<time datetime="2024-11-10T10:00:00Z">November 10, 2024</time>
```
✅ Most reliable - ISO format
#### Strategy 2: Article Published Time Meta
```html
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
```
✅ Open Graph standard
#### Strategy 3: OG Published Time
```html
<meta property="og:published_time" content="2024-11-10T10:00:00Z" />
```
✅ Facebook standard
#### Strategy 4: Common Class Names
```html
<span class="publish-date">November 10, 2024</span>
<time class="published">2024-11-10</time>
<div class="timestamp">10:00 AM, Nov 10</div>
```
✅ Searches for: publish-date, published, date, timestamp
#### Strategy 5: Schema.org Markup
```html
<meta itemprop="datePublished" content="2024-11-10T10:00:00Z" />
```
✅ Structured data
#### Strategy 6: JSON-LD Structured Data
```html
<script type="application/ld+json">
{
"@type": "NewsArticle",
"datePublished": "2024-11-10T10:00:00Z"
}
</script>
```
✅ Most structured
---
### 4. Content Extraction
Tries multiple methods:
#### Strategy 1: Semantic HTML Tags
```html
<article>
<p>Article content here...</p>
</article>
```
✅ Best practice HTML5
#### Strategy 2: Common Class Names
```html
<div class="article-content">...</div>
<div class="article-body">...</div>
<div class="post-content">...</div>
<div class="entry-content">...</div>
<div class="story-body">...</div>
```
✅ Searches for common patterns
#### Strategy 3: Schema.org Markup
```html
<div itemprop="articleBody">
<p>Content here...</p>
</div>
```
✅ Structured data
#### Strategy 4: Main Tag
```html
<main>
<p>Content here...</p>
</main>
```
✅ Semantic HTML5
#### Strategy 5: Body Tag (Fallback)
```html
<body>
<p>Content here...</p>
</body>
```
⚠️ Last resort, may include navigation
**Content Filtering:**
- Removes `<script>`, `<style>`, `<nav>`, `<footer>`, `<header>`, `<aside>`
- Filters out short paragraphs (< 50 chars) - likely ads/navigation
- Keeps only substantial paragraphs
- **No length limit** - stores full article content
---
## 🔍 How It Works
### Example: Crawling a News Article
```python
# 1. Fetch HTML
response = requests.get(article_url)
soup = BeautifulSoup(response.content, 'html.parser')
# 2. Extract title (tries 4 strategies)
title = extract_title(soup)
# Result: "New U-Bahn Line Opens in Munich"
# 3. Extract author (tries 5 strategies)
author = extract_author(soup)
# Result: "Max Mustermann"
# 4. Extract date (tries 6 strategies)
published_date = extract_date(soup)
# Result: "2024-11-10T10:00:00Z"
# 5. Extract content (tries 5 strategies)
content = extract_main_content(soup)
# Result: "The new U-Bahn line connecting..."
# 6. Save to database
article_doc = {
'title': title,
'author': author,
'published_at': published_date,
'full_content': content,
'word_count': len(content.split())
}
```
---
## 📊 Success Rates by Strategy
Based on common news sites:
| Strategy | Success Rate | Notes |
|----------|-------------|-------|
| H1 for title | 95% | Almost universal |
| OG meta tags | 90% | Most modern sites |
| Time tag for date | 85% | HTML5 sites |
| JSON-LD | 70% | Growing adoption |
| Class name patterns | 60% | Varies by site |
| Schema.org | 50% | Not widely adopted |
---
## 🎨 Real-World Examples
### Example 1: Süddeutsche Zeitung
```html
<article>
<h1>New U-Bahn Line Opens</h1>
<span class="author">Max Mustermann</span>
<time datetime="2024-11-10T10:00:00Z">10. November 2024</time>
<div class="article-body">
<p>The new U-Bahn line...</p>
</div>
</article>
```
✅ Extracts: Title (H1), Author (class), Date (time), Content (article-body)
### Example 2: Medium Blog
```html
<article>
<h1>How to Build a News Crawler</h1>
<meta property="og:title" content="How to Build a News Crawler" />
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
<a rel="author" href="/author">Jane Smith</a>
<section>
<p>In this article...</p>
</section>
</article>
```
✅ Extracts: Title (OG meta), Author (rel), Date (article meta), Content (section)
### Example 3: WordPress Blog
```html
<div class="post">
<h1 class="entry-title">My Blog Post</h1>
<span class="byline">By John Doe</span>
<time class="published">November 10, 2024</time>
<div class="entry-content">
<p>Blog content here...</p>
</div>
</div>
```
✅ Extracts: Title (H1), Author (byline), Date (published), Content (entry-content)
---
## ⚠️ Edge Cases Handled
1. **Missing Fields**: Returns `None` instead of crashing
2. **Multiple Authors**: Takes first one found
3. **Relative Dates**: Stores as-is ("2 hours ago")
4. **Paywalls**: Extracts what's available
5. **JavaScript-rendered**: Only gets server-side HTML
6. **Ads/Navigation**: Filtered out by paragraph length
7. **Site Name in Title**: Cleaned automatically
---
## 🚀 Future Improvements
Potential enhancements:
- [ ] JavaScript rendering (Selenium/Playwright)
- [ ] Paywall bypass (where legal)
- [ ] Image extraction
- [ ] Video detection
- [ ] Related articles
- [ ] Tags/categories
- [ ] Reading time estimation
- [ ] Language detection
- [ ] Sentiment analysis
---
## 🧪 Testing
Test the extraction on a specific URL:
```python
from crawler_service import extract_article_content
url = "https://www.sueddeutsche.de/muenchen/article-123"
data = extract_article_content(url)
print(f"Title: {data['title']}")
print(f"Author: {data['author']}")
print(f"Date: {data['published_date']}")
print(f"Content length: {len(data['content'])} chars")
print(f"Word count: {data['word_count']}")
```
---
## 📚 Standards Supported
- ✅ HTML5 semantic tags
- ✅ Open Graph Protocol
- ✅ Twitter Cards
- ✅ Schema.org microdata
- ✅ JSON-LD structured data
- ✅ Dublin Core metadata
- ✅ Common CSS class patterns

209
docs/OLD_ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,209 @@
# Munich News Daily - Architecture
## System Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Users / Browsers │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Port 3000) │
│ Node.js + Express + Vanilla JS │
│ - Subscription form │
│ - News display │
│ - RSS feed management UI (future) │
└────────────────────────┬────────────────────────────────────┘
│ HTTP/REST
┌─────────────────────────────────────────────────────────────┐
│ Backend API (Port 5001) │
│ Flask + Python │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Routes (Blueprints) │ │
│ │ - subscription_routes.py (subscribe/unsubscribe) │ │
│ │ - news_routes.py (get news, stats) │ │
│ │ - rss_routes.py (manage RSS feeds) │ │
│ │ - ollama_routes.py (AI features) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Services (Business Logic) │ │
│ │ - news_service.py (fetch & save articles) │ │
│ │ - email_service.py (send newsletters) │ │
│ │ - ollama_service.py (AI integration) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Core │ │
│ │ - config.py (configuration) │ │
│ │ - database.py (DB connection) │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ MongoDB (Port 27017) │
│ │
│ Collections: │
│ - articles (news articles with full content) │
│ - subscribers (email subscribers) │
│ - rss_feeds (RSS feed sources) │
└─────────────────────────┬───────────────────────────────────┘
│ Read/Write
┌─────────────────────────┴───────────────────────────────────┐
│ News Crawler Microservice │
│ (Standalone) │
│ │
│ - Fetches RSS feeds from MongoDB │
│ - Crawls full article content │
│ - Extracts text, metadata, word count │
│ - Stores back to MongoDB │
│ - Can run independently or scheduled │
└──────────────────────────────────────────────────────────────┘
│ (Optional)
┌─────────────────────────────────────────────────────────────┐
│ Ollama AI Server (Port 11434) │
│ (Optional, External) │
│ │
│ - Article summarization │
│ - Content analysis │
│ - AI-powered features │
└──────────────────────────────────────────────────────────────┘
```
## Component Details
### Frontend (Port 3000)
- **Technology**: Node.js, Express, Vanilla JavaScript
- **Responsibilities**:
- User interface
- Subscription management
- News display
- API proxy to backend
- **Communication**: HTTP REST to Backend
### Backend API (Port 5001)
- **Technology**: Python, Flask
- **Architecture**: Modular with Blueprints
- **Responsibilities**:
- REST API endpoints
- Business logic
- Database operations
- Email sending
- AI integration
- **Communication**:
- HTTP REST from Frontend
- MongoDB driver to Database
- HTTP to Ollama (optional)
### MongoDB (Port 27017)
- **Technology**: MongoDB 7.0
- **Responsibilities**:
- Persistent data storage
- Articles, subscribers, RSS feeds
- **Communication**: MongoDB protocol
### News Crawler (Standalone)
- **Technology**: Python, BeautifulSoup
- **Architecture**: Microservice (can run independently)
- **Responsibilities**:
- Fetch RSS feeds
- Crawl article content
- Extract and clean text
- Store in database
- **Communication**: MongoDB driver to Database
- **Execution**:
- Manual: `python crawler_service.py`
- Scheduled: Cron, systemd, Docker
- On-demand: Via backend API (future)
### Ollama AI Server (Optional, External)
- **Technology**: Ollama
- **Responsibilities**:
- AI model inference
- Text summarization
- Content analysis
- **Communication**: HTTP REST API
## Data Flow
### 1. News Aggregation Flow
```
RSS Feeds → Backend (news_service) → MongoDB (articles)
```
### 2. Content Crawling Flow
```
MongoDB (rss_feeds) → Crawler → Article URLs →
Web Scraping → MongoDB (articles with full_content)
```
### 3. Subscription Flow
```
User → Frontend → Backend (subscription_routes) →
MongoDB (subscribers)
```
### 4. Newsletter Flow (Future)
```
Scheduler → Backend (email_service) →
MongoDB (articles + subscribers) → SMTP → Users
```
### 5. AI Processing Flow (Optional)
```
MongoDB (articles) → Backend (ollama_service) →
Ollama Server → AI Summary → MongoDB (articles)
```
## Deployment Options
### Development
- All services run locally
- MongoDB via Docker Compose
- Manual crawler execution
### Production
- Backend: Cloud VM, Container, or PaaS
- Frontend: Static hosting or same server
- MongoDB: MongoDB Atlas or self-hosted
- Crawler: Scheduled job (cron, systemd timer)
- Ollama: Separate GPU server (optional)
## Scalability Considerations
### Current Architecture
- Monolithic backend (single Flask instance)
- Standalone crawler (can run multiple instances)
- Shared MongoDB
### Future Improvements
- Load balancer for backend
- Message queue for crawler jobs (Celery + Redis)
- Caching layer (Redis)
- CDN for frontend
- Read replicas for MongoDB
## Security
- CORS enabled for frontend-backend communication
- MongoDB authentication (production)
- Environment variables for secrets
- Input validation on all endpoints
- Rate limiting (future)
## Monitoring (Future)
- Application logs
- MongoDB metrics
- Crawler success/failure tracking
- API response times
- Error tracking (Sentry)

243
docs/QUICK_REFERENCE.md Normal file
View File

@@ -0,0 +1,243 @@
# Quick Reference Guide
## Starting the Application
### 1. Start MongoDB
```bash
docker-compose up -d
```
### 2. Start Backend (Port 5001)
```bash
cd backend
source venv/bin/activate # or: venv\Scripts\activate on Windows
python app.py
```
### 3. Start Frontend (Port 3000)
```bash
cd frontend
npm start
```
### 4. Run Crawler (Optional)
```bash
cd news_crawler
pip install -r requirements.txt
python crawler_service.py 10
```
## Common Commands
### RSS Feed Management
**List all feeds:**
```bash
curl http://localhost:5001/api/rss-feeds
```
**Add a feed:**
```bash
curl -X POST http://localhost:5001/api/rss-feeds \
-H "Content-Type: application/json" \
-d '{"name": "Feed Name", "url": "https://example.com/rss"}'
```
**Remove a feed:**
```bash
curl -X DELETE http://localhost:5001/api/rss-feeds/<feed_id>
```
**Toggle feed status:**
```bash
curl -X PATCH http://localhost:5001/api/rss-feeds/<feed_id>/toggle
```
### News & Subscriptions
**Get latest news:**
```bash
curl http://localhost:5001/api/news
```
**Subscribe:**
```bash
curl -X POST http://localhost:5001/api/subscribe \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com"}'
```
**Get stats:**
```bash
curl http://localhost:5001/api/stats
```
### Ollama (AI)
**Test connection:**
```bash
curl http://localhost:5001/api/ollama/ping
```
**List models:**
```bash
curl http://localhost:5001/api/ollama/models
```
### Email Tracking & Analytics
**Get newsletter metrics:**
```bash
curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
```
**Get article performance:**
```bash
curl http://localhost:5001/api/analytics/article/<article_id>
```
**Get subscriber activity:**
```bash
curl http://localhost:5001/api/analytics/subscriber/<email>
```
**Delete subscriber tracking data:**
```bash
curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
```
**Anonymize old tracking data:**
```bash
curl -X POST http://localhost:5001/api/tracking/anonymize
```
### Database
**Connect to MongoDB:**
```bash
mongosh
use munich_news
```
**Check articles:**
```javascript
db.articles.find().limit(5)
db.articles.countDocuments()
db.articles.countDocuments({full_content: {$exists: true}})
```
**Check subscribers:**
```javascript
db.subscribers.find()
db.subscribers.countDocuments({status: "active"})
```
**Check RSS feeds:**
```javascript
db.rss_feeds.find()
```
**Check tracking data:**
```javascript
db.newsletter_sends.find().limit(5)
db.link_clicks.find().limit(5)
db.subscriber_activity.find()
```
## File Locations
### Configuration
- Backend: `backend/.env`
- Frontend: `frontend/package.json`
- Crawler: Uses backend's `.env` or own `.env`
### Logs
- Backend: Terminal output
- Frontend: Terminal output
- Crawler: Terminal output
### Database
- MongoDB data: Docker volume `mongodb_data`
- Database name: `munich_news`
## Ports
| Service | Port | URL |
|---------|------|-----|
| Frontend | 3000 | http://localhost:3000 |
| Backend | 5001 | http://localhost:5001 |
| MongoDB | 27017 | mongodb://localhost:27017 |
| Ollama | 11434 | http://localhost:11434 |
## Troubleshooting
### Backend won't start
- Check if port 5001 is available
- Verify MongoDB is running
- Check `.env` file exists
### Frontend can't connect
- Verify backend is running on port 5001
- Check CORS settings
- Check API_URL in frontend
### Crawler fails
- Install dependencies: `pip install -r requirements.txt`
- Check MongoDB connection
- Verify RSS feeds exist in database
### MongoDB connection error
- Start MongoDB: `docker-compose up -d`
- Check connection string in `.env`
- Verify port 27017 is not blocked
### Port 5000 conflict (macOS)
- AirPlay uses port 5000
- Use port 5001 instead (set in `.env`)
- Or disable AirPlay Receiver in System Preferences
## Project Structure
```
munich-news/
├── backend/ # Main API (Flask)
├── frontend/ # Web UI (Express + JS)
├── news_crawler/ # Crawler microservice
├── .env # Environment variables
└── docker-compose.yml # MongoDB setup
```
## Environment Variables
### Backend (.env)
```env
MONGODB_URI=mongodb://localhost:27017/
FLASK_PORT=5001
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_MODEL=phi3:latest
OLLAMA_ENABLED=true
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90
```
## Development Workflow
1. **Add RSS Feed** → Backend API
2. **Run Crawler** → Fetches full content
3. **View News** → Frontend displays articles
4. **Users Subscribe** → Via frontend form
5. **Send Newsletter** → Manual or scheduled
## Useful Links
- Frontend: http://localhost:3000
- Backend API: http://localhost:5001
- MongoDB: mongodb://localhost:27017
- Architecture: See `ARCHITECTURE.md`
- Backend Structure: See `backend/STRUCTURE.md`
- Crawler Guide: See `news_crawler/README.md`

194
docs/RSS_URL_EXTRACTION.md Normal file
View File

@@ -0,0 +1,194 @@
# RSS URL Extraction - How It Works
## The Problem
Different RSS feed providers use different fields to store the article URL:
### Example 1: Standard RSS (uses `link`)
```xml
<item>
<title>Article Title</title>
<link>https://example.com/article/123</link>
<guid>internal-id-456</guid>
</item>
```
### Example 2: Some feeds (uses `guid` as URL)
```xml
<item>
<title>Article Title</title>
<guid>https://example.com/article/123</guid>
</item>
```
### Example 3: Atom feeds (uses `id`)
```xml
<entry>
<title>Article Title</title>
<id>https://example.com/article/123</id>
</entry>
```
### Example 4: Complex feeds (guid as object)
```xml
<item>
<title>Article Title</title>
<guid isPermaLink="true">https://example.com/article/123</guid>
</item>
```
### Example 5: Multiple links
```xml
<item>
<title>Article Title</title>
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
</item>
```
## Our Solution
The `extract_article_url()` function tries multiple strategies in order:
### Strategy 1: Check `link` field (most common)
```python
if entry.get('link') and entry.get('link', '').startswith('http'):
return entry.get('link')
```
✅ Works for: Most RSS 2.0 feeds
### Strategy 2: Check `guid` field
```python
if entry.get('guid'):
guid = entry.get('guid')
# guid can be a string
if isinstance(guid, str) and guid.startswith('http'):
return guid
# or a dict with 'href'
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
return guid.get('href')
```
✅ Works for: Feeds that use GUID as permalink
### Strategy 3: Check `id` field
```python
if entry.get('id') and entry.get('id', '').startswith('http'):
return entry.get('id')
```
✅ Works for: Atom feeds
### Strategy 4: Check `links` array
```python
if entry.get('links'):
for link in entry.get('links', []):
if isinstance(link, dict) and link.get('href', '').startswith('http'):
# Prefer 'alternate' type
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
return link.get('href')
```
✅ Works for: Feeds with multiple links (prefers HTML content)
## Real-World Examples
### Süddeutsche Zeitung
```python
entry = {
'title': 'Munich News',
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
'guid': 'sz-internal-123'
}
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
```
### Medium Blog
```python
entry = {
'title': 'Blog Post',
'guid': 'https://medium.com/@user/post-abc123',
'link': None
}
# Returns: 'https://medium.com/@user/post-abc123'
```
### YouTube RSS
```python
entry = {
'title': 'Video Title',
'id': 'https://www.youtube.com/watch?v=abc123',
'link': None
}
# Returns: 'https://www.youtube.com/watch?v=abc123'
```
### Complex Feed
```python
entry = {
'title': 'Article',
'links': [
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
]
}
# Returns: 'https://example.com/article' (prefers text/html)
```
## Validation
All extracted URLs must:
1. Start with `http://` or `https://`
2. Be a valid string (not None or empty)
If no valid URL is found:
```python
return None
# Crawler will skip this entry and log a warning
```
## Testing Different Feeds
To test if a feed works with our extractor:
```python
import feedparser
from rss_utils import extract_article_url
# Parse feed
feed = feedparser.parse('https://example.com/rss')
# Test each entry
for entry in feed.entries[:5]:
url = extract_article_url(entry)
if url:
print(f"{entry.get('title', 'No title')[:50]}")
print(f" URL: {url}")
else:
print(f"{entry.get('title', 'No title')[:50]}")
print(f" No valid URL found")
print(f" Available fields: {list(entry.keys())}")
```
## Supported Feed Types
✅ RSS 2.0
✅ RSS 1.0
✅ Atom
✅ Custom RSS variants
✅ Feeds with multiple links
✅ Feeds with GUID as permalink
## Edge Cases Handled
1. **GUID is not a URL**: Checks if it starts with `http`
2. **Multiple links**: Prefers `text/html` type
3. **GUID as dict**: Extracts `href` field
4. **Missing fields**: Returns None instead of crashing
5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
## Future Improvements
Potential enhancements:
- [ ] Support for `feedburner:origLink`
- [ ] Support for `pheedo:origLink`
- [ ] Resolve shortened URLs (bit.ly, etc.)
- [ ] Handle relative URLs (convert to absolute)
- [ ] Cache URL extraction results

412
docs/SYSTEM_ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,412 @@
# Munich News Daily - System Architecture
## 📊 Complete System Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Munich News Daily System │
│ Fully Automated Pipeline │
└─────────────────────────────────────────────────────────────────┘
Daily Schedule
┌──────────────────────┐
│ 6:00 AM Berlin │
│ News Crawler │
└──────────┬───────────┘
┌──────────────────────────────────────────────────────────────────┐
│ News Crawler │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
│ │ Fetch RSS │→ │ Extract │→ │ Summarize │→ │ Save to ││
│ │ Feeds │ │ Content │ │ with AI │ │ MongoDB ││
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
│ │
│ Sources: Süddeutsche, Merkur, BR24, etc. │
│ Output: Full articles + AI summaries │
└──────────────────────────────────────────────────────────────────┘
│ Articles saved
┌──────────────────────┐
│ MongoDB │
│ (Data Storage) │
└──────────┬───────────┘
│ Wait for crawler
┌──────────────────────┐
│ 7:00 AM Berlin │
│ Newsletter Sender │
└──────────┬───────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Newsletter Sender │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
│ │ Wait for │→ │ Fetch │→ │ Generate │→ │ Send to ││
│ │ Crawler │ │ Articles │ │ Newsletter │ │ Subscribers││
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
│ │
│ Features: Tracking pixels, link tracking, HTML templates │
│ Output: Personalized newsletters with engagement tracking │
└──────────────────────────────────────────────────────────────────┘
│ Emails sent
┌──────────────────────┐
│ Subscribers │
│ (Email Inboxes) │
└──────────┬───────────┘
│ Opens & clicks
┌──────────────────────┐
│ Tracking System │
│ (Analytics API) │
└──────────────────────┘
```
## 🔄 Data Flow
### 1. Content Acquisition (6:00 AM)
```
RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
```
**Details**:
- Fetches from multiple RSS sources
- Extracts full article text
- Generates concise summaries using Ollama
- Stores with metadata (author, date, source)
### 2. Newsletter Generation (7:00 AM)
```
MongoDB → Articles → Template → HTML → Email
```
**Details**:
- Waits for crawler to finish (max 30 min)
- Fetches today's articles with summaries
- Applies Jinja2 template
- Injects tracking pixels
- Replaces links with tracking URLs
### 3. Engagement Tracking (Ongoing)
```
Email Open → Pixel Load → Log Event → Analytics
Link Click → Redirect → Log Event → Analytics
```
**Details**:
- Tracks email opens via 1x1 pixel
- Tracks link clicks via redirect URLs
- Stores engagement data in MongoDB
- Provides analytics API
## 🏗️ Component Architecture
### Docker Containers
```
┌─────────────────────────────────────────────────────────┐
│ Docker Network │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ MongoDB │ │ Crawler │ │ Sender │ │
│ │ │ │ │ │ │ │
│ │ Port: 27017 │←─│ Schedule: │←─│ Schedule: │ │
│ │ │ │ 6:00 AM │ │ 7:00 AM │ │
│ │ Storage: │ │ │ │ │ │
│ │ - articles │ │ Depends on: │ │ Depends on: │ │
│ │ - subscribers│ │ - MongoDB │ │ - MongoDB │ │
│ │ - tracking │ │ │ │ - Crawler │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ All containers auto-restart on failure │
│ All use Europe/Berlin timezone │
└─────────────────────────────────────────────────────────┘
```
### Backend Services
```
┌─────────────────────────────────────────────────────────┐
│ Backend Services │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Flask API (Port 5001) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Tracking │ │ Analytics │ │ Privacy │ │ │
│ │ │ Endpoints │ │ Endpoints │ │ Endpoints │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Services Layer │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Tracking │ │ Analytics │ │ Ollama │ │ │
│ │ │ Service │ │ Service │ │ Client │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
## 📅 Daily Timeline
```
Time (Berlin) │ Event │ Duration
───────────────┼──────────────────────────┼──────────
05:59:59 │ System idle │ -
06:00:00 │ Crawler starts │ ~10-20 min
06:00:01 │ - Fetch RSS feeds │
06:02:00 │ - Extract content │
06:05:00 │ - Generate summaries │
06:15:00 │ - Save to MongoDB │
06:20:00 │ Crawler finishes │
06:20:01 │ System idle │ ~40 min
07:00:00 │ Sender starts │ ~5-10 min
07:00:01 │ - Wait for crawler │ (checks every 30s)
07:00:30 │ - Crawler confirmed done │
07:00:31 │ - Fetch articles │
07:01:00 │ - Generate newsletters │
07:02:00 │ - Send to subscribers │
07:10:00 │ Sender finishes │
07:10:01 │ System idle │ Until tomorrow
```
## 🔐 Security & Privacy
### Data Protection
```
┌─────────────────────────────────────────────────────────┐
│ Privacy Features │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Data Retention │ │
│ │ - Personal data: 90 days │ │
│ │ - Anonymization: Automatic │ │
│ │ - Deletion: On request │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ User Rights │ │
│ │ - Opt-out: Anytime │ │
│ │ - Data access: API available │ │
│ │ - Data deletion: Full removal │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Compliance │ │
│ │ - GDPR compliant │ │
│ │ - Privacy notice in emails │ │
│ │ - Transparent tracking │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
```
## 📊 Database Schema
### Collections
```
MongoDB (munich_news)
├── articles
│ ├── title
│ ├── author
│ ├── content (full text)
│ ├── summary (AI generated)
│ ├── link
│ ├── source
│ ├── published_at
│ └── crawled_at
├── subscribers
│ ├── email
│ ├── active
│ ├── tracking_enabled
│ └── subscribed_at
├── rss_feeds
│ ├── name
│ ├── url
│ └── active
├── newsletter_sends
│ ├── tracking_id
│ ├── newsletter_id
│ ├── subscriber_email
│ ├── opened
│ ├── first_opened_at
│ └── open_count
├── link_clicks
│ ├── tracking_id
│ ├── newsletter_id
│ ├── subscriber_email
│ ├── article_url
│ ├── clicked
│ └── clicked_at
└── subscriber_activity
├── email
├── status (active/inactive/dormant)
├── last_opened_at
├── last_clicked_at
├── total_opens
└── total_clicks
```
## 🚀 Deployment Architecture
### Development
```
Local Machine
├── Docker Compose
│ ├── MongoDB (no auth)
│ ├── Crawler
│ └── Sender
├── Backend (manual start)
│ └── Flask API
└── Ollama (optional)
└── AI Summarization
```
### Production
```
Server
├── Docker Compose (prod)
│ ├── MongoDB (with auth)
│ ├── Crawler
│ └── Sender
├── Backend (systemd/pm2)
│ └── Flask API (HTTPS)
├── Ollama (optional)
│ └── AI Summarization
└── Nginx (reverse proxy)
└── SSL/TLS
```
## 🔄 Coordination Mechanism
### Crawler-Sender Synchronization
```
┌─────────────────────────────────────────────────────────┐
│ Coordination Flow │
│ │
│ 6:00 AM → Crawler starts │
│ ↓ │
│ Crawling articles... │
│ ↓ │
│ Saves to MongoDB │
│ ↓ │
│ 6:20 AM → Crawler finishes │
│ ↓ │
│ 7:00 AM → Sender starts │
│ ↓ │
│ Check: Recent articles? ──→ No ──┐ │
│ ↓ Yes │ │
│ Proceed with send │ │
│ │ │
│ ← Wait 30s ← Wait 30s ← Wait 30s┘ │
│ (max 30 minutes) │
│ │
│ 7:10 AM → Newsletter sent │
└─────────────────────────────────────────────────────────┘
```
## 📈 Monitoring & Observability
### Key Metrics
```
┌─────────────────────────────────────────────────────────┐
│ Metrics to Monitor │
│ │
│ Crawler: │
│ - Articles crawled per day │
│ - Crawl duration │
│ - Success/failure rate │
│ - Summary generation rate │
│ │
│ Sender: │
│ - Newsletters sent per day │
│ - Send duration │
│ - Success/failure rate │
│ - Wait time for crawler │
│ │
│ Engagement: │
│ - Open rate │
│ - Click-through rate │
│ - Active subscribers │
│ - Dormant subscribers │
│ │
│ System: │
│ - Container uptime │
│ - Database size │
│ - Error rate │
│ - Response times │
└─────────────────────────────────────────────────────────┘
```
## 🛠️ Maintenance Tasks
### Daily
- Check logs for errors
- Verify newsletters sent
- Monitor engagement metrics
### Weekly
- Review article quality
- Check subscriber growth
- Analyze engagement trends
### Monthly
- Archive old articles
- Clean up dormant subscribers
- Update dependencies
- Review system performance
## 📚 Technology Stack
```
┌─────────────────────────────────────────────────────────┐
│ Technology Stack │
│ │
│ Backend: │
│ - Python 3.11 │
│ - Flask (API) │
│ - PyMongo (Database) │
│ - Schedule (Automation) │
│ - Jinja2 (Templates) │
│ - BeautifulSoup (Parsing) │
│ │
│ Database: │
│ - MongoDB 7.0 │
│ │
│ AI/ML: │
│ - Ollama (Summarization) │
│ - Phi3 Model (default) │
│ │
│ Infrastructure: │
│ - Docker & Docker Compose │
│ - Linux (Ubuntu/Debian) │
│ │
│ Email: │
│ - SMTP (configurable) │
│ - HTML emails with tracking │
└─────────────────────────────────────────────────────────┘
```
---
**Last Updated**: 2024-01-16
**Version**: 1.0
**Status**: Production Ready ✅