update
This commit is contained in:
223
docs/API.md
Normal file
223
docs/API.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# API Reference
|
||||
|
||||
## Tracking Endpoints
|
||||
|
||||
### Track Email Open
|
||||
|
||||
```http
|
||||
GET /api/track/pixel/<tracking_id>
|
||||
```
|
||||
|
||||
Returns a 1x1 transparent PNG and logs the email open event.
|
||||
|
||||
**Response**: Image (image/png)
|
||||
|
||||
### Track Link Click
|
||||
|
||||
```http
|
||||
GET /api/track/click/<tracking_id>
|
||||
```
|
||||
|
||||
Logs the click event and redirects to the original article URL.
|
||||
|
||||
**Response**: 302 Redirect
|
||||
|
||||
## Analytics Endpoints
|
||||
|
||||
### Get Newsletter Metrics
|
||||
|
||||
```http
|
||||
GET /api/analytics/newsletter/<newsletter_id>
|
||||
```
|
||||
|
||||
Returns comprehensive metrics for a specific newsletter.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"newsletter_id": "2024-01-15",
|
||||
"total_sent": 100,
|
||||
"total_opened": 75,
|
||||
"open_rate": 75.0,
|
||||
"unique_openers": 70,
|
||||
"total_clicks": 45,
|
||||
"unique_clickers": 30,
|
||||
"click_through_rate": 30.0
|
||||
}
|
||||
```
|
||||
|
||||
### Get Article Performance
|
||||
|
||||
```http
|
||||
GET /api/analytics/article/<article_url>
|
||||
```
|
||||
|
||||
Returns performance metrics for a specific article.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"article_url": "https://example.com/article",
|
||||
"total_sent": 100,
|
||||
"total_clicks": 25,
|
||||
"click_rate": 25.0,
|
||||
"unique_clickers": 20,
|
||||
"newsletters": ["2024-01-15", "2024-01-16"]
|
||||
}
|
||||
```
|
||||
|
||||
### Get Subscriber Activity
|
||||
|
||||
```http
|
||||
GET /api/analytics/subscriber/<email>
|
||||
```
|
||||
|
||||
Returns activity status and engagement metrics for a subscriber.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"email": "user@example.com",
|
||||
"status": "active",
|
||||
"last_opened_at": "2024-01-15T10:30:00",
|
||||
"last_clicked_at": "2024-01-15T10:35:00",
|
||||
"total_opens": 45,
|
||||
"total_clicks": 20,
|
||||
"newsletters_received": 50,
|
||||
"newsletters_opened": 45
|
||||
}
|
||||
```
|
||||
|
||||
## Privacy Endpoints
|
||||
|
||||
### Delete Subscriber Data
|
||||
|
||||
```http
|
||||
DELETE /api/tracking/subscriber/<email>
|
||||
```
|
||||
|
||||
Deletes all tracking data for a subscriber (GDPR compliance).
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "All tracking data deleted for user@example.com",
|
||||
"deleted_counts": {
|
||||
"newsletter_sends": 50,
|
||||
"link_clicks": 25,
|
||||
"subscriber_activity": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Anonymize Old Data
|
||||
|
||||
```http
|
||||
POST /api/tracking/anonymize
|
||||
```
|
||||
|
||||
Anonymizes tracking data older than the retention period.
|
||||
|
||||
**Request Body** (optional):
|
||||
```json
|
||||
{
|
||||
"retention_days": 90
|
||||
}
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Anonymized tracking data older than 90 days",
|
||||
"anonymized_counts": {
|
||||
"newsletter_sends": 1250,
|
||||
"link_clicks": 650
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Opt Out of Tracking
|
||||
|
||||
```http
|
||||
POST /api/tracking/subscriber/<email>/opt-out
|
||||
```
|
||||
|
||||
Disables tracking for a subscriber.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Subscriber user@example.com has opted out of tracking"
|
||||
}
|
||||
```
|
||||
|
||||
### Opt In to Tracking
|
||||
|
||||
```http
|
||||
POST /api/tracking/subscriber/<email>/opt-in
|
||||
```
|
||||
|
||||
Re-enables tracking for a subscriber.
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Subscriber user@example.com has opted in to tracking"
|
||||
}
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Using curl
|
||||
|
||||
```bash
|
||||
# Get newsletter metrics
|
||||
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
|
||||
|
||||
# Delete subscriber data
|
||||
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
|
||||
|
||||
# Anonymize old data
|
||||
curl -X POST http://localhost:5001/api/tracking/anonymize \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"retention_days": 90}'
|
||||
|
||||
# Opt out of tracking
|
||||
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
|
||||
```
|
||||
|
||||
### Using Python
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Get newsletter metrics
|
||||
response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
|
||||
metrics = response.json()
|
||||
print(f"Open rate: {metrics['open_rate']}%")
|
||||
|
||||
# Delete subscriber data
|
||||
response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
|
||||
result = response.json()
|
||||
print(result['message'])
|
||||
```
|
||||
|
||||
## Error Responses
|
||||
|
||||
All endpoints return standard error responses:
|
||||
|
||||
```json
|
||||
{
|
||||
"success": false,
|
||||
"error": "Error message here"
|
||||
}
|
||||
```
|
||||
|
||||
HTTP Status Codes:
|
||||
- `200` - Success
|
||||
- `404` - Not found
|
||||
- `500` - Server error
|
||||
131
docs/ARCHITECTURE.md
Normal file
131
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# System Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Munich News Daily System │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
6:00 AM Berlin → News Crawler
|
||||
↓
|
||||
Fetches RSS feeds
|
||||
Extracts full content
|
||||
Generates AI summaries
|
||||
Saves to MongoDB
|
||||
↓
|
||||
7:00 AM Berlin → Newsletter Sender
|
||||
↓
|
||||
Waits for crawler
|
||||
Fetches articles
|
||||
Generates newsletter
|
||||
Sends to subscribers
|
||||
↓
|
||||
✅ Done!
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. MongoDB Database
|
||||
- **Purpose**: Central data storage
|
||||
- **Collections**:
|
||||
- `articles`: News articles with summaries
|
||||
- `subscribers`: Email subscribers
|
||||
- `rss_feeds`: RSS feed sources
|
||||
- `newsletter_sends`: Email tracking data
|
||||
- `link_clicks`: Link click tracking
|
||||
- `subscriber_activity`: Engagement metrics
|
||||
|
||||
### 2. News Crawler
|
||||
- **Schedule**: Daily at 6:00 AM Berlin time
|
||||
- **Functions**:
|
||||
- Fetches articles from RSS feeds
|
||||
- Extracts full article content
|
||||
- Generates AI summaries using Ollama
|
||||
- Saves to MongoDB
|
||||
- **Technology**: Python, BeautifulSoup, Ollama
|
||||
|
||||
### 3. Newsletter Sender
|
||||
- **Schedule**: Daily at 7:00 AM Berlin time
|
||||
- **Functions**:
|
||||
- Waits for crawler to finish (max 30 min)
|
||||
- Fetches today's articles
|
||||
- Generates HTML newsletter
|
||||
- Injects tracking pixels
|
||||
- Sends to all subscribers
|
||||
- **Technology**: Python, Jinja2, SMTP
|
||||
|
||||
### 4. Backend API (Optional)
|
||||
- **Purpose**: Tracking and analytics
|
||||
- **Endpoints**:
|
||||
- `/api/track/pixel/<id>` - Email open tracking
|
||||
- `/api/track/click/<id>` - Link click tracking
|
||||
- `/api/analytics/*` - Engagement metrics
|
||||
- `/api/tracking/*` - Privacy controls
|
||||
- **Technology**: Flask, Python
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
|
||||
↓
|
||||
Backend API
|
||||
↓
|
||||
Analytics
|
||||
```
|
||||
|
||||
## Coordination
|
||||
|
||||
The sender waits for the crawler to ensure fresh content:
|
||||
|
||||
1. Sender starts at 7:00 AM
|
||||
2. Checks for recent articles every 30 seconds
|
||||
3. Maximum wait time: 30 minutes
|
||||
4. Proceeds once crawler finishes or timeout
|
||||
|
||||
## Technology Stack
|
||||
|
||||
- **Backend**: Python 3.11
|
||||
- **Database**: MongoDB 7.0
|
||||
- **AI**: Ollama (Phi3 model)
|
||||
- **Scheduling**: Python schedule library
|
||||
- **Email**: SMTP with HTML templates
|
||||
- **Tracking**: Pixel tracking + redirect URLs
|
||||
- **Infrastructure**: Docker & Docker Compose
|
||||
|
||||
## Deployment
|
||||
|
||||
All components run in Docker containers:
|
||||
|
||||
```
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
Containers:
|
||||
- `munich-news-mongodb` - Database
|
||||
- `munich-news-crawler` - Crawler service
|
||||
- `munich-news-sender` - Sender service
|
||||
|
||||
## Security
|
||||
|
||||
- MongoDB authentication enabled
|
||||
- Environment variables for secrets
|
||||
- HTTPS for tracking URLs (production)
|
||||
- GDPR-compliant data retention
|
||||
- Privacy controls (opt-out, deletion)
|
||||
|
||||
## Monitoring
|
||||
|
||||
- Docker logs for all services
|
||||
- MongoDB for data verification
|
||||
- Health checks on containers
|
||||
- Engagement metrics via API
|
||||
|
||||
## Scalability
|
||||
|
||||
- Horizontal: Add more crawler instances
|
||||
- Vertical: Increase container resources
|
||||
- Database: MongoDB sharding if needed
|
||||
- Caching: Redis for API responses (future)
|
||||
106
docs/BACKEND_STRUCTURE.md
Normal file
106
docs/BACKEND_STRUCTURE.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Backend Structure
|
||||
|
||||
The backend has been modularized for better maintainability and scalability.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
backend/
|
||||
├── app.py # Main Flask application entry point
|
||||
├── config.py # Configuration management
|
||||
├── database.py # Database connection and initialization
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env # Environment variables
|
||||
│
|
||||
├── routes/ # API route handlers (blueprints)
|
||||
│ ├── __init__.py
|
||||
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
|
||||
│ ├── news_routes.py # /api/news, /api/stats
|
||||
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
|
||||
│ ├── ollama_routes.py # /api/ollama/* (AI features)
|
||||
│ ├── tracking_routes.py # /api/track/* (email tracking)
|
||||
│ └── analytics_routes.py # /api/analytics/* (engagement metrics)
|
||||
│
|
||||
└── services/ # Business logic layer
|
||||
├── __init__.py
|
||||
├── news_service.py # News fetching and storage logic
|
||||
├── email_service.py # Newsletter email sending
|
||||
├── ollama_service.py # Ollama AI integration
|
||||
├── tracking_service.py # Email tracking (opens/clicks)
|
||||
└── analytics_service.py # Engagement analytics
|
||||
```
|
||||
|
||||
## Key Components
|
||||
|
||||
### app.py
|
||||
- Main Flask application
|
||||
- Registers all blueprints
|
||||
- Minimal code, just wiring things together
|
||||
|
||||
### config.py
|
||||
- Centralized configuration
|
||||
- Loads environment variables
|
||||
- Single source of truth for all settings
|
||||
|
||||
### database.py
|
||||
- MongoDB connection setup
|
||||
- Collection definitions
|
||||
- Database initialization with indexes
|
||||
|
||||
### routes/
|
||||
Each route file is a Flask Blueprint handling specific API endpoints:
|
||||
- **subscription_routes.py**: User subscription management
|
||||
- **news_routes.py**: News fetching and statistics
|
||||
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
|
||||
- **ollama_routes.py**: AI/Ollama integration endpoints
|
||||
- **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
|
||||
- **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
|
||||
|
||||
### services/
|
||||
Business logic separated from route handlers:
|
||||
- **news_service.py**: Fetches news from RSS feeds, saves to database
|
||||
- **email_service.py**: Sends newsletter emails to subscribers
|
||||
- **ollama_service.py**: Communicates with Ollama AI server
|
||||
- **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
|
||||
- **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
|
||||
|
||||
## Benefits of This Structure
|
||||
|
||||
1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
|
||||
2. **Testability**: Each module can be tested independently
|
||||
3. **Maintainability**: Easy to find and modify specific functionality
|
||||
4. **Scalability**: Easy to add new routes or services
|
||||
5. **Reusability**: Services can be used by multiple routes
|
||||
|
||||
## Adding New Features
|
||||
|
||||
### To add a new API endpoint:
|
||||
1. Create a new route file in `routes/` or add to existing one
|
||||
2. Create a Blueprint and define routes
|
||||
3. Register the blueprint in `app.py`
|
||||
|
||||
### To add new business logic:
|
||||
1. Create a new service file in `services/`
|
||||
2. Import and use in your route handlers
|
||||
|
||||
### Example:
|
||||
```python
|
||||
# services/my_service.py
|
||||
def my_business_logic():
|
||||
return "Hello"
|
||||
|
||||
# routes/my_routes.py
|
||||
from flask import Blueprint
|
||||
from services.my_service import my_business_logic
|
||||
|
||||
my_bp = Blueprint('my', __name__)
|
||||
|
||||
@my_bp.route('/api/my-endpoint')
|
||||
def my_endpoint():
|
||||
result = my_business_logic()
|
||||
return {'message': result}
|
||||
|
||||
# app.py
|
||||
from routes.my_routes import my_bp
|
||||
app.register_blueprint(my_bp)
|
||||
```
|
||||
136
docs/CHANGELOG.md
Normal file
136
docs/CHANGELOG.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# Changelog
|
||||
|
||||
## [Unreleased] - 2024-11-10
|
||||
|
||||
### Added - Major Refactoring
|
||||
|
||||
#### Backend Modularization
|
||||
- ✅ Restructured backend into modular architecture
|
||||
- ✅ Created separate route blueprints:
|
||||
- `subscription_routes.py` - User subscriptions
|
||||
- `news_routes.py` - News fetching and stats
|
||||
- `rss_routes.py` - RSS feed management (CRUD)
|
||||
- `ollama_routes.py` - AI integration
|
||||
- ✅ Created service layer:
|
||||
- `news_service.py` - News fetching logic
|
||||
- `email_service.py` - Newsletter sending
|
||||
- `ollama_service.py` - AI communication
|
||||
- ✅ Centralized configuration in `config.py`
|
||||
- ✅ Separated database logic in `database.py`
|
||||
- ✅ Reduced main `app.py` from 700+ lines to 27 lines
|
||||
|
||||
#### RSS Feed Management
|
||||
- ✅ Dynamic RSS feed management via API
|
||||
- ✅ Add/remove/list/toggle RSS feeds without code changes
|
||||
- ✅ Unique index on RSS feed URLs (prevents duplicates)
|
||||
- ✅ Default feeds auto-initialized on first run
|
||||
- ✅ Created `fix_duplicates.py` utility script
|
||||
|
||||
#### News Crawler Microservice
|
||||
- ✅ Created standalone `news_crawler/` microservice
|
||||
- ✅ Web scraping with BeautifulSoup
|
||||
- ✅ Smart content extraction using multiple selectors
|
||||
- ✅ Full article content storage in MongoDB
|
||||
- ✅ Word count calculation
|
||||
- ✅ Duplicate prevention (skips already-crawled articles)
|
||||
- ✅ Rate limiting (1 second between requests)
|
||||
- ✅ Can run independently or scheduled
|
||||
- ✅ Docker support for crawler
|
||||
- ✅ Comprehensive documentation
|
||||
|
||||
#### API Endpoints
|
||||
New endpoints added:
|
||||
- `GET /api/rss-feeds` - List all RSS feeds
|
||||
- `POST /api/rss-feeds` - Add new RSS feed
|
||||
- `DELETE /api/rss-feeds/<id>` - Remove RSS feed
|
||||
- `PATCH /api/rss-feeds/<id>/toggle` - Toggle feed active status
|
||||
|
||||
#### Documentation
|
||||
- ✅ Created `ARCHITECTURE.md` - System architecture overview
|
||||
- ✅ Created `backend/STRUCTURE.md` - Backend structure guide
|
||||
- ✅ Created `news_crawler/README.md` - Crawler documentation
|
||||
- ✅ Created `news_crawler/QUICKSTART.md` - Quick start guide
|
||||
- ✅ Created `news_crawler/test_crawler.py` - Test suite
|
||||
- ✅ Updated main `README.md` with new features
|
||||
- ✅ Updated `DATABASE_SCHEMA.md` with new fields
|
||||
|
||||
#### Configuration
|
||||
- ✅ Added `FLASK_PORT` environment variable
|
||||
- ✅ Fixed `OLLAMA_MODEL` typo in `.env`
|
||||
- ✅ Port 5001 default to avoid macOS AirPlay conflict
|
||||
|
||||
### Changed
|
||||
- Backend structure: Monolithic → Modular
|
||||
- RSS feeds: Hardcoded → Database-driven
|
||||
- Article storage: Summary only → Full content support
|
||||
- Configuration: Scattered → Centralized
|
||||
|
||||
### Technical Improvements
|
||||
- Separation of concerns (routes vs services)
|
||||
- Better testability
|
||||
- Easier maintenance
|
||||
- Scalable architecture
|
||||
- Independent microservices
|
||||
- Proper error handling
|
||||
- Comprehensive logging
|
||||
|
||||
### Database Schema Updates
|
||||
Articles collection now includes:
|
||||
- `full_content` - Full article text
|
||||
- `word_count` - Number of words
|
||||
- `crawled_at` - When content was crawled
|
||||
|
||||
RSS Feeds collection added:
|
||||
- `name` - Feed name
|
||||
- `url` - Feed URL (unique)
|
||||
- `active` - Active status
|
||||
- `created_at` - Creation timestamp
|
||||
|
||||
### Files Added
|
||||
```
|
||||
backend/
|
||||
├── config.py
|
||||
├── database.py
|
||||
├── fix_duplicates.py
|
||||
├── STRUCTURE.md
|
||||
├── routes/
|
||||
│ ├── __init__.py
|
||||
│ ├── subscription_routes.py
|
||||
│ ├── news_routes.py
|
||||
│ ├── rss_routes.py
|
||||
│ └── ollama_routes.py
|
||||
└── services/
|
||||
├── __init__.py
|
||||
├── news_service.py
|
||||
├── email_service.py
|
||||
└── ollama_service.py
|
||||
|
||||
news_crawler/
|
||||
├── crawler_service.py
|
||||
├── test_crawler.py
|
||||
├── requirements.txt
|
||||
├── .gitignore
|
||||
├── Dockerfile
|
||||
├── docker-compose.yml
|
||||
├── README.md
|
||||
└── QUICKSTART.md
|
||||
|
||||
Root:
|
||||
├── ARCHITECTURE.md
|
||||
└── CHANGELOG.md
|
||||
```
|
||||
|
||||
### Files Removed
|
||||
- Old monolithic `backend/app.py` (replaced with modular version)
|
||||
|
||||
### Next Steps (Future Enhancements)
|
||||
- [ ] Frontend UI for RSS feed management
|
||||
- [ ] Automatic article summarization with Ollama
|
||||
- [ ] Scheduled newsletter sending
|
||||
- [ ] Article categorization and tagging
|
||||
- [ ] Search functionality
|
||||
- [ ] User preferences (categories, frequency)
|
||||
- [ ] Analytics dashboard
|
||||
- [ ] API rate limiting
|
||||
- [ ] Caching layer (Redis)
|
||||
- [ ] Message queue for crawler (Celery)
|
||||
306
docs/CRAWLER_HOW_IT_WORKS.md
Normal file
306
docs/CRAWLER_HOW_IT_WORKS.md
Normal file
@@ -0,0 +1,306 @@
|
||||
# How the News Crawler Works
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The crawler dynamically extracts article metadata from any website using multiple fallback strategies.
|
||||
|
||||
## 📊 Flow Diagram
|
||||
|
||||
```
|
||||
RSS Feed URL
|
||||
↓
|
||||
Parse RSS Feed
|
||||
↓
|
||||
For each article link:
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 1. Fetch HTML Page │
|
||||
│ GET https://example.com/article │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 2. Parse with BeautifulSoup │
|
||||
│ soup = BeautifulSoup(html) │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 3. Clean HTML │
|
||||
│ Remove: scripts, styles, nav, │
|
||||
│ footer, header, ads │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 4. Extract Title │
|
||||
│ Try: H1 → OG meta → Twitter → │
|
||||
│ Title tag │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 5. Extract Author │
|
||||
│ Try: Meta author → rel=author → │
|
||||
│ Class names → JSON-LD │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 6. Extract Date │
|
||||
│ Try: <time> → Meta tags → │
|
||||
│ Class names → JSON-LD │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 7. Extract Content │
|
||||
│ Try: <article> → Class names → │
|
||||
│ <main> → <body> │
|
||||
│ Filter short paragraphs │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────┐
|
||||
│ 8. Save to MongoDB │
|
||||
│ { │
|
||||
│ title, author, date, │
|
||||
│ content, word_count │
|
||||
│ } │
|
||||
└─────────────────────────────────────┘
|
||||
↓
|
||||
Wait 1 second (rate limiting)
|
||||
↓
|
||||
Next article
|
||||
```
|
||||
|
||||
## 🔍 Detailed Example
|
||||
|
||||
### Input: RSS Feed Entry
|
||||
```xml
|
||||
<item>
|
||||
<title>New U-Bahn Line Opens</title>
|
||||
<link>https://www.sueddeutsche.de/muenchen/article-123</link>
|
||||
<pubDate>Mon, 10 Nov 2024 10:00:00 +0100</pubDate>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Step 1: Fetch HTML
|
||||
```python
|
||||
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
||||
response = requests.get(url)
|
||||
html = response.content
|
||||
```
|
||||
|
||||
### Step 2: Parse HTML
|
||||
```python
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
```
|
||||
|
||||
### Step 3: Extract Title
|
||||
```python
|
||||
# Try H1
|
||||
h1 = soup.find('h1')
|
||||
# Result: "New U-Bahn Line Opens in Munich"
|
||||
|
||||
# If no H1, try OG meta
|
||||
og_title = soup.find('meta', property='og:title')
|
||||
# Fallback chain continues...
|
||||
```
|
||||
|
||||
### Step 4: Extract Author
|
||||
```python
|
||||
# Try meta author
|
||||
meta_author = soup.find('meta', name='author')
|
||||
# Result: None
|
||||
|
||||
# Try class names
|
||||
author_elem = soup.select_one('[class*="author"]')
|
||||
# Result: "Max Mustermann"
|
||||
```
|
||||
|
||||
### Step 5: Extract Date
|
||||
```python
|
||||
# Try time tag
|
||||
time_tag = soup.find('time')
|
||||
# Result: "2024-11-10T10:00:00Z"
|
||||
```
|
||||
|
||||
### Step 6: Extract Content
|
||||
```python
|
||||
# Try article tag
|
||||
article = soup.find('article')
|
||||
paragraphs = article.find_all('p')
|
||||
|
||||
# Filter paragraphs
|
||||
content = []
|
||||
for p in paragraphs:
|
||||
text = p.get_text().strip()
|
||||
if len(text) >= 50: # Keep substantial paragraphs
|
||||
content.append(text)
|
||||
|
||||
full_content = '\n\n'.join(content)
|
||||
# Result: "The new U-Bahn line connecting the city center..."
|
||||
```
|
||||
|
||||
### Step 7: Save to Database
|
||||
```python
|
||||
article_doc = {
|
||||
'title': 'New U-Bahn Line Opens in Munich',
|
||||
'author': 'Max Mustermann',
|
||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
||||
'summary': 'Short summary from RSS...',
|
||||
'full_content': 'The new U-Bahn line connecting...',
|
||||
'word_count': 1250,
|
||||
'source': 'Süddeutsche Zeitung München',
|
||||
'published_at': '2024-11-10T10:00:00Z',
|
||||
'crawled_at': datetime.utcnow(),
|
||||
'created_at': datetime.utcnow()
|
||||
}
|
||||
|
||||
db.articles.update_one(
|
||||
{'link': article_url},
|
||||
{'$set': article_doc},
|
||||
upsert=True
|
||||
)
|
||||
```
|
||||
|
||||
## 🎨 What Makes It "Dynamic"?
|
||||
|
||||
### Traditional Approach (Hardcoded)
|
||||
```python
|
||||
# Only works for one specific site
|
||||
title = soup.find('h1', class_='article-title').text
|
||||
author = soup.find('span', class_='author-name').text
|
||||
```
|
||||
❌ Breaks when site changes
|
||||
❌ Doesn't work on other sites
|
||||
|
||||
### Our Approach (Dynamic)
|
||||
```python
|
||||
# Works on ANY site
|
||||
title = extract_title(soup) # Tries 4 different methods
|
||||
author = extract_author(soup) # Tries 5 different methods
|
||||
```
|
||||
✅ Adapts to different HTML structures
|
||||
✅ Falls back to alternatives
|
||||
✅ Works across multiple sites
|
||||
|
||||
## 🛡️ Robustness Features
|
||||
|
||||
### 1. Multiple Strategies
|
||||
Each field has 4-6 extraction strategies
|
||||
```python
|
||||
def extract_title(soup):
|
||||
# Try strategy 1
|
||||
if h1 := soup.find('h1'):
|
||||
return h1.text
|
||||
|
||||
# Try strategy 2
|
||||
if og_title := soup.find('meta', property='og:title'):
|
||||
return og_title['content']
|
||||
|
||||
# Try strategy 3...
|
||||
# Try strategy 4...
|
||||
```
|
||||
|
||||
### 2. Validation
|
||||
```python
|
||||
# Title must be reasonable length
|
||||
if title and len(title) > 10:
|
||||
return title
|
||||
|
||||
# Author must be < 100 chars
|
||||
if author and len(author) < 100:
|
||||
return author
|
||||
```
|
||||
|
||||
### 3. Cleaning
|
||||
```python
|
||||
# Remove site name from title
|
||||
if ' | ' in title:
|
||||
title = title.split(' | ')[0]
|
||||
|
||||
# Remove "By" from author
|
||||
author = author.replace('By ', '').strip()
|
||||
```
|
||||
|
||||
### 4. Error Handling
|
||||
```python
|
||||
try:
|
||||
data = extract_article_content(url)
|
||||
except Timeout:
|
||||
print("Timeout - skip")
|
||||
except RequestException:
|
||||
print("Network error - skip")
|
||||
except Exception:
|
||||
print("Unknown error - skip")
|
||||
```
|
||||
|
||||
## 📈 Success Metrics
|
||||
|
||||
After crawling, you'll see:
|
||||
|
||||
```
|
||||
📰 Crawling feed: Süddeutsche Zeitung München
|
||||
🔍 Crawling: New U-Bahn Line Opens...
|
||||
✓ Saved (1250 words)
|
||||
|
||||
Title: ✓ Found
|
||||
Author: ✓ Found (Max Mustermann)
|
||||
Date: ✓ Found (2024-11-10T10:00:00Z)
|
||||
Content: ✓ Found (1250 words)
|
||||
```
|
||||
|
||||
## 🗄️ Database Result
|
||||
|
||||
**Before Crawling:**
|
||||
```javascript
|
||||
{
|
||||
title: "New U-Bahn Line Opens",
|
||||
link: "https://example.com/article",
|
||||
summary: "Short RSS summary...",
|
||||
source: "Süddeutsche Zeitung"
|
||||
}
|
||||
```
|
||||
|
||||
**After Crawling:**
|
||||
```javascript
|
||||
{
|
||||
title: "New U-Bahn Line Opens in Munich", // ← Enhanced
|
||||
author: "Max Mustermann", // ← NEW!
|
||||
link: "https://example.com/article",
|
||||
summary: "Short RSS summary...",
|
||||
full_content: "The new U-Bahn line...", // ← NEW! (1250 words)
|
||||
word_count: 1250, // ← NEW!
|
||||
source: "Süddeutsche Zeitung",
|
||||
published_at: "2024-11-10T10:00:00Z", // ← Enhanced
|
||||
crawled_at: ISODate("2024-11-10T16:30:00Z"), // ← NEW!
|
||||
created_at: ISODate("2024-11-10T16:00:00Z")
|
||||
}
|
||||
```
|
||||
|
||||
## 🚀 Running the Crawler
|
||||
|
||||
```bash
|
||||
cd news_crawler
|
||||
pip install -r requirements.txt
|
||||
python crawler_service.py 10
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
============================================================
|
||||
🚀 Starting RSS Feed Crawler
|
||||
============================================================
|
||||
Found 3 active feed(s)
|
||||
|
||||
📰 Crawling feed: Süddeutsche Zeitung München
|
||||
🔍 Crawling: New U-Bahn Line Opens...
|
||||
✓ Saved (1250 words)
|
||||
🔍 Crawling: Munich Weather Update...
|
||||
✓ Saved (450 words)
|
||||
✓ Crawled 2 articles
|
||||
|
||||
============================================================
|
||||
✓ Crawling Complete!
|
||||
Total feeds processed: 3
|
||||
Total articles crawled: 15
|
||||
Duration: 45.23 seconds
|
||||
============================================================
|
||||
```
|
||||
|
||||
Now you have rich, structured article data ready for AI processing! 🎉
|
||||
271
docs/DATABASE_SCHEMA.md
Normal file
271
docs/DATABASE_SCHEMA.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# MongoDB Database Schema
|
||||
|
||||
This document describes the MongoDB collections and their structure for Munich News Daily.
|
||||
|
||||
## Collections
|
||||
|
||||
### 1. Articles Collection (`articles`)
|
||||
|
||||
Stores all news articles aggregated from Munich news sources.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
title: String, // Article title (required)
|
||||
author: String, // Article author (optional, extracted during crawl)
|
||||
link: String, // Article URL (required, unique)
|
||||
content: String, // Full article content (no length limit)
|
||||
summary: String, // AI-generated English summary (≤150 words)
|
||||
word_count: Number, // Word count of full content
|
||||
summary_word_count: Number, // Word count of AI summary
|
||||
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
|
||||
published_at: String, // Original publication date from RSS feed or crawled
|
||||
crawled_at: DateTime, // When article content was crawled (UTC)
|
||||
summarized_at: DateTime, // When AI summary was generated (UTC)
|
||||
created_at: DateTime // When article was added to database (UTC)
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `link` - Unique index to prevent duplicate articles
|
||||
- `created_at` - Index for efficient sorting by date
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439011"),
|
||||
title: "New U-Bahn Line Opens in Munich",
|
||||
author: "Max Mustermann",
|
||||
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
||||
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
|
||||
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
|
||||
word_count: 1250,
|
||||
summary_word_count: 48,
|
||||
source: "Süddeutsche Zeitung München",
|
||||
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
|
||||
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
|
||||
created_at: ISODate("2024-01-15T09:00:00.000Z")
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Subscribers Collection (`subscribers`)
|
||||
|
||||
Stores all newsletter subscribers.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
email: String, // Subscriber email (required, unique, lowercase)
|
||||
subscribed_at: DateTime, // When user subscribed (UTC)
|
||||
status: String // Subscription status: 'active' or 'inactive'
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `email` - Unique index for email lookups and preventing duplicates
|
||||
- `subscribed_at` - Index for analytics and sorting
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439012"),
|
||||
email: "user@example.com",
|
||||
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
|
||||
status: "active"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Newsletter Sends Collection (`newsletter_sends`)
|
||||
|
||||
Tracks each newsletter sent to each subscriber for email open tracking.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
|
||||
subscriber_email: String, // Recipient email
|
||||
tracking_id: String, // Unique tracking ID for this send (UUID)
|
||||
sent_at: DateTime, // When email was sent (UTC)
|
||||
opened: Boolean, // Whether email was opened
|
||||
first_opened_at: DateTime, // First open timestamp (null if not opened)
|
||||
last_opened_at: DateTime, // Most recent open timestamp
|
||||
open_count: Number, // Number of times opened
|
||||
created_at: DateTime // Record creation time (UTC)
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `tracking_id` - Unique index for fast pixel request lookups
|
||||
- `newsletter_id` - Index for analytics queries
|
||||
- `subscriber_email` - Index for user activity queries
|
||||
- `sent_at` - Index for time-based queries
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439013"),
|
||||
newsletter_id: "2024-01-15",
|
||||
subscriber_email: "user@example.com",
|
||||
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
||||
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
|
||||
opened: true,
|
||||
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
|
||||
open_count: 3,
|
||||
created_at: ISODate("2024-01-15T08:00:00.000Z")
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Link Clicks Collection (`link_clicks`)
|
||||
|
||||
Tracks individual link clicks from newsletters.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
tracking_id: String, // Unique tracking ID for this link (UUID)
|
||||
newsletter_id: String, // Which newsletter this link was in
|
||||
subscriber_email: String, // Who clicked
|
||||
article_url: String, // Original article URL
|
||||
article_title: String, // Article title for reporting
|
||||
clicked_at: DateTime, // When link was clicked (UTC)
|
||||
user_agent: String, // Browser/client info
|
||||
created_at: DateTime // Record creation time (UTC)
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `tracking_id` - Unique index for fast redirect request lookups
|
||||
- `newsletter_id` - Index for analytics queries
|
||||
- `article_url` - Index for article performance queries
|
||||
- `subscriber_email` - Index for user activity queries
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439014"),
|
||||
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
|
||||
newsletter_id: "2024-01-15",
|
||||
subscriber_email: "user@example.com",
|
||||
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
||||
article_title: "New U-Bahn Line Opens in Munich",
|
||||
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
||||
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||
created_at: ISODate("2024-01-15T09:35:00.000Z")
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Subscriber Activity Collection (`subscriber_activity`)
|
||||
|
||||
Aggregated activity status for each subscriber.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
email: String, // Subscriber email (unique)
|
||||
status: String, // 'active', 'inactive', or 'dormant'
|
||||
last_opened_at: DateTime, // Most recent email open (UTC)
|
||||
last_clicked_at: DateTime, // Most recent link click (UTC)
|
||||
total_opens: Number, // Lifetime open count
|
||||
total_clicks: Number, // Lifetime click count
|
||||
newsletters_received: Number, // Total newsletters sent
|
||||
newsletters_opened: Number, // Total newsletters opened
|
||||
updated_at: DateTime // Last status update (UTC)
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `email` - Unique index for fast lookups
|
||||
- `status` - Index for filtering by activity level
|
||||
- `last_opened_at` - Index for time-based queries
|
||||
|
||||
**Activity Status Classification:**
|
||||
- **active**: Opened an email in the last 30 days
|
||||
- **inactive**: No opens in 30-60 days
|
||||
- **dormant**: No opens in 60+ days
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439015"),
|
||||
email: "user@example.com",
|
||||
status: "active",
|
||||
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
||||
total_opens: 45,
|
||||
total_clicks: 23,
|
||||
newsletters_received: 60,
|
||||
newsletters_opened: 45,
|
||||
updated_at: ISODate("2024-01-15T10:00:00.000Z")
|
||||
}
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Why MongoDB?
|
||||
|
||||
1. **Flexibility**: Easy to add new fields without schema migrations
|
||||
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
|
||||
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
|
||||
4. **Document Model**: Natural fit for news articles and subscriber data
|
||||
|
||||
### Schema Choices
|
||||
|
||||
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
|
||||
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
|
||||
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
|
||||
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
Potential fields to add in the future:
|
||||
|
||||
**Articles:**
|
||||
- `category`: String (e.g., "politics", "sports", "culture")
|
||||
- `tags`: Array of Strings
|
||||
- `image_url`: String
|
||||
- `sent_in_newsletter`: Boolean (track if article was sent)
|
||||
- `sent_at`: DateTime (when article was included in newsletter)
|
||||
|
||||
**Subscribers:**
|
||||
- `preferences`: Object (newsletter frequency, categories, etc.)
|
||||
- `last_sent_at`: DateTime (last newsletter sent date)
|
||||
- `unsubscribed_at`: DateTime (when user unsubscribed)
|
||||
- `verification_token`: String (for email verification)
|
||||
|
||||
|
||||
|
||||
## AI Summarization Workflow
|
||||
|
||||
When the crawler processes an article:
|
||||
|
||||
1. **Extract Content**: Full article text is extracted from the webpage
|
||||
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
|
||||
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
|
||||
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
|
||||
|
||||
### Summary Field Details
|
||||
|
||||
- **Language**: Always in English, regardless of source article language
|
||||
- **Length**: Maximum 150 words
|
||||
- **Format**: Plain text, concise and clear
|
||||
- **Purpose**: Quick preview for newsletters and frontend display
|
||||
|
||||
### Querying Articles
|
||||
|
||||
```javascript
|
||||
// Get articles with AI summaries
|
||||
db.articles.find({ summary: { $exists: true, $ne: null } })
|
||||
|
||||
// Get articles without summaries
|
||||
db.articles.find({ summary: { $exists: false } })
|
||||
|
||||
// Count summarized articles
|
||||
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
|
||||
```
|
||||
274
docs/DEPLOYMENT.md
Normal file
274
docs/DEPLOYMENT.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# Deployment Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Clone repository
|
||||
git clone <repository-url>
|
||||
cd munich-news
|
||||
|
||||
# 2. Configure environment
|
||||
cp backend/.env.example backend/.env
|
||||
# Edit backend/.env with your settings
|
||||
|
||||
# 3. Start system
|
||||
docker-compose up -d
|
||||
|
||||
# 4. View logs
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
## Environment Configuration
|
||||
|
||||
### Required Settings
|
||||
|
||||
Edit `backend/.env`:
|
||||
|
||||
```env
|
||||
# Email (Required)
|
||||
SMTP_SERVER=smtp.gmail.com
|
||||
SMTP_PORT=587
|
||||
EMAIL_USER=your-email@gmail.com
|
||||
EMAIL_PASSWORD=your-app-password
|
||||
|
||||
# MongoDB (Optional - defaults provided)
|
||||
MONGODB_URI=mongodb://localhost:27017/
|
||||
|
||||
# Tracking (Optional)
|
||||
TRACKING_ENABLED=true
|
||||
TRACKING_API_URL=http://localhost:5001
|
||||
```
|
||||
|
||||
### Optional Settings
|
||||
|
||||
```env
|
||||
# Newsletter
|
||||
NEWSLETTER_MAX_ARTICLES=10
|
||||
NEWSLETTER_HOURS_LOOKBACK=24
|
||||
|
||||
# Ollama AI
|
||||
OLLAMA_ENABLED=true
|
||||
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
|
||||
# Tracking
|
||||
TRACKING_DATA_RETENTION_DAYS=90
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### 1. Set MongoDB Password
|
||||
|
||||
```bash
|
||||
export MONGO_PASSWORD=your-secure-password
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 2. Use HTTPS for Tracking
|
||||
|
||||
Update `backend/.env`:
|
||||
```env
|
||||
TRACKING_API_URL=https://yourdomain.com
|
||||
```
|
||||
|
||||
### 3. Configure Log Rotation
|
||||
|
||||
Add to `docker-compose.yml`:
|
||||
```yaml
|
||||
services:
|
||||
crawler:
|
||||
logging:
|
||||
driver: "json-file"
|
||||
options:
|
||||
max-size: "10m"
|
||||
max-file: "3"
|
||||
```
|
||||
|
||||
### 4. Set Up Backups
|
||||
|
||||
```bash
|
||||
# Daily MongoDB backup
|
||||
0 3 * * * docker exec munich-news-mongodb mongodump --out=/data/backup/$(date +\%Y\%m\%d)
|
||||
```
|
||||
|
||||
### 5. Enable Backend API
|
||||
|
||||
Uncomment backend service in `docker-compose.yml`:
|
||||
```yaml
|
||||
backend:
|
||||
build:
|
||||
context: ./backend
|
||||
ports:
|
||||
- "5001:5001"
|
||||
# ... rest of config
|
||||
```
|
||||
|
||||
## Schedule Configuration
|
||||
|
||||
### Change Crawler Time
|
||||
|
||||
Edit `news_crawler/scheduled_crawler.py`:
|
||||
```python
|
||||
schedule.every().day.at("06:00").do(run_crawler) # Change time
|
||||
```
|
||||
|
||||
### Change Sender Time
|
||||
|
||||
Edit `news_sender/scheduled_sender.py`:
|
||||
```python
|
||||
schedule.every().day.at("07:00").do(run_sender) # Change time
|
||||
```
|
||||
|
||||
Rebuild after changes:
|
||||
```bash
|
||||
docker-compose up -d --build
|
||||
```
|
||||
|
||||
## Database Setup
|
||||
|
||||
### Add RSS Feeds
|
||||
|
||||
```bash
|
||||
mongosh munich_news
|
||||
|
||||
db.rss_feeds.insertMany([
|
||||
{
|
||||
name: "Süddeutsche Zeitung München",
|
||||
url: "https://www.sueddeutsche.de/muenchen/rss",
|
||||
active: true
|
||||
},
|
||||
{
|
||||
name: "Merkur München",
|
||||
url: "https://www.merkur.de/lokales/muenchen/rss/feed.rss",
|
||||
active: true
|
||||
}
|
||||
])
|
||||
```
|
||||
|
||||
### Add Subscribers
|
||||
|
||||
```bash
|
||||
mongosh munich_news
|
||||
|
||||
db.subscribers.insertMany([
|
||||
{
|
||||
email: "user1@example.com",
|
||||
active: true,
|
||||
tracking_enabled: true,
|
||||
subscribed_at: new Date()
|
||||
},
|
||||
{
|
||||
email: "user2@example.com",
|
||||
active: true,
|
||||
tracking_enabled: true,
|
||||
subscribed_at: new Date()
|
||||
}
|
||||
])
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Container Status
|
||||
|
||||
```bash
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# All services
|
||||
docker-compose logs -f
|
||||
|
||||
# Specific service
|
||||
docker-compose logs -f crawler
|
||||
docker-compose logs -f sender
|
||||
```
|
||||
|
||||
### Check Database
|
||||
|
||||
```bash
|
||||
mongosh munich_news
|
||||
|
||||
// Count articles
|
||||
db.articles.countDocuments()
|
||||
|
||||
// Count subscribers
|
||||
db.subscribers.countDocuments({ active: true })
|
||||
|
||||
// View recent articles
|
||||
db.articles.find().sort({ crawled_at: -1 }).limit(5)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Containers Won't Start
|
||||
|
||||
```bash
|
||||
# Check logs
|
||||
docker-compose logs
|
||||
|
||||
# Rebuild
|
||||
docker-compose up -d --build
|
||||
|
||||
# Reset everything
|
||||
docker-compose down -v
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Crawler Not Finding Articles
|
||||
|
||||
```bash
|
||||
# Check RSS feeds
|
||||
mongosh munich_news --eval "db.rss_feeds.find({ active: true })"
|
||||
|
||||
# Test manually
|
||||
docker-compose exec crawler python crawler_service.py 5
|
||||
```
|
||||
|
||||
### Newsletter Not Sending
|
||||
|
||||
```bash
|
||||
# Test email
|
||||
docker-compose exec sender python sender_service.py test your-email@example.com
|
||||
|
||||
# Check SMTP config
|
||||
docker-compose exec sender python -c "from sender_service import Config; print(Config.SMTP_SERVER)"
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Update System
|
||||
|
||||
```bash
|
||||
git pull
|
||||
docker-compose up -d --build
|
||||
```
|
||||
|
||||
### Backup Database
|
||||
|
||||
```bash
|
||||
docker exec munich-news-mongodb mongodump --out=/data/backup
|
||||
```
|
||||
|
||||
### Clean Old Data
|
||||
|
||||
```bash
|
||||
mongosh munich_news
|
||||
|
||||
// Delete articles older than 90 days
|
||||
db.articles.deleteMany({
|
||||
crawled_at: { $lt: new Date(Date.now() - 90*24*60*60*1000) }
|
||||
})
|
||||
```
|
||||
|
||||
## Security Checklist
|
||||
|
||||
- [ ] Set strong MongoDB password
|
||||
- [ ] Use HTTPS for tracking URLs
|
||||
- [ ] Secure SMTP credentials
|
||||
- [ ] Enable firewall rules
|
||||
- [ ] Set up log rotation
|
||||
- [ ] Configure backups
|
||||
- [ ] Monitor for failures
|
||||
- [ ] Keep dependencies updated
|
||||
353
docs/EXTRACTION_STRATEGIES.md
Normal file
353
docs/EXTRACTION_STRATEGIES.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# Content Extraction Strategies
|
||||
|
||||
The crawler uses multiple strategies to dynamically extract article metadata from any website.
|
||||
|
||||
## 🎯 What Gets Extracted
|
||||
|
||||
1. **Title** - Article headline
|
||||
2. **Author** - Article writer/journalist
|
||||
3. **Published Date** - When article was published
|
||||
4. **Content** - Main article text
|
||||
5. **Description** - Meta description/summary
|
||||
|
||||
## 📋 Extraction Strategies
|
||||
|
||||
### 1. Title Extraction
|
||||
|
||||
Tries multiple methods in order of reliability:
|
||||
|
||||
#### Strategy 1: H1 Tag
|
||||
```html
|
||||
<h1>Article Title Here</h1>
|
||||
```
|
||||
✅ Most reliable - usually the main headline
|
||||
|
||||
#### Strategy 2: Open Graph Meta Tag
|
||||
```html
|
||||
<meta property="og:title" content="Article Title Here" />
|
||||
```
|
||||
✅ Used by Facebook, very reliable
|
||||
|
||||
#### Strategy 3: Twitter Card Meta Tag
|
||||
```html
|
||||
<meta name="twitter:title" content="Article Title Here" />
|
||||
```
|
||||
✅ Used by Twitter, reliable
|
||||
|
||||
#### Strategy 4: Title Tag (Fallback)
|
||||
```html
|
||||
<title>Article Title | Site Name</title>
|
||||
```
|
||||
⚠️ Often includes site name, needs cleaning
|
||||
|
||||
**Cleaning:**
|
||||
- Removes " | Site Name"
|
||||
- Removes " - Site Name"
|
||||
|
||||
---
|
||||
|
||||
### 2. Author Extraction
|
||||
|
||||
Tries multiple methods:
|
||||
|
||||
#### Strategy 1: Meta Author Tag
|
||||
```html
|
||||
<meta name="author" content="John Doe" />
|
||||
```
|
||||
✅ Standard HTML meta tag
|
||||
|
||||
#### Strategy 2: Rel="author" Link
|
||||
```html
|
||||
<a rel="author" href="/author/john-doe">John Doe</a>
|
||||
```
|
||||
✅ Semantic HTML
|
||||
|
||||
#### Strategy 3: Common Class Names
|
||||
```html
|
||||
<div class="author-name">John Doe</div>
|
||||
<span class="byline">By John Doe</span>
|
||||
<p class="writer">John Doe</p>
|
||||
```
|
||||
✅ Searches for: author-name, author, byline, writer
|
||||
|
||||
#### Strategy 4: Schema.org Markup
|
||||
```html
|
||||
<span itemprop="author">John Doe</span>
|
||||
```
|
||||
✅ Structured data
|
||||
|
||||
#### Strategy 5: JSON-LD Structured Data
|
||||
```html
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@type": "NewsArticle",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "John Doe"
|
||||
}
|
||||
}
|
||||
</script>
|
||||
```
|
||||
✅ Most structured, very reliable
|
||||
|
||||
**Cleaning:**
|
||||
- Removes "By " prefix
|
||||
- Validates length (< 100 chars)
|
||||
|
||||
---
|
||||
|
||||
### 3. Date Extraction
|
||||
|
||||
Tries multiple methods:
|
||||
|
||||
#### Strategy 1: Time Tag with Datetime
|
||||
```html
|
||||
<time datetime="2024-11-10T10:00:00Z">November 10, 2024</time>
|
||||
```
|
||||
✅ Most reliable - ISO format
|
||||
|
||||
#### Strategy 2: Article Published Time Meta
|
||||
```html
|
||||
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
|
||||
```
|
||||
✅ Open Graph standard
|
||||
|
||||
#### Strategy 3: OG Published Time
|
||||
```html
|
||||
<meta property="og:published_time" content="2024-11-10T10:00:00Z" />
|
||||
```
|
||||
✅ Facebook standard
|
||||
|
||||
#### Strategy 4: Common Class Names
|
||||
```html
|
||||
<span class="publish-date">November 10, 2024</span>
|
||||
<time class="published">2024-11-10</time>
|
||||
<div class="timestamp">10:00 AM, Nov 10</div>
|
||||
```
|
||||
✅ Searches for: publish-date, published, date, timestamp
|
||||
|
||||
#### Strategy 5: Schema.org Markup
|
||||
```html
|
||||
<meta itemprop="datePublished" content="2024-11-10T10:00:00Z" />
|
||||
```
|
||||
✅ Structured data
|
||||
|
||||
#### Strategy 6: JSON-LD Structured Data
|
||||
```html
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@type": "NewsArticle",
|
||||
"datePublished": "2024-11-10T10:00:00Z"
|
||||
}
|
||||
</script>
|
||||
```
|
||||
✅ Most structured
|
||||
|
||||
---
|
||||
|
||||
### 4. Content Extraction
|
||||
|
||||
Tries multiple methods:
|
||||
|
||||
#### Strategy 1: Semantic HTML Tags
|
||||
```html
|
||||
<article>
|
||||
<p>Article content here...</p>
|
||||
</article>
|
||||
```
|
||||
✅ Best practice HTML5
|
||||
|
||||
#### Strategy 2: Common Class Names
|
||||
```html
|
||||
<div class="article-content">...</div>
|
||||
<div class="article-body">...</div>
|
||||
<div class="post-content">...</div>
|
||||
<div class="entry-content">...</div>
|
||||
<div class="story-body">...</div>
|
||||
```
|
||||
✅ Searches for common patterns
|
||||
|
||||
#### Strategy 3: Schema.org Markup
|
||||
```html
|
||||
<div itemprop="articleBody">
|
||||
<p>Content here...</p>
|
||||
</div>
|
||||
```
|
||||
✅ Structured data
|
||||
|
||||
#### Strategy 4: Main Tag
|
||||
```html
|
||||
<main>
|
||||
<p>Content here...</p>
|
||||
</main>
|
||||
```
|
||||
✅ Semantic HTML5
|
||||
|
||||
#### Strategy 5: Body Tag (Fallback)
|
||||
```html
|
||||
<body>
|
||||
<p>Content here...</p>
|
||||
</body>
|
||||
```
|
||||
⚠️ Last resort, may include navigation
|
||||
|
||||
**Content Filtering:**
|
||||
- Removes `<script>`, `<style>`, `<nav>`, `<footer>`, `<header>`, `<aside>`
|
||||
- Filters out short paragraphs (< 50 chars) - likely ads/navigation
|
||||
- Keeps only substantial paragraphs
|
||||
- **No length limit** - stores full article content
|
||||
|
||||
---
|
||||
|
||||
## 🔍 How It Works
|
||||
|
||||
### Example: Crawling a News Article
|
||||
|
||||
```python
|
||||
# 1. Fetch HTML
|
||||
response = requests.get(article_url)
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# 2. Extract title (tries 4 strategies)
|
||||
title = extract_title(soup)
|
||||
# Result: "New U-Bahn Line Opens in Munich"
|
||||
|
||||
# 3. Extract author (tries 5 strategies)
|
||||
author = extract_author(soup)
|
||||
# Result: "Max Mustermann"
|
||||
|
||||
# 4. Extract date (tries 6 strategies)
|
||||
published_date = extract_date(soup)
|
||||
# Result: "2024-11-10T10:00:00Z"
|
||||
|
||||
# 5. Extract content (tries 5 strategies)
|
||||
content = extract_main_content(soup)
|
||||
# Result: "The new U-Bahn line connecting..."
|
||||
|
||||
# 6. Save to database
|
||||
article_doc = {
|
||||
'title': title,
|
||||
'author': author,
|
||||
'published_at': published_date,
|
||||
'full_content': content,
|
||||
'word_count': len(content.split())
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Rates by Strategy
|
||||
|
||||
Based on common news sites:
|
||||
|
||||
| Strategy | Success Rate | Notes |
|
||||
|----------|-------------|-------|
|
||||
| H1 for title | 95% | Almost universal |
|
||||
| OG meta tags | 90% | Most modern sites |
|
||||
| Time tag for date | 85% | HTML5 sites |
|
||||
| JSON-LD | 70% | Growing adoption |
|
||||
| Class name patterns | 60% | Varies by site |
|
||||
| Schema.org | 50% | Not widely adopted |
|
||||
|
||||
---
|
||||
|
||||
## 🎨 Real-World Examples
|
||||
|
||||
### Example 1: Süddeutsche Zeitung
|
||||
```html
|
||||
<article>
|
||||
<h1>New U-Bahn Line Opens</h1>
|
||||
<span class="author">Max Mustermann</span>
|
||||
<time datetime="2024-11-10T10:00:00Z">10. November 2024</time>
|
||||
<div class="article-body">
|
||||
<p>The new U-Bahn line...</p>
|
||||
</div>
|
||||
</article>
|
||||
```
|
||||
✅ Extracts: Title (H1), Author (class), Date (time), Content (article-body)
|
||||
|
||||
### Example 2: Medium Blog
|
||||
```html
|
||||
<article>
|
||||
<h1>How to Build a News Crawler</h1>
|
||||
<meta property="og:title" content="How to Build a News Crawler" />
|
||||
<meta property="article:published_time" content="2024-11-10T10:00:00Z" />
|
||||
<a rel="author" href="/author">Jane Smith</a>
|
||||
<section>
|
||||
<p>In this article...</p>
|
||||
</section>
|
||||
</article>
|
||||
```
|
||||
✅ Extracts: Title (OG meta), Author (rel), Date (article meta), Content (section)
|
||||
|
||||
### Example 3: WordPress Blog
|
||||
```html
|
||||
<div class="post">
|
||||
<h1 class="entry-title">My Blog Post</h1>
|
||||
<span class="byline">By John Doe</span>
|
||||
<time class="published">November 10, 2024</time>
|
||||
<div class="entry-content">
|
||||
<p>Blog content here...</p>
|
||||
</div>
|
||||
</div>
|
||||
```
|
||||
✅ Extracts: Title (H1), Author (byline), Date (published), Content (entry-content)
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Edge Cases Handled
|
||||
|
||||
1. **Missing Fields**: Returns `None` instead of crashing
|
||||
2. **Multiple Authors**: Takes first one found
|
||||
3. **Relative Dates**: Stores as-is ("2 hours ago")
|
||||
4. **Paywalls**: Extracts what's available
|
||||
5. **JavaScript-rendered**: Only gets server-side HTML
|
||||
6. **Ads/Navigation**: Filtered out by paragraph length
|
||||
7. **Site Name in Title**: Cleaned automatically
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Future Improvements
|
||||
|
||||
Potential enhancements:
|
||||
|
||||
- [ ] JavaScript rendering (Selenium/Playwright)
|
||||
- [ ] Paywall bypass (where legal)
|
||||
- [ ] Image extraction
|
||||
- [ ] Video detection
|
||||
- [ ] Related articles
|
||||
- [ ] Tags/categories
|
||||
- [ ] Reading time estimation
|
||||
- [ ] Language detection
|
||||
- [ ] Sentiment analysis
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
Test the extraction on a specific URL:
|
||||
|
||||
```python
|
||||
from crawler_service import extract_article_content
|
||||
|
||||
url = "https://www.sueddeutsche.de/muenchen/article-123"
|
||||
data = extract_article_content(url)
|
||||
|
||||
print(f"Title: {data['title']}")
|
||||
print(f"Author: {data['author']}")
|
||||
print(f"Date: {data['published_date']}")
|
||||
print(f"Content length: {len(data['content'])} chars")
|
||||
print(f"Word count: {data['word_count']}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Standards Supported
|
||||
|
||||
- ✅ HTML5 semantic tags
|
||||
- ✅ Open Graph Protocol
|
||||
- ✅ Twitter Cards
|
||||
- ✅ Schema.org microdata
|
||||
- ✅ JSON-LD structured data
|
||||
- ✅ Dublin Core metadata
|
||||
- ✅ Common CSS class patterns
|
||||
209
docs/OLD_ARCHITECTURE.md
Normal file
209
docs/OLD_ARCHITECTURE.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# Munich News Daily - Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Users / Browsers │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Frontend (Port 3000) │
|
||||
│ Node.js + Express + Vanilla JS │
|
||||
│ - Subscription form │
|
||||
│ - News display │
|
||||
│ - RSS feed management UI (future) │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│ HTTP/REST
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Backend API (Port 5001) │
|
||||
│ Flask + Python │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Routes (Blueprints) │ │
|
||||
│ │ - subscription_routes.py (subscribe/unsubscribe) │ │
|
||||
│ │ - news_routes.py (get news, stats) │ │
|
||||
│ │ - rss_routes.py (manage RSS feeds) │ │
|
||||
│ │ - ollama_routes.py (AI features) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Services (Business Logic) │ │
|
||||
│ │ - news_service.py (fetch & save articles) │ │
|
||||
│ │ - email_service.py (send newsletters) │ │
|
||||
│ │ - ollama_service.py (AI integration) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Core │ │
|
||||
│ │ - config.py (configuration) │ │
|
||||
│ │ - database.py (DB connection) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ MongoDB (Port 27017) │
|
||||
│ │
|
||||
│ Collections: │
|
||||
│ - articles (news articles with full content) │
|
||||
│ - subscribers (email subscribers) │
|
||||
│ - rss_feeds (RSS feed sources) │
|
||||
└─────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
│ Read/Write
|
||||
│
|
||||
┌─────────────────────────┴───────────────────────────────────┐
|
||||
│ News Crawler Microservice │
|
||||
│ (Standalone) │
|
||||
│ │
|
||||
│ - Fetches RSS feeds from MongoDB │
|
||||
│ - Crawls full article content │
|
||||
│ - Extracts text, metadata, word count │
|
||||
│ - Stores back to MongoDB │
|
||||
│ - Can run independently or scheduled │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
|
||||
│
|
||||
│ (Optional)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Ollama AI Server (Port 11434) │
|
||||
│ (Optional, External) │
|
||||
│ │
|
||||
│ - Article summarization │
|
||||
│ - Content analysis │
|
||||
│ - AI-powered features │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### Frontend (Port 3000)
|
||||
- **Technology**: Node.js, Express, Vanilla JavaScript
|
||||
- **Responsibilities**:
|
||||
- User interface
|
||||
- Subscription management
|
||||
- News display
|
||||
- API proxy to backend
|
||||
- **Communication**: HTTP REST to Backend
|
||||
|
||||
### Backend API (Port 5001)
|
||||
- **Technology**: Python, Flask
|
||||
- **Architecture**: Modular with Blueprints
|
||||
- **Responsibilities**:
|
||||
- REST API endpoints
|
||||
- Business logic
|
||||
- Database operations
|
||||
- Email sending
|
||||
- AI integration
|
||||
- **Communication**:
|
||||
- HTTP REST from Frontend
|
||||
- MongoDB driver to Database
|
||||
- HTTP to Ollama (optional)
|
||||
|
||||
### MongoDB (Port 27017)
|
||||
- **Technology**: MongoDB 7.0
|
||||
- **Responsibilities**:
|
||||
- Persistent data storage
|
||||
- Articles, subscribers, RSS feeds
|
||||
- **Communication**: MongoDB protocol
|
||||
|
||||
### News Crawler (Standalone)
|
||||
- **Technology**: Python, BeautifulSoup
|
||||
- **Architecture**: Microservice (can run independently)
|
||||
- **Responsibilities**:
|
||||
- Fetch RSS feeds
|
||||
- Crawl article content
|
||||
- Extract and clean text
|
||||
- Store in database
|
||||
- **Communication**: MongoDB driver to Database
|
||||
- **Execution**:
|
||||
- Manual: `python crawler_service.py`
|
||||
- Scheduled: Cron, systemd, Docker
|
||||
- On-demand: Via backend API (future)
|
||||
|
||||
### Ollama AI Server (Optional, External)
|
||||
- **Technology**: Ollama
|
||||
- **Responsibilities**:
|
||||
- AI model inference
|
||||
- Text summarization
|
||||
- Content analysis
|
||||
- **Communication**: HTTP REST API
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. News Aggregation Flow
|
||||
```
|
||||
RSS Feeds → Backend (news_service) → MongoDB (articles)
|
||||
```
|
||||
|
||||
### 2. Content Crawling Flow
|
||||
```
|
||||
MongoDB (rss_feeds) → Crawler → Article URLs →
|
||||
Web Scraping → MongoDB (articles with full_content)
|
||||
```
|
||||
|
||||
### 3. Subscription Flow
|
||||
```
|
||||
User → Frontend → Backend (subscription_routes) →
|
||||
MongoDB (subscribers)
|
||||
```
|
||||
|
||||
### 4. Newsletter Flow (Future)
|
||||
```
|
||||
Scheduler → Backend (email_service) →
|
||||
MongoDB (articles + subscribers) → SMTP → Users
|
||||
```
|
||||
|
||||
### 5. AI Processing Flow (Optional)
|
||||
```
|
||||
MongoDB (articles) → Backend (ollama_service) →
|
||||
Ollama Server → AI Summary → MongoDB (articles)
|
||||
```
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### Development
|
||||
- All services run locally
|
||||
- MongoDB via Docker Compose
|
||||
- Manual crawler execution
|
||||
|
||||
### Production
|
||||
- Backend: Cloud VM, Container, or PaaS
|
||||
- Frontend: Static hosting or same server
|
||||
- MongoDB: MongoDB Atlas or self-hosted
|
||||
- Crawler: Scheduled job (cron, systemd timer)
|
||||
- Ollama: Separate GPU server (optional)
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Current Architecture
|
||||
- Monolithic backend (single Flask instance)
|
||||
- Standalone crawler (can run multiple instances)
|
||||
- Shared MongoDB
|
||||
|
||||
### Future Improvements
|
||||
- Load balancer for backend
|
||||
- Message queue for crawler jobs (Celery + Redis)
|
||||
- Caching layer (Redis)
|
||||
- CDN for frontend
|
||||
- Read replicas for MongoDB
|
||||
|
||||
## Security
|
||||
|
||||
- CORS enabled for frontend-backend communication
|
||||
- MongoDB authentication (production)
|
||||
- Environment variables for secrets
|
||||
- Input validation on all endpoints
|
||||
- Rate limiting (future)
|
||||
|
||||
## Monitoring (Future)
|
||||
|
||||
- Application logs
|
||||
- MongoDB metrics
|
||||
- Crawler success/failure tracking
|
||||
- API response times
|
||||
- Error tracking (Sentry)
|
||||
243
docs/QUICK_REFERENCE.md
Normal file
243
docs/QUICK_REFERENCE.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Quick Reference Guide
|
||||
|
||||
## Starting the Application
|
||||
|
||||
### 1. Start MongoDB
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 2. Start Backend (Port 5001)
|
||||
```bash
|
||||
cd backend
|
||||
source venv/bin/activate # or: venv\Scripts\activate on Windows
|
||||
python app.py
|
||||
```
|
||||
|
||||
### 3. Start Frontend (Port 3000)
|
||||
```bash
|
||||
cd frontend
|
||||
npm start
|
||||
```
|
||||
|
||||
### 4. Run Crawler (Optional)
|
||||
```bash
|
||||
cd news_crawler
|
||||
pip install -r requirements.txt
|
||||
python crawler_service.py 10
|
||||
```
|
||||
|
||||
## Common Commands
|
||||
|
||||
### RSS Feed Management
|
||||
|
||||
**List all feeds:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/rss-feeds
|
||||
```
|
||||
|
||||
**Add a feed:**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"name": "Feed Name", "url": "https://example.com/rss"}'
|
||||
```
|
||||
|
||||
**Remove a feed:**
|
||||
```bash
|
||||
curl -X DELETE http://localhost:5001/api/rss-feeds/<feed_id>
|
||||
```
|
||||
|
||||
**Toggle feed status:**
|
||||
```bash
|
||||
curl -X PATCH http://localhost:5001/api/rss-feeds/<feed_id>/toggle
|
||||
```
|
||||
|
||||
### News & Subscriptions
|
||||
|
||||
**Get latest news:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/news
|
||||
```
|
||||
|
||||
**Subscribe:**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/subscribe \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email": "user@example.com"}'
|
||||
```
|
||||
|
||||
**Get stats:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/stats
|
||||
```
|
||||
|
||||
### Ollama (AI)
|
||||
|
||||
**Test connection:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/ollama/ping
|
||||
```
|
||||
|
||||
**List models:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/ollama/models
|
||||
```
|
||||
|
||||
### Email Tracking & Analytics
|
||||
|
||||
**Get newsletter metrics:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
|
||||
```
|
||||
|
||||
**Get article performance:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/analytics/article/<article_id>
|
||||
```
|
||||
|
||||
**Get subscriber activity:**
|
||||
```bash
|
||||
curl http://localhost:5001/api/analytics/subscriber/<email>
|
||||
```
|
||||
|
||||
**Delete subscriber tracking data:**
|
||||
```bash
|
||||
curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
|
||||
```
|
||||
|
||||
**Anonymize old tracking data:**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/tracking/anonymize
|
||||
```
|
||||
|
||||
### Database
|
||||
|
||||
**Connect to MongoDB:**
|
||||
```bash
|
||||
mongosh
|
||||
use munich_news
|
||||
```
|
||||
|
||||
**Check articles:**
|
||||
```javascript
|
||||
db.articles.find().limit(5)
|
||||
db.articles.countDocuments()
|
||||
db.articles.countDocuments({full_content: {$exists: true}})
|
||||
```
|
||||
|
||||
**Check subscribers:**
|
||||
```javascript
|
||||
db.subscribers.find()
|
||||
db.subscribers.countDocuments({status: "active"})
|
||||
```
|
||||
|
||||
**Check RSS feeds:**
|
||||
```javascript
|
||||
db.rss_feeds.find()
|
||||
```
|
||||
|
||||
**Check tracking data:**
|
||||
```javascript
|
||||
db.newsletter_sends.find().limit(5)
|
||||
db.link_clicks.find().limit(5)
|
||||
db.subscriber_activity.find()
|
||||
```
|
||||
|
||||
## File Locations
|
||||
|
||||
### Configuration
|
||||
- Backend: `backend/.env`
|
||||
- Frontend: `frontend/package.json`
|
||||
- Crawler: Uses backend's `.env` or own `.env`
|
||||
|
||||
### Logs
|
||||
- Backend: Terminal output
|
||||
- Frontend: Terminal output
|
||||
- Crawler: Terminal output
|
||||
|
||||
### Database
|
||||
- MongoDB data: Docker volume `mongodb_data`
|
||||
- Database name: `munich_news`
|
||||
|
||||
## Ports
|
||||
|
||||
| Service | Port | URL |
|
||||
|---------|------|-----|
|
||||
| Frontend | 3000 | http://localhost:3000 |
|
||||
| Backend | 5001 | http://localhost:5001 |
|
||||
| MongoDB | 27017 | mongodb://localhost:27017 |
|
||||
| Ollama | 11434 | http://localhost:11434 |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Backend won't start
|
||||
- Check if port 5001 is available
|
||||
- Verify MongoDB is running
|
||||
- Check `.env` file exists
|
||||
|
||||
### Frontend can't connect
|
||||
- Verify backend is running on port 5001
|
||||
- Check CORS settings
|
||||
- Check API_URL in frontend
|
||||
|
||||
### Crawler fails
|
||||
- Install dependencies: `pip install -r requirements.txt`
|
||||
- Check MongoDB connection
|
||||
- Verify RSS feeds exist in database
|
||||
|
||||
### MongoDB connection error
|
||||
- Start MongoDB: `docker-compose up -d`
|
||||
- Check connection string in `.env`
|
||||
- Verify port 27017 is not blocked
|
||||
|
||||
### Port 5000 conflict (macOS)
|
||||
- AirPlay uses port 5000
|
||||
- Use port 5001 instead (set in `.env`)
|
||||
- Or disable AirPlay Receiver in System Preferences
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
munich-news/
|
||||
├── backend/ # Main API (Flask)
|
||||
├── frontend/ # Web UI (Express + JS)
|
||||
├── news_crawler/ # Crawler microservice
|
||||
├── .env # Environment variables
|
||||
└── docker-compose.yml # MongoDB setup
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Backend (.env)
|
||||
```env
|
||||
MONGODB_URI=mongodb://localhost:27017/
|
||||
FLASK_PORT=5001
|
||||
SMTP_SERVER=smtp.gmail.com
|
||||
SMTP_PORT=587
|
||||
EMAIL_USER=your-email@gmail.com
|
||||
EMAIL_PASSWORD=your-app-password
|
||||
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||
OLLAMA_MODEL=phi3:latest
|
||||
OLLAMA_ENABLED=true
|
||||
TRACKING_ENABLED=true
|
||||
TRACKING_API_URL=http://localhost:5001
|
||||
TRACKING_DATA_RETENTION_DAYS=90
|
||||
```
|
||||
|
||||
## Development Workflow
|
||||
|
||||
1. **Add RSS Feed** → Backend API
|
||||
2. **Run Crawler** → Fetches full content
|
||||
3. **View News** → Frontend displays articles
|
||||
4. **Users Subscribe** → Via frontend form
|
||||
5. **Send Newsletter** → Manual or scheduled
|
||||
|
||||
## Useful Links
|
||||
|
||||
- Frontend: http://localhost:3000
|
||||
- Backend API: http://localhost:5001
|
||||
- MongoDB: mongodb://localhost:27017
|
||||
- Architecture: See `ARCHITECTURE.md`
|
||||
- Backend Structure: See `backend/STRUCTURE.md`
|
||||
- Crawler Guide: See `news_crawler/README.md`
|
||||
194
docs/RSS_URL_EXTRACTION.md
Normal file
194
docs/RSS_URL_EXTRACTION.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# RSS URL Extraction - How It Works
|
||||
|
||||
## The Problem
|
||||
|
||||
Different RSS feed providers use different fields to store the article URL:
|
||||
|
||||
### Example 1: Standard RSS (uses `link`)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<link>https://example.com/article/123</link>
|
||||
<guid>internal-id-456</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 2: Some feeds (uses `guid` as URL)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<guid>https://example.com/article/123</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 3: Atom feeds (uses `id`)
|
||||
```xml
|
||||
<entry>
|
||||
<title>Article Title</title>
|
||||
<id>https://example.com/article/123</id>
|
||||
</entry>
|
||||
```
|
||||
|
||||
### Example 4: Complex feeds (guid as object)
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<guid isPermaLink="true">https://example.com/article/123</guid>
|
||||
</item>
|
||||
```
|
||||
|
||||
### Example 5: Multiple links
|
||||
```xml
|
||||
<item>
|
||||
<title>Article Title</title>
|
||||
<link rel="alternate" type="text/html" href="https://example.com/article/123"/>
|
||||
<link rel="enclosure" type="image/jpeg" href="https://example.com/image.jpg"/>
|
||||
</item>
|
||||
```
|
||||
|
||||
## Our Solution
|
||||
|
||||
The `extract_article_url()` function tries multiple strategies in order:
|
||||
|
||||
### Strategy 1: Check `link` field (most common)
|
||||
```python
|
||||
if entry.get('link') and entry.get('link', '').startswith('http'):
|
||||
return entry.get('link')
|
||||
```
|
||||
✅ Works for: Most RSS 2.0 feeds
|
||||
|
||||
### Strategy 2: Check `guid` field
|
||||
```python
|
||||
if entry.get('guid'):
|
||||
guid = entry.get('guid')
|
||||
# guid can be a string
|
||||
if isinstance(guid, str) and guid.startswith('http'):
|
||||
return guid
|
||||
# or a dict with 'href'
|
||||
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
|
||||
return guid.get('href')
|
||||
```
|
||||
✅ Works for: Feeds that use GUID as permalink
|
||||
|
||||
### Strategy 3: Check `id` field
|
||||
```python
|
||||
if entry.get('id') and entry.get('id', '').startswith('http'):
|
||||
return entry.get('id')
|
||||
```
|
||||
✅ Works for: Atom feeds
|
||||
|
||||
### Strategy 4: Check `links` array
|
||||
```python
|
||||
if entry.get('links'):
|
||||
for link in entry.get('links', []):
|
||||
if isinstance(link, dict) and link.get('href', '').startswith('http'):
|
||||
# Prefer 'alternate' type
|
||||
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
|
||||
return link.get('href')
|
||||
```
|
||||
✅ Works for: Feeds with multiple links (prefers HTML content)
|
||||
|
||||
## Real-World Examples
|
||||
|
||||
### Süddeutsche Zeitung
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Munich News',
|
||||
'link': 'https://www.sueddeutsche.de/muenchen/article-123',
|
||||
'guid': 'sz-internal-123'
|
||||
}
|
||||
# Returns: 'https://www.sueddeutsche.de/muenchen/article-123'
|
||||
```
|
||||
|
||||
### Medium Blog
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Blog Post',
|
||||
'guid': 'https://medium.com/@user/post-abc123',
|
||||
'link': None
|
||||
}
|
||||
# Returns: 'https://medium.com/@user/post-abc123'
|
||||
```
|
||||
|
||||
### YouTube RSS
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Video Title',
|
||||
'id': 'https://www.youtube.com/watch?v=abc123',
|
||||
'link': None
|
||||
}
|
||||
# Returns: 'https://www.youtube.com/watch?v=abc123'
|
||||
```
|
||||
|
||||
### Complex Feed
|
||||
```python
|
||||
entry = {
|
||||
'title': 'Article',
|
||||
'links': [
|
||||
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://example.com/article'},
|
||||
{'rel': 'enclosure', 'type': 'image/jpeg', 'href': 'https://example.com/image.jpg'}
|
||||
]
|
||||
}
|
||||
# Returns: 'https://example.com/article' (prefers text/html)
|
||||
```
|
||||
|
||||
## Validation
|
||||
|
||||
All extracted URLs must:
|
||||
1. Start with `http://` or `https://`
|
||||
2. Be a valid string (not None or empty)
|
||||
|
||||
If no valid URL is found:
|
||||
```python
|
||||
return None
|
||||
# Crawler will skip this entry and log a warning
|
||||
```
|
||||
|
||||
## Testing Different Feeds
|
||||
|
||||
To test if a feed works with our extractor:
|
||||
|
||||
```python
|
||||
import feedparser
|
||||
from rss_utils import extract_article_url
|
||||
|
||||
# Parse feed
|
||||
feed = feedparser.parse('https://example.com/rss')
|
||||
|
||||
# Test each entry
|
||||
for entry in feed.entries[:5]:
|
||||
url = extract_article_url(entry)
|
||||
if url:
|
||||
print(f"✓ {entry.get('title', 'No title')[:50]}")
|
||||
print(f" URL: {url}")
|
||||
else:
|
||||
print(f"✗ {entry.get('title', 'No title')[:50]}")
|
||||
print(f" No valid URL found")
|
||||
print(f" Available fields: {list(entry.keys())}")
|
||||
```
|
||||
|
||||
## Supported Feed Types
|
||||
|
||||
✅ RSS 2.0
|
||||
✅ RSS 1.0
|
||||
✅ Atom
|
||||
✅ Custom RSS variants
|
||||
✅ Feeds with multiple links
|
||||
✅ Feeds with GUID as permalink
|
||||
|
||||
## Edge Cases Handled
|
||||
|
||||
1. **GUID is not a URL**: Checks if it starts with `http`
|
||||
2. **Multiple links**: Prefers `text/html` type
|
||||
3. **GUID as dict**: Extracts `href` field
|
||||
4. **Missing fields**: Returns None instead of crashing
|
||||
5. **Non-HTTP URLs**: Filters out `mailto:`, `ftp:`, etc.
|
||||
|
||||
## Future Improvements
|
||||
|
||||
Potential enhancements:
|
||||
- [ ] Support for `feedburner:origLink`
|
||||
- [ ] Support for `pheedo:origLink`
|
||||
- [ ] Resolve shortened URLs (bit.ly, etc.)
|
||||
- [ ] Handle relative URLs (convert to absolute)
|
||||
- [ ] Cache URL extraction results
|
||||
412
docs/SYSTEM_ARCHITECTURE.md
Normal file
412
docs/SYSTEM_ARCHITECTURE.md
Normal file
@@ -0,0 +1,412 @@
|
||||
# Munich News Daily - System Architecture
|
||||
|
||||
## 📊 Complete System Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Munich News Daily System │
|
||||
│ Fully Automated Pipeline │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Daily Schedule
|
||||
┌──────────────────────┐
|
||||
│ 6:00 AM Berlin │
|
||||
│ News Crawler │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ News Crawler │
|
||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
||||
│ │ Fetch RSS │→ │ Extract │→ │ Summarize │→ │ Save to ││
|
||||
│ │ Feeds │ │ Content │ │ with AI │ │ MongoDB ││
|
||||
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
||||
│ │
|
||||
│ Sources: Süddeutsche, Merkur, BR24, etc. │
|
||||
│ Output: Full articles + AI summaries │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ Articles saved
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ MongoDB │
|
||||
│ (Data Storage) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ Wait for crawler
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ 7:00 AM Berlin │
|
||||
│ Newsletter Sender │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Newsletter Sender │
|
||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
||||
│ │ Wait for │→ │ Fetch │→ │ Generate │→ │ Send to ││
|
||||
│ │ Crawler │ │ Articles │ │ Newsletter │ │ Subscribers││
|
||||
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
||||
│ │
|
||||
│ Features: Tracking pixels, link tracking, HTML templates │
|
||||
│ Output: Personalized newsletters with engagement tracking │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ Emails sent
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Subscribers │
|
||||
│ (Email Inboxes) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
│ Opens & clicks
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ Tracking System │
|
||||
│ (Analytics API) │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
## 🔄 Data Flow
|
||||
|
||||
### 1. Content Acquisition (6:00 AM)
|
||||
|
||||
```
|
||||
RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
|
||||
```
|
||||
|
||||
**Details**:
|
||||
- Fetches from multiple RSS sources
|
||||
- Extracts full article text
|
||||
- Generates concise summaries using Ollama
|
||||
- Stores with metadata (author, date, source)
|
||||
|
||||
### 2. Newsletter Generation (7:00 AM)
|
||||
|
||||
```
|
||||
MongoDB → Articles → Template → HTML → Email
|
||||
```
|
||||
|
||||
**Details**:
|
||||
- Waits for crawler to finish (max 30 min)
|
||||
- Fetches today's articles with summaries
|
||||
- Applies Jinja2 template
|
||||
- Injects tracking pixels
|
||||
- Replaces links with tracking URLs
|
||||
|
||||
### 3. Engagement Tracking (Ongoing)
|
||||
|
||||
```
|
||||
Email Open → Pixel Load → Log Event → Analytics
|
||||
Link Click → Redirect → Log Event → Analytics
|
||||
```
|
||||
|
||||
**Details**:
|
||||
- Tracks email opens via 1x1 pixel
|
||||
- Tracks link clicks via redirect URLs
|
||||
- Stores engagement data in MongoDB
|
||||
- Provides analytics API
|
||||
|
||||
## 🏗️ Component Architecture
|
||||
|
||||
### Docker Containers
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Docker Network │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ MongoDB │ │ Crawler │ │ Sender │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ Port: 27017 │←─│ Schedule: │←─│ Schedule: │ │
|
||||
│ │ │ │ 6:00 AM │ │ 7:00 AM │ │
|
||||
│ │ Storage: │ │ │ │ │ │
|
||||
│ │ - articles │ │ Depends on: │ │ Depends on: │ │
|
||||
│ │ - subscribers│ │ - MongoDB │ │ - MongoDB │ │
|
||||
│ │ - tracking │ │ │ │ - Crawler │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||
│ │
|
||||
│ All containers auto-restart on failure │
|
||||
│ All use Europe/Berlin timezone │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Backend Services
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Backend Services │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Flask API (Port 5001) │ │
|
||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||
│ │ │ Tracking │ │ Analytics │ │ Privacy │ │ │
|
||||
│ │ │ Endpoints │ │ Endpoints │ │ Endpoints │ │ │
|
||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Services Layer │ │
|
||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||
│ │ │ Tracking │ │ Analytics │ │ Ollama │ │ │
|
||||
│ │ │ Service │ │ Service │ │ Client │ │ │
|
||||
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 📅 Daily Timeline
|
||||
|
||||
```
|
||||
Time (Berlin) │ Event │ Duration
|
||||
───────────────┼──────────────────────────┼──────────
|
||||
05:59:59 │ System idle │ -
|
||||
06:00:00 │ Crawler starts │ ~10-20 min
|
||||
06:00:01 │ - Fetch RSS feeds │
|
||||
06:02:00 │ - Extract content │
|
||||
06:05:00 │ - Generate summaries │
|
||||
06:15:00 │ - Save to MongoDB │
|
||||
06:20:00 │ Crawler finishes │
|
||||
06:20:01 │ System idle │ ~40 min
|
||||
07:00:00 │ Sender starts │ ~5-10 min
|
||||
07:00:01 │ - Wait for crawler │ (checks every 30s)
|
||||
07:00:30 │ - Crawler confirmed done │
|
||||
07:00:31 │ - Fetch articles │
|
||||
07:01:00 │ - Generate newsletters │
|
||||
07:02:00 │ - Send to subscribers │
|
||||
07:10:00 │ Sender finishes │
|
||||
07:10:01 │ System idle │ Until tomorrow
|
||||
```
|
||||
|
||||
## 🔐 Security & Privacy
|
||||
|
||||
### Data Protection
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Privacy Features │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Data Retention │ │
|
||||
│ │ - Personal data: 90 days │ │
|
||||
│ │ - Anonymization: Automatic │ │
|
||||
│ │ - Deletion: On request │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ User Rights │ │
|
||||
│ │ - Opt-out: Anytime │ │
|
||||
│ │ - Data access: API available │ │
|
||||
│ │ - Data deletion: Full removal │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Compliance │ │
|
||||
│ │ - GDPR compliant │ │
|
||||
│ │ - Privacy notice in emails │ │
|
||||
│ │ - Transparent tracking │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 📊 Database Schema
|
||||
|
||||
### Collections
|
||||
|
||||
```
|
||||
MongoDB (munich_news)
|
||||
│
|
||||
├── articles
|
||||
│ ├── title
|
||||
│ ├── author
|
||||
│ ├── content (full text)
|
||||
│ ├── summary (AI generated)
|
||||
│ ├── link
|
||||
│ ├── source
|
||||
│ ├── published_at
|
||||
│ └── crawled_at
|
||||
│
|
||||
├── subscribers
|
||||
│ ├── email
|
||||
│ ├── active
|
||||
│ ├── tracking_enabled
|
||||
│ └── subscribed_at
|
||||
│
|
||||
├── rss_feeds
|
||||
│ ├── name
|
||||
│ ├── url
|
||||
│ └── active
|
||||
│
|
||||
├── newsletter_sends
|
||||
│ ├── tracking_id
|
||||
│ ├── newsletter_id
|
||||
│ ├── subscriber_email
|
||||
│ ├── opened
|
||||
│ ├── first_opened_at
|
||||
│ └── open_count
|
||||
│
|
||||
├── link_clicks
|
||||
│ ├── tracking_id
|
||||
│ ├── newsletter_id
|
||||
│ ├── subscriber_email
|
||||
│ ├── article_url
|
||||
│ ├── clicked
|
||||
│ └── clicked_at
|
||||
│
|
||||
└── subscriber_activity
|
||||
├── email
|
||||
├── status (active/inactive/dormant)
|
||||
├── last_opened_at
|
||||
├── last_clicked_at
|
||||
├── total_opens
|
||||
└── total_clicks
|
||||
```
|
||||
|
||||
## 🚀 Deployment Architecture
|
||||
|
||||
### Development
|
||||
|
||||
```
|
||||
Local Machine
|
||||
├── Docker Compose
|
||||
│ ├── MongoDB (no auth)
|
||||
│ ├── Crawler
|
||||
│ └── Sender
|
||||
├── Backend (manual start)
|
||||
│ └── Flask API
|
||||
└── Ollama (optional)
|
||||
└── AI Summarization
|
||||
```
|
||||
|
||||
### Production
|
||||
|
||||
```
|
||||
Server
|
||||
├── Docker Compose (prod)
|
||||
│ ├── MongoDB (with auth)
|
||||
│ ├── Crawler
|
||||
│ └── Sender
|
||||
├── Backend (systemd/pm2)
|
||||
│ └── Flask API (HTTPS)
|
||||
├── Ollama (optional)
|
||||
│ └── AI Summarization
|
||||
└── Nginx (reverse proxy)
|
||||
└── SSL/TLS
|
||||
```
|
||||
|
||||
## 🔄 Coordination Mechanism
|
||||
|
||||
### Crawler-Sender Synchronization
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Coordination Flow │
|
||||
│ │
|
||||
│ 6:00 AM → Crawler starts │
|
||||
│ ↓ │
|
||||
│ Crawling articles... │
|
||||
│ ↓ │
|
||||
│ Saves to MongoDB │
|
||||
│ ↓ │
|
||||
│ 6:20 AM → Crawler finishes │
|
||||
│ ↓ │
|
||||
│ 7:00 AM → Sender starts │
|
||||
│ ↓ │
|
||||
│ Check: Recent articles? ──→ No ──┐ │
|
||||
│ ↓ Yes │ │
|
||||
│ Proceed with send │ │
|
||||
│ │ │
|
||||
│ ← Wait 30s ← Wait 30s ← Wait 30s┘ │
|
||||
│ (max 30 minutes) │
|
||||
│ │
|
||||
│ 7:10 AM → Newsletter sent │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 📈 Monitoring & Observability
|
||||
|
||||
### Key Metrics
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Metrics to Monitor │
|
||||
│ │
|
||||
│ Crawler: │
|
||||
│ - Articles crawled per day │
|
||||
│ - Crawl duration │
|
||||
│ - Success/failure rate │
|
||||
│ - Summary generation rate │
|
||||
│ │
|
||||
│ Sender: │
|
||||
│ - Newsletters sent per day │
|
||||
│ - Send duration │
|
||||
│ - Success/failure rate │
|
||||
│ - Wait time for crawler │
|
||||
│ │
|
||||
│ Engagement: │
|
||||
│ - Open rate │
|
||||
│ - Click-through rate │
|
||||
│ - Active subscribers │
|
||||
│ - Dormant subscribers │
|
||||
│ │
|
||||
│ System: │
|
||||
│ - Container uptime │
|
||||
│ - Database size │
|
||||
│ - Error rate │
|
||||
│ - Response times │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 🛠️ Maintenance Tasks
|
||||
|
||||
### Daily
|
||||
- Check logs for errors
|
||||
- Verify newsletters sent
|
||||
- Monitor engagement metrics
|
||||
|
||||
### Weekly
|
||||
- Review article quality
|
||||
- Check subscriber growth
|
||||
- Analyze engagement trends
|
||||
|
||||
### Monthly
|
||||
- Archive old articles
|
||||
- Clean up dormant subscribers
|
||||
- Update dependencies
|
||||
- Review system performance
|
||||
|
||||
## 📚 Technology Stack
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Technology Stack │
|
||||
│ │
|
||||
│ Backend: │
|
||||
│ - Python 3.11 │
|
||||
│ - Flask (API) │
|
||||
│ - PyMongo (Database) │
|
||||
│ - Schedule (Automation) │
|
||||
│ - Jinja2 (Templates) │
|
||||
│ - BeautifulSoup (Parsing) │
|
||||
│ │
|
||||
│ Database: │
|
||||
│ - MongoDB 7.0 │
|
||||
│ │
|
||||
│ AI/ML: │
|
||||
│ - Ollama (Summarization) │
|
||||
│ - Phi3 Model (default) │
|
||||
│ │
|
||||
│ Infrastructure: │
|
||||
│ - Docker & Docker Compose │
|
||||
│ - Linux (Ubuntu/Debian) │
|
||||
│ │
|
||||
│ Email: │
|
||||
│ - SMTP (configurable) │
|
||||
│ - HTML emails with tracking │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2024-01-16
|
||||
**Version**: 1.0
|
||||
**Status**: Production Ready ✅
|
||||
Reference in New Issue
Block a user