update
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -179,8 +179,8 @@ mongodb_data/
|
|||||||
ollama_data/
|
ollama_data/
|
||||||
|
|
||||||
# Spec artifacts (optional - uncomment if you don't want to track specs)
|
# Spec artifacts (optional - uncomment if you don't want to track specs)
|
||||||
# .kiro/specs/
|
.kiro/specs/
|
||||||
|
.vscode
|
||||||
# Test outputs
|
# Test outputs
|
||||||
test-results/
|
test-results/
|
||||||
coverage/
|
coverage/
|
||||||
|
|||||||
79
CONTRIBUTING.md
Normal file
79
CONTRIBUTING.md
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
# Contributing to Munich News Daily
|
||||||
|
|
||||||
|
Thank you for your interest in contributing!
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
1. Fork the repository
|
||||||
|
2. Clone your fork
|
||||||
|
3. Create a feature branch
|
||||||
|
4. Make your changes
|
||||||
|
5. Run tests
|
||||||
|
6. Submit a pull request
|
||||||
|
|
||||||
|
## Development Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone repository
|
||||||
|
git clone <your-fork-url>
|
||||||
|
cd munich-news
|
||||||
|
|
||||||
|
# Copy environment file
|
||||||
|
cp backend/.env.example backend/.env
|
||||||
|
|
||||||
|
# Start development environment
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all tests
|
||||||
|
docker-compose exec crawler python -m pytest tests/crawler
|
||||||
|
docker-compose exec sender python -m pytest tests/sender
|
||||||
|
docker-compose exec backend python -m pytest tests/backend
|
||||||
|
|
||||||
|
# Run specific test
|
||||||
|
docker-compose exec crawler python tests/crawler/test_crawler.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Code Style
|
||||||
|
|
||||||
|
- Follow PEP 8 for Python code
|
||||||
|
- Use meaningful variable names
|
||||||
|
- Add docstrings to functions
|
||||||
|
- Keep functions small and focused
|
||||||
|
- Write tests for new features
|
||||||
|
|
||||||
|
## Commit Messages
|
||||||
|
|
||||||
|
- Use clear, descriptive commit messages
|
||||||
|
- Start with a verb (Add, Fix, Update, etc.)
|
||||||
|
- Keep first line under 50 characters
|
||||||
|
- Add details in the body if needed
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```
|
||||||
|
Add RSS feed validation
|
||||||
|
|
||||||
|
- Validate URL format
|
||||||
|
- Check feed accessibility
|
||||||
|
- Add error handling
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pull Request Process
|
||||||
|
|
||||||
|
1. Update documentation if needed
|
||||||
|
2. Add tests for new features
|
||||||
|
3. Ensure all tests pass
|
||||||
|
4. Update CHANGELOG.md
|
||||||
|
5. Request review from maintainers
|
||||||
|
|
||||||
|
## Questions?
|
||||||
|
|
||||||
|
Open an issue or reach out to the maintainers.
|
||||||
|
|
||||||
|
Thank you for contributing! 🎉
|
||||||
243
FINAL_STRUCTURE.md
Normal file
243
FINAL_STRUCTURE.md
Normal file
@@ -0,0 +1,243 @@
|
|||||||
|
# ✅ Final Clean Project Structure
|
||||||
|
|
||||||
|
## 🎉 Cleanup Complete!
|
||||||
|
|
||||||
|
Your Munich News Daily project is now clean, organized, and professional.
|
||||||
|
|
||||||
|
## 📁 Current Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
munich-news/
|
||||||
|
├── 📄 Root Files (5 essential files)
|
||||||
|
│ ├── README.md # Main documentation
|
||||||
|
│ ├── QUICKSTART.md # 5-minute setup guide
|
||||||
|
│ ├── CONTRIBUTING.md # Contribution guidelines
|
||||||
|
│ ├── PROJECT_STRUCTURE.md # Project layout
|
||||||
|
│ └── docker-compose.yml # Single unified compose file
|
||||||
|
│
|
||||||
|
├── 📚 docs/ (12 documentation files)
|
||||||
|
│ ├── API.md # API reference
|
||||||
|
│ ├── ARCHITECTURE.md # System architecture
|
||||||
|
│ ├── BACKEND_STRUCTURE.md # Backend organization
|
||||||
|
│ ├── CRAWLER_HOW_IT_WORKS.md # Crawler internals
|
||||||
|
│ ├── DATABASE_SCHEMA.md # Database structure
|
||||||
|
│ ├── DEPLOYMENT.md # Deployment guide
|
||||||
|
│ ├── EXTRACTION_STRATEGIES.md # Content extraction
|
||||||
|
│ └── RSS_URL_EXTRACTION.md # RSS parsing
|
||||||
|
│
|
||||||
|
├── 🧪 tests/ (10 test files)
|
||||||
|
│ ├── backend/ # Backend tests
|
||||||
|
│ ├── crawler/ # Crawler tests
|
||||||
|
│ └── sender/ # Sender tests
|
||||||
|
│
|
||||||
|
├── 🔧 backend/ # Backend API
|
||||||
|
│ ├── routes/
|
||||||
|
│ ├── services/
|
||||||
|
│ ├── .env.example
|
||||||
|
│ └── app.py
|
||||||
|
│
|
||||||
|
├── 📰 news_crawler/ # Crawler service
|
||||||
|
│ ├── Dockerfile
|
||||||
|
│ ├── crawler_service.py
|
||||||
|
│ ├── scheduled_crawler.py
|
||||||
|
│ └── requirements.txt
|
||||||
|
│
|
||||||
|
├── 📧 news_sender/ # Sender service
|
||||||
|
│ ├── Dockerfile
|
||||||
|
│ ├── sender_service.py
|
||||||
|
│ ├── scheduled_sender.py
|
||||||
|
│ └── requirements.txt
|
||||||
|
│
|
||||||
|
└── 🎨 frontend/ # React dashboard (optional)
|
||||||
|
```
|
||||||
|
|
||||||
|
## ✨ What Was Cleaned
|
||||||
|
|
||||||
|
### Removed Files (20+)
|
||||||
|
- ❌ All redundant markdown files from root
|
||||||
|
- ❌ All redundant markdown files from subdirectories
|
||||||
|
- ❌ Multiple docker-compose files (kept only 1)
|
||||||
|
- ❌ Multiple startup scripts (use docker-compose now)
|
||||||
|
- ❌ Test scripts and helpers
|
||||||
|
|
||||||
|
### Organized Files
|
||||||
|
- ✅ All tests → `tests/` directory
|
||||||
|
- ✅ All documentation → `docs/` directory
|
||||||
|
- ✅ All docker configs → single `docker-compose.yml`
|
||||||
|
|
||||||
|
## 🚀 How to Use
|
||||||
|
|
||||||
|
### Start Everything
|
||||||
|
```bash
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
That's it! One command starts:
|
||||||
|
- MongoDB database
|
||||||
|
- News crawler (6 AM schedule)
|
||||||
|
- Newsletter sender (7 AM schedule)
|
||||||
|
|
||||||
|
### View Logs
|
||||||
|
```bash
|
||||||
|
docker-compose logs -f
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stop Everything
|
||||||
|
```bash
|
||||||
|
docker-compose down
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📊 Before vs After
|
||||||
|
|
||||||
|
### Before
|
||||||
|
```
|
||||||
|
Root: 20+ files (messy)
|
||||||
|
├── AUTOMATION_README.md
|
||||||
|
├── AUTOMATION_SETUP_COMPLETE.md
|
||||||
|
├── CRAWLER_QUICKSTART.md
|
||||||
|
├── CRAWLER_SETUP_SUMMARY.md
|
||||||
|
├── docker-compose.yml
|
||||||
|
├── docker-compose.prod.yml
|
||||||
|
├── README_CRAWLER.md
|
||||||
|
├── start-automation.sh
|
||||||
|
├── start-crawler.sh
|
||||||
|
├── start-sender.sh
|
||||||
|
├── test-crawler-setup.sh
|
||||||
|
└── ... many more
|
||||||
|
|
||||||
|
Subdirectories: Scattered docs
|
||||||
|
├── backend/TRACKING_README.md
|
||||||
|
├── backend/TRACKING_CONFIGURATION.md
|
||||||
|
├── news_crawler/README.md
|
||||||
|
├── news_crawler/QUICKSTART.md
|
||||||
|
├── news_crawler/docker-compose.yml
|
||||||
|
├── news_sender/README.md
|
||||||
|
└── ... more scattered files
|
||||||
|
|
||||||
|
Tests: Scattered everywhere
|
||||||
|
```
|
||||||
|
|
||||||
|
### After
|
||||||
|
```
|
||||||
|
Root: 5 essential files (clean)
|
||||||
|
├── README.md
|
||||||
|
├── QUICKSTART.md
|
||||||
|
├── CONTRIBUTING.md
|
||||||
|
├── PROJECT_STRUCTURE.md
|
||||||
|
└── docker-compose.yml
|
||||||
|
|
||||||
|
docs/: All documentation (12 files)
|
||||||
|
├── API.md
|
||||||
|
├── ARCHITECTURE.md
|
||||||
|
├── DEPLOYMENT.md
|
||||||
|
└── ... organized docs
|
||||||
|
|
||||||
|
tests/: All tests (10 files)
|
||||||
|
├── backend/
|
||||||
|
├── crawler/
|
||||||
|
└── sender/
|
||||||
|
|
||||||
|
Subdirectories: Clean, no scattered docs
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎯 Benefits
|
||||||
|
|
||||||
|
### 1. Easy to Navigate
|
||||||
|
- Clear directory structure
|
||||||
|
- Everything in its place
|
||||||
|
- No clutter
|
||||||
|
|
||||||
|
### 2. Simple to Use
|
||||||
|
- One command: `docker-compose up -d`
|
||||||
|
- One place for docs: `docs/`
|
||||||
|
- One place for tests: `tests/`
|
||||||
|
|
||||||
|
### 3. Professional
|
||||||
|
- Industry-standard layout
|
||||||
|
- Clean and organized
|
||||||
|
- Ready for collaboration
|
||||||
|
|
||||||
|
### 4. Maintainable
|
||||||
|
- Easy to find files
|
||||||
|
- Clear separation of concerns
|
||||||
|
- Scalable structure
|
||||||
|
|
||||||
|
## 📝 Quick Reference
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
```bash
|
||||||
|
# Main docs
|
||||||
|
cat README.md
|
||||||
|
cat QUICKSTART.md
|
||||||
|
|
||||||
|
# Technical docs
|
||||||
|
ls docs/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running
|
||||||
|
```bash
|
||||||
|
# Start
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Stop
|
||||||
|
docker-compose down
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
```bash
|
||||||
|
# Run tests
|
||||||
|
docker-compose exec crawler python tests/crawler/test_crawler.py
|
||||||
|
docker-compose exec sender python tests/sender/test_tracking_integration.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Development
|
||||||
|
```bash
|
||||||
|
# Edit code in respective directories
|
||||||
|
# Rebuild
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## ✅ Verification
|
||||||
|
|
||||||
|
Run these commands to verify the cleanup:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check root directory (should be clean)
|
||||||
|
ls -1 *.md
|
||||||
|
|
||||||
|
# Check docs directory
|
||||||
|
ls -1 docs/
|
||||||
|
|
||||||
|
# Check tests directory
|
||||||
|
ls -1 tests/
|
||||||
|
|
||||||
|
# Check for stray docker-compose files (should be only 1)
|
||||||
|
find . -name "docker-compose*.yml" ! -path "*/node_modules/*" ! -path "*/env/*"
|
||||||
|
|
||||||
|
# Check for stray markdown in subdirectories (should be none)
|
||||||
|
find backend news_crawler news_sender -name "*.md" ! -path "*/env/*"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎊 Result
|
||||||
|
|
||||||
|
A clean, professional, production-ready project structure!
|
||||||
|
|
||||||
|
**One command to start everything:**
|
||||||
|
```bash
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
**One place for all documentation:**
|
||||||
|
```bash
|
||||||
|
ls docs/
|
||||||
|
```
|
||||||
|
|
||||||
|
**One place for all tests:**
|
||||||
|
```bash
|
||||||
|
ls tests/
|
||||||
|
```
|
||||||
|
|
||||||
|
Simple. Clean. Professional. ✨
|
||||||
126
PROJECT_STRUCTURE.md
Normal file
126
PROJECT_STRUCTURE.md
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
# Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
munich-news/
|
||||||
|
├── backend/ # Backend API and services
|
||||||
|
│ ├── routes/ # API routes
|
||||||
|
│ ├── services/ # Business logic
|
||||||
|
│ ├── .env.example # Environment template
|
||||||
|
│ ├── app.py # Flask application
|
||||||
|
│ ├── config.py # Configuration
|
||||||
|
│ └── database.py # MongoDB connection
|
||||||
|
│
|
||||||
|
├── news_crawler/ # News crawler service
|
||||||
|
│ ├── Dockerfile # Crawler container
|
||||||
|
│ ├── crawler_service.py # Main crawler logic
|
||||||
|
│ ├── scheduled_crawler.py # Scheduler (6 AM)
|
||||||
|
│ ├── rss_utils.py # RSS parsing utilities
|
||||||
|
│ └── requirements.txt # Python dependencies
|
||||||
|
│
|
||||||
|
├── news_sender/ # Newsletter sender service
|
||||||
|
│ ├── Dockerfile # Sender container
|
||||||
|
│ ├── sender_service.py # Main sender logic
|
||||||
|
│ ├── scheduled_sender.py # Scheduler (7 AM)
|
||||||
|
│ ├── tracking_integration.py # Email tracking
|
||||||
|
│ ├── newsletter_template.html # Email template
|
||||||
|
│ └── requirements.txt # Python dependencies
|
||||||
|
│
|
||||||
|
├── frontend/ # React dashboard (optional)
|
||||||
|
│ ├── src/ # React components
|
||||||
|
│ ├── public/ # Static files
|
||||||
|
│ └── package.json # Node dependencies
|
||||||
|
│
|
||||||
|
├── tests/ # All test files
|
||||||
|
│ ├── crawler/ # Crawler tests
|
||||||
|
│ ├── sender/ # Sender tests
|
||||||
|
│ └── backend/ # Backend tests
|
||||||
|
│
|
||||||
|
├── docs/ # Documentation
|
||||||
|
│ ├── ARCHITECTURE.md # System architecture
|
||||||
|
│ ├── DEPLOYMENT.md # Deployment guide
|
||||||
|
│ ├── API.md # API reference
|
||||||
|
│ ├── DATABASE_SCHEMA.md # Database structure
|
||||||
|
│ ├── BACKEND_STRUCTURE.md # Backend organization
|
||||||
|
│ ├── CRAWLER_HOW_IT_WORKS.md # Crawler internals
|
||||||
|
│ ├── EXTRACTION_STRATEGIES.md # Content extraction
|
||||||
|
│ └── RSS_URL_EXTRACTION.md # RSS parsing
|
||||||
|
│
|
||||||
|
├── .kiro/ # Kiro IDE configuration
|
||||||
|
│ └── specs/ # Feature specifications
|
||||||
|
│
|
||||||
|
├── docker-compose.yml # Docker orchestration
|
||||||
|
├── README.md # Main documentation
|
||||||
|
├── QUICKSTART.md # 5-minute setup guide
|
||||||
|
├── CONTRIBUTING.md # Contribution guidelines
|
||||||
|
├── .gitignore # Git ignore rules
|
||||||
|
└── .dockerignore # Docker ignore rules
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Files
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
- `backend/.env` - Environment variables (create from .env.example)
|
||||||
|
- `docker-compose.yml` - Docker services configuration
|
||||||
|
|
||||||
|
### Entry Points
|
||||||
|
- `news_crawler/scheduled_crawler.py` - Crawler scheduler (6 AM)
|
||||||
|
- `news_sender/scheduled_sender.py` - Sender scheduler (7 AM)
|
||||||
|
- `backend/app.py` - Backend API server
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- `README.md` - Main project documentation
|
||||||
|
- `QUICKSTART.md` - Quick setup guide
|
||||||
|
- `docs/` - Detailed documentation
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
- `tests/crawler/` - Crawler test files
|
||||||
|
- `tests/sender/` - Sender test files
|
||||||
|
- `tests/backend/` - Backend test files
|
||||||
|
|
||||||
|
## Docker Services
|
||||||
|
|
||||||
|
When you run `docker-compose up -d`, these services start:
|
||||||
|
|
||||||
|
1. **mongodb** - Database (port 27017)
|
||||||
|
2. **crawler** - News crawler (scheduled for 6 AM)
|
||||||
|
3. **sender** - Newsletter sender (scheduled for 7 AM)
|
||||||
|
4. **backend** - API server (port 5001, optional)
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
|
||||||
|
↓
|
||||||
|
Backend API
|
||||||
|
↓
|
||||||
|
Analytics
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
1. Edit code in respective directories
|
||||||
|
2. Rebuild containers: `docker-compose up -d --build`
|
||||||
|
3. View logs: `docker-compose logs -f`
|
||||||
|
4. Run tests: `docker-compose exec <service> python tests/...`
|
||||||
|
|
||||||
|
## Adding New Features
|
||||||
|
|
||||||
|
1. Create spec in `.kiro/specs/`
|
||||||
|
2. Implement in appropriate directory
|
||||||
|
3. Add tests in `tests/`
|
||||||
|
4. Update documentation in `docs/`
|
||||||
|
5. Submit pull request
|
||||||
|
|
||||||
|
## Clean Architecture
|
||||||
|
|
||||||
|
- **Separation of Concerns**: Each service has its own directory
|
||||||
|
- **Centralized Configuration**: All config in `backend/.env`
|
||||||
|
- **Organized Tests**: All tests in `tests/` directory
|
||||||
|
- **Clear Documentation**: All docs in `docs/` directory
|
||||||
|
- **Single Entry Point**: One `docker-compose.yml` file
|
||||||
|
|
||||||
|
This structure makes the project:
|
||||||
|
- ✅ Easy to navigate
|
||||||
|
- ✅ Simple to deploy
|
||||||
|
- ✅ Clear to understand
|
||||||
|
- ✅ Maintainable long-term
|
||||||
131
QUICKSTART.md
Normal file
131
QUICKSTART.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
# Quick Start Guide
|
||||||
|
|
||||||
|
Get Munich News Daily running in 5 minutes!
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Docker & Docker Compose installed
|
||||||
|
- (Optional) Ollama for AI summarization
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
### 1. Configure Environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Copy example environment file
|
||||||
|
cp backend/.env.example backend/.env
|
||||||
|
|
||||||
|
# Edit with your settings (required: email configuration)
|
||||||
|
nano backend/.env
|
||||||
|
```
|
||||||
|
|
||||||
|
**Minimum required settings:**
|
||||||
|
```env
|
||||||
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your-email@gmail.com
|
||||||
|
EMAIL_PASSWORD=your-app-password
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Start System
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start all services
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Add RSS Feeds
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.rss_feeds.insertMany([
|
||||||
|
{
|
||||||
|
name: "Süddeutsche Zeitung München",
|
||||||
|
url: "https://www.sueddeutsche.de/muenchen/rss",
|
||||||
|
active: true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "Merkur München",
|
||||||
|
url: "https://www.merkur.de/lokales/muenchen/rss/feed.rss",
|
||||||
|
active: true
|
||||||
|
}
|
||||||
|
])
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Add Subscribers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.subscribers.insertOne({
|
||||||
|
email: "your-email@example.com",
|
||||||
|
active: true,
|
||||||
|
tracking_enabled: true,
|
||||||
|
subscribed_at: new Date()
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Test It
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test crawler
|
||||||
|
docker-compose exec crawler python crawler_service.py 5
|
||||||
|
|
||||||
|
# Test newsletter
|
||||||
|
docker-compose exec sender python sender_service.py test your-email@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
## What Happens Next?
|
||||||
|
|
||||||
|
The system will automatically:
|
||||||
|
- **Backend API**: Runs continuously at http://localhost:5001 for tracking and analytics
|
||||||
|
- **6:00 AM Berlin time**: Crawl news articles
|
||||||
|
- **7:00 AM Berlin time**: Send newsletter to subscribers
|
||||||
|
|
||||||
|
## View Results
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check articles
|
||||||
|
mongosh munich_news
|
||||||
|
db.articles.find().sort({ crawled_at: -1 }).limit(5)
|
||||||
|
|
||||||
|
# Check logs
|
||||||
|
docker-compose logs -f crawler
|
||||||
|
docker-compose logs -f sender
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop system
|
||||||
|
docker-compose down
|
||||||
|
|
||||||
|
# Restart system
|
||||||
|
docker-compose restart
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Rebuild after changes
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Need Help?
|
||||||
|
|
||||||
|
- Check [README.md](README.md) for full documentation
|
||||||
|
- See [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) for detailed setup
|
||||||
|
- View [docs/API.md](docs/API.md) for API reference
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. Configure Ollama for AI summaries (optional)
|
||||||
|
2. Set up tracking API (optional)
|
||||||
|
3. Customize newsletter template
|
||||||
|
4. Add more RSS feeds
|
||||||
|
5. Monitor engagement metrics
|
||||||
|
|
||||||
|
That's it! Your automated news system is running. 🎉
|
||||||
639
README.md
639
README.md
@@ -1,327 +1,390 @@
|
|||||||
# Munich News Daily 📰
|
# Munich News Daily - Automated Newsletter System
|
||||||
|
|
||||||
A TLDR/Morning Brew-style news email platform specifically for Munich. Get the latest Munich news delivered to your inbox every morning.
|
A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking.
|
||||||
|
|
||||||
## Features
|
## 🚀 Quick Start
|
||||||
|
|
||||||
- 📧 Email newsletter subscription system
|
```bash
|
||||||
- 📰 Aggregated news from multiple Munich news sources
|
# 1. Configure environment
|
||||||
- 🎨 Beautiful, modern web interface
|
cp backend/.env.example backend/.env
|
||||||
- 📊 Subscription statistics
|
# Edit backend/.env with your email settings
|
||||||
- 🔄 Real-time news updates
|
|
||||||
|
|
||||||
## Tech Stack
|
# 2. Start everything
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
- **Backend**: Python (Flask) - Modular architecture with blueprints
|
# 3. View logs
|
||||||
- **Frontend**: Node.js (Express + Vanilla JavaScript)
|
docker-compose logs -f
|
||||||
- **Database**: MongoDB
|
```
|
||||||
- **News Crawler**: Standalone Python microservice
|
|
||||||
- **News Sources**: RSS feeds from major Munich news outlets
|
|
||||||
|
|
||||||
## Setup Instructions
|
That's it! The system will automatically:
|
||||||
|
- **Backend API**: Runs continuously for tracking and analytics (http://localhost:5001)
|
||||||
|
- **6:00 AM Berlin time**: Crawl news articles and generate summaries
|
||||||
|
- **7:00 AM Berlin time**: Send newsletter to all subscribers
|
||||||
|
|
||||||
|
📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide.
|
||||||
|
|
||||||
|
## 📋 System Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
6:00 AM → News Crawler
|
||||||
|
↓
|
||||||
|
Fetches articles from RSS feeds
|
||||||
|
Extracts full content
|
||||||
|
Generates AI summaries
|
||||||
|
Saves to MongoDB
|
||||||
|
↓
|
||||||
|
7:00 AM → Newsletter Sender
|
||||||
|
↓
|
||||||
|
Waits for crawler to finish
|
||||||
|
Fetches today's articles
|
||||||
|
Generates newsletter with tracking
|
||||||
|
Sends to all subscribers
|
||||||
|
↓
|
||||||
|
✅ Done! Repeat tomorrow
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🏗️ Architecture
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
- **MongoDB**: Data storage (articles, subscribers, tracking)
|
||||||
|
- **Backend API**: Flask API for tracking and analytics (port 5001)
|
||||||
|
- **News Crawler**: Automated RSS feed crawler with AI summarization
|
||||||
|
- **Newsletter Sender**: Automated email sender with tracking
|
||||||
|
- **Frontend**: React dashboard (optional)
|
||||||
|
|
||||||
|
### Technology Stack
|
||||||
|
|
||||||
|
- Python 3.11
|
||||||
|
- MongoDB 7.0
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- Flask (API)
|
||||||
|
- Ollama (AI summarization)
|
||||||
|
- Schedule (automation)
|
||||||
|
- Jinja2 (email templates)
|
||||||
|
|
||||||
|
## 📦 Installation
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
- Python 3.8+
|
- Docker & Docker Compose
|
||||||
- Node.js 14+
|
- (Optional) Ollama for AI summarization
|
||||||
- npm or yarn
|
|
||||||
- Docker and Docker Compose (recommended for MongoDB) OR MongoDB (local installation or MongoDB Atlas account)
|
|
||||||
|
|
||||||
### Backend Setup
|
### Setup
|
||||||
|
|
||||||
1. Navigate to the backend directory:
|
1. **Clone the repository**
|
||||||
```bash
|
```bash
|
||||||
cd backend
|
git clone <repository-url>
|
||||||
```
|
cd munich-news
|
||||||
|
```
|
||||||
2. Create a virtual environment (recommended):
|
|
||||||
```bash
|
2. **Configure environment**
|
||||||
python3 -m venv venv
|
```bash
|
||||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
cp backend/.env.example backend/.env
|
||||||
```
|
# Edit backend/.env with your settings
|
||||||
|
```
|
||||||
3. Install dependencies:
|
|
||||||
```bash
|
3. **Start the system**
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
4. Set up MongoDB using Docker Compose (recommended):
|
|
||||||
```bash
|
```bash
|
||||||
# From the project root directory
|
|
||||||
docker-compose up -d
|
docker-compose up -d
|
||||||
```
|
```
|
||||||
|
|
||||||
This will start MongoDB in a Docker container. The database will be available at `mongodb://localhost:27017/`
|
## ⚙️ Configuration
|
||||||
|
|
||||||
**Useful Docker commands:**
|
Edit `backend/.env`:
|
||||||
```bash
|
|
||||||
# Start MongoDB
|
|
||||||
docker-compose up -d
|
|
||||||
|
|
||||||
# Stop MongoDB
|
```env
|
||||||
docker-compose down
|
# MongoDB
|
||||||
|
MONGODB_URI=mongodb://localhost:27017/
|
||||||
|
|
||||||
# View MongoDB logs
|
# Email (SMTP)
|
||||||
docker-compose logs -f mongodb
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your-email@gmail.com
|
||||||
|
EMAIL_PASSWORD=your-app-password
|
||||||
|
|
||||||
# Restart MongoDB
|
# Newsletter
|
||||||
docker-compose restart mongodb
|
NEWSLETTER_MAX_ARTICLES=10
|
||||||
|
NEWSLETTER_HOURS_LOOKBACK=24
|
||||||
|
|
||||||
# Remove MongoDB and all data (WARNING: deletes all data)
|
# Tracking
|
||||||
docker-compose down -v
|
TRACKING_ENABLED=true
|
||||||
```
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
|
|
||||||
**Alternative options:**
|
# Ollama (AI Summarization)
|
||||||
- **Local MongoDB**: Install MongoDB locally and make sure it's running
|
OLLAMA_ENABLED=true
|
||||||
- **MongoDB Atlas** (Cloud): Create a free account at [mongodb.com/cloud/atlas](https://www.mongodb.com/cloud/atlas) and get your connection string
|
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
5. Create a `.env` file in the backend directory:
|
|
||||||
```bash
|
|
||||||
# Copy the template file
|
|
||||||
cp env.template .env
|
|
||||||
```
|
|
||||||
|
|
||||||
Then edit `.env` with your configuration:
|
|
||||||
```env
|
|
||||||
# MongoDB connection (default: mongodb://localhost:27017/)
|
|
||||||
# For Docker Compose (no authentication):
|
|
||||||
MONGODB_URI=mongodb://localhost:27017/
|
|
||||||
# For Docker Compose with authentication (if you modify docker-compose.yml):
|
|
||||||
# MONGODB_URI=mongodb://admin:password@localhost:27017/
|
|
||||||
# Or for MongoDB Atlas:
|
|
||||||
# MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/
|
|
||||||
|
|
||||||
# Email configuration (optional for testing)
|
|
||||||
SMTP_SERVER=smtp.gmail.com
|
|
||||||
SMTP_PORT=587
|
|
||||||
EMAIL_USER=your-email@gmail.com
|
|
||||||
EMAIL_PASSWORD=your-app-password
|
|
||||||
|
|
||||||
# Ollama Configuration (for AI-powered features)
|
|
||||||
# Remote Ollama server URL
|
|
||||||
OLLAMA_BASE_URL=http://your-remote-server-ip:11434
|
|
||||||
# Optional: API key if your Ollama server requires authentication
|
|
||||||
# OLLAMA_API_KEY=your-api-key-here
|
|
||||||
# Model name to use (e.g., llama2, mistral, codellama, llama3)
|
|
||||||
OLLAMA_MODEL=llama2
|
|
||||||
# Enable/disable Ollama features (true/false)
|
|
||||||
OLLAMA_ENABLED=false
|
|
||||||
```
|
|
||||||
|
|
||||||
**Notes:**
|
|
||||||
- For Gmail, you'll need to use an [App Password](https://support.google.com/accounts/answer/185833) instead of your regular password.
|
|
||||||
- For Ollama, replace `your-remote-server-ip` with your actual server IP or domain. Set `OLLAMA_ENABLED=true` to enable AI features.
|
|
||||||
|
|
||||||
6. Run the backend server:
|
|
||||||
```bash
|
|
||||||
python app.py
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The backend will run on `http://localhost:5001` (port 5001 to avoid conflict with AirPlay on macOS)
|
## 📊 Usage
|
||||||
|
|
||||||
### Frontend Setup
|
### View Logs
|
||||||
|
|
||||||
1. Navigate to the frontend directory:
|
|
||||||
```bash
|
|
||||||
cd frontend
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Install dependencies:
|
|
||||||
```bash
|
|
||||||
npm install
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Run the frontend server:
|
|
||||||
```bash
|
|
||||||
npm start
|
|
||||||
```
|
|
||||||
|
|
||||||
The frontend will run on `http://localhost:3000`
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
1. Open your browser and go to `http://localhost:3000`
|
|
||||||
2. Enter your email address to subscribe to the newsletter
|
|
||||||
3. View the latest Munich news on the homepage
|
|
||||||
4. The backend will aggregate news from multiple Munich news sources
|
|
||||||
|
|
||||||
## Sending Newsletters
|
|
||||||
|
|
||||||
To send newsletters to all subscribers, you can add a scheduled task or manually trigger the `send_newsletter()` function in `app.py`. For production, consider using:
|
|
||||||
|
|
||||||
- **Cron jobs** (Linux/Mac)
|
|
||||||
- **Task Scheduler** (Windows)
|
|
||||||
- **Celery** with Redis/RabbitMQ for more advanced scheduling
|
|
||||||
- **Cloud functions** (AWS Lambda, Google Cloud Functions)
|
|
||||||
|
|
||||||
Example cron job to send daily at 8 AM:
|
|
||||||
```
|
|
||||||
0 8 * * * cd /path/to/munich-news/backend && python -c "from app import send_newsletter; send_newsletter()"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Project Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
munich-news/
|
|
||||||
├── backend/ # Main API server
|
|
||||||
│ ├── app.py # Flask application entry point
|
|
||||||
│ ├── config.py # Configuration management
|
|
||||||
│ ├── database.py # Database connection
|
|
||||||
│ ├── routes/ # API endpoints (blueprints)
|
|
||||||
│ ├── services/ # Business logic
|
|
||||||
│ ├── templates/ # Email templates
|
|
||||||
│ └── requirements.txt # Python dependencies
|
|
||||||
├── news_crawler/ # Crawler microservice
|
|
||||||
│ ├── crawler_service.py # Standalone crawler
|
|
||||||
│ ├── ollama_client.py # AI summarization client
|
|
||||||
│ ├── requirements.txt # Crawler dependencies
|
|
||||||
│ └── README.md # Crawler documentation
|
|
||||||
├── news_sender/ # Newsletter sender microservice
|
|
||||||
│ ├── sender_service.py # Standalone email sender
|
|
||||||
│ ├── newsletter_template.html # Email template
|
|
||||||
│ ├── requirements.txt # Sender dependencies
|
|
||||||
│ └── README.md # Sender documentation
|
|
||||||
├── frontend/ # Web interface
|
|
||||||
│ ├── server.js # Express server
|
|
||||||
│ ├── package.json # Node.js dependencies
|
|
||||||
│ └── public/
|
|
||||||
│ ├── index.html # Main page
|
|
||||||
│ ├── styles.css # Styling
|
|
||||||
│ └── app.js # Frontend JavaScript
|
|
||||||
├── docker-compose.yml # Docker Compose for MongoDB (development)
|
|
||||||
├── docker-compose.prod.yml # Docker Compose with authentication (production)
|
|
||||||
└── README.md
|
|
||||||
```
|
|
||||||
|
|
||||||
## API Endpoints
|
|
||||||
|
|
||||||
### `POST /api/subscribe`
|
|
||||||
Subscribe to the newsletter
|
|
||||||
- Body: `{ "email": "user@example.com" }`
|
|
||||||
|
|
||||||
### `POST /api/unsubscribe`
|
|
||||||
Unsubscribe from the newsletter
|
|
||||||
- Body: `{ "email": "user@example.com" }`
|
|
||||||
|
|
||||||
### `GET /api/news`
|
|
||||||
Get latest Munich news articles
|
|
||||||
|
|
||||||
### `GET /api/stats`
|
|
||||||
Get subscription statistics
|
|
||||||
- Returns: `{ "subscribers": number, "articles": number, "crawled_articles": number }`
|
|
||||||
|
|
||||||
### `GET /api/news/<article_url>`
|
|
||||||
Get full article content by URL
|
|
||||||
- Returns: Full article with content, author, word count, etc.
|
|
||||||
|
|
||||||
### `GET /api/ollama/ping`
|
|
||||||
Test connection to Ollama server
|
|
||||||
- Returns: Connection status and Ollama configuration
|
|
||||||
- Response examples:
|
|
||||||
- Success: `{ "status": "success", "message": "...", "response": "...", "ollama_config": {...} }`
|
|
||||||
- Disabled: `{ "status": "disabled", "message": "...", "ollama_config": {...} }`
|
|
||||||
- Error: `{ "status": "error", "message": "...", "error_details": "...", "troubleshooting": {...}, "ollama_config": {...} }`
|
|
||||||
|
|
||||||
### `GET /api/ollama/models`
|
|
||||||
List available models on Ollama server
|
|
||||||
- Returns: List of available models and current configuration
|
|
||||||
- Response: `{ "status": "success", "models": [...], "current_model": "...", "ollama_config": {...} }`
|
|
||||||
|
|
||||||
### `GET /api/rss-feeds`
|
|
||||||
Get all RSS feeds
|
|
||||||
- Returns: `{ "feeds": [...] }`
|
|
||||||
|
|
||||||
### `POST /api/rss-feeds`
|
|
||||||
Add a new RSS feed
|
|
||||||
- Body: `{ "name": "Feed Name", "url": "https://example.com/rss" }`
|
|
||||||
- Returns: `{ "message": "...", "id": "..." }`
|
|
||||||
|
|
||||||
### `DELETE /api/rss-feeds/<feed_id>`
|
|
||||||
Remove an RSS feed
|
|
||||||
- Returns: `{ "message": "..." }`
|
|
||||||
|
|
||||||
### `PATCH /api/rss-feeds/<feed_id>/toggle`
|
|
||||||
Toggle RSS feed active status
|
|
||||||
- Returns: `{ "message": "...", "active": boolean }`
|
|
||||||
|
|
||||||
## Database Schema
|
|
||||||
|
|
||||||
### Articles Collection
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId,
|
|
||||||
title: String,
|
|
||||||
link: String (unique),
|
|
||||||
summary: String,
|
|
||||||
source: String,
|
|
||||||
published_at: String,
|
|
||||||
created_at: DateTime
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Subscribers Collection
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId,
|
|
||||||
email: String (unique, lowercase),
|
|
||||||
subscribed_at: DateTime,
|
|
||||||
status: String ('active' | 'inactive')
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Indexes:**
|
|
||||||
- `articles.link` - Unique index to prevent duplicate articles
|
|
||||||
- `articles.created_at` - For efficient sorting
|
|
||||||
- `subscribers.email` - Unique index for email lookups
|
|
||||||
- `subscribers.subscribed_at` - For analytics
|
|
||||||
|
|
||||||
## News Crawler Microservice
|
|
||||||
|
|
||||||
The project includes a standalone crawler microservice that fetches full article content from RSS feeds.
|
|
||||||
|
|
||||||
### Running the Crawler
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd news_crawler
|
# All services
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
# Install dependencies
|
# Specific service
|
||||||
pip install -r requirements.txt
|
docker-compose logs -f crawler
|
||||||
|
docker-compose logs -f sender
|
||||||
# Run crawler
|
docker-compose logs -f mongodb
|
||||||
python crawler_service.py 10
|
|
||||||
```
|
```
|
||||||
|
|
||||||
See `news_crawler/README.md` for detailed documentation.
|
### Manual Operations
|
||||||
|
|
||||||
### What It Does
|
|
||||||
|
|
||||||
- Crawls full article content from RSS feed links
|
|
||||||
- Extracts text, word count, and metadata
|
|
||||||
- Stores in MongoDB for AI processing
|
|
||||||
- Skips already-crawled articles
|
|
||||||
- Rate-limited (1 second between requests)
|
|
||||||
|
|
||||||
## Customization
|
|
||||||
|
|
||||||
### Adding News Sources
|
|
||||||
|
|
||||||
Use the API to add RSS feeds dynamically:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
# Run crawler manually
|
||||||
-H "Content-Type: application/json" \
|
docker-compose exec crawler python crawler_service.py 10
|
||||||
-d '{"name": "Your Source Name", "url": "https://example.com/rss"}'
|
|
||||||
|
# Send test newsletter
|
||||||
|
docker-compose exec sender python sender_service.py test your-email@example.com
|
||||||
|
|
||||||
|
# Preview newsletter
|
||||||
|
docker-compose exec sender python sender_service.py preview
|
||||||
```
|
```
|
||||||
|
|
||||||
### Styling
|
### Database Access
|
||||||
|
|
||||||
Modify `frontend/public/styles.css` to customize the appearance.
|
```bash
|
||||||
|
# Connect to MongoDB
|
||||||
|
docker-compose exec mongodb mongosh munich_news
|
||||||
|
|
||||||
## License
|
# View articles
|
||||||
|
db.articles.find().sort({ crawled_at: -1 }).limit(5).pretty()
|
||||||
|
|
||||||
MIT
|
# View subscribers
|
||||||
|
db.subscribers.find({ active: true }).pretty()
|
||||||
|
|
||||||
## Contributing
|
# View tracking data
|
||||||
|
db.newsletter_sends.find().sort({ created_at: -1 }).limit(10).pretty()
|
||||||
|
```
|
||||||
|
|
||||||
Feel free to submit issues and enhancement requests!
|
## 🔧 Management
|
||||||
|
|
||||||
|
### Add RSS Feeds
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.rss_feeds.insertOne({
|
||||||
|
name: "Source Name",
|
||||||
|
url: "https://example.com/rss",
|
||||||
|
active: true
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Subscribers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.subscribers.insertOne({
|
||||||
|
email: "user@example.com",
|
||||||
|
active: true,
|
||||||
|
tracking_enabled: true,
|
||||||
|
subscribed_at: new Date()
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Analytics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Newsletter metrics
|
||||||
|
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
|
||||||
|
|
||||||
|
# Article performance
|
||||||
|
curl http://localhost:5001/api/analytics/article/https://example.com/article
|
||||||
|
|
||||||
|
# Subscriber activity
|
||||||
|
curl http://localhost:5001/api/analytics/subscriber/user@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
## ⏰ Schedule Configuration
|
||||||
|
|
||||||
|
### Change Crawler Time (default: 6:00 AM)
|
||||||
|
|
||||||
|
Edit `news_crawler/scheduled_crawler.py`:
|
||||||
|
```python
|
||||||
|
schedule.every().day.at("06:00").do(run_crawler) # Change time
|
||||||
|
```
|
||||||
|
|
||||||
|
### Change Sender Time (default: 7:00 AM)
|
||||||
|
|
||||||
|
Edit `news_sender/scheduled_sender.py`:
|
||||||
|
```python
|
||||||
|
schedule.every().day.at("07:00").do(run_sender) # Change time
|
||||||
|
```
|
||||||
|
|
||||||
|
After changes:
|
||||||
|
```bash
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📈 Monitoring
|
||||||
|
|
||||||
|
### Container Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker-compose ps
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Next Scheduled Runs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Crawler
|
||||||
|
docker-compose logs crawler | grep "Next scheduled run"
|
||||||
|
|
||||||
|
# Sender
|
||||||
|
docker-compose logs sender | grep "Next scheduled run"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Engagement Metrics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
// Open rate
|
||||||
|
var sent = db.newsletter_sends.countDocuments({ newsletter_id: "2024-01-15" })
|
||||||
|
var opened = db.newsletter_sends.countDocuments({ newsletter_id: "2024-01-15", opened: true })
|
||||||
|
print("Open Rate: " + ((opened / sent) * 100).toFixed(2) + "%")
|
||||||
|
|
||||||
|
// Click rate
|
||||||
|
var clicks = db.link_clicks.countDocuments({ newsletter_id: "2024-01-15" })
|
||||||
|
print("Click Rate: " + ((clicks / sent) * 100).toFixed(2) + "%")
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🐛 Troubleshooting
|
||||||
|
|
||||||
|
### Crawler Not Finding Articles
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check RSS feeds
|
||||||
|
mongosh munich_news --eval "db.rss_feeds.find({ active: true })"
|
||||||
|
|
||||||
|
# Test manually
|
||||||
|
docker-compose exec crawler python crawler_service.py 5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Newsletter Not Sending
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check email config
|
||||||
|
docker-compose exec sender python -c "from sender_service import Config; print(Config.SMTP_SERVER)"
|
||||||
|
|
||||||
|
# Test email
|
||||||
|
docker-compose exec sender python sender_service.py test your-email@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
### Containers Not Starting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check logs
|
||||||
|
docker-compose logs
|
||||||
|
|
||||||
|
# Rebuild
|
||||||
|
docker-compose up -d --build
|
||||||
|
|
||||||
|
# Reset everything
|
||||||
|
docker-compose down -v
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔐 Privacy & Compliance
|
||||||
|
|
||||||
|
### GDPR Features
|
||||||
|
|
||||||
|
- **Data Retention**: Automatic anonymization after 90 days
|
||||||
|
- **Opt-Out**: Subscribers can disable tracking
|
||||||
|
- **Data Deletion**: Full data removal on request
|
||||||
|
- **Transparency**: Privacy notice in all emails
|
||||||
|
|
||||||
|
### Privacy Endpoints
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete subscriber data
|
||||||
|
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
|
||||||
|
|
||||||
|
# Anonymize old data
|
||||||
|
curl -X POST http://localhost:5001/api/tracking/anonymize
|
||||||
|
|
||||||
|
# Opt out of tracking
|
||||||
|
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📚 Documentation
|
||||||
|
|
||||||
|
### Getting Started
|
||||||
|
- **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
|
||||||
|
- **[PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md)** - Project layout
|
||||||
|
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
|
||||||
|
|
||||||
|
### Technical Documentation
|
||||||
|
- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
|
||||||
|
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
|
||||||
|
- **[docs/API.md](docs/API.md)** - API reference
|
||||||
|
- **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure
|
||||||
|
- **[docs/BACKEND_STRUCTURE.md](docs/BACKEND_STRUCTURE.md)** - Backend organization
|
||||||
|
|
||||||
|
### Component Documentation
|
||||||
|
- **[docs/CRAWLER_HOW_IT_WORKS.md](docs/CRAWLER_HOW_IT_WORKS.md)** - Crawler internals
|
||||||
|
- **[docs/EXTRACTION_STRATEGIES.md](docs/EXTRACTION_STRATEGIES.md)** - Content extraction
|
||||||
|
- **[docs/RSS_URL_EXTRACTION.md](docs/RSS_URL_EXTRACTION.md)** - RSS parsing
|
||||||
|
|
||||||
|
## 🧪 Testing
|
||||||
|
|
||||||
|
All test files are organized in the `tests/` directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run crawler tests
|
||||||
|
docker-compose exec crawler python tests/crawler/test_crawler.py
|
||||||
|
|
||||||
|
# Run sender tests
|
||||||
|
docker-compose exec sender python tests/sender/test_tracking_integration.py
|
||||||
|
|
||||||
|
# Run backend tests
|
||||||
|
docker-compose exec backend python tests/backend/test_tracking.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Production Deployment
|
||||||
|
|
||||||
|
### Environment Setup
|
||||||
|
|
||||||
|
1. Update `backend/.env` with production values
|
||||||
|
2. Set strong MongoDB password
|
||||||
|
3. Use HTTPS for tracking URLs
|
||||||
|
4. Configure proper SMTP server
|
||||||
|
|
||||||
|
### Security
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use production compose file
|
||||||
|
docker-compose -f docker-compose.prod.yml up -d
|
||||||
|
|
||||||
|
# Set MongoDB password
|
||||||
|
export MONGO_PASSWORD=your-secure-password
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitoring
|
||||||
|
|
||||||
|
- Set up log rotation
|
||||||
|
- Configure health checks
|
||||||
|
- Set up alerts for failures
|
||||||
|
- Monitor database size
|
||||||
|
|
||||||
|
## 📝 License
|
||||||
|
|
||||||
|
[Your License Here]
|
||||||
|
|
||||||
|
## 🤝 Contributing
|
||||||
|
|
||||||
|
Contributions welcome! Please read CONTRIBUTING.md first.
|
||||||
|
|
||||||
|
## 📧 Support
|
||||||
|
|
||||||
|
For issues or questions, please open a GitHub issue.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Built with ❤️ for Munich News Daily**
|
||||||
|
|||||||
@@ -1,132 +0,0 @@
|
|||||||
# Testing RSS Feed URL Extraction
|
|
||||||
|
|
||||||
## Quick Test (Recommended)
|
|
||||||
|
|
||||||
Run this from the project root with backend virtual environment activated:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Activate backend virtual environment
|
|
||||||
cd backend
|
|
||||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
||||||
|
|
||||||
# 2. Go back to project root
|
|
||||||
cd ..
|
|
||||||
|
|
||||||
# 3. Run the test
|
|
||||||
python test_feeds_quick.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will:
|
|
||||||
- ✓ Check what RSS feeds are in your database
|
|
||||||
- ✓ Fetch each feed
|
|
||||||
- ✓ Test URL extraction on first 3 articles
|
|
||||||
- ✓ Show what fields are available
|
|
||||||
- ✓ Verify summary and date extraction
|
|
||||||
|
|
||||||
## Expected Output
|
|
||||||
|
|
||||||
```
|
|
||||||
================================================================================
|
|
||||||
RSS Feed Test - Checking Database Feeds
|
|
||||||
================================================================================
|
|
||||||
|
|
||||||
✓ Found 3 feed(s) in database
|
|
||||||
|
|
||||||
================================================================================
|
|
||||||
Feed: Süddeutsche Zeitung München
|
|
||||||
URL: https://www.sueddeutsche.de/muenchen/rss
|
|
||||||
Active: True
|
|
||||||
================================================================================
|
|
||||||
Fetching RSS feed...
|
|
||||||
✓ Found 20 entries
|
|
||||||
|
|
||||||
--- Entry 1 ---
|
|
||||||
Title: New U-Bahn Line Opens in Munich
|
|
||||||
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-123
|
|
||||||
✓ Summary: The new U-Bahn line connecting the city center...
|
|
||||||
✓ Date: Mon, 10 Nov 2024 10:00:00 +0100
|
|
||||||
|
|
||||||
--- Entry 2 ---
|
|
||||||
Title: Munich Weather Update
|
|
||||||
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-124
|
|
||||||
✓ Summary: Weather forecast for the week...
|
|
||||||
✓ Date: Mon, 10 Nov 2024 09:30:00 +0100
|
|
||||||
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
## If No Feeds Found
|
|
||||||
|
|
||||||
Add a feed first:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"name": "Süddeutsche Politik", "url": "https://rss.sueddeutsche.de/rss/Politik"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
## Testing News Crawler
|
|
||||||
|
|
||||||
Once feeds are verified, test the crawler:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Install crawler dependencies
|
|
||||||
cd news_crawler
|
|
||||||
pip install -r requirements.txt
|
|
||||||
|
|
||||||
# 2. Run the test
|
|
||||||
python test_rss_feeds.py
|
|
||||||
|
|
||||||
# 3. Or run the actual crawler
|
|
||||||
python crawler_service.py 5
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### "No module named 'pymongo'"
|
|
||||||
- Activate the backend virtual environment first
|
|
||||||
- Or install dependencies: `pip install -r backend/requirements.txt`
|
|
||||||
|
|
||||||
### "No RSS feeds in database"
|
|
||||||
- Make sure backend is running
|
|
||||||
- Add feeds via API (see above)
|
|
||||||
- Or check if MongoDB is running: `docker-compose ps`
|
|
||||||
|
|
||||||
### "Could not extract URL"
|
|
||||||
- The test will show available fields
|
|
||||||
- Check if the feed uses `guid`, `id`, or `links` instead of `link`
|
|
||||||
- Our utility should handle most cases automatically
|
|
||||||
|
|
||||||
### "No entries found"
|
|
||||||
- The RSS feed URL might be invalid
|
|
||||||
- Try opening the URL in a browser
|
|
||||||
- Check if it returns valid XML
|
|
||||||
|
|
||||||
## Manual Database Check
|
|
||||||
|
|
||||||
Using mongosh:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
mongosh
|
|
||||||
use munich_news
|
|
||||||
db.rss_feeds.find()
|
|
||||||
db.articles.find().limit(3)
|
|
||||||
```
|
|
||||||
|
|
||||||
## What to Look For
|
|
||||||
|
|
||||||
✅ **Good signs:**
|
|
||||||
- URLs are extracted successfully
|
|
||||||
- URLs start with `http://` or `https://`
|
|
||||||
- Summaries are present
|
|
||||||
- Dates are extracted
|
|
||||||
|
|
||||||
⚠️ **Warning signs:**
|
|
||||||
- "Could not extract URL" messages
|
|
||||||
- Empty summaries (not critical)
|
|
||||||
- Missing dates (not critical)
|
|
||||||
|
|
||||||
❌ **Problems:**
|
|
||||||
- No entries found in feed
|
|
||||||
- All URL extractions fail
|
|
||||||
- Feed parsing errors
|
|
||||||
28
backend/.env.example
Normal file
28
backend/.env.example
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# MongoDB Configuration
|
||||||
|
MONGODB_URI=mongodb://localhost:27017/
|
||||||
|
|
||||||
|
# Email Configuration (Required)
|
||||||
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your-email@gmail.com
|
||||||
|
EMAIL_PASSWORD=your-app-password
|
||||||
|
|
||||||
|
# Newsletter Settings
|
||||||
|
NEWSLETTER_MAX_ARTICLES=10
|
||||||
|
NEWSLETTER_HOURS_LOOKBACK=24
|
||||||
|
WEBSITE_URL=http://localhost:3000
|
||||||
|
|
||||||
|
# Tracking Configuration
|
||||||
|
TRACKING_ENABLED=true
|
||||||
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
|
|
||||||
|
# Ollama Configuration (AI Summarization)
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
OLLAMA_TIMEOUT=120
|
||||||
|
SUMMARY_MAX_WORDS=150
|
||||||
|
|
||||||
|
# Flask Server Configuration
|
||||||
|
FLASK_PORT=5001
|
||||||
20
backend/Dockerfile
Normal file
20
backend/Dockerfile
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Copy application files
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# Set timezone to Berlin
|
||||||
|
ENV TZ=Europe/Berlin
|
||||||
|
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||||
|
|
||||||
|
# Expose Flask port
|
||||||
|
EXPOSE 5001
|
||||||
|
|
||||||
|
# Run the Flask application
|
||||||
|
CMD ["python", "-u", "app.py"]
|
||||||
@@ -7,6 +7,8 @@ from routes.news_routes import news_bp
|
|||||||
from routes.rss_routes import rss_bp
|
from routes.rss_routes import rss_bp
|
||||||
from routes.ollama_routes import ollama_bp
|
from routes.ollama_routes import ollama_bp
|
||||||
from routes.newsletter_routes import newsletter_bp
|
from routes.newsletter_routes import newsletter_bp
|
||||||
|
from routes.tracking_routes import tracking_bp
|
||||||
|
from routes.analytics_routes import analytics_bp
|
||||||
|
|
||||||
# Initialize Flask app
|
# Initialize Flask app
|
||||||
app = Flask(__name__)
|
app = Flask(__name__)
|
||||||
@@ -21,9 +23,17 @@ app.register_blueprint(news_bp)
|
|||||||
app.register_blueprint(rss_bp)
|
app.register_blueprint(rss_bp)
|
||||||
app.register_blueprint(ollama_bp)
|
app.register_blueprint(ollama_bp)
|
||||||
app.register_blueprint(newsletter_bp)
|
app.register_blueprint(newsletter_bp)
|
||||||
|
app.register_blueprint(tracking_bp)
|
||||||
|
app.register_blueprint(analytics_bp)
|
||||||
|
|
||||||
|
# Health check endpoint
|
||||||
|
@app.route('/health')
|
||||||
|
def health():
|
||||||
|
return {'status': 'healthy', 'service': 'munich-news-backend'}, 200
|
||||||
|
|
||||||
# Print configuration
|
# Print configuration
|
||||||
Config.print_config()
|
Config.print_config()
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
app.run(debug=True, port=Config.FLASK_PORT, host='127.0.0.1')
|
# Use 0.0.0.0 to allow Docker container access
|
||||||
|
app.run(debug=True, port=Config.FLASK_PORT, host='0.0.0.0')
|
||||||
|
|||||||
@@ -40,6 +40,11 @@ class Config:
|
|||||||
# Flask
|
# Flask
|
||||||
FLASK_PORT = int(os.getenv('FLASK_PORT', '5000'))
|
FLASK_PORT = int(os.getenv('FLASK_PORT', '5000'))
|
||||||
|
|
||||||
|
# Tracking
|
||||||
|
TRACKING_ENABLED = os.getenv('TRACKING_ENABLED', 'true').lower() == 'true'
|
||||||
|
TRACKING_API_URL = os.getenv('TRACKING_API_URL', f'http://localhost:{os.getenv("FLASK_PORT", "5000")}')
|
||||||
|
TRACKING_DATA_RETENTION_DAYS = int(os.getenv('TRACKING_DATA_RETENTION_DAYS', '90'))
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def print_config(cls):
|
def print_config(cls):
|
||||||
"""Print configuration (without sensitive data)"""
|
"""Print configuration (without sensitive data)"""
|
||||||
@@ -50,3 +55,5 @@ class Config:
|
|||||||
print(f" Ollama Base URL: {cls.OLLAMA_BASE_URL}")
|
print(f" Ollama Base URL: {cls.OLLAMA_BASE_URL}")
|
||||||
print(f" Ollama Model: {cls.OLLAMA_MODEL}")
|
print(f" Ollama Model: {cls.OLLAMA_MODEL}")
|
||||||
print(f" Ollama Enabled: {cls.OLLAMA_ENABLED}")
|
print(f" Ollama Enabled: {cls.OLLAMA_ENABLED}")
|
||||||
|
print(f" Tracking Enabled: {cls.TRACKING_ENABLED}")
|
||||||
|
print(f" Tracking API URL: {cls.TRACKING_API_URL}")
|
||||||
|
|||||||
@@ -11,6 +11,11 @@ articles_collection = db['articles']
|
|||||||
subscribers_collection = db['subscribers']
|
subscribers_collection = db['subscribers']
|
||||||
rss_feeds_collection = db['rss_feeds']
|
rss_feeds_collection = db['rss_feeds']
|
||||||
|
|
||||||
|
# Tracking Collections
|
||||||
|
newsletter_sends_collection = db['newsletter_sends']
|
||||||
|
link_clicks_collection = db['link_clicks']
|
||||||
|
subscriber_activity_collection = db['subscriber_activity']
|
||||||
|
|
||||||
|
|
||||||
def init_db():
|
def init_db():
|
||||||
"""Initialize database with indexes"""
|
"""Initialize database with indexes"""
|
||||||
@@ -25,6 +30,9 @@ def init_db():
|
|||||||
# Create unique index on RSS feed URLs
|
# Create unique index on RSS feed URLs
|
||||||
rss_feeds_collection.create_index('url', unique=True)
|
rss_feeds_collection.create_index('url', unique=True)
|
||||||
|
|
||||||
|
# Initialize tracking collections indexes
|
||||||
|
init_tracking_collections()
|
||||||
|
|
||||||
# Initialize default RSS feeds if collection is empty
|
# Initialize default RSS feeds if collection is empty
|
||||||
if rss_feeds_collection.count_documents({}) == 0:
|
if rss_feeds_collection.count_documents({}) == 0:
|
||||||
default_feeds = [
|
default_feeds = [
|
||||||
@@ -51,3 +59,37 @@ def init_db():
|
|||||||
print(f"Initialized {len(default_feeds)} default RSS feeds")
|
print(f"Initialized {len(default_feeds)} default RSS feeds")
|
||||||
|
|
||||||
print("Database initialized with indexes")
|
print("Database initialized with indexes")
|
||||||
|
|
||||||
|
|
||||||
|
def init_tracking_collections():
|
||||||
|
"""Initialize tracking collections with indexes for email tracking system"""
|
||||||
|
|
||||||
|
# Newsletter Sends Collection Indexes
|
||||||
|
# Unique index on tracking_id for fast pixel/click lookups
|
||||||
|
newsletter_sends_collection.create_index('tracking_id', unique=True)
|
||||||
|
# Index on newsletter_id for analytics queries
|
||||||
|
newsletter_sends_collection.create_index('newsletter_id')
|
||||||
|
# Index on subscriber_email for user activity queries
|
||||||
|
newsletter_sends_collection.create_index('subscriber_email')
|
||||||
|
# Index on sent_at for time-based queries
|
||||||
|
newsletter_sends_collection.create_index('sent_at')
|
||||||
|
|
||||||
|
# Link Clicks Collection Indexes
|
||||||
|
# Unique index on tracking_id for fast redirect lookups
|
||||||
|
link_clicks_collection.create_index('tracking_id', unique=True)
|
||||||
|
# Index on newsletter_id for analytics queries
|
||||||
|
link_clicks_collection.create_index('newsletter_id')
|
||||||
|
# Index on article_url for article performance queries
|
||||||
|
link_clicks_collection.create_index('article_url')
|
||||||
|
# Index on subscriber_email for user activity queries
|
||||||
|
link_clicks_collection.create_index('subscriber_email')
|
||||||
|
|
||||||
|
# Subscriber Activity Collection Indexes
|
||||||
|
# Unique index on email for fast lookups
|
||||||
|
subscriber_activity_collection.create_index('email', unique=True)
|
||||||
|
# Index on status for filtering by activity level
|
||||||
|
subscriber_activity_collection.create_index('status')
|
||||||
|
# Index on last_opened_at for time-based queries
|
||||||
|
subscriber_activity_collection.create_index('last_opened_at')
|
||||||
|
|
||||||
|
print("Tracking collections initialized with indexes")
|
||||||
|
|||||||
@@ -30,3 +30,12 @@ OLLAMA_TIMEOUT=30
|
|||||||
# Port for Flask server (default: 5001 to avoid AirPlay conflict on macOS)
|
# Port for Flask server (default: 5001 to avoid AirPlay conflict on macOS)
|
||||||
FLASK_PORT=5001
|
FLASK_PORT=5001
|
||||||
|
|
||||||
|
# Tracking Configuration
|
||||||
|
# Enable/disable email tracking features (true/false)
|
||||||
|
TRACKING_ENABLED=true
|
||||||
|
# Base URL for tracking API (used in tracking pixel and link URLs)
|
||||||
|
# In production, use your actual domain (e.g., https://yourdomain.com)
|
||||||
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
# Number of days to retain tracking data before anonymization
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
|
|
||||||
|
|||||||
107
backend/init_tracking_db.py
Normal file
107
backend/init_tracking_db.py
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Database initialization script for email tracking system.
|
||||||
|
|
||||||
|
This script creates the necessary MongoDB collections and indexes
|
||||||
|
for tracking email opens and link clicks in the newsletter system.
|
||||||
|
|
||||||
|
Collections created:
|
||||||
|
- newsletter_sends: Tracks each newsletter sent to each subscriber
|
||||||
|
- link_clicks: Tracks individual link clicks
|
||||||
|
- subscriber_activity: Aggregated activity status for each subscriber
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python init_tracking_db.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from pymongo import MongoClient, ASCENDING
|
||||||
|
from config import Config
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
|
||||||
|
def init_tracking_database():
|
||||||
|
"""Initialize tracking collections with proper indexes"""
|
||||||
|
|
||||||
|
print("Connecting to MongoDB...")
|
||||||
|
client = MongoClient(Config.MONGODB_URI)
|
||||||
|
db = client[Config.DB_NAME]
|
||||||
|
|
||||||
|
print(f"Connected to database: {Config.DB_NAME}")
|
||||||
|
|
||||||
|
# Get collection references
|
||||||
|
newsletter_sends = db['newsletter_sends']
|
||||||
|
link_clicks = db['link_clicks']
|
||||||
|
subscriber_activity = db['subscriber_activity']
|
||||||
|
|
||||||
|
print("\n=== Setting up Newsletter Sends Collection ===")
|
||||||
|
# Newsletter Sends Collection Indexes
|
||||||
|
newsletter_sends.create_index('tracking_id', unique=True)
|
||||||
|
print("✓ Created unique index on 'tracking_id'")
|
||||||
|
|
||||||
|
newsletter_sends.create_index('newsletter_id')
|
||||||
|
print("✓ Created index on 'newsletter_id'")
|
||||||
|
|
||||||
|
newsletter_sends.create_index('subscriber_email')
|
||||||
|
print("✓ Created index on 'subscriber_email'")
|
||||||
|
|
||||||
|
newsletter_sends.create_index('sent_at')
|
||||||
|
print("✓ Created index on 'sent_at'")
|
||||||
|
|
||||||
|
print("\n=== Setting up Link Clicks Collection ===")
|
||||||
|
# Link Clicks Collection Indexes
|
||||||
|
link_clicks.create_index('tracking_id', unique=True)
|
||||||
|
print("✓ Created unique index on 'tracking_id'")
|
||||||
|
|
||||||
|
link_clicks.create_index('newsletter_id')
|
||||||
|
print("✓ Created index on 'newsletter_id'")
|
||||||
|
|
||||||
|
link_clicks.create_index('article_url')
|
||||||
|
print("✓ Created index on 'article_url'")
|
||||||
|
|
||||||
|
link_clicks.create_index('subscriber_email')
|
||||||
|
print("✓ Created index on 'subscriber_email'")
|
||||||
|
|
||||||
|
print("\n=== Setting up Subscriber Activity Collection ===")
|
||||||
|
# Subscriber Activity Collection Indexes
|
||||||
|
subscriber_activity.create_index('email', unique=True)
|
||||||
|
print("✓ Created unique index on 'email'")
|
||||||
|
|
||||||
|
subscriber_activity.create_index('status')
|
||||||
|
print("✓ Created index on 'status'")
|
||||||
|
|
||||||
|
subscriber_activity.create_index('last_opened_at')
|
||||||
|
print("✓ Created index on 'last_opened_at'")
|
||||||
|
|
||||||
|
# Display collection statistics
|
||||||
|
print("\n=== Collection Statistics ===")
|
||||||
|
print(f"newsletter_sends: {newsletter_sends.count_documents({})} documents")
|
||||||
|
print(f"link_clicks: {link_clicks.count_documents({})} documents")
|
||||||
|
print(f"subscriber_activity: {subscriber_activity.count_documents({})} documents")
|
||||||
|
|
||||||
|
# List all indexes for verification
|
||||||
|
print("\n=== Index Verification ===")
|
||||||
|
print("\nNewsletter Sends Indexes:")
|
||||||
|
for index in newsletter_sends.list_indexes():
|
||||||
|
print(f" - {index['name']}: {index.get('key', {})}")
|
||||||
|
|
||||||
|
print("\nLink Clicks Indexes:")
|
||||||
|
for index in link_clicks.list_indexes():
|
||||||
|
print(f" - {index['name']}: {index.get('key', {})}")
|
||||||
|
|
||||||
|
print("\nSubscriber Activity Indexes:")
|
||||||
|
for index in subscriber_activity.list_indexes():
|
||||||
|
print(f" - {index['name']}: {index.get('key', {})}")
|
||||||
|
|
||||||
|
print("\n✅ Tracking database initialization complete!")
|
||||||
|
|
||||||
|
client.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
try:
|
||||||
|
init_tracking_database()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n❌ Error initializing tracking database: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
exit(1)
|
||||||
127
backend/routes/analytics_routes.py
Normal file
127
backend/routes/analytics_routes.py
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
"""
|
||||||
|
Analytics routes for email tracking metrics and subscriber engagement.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from flask import Blueprint, jsonify, request
|
||||||
|
from services.analytics_service import (
|
||||||
|
get_newsletter_metrics,
|
||||||
|
get_article_performance,
|
||||||
|
get_subscriber_activity_status,
|
||||||
|
update_subscriber_activity_statuses
|
||||||
|
)
|
||||||
|
from database import subscriber_activity_collection
|
||||||
|
|
||||||
|
analytics_bp = Blueprint('analytics', __name__)
|
||||||
|
|
||||||
|
|
||||||
|
@analytics_bp.route('/api/analytics/newsletter/<newsletter_id>', methods=['GET'])
|
||||||
|
def get_newsletter_analytics(newsletter_id):
|
||||||
|
"""
|
||||||
|
Get comprehensive metrics for a specific newsletter.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
newsletter_id: Unique identifier for the newsletter batch
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with newsletter metrics including:
|
||||||
|
- total_sent, total_opened, open_rate
|
||||||
|
- total_clicks, unique_clickers, click_through_rate
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
metrics = get_newsletter_metrics(newsletter_id)
|
||||||
|
return jsonify(metrics), 200
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@analytics_bp.route('/api/analytics/article/<path:article_url>', methods=['GET'])
|
||||||
|
def get_article_analytics(article_url):
|
||||||
|
"""
|
||||||
|
Get performance metrics for a specific article.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article_url: The original article URL (passed as path parameter)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with article performance metrics including:
|
||||||
|
- total_sent, total_clicks, click_rate
|
||||||
|
- unique_clickers, newsletters
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
performance = get_article_performance(article_url)
|
||||||
|
return jsonify(performance), 200
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@analytics_bp.route('/api/analytics/subscriber/<email>', methods=['GET'])
|
||||||
|
def get_subscriber_analytics(email):
|
||||||
|
"""
|
||||||
|
Get activity status and engagement metrics for a specific subscriber.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
email: Subscriber email address
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with subscriber activity data including:
|
||||||
|
- status, last_opened_at, last_clicked_at
|
||||||
|
- total_opens, total_clicks
|
||||||
|
- newsletters_received, newsletters_opened
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get current activity status
|
||||||
|
status = get_subscriber_activity_status(email)
|
||||||
|
|
||||||
|
# Get detailed activity record from database
|
||||||
|
activity_record = subscriber_activity_collection.find_one(
|
||||||
|
{'email': email},
|
||||||
|
{'_id': 0} # Exclude MongoDB _id field
|
||||||
|
)
|
||||||
|
|
||||||
|
if activity_record:
|
||||||
|
# Convert datetime objects to ISO format strings
|
||||||
|
if activity_record.get('last_opened_at'):
|
||||||
|
activity_record['last_opened_at'] = activity_record['last_opened_at'].isoformat()
|
||||||
|
if activity_record.get('last_clicked_at'):
|
||||||
|
activity_record['last_clicked_at'] = activity_record['last_clicked_at'].isoformat()
|
||||||
|
if activity_record.get('updated_at'):
|
||||||
|
activity_record['updated_at'] = activity_record['updated_at'].isoformat()
|
||||||
|
|
||||||
|
return jsonify(activity_record), 200
|
||||||
|
else:
|
||||||
|
# Return basic status if no detailed record exists yet
|
||||||
|
return jsonify({
|
||||||
|
'email': email,
|
||||||
|
'status': status,
|
||||||
|
'last_opened_at': None,
|
||||||
|
'last_clicked_at': None,
|
||||||
|
'total_opens': 0,
|
||||||
|
'total_clicks': 0,
|
||||||
|
'newsletters_received': 0,
|
||||||
|
'newsletters_opened': 0
|
||||||
|
}), 200
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@analytics_bp.route('/api/analytics/update-activity', methods=['POST'])
|
||||||
|
def update_activity_statuses():
|
||||||
|
"""
|
||||||
|
Trigger batch update of subscriber activity statuses.
|
||||||
|
|
||||||
|
Updates the subscriber_activity collection with current engagement
|
||||||
|
metrics for all subscribers.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with count of updated records
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
updated_count = update_subscriber_activity_statuses()
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'updated_count': updated_count,
|
||||||
|
'message': f'Updated activity status for {updated_count} subscribers'
|
||||||
|
}), 200
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({'error': str(e)}), 500
|
||||||
285
backend/routes/tracking_routes.py
Normal file
285
backend/routes/tracking_routes.py
Normal file
@@ -0,0 +1,285 @@
|
|||||||
|
"""
|
||||||
|
Tracking routes for email open and link click tracking.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from flask import Blueprint, request, redirect, make_response, jsonify
|
||||||
|
from datetime import datetime
|
||||||
|
import base64
|
||||||
|
from database import newsletter_sends_collection, link_clicks_collection
|
||||||
|
from services.tracking_service import delete_subscriber_tracking_data, anonymize_old_tracking_data
|
||||||
|
from config import Config
|
||||||
|
|
||||||
|
tracking_bp = Blueprint('tracking', __name__)
|
||||||
|
|
||||||
|
# 1x1 transparent PNG image (43 bytes)
|
||||||
|
TRANSPARENT_PNG = base64.b64decode(
|
||||||
|
'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/track/pixel/<tracking_id>', methods=['GET'])
|
||||||
|
def track_pixel(tracking_id):
|
||||||
|
"""
|
||||||
|
Track email opens via tracking pixel.
|
||||||
|
|
||||||
|
Serves a 1x1 transparent PNG image and logs the email open event.
|
||||||
|
Handles multiple opens by updating last_opened_at and open_count.
|
||||||
|
Fails silently if tracking_id is invalid to avoid breaking email rendering.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
tracking_id: Unique tracking ID for the newsletter send
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Response: 1x1 transparent PNG image with proper headers
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Look up tracking record
|
||||||
|
tracking_record = newsletter_sends_collection.find_one({'tracking_id': tracking_id})
|
||||||
|
|
||||||
|
if tracking_record:
|
||||||
|
# Get user agent for logging
|
||||||
|
user_agent = request.headers.get('User-Agent', '')
|
||||||
|
current_time = datetime.utcnow()
|
||||||
|
|
||||||
|
# Update tracking record
|
||||||
|
update_data = {
|
||||||
|
'opened': True,
|
||||||
|
'last_opened_at': current_time,
|
||||||
|
'user_agent': user_agent
|
||||||
|
}
|
||||||
|
|
||||||
|
# Set first_opened_at only if this is the first open
|
||||||
|
if not tracking_record.get('opened'):
|
||||||
|
update_data['first_opened_at'] = current_time
|
||||||
|
|
||||||
|
# Increment open count
|
||||||
|
newsletter_sends_collection.update_one(
|
||||||
|
{'tracking_id': tracking_id},
|
||||||
|
{
|
||||||
|
'$set': update_data,
|
||||||
|
'$inc': {'open_count': 1}
|
||||||
|
}
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
# Log error but don't fail - we still want to return the pixel
|
||||||
|
print(f"Error tracking pixel for {tracking_id}: {str(e)}")
|
||||||
|
|
||||||
|
# Always return the transparent PNG, even if tracking fails
|
||||||
|
response = make_response(TRANSPARENT_PNG)
|
||||||
|
response.headers['Content-Type'] = 'image/png'
|
||||||
|
response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
|
||||||
|
response.headers['Pragma'] = 'no-cache'
|
||||||
|
response.headers['Expires'] = '0'
|
||||||
|
|
||||||
|
return response
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/track/click/<tracking_id>', methods=['GET'])
|
||||||
|
def track_click(tracking_id):
|
||||||
|
"""
|
||||||
|
Track link clicks and redirect to original article URL.
|
||||||
|
|
||||||
|
Logs the click event and redirects the user to the original article URL.
|
||||||
|
Handles invalid tracking_id by redirecting to homepage.
|
||||||
|
Ensures redirect completes within 200ms.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
tracking_id: Unique tracking ID for the article link
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Response: 302 redirect to original article URL or homepage
|
||||||
|
"""
|
||||||
|
# Default redirect URL (homepage)
|
||||||
|
redirect_url = Config.TRACKING_API_URL or 'http://localhost:5001'
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Look up tracking record
|
||||||
|
tracking_record = link_clicks_collection.find_one({'tracking_id': tracking_id})
|
||||||
|
|
||||||
|
if tracking_record:
|
||||||
|
# Get the original article URL
|
||||||
|
redirect_url = tracking_record.get('article_url', redirect_url)
|
||||||
|
|
||||||
|
# Get user agent for logging
|
||||||
|
user_agent = request.headers.get('User-Agent', '')
|
||||||
|
current_time = datetime.utcnow()
|
||||||
|
|
||||||
|
# Update tracking record with click event
|
||||||
|
link_clicks_collection.update_one(
|
||||||
|
{'tracking_id': tracking_id},
|
||||||
|
{
|
||||||
|
'$set': {
|
||||||
|
'clicked': True,
|
||||||
|
'clicked_at': current_time,
|
||||||
|
'user_agent': user_agent
|
||||||
|
}
|
||||||
|
}
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
# Log error but still redirect
|
||||||
|
print(f"Error tracking click for {tracking_id}: {str(e)}")
|
||||||
|
|
||||||
|
# Redirect to the article URL (or homepage if tracking failed)
|
||||||
|
return redirect(redirect_url, code=302)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/tracking/subscriber/<email>', methods=['DELETE'])
|
||||||
|
def delete_subscriber_data(email):
|
||||||
|
"""
|
||||||
|
Delete all tracking data for a specific subscriber.
|
||||||
|
|
||||||
|
Removes all tracking records associated with the subscriber's email address
|
||||||
|
from all tracking collections (newsletter_sends, link_clicks, subscriber_activity).
|
||||||
|
Supports GDPR right to be forgotten.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
email: Email address of the subscriber
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with deletion counts and confirmation message
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Delete all tracking data for the subscriber
|
||||||
|
result = delete_subscriber_tracking_data(email)
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'message': f'All tracking data deleted for {email}',
|
||||||
|
'deleted_counts': result
|
||||||
|
}), 200
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': str(e)
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/tracking/anonymize', methods=['POST'])
|
||||||
|
def anonymize_tracking_data():
|
||||||
|
"""
|
||||||
|
Anonymize tracking data older than the retention period.
|
||||||
|
|
||||||
|
Removes email addresses from old tracking records while preserving
|
||||||
|
aggregated metrics. Default retention period is 90 days.
|
||||||
|
|
||||||
|
Request body (optional):
|
||||||
|
{
|
||||||
|
"retention_days": 90 // Number of days to retain personal data
|
||||||
|
}
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with anonymization counts
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get retention days from request body (default: 90)
|
||||||
|
data = request.get_json() or {}
|
||||||
|
retention_days = data.get('retention_days', 90)
|
||||||
|
|
||||||
|
# Validate retention_days
|
||||||
|
if not isinstance(retention_days, int) or retention_days < 1:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': 'retention_days must be a positive integer'
|
||||||
|
}), 400
|
||||||
|
|
||||||
|
# Anonymize old tracking data
|
||||||
|
result = anonymize_old_tracking_data(retention_days)
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'message': f'Anonymized tracking data older than {retention_days} days',
|
||||||
|
'anonymized_counts': result
|
||||||
|
}), 200
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': str(e)
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/tracking/subscriber/<email>/opt-out', methods=['POST'])
|
||||||
|
def opt_out_tracking(email):
|
||||||
|
"""
|
||||||
|
Opt a subscriber out of tracking.
|
||||||
|
|
||||||
|
Sets the tracking_enabled field to False for the subscriber,
|
||||||
|
preventing future tracking of their email opens and link clicks.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
email: Email address of the subscriber
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with confirmation message
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from database import subscribers_collection
|
||||||
|
|
||||||
|
# Update subscriber to opt out of tracking
|
||||||
|
result = subscribers_collection.update_one(
|
||||||
|
{'email': email},
|
||||||
|
{'$set': {'tracking_enabled': False}},
|
||||||
|
upsert=False
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.matched_count == 0:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': f'Subscriber {email} not found'
|
||||||
|
}), 404
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'message': f'Subscriber {email} has opted out of tracking'
|
||||||
|
}), 200
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': str(e)
|
||||||
|
}), 500
|
||||||
|
|
||||||
|
|
||||||
|
@tracking_bp.route('/api/tracking/subscriber/<email>/opt-in', methods=['POST'])
|
||||||
|
def opt_in_tracking(email):
|
||||||
|
"""
|
||||||
|
Opt a subscriber back into tracking.
|
||||||
|
|
||||||
|
Sets the tracking_enabled field to True for the subscriber,
|
||||||
|
enabling tracking of their email opens and link clicks.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
email: Email address of the subscriber
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON response with confirmation message
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from database import subscribers_collection
|
||||||
|
|
||||||
|
# Update subscriber to opt in to tracking
|
||||||
|
result = subscribers_collection.update_one(
|
||||||
|
{'email': email},
|
||||||
|
{'$set': {'tracking_enabled': True}},
|
||||||
|
upsert=False
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.matched_count == 0:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': f'Subscriber {email} not found'
|
||||||
|
}), 404
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
'success': True,
|
||||||
|
'message': f'Subscriber {email} has opted in to tracking'
|
||||||
|
}), 200
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return jsonify({
|
||||||
|
'success': False,
|
||||||
|
'error': str(e)
|
||||||
|
}), 500
|
||||||
306
backend/services/analytics_service.py
Normal file
306
backend/services/analytics_service.py
Normal file
@@ -0,0 +1,306 @@
|
|||||||
|
"""
|
||||||
|
Analytics service for email tracking metrics and subscriber engagement.
|
||||||
|
Calculates open rates, click rates, and subscriber activity status.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import Dict, Optional
|
||||||
|
from database import (
|
||||||
|
newsletter_sends_collection,
|
||||||
|
link_clicks_collection,
|
||||||
|
subscriber_activity_collection
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_open_rate(newsletter_id: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate the percentage of subscribers who opened a specific newsletter.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
newsletter_id: Unique identifier for the newsletter batch
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
float: Open rate as a percentage (0-100)
|
||||||
|
"""
|
||||||
|
# Count total sends for this newsletter
|
||||||
|
total_sends = newsletter_sends_collection.count_documents({
|
||||||
|
'newsletter_id': newsletter_id
|
||||||
|
})
|
||||||
|
|
||||||
|
if total_sends == 0:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# Count how many were opened
|
||||||
|
opened_count = newsletter_sends_collection.count_documents({
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'opened': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Calculate percentage
|
||||||
|
open_rate = (opened_count / total_sends) * 100
|
||||||
|
return round(open_rate, 2)
|
||||||
|
|
||||||
|
|
||||||
|
def get_click_rate(article_url: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate the percentage of subscribers who clicked a specific article link.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article_url: The original article URL
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
float: Click rate as a percentage (0-100)
|
||||||
|
"""
|
||||||
|
# Count total link tracking records for this article
|
||||||
|
total_links = link_clicks_collection.count_documents({
|
||||||
|
'article_url': article_url
|
||||||
|
})
|
||||||
|
|
||||||
|
if total_links == 0:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# Count how many were clicked
|
||||||
|
clicked_count = link_clicks_collection.count_documents({
|
||||||
|
'article_url': article_url,
|
||||||
|
'clicked': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Calculate percentage
|
||||||
|
click_rate = (clicked_count / total_links) * 100
|
||||||
|
return round(click_rate, 2)
|
||||||
|
|
||||||
|
|
||||||
|
def get_newsletter_metrics(newsletter_id: str) -> Dict:
|
||||||
|
"""
|
||||||
|
Get comprehensive metrics for a specific newsletter.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
newsletter_id: Unique identifier for the newsletter batch
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Dictionary containing:
|
||||||
|
- newsletter_id: The newsletter ID
|
||||||
|
- total_sent: Total number of emails sent
|
||||||
|
- total_opened: Number of emails opened
|
||||||
|
- open_rate: Percentage of emails opened
|
||||||
|
- total_clicks: Total number of link clicks
|
||||||
|
- unique_clickers: Number of unique subscribers who clicked
|
||||||
|
- click_through_rate: Percentage of recipients who clicked any link
|
||||||
|
"""
|
||||||
|
# Get total sends
|
||||||
|
total_sent = newsletter_sends_collection.count_documents({
|
||||||
|
'newsletter_id': newsletter_id
|
||||||
|
})
|
||||||
|
|
||||||
|
# Get total opened
|
||||||
|
total_opened = newsletter_sends_collection.count_documents({
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'opened': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Calculate open rate
|
||||||
|
open_rate = (total_opened / total_sent * 100) if total_sent > 0 else 0.0
|
||||||
|
|
||||||
|
# Get total clicks for this newsletter
|
||||||
|
total_clicks = link_clicks_collection.count_documents({
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'clicked': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Get unique clickers (distinct subscriber emails who clicked)
|
||||||
|
unique_clickers = len(link_clicks_collection.distinct(
|
||||||
|
'subscriber_email',
|
||||||
|
{'newsletter_id': newsletter_id, 'clicked': True}
|
||||||
|
))
|
||||||
|
|
||||||
|
# Calculate click-through rate (unique clickers / total sent)
|
||||||
|
click_through_rate = (unique_clickers / total_sent * 100) if total_sent > 0 else 0.0
|
||||||
|
|
||||||
|
return {
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'total_sent': total_sent,
|
||||||
|
'total_opened': total_opened,
|
||||||
|
'open_rate': round(open_rate, 2),
|
||||||
|
'total_clicks': total_clicks,
|
||||||
|
'unique_clickers': unique_clickers,
|
||||||
|
'click_through_rate': round(click_through_rate, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_article_performance(article_url: str) -> Dict:
|
||||||
|
"""
|
||||||
|
Get performance metrics for a specific article across all newsletters.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
article_url: The original article URL
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Dictionary containing:
|
||||||
|
- article_url: The article URL
|
||||||
|
- total_sent: Total times this article was sent
|
||||||
|
- total_clicks: Total number of clicks
|
||||||
|
- click_rate: Percentage of recipients who clicked
|
||||||
|
- unique_clickers: Number of unique subscribers who clicked
|
||||||
|
- newsletters: List of newsletter IDs that included this article
|
||||||
|
"""
|
||||||
|
# Get all link tracking records for this article
|
||||||
|
total_sent = link_clicks_collection.count_documents({
|
||||||
|
'article_url': article_url
|
||||||
|
})
|
||||||
|
|
||||||
|
# Get total clicks
|
||||||
|
total_clicks = link_clicks_collection.count_documents({
|
||||||
|
'article_url': article_url,
|
||||||
|
'clicked': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Calculate click rate
|
||||||
|
click_rate = (total_clicks / total_sent * 100) if total_sent > 0 else 0.0
|
||||||
|
|
||||||
|
# Get unique clickers
|
||||||
|
unique_clickers = len(link_clicks_collection.distinct(
|
||||||
|
'subscriber_email',
|
||||||
|
{'article_url': article_url, 'clicked': True}
|
||||||
|
))
|
||||||
|
|
||||||
|
# Get list of newsletters that included this article
|
||||||
|
newsletters = link_clicks_collection.distinct(
|
||||||
|
'newsletter_id',
|
||||||
|
{'article_url': article_url}
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'article_url': article_url,
|
||||||
|
'total_sent': total_sent,
|
||||||
|
'total_clicks': total_clicks,
|
||||||
|
'click_rate': round(click_rate, 2),
|
||||||
|
'unique_clickers': unique_clickers,
|
||||||
|
'newsletters': newsletters
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_subscriber_activity_status(email: str) -> str:
|
||||||
|
"""
|
||||||
|
Get the activity status for a specific subscriber.
|
||||||
|
|
||||||
|
Classifies subscribers based on their last email open:
|
||||||
|
- 'active': Opened an email in the last 30 days
|
||||||
|
- 'inactive': No opens in 30-60 days
|
||||||
|
- 'dormant': No opens in 60+ days
|
||||||
|
- 'new': No opens yet
|
||||||
|
|
||||||
|
Args:
|
||||||
|
email: Subscriber email address
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: Activity status ('active', 'inactive', 'dormant', or 'new')
|
||||||
|
"""
|
||||||
|
# Find the most recent open for this subscriber
|
||||||
|
most_recent_open = newsletter_sends_collection.find_one(
|
||||||
|
{'subscriber_email': email, 'opened': True},
|
||||||
|
sort=[('last_opened_at', -1)]
|
||||||
|
)
|
||||||
|
|
||||||
|
if not most_recent_open:
|
||||||
|
# Check if subscriber has received any newsletters
|
||||||
|
has_received = newsletter_sends_collection.count_documents({
|
||||||
|
'subscriber_email': email
|
||||||
|
}) > 0
|
||||||
|
|
||||||
|
return 'new' if has_received else 'new'
|
||||||
|
|
||||||
|
# Calculate days since last open
|
||||||
|
last_opened_at = most_recent_open.get('last_opened_at')
|
||||||
|
if not last_opened_at:
|
||||||
|
return 'new'
|
||||||
|
|
||||||
|
days_since_open = (datetime.utcnow() - last_opened_at).days
|
||||||
|
|
||||||
|
# Classify based on days since last open
|
||||||
|
if days_since_open <= 30:
|
||||||
|
return 'active'
|
||||||
|
elif days_since_open <= 60:
|
||||||
|
return 'inactive'
|
||||||
|
else:
|
||||||
|
return 'dormant'
|
||||||
|
|
||||||
|
|
||||||
|
def update_subscriber_activity_statuses() -> int:
|
||||||
|
"""
|
||||||
|
Batch update activity statuses for all subscribers.
|
||||||
|
|
||||||
|
Updates the subscriber_activity collection with current activity status,
|
||||||
|
engagement metrics, and last interaction timestamps for all subscribers
|
||||||
|
who have received newsletters.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
int: Number of subscriber records updated
|
||||||
|
"""
|
||||||
|
# Get all unique subscriber emails from newsletter sends
|
||||||
|
all_subscribers = newsletter_sends_collection.distinct('subscriber_email')
|
||||||
|
|
||||||
|
updated_count = 0
|
||||||
|
|
||||||
|
for email in all_subscribers:
|
||||||
|
# Get activity status
|
||||||
|
status = get_subscriber_activity_status(email)
|
||||||
|
|
||||||
|
# Get last opened timestamp
|
||||||
|
last_open_record = newsletter_sends_collection.find_one(
|
||||||
|
{'subscriber_email': email, 'opened': True},
|
||||||
|
sort=[('last_opened_at', -1)]
|
||||||
|
)
|
||||||
|
last_opened_at = last_open_record.get('last_opened_at') if last_open_record else None
|
||||||
|
|
||||||
|
# Get last clicked timestamp
|
||||||
|
last_click_record = link_clicks_collection.find_one(
|
||||||
|
{'subscriber_email': email, 'clicked': True},
|
||||||
|
sort=[('clicked_at', -1)]
|
||||||
|
)
|
||||||
|
last_clicked_at = last_click_record.get('clicked_at') if last_click_record else None
|
||||||
|
|
||||||
|
# Count total opens
|
||||||
|
total_opens = newsletter_sends_collection.count_documents({
|
||||||
|
'subscriber_email': email,
|
||||||
|
'opened': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Count total clicks
|
||||||
|
total_clicks = link_clicks_collection.count_documents({
|
||||||
|
'subscriber_email': email,
|
||||||
|
'clicked': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Count newsletters received
|
||||||
|
newsletters_received = newsletter_sends_collection.count_documents({
|
||||||
|
'subscriber_email': email
|
||||||
|
})
|
||||||
|
|
||||||
|
# Count newsletters opened (distinct newsletter_ids)
|
||||||
|
newsletters_opened = len(newsletter_sends_collection.distinct(
|
||||||
|
'newsletter_id',
|
||||||
|
{'subscriber_email': email, 'opened': True}
|
||||||
|
))
|
||||||
|
|
||||||
|
# Update or insert subscriber activity record
|
||||||
|
subscriber_activity_collection.update_one(
|
||||||
|
{'email': email},
|
||||||
|
{
|
||||||
|
'$set': {
|
||||||
|
'email': email,
|
||||||
|
'status': status,
|
||||||
|
'last_opened_at': last_opened_at,
|
||||||
|
'last_clicked_at': last_clicked_at,
|
||||||
|
'total_opens': total_opens,
|
||||||
|
'total_clicks': total_clicks,
|
||||||
|
'newsletters_received': newsletters_received,
|
||||||
|
'newsletters_opened': newsletters_opened,
|
||||||
|
'updated_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
},
|
||||||
|
upsert=True
|
||||||
|
)
|
||||||
|
|
||||||
|
updated_count += 1
|
||||||
|
|
||||||
|
return updated_count
|
||||||
215
backend/services/tracking_service.py
Normal file
215
backend/services/tracking_service.py
Normal file
@@ -0,0 +1,215 @@
|
|||||||
|
"""
|
||||||
|
Email tracking service for Munich News Daily newsletter system.
|
||||||
|
Handles tracking ID generation and tracking record creation.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import uuid
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
from database import newsletter_sends_collection, link_clicks_collection, subscriber_activity_collection, subscribers_collection
|
||||||
|
|
||||||
|
|
||||||
|
def generate_tracking_id() -> str:
|
||||||
|
"""
|
||||||
|
Generate a unique tracking ID using UUID4.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: A unique UUID4 string for tracking purposes
|
||||||
|
"""
|
||||||
|
return str(uuid.uuid4())
|
||||||
|
|
||||||
|
|
||||||
|
def create_newsletter_tracking(
|
||||||
|
newsletter_id: str,
|
||||||
|
subscriber_email: str,
|
||||||
|
article_links: Optional[List[Dict[str, str]]] = None
|
||||||
|
) -> Dict[str, any]:
|
||||||
|
"""
|
||||||
|
Create tracking records for a newsletter send.
|
||||||
|
|
||||||
|
Creates a tracking record in newsletter_sends collection for email open tracking,
|
||||||
|
and creates tracking records in link_clicks collection for each article link.
|
||||||
|
Respects subscriber opt-out preferences.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
newsletter_id: Unique identifier for the newsletter batch (e.g., date-based)
|
||||||
|
subscriber_email: Email address of the recipient
|
||||||
|
article_links: Optional list of article dictionaries with 'url' and 'title' keys
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Tracking information containing:
|
||||||
|
- pixel_tracking_id: ID for the tracking pixel (None if opted out)
|
||||||
|
- link_tracking_map: Dict mapping original URLs to tracking IDs (empty if opted out)
|
||||||
|
- newsletter_id: The newsletter batch ID
|
||||||
|
- subscriber_email: The recipient email
|
||||||
|
- tracking_enabled: Boolean indicating if tracking is enabled for this subscriber
|
||||||
|
"""
|
||||||
|
# Check if subscriber has opted out of tracking
|
||||||
|
subscriber = subscribers_collection.find_one({'email': subscriber_email})
|
||||||
|
tracking_enabled = subscriber.get('tracking_enabled', True) if subscriber else True
|
||||||
|
|
||||||
|
# If tracking is disabled, return empty tracking data
|
||||||
|
if not tracking_enabled:
|
||||||
|
return {
|
||||||
|
'pixel_tracking_id': None,
|
||||||
|
'link_tracking_map': {},
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'subscriber_email': subscriber_email,
|
||||||
|
'tracking_enabled': False
|
||||||
|
}
|
||||||
|
|
||||||
|
# Generate tracking ID for the email open pixel
|
||||||
|
pixel_tracking_id = generate_tracking_id()
|
||||||
|
|
||||||
|
# Create newsletter send tracking record
|
||||||
|
newsletter_send_doc = {
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'subscriber_email': subscriber_email,
|
||||||
|
'tracking_id': pixel_tracking_id,
|
||||||
|
'sent_at': datetime.utcnow(),
|
||||||
|
'opened': False,
|
||||||
|
'first_opened_at': None,
|
||||||
|
'last_opened_at': None,
|
||||||
|
'open_count': 0,
|
||||||
|
'created_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
|
||||||
|
newsletter_sends_collection.insert_one(newsletter_send_doc)
|
||||||
|
|
||||||
|
# Create tracking records for article links
|
||||||
|
link_tracking_map = {}
|
||||||
|
|
||||||
|
if article_links:
|
||||||
|
for article in article_links:
|
||||||
|
article_url = article.get('url')
|
||||||
|
article_title = article.get('title', '')
|
||||||
|
|
||||||
|
if article_url:
|
||||||
|
link_tracking_id = generate_tracking_id()
|
||||||
|
|
||||||
|
# Create link click tracking record
|
||||||
|
link_click_doc = {
|
||||||
|
'tracking_id': link_tracking_id,
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'subscriber_email': subscriber_email,
|
||||||
|
'article_url': article_url,
|
||||||
|
'article_title': article_title,
|
||||||
|
'clicked': False,
|
||||||
|
'clicked_at': None,
|
||||||
|
'user_agent': None,
|
||||||
|
'created_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
|
||||||
|
link_clicks_collection.insert_one(link_click_doc)
|
||||||
|
|
||||||
|
# Map original URL to tracking ID
|
||||||
|
link_tracking_map[article_url] = link_tracking_id
|
||||||
|
|
||||||
|
return {
|
||||||
|
'pixel_tracking_id': pixel_tracking_id,
|
||||||
|
'link_tracking_map': link_tracking_map,
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'subscriber_email': subscriber_email,
|
||||||
|
'tracking_enabled': True
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def anonymize_old_tracking_data(retention_days: int = 90) -> Dict[str, int]:
|
||||||
|
"""
|
||||||
|
Anonymize tracking data older than the specified retention period.
|
||||||
|
|
||||||
|
Removes email addresses from tracking records while preserving aggregated metrics.
|
||||||
|
This helps comply with privacy regulations by not retaining personal data indefinitely.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
retention_days: Number of days to retain personal data (default: 90)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Count of anonymized records for each collection:
|
||||||
|
- newsletter_sends_anonymized: Number of newsletter send records anonymized
|
||||||
|
- link_clicks_anonymized: Number of link click records anonymized
|
||||||
|
- total_anonymized: Total number of records anonymized
|
||||||
|
"""
|
||||||
|
cutoff_date = datetime.utcnow() - timedelta(days=retention_days)
|
||||||
|
|
||||||
|
# Anonymize newsletter_sends records
|
||||||
|
newsletter_result = newsletter_sends_collection.update_many(
|
||||||
|
{
|
||||||
|
'sent_at': {'$lt': cutoff_date},
|
||||||
|
'subscriber_email': {'$ne': 'anonymized'} # Don't re-anonymize
|
||||||
|
},
|
||||||
|
{
|
||||||
|
'$set': {
|
||||||
|
'subscriber_email': 'anonymized',
|
||||||
|
'anonymized_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Anonymize link_clicks records
|
||||||
|
link_clicks_result = link_clicks_collection.update_many(
|
||||||
|
{
|
||||||
|
'created_at': {'$lt': cutoff_date},
|
||||||
|
'subscriber_email': {'$ne': 'anonymized'} # Don't re-anonymize
|
||||||
|
},
|
||||||
|
{
|
||||||
|
'$set': {
|
||||||
|
'subscriber_email': 'anonymized',
|
||||||
|
'anonymized_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
newsletter_count = newsletter_result.modified_count
|
||||||
|
link_clicks_count = link_clicks_result.modified_count
|
||||||
|
|
||||||
|
return {
|
||||||
|
'newsletter_sends_anonymized': newsletter_count,
|
||||||
|
'link_clicks_anonymized': link_clicks_count,
|
||||||
|
'total_anonymized': newsletter_count + link_clicks_count
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def delete_subscriber_tracking_data(subscriber_email: str) -> Dict[str, int]:
|
||||||
|
"""
|
||||||
|
Delete all tracking data for a specific subscriber.
|
||||||
|
|
||||||
|
Removes all tracking records associated with a subscriber's email address
|
||||||
|
from all tracking collections. This supports GDPR right to be forgotten.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
subscriber_email: Email address of the subscriber
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Count of deleted records for each collection:
|
||||||
|
- newsletter_sends_deleted: Number of newsletter send records deleted
|
||||||
|
- link_clicks_deleted: Number of link click records deleted
|
||||||
|
- subscriber_activity_deleted: Number of activity records deleted
|
||||||
|
- total_deleted: Total number of records deleted
|
||||||
|
"""
|
||||||
|
# Delete from newsletter_sends
|
||||||
|
newsletter_result = newsletter_sends_collection.delete_many({
|
||||||
|
'subscriber_email': subscriber_email
|
||||||
|
})
|
||||||
|
|
||||||
|
# Delete from link_clicks
|
||||||
|
link_clicks_result = link_clicks_collection.delete_many({
|
||||||
|
'subscriber_email': subscriber_email
|
||||||
|
})
|
||||||
|
|
||||||
|
# Delete from subscriber_activity
|
||||||
|
activity_result = subscriber_activity_collection.delete_many({
|
||||||
|
'email': subscriber_email
|
||||||
|
})
|
||||||
|
|
||||||
|
newsletter_count = newsletter_result.deleted_count
|
||||||
|
link_clicks_count = link_clicks_result.deleted_count
|
||||||
|
activity_count = activity_result.deleted_count
|
||||||
|
|
||||||
|
return {
|
||||||
|
'newsletter_sends_deleted': newsletter_count,
|
||||||
|
'link_clicks_deleted': link_clicks_count,
|
||||||
|
'subscriber_activity_deleted': activity_count,
|
||||||
|
'total_deleted': newsletter_count + link_clicks_count + activity_count
|
||||||
|
}
|
||||||
@@ -1,33 +0,0 @@
|
|||||||
version: '3.8'
|
|
||||||
|
|
||||||
# Production version with authentication enabled
|
|
||||||
# Usage: docker-compose -f docker-compose.prod.yml up -d
|
|
||||||
|
|
||||||
services:
|
|
||||||
mongodb:
|
|
||||||
image: mongo:7.0
|
|
||||||
container_name: munich-news-mongodb
|
|
||||||
restart: unless-stopped
|
|
||||||
ports:
|
|
||||||
- "27017:27017"
|
|
||||||
environment:
|
|
||||||
MONGO_INITDB_ROOT_USERNAME: admin
|
|
||||||
MONGO_INITDB_ROOT_PASSWORD: ${MONGO_PASSWORD:-changeme}
|
|
||||||
MONGO_INITDB_DATABASE: munich_news
|
|
||||||
volumes:
|
|
||||||
- mongodb_data:/data/db
|
|
||||||
- mongodb_config:/data/configdb
|
|
||||||
networks:
|
|
||||||
- munich-news-network
|
|
||||||
command: mongod --bind_ip_all --auth
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
mongodb_data:
|
|
||||||
driver: local
|
|
||||||
mongodb_config:
|
|
||||||
driver: local
|
|
||||||
|
|
||||||
networks:
|
|
||||||
munich-news-network:
|
|
||||||
driver: bridge
|
|
||||||
|
|
||||||
@@ -1,24 +1,106 @@
|
|||||||
version: '3.8'
|
version: '3.8'
|
||||||
|
|
||||||
services:
|
services:
|
||||||
|
# MongoDB Database
|
||||||
mongodb:
|
mongodb:
|
||||||
image: mongo:7.0
|
image: mongo:latest
|
||||||
container_name: munich-news-mongodb
|
container_name: munich-news-mongodb
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
ports:
|
ports:
|
||||||
- "27017:27017"
|
- "27017:27017"
|
||||||
# For development: MongoDB runs without authentication
|
environment:
|
||||||
# For production: Uncomment the environment variables below and update MONGODB_URI
|
# For production, set MONGO_PASSWORD environment variable
|
||||||
# environment:
|
MONGO_INITDB_ROOT_USERNAME: ${MONGO_USERNAME:-admin}
|
||||||
# MONGO_INITDB_ROOT_USERNAME: admin
|
MONGO_INITDB_ROOT_PASSWORD: ${MONGO_PASSWORD:-changeme}
|
||||||
# MONGO_INITDB_ROOT_PASSWORD: password
|
MONGO_INITDB_DATABASE: munich_news
|
||||||
# MONGO_INITDB_DATABASE: munich_news
|
|
||||||
volumes:
|
volumes:
|
||||||
- mongodb_data:/data/db
|
- mongodb_data:/data/db
|
||||||
- mongodb_config:/data/configdb
|
- mongodb_config:/data/configdb
|
||||||
networks:
|
networks:
|
||||||
- munich-news-network
|
- munich-news-network
|
||||||
command: mongod --bind_ip_all
|
command: mongod --bind_ip_all ${MONGO_AUTH:---auth}
|
||||||
|
healthcheck:
|
||||||
|
test: echo 'db.runCommand("ping").ok' | mongosh localhost:27017/test --quiet
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
# News Crawler - Runs at 6 AM Berlin time
|
||||||
|
crawler:
|
||||||
|
build:
|
||||||
|
context: .
|
||||||
|
dockerfile: news_crawler/Dockerfile
|
||||||
|
container_name: munich-news-crawler
|
||||||
|
restart: unless-stopped
|
||||||
|
depends_on:
|
||||||
|
- mongodb
|
||||||
|
environment:
|
||||||
|
- MONGODB_URI=mongodb://${MONGO_USERNAME:-admin}:${MONGO_PASSWORD:-changeme}@mongodb:27017/
|
||||||
|
- TZ=Europe/Berlin
|
||||||
|
volumes:
|
||||||
|
- ./backend/.env:/app/.env:ro
|
||||||
|
- ./backend/config.py:/app/config.py:ro
|
||||||
|
- ./backend/ollama_client.py:/app/ollama_client.py:ro
|
||||||
|
- ./news_crawler:/app:ro
|
||||||
|
networks:
|
||||||
|
- munich-news-network
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python", "-c", "import sys; sys.exit(0)"]
|
||||||
|
interval: 1m
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
# Backend API - Tracking and analytics
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: munich-news-backend
|
||||||
|
restart: unless-stopped
|
||||||
|
depends_on:
|
||||||
|
- mongodb
|
||||||
|
ports:
|
||||||
|
- "5001:5001"
|
||||||
|
environment:
|
||||||
|
- MONGODB_URI=mongodb://${MONGO_USERNAME:-admin}:${MONGO_PASSWORD:-changeme}@mongodb:27017/
|
||||||
|
- FLASK_PORT=5001
|
||||||
|
- TZ=Europe/Berlin
|
||||||
|
volumes:
|
||||||
|
- ./backend/.env:/app/.env:ro
|
||||||
|
networks:
|
||||||
|
- munich-news-network
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:5001/health')"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 40s
|
||||||
|
|
||||||
|
# Newsletter Sender - Runs at 7 AM Berlin time
|
||||||
|
sender:
|
||||||
|
build:
|
||||||
|
context: .
|
||||||
|
dockerfile: news_sender/Dockerfile
|
||||||
|
container_name: munich-news-sender
|
||||||
|
restart: unless-stopped
|
||||||
|
depends_on:
|
||||||
|
- mongodb
|
||||||
|
- backend
|
||||||
|
- crawler
|
||||||
|
environment:
|
||||||
|
- MONGODB_URI=mongodb://${MONGO_USERNAME:-admin}:${MONGO_PASSWORD:-changeme}@mongodb:27017/
|
||||||
|
- TZ=Europe/Berlin
|
||||||
|
volumes:
|
||||||
|
- ./backend/.env:/app/.env:ro
|
||||||
|
- ./backend/services:/app/backend/services:ro
|
||||||
|
- ./news_sender:/app:ro
|
||||||
|
networks:
|
||||||
|
- munich-news-network
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python", "-c", "import sys; sys.exit(0)"]
|
||||||
|
interval: 1m
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
mongodb_data:
|
mongodb_data:
|
||||||
@@ -29,4 +111,3 @@ volumes:
|
|||||||
networks:
|
networks:
|
||||||
munich-news-network:
|
munich-news-network:
|
||||||
driver: bridge
|
driver: bridge
|
||||||
|
|
||||||
|
|||||||
223
docs/API.md
Normal file
223
docs/API.md
Normal file
@@ -0,0 +1,223 @@
|
|||||||
|
# API Reference
|
||||||
|
|
||||||
|
## Tracking Endpoints
|
||||||
|
|
||||||
|
### Track Email Open
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/track/pixel/<tracking_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns a 1x1 transparent PNG and logs the email open event.
|
||||||
|
|
||||||
|
**Response**: Image (image/png)
|
||||||
|
|
||||||
|
### Track Link Click
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/track/click/<tracking_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
Logs the click event and redirects to the original article URL.
|
||||||
|
|
||||||
|
**Response**: 302 Redirect
|
||||||
|
|
||||||
|
## Analytics Endpoints
|
||||||
|
|
||||||
|
### Get Newsletter Metrics
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/analytics/newsletter/<newsletter_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns comprehensive metrics for a specific newsletter.
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"newsletter_id": "2024-01-15",
|
||||||
|
"total_sent": 100,
|
||||||
|
"total_opened": 75,
|
||||||
|
"open_rate": 75.0,
|
||||||
|
"unique_openers": 70,
|
||||||
|
"total_clicks": 45,
|
||||||
|
"unique_clickers": 30,
|
||||||
|
"click_through_rate": 30.0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Article Performance
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/analytics/article/<article_url>
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns performance metrics for a specific article.
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"article_url": "https://example.com/article",
|
||||||
|
"total_sent": 100,
|
||||||
|
"total_clicks": 25,
|
||||||
|
"click_rate": 25.0,
|
||||||
|
"unique_clickers": 20,
|
||||||
|
"newsletters": ["2024-01-15", "2024-01-16"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Subscriber Activity
|
||||||
|
|
||||||
|
```http
|
||||||
|
GET /api/analytics/subscriber/<email>
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns activity status and engagement metrics for a subscriber.
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"email": "user@example.com",
|
||||||
|
"status": "active",
|
||||||
|
"last_opened_at": "2024-01-15T10:30:00",
|
||||||
|
"last_clicked_at": "2024-01-15T10:35:00",
|
||||||
|
"total_opens": 45,
|
||||||
|
"total_clicks": 20,
|
||||||
|
"newsletters_received": 50,
|
||||||
|
"newsletters_opened": 45
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Privacy Endpoints
|
||||||
|
|
||||||
|
### Delete Subscriber Data
|
||||||
|
|
||||||
|
```http
|
||||||
|
DELETE /api/tracking/subscriber/<email>
|
||||||
|
```
|
||||||
|
|
||||||
|
Deletes all tracking data for a subscriber (GDPR compliance).
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "All tracking data deleted for user@example.com",
|
||||||
|
"deleted_counts": {
|
||||||
|
"newsletter_sends": 50,
|
||||||
|
"link_clicks": 25,
|
||||||
|
"subscriber_activity": 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Anonymize Old Data
|
||||||
|
|
||||||
|
```http
|
||||||
|
POST /api/tracking/anonymize
|
||||||
|
```
|
||||||
|
|
||||||
|
Anonymizes tracking data older than the retention period.
|
||||||
|
|
||||||
|
**Request Body** (optional):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"retention_days": 90
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "Anonymized tracking data older than 90 days",
|
||||||
|
"anonymized_counts": {
|
||||||
|
"newsletter_sends": 1250,
|
||||||
|
"link_clicks": 650
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Opt Out of Tracking
|
||||||
|
|
||||||
|
```http
|
||||||
|
POST /api/tracking/subscriber/<email>/opt-out
|
||||||
|
```
|
||||||
|
|
||||||
|
Disables tracking for a subscriber.
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "Subscriber user@example.com has opted out of tracking"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Opt In to Tracking
|
||||||
|
|
||||||
|
```http
|
||||||
|
POST /api/tracking/subscriber/<email>/opt-in
|
||||||
|
```
|
||||||
|
|
||||||
|
Re-enables tracking for a subscriber.
|
||||||
|
|
||||||
|
**Response**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "Subscriber user@example.com has opted in to tracking"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
### Using curl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get newsletter metrics
|
||||||
|
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
|
||||||
|
|
||||||
|
# Delete subscriber data
|
||||||
|
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
|
||||||
|
|
||||||
|
# Anonymize old data
|
||||||
|
curl -X POST http://localhost:5001/api/tracking/anonymize \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"retention_days": 90}'
|
||||||
|
|
||||||
|
# Opt out of tracking
|
||||||
|
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using Python
|
||||||
|
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# Get newsletter metrics
|
||||||
|
response = requests.get('http://localhost:5001/api/analytics/newsletter/2024-01-15')
|
||||||
|
metrics = response.json()
|
||||||
|
print(f"Open rate: {metrics['open_rate']}%")
|
||||||
|
|
||||||
|
# Delete subscriber data
|
||||||
|
response = requests.delete('http://localhost:5001/api/tracking/subscriber/user@example.com')
|
||||||
|
result = response.json()
|
||||||
|
print(result['message'])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Responses
|
||||||
|
|
||||||
|
All endpoints return standard error responses:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": false,
|
||||||
|
"error": "Error message here"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
HTTP Status Codes:
|
||||||
|
- `200` - Success
|
||||||
|
- `404` - Not found
|
||||||
|
- `500` - Server error
|
||||||
131
docs/ARCHITECTURE.md
Normal file
131
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
# System Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Munich News Daily is a fully automated news aggregation and newsletter system with the following components:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Munich News Daily System │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
6:00 AM Berlin → News Crawler
|
||||||
|
↓
|
||||||
|
Fetches RSS feeds
|
||||||
|
Extracts full content
|
||||||
|
Generates AI summaries
|
||||||
|
Saves to MongoDB
|
||||||
|
↓
|
||||||
|
7:00 AM Berlin → Newsletter Sender
|
||||||
|
↓
|
||||||
|
Waits for crawler
|
||||||
|
Fetches articles
|
||||||
|
Generates newsletter
|
||||||
|
Sends to subscribers
|
||||||
|
↓
|
||||||
|
✅ Done!
|
||||||
|
```
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### 1. MongoDB Database
|
||||||
|
- **Purpose**: Central data storage
|
||||||
|
- **Collections**:
|
||||||
|
- `articles`: News articles with summaries
|
||||||
|
- `subscribers`: Email subscribers
|
||||||
|
- `rss_feeds`: RSS feed sources
|
||||||
|
- `newsletter_sends`: Email tracking data
|
||||||
|
- `link_clicks`: Link click tracking
|
||||||
|
- `subscriber_activity`: Engagement metrics
|
||||||
|
|
||||||
|
### 2. News Crawler
|
||||||
|
- **Schedule**: Daily at 6:00 AM Berlin time
|
||||||
|
- **Functions**:
|
||||||
|
- Fetches articles from RSS feeds
|
||||||
|
- Extracts full article content
|
||||||
|
- Generates AI summaries using Ollama
|
||||||
|
- Saves to MongoDB
|
||||||
|
- **Technology**: Python, BeautifulSoup, Ollama
|
||||||
|
|
||||||
|
### 3. Newsletter Sender
|
||||||
|
- **Schedule**: Daily at 7:00 AM Berlin time
|
||||||
|
- **Functions**:
|
||||||
|
- Waits for crawler to finish (max 30 min)
|
||||||
|
- Fetches today's articles
|
||||||
|
- Generates HTML newsletter
|
||||||
|
- Injects tracking pixels
|
||||||
|
- Sends to all subscribers
|
||||||
|
- **Technology**: Python, Jinja2, SMTP
|
||||||
|
|
||||||
|
### 4. Backend API (Optional)
|
||||||
|
- **Purpose**: Tracking and analytics
|
||||||
|
- **Endpoints**:
|
||||||
|
- `/api/track/pixel/<id>` - Email open tracking
|
||||||
|
- `/api/track/click/<id>` - Link click tracking
|
||||||
|
- `/api/analytics/*` - Engagement metrics
|
||||||
|
- `/api/tracking/*` - Privacy controls
|
||||||
|
- **Technology**: Flask, Python
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
RSS Feeds → Crawler → MongoDB → Sender → Subscribers
|
||||||
|
↓
|
||||||
|
Backend API
|
||||||
|
↓
|
||||||
|
Analytics
|
||||||
|
```
|
||||||
|
|
||||||
|
## Coordination
|
||||||
|
|
||||||
|
The sender waits for the crawler to ensure fresh content:
|
||||||
|
|
||||||
|
1. Sender starts at 7:00 AM
|
||||||
|
2. Checks for recent articles every 30 seconds
|
||||||
|
3. Maximum wait time: 30 minutes
|
||||||
|
4. Proceeds once crawler finishes or timeout
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
- **Backend**: Python 3.11
|
||||||
|
- **Database**: MongoDB 7.0
|
||||||
|
- **AI**: Ollama (Phi3 model)
|
||||||
|
- **Scheduling**: Python schedule library
|
||||||
|
- **Email**: SMTP with HTML templates
|
||||||
|
- **Tracking**: Pixel tracking + redirect URLs
|
||||||
|
- **Infrastructure**: Docker & Docker Compose
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
All components run in Docker containers:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
Containers:
|
||||||
|
- `munich-news-mongodb` - Database
|
||||||
|
- `munich-news-crawler` - Crawler service
|
||||||
|
- `munich-news-sender` - Sender service
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- MongoDB authentication enabled
|
||||||
|
- Environment variables for secrets
|
||||||
|
- HTTPS for tracking URLs (production)
|
||||||
|
- GDPR-compliant data retention
|
||||||
|
- Privacy controls (opt-out, deletion)
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
- Docker logs for all services
|
||||||
|
- MongoDB for data verification
|
||||||
|
- Health checks on containers
|
||||||
|
- Engagement metrics via API
|
||||||
|
|
||||||
|
## Scalability
|
||||||
|
|
||||||
|
- Horizontal: Add more crawler instances
|
||||||
|
- Vertical: Increase container resources
|
||||||
|
- Database: MongoDB sharding if needed
|
||||||
|
- Caching: Redis for API responses (future)
|
||||||
@@ -17,13 +17,17 @@ backend/
|
|||||||
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
|
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
|
||||||
│ ├── news_routes.py # /api/news, /api/stats
|
│ ├── news_routes.py # /api/news, /api/stats
|
||||||
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
|
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
|
||||||
│ └── ollama_routes.py # /api/ollama/* (AI features)
|
│ ├── ollama_routes.py # /api/ollama/* (AI features)
|
||||||
|
│ ├── tracking_routes.py # /api/track/* (email tracking)
|
||||||
|
│ └── analytics_routes.py # /api/analytics/* (engagement metrics)
|
||||||
│
|
│
|
||||||
└── services/ # Business logic layer
|
└── services/ # Business logic layer
|
||||||
├── __init__.py
|
├── __init__.py
|
||||||
├── news_service.py # News fetching and storage logic
|
├── news_service.py # News fetching and storage logic
|
||||||
├── email_service.py # Newsletter email sending
|
├── email_service.py # Newsletter email sending
|
||||||
└── ollama_service.py # Ollama AI integration
|
├── ollama_service.py # Ollama AI integration
|
||||||
|
├── tracking_service.py # Email tracking (opens/clicks)
|
||||||
|
└── analytics_service.py # Engagement analytics
|
||||||
```
|
```
|
||||||
|
|
||||||
## Key Components
|
## Key Components
|
||||||
@@ -49,12 +53,16 @@ Each route file is a Flask Blueprint handling specific API endpoints:
|
|||||||
- **news_routes.py**: News fetching and statistics
|
- **news_routes.py**: News fetching and statistics
|
||||||
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
|
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
|
||||||
- **ollama_routes.py**: AI/Ollama integration endpoints
|
- **ollama_routes.py**: AI/Ollama integration endpoints
|
||||||
|
- **tracking_routes.py**: Email tracking (pixel, click redirects, data deletion)
|
||||||
|
- **analytics_routes.py**: Engagement analytics (open rates, click rates, subscriber activity)
|
||||||
|
|
||||||
### services/
|
### services/
|
||||||
Business logic separated from route handlers:
|
Business logic separated from route handlers:
|
||||||
- **news_service.py**: Fetches news from RSS feeds, saves to database
|
- **news_service.py**: Fetches news from RSS feeds, saves to database
|
||||||
- **email_service.py**: Sends newsletter emails to subscribers
|
- **email_service.py**: Sends newsletter emails to subscribers
|
||||||
- **ollama_service.py**: Communicates with Ollama AI server
|
- **ollama_service.py**: Communicates with Ollama AI server
|
||||||
|
- **tracking_service.py**: Email tracking logic (tracking IDs, pixel generation, click logging)
|
||||||
|
- **analytics_service.py**: Analytics calculations (open rates, click rates, activity classification)
|
||||||
|
|
||||||
## Benefits of This Structure
|
## Benefits of This Structure
|
||||||
|
|
||||||
@@ -78,6 +78,134 @@ Stores all newsletter subscribers.
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 3. Newsletter Sends Collection (`newsletter_sends`)
|
||||||
|
|
||||||
|
Tracks each newsletter sent to each subscriber for email open tracking.
|
||||||
|
|
||||||
|
**Document Structure:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId, // Auto-generated MongoDB ID
|
||||||
|
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
|
||||||
|
subscriber_email: String, // Recipient email
|
||||||
|
tracking_id: String, // Unique tracking ID for this send (UUID)
|
||||||
|
sent_at: DateTime, // When email was sent (UTC)
|
||||||
|
opened: Boolean, // Whether email was opened
|
||||||
|
first_opened_at: DateTime, // First open timestamp (null if not opened)
|
||||||
|
last_opened_at: DateTime, // Most recent open timestamp
|
||||||
|
open_count: Number, // Number of times opened
|
||||||
|
created_at: DateTime // Record creation time (UTC)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Indexes:**
|
||||||
|
- `tracking_id` - Unique index for fast pixel request lookups
|
||||||
|
- `newsletter_id` - Index for analytics queries
|
||||||
|
- `subscriber_email` - Index for user activity queries
|
||||||
|
- `sent_at` - Index for time-based queries
|
||||||
|
|
||||||
|
**Example Document:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId("507f1f77bcf86cd799439013"),
|
||||||
|
newsletter_id: "2024-01-15",
|
||||||
|
subscriber_email: "user@example.com",
|
||||||
|
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
||||||
|
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
|
||||||
|
opened: true,
|
||||||
|
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||||
|
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
|
||||||
|
open_count: 3,
|
||||||
|
created_at: ISODate("2024-01-15T08:00:00.000Z")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Link Clicks Collection (`link_clicks`)
|
||||||
|
|
||||||
|
Tracks individual link clicks from newsletters.
|
||||||
|
|
||||||
|
**Document Structure:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId, // Auto-generated MongoDB ID
|
||||||
|
tracking_id: String, // Unique tracking ID for this link (UUID)
|
||||||
|
newsletter_id: String, // Which newsletter this link was in
|
||||||
|
subscriber_email: String, // Who clicked
|
||||||
|
article_url: String, // Original article URL
|
||||||
|
article_title: String, // Article title for reporting
|
||||||
|
clicked_at: DateTime, // When link was clicked (UTC)
|
||||||
|
user_agent: String, // Browser/client info
|
||||||
|
created_at: DateTime // Record creation time (UTC)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Indexes:**
|
||||||
|
- `tracking_id` - Unique index for fast redirect request lookups
|
||||||
|
- `newsletter_id` - Index for analytics queries
|
||||||
|
- `article_url` - Index for article performance queries
|
||||||
|
- `subscriber_email` - Index for user activity queries
|
||||||
|
|
||||||
|
**Example Document:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId("507f1f77bcf86cd799439014"),
|
||||||
|
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
|
||||||
|
newsletter_id: "2024-01-15",
|
||||||
|
subscriber_email: "user@example.com",
|
||||||
|
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
||||||
|
article_title: "New U-Bahn Line Opens in Munich",
|
||||||
|
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
||||||
|
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||||
|
created_at: ISODate("2024-01-15T09:35:00.000Z")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Subscriber Activity Collection (`subscriber_activity`)
|
||||||
|
|
||||||
|
Aggregated activity status for each subscriber.
|
||||||
|
|
||||||
|
**Document Structure:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId, // Auto-generated MongoDB ID
|
||||||
|
email: String, // Subscriber email (unique)
|
||||||
|
status: String, // 'active', 'inactive', or 'dormant'
|
||||||
|
last_opened_at: DateTime, // Most recent email open (UTC)
|
||||||
|
last_clicked_at: DateTime, // Most recent link click (UTC)
|
||||||
|
total_opens: Number, // Lifetime open count
|
||||||
|
total_clicks: Number, // Lifetime click count
|
||||||
|
newsletters_received: Number, // Total newsletters sent
|
||||||
|
newsletters_opened: Number, // Total newsletters opened
|
||||||
|
updated_at: DateTime // Last status update (UTC)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Indexes:**
|
||||||
|
- `email` - Unique index for fast lookups
|
||||||
|
- `status` - Index for filtering by activity level
|
||||||
|
- `last_opened_at` - Index for time-based queries
|
||||||
|
|
||||||
|
**Activity Status Classification:**
|
||||||
|
- **active**: Opened an email in the last 30 days
|
||||||
|
- **inactive**: No opens in 30-60 days
|
||||||
|
- **dormant**: No opens in 60+ days
|
||||||
|
|
||||||
|
**Example Document:**
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
_id: ObjectId("507f1f77bcf86cd799439015"),
|
||||||
|
email: "user@example.com",
|
||||||
|
status: "active",
|
||||||
|
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||||
|
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
||||||
|
total_opens: 45,
|
||||||
|
total_clicks: 23,
|
||||||
|
newsletters_received: 60,
|
||||||
|
newsletters_opened: 45,
|
||||||
|
updated_at: ISODate("2024-01-15T10:00:00.000Z")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Design Decisions
|
## Design Decisions
|
||||||
|
|
||||||
### Why MongoDB?
|
### Why MongoDB?
|
||||||
274
docs/DEPLOYMENT.md
Normal file
274
docs/DEPLOYMENT.md
Normal file
@@ -0,0 +1,274 @@
|
|||||||
|
# Deployment Guide
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Clone repository
|
||||||
|
git clone <repository-url>
|
||||||
|
cd munich-news
|
||||||
|
|
||||||
|
# 2. Configure environment
|
||||||
|
cp backend/.env.example backend/.env
|
||||||
|
# Edit backend/.env with your settings
|
||||||
|
|
||||||
|
# 3. Start system
|
||||||
|
docker-compose up -d
|
||||||
|
|
||||||
|
# 4. View logs
|
||||||
|
docker-compose logs -f
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment Configuration
|
||||||
|
|
||||||
|
### Required Settings
|
||||||
|
|
||||||
|
Edit `backend/.env`:
|
||||||
|
|
||||||
|
```env
|
||||||
|
# Email (Required)
|
||||||
|
SMTP_SERVER=smtp.gmail.com
|
||||||
|
SMTP_PORT=587
|
||||||
|
EMAIL_USER=your-email@gmail.com
|
||||||
|
EMAIL_PASSWORD=your-app-password
|
||||||
|
|
||||||
|
# MongoDB (Optional - defaults provided)
|
||||||
|
MONGODB_URI=mongodb://localhost:27017/
|
||||||
|
|
||||||
|
# Tracking (Optional)
|
||||||
|
TRACKING_ENABLED=true
|
||||||
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optional Settings
|
||||||
|
|
||||||
|
```env
|
||||||
|
# Newsletter
|
||||||
|
NEWSLETTER_MAX_ARTICLES=10
|
||||||
|
NEWSLETTER_HOURS_LOOKBACK=24
|
||||||
|
|
||||||
|
# Ollama AI
|
||||||
|
OLLAMA_ENABLED=true
|
||||||
|
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||||
|
OLLAMA_MODEL=phi3:latest
|
||||||
|
|
||||||
|
# Tracking
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
|
```
|
||||||
|
|
||||||
|
## Production Deployment
|
||||||
|
|
||||||
|
### 1. Set MongoDB Password
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export MONGO_PASSWORD=your-secure-password
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Use HTTPS for Tracking
|
||||||
|
|
||||||
|
Update `backend/.env`:
|
||||||
|
```env
|
||||||
|
TRACKING_API_URL=https://yourdomain.com
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Configure Log Rotation
|
||||||
|
|
||||||
|
Add to `docker-compose.yml`:
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
crawler:
|
||||||
|
logging:
|
||||||
|
driver: "json-file"
|
||||||
|
options:
|
||||||
|
max-size: "10m"
|
||||||
|
max-file: "3"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Set Up Backups
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Daily MongoDB backup
|
||||||
|
0 3 * * * docker exec munich-news-mongodb mongodump --out=/data/backup/$(date +\%Y\%m\%d)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Enable Backend API
|
||||||
|
|
||||||
|
Uncomment backend service in `docker-compose.yml`:
|
||||||
|
```yaml
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
ports:
|
||||||
|
- "5001:5001"
|
||||||
|
# ... rest of config
|
||||||
|
```
|
||||||
|
|
||||||
|
## Schedule Configuration
|
||||||
|
|
||||||
|
### Change Crawler Time
|
||||||
|
|
||||||
|
Edit `news_crawler/scheduled_crawler.py`:
|
||||||
|
```python
|
||||||
|
schedule.every().day.at("06:00").do(run_crawler) # Change time
|
||||||
|
```
|
||||||
|
|
||||||
|
### Change Sender Time
|
||||||
|
|
||||||
|
Edit `news_sender/scheduled_sender.py`:
|
||||||
|
```python
|
||||||
|
schedule.every().day.at("07:00").do(run_sender) # Change time
|
||||||
|
```
|
||||||
|
|
||||||
|
Rebuild after changes:
|
||||||
|
```bash
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## Database Setup
|
||||||
|
|
||||||
|
### Add RSS Feeds
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.rss_feeds.insertMany([
|
||||||
|
{
|
||||||
|
name: "Süddeutsche Zeitung München",
|
||||||
|
url: "https://www.sueddeutsche.de/muenchen/rss",
|
||||||
|
active: true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "Merkur München",
|
||||||
|
url: "https://www.merkur.de/lokales/muenchen/rss/feed.rss",
|
||||||
|
active: true
|
||||||
|
}
|
||||||
|
])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Subscribers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
db.subscribers.insertMany([
|
||||||
|
{
|
||||||
|
email: "user1@example.com",
|
||||||
|
active: true,
|
||||||
|
tracking_enabled: true,
|
||||||
|
subscribed_at: new Date()
|
||||||
|
},
|
||||||
|
{
|
||||||
|
email: "user2@example.com",
|
||||||
|
active: true,
|
||||||
|
tracking_enabled: true,
|
||||||
|
subscribed_at: new Date()
|
||||||
|
}
|
||||||
|
])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Check Container Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker-compose ps
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# All services
|
||||||
|
docker-compose logs -f
|
||||||
|
|
||||||
|
# Specific service
|
||||||
|
docker-compose logs -f crawler
|
||||||
|
docker-compose logs -f sender
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Database
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
// Count articles
|
||||||
|
db.articles.countDocuments()
|
||||||
|
|
||||||
|
// Count subscribers
|
||||||
|
db.subscribers.countDocuments({ active: true })
|
||||||
|
|
||||||
|
// View recent articles
|
||||||
|
db.articles.find().sort({ crawled_at: -1 }).limit(5)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Containers Won't Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check logs
|
||||||
|
docker-compose logs
|
||||||
|
|
||||||
|
# Rebuild
|
||||||
|
docker-compose up -d --build
|
||||||
|
|
||||||
|
# Reset everything
|
||||||
|
docker-compose down -v
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Crawler Not Finding Articles
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check RSS feeds
|
||||||
|
mongosh munich_news --eval "db.rss_feeds.find({ active: true })"
|
||||||
|
|
||||||
|
# Test manually
|
||||||
|
docker-compose exec crawler python crawler_service.py 5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Newsletter Not Sending
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test email
|
||||||
|
docker-compose exec sender python sender_service.py test your-email@example.com
|
||||||
|
|
||||||
|
# Check SMTP config
|
||||||
|
docker-compose exec sender python -c "from sender_service import Config; print(Config.SMTP_SERVER)"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Update System
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git pull
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Database
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec munich-news-mongodb mongodump --out=/data/backup
|
||||||
|
```
|
||||||
|
|
||||||
|
### Clean Old Data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mongosh munich_news
|
||||||
|
|
||||||
|
// Delete articles older than 90 days
|
||||||
|
db.articles.deleteMany({
|
||||||
|
crawled_at: { $lt: new Date(Date.now() - 90*24*60*60*1000) }
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
## Security Checklist
|
||||||
|
|
||||||
|
- [ ] Set strong MongoDB password
|
||||||
|
- [ ] Use HTTPS for tracking URLs
|
||||||
|
- [ ] Secure SMTP credentials
|
||||||
|
- [ ] Enable firewall rules
|
||||||
|
- [ ] Set up log rotation
|
||||||
|
- [ ] Configure backups
|
||||||
|
- [ ] Monitor for failures
|
||||||
|
- [ ] Keep dependencies updated
|
||||||
@@ -84,6 +84,33 @@ curl http://localhost:5001/api/ollama/ping
|
|||||||
curl http://localhost:5001/api/ollama/models
|
curl http://localhost:5001/api/ollama/models
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Email Tracking & Analytics
|
||||||
|
|
||||||
|
**Get newsletter metrics:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/analytics/newsletter/<newsletter_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Get article performance:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/analytics/article/<article_id>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Get subscriber activity:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:5001/api/analytics/subscriber/<email>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Delete subscriber tracking data:**
|
||||||
|
```bash
|
||||||
|
curl -X DELETE http://localhost:5001/api/tracking/subscriber/<email>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Anonymize old tracking data:**
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:5001/api/tracking/anonymize
|
||||||
|
```
|
||||||
|
|
||||||
### Database
|
### Database
|
||||||
|
|
||||||
**Connect to MongoDB:**
|
**Connect to MongoDB:**
|
||||||
@@ -110,6 +137,13 @@ db.subscribers.countDocuments({status: "active"})
|
|||||||
db.rss_feeds.find()
|
db.rss_feeds.find()
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Check tracking data:**
|
||||||
|
```javascript
|
||||||
|
db.newsletter_sends.find().limit(5)
|
||||||
|
db.link_clicks.find().limit(5)
|
||||||
|
db.subscriber_activity.find()
|
||||||
|
```
|
||||||
|
|
||||||
## File Locations
|
## File Locations
|
||||||
|
|
||||||
### Configuration
|
### Configuration
|
||||||
@@ -186,6 +220,9 @@ EMAIL_PASSWORD=your-app-password
|
|||||||
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
OLLAMA_BASE_URL=http://127.0.0.1:11434
|
||||||
OLLAMA_MODEL=phi3:latest
|
OLLAMA_MODEL=phi3:latest
|
||||||
OLLAMA_ENABLED=true
|
OLLAMA_ENABLED=true
|
||||||
|
TRACKING_ENABLED=true
|
||||||
|
TRACKING_API_URL=http://localhost:5001
|
||||||
|
TRACKING_DATA_RETENTION_DAYS=90
|
||||||
```
|
```
|
||||||
|
|
||||||
## Development Workflow
|
## Development Workflow
|
||||||
412
docs/SYSTEM_ARCHITECTURE.md
Normal file
412
docs/SYSTEM_ARCHITECTURE.md
Normal file
@@ -0,0 +1,412 @@
|
|||||||
|
# Munich News Daily - System Architecture
|
||||||
|
|
||||||
|
## 📊 Complete System Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Munich News Daily System │
|
||||||
|
│ Fully Automated Pipeline │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
Daily Schedule
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ 6:00 AM Berlin │
|
||||||
|
│ News Crawler │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ News Crawler │
|
||||||
|
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
||||||
|
│ │ Fetch RSS │→ │ Extract │→ │ Summarize │→ │ Save to ││
|
||||||
|
│ │ Feeds │ │ Content │ │ with AI │ │ MongoDB ││
|
||||||
|
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
||||||
|
│ │
|
||||||
|
│ Sources: Süddeutsche, Merkur, BR24, etc. │
|
||||||
|
│ Output: Full articles + AI summaries │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ Articles saved
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ MongoDB │
|
||||||
|
│ (Data Storage) │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│
|
||||||
|
│ Wait for crawler
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ 7:00 AM Berlin │
|
||||||
|
│ Newsletter Sender │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Newsletter Sender │
|
||||||
|
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│
|
||||||
|
│ │ Wait for │→ │ Fetch │→ │ Generate │→ │ Send to ││
|
||||||
|
│ │ Crawler │ │ Articles │ │ Newsletter │ │ Subscribers││
|
||||||
|
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘│
|
||||||
|
│ │
|
||||||
|
│ Features: Tracking pixels, link tracking, HTML templates │
|
||||||
|
│ Output: Personalized newsletters with engagement tracking │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
│ Emails sent
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ Subscribers │
|
||||||
|
│ (Email Inboxes) │
|
||||||
|
└──────────┬───────────┘
|
||||||
|
│
|
||||||
|
│ Opens & clicks
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ Tracking System │
|
||||||
|
│ (Analytics API) │
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔄 Data Flow
|
||||||
|
|
||||||
|
### 1. Content Acquisition (6:00 AM)
|
||||||
|
|
||||||
|
```
|
||||||
|
RSS Feeds → Crawler → Full Content → AI Summary → MongoDB
|
||||||
|
```
|
||||||
|
|
||||||
|
**Details**:
|
||||||
|
- Fetches from multiple RSS sources
|
||||||
|
- Extracts full article text
|
||||||
|
- Generates concise summaries using Ollama
|
||||||
|
- Stores with metadata (author, date, source)
|
||||||
|
|
||||||
|
### 2. Newsletter Generation (7:00 AM)
|
||||||
|
|
||||||
|
```
|
||||||
|
MongoDB → Articles → Template → HTML → Email
|
||||||
|
```
|
||||||
|
|
||||||
|
**Details**:
|
||||||
|
- Waits for crawler to finish (max 30 min)
|
||||||
|
- Fetches today's articles with summaries
|
||||||
|
- Applies Jinja2 template
|
||||||
|
- Injects tracking pixels
|
||||||
|
- Replaces links with tracking URLs
|
||||||
|
|
||||||
|
### 3. Engagement Tracking (Ongoing)
|
||||||
|
|
||||||
|
```
|
||||||
|
Email Open → Pixel Load → Log Event → Analytics
|
||||||
|
Link Click → Redirect → Log Event → Analytics
|
||||||
|
```
|
||||||
|
|
||||||
|
**Details**:
|
||||||
|
- Tracks email opens via 1x1 pixel
|
||||||
|
- Tracks link clicks via redirect URLs
|
||||||
|
- Stores engagement data in MongoDB
|
||||||
|
- Provides analytics API
|
||||||
|
|
||||||
|
## 🏗️ Component Architecture
|
||||||
|
|
||||||
|
### Docker Containers
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Docker Network │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||||
|
│ │ MongoDB │ │ Crawler │ │ Sender │ │
|
||||||
|
│ │ │ │ │ │ │ │
|
||||||
|
│ │ Port: 27017 │←─│ Schedule: │←─│ Schedule: │ │
|
||||||
|
│ │ │ │ 6:00 AM │ │ 7:00 AM │ │
|
||||||
|
│ │ Storage: │ │ │ │ │ │
|
||||||
|
│ │ - articles │ │ Depends on: │ │ Depends on: │ │
|
||||||
|
│ │ - subscribers│ │ - MongoDB │ │ - MongoDB │ │
|
||||||
|
│ │ - tracking │ │ │ │ - Crawler │ │
|
||||||
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ All containers auto-restart on failure │
|
||||||
|
│ All use Europe/Berlin timezone │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backend Services
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Backend Services │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Flask API (Port 5001) │ │
|
||||||
|
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||||
|
│ │ │ Tracking │ │ Analytics │ │ Privacy │ │ │
|
||||||
|
│ │ │ Endpoints │ │ Endpoints │ │ Endpoints │ │ │
|
||||||
|
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||||
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Services Layer │ │
|
||||||
|
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||||
|
│ │ │ Tracking │ │ Analytics │ │ Ollama │ │ │
|
||||||
|
│ │ │ Service │ │ Service │ │ Client │ │ │
|
||||||
|
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||||
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📅 Daily Timeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Time (Berlin) │ Event │ Duration
|
||||||
|
───────────────┼──────────────────────────┼──────────
|
||||||
|
05:59:59 │ System idle │ -
|
||||||
|
06:00:00 │ Crawler starts │ ~10-20 min
|
||||||
|
06:00:01 │ - Fetch RSS feeds │
|
||||||
|
06:02:00 │ - Extract content │
|
||||||
|
06:05:00 │ - Generate summaries │
|
||||||
|
06:15:00 │ - Save to MongoDB │
|
||||||
|
06:20:00 │ Crawler finishes │
|
||||||
|
06:20:01 │ System idle │ ~40 min
|
||||||
|
07:00:00 │ Sender starts │ ~5-10 min
|
||||||
|
07:00:01 │ - Wait for crawler │ (checks every 30s)
|
||||||
|
07:00:30 │ - Crawler confirmed done │
|
||||||
|
07:00:31 │ - Fetch articles │
|
||||||
|
07:01:00 │ - Generate newsletters │
|
||||||
|
07:02:00 │ - Send to subscribers │
|
||||||
|
07:10:00 │ Sender finishes │
|
||||||
|
07:10:01 │ System idle │ Until tomorrow
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔐 Security & Privacy
|
||||||
|
|
||||||
|
### Data Protection
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Privacy Features │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Data Retention │ │
|
||||||
|
│ │ - Personal data: 90 days │ │
|
||||||
|
│ │ - Anonymization: Automatic │ │
|
||||||
|
│ │ - Deletion: On request │ │
|
||||||
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────┐ │
|
||||||
|
│ │ User Rights │ │
|
||||||
|
│ │ - Opt-out: Anytime │ │
|
||||||
|
│ │ - Data access: API available │ │
|
||||||
|
│ │ - Data deletion: Full removal │ │
|
||||||
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Compliance │ │
|
||||||
|
│ │ - GDPR compliant │ │
|
||||||
|
│ │ - Privacy notice in emails │ │
|
||||||
|
│ │ - Transparent tracking │ │
|
||||||
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📊 Database Schema
|
||||||
|
|
||||||
|
### Collections
|
||||||
|
|
||||||
|
```
|
||||||
|
MongoDB (munich_news)
|
||||||
|
│
|
||||||
|
├── articles
|
||||||
|
│ ├── title
|
||||||
|
│ ├── author
|
||||||
|
│ ├── content (full text)
|
||||||
|
│ ├── summary (AI generated)
|
||||||
|
│ ├── link
|
||||||
|
│ ├── source
|
||||||
|
│ ├── published_at
|
||||||
|
│ └── crawled_at
|
||||||
|
│
|
||||||
|
├── subscribers
|
||||||
|
│ ├── email
|
||||||
|
│ ├── active
|
||||||
|
│ ├── tracking_enabled
|
||||||
|
│ └── subscribed_at
|
||||||
|
│
|
||||||
|
├── rss_feeds
|
||||||
|
│ ├── name
|
||||||
|
│ ├── url
|
||||||
|
│ └── active
|
||||||
|
│
|
||||||
|
├── newsletter_sends
|
||||||
|
│ ├── tracking_id
|
||||||
|
│ ├── newsletter_id
|
||||||
|
│ ├── subscriber_email
|
||||||
|
│ ├── opened
|
||||||
|
│ ├── first_opened_at
|
||||||
|
│ └── open_count
|
||||||
|
│
|
||||||
|
├── link_clicks
|
||||||
|
│ ├── tracking_id
|
||||||
|
│ ├── newsletter_id
|
||||||
|
│ ├── subscriber_email
|
||||||
|
│ ├── article_url
|
||||||
|
│ ├── clicked
|
||||||
|
│ └── clicked_at
|
||||||
|
│
|
||||||
|
└── subscriber_activity
|
||||||
|
├── email
|
||||||
|
├── status (active/inactive/dormant)
|
||||||
|
├── last_opened_at
|
||||||
|
├── last_clicked_at
|
||||||
|
├── total_opens
|
||||||
|
└── total_clicks
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Deployment Architecture
|
||||||
|
|
||||||
|
### Development
|
||||||
|
|
||||||
|
```
|
||||||
|
Local Machine
|
||||||
|
├── Docker Compose
|
||||||
|
│ ├── MongoDB (no auth)
|
||||||
|
│ ├── Crawler
|
||||||
|
│ └── Sender
|
||||||
|
├── Backend (manual start)
|
||||||
|
│ └── Flask API
|
||||||
|
└── Ollama (optional)
|
||||||
|
└── AI Summarization
|
||||||
|
```
|
||||||
|
|
||||||
|
### Production
|
||||||
|
|
||||||
|
```
|
||||||
|
Server
|
||||||
|
├── Docker Compose (prod)
|
||||||
|
│ ├── MongoDB (with auth)
|
||||||
|
│ ├── Crawler
|
||||||
|
│ └── Sender
|
||||||
|
├── Backend (systemd/pm2)
|
||||||
|
│ └── Flask API (HTTPS)
|
||||||
|
├── Ollama (optional)
|
||||||
|
│ └── AI Summarization
|
||||||
|
└── Nginx (reverse proxy)
|
||||||
|
└── SSL/TLS
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔄 Coordination Mechanism
|
||||||
|
|
||||||
|
### Crawler-Sender Synchronization
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Coordination Flow │
|
||||||
|
│ │
|
||||||
|
│ 6:00 AM → Crawler starts │
|
||||||
|
│ ↓ │
|
||||||
|
│ Crawling articles... │
|
||||||
|
│ ↓ │
|
||||||
|
│ Saves to MongoDB │
|
||||||
|
│ ↓ │
|
||||||
|
│ 6:20 AM → Crawler finishes │
|
||||||
|
│ ↓ │
|
||||||
|
│ 7:00 AM → Sender starts │
|
||||||
|
│ ↓ │
|
||||||
|
│ Check: Recent articles? ──→ No ──┐ │
|
||||||
|
│ ↓ Yes │ │
|
||||||
|
│ Proceed with send │ │
|
||||||
|
│ │ │
|
||||||
|
│ ← Wait 30s ← Wait 30s ← Wait 30s┘ │
|
||||||
|
│ (max 30 minutes) │
|
||||||
|
│ │
|
||||||
|
│ 7:10 AM → Newsletter sent │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📈 Monitoring & Observability
|
||||||
|
|
||||||
|
### Key Metrics
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Metrics to Monitor │
|
||||||
|
│ │
|
||||||
|
│ Crawler: │
|
||||||
|
│ - Articles crawled per day │
|
||||||
|
│ - Crawl duration │
|
||||||
|
│ - Success/failure rate │
|
||||||
|
│ - Summary generation rate │
|
||||||
|
│ │
|
||||||
|
│ Sender: │
|
||||||
|
│ - Newsletters sent per day │
|
||||||
|
│ - Send duration │
|
||||||
|
│ - Success/failure rate │
|
||||||
|
│ - Wait time for crawler │
|
||||||
|
│ │
|
||||||
|
│ Engagement: │
|
||||||
|
│ - Open rate │
|
||||||
|
│ - Click-through rate │
|
||||||
|
│ - Active subscribers │
|
||||||
|
│ - Dormant subscribers │
|
||||||
|
│ │
|
||||||
|
│ System: │
|
||||||
|
│ - Container uptime │
|
||||||
|
│ - Database size │
|
||||||
|
│ - Error rate │
|
||||||
|
│ - Response times │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🛠️ Maintenance Tasks
|
||||||
|
|
||||||
|
### Daily
|
||||||
|
- Check logs for errors
|
||||||
|
- Verify newsletters sent
|
||||||
|
- Monitor engagement metrics
|
||||||
|
|
||||||
|
### Weekly
|
||||||
|
- Review article quality
|
||||||
|
- Check subscriber growth
|
||||||
|
- Analyze engagement trends
|
||||||
|
|
||||||
|
### Monthly
|
||||||
|
- Archive old articles
|
||||||
|
- Clean up dormant subscribers
|
||||||
|
- Update dependencies
|
||||||
|
- Review system performance
|
||||||
|
|
||||||
|
## 📚 Technology Stack
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Technology Stack │
|
||||||
|
│ │
|
||||||
|
│ Backend: │
|
||||||
|
│ - Python 3.11 │
|
||||||
|
│ - Flask (API) │
|
||||||
|
│ - PyMongo (Database) │
|
||||||
|
│ - Schedule (Automation) │
|
||||||
|
│ - Jinja2 (Templates) │
|
||||||
|
│ - BeautifulSoup (Parsing) │
|
||||||
|
│ │
|
||||||
|
│ Database: │
|
||||||
|
│ - MongoDB 7.0 │
|
||||||
|
│ │
|
||||||
|
│ AI/ML: │
|
||||||
|
│ - Ollama (Summarization) │
|
||||||
|
│ - Phi3 Model (default) │
|
||||||
|
│ │
|
||||||
|
│ Infrastructure: │
|
||||||
|
│ - Docker & Docker Compose │
|
||||||
|
│ - Linux (Ubuntu/Debian) │
|
||||||
|
│ │
|
||||||
|
│ Email: │
|
||||||
|
│ - SMTP (configurable) │
|
||||||
|
│ - HTML emails with tracking │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2024-01-16
|
||||||
|
**Version**: 1.0
|
||||||
|
**Status**: Production Ready ✅
|
||||||
@@ -1,191 +0,0 @@
|
|||||||
# Recent Changes - Full Content Storage
|
|
||||||
|
|
||||||
## ✅ What Changed
|
|
||||||
|
|
||||||
### 1. Removed Content Length Limit
|
|
||||||
**Before:**
|
|
||||||
```python
|
|
||||||
'content': content_text[:10000] # Limited to 10k chars
|
|
||||||
```
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
```python
|
|
||||||
'content': content_text # Full content, no limit
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Simplified Database Schema
|
|
||||||
**Before:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
summary: String, // Short summary
|
|
||||||
full_content: String // Limited content
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
content: String // Full article content, no limit
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Enhanced API Response
|
|
||||||
**Before:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
title: "...",
|
|
||||||
link: "...",
|
|
||||||
summary: "..."
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**After:**
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
title: "...",
|
|
||||||
author: "...", // NEW!
|
|
||||||
link: "...",
|
|
||||||
preview: "...", // First 200 chars
|
|
||||||
word_count: 1250, // NEW!
|
|
||||||
has_full_content: true // NEW!
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📊 Database Structure
|
|
||||||
|
|
||||||
### Articles Collection
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
_id: ObjectId,
|
|
||||||
title: String, // Article title
|
|
||||||
author: String, // Article author (extracted)
|
|
||||||
link: String, // Article URL (unique)
|
|
||||||
content: String, // FULL article content (no limit)
|
|
||||||
word_count: Number, // Word count
|
|
||||||
source: String, // RSS feed name
|
|
||||||
published_at: String, // Publication date
|
|
||||||
crawled_at: DateTime, // When crawled
|
|
||||||
created_at: DateTime // When added
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🆕 New API Endpoint
|
|
||||||
|
|
||||||
### GET /api/news/<article_url>
|
|
||||||
Get full article content by URL.
|
|
||||||
|
|
||||||
**Example:**
|
|
||||||
```bash
|
|
||||||
# URL encode the article URL
|
|
||||||
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
|
||||||
```
|
|
||||||
|
|
||||||
**Response:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"title": "New U-Bahn Line Opens in Munich",
|
|
||||||
"author": "Max Mustermann",
|
|
||||||
"link": "https://example.com/article",
|
|
||||||
"content": "The full article text here... (complete, no truncation)",
|
|
||||||
"word_count": 1250,
|
|
||||||
"source": "Süddeutsche Zeitung München",
|
|
||||||
"published_at": "2024-11-10T10:00:00Z",
|
|
||||||
"crawled_at": "2024-11-10T16:30:00Z",
|
|
||||||
"created_at": "2024-11-10T16:00:00Z"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📈 Enhanced Stats
|
|
||||||
|
|
||||||
### GET /api/stats
|
|
||||||
Now includes crawled article count:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"subscribers": 150,
|
|
||||||
"articles": 500,
|
|
||||||
"crawled_articles": 350 // NEW!
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🎯 Benefits
|
|
||||||
|
|
||||||
1. **Complete Content** - No truncation, full articles stored
|
|
||||||
2. **Better for AI** - Full context for summarization/analysis
|
|
||||||
3. **Cleaner Schema** - Single `content` field instead of `summary` + `full_content`
|
|
||||||
4. **More Metadata** - Author, word count, crawl timestamp
|
|
||||||
5. **Better API** - Preview in list, full content on demand
|
|
||||||
|
|
||||||
## 🔄 Migration
|
|
||||||
|
|
||||||
If you have existing articles with `full_content` field, they will continue to work. New articles will use the `content` field.
|
|
||||||
|
|
||||||
To migrate old articles:
|
|
||||||
```javascript
|
|
||||||
// MongoDB shell
|
|
||||||
db.articles.updateMany(
|
|
||||||
{ full_content: { $exists: true } },
|
|
||||||
[
|
|
||||||
{
|
|
||||||
$set: {
|
|
||||||
content: "$full_content"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
$unset: ["full_content", "summary"]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Usage
|
|
||||||
|
|
||||||
### Crawl Articles
|
|
||||||
```bash
|
|
||||||
cd news_crawler
|
|
||||||
python crawler_service.py 10
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Article List (with previews)
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/news
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Full Article Content
|
|
||||||
```bash
|
|
||||||
# Get the article URL from the list, then:
|
|
||||||
curl "http://localhost:5001/api/news/<encoded_url>"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Check Stats
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
## 📝 Example Workflow
|
|
||||||
|
|
||||||
1. **Add RSS Feed**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"name": "News Source", "url": "https://example.com/rss"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Crawl Articles**
|
|
||||||
```bash
|
|
||||||
cd news_crawler
|
|
||||||
python crawler_service.py 10
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **View Articles**
|
|
||||||
```bash
|
|
||||||
curl http://localhost:5001/api/news
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Get Full Content**
|
|
||||||
```bash
|
|
||||||
# Copy article link from above, URL encode it
|
|
||||||
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
|
||||||
```
|
|
||||||
|
|
||||||
Now you have complete article content ready for AI processing! 🎉
|
|
||||||
@@ -6,8 +6,20 @@ WORKDIR /app
|
|||||||
COPY requirements.txt .
|
COPY requirements.txt .
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
# Copy crawler service
|
# Copy crawler files
|
||||||
COPY crawler_service.py .
|
COPY . .
|
||||||
|
|
||||||
# Run crawler
|
# Copy backend config files (needed for Config class)
|
||||||
CMD ["python", "crawler_service.py"]
|
COPY ../backend/config.py /app/config.py
|
||||||
|
COPY ../backend/ollama_client.py /app/ollama_client.py
|
||||||
|
COPY ../backend/.env /app/.env
|
||||||
|
|
||||||
|
# Make the scheduler executable
|
||||||
|
RUN chmod +x scheduled_crawler.py
|
||||||
|
|
||||||
|
# Set timezone to Berlin
|
||||||
|
ENV TZ=Europe/Berlin
|
||||||
|
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||||
|
|
||||||
|
# Run the scheduled crawler
|
||||||
|
CMD ["python", "-u", "scheduled_crawler.py"]
|
||||||
|
|||||||
@@ -1,127 +0,0 @@
|
|||||||
# News Crawler - Quick Start
|
|
||||||
|
|
||||||
## 1. Install Dependencies
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd news_crawler
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
## 2. Configure Environment
|
|
||||||
|
|
||||||
Make sure MongoDB is running and accessible. The crawler will use the same database as the backend.
|
|
||||||
|
|
||||||
Default connection: `mongodb://localhost:27017/`
|
|
||||||
|
|
||||||
To use a different MongoDB URI, create a `.env` file:
|
|
||||||
```env
|
|
||||||
MONGODB_URI=mongodb://localhost:27017/
|
|
||||||
```
|
|
||||||
|
|
||||||
## 3. Run the Crawler
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Crawl up to 10 articles per feed
|
|
||||||
python crawler_service.py
|
|
||||||
|
|
||||||
# Crawl up to 20 articles per feed
|
|
||||||
python crawler_service.py 20
|
|
||||||
```
|
|
||||||
|
|
||||||
## 4. Verify Results
|
|
||||||
|
|
||||||
Check your MongoDB database:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Using mongosh
|
|
||||||
mongosh
|
|
||||||
use munich_news
|
|
||||||
db.articles.find({full_content: {$exists: true}}).count()
|
|
||||||
db.articles.findOne({full_content: {$exists: true}})
|
|
||||||
```
|
|
||||||
|
|
||||||
## 5. Schedule Regular Crawling
|
|
||||||
|
|
||||||
### Option A: Cron (Linux/Mac)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Edit crontab
|
|
||||||
crontab -e
|
|
||||||
|
|
||||||
# Add this line to run every 6 hours
|
|
||||||
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option B: Docker
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build and run
|
|
||||||
docker-compose up
|
|
||||||
|
|
||||||
# Or run as a one-off
|
|
||||||
docker-compose run --rm crawler
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option C: Manual
|
|
||||||
|
|
||||||
Just run the script whenever you want to fetch new articles:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python crawler_service.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## What Gets Crawled?
|
|
||||||
|
|
||||||
The crawler:
|
|
||||||
1. Fetches all active RSS feeds from the database
|
|
||||||
2. For each feed, gets the latest articles
|
|
||||||
3. Crawls the full content from each article URL
|
|
||||||
4. Saves: title, full_content, word_count, crawled_at
|
|
||||||
5. Skips articles that already have content
|
|
||||||
|
|
||||||
## Output Example
|
|
||||||
|
|
||||||
```
|
|
||||||
============================================================
|
|
||||||
🚀 Starting RSS Feed Crawler
|
|
||||||
============================================================
|
|
||||||
Found 3 active feed(s)
|
|
||||||
|
|
||||||
📰 Crawling feed: Süddeutsche Zeitung München
|
|
||||||
URL: https://www.sueddeutsche.de/muenchen/rss
|
|
||||||
🔍 Crawling: New U-Bahn Line Opens in Munich...
|
|
||||||
✓ Saved (1250 words)
|
|
||||||
🔍 Crawling: Munich Weather Update...
|
|
||||||
✓ Saved (450 words)
|
|
||||||
✓ Crawled 2 articles from Süddeutsche Zeitung München
|
|
||||||
|
|
||||||
============================================================
|
|
||||||
✓ Crawling Complete!
|
|
||||||
Total feeds processed: 3
|
|
||||||
Total articles crawled: 15
|
|
||||||
Duration: 45.23 seconds
|
|
||||||
============================================================
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
**No feeds found:**
|
|
||||||
- Make sure you've added RSS feeds via the backend API
|
|
||||||
- Check MongoDB connection
|
|
||||||
|
|
||||||
**Can't extract content:**
|
|
||||||
- Some sites block scrapers
|
|
||||||
- Some sites require JavaScript (not supported yet)
|
|
||||||
- Check if the URL is accessible
|
|
||||||
|
|
||||||
**Timeout errors:**
|
|
||||||
- Increase timeout in the code
|
|
||||||
- Check your internet connection
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
Once articles are crawled, you can:
|
|
||||||
- View them in the frontend
|
|
||||||
- Use Ollama to summarize them
|
|
||||||
- Generate newsletters with full content
|
|
||||||
- Perform text analysis
|
|
||||||
@@ -1,225 +0,0 @@
|
|||||||
# News Crawler Microservice
|
|
||||||
|
|
||||||
A standalone microservice that crawls full article content from RSS feeds and stores it in MongoDB.
|
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
- 🔍 Extracts full article content from RSS feed links
|
|
||||||
- 📊 Calculates word count
|
|
||||||
- 🔄 Avoids re-crawling already processed articles
|
|
||||||
- ⏱️ Rate limiting (1 second delay between requests)
|
|
||||||
- 🎯 Smart content extraction using multiple selectors
|
|
||||||
- 🧹 Cleans up scripts, styles, and navigation elements
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
1. Create a virtual environment:
|
|
||||||
```bash
|
|
||||||
python -m venv venv
|
|
||||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Install dependencies:
|
|
||||||
```bash
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Configure environment variables:
|
|
||||||
Create a `.env` file in the project root (or use the backend's `.env`):
|
|
||||||
```env
|
|
||||||
MONGODB_URI=mongodb://localhost:27017/
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Standalone Execution
|
|
||||||
|
|
||||||
Run the crawler directly:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Crawl up to 10 articles per feed (default)
|
|
||||||
python crawler_service.py
|
|
||||||
|
|
||||||
# Crawl up to 20 articles per feed
|
|
||||||
python crawler_service.py 20
|
|
||||||
```
|
|
||||||
|
|
||||||
### As a Module
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawler_service import crawl_all_feeds, crawl_rss_feed
|
|
||||||
|
|
||||||
# Crawl all active feeds
|
|
||||||
result = crawl_all_feeds(max_articles_per_feed=10)
|
|
||||||
print(result)
|
|
||||||
|
|
||||||
# Crawl a specific feed
|
|
||||||
crawl_rss_feed(
|
|
||||||
feed_url='https://example.com/rss',
|
|
||||||
feed_name='Example News',
|
|
||||||
max_articles=10
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Via Backend API
|
|
||||||
|
|
||||||
The backend has integrated endpoints:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Start crawler
|
|
||||||
curl -X POST http://localhost:5001/api/crawler/start
|
|
||||||
|
|
||||||
# Check status
|
|
||||||
curl http://localhost:5001/api/crawler/status
|
|
||||||
|
|
||||||
# Crawl specific feed
|
|
||||||
curl -X POST http://localhost:5001/api/crawler/feed/<feed_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
1. **Fetch RSS Feeds**: Gets all active RSS feeds from MongoDB
|
|
||||||
2. **Parse Feed**: Extracts article links from each feed
|
|
||||||
3. **Crawl Content**: For each article:
|
|
||||||
- Fetches HTML page
|
|
||||||
- Removes scripts, styles, navigation
|
|
||||||
- Extracts main content using smart selectors
|
|
||||||
- Calculates word count
|
|
||||||
4. **Store Data**: Saves to MongoDB with metadata
|
|
||||||
5. **Skip Duplicates**: Avoids re-crawling articles with existing content
|
|
||||||
|
|
||||||
## Content Extraction Strategy
|
|
||||||
|
|
||||||
The crawler tries multiple selectors in order:
|
|
||||||
|
|
||||||
1. `<article>` tag
|
|
||||||
2. Elements with class containing "article-content", "article-body"
|
|
||||||
3. Elements with class containing "post-content", "entry-content"
|
|
||||||
4. `<main>` tag
|
|
||||||
5. Fallback to all `<p>` tags in body
|
|
||||||
|
|
||||||
## Database Schema
|
|
||||||
|
|
||||||
Articles are stored with these fields:
|
|
||||||
|
|
||||||
```javascript
|
|
||||||
{
|
|
||||||
title: String, // Article title
|
|
||||||
link: String, // Article URL (unique)
|
|
||||||
summary: String, // Short summary
|
|
||||||
full_content: String, // Full article text (max 10,000 chars)
|
|
||||||
word_count: Number, // Number of words
|
|
||||||
source: String, // RSS feed name
|
|
||||||
published_at: String, // Publication date
|
|
||||||
crawled_at: DateTime, // When content was crawled
|
|
||||||
created_at: DateTime // When added to database
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Scheduling
|
|
||||||
|
|
||||||
### Using Cron (Linux/Mac)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run every 6 hours
|
|
||||||
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using systemd Timer (Linux)
|
|
||||||
|
|
||||||
Create `/etc/systemd/system/news-crawler.service`:
|
|
||||||
```ini
|
|
||||||
[Unit]
|
|
||||||
Description=News Crawler Service
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
Type=oneshot
|
|
||||||
WorkingDirectory=/path/to/news_crawler
|
|
||||||
ExecStart=/path/to/venv/bin/python crawler_service.py
|
|
||||||
User=your-user
|
|
||||||
```
|
|
||||||
|
|
||||||
Create `/etc/systemd/system/news-crawler.timer`:
|
|
||||||
```ini
|
|
||||||
[Unit]
|
|
||||||
Description=Run News Crawler every 6 hours
|
|
||||||
|
|
||||||
[Timer]
|
|
||||||
OnBootSec=5min
|
|
||||||
OnUnitActiveSec=6h
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=timers.target
|
|
||||||
```
|
|
||||||
|
|
||||||
Enable and start:
|
|
||||||
```bash
|
|
||||||
sudo systemctl enable news-crawler.timer
|
|
||||||
sudo systemctl start news-crawler.timer
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using Docker
|
|
||||||
|
|
||||||
Create `Dockerfile`:
|
|
||||||
```dockerfile
|
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
COPY requirements.txt .
|
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
|
||||||
|
|
||||||
COPY crawler_service.py .
|
|
||||||
|
|
||||||
CMD ["python", "crawler_service.py"]
|
|
||||||
```
|
|
||||||
|
|
||||||
Build and run:
|
|
||||||
```bash
|
|
||||||
docker build -t news-crawler .
|
|
||||||
docker run --env-file ../.env news-crawler
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
Environment variables:
|
|
||||||
|
|
||||||
- `MONGODB_URI` - MongoDB connection string (default: `mongodb://localhost:27017/`)
|
|
||||||
|
|
||||||
## Rate Limiting
|
|
||||||
|
|
||||||
- 1 second delay between article requests
|
|
||||||
- Respects server resources
|
|
||||||
- User-Agent header included
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
**Issue: Can't extract content**
|
|
||||||
- Some sites block scrapers
|
|
||||||
- Try adjusting User-Agent header
|
|
||||||
- Some sites require JavaScript (consider Selenium)
|
|
||||||
|
|
||||||
**Issue: Timeout errors**
|
|
||||||
- Increase timeout in `extract_article_content()`
|
|
||||||
- Check network connectivity
|
|
||||||
|
|
||||||
**Issue: Memory usage**
|
|
||||||
- Reduce `max_articles_per_feed`
|
|
||||||
- Content limited to 10,000 characters per article
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
This is a standalone microservice that:
|
|
||||||
- Can run independently of the main backend
|
|
||||||
- Shares the same MongoDB database
|
|
||||||
- Can be deployed separately
|
|
||||||
- Can be scheduled independently
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
Once articles are crawled, you can:
|
|
||||||
- Use Ollama to summarize articles
|
|
||||||
- Perform sentiment analysis
|
|
||||||
- Extract keywords and topics
|
|
||||||
- Generate newsletter content
|
|
||||||
- Create article recommendations
|
|
||||||
@@ -1,33 +0,0 @@
|
|||||||
version: '3.8'
|
|
||||||
|
|
||||||
services:
|
|
||||||
crawler:
|
|
||||||
build: .
|
|
||||||
container_name: news-crawler
|
|
||||||
environment:
|
|
||||||
- MONGODB_URI=mongodb://mongodb:27017/
|
|
||||||
networks:
|
|
||||||
- munich-news-network
|
|
||||||
depends_on:
|
|
||||||
- mongodb
|
|
||||||
# Run once and exit
|
|
||||||
restart: "no"
|
|
||||||
|
|
||||||
mongodb:
|
|
||||||
image: mongo:7.0
|
|
||||||
container_name: munich-news-mongodb
|
|
||||||
restart: unless-stopped
|
|
||||||
ports:
|
|
||||||
- "27017:27017"
|
|
||||||
volumes:
|
|
||||||
- mongodb_data:/data/db
|
|
||||||
networks:
|
|
||||||
- munich-news-network
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
mongodb_data:
|
|
||||||
driver: local
|
|
||||||
|
|
||||||
networks:
|
|
||||||
munich-news-network:
|
|
||||||
driver: bridge
|
|
||||||
@@ -4,3 +4,5 @@ requests==2.31.0
|
|||||||
feedparser==6.0.10
|
feedparser==6.0.10
|
||||||
pymongo==4.6.1
|
pymongo==4.6.1
|
||||||
python-dotenv==1.0.0
|
python-dotenv==1.0.0
|
||||||
|
schedule==1.2.0
|
||||||
|
pytz==2023.3
|
||||||
|
|||||||
75
news_crawler/scheduled_crawler.py
Executable file
75
news_crawler/scheduled_crawler.py
Executable file
@@ -0,0 +1,75 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Scheduled crawler that runs daily at 6 AM Berlin time
|
||||||
|
"""
|
||||||
|
import schedule
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
import pytz
|
||||||
|
from crawler_service import crawl_all_feeds
|
||||||
|
|
||||||
|
# Berlin timezone
|
||||||
|
BERLIN_TZ = pytz.timezone('Europe/Berlin')
|
||||||
|
|
||||||
|
def run_crawler():
|
||||||
|
"""Run the crawler and log the execution"""
|
||||||
|
berlin_time = datetime.now(BERLIN_TZ)
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"🕐 Scheduled crawler started at {berlin_time.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run crawler with max 20 articles per feed
|
||||||
|
result = crawl_all_feeds(max_articles_per_feed=20)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"✓ Scheduled crawler completed successfully")
|
||||||
|
print(f" Articles crawled: {result['total_articles_crawled']}")
|
||||||
|
print(f" Duration: {result['duration_seconds']}s")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"✗ Scheduled crawler failed: {e}")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main scheduler loop"""
|
||||||
|
print("🤖 Munich News Crawler Scheduler")
|
||||||
|
print("="*60)
|
||||||
|
print("Schedule: Daily at 6:00 AM Berlin time")
|
||||||
|
print("Timezone: Europe/Berlin (CET/CEST)")
|
||||||
|
print("="*60)
|
||||||
|
|
||||||
|
# Schedule the crawler to run at 6 AM Berlin time
|
||||||
|
schedule.every().day.at("06:00").do(run_crawler)
|
||||||
|
|
||||||
|
# Show next run time
|
||||||
|
berlin_time = datetime.now(BERLIN_TZ)
|
||||||
|
print(f"\nCurrent time (Berlin): {berlin_time.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
|
||||||
|
# Get next scheduled run
|
||||||
|
next_run = schedule.next_run()
|
||||||
|
if next_run:
|
||||||
|
# Convert to Berlin time for display
|
||||||
|
next_run_berlin = next_run.astimezone(BERLIN_TZ)
|
||||||
|
print(f"Next scheduled run: {next_run_berlin.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
|
||||||
|
print("\n⏳ Scheduler is running... (Press Ctrl+C to stop)\n")
|
||||||
|
|
||||||
|
# Run immediately on startup (optional - comment out if you don't want this)
|
||||||
|
print("🚀 Running initial crawl on startup...")
|
||||||
|
run_crawler()
|
||||||
|
|
||||||
|
# Keep the scheduler running
|
||||||
|
while True:
|
||||||
|
schedule.run_pending()
|
||||||
|
time.sleep(60) # Check every minute
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
try:
|
||||||
|
main()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n👋 Scheduler stopped by user")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n\n✗ Scheduler error: {e}")
|
||||||
24
news_sender/Dockerfile
Normal file
24
news_sender/Dockerfile
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Copy sender files
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# Copy backend files (needed for tracking and config)
|
||||||
|
COPY ../backend/services /app/backend/services
|
||||||
|
COPY ../backend/.env /app/.env
|
||||||
|
|
||||||
|
# Make the scheduler executable
|
||||||
|
RUN chmod +x scheduled_sender.py
|
||||||
|
|
||||||
|
# Set timezone to Berlin
|
||||||
|
ENV TZ=Europe/Berlin
|
||||||
|
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||||
|
|
||||||
|
# Run the scheduled sender
|
||||||
|
CMD ["python", "-u", "scheduled_sender.py"]
|
||||||
@@ -1,303 +0,0 @@
|
|||||||
# News Sender Microservice
|
|
||||||
|
|
||||||
Standalone service for sending Munich News Daily newsletters to subscribers.
|
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
- 📧 Sends beautiful HTML newsletters
|
|
||||||
- 🤖 Uses AI-generated article summaries
|
|
||||||
- 📊 Tracks sending statistics
|
|
||||||
- 🧪 Test mode for development
|
|
||||||
- 📝 Preview generation
|
|
||||||
- 🔄 Fetches data from shared MongoDB
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd news_sender
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
The service uses the same `.env` file as the backend (`../backend/.env`):
|
|
||||||
|
|
||||||
```env
|
|
||||||
# MongoDB
|
|
||||||
MONGODB_URI=mongodb://localhost:27017/
|
|
||||||
|
|
||||||
# Email (Gmail example)
|
|
||||||
SMTP_SERVER=smtp.gmail.com
|
|
||||||
SMTP_PORT=587
|
|
||||||
EMAIL_USER=your-email@gmail.com
|
|
||||||
EMAIL_PASSWORD=your-app-password
|
|
||||||
|
|
||||||
# Newsletter Settings (optional)
|
|
||||||
NEWSLETTER_MAX_ARTICLES=10
|
|
||||||
WEBSITE_URL=http://localhost:3000
|
|
||||||
```
|
|
||||||
|
|
||||||
**Gmail Setup:**
|
|
||||||
1. Enable 2-factor authentication
|
|
||||||
2. Generate an App Password: https://support.google.com/accounts/answer/185833
|
|
||||||
3. Use the App Password (not your regular password)
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### 1. Preview Newsletter
|
|
||||||
|
|
||||||
Generate HTML preview without sending:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python sender_service.py preview
|
|
||||||
```
|
|
||||||
|
|
||||||
This creates `newsletter_preview.html` - open it in your browser to see how the newsletter looks.
|
|
||||||
|
|
||||||
### 2. Send Test Email
|
|
||||||
|
|
||||||
Send to a single email address for testing:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python sender_service.py test your-email@example.com
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Send to All Subscribers
|
|
||||||
|
|
||||||
Send newsletter to all active subscribers:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Send with default article count (10)
|
|
||||||
python sender_service.py send
|
|
||||||
|
|
||||||
# Send with custom article count
|
|
||||||
python sender_service.py send 15
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Use as Python Module
|
|
||||||
|
|
||||||
```python
|
|
||||||
from sender_service import send_newsletter, preview_newsletter
|
|
||||||
|
|
||||||
# Send newsletter
|
|
||||||
result = send_newsletter(max_articles=10)
|
|
||||||
print(f"Sent to {result['sent_count']} subscribers")
|
|
||||||
|
|
||||||
# Generate preview
|
|
||||||
html = preview_newsletter(max_articles=5)
|
|
||||||
```
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ 1. Fetch Articles from MongoDB │
|
|
||||||
│ - Get latest articles with AI summaries │
|
|
||||||
│ - Sort by creation date (newest first) │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ 2. Fetch Active Subscribers │
|
|
||||||
│ - Get all subscribers with status='active' │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ 3. Render Newsletter HTML │
|
|
||||||
│ - Load newsletter_template.html │
|
|
||||||
│ - Populate with articles and metadata │
|
|
||||||
│ - Generate beautiful HTML email │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ 4. Send Emails │
|
|
||||||
│ - Connect to SMTP server │
|
|
||||||
│ - Send to each subscriber │
|
|
||||||
│ - Track success/failure │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
↓
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ 5. Report Statistics │
|
|
||||||
│ - Total sent │
|
|
||||||
│ - Failed sends │
|
|
||||||
│ - Error details │
|
|
||||||
└─────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Output Example
|
|
||||||
|
|
||||||
```
|
|
||||||
======================================================================
|
|
||||||
📧 Munich News Daily - Newsletter Sender
|
|
||||||
======================================================================
|
|
||||||
|
|
||||||
Fetching latest 10 articles with AI summaries...
|
|
||||||
✓ Found 10 articles
|
|
||||||
|
|
||||||
Fetching active subscribers...
|
|
||||||
✓ Found 150 active subscriber(s)
|
|
||||||
|
|
||||||
Rendering newsletter HTML...
|
|
||||||
✓ Newsletter rendered
|
|
||||||
|
|
||||||
Sending newsletter: 'Munich News Daily - November 10, 2024'
|
|
||||||
----------------------------------------------------------------------
|
|
||||||
[1/150] Sending to user1@example.com... ✓
|
|
||||||
[2/150] Sending to user2@example.com... ✓
|
|
||||||
[3/150] Sending to user3@example.com... ✓
|
|
||||||
...
|
|
||||||
|
|
||||||
======================================================================
|
|
||||||
📊 Sending Complete
|
|
||||||
======================================================================
|
|
||||||
✓ Successfully sent: 148
|
|
||||||
✗ Failed: 2
|
|
||||||
📰 Articles included: 10
|
|
||||||
======================================================================
|
|
||||||
```
|
|
||||||
|
|
||||||
## Scheduling
|
|
||||||
|
|
||||||
### Using Cron (Linux/Mac)
|
|
||||||
|
|
||||||
Send newsletter daily at 8 AM:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Edit crontab
|
|
||||||
crontab -e
|
|
||||||
|
|
||||||
# Add this line
|
|
||||||
0 8 * * * cd /path/to/news_sender && /path/to/venv/bin/python sender_service.py send
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using systemd Timer (Linux)
|
|
||||||
|
|
||||||
Create `/etc/systemd/system/news-sender.service`:
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[Unit]
|
|
||||||
Description=Munich News Sender
|
|
||||||
|
|
||||||
[Service]
|
|
||||||
Type=oneshot
|
|
||||||
WorkingDirectory=/path/to/news_sender
|
|
||||||
ExecStart=/path/to/venv/bin/python sender_service.py send
|
|
||||||
User=your-user
|
|
||||||
```
|
|
||||||
|
|
||||||
Create `/etc/systemd/system/news-sender.timer`:
|
|
||||||
|
|
||||||
```ini
|
|
||||||
[Unit]
|
|
||||||
Description=Send Munich News Daily at 8 AM
|
|
||||||
|
|
||||||
[Timer]
|
|
||||||
OnCalendar=daily
|
|
||||||
OnCalendar=*-*-* 08:00:00
|
|
||||||
|
|
||||||
[Install]
|
|
||||||
WantedBy=timers.target
|
|
||||||
```
|
|
||||||
|
|
||||||
Enable and start:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo systemctl enable news-sender.timer
|
|
||||||
sudo systemctl start news-sender.timer
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using Docker
|
|
||||||
|
|
||||||
Create `Dockerfile`:
|
|
||||||
|
|
||||||
```dockerfile
|
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
COPY requirements.txt .
|
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
|
||||||
|
|
||||||
COPY sender_service.py newsletter_template.html ./
|
|
||||||
|
|
||||||
CMD ["python", "sender_service.py", "send"]
|
|
||||||
```
|
|
||||||
|
|
||||||
Build and run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker build -t news-sender .
|
|
||||||
docker run --env-file ../backend/.env news-sender
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### "Email credentials not configured"
|
|
||||||
- Check that `EMAIL_USER` and `EMAIL_PASSWORD` are set in `.env`
|
|
||||||
- For Gmail, use an App Password, not your regular password
|
|
||||||
|
|
||||||
### "No articles with summaries found"
|
|
||||||
- Run the crawler first: `cd ../news_crawler && python crawler_service.py 10`
|
|
||||||
- Make sure Ollama is enabled and working
|
|
||||||
- Check MongoDB has articles with `summary` field
|
|
||||||
|
|
||||||
### "No active subscribers found"
|
|
||||||
- Add subscribers via the backend API
|
|
||||||
- Check subscriber status is 'active' in MongoDB
|
|
||||||
|
|
||||||
### SMTP Connection Errors
|
|
||||||
- Verify SMTP server and port are correct
|
|
||||||
- Check firewall isn't blocking SMTP port
|
|
||||||
- For Gmail, ensure "Less secure app access" is enabled or use App Password
|
|
||||||
|
|
||||||
### Emails Going to Spam
|
|
||||||
- Set up SPF, DKIM, and DMARC records for your domain
|
|
||||||
- Use a verified email address
|
|
||||||
- Avoid spam trigger words in subject/content
|
|
||||||
- Include unsubscribe link (already included in template)
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
This is a standalone microservice that:
|
|
||||||
- Runs independently of the backend
|
|
||||||
- Shares the same MongoDB database
|
|
||||||
- Can be deployed separately
|
|
||||||
- Can be scheduled independently
|
|
||||||
- Has no dependencies on backend code
|
|
||||||
|
|
||||||
## Integration with Other Services
|
|
||||||
|
|
||||||
```
|
|
||||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
||||||
│ Backend │ │ Crawler │ │ Sender │
|
|
||||||
│ (Flask) │ │ (Scraper) │ │ (Email) │
|
|
||||||
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
||||||
│ │ │
|
|
||||||
│ │ │
|
|
||||||
└────────────────────┴─────────────────────┘
|
|
||||||
│
|
|
||||||
┌───────▼────────┐
|
|
||||||
│ MongoDB │
|
|
||||||
│ (Shared DB) │
|
|
||||||
└────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Next Steps
|
|
||||||
|
|
||||||
1. **Test the newsletter:**
|
|
||||||
```bash
|
|
||||||
python sender_service.py test your-email@example.com
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Schedule daily sending:**
|
|
||||||
- Set up cron job or systemd timer
|
|
||||||
- Choose appropriate time (e.g., 8 AM)
|
|
||||||
|
|
||||||
3. **Monitor sending:**
|
|
||||||
- Check logs for errors
|
|
||||||
- Track open rates (requires email tracking service)
|
|
||||||
- Monitor spam complaints
|
|
||||||
|
|
||||||
4. **Optimize:**
|
|
||||||
- Add email tracking pixels
|
|
||||||
- A/B test subject lines
|
|
||||||
- Personalize content per subscriber
|
|
||||||
@@ -146,6 +146,14 @@
|
|||||||
<a href="{{ unsubscribe_link }}" style="color: #999999; text-decoration: none;">Unsubscribe</a>
|
<a href="{{ unsubscribe_link }}" style="color: #999999; text-decoration: none;">Unsubscribe</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
{% if tracking_enabled %}
|
||||||
|
<!-- Privacy Notice -->
|
||||||
|
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666; line-height: 1.4;">
|
||||||
|
This email contains tracking to measure engagement and improve our content.<br>
|
||||||
|
We respect your privacy and anonymize data after 90 days.
|
||||||
|
</p>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666;">
|
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666;">
|
||||||
© {{ year }} Munich News Daily. All rights reserved.
|
© {{ year }} Munich News Daily. All rights reserved.
|
||||||
</p>
|
</p>
|
||||||
|
|||||||
@@ -1,3 +1,6 @@
|
|||||||
pymongo==4.6.1
|
pymongo==4.6.1
|
||||||
python-dotenv==1.0.0
|
python-dotenv==1.0.0
|
||||||
Jinja2==3.1.2
|
Jinja2==3.1.2
|
||||||
|
beautifulsoup4==4.12.2
|
||||||
|
schedule==1.2.0
|
||||||
|
pytz==2023.3
|
||||||
|
|||||||
178
news_sender/scheduled_sender.py
Executable file
178
news_sender/scheduled_sender.py
Executable file
@@ -0,0 +1,178 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Scheduled newsletter sender that runs daily at 7 AM Berlin time
|
||||||
|
Waits for crawler to finish before sending to ensure fresh content
|
||||||
|
"""
|
||||||
|
import schedule
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
import pytz
|
||||||
|
from pathlib import Path
|
||||||
|
import sys
|
||||||
|
|
||||||
|
# Add current directory to path
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
|
||||||
|
from sender_service import send_newsletter, get_latest_articles, Config
|
||||||
|
|
||||||
|
# Berlin timezone
|
||||||
|
BERLIN_TZ = pytz.timezone('Europe/Berlin')
|
||||||
|
|
||||||
|
# Maximum time to wait for crawler (in minutes)
|
||||||
|
MAX_WAIT_TIME = 30
|
||||||
|
|
||||||
|
def check_crawler_finished():
|
||||||
|
"""
|
||||||
|
Check if crawler has finished by looking for recent articles
|
||||||
|
Returns: (bool, str) - (is_finished, message)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Check if we have articles from today
|
||||||
|
articles = get_latest_articles(max_articles=1, hours=2)
|
||||||
|
|
||||||
|
if articles:
|
||||||
|
# Check if the most recent article was crawled recently (within last 2 hours)
|
||||||
|
latest_article = articles[0]
|
||||||
|
crawled_at = latest_article.get('crawled_at')
|
||||||
|
|
||||||
|
if crawled_at:
|
||||||
|
time_since_crawl = datetime.utcnow() - crawled_at
|
||||||
|
minutes_since = time_since_crawl.total_seconds() / 60
|
||||||
|
|
||||||
|
if minutes_since < 120: # Within last 2 hours
|
||||||
|
return True, f"Crawler finished {int(minutes_since)} minutes ago"
|
||||||
|
|
||||||
|
return False, "No recent articles found"
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return False, f"Error checking crawler status: {e}"
|
||||||
|
|
||||||
|
|
||||||
|
def wait_for_crawler(max_wait_minutes=30):
|
||||||
|
"""
|
||||||
|
Wait for crawler to finish before sending newsletter
|
||||||
|
|
||||||
|
Args:
|
||||||
|
max_wait_minutes: Maximum time to wait in minutes
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if crawler finished, False if timeout
|
||||||
|
"""
|
||||||
|
berlin_time = datetime.now(BERLIN_TZ)
|
||||||
|
print(f"\n⏳ Waiting for crawler to finish...")
|
||||||
|
print(f" Current time: {berlin_time.strftime('%H:%M:%S %Z')}")
|
||||||
|
print(f" Max wait time: {max_wait_minutes} minutes")
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
check_interval = 30 # Check every 30 seconds
|
||||||
|
|
||||||
|
while True:
|
||||||
|
elapsed_minutes = (time.time() - start_time) / 60
|
||||||
|
|
||||||
|
# Check if crawler finished
|
||||||
|
is_finished, message = check_crawler_finished()
|
||||||
|
|
||||||
|
if is_finished:
|
||||||
|
print(f" ✓ {message}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Check if we've exceeded max wait time
|
||||||
|
if elapsed_minutes >= max_wait_minutes:
|
||||||
|
print(f" ⚠ Timeout after {max_wait_minutes} minutes")
|
||||||
|
print(f" Proceeding with available articles...")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Show progress
|
||||||
|
remaining = max_wait_minutes - elapsed_minutes
|
||||||
|
print(f" ⏳ Still waiting... ({remaining:.1f} minutes remaining) - {message}")
|
||||||
|
|
||||||
|
# Wait before next check
|
||||||
|
time.sleep(check_interval)
|
||||||
|
|
||||||
|
|
||||||
|
def run_sender():
|
||||||
|
"""Run the newsletter sender with crawler coordination"""
|
||||||
|
berlin_time = datetime.now(BERLIN_TZ)
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"📧 Scheduled newsletter sender started")
|
||||||
|
print(f" Time: {berlin_time.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
print(f"{'='*70}\n")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Wait for crawler to finish (max 30 minutes)
|
||||||
|
crawler_finished = wait_for_crawler(max_wait_minutes=MAX_WAIT_TIME)
|
||||||
|
|
||||||
|
if not crawler_finished:
|
||||||
|
print(f"\n⚠ Crawler may still be running, but proceeding anyway...")
|
||||||
|
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"📧 Starting newsletter send...")
|
||||||
|
print(f"{'='*70}\n")
|
||||||
|
|
||||||
|
# Send newsletter to all subscribers
|
||||||
|
result = send_newsletter(max_articles=Config.MAX_ARTICLES)
|
||||||
|
|
||||||
|
if result['success']:
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"✅ Newsletter sent successfully!")
|
||||||
|
print(f" Sent: {result['sent_count']}/{result['total_subscribers']}")
|
||||||
|
print(f" Articles: {result['article_count']}")
|
||||||
|
print(f" Failed: {result['failed_count']}")
|
||||||
|
print(f"{'='*70}\n")
|
||||||
|
else:
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"❌ Newsletter send failed: {result.get('error', 'Unknown error')}")
|
||||||
|
print(f"{'='*70}\n")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"❌ Scheduled sender error: {e}")
|
||||||
|
print(f"{'='*70}\n")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main scheduler loop"""
|
||||||
|
print("📧 Munich News Newsletter Scheduler")
|
||||||
|
print("="*70)
|
||||||
|
print("Schedule: Daily at 7:00 AM Berlin time")
|
||||||
|
print("Timezone: Europe/Berlin (CET/CEST)")
|
||||||
|
print("Coordination: Waits for crawler to finish (max 30 min)")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
# Schedule the sender to run at 7 AM Berlin time
|
||||||
|
schedule.every().day.at("07:00").do(run_sender)
|
||||||
|
|
||||||
|
# Show next run time
|
||||||
|
berlin_time = datetime.now(BERLIN_TZ)
|
||||||
|
print(f"\nCurrent time (Berlin): {berlin_time.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
|
||||||
|
# Get next scheduled run
|
||||||
|
next_run = schedule.next_run()
|
||||||
|
if next_run:
|
||||||
|
# Convert to Berlin time for display
|
||||||
|
next_run_berlin = next_run.astimezone(BERLIN_TZ)
|
||||||
|
print(f"Next scheduled run: {next_run_berlin.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||||
|
|
||||||
|
print("\n⏳ Scheduler is running... (Press Ctrl+C to stop)\n")
|
||||||
|
|
||||||
|
# Optional: Run immediately on startup (comment out if you don't want this)
|
||||||
|
# print("🚀 Running initial send on startup...")
|
||||||
|
# run_sender()
|
||||||
|
|
||||||
|
# Keep the scheduler running
|
||||||
|
while True:
|
||||||
|
schedule.run_pending()
|
||||||
|
time.sleep(60) # Check every minute
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
try:
|
||||||
|
main()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n👋 Scheduler stopped by user")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n\n❌ Scheduler error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
@@ -11,8 +11,17 @@ from pathlib import Path
|
|||||||
from jinja2 import Template
|
from jinja2 import Template
|
||||||
from pymongo import MongoClient
|
from pymongo import MongoClient
|
||||||
import os
|
import os
|
||||||
|
import sys
|
||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
# Add backend directory to path for importing tracking service
|
||||||
|
backend_dir = Path(__file__).parent.parent / 'backend'
|
||||||
|
sys.path.insert(0, str(backend_dir))
|
||||||
|
|
||||||
|
# Import tracking modules
|
||||||
|
from services import tracking_service
|
||||||
|
from tracking_integration import inject_tracking_pixel, replace_article_links, generate_tracking_urls
|
||||||
|
|
||||||
# Load environment variables from backend/.env
|
# Load environment variables from backend/.env
|
||||||
backend_dir = Path(__file__).parent.parent / 'backend'
|
backend_dir = Path(__file__).parent.parent / 'backend'
|
||||||
env_path = backend_dir / '.env'
|
env_path = backend_dir / '.env'
|
||||||
@@ -41,6 +50,11 @@ class Config:
|
|||||||
HOURS_LOOKBACK = int(os.getenv('NEWSLETTER_HOURS_LOOKBACK', '24'))
|
HOURS_LOOKBACK = int(os.getenv('NEWSLETTER_HOURS_LOOKBACK', '24'))
|
||||||
WEBSITE_URL = os.getenv('WEBSITE_URL', 'http://localhost:3000')
|
WEBSITE_URL = os.getenv('WEBSITE_URL', 'http://localhost:3000')
|
||||||
|
|
||||||
|
# Tracking
|
||||||
|
TRACKING_ENABLED = os.getenv('TRACKING_ENABLED', 'true').lower() == 'true'
|
||||||
|
TRACKING_API_URL = os.getenv('TRACKING_API_URL', 'http://localhost:5001')
|
||||||
|
TRACKING_DATA_RETENTION_DAYS = int(os.getenv('TRACKING_DATA_RETENTION_DAYS', '90'))
|
||||||
|
|
||||||
|
|
||||||
# MongoDB connection
|
# MongoDB connection
|
||||||
client = MongoClient(Config.MONGODB_URI)
|
client = MongoClient(Config.MONGODB_URI)
|
||||||
@@ -117,15 +131,20 @@ def get_active_subscribers():
|
|||||||
return [doc['email'] for doc in cursor]
|
return [doc['email'] for doc in cursor]
|
||||||
|
|
||||||
|
|
||||||
def render_newsletter_html(articles):
|
def render_newsletter_html(articles, tracking_enabled=False, pixel_tracking_id=None,
|
||||||
|
link_tracking_map=None, api_url=None):
|
||||||
"""
|
"""
|
||||||
Render newsletter HTML from template
|
Render newsletter HTML from template with optional tracking integration
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
articles: List of article dictionaries
|
articles: List of article dictionaries
|
||||||
|
tracking_enabled: Whether to inject tracking pixel and replace links
|
||||||
|
pixel_tracking_id: Tracking ID for the email open pixel
|
||||||
|
link_tracking_map: Dictionary mapping original URLs to tracking IDs
|
||||||
|
api_url: Base URL for the tracking API
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
str: Rendered HTML content
|
str: Rendered HTML content with tracking injected if enabled
|
||||||
"""
|
"""
|
||||||
# Load template
|
# Load template
|
||||||
template_path = Path(__file__).parent / 'newsletter_template.html'
|
template_path = Path(__file__).parent / 'newsletter_template.html'
|
||||||
@@ -142,11 +161,23 @@ def render_newsletter_html(articles):
|
|||||||
'article_count': len(articles),
|
'article_count': len(articles),
|
||||||
'articles': articles,
|
'articles': articles,
|
||||||
'unsubscribe_link': f'{Config.WEBSITE_URL}/unsubscribe',
|
'unsubscribe_link': f'{Config.WEBSITE_URL}/unsubscribe',
|
||||||
'website_link': Config.WEBSITE_URL
|
'website_link': Config.WEBSITE_URL,
|
||||||
|
'tracking_enabled': tracking_enabled
|
||||||
}
|
}
|
||||||
|
|
||||||
# Render HTML
|
# Render HTML
|
||||||
return template.render(**template_data)
|
html = template.render(**template_data)
|
||||||
|
|
||||||
|
# Inject tracking if enabled
|
||||||
|
if tracking_enabled and pixel_tracking_id and api_url:
|
||||||
|
# Inject tracking pixel
|
||||||
|
html = inject_tracking_pixel(html, pixel_tracking_id, api_url)
|
||||||
|
|
||||||
|
# Replace article links with tracking URLs
|
||||||
|
if link_tracking_map:
|
||||||
|
html = replace_article_links(html, link_tracking_map, api_url)
|
||||||
|
|
||||||
|
return html
|
||||||
|
|
||||||
|
|
||||||
def send_email(to_email, subject, html_content):
|
def send_email(to_email, subject, html_content):
|
||||||
@@ -246,14 +277,14 @@ def send_newsletter(max_articles=None, test_email=None):
|
|||||||
'error': 'No active subscribers'
|
'error': 'No active subscribers'
|
||||||
}
|
}
|
||||||
|
|
||||||
# Render newsletter
|
# Generate newsletter ID (date-based)
|
||||||
print("\nRendering newsletter HTML...")
|
newsletter_id = f"newsletter-{datetime.now().strftime('%Y-%m-%d')}"
|
||||||
html_content = render_newsletter_html(articles)
|
|
||||||
print("✓ Newsletter rendered")
|
|
||||||
|
|
||||||
# Send to subscribers
|
# Send to subscribers
|
||||||
subject = f"Munich News Daily - {datetime.now().strftime('%B %d, %Y')}"
|
subject = f"Munich News Daily - {datetime.now().strftime('%B %d, %Y')}"
|
||||||
print(f"\nSending newsletter: '{subject}'")
|
print(f"\nSending newsletter: '{subject}'")
|
||||||
|
print(f"Newsletter ID: {newsletter_id}")
|
||||||
|
print(f"Tracking enabled: {Config.TRACKING_ENABLED}")
|
||||||
print("-" * 70)
|
print("-" * 70)
|
||||||
|
|
||||||
sent_count = 0
|
sent_count = 0
|
||||||
@@ -262,6 +293,34 @@ def send_newsletter(max_articles=None, test_email=None):
|
|||||||
|
|
||||||
for i, email in enumerate(subscribers, 1):
|
for i, email in enumerate(subscribers, 1):
|
||||||
print(f"[{i}/{len(subscribers)}] Sending to {email}...", end=' ')
|
print(f"[{i}/{len(subscribers)}] Sending to {email}...", end=' ')
|
||||||
|
|
||||||
|
# Generate tracking data for this subscriber if tracking is enabled
|
||||||
|
if Config.TRACKING_ENABLED:
|
||||||
|
try:
|
||||||
|
tracking_data = generate_tracking_urls(
|
||||||
|
articles=articles,
|
||||||
|
newsletter_id=newsletter_id,
|
||||||
|
subscriber_email=email,
|
||||||
|
tracking_service=tracking_service
|
||||||
|
)
|
||||||
|
|
||||||
|
# Render newsletter with tracking
|
||||||
|
html_content = render_newsletter_html(
|
||||||
|
articles=articles,
|
||||||
|
tracking_enabled=True,
|
||||||
|
pixel_tracking_id=tracking_data['pixel_tracking_id'],
|
||||||
|
link_tracking_map=tracking_data['link_tracking_map'],
|
||||||
|
api_url=Config.TRACKING_API_URL
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠ Tracking error: {e}, sending without tracking...", end=' ')
|
||||||
|
# Fallback: send without tracking
|
||||||
|
html_content = render_newsletter_html(articles)
|
||||||
|
else:
|
||||||
|
# Render newsletter without tracking
|
||||||
|
html_content = render_newsletter_html(articles)
|
||||||
|
|
||||||
|
# Send email
|
||||||
success, error = send_email(email, subject, html_content)
|
success, error = send_email(email, subject, html_content)
|
||||||
|
|
||||||
if success:
|
if success:
|
||||||
@@ -310,12 +369,11 @@ def preview_newsletter(max_articles=None, hours=None):
|
|||||||
today_date = datetime.now().strftime('%B %d, %Y')
|
today_date = datetime.now().strftime('%B %d, %Y')
|
||||||
return f"<h1>No articles from today found</h1><p>No articles published today ({today_date}). Run the crawler with Ollama enabled to get fresh content.</p>"
|
return f"<h1>No articles from today found</h1><p>No articles published today ({today_date}). Run the crawler with Ollama enabled to get fresh content.</p>"
|
||||||
|
|
||||||
return render_newsletter_html(articles)
|
# Preview without tracking
|
||||||
|
return render_newsletter_html(articles, tracking_enabled=False)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
import sys
|
|
||||||
|
|
||||||
# Parse command line arguments
|
# Parse command line arguments
|
||||||
if len(sys.argv) > 1:
|
if len(sys.argv) > 1:
|
||||||
command = sys.argv[1]
|
command = sys.argv[1]
|
||||||
|
|||||||
150
news_sender/tracking_integration.py
Normal file
150
news_sender/tracking_integration.py
Normal file
@@ -0,0 +1,150 @@
|
|||||||
|
"""
|
||||||
|
Tracking integration module for Munich News Daily newsletter system.
|
||||||
|
Handles injection of tracking pixels and replacement of article links with tracking URLs.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Dict, List
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
|
||||||
|
def inject_tracking_pixel(html: str, tracking_id: str, api_url: str) -> str:
|
||||||
|
"""
|
||||||
|
Inject tracking pixel into newsletter HTML before closing </body> tag.
|
||||||
|
|
||||||
|
The tracking pixel is a 1x1 transparent image that loads when the email is opened,
|
||||||
|
allowing us to track email opens.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
html: Original newsletter HTML content
|
||||||
|
tracking_id: Unique tracking ID for this newsletter send (None if tracking disabled)
|
||||||
|
api_url: Base URL for the tracking API (e.g., http://localhost:5001)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: HTML with tracking pixel injected (unchanged if tracking_id is None)
|
||||||
|
|
||||||
|
Example:
|
||||||
|
>>> html = '<html><body><p>Content</p></body></html>'
|
||||||
|
>>> inject_tracking_pixel(html, 'abc-123', 'http://api.example.com')
|
||||||
|
'<html><body><p>Content</p><img src="http://api.example.com/api/track/pixel/abc-123" width="1" height="1" alt="" /></body></html>'
|
||||||
|
"""
|
||||||
|
# Skip tracking if no tracking_id provided (subscriber opted out)
|
||||||
|
if not tracking_id:
|
||||||
|
return html
|
||||||
|
|
||||||
|
# Construct tracking pixel URL
|
||||||
|
pixel_url = f"{api_url}/api/track/pixel/{tracking_id}"
|
||||||
|
|
||||||
|
# Create tracking pixel HTML
|
||||||
|
pixel_html = f'<img src="{pixel_url}" width="1" height="1" alt="" style="display:block;" />'
|
||||||
|
|
||||||
|
# Inject pixel before closing </body> tag
|
||||||
|
if '</body>' in html:
|
||||||
|
html = html.replace('</body>', f'{pixel_html}</body>')
|
||||||
|
else:
|
||||||
|
# Fallback: append to end if no </body> tag found
|
||||||
|
html += pixel_html
|
||||||
|
|
||||||
|
return html
|
||||||
|
|
||||||
|
|
||||||
|
def replace_article_links(
|
||||||
|
html: str,
|
||||||
|
link_tracking_map: Dict[str, str],
|
||||||
|
api_url: str
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Replace article links in newsletter HTML with tracking URLs.
|
||||||
|
|
||||||
|
Finds all article links in the HTML and replaces them with tracking redirect URLs
|
||||||
|
that log clicks before redirecting to the original article.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
html: Original newsletter HTML content
|
||||||
|
link_tracking_map: Dictionary mapping original URLs to tracking IDs (empty if tracking disabled)
|
||||||
|
api_url: Base URL for the tracking API (e.g., http://localhost:5001)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: HTML with article links replaced by tracking URLs (unchanged if map is empty)
|
||||||
|
|
||||||
|
Example:
|
||||||
|
>>> html = '<a href="https://example.com/article">Read</a>'
|
||||||
|
>>> mapping = {'https://example.com/article': 'track-123'}
|
||||||
|
>>> replace_article_links(html, mapping, 'http://api.example.com')
|
||||||
|
'<a href="http://api.example.com/api/track/click/track-123">Read</a>'
|
||||||
|
"""
|
||||||
|
# Skip tracking if no tracking map provided (subscriber opted out)
|
||||||
|
if not link_tracking_map:
|
||||||
|
return html
|
||||||
|
|
||||||
|
# Parse HTML with BeautifulSoup
|
||||||
|
soup = BeautifulSoup(html, 'html.parser')
|
||||||
|
|
||||||
|
# Find all <a> tags with href attributes
|
||||||
|
for link in soup.find_all('a', href=True):
|
||||||
|
original_url = link['href']
|
||||||
|
|
||||||
|
# Check if this URL should be tracked
|
||||||
|
if original_url in link_tracking_map:
|
||||||
|
tracking_id = link_tracking_map[original_url]
|
||||||
|
tracking_url = f"{api_url}/api/track/click/{tracking_id}"
|
||||||
|
|
||||||
|
# Replace the href with tracking URL
|
||||||
|
link['href'] = tracking_url
|
||||||
|
|
||||||
|
# Return modified HTML
|
||||||
|
return str(soup)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_tracking_urls(
|
||||||
|
articles: List[Dict],
|
||||||
|
newsletter_id: str,
|
||||||
|
subscriber_email: str,
|
||||||
|
tracking_service
|
||||||
|
) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Generate tracking records for all article links and return URL mapping.
|
||||||
|
|
||||||
|
Creates tracking records in the database for each article link and returns
|
||||||
|
a mapping of original URLs to tracking IDs.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
articles: List of article dictionaries with 'link' and 'title' keys
|
||||||
|
newsletter_id: Unique identifier for the newsletter batch
|
||||||
|
subscriber_email: Email address of the recipient
|
||||||
|
tracking_service: Tracking service module with create_newsletter_tracking function
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Dictionary containing:
|
||||||
|
- pixel_tracking_id: ID for the tracking pixel
|
||||||
|
- link_tracking_map: Dict mapping original URLs to tracking IDs
|
||||||
|
|
||||||
|
Example:
|
||||||
|
>>> articles = [{'link': 'https://example.com/1', 'title': 'Article 1'}]
|
||||||
|
>>> generate_tracking_urls(articles, 'news-2024-01-01', 'user@example.com', tracking_service)
|
||||||
|
{
|
||||||
|
'pixel_tracking_id': 'uuid-for-pixel',
|
||||||
|
'link_tracking_map': {'https://example.com/1': 'uuid-for-link'}
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
# Prepare article links for tracking
|
||||||
|
article_links = []
|
||||||
|
for article in articles:
|
||||||
|
if 'link' in article and article['link']:
|
||||||
|
article_links.append({
|
||||||
|
'url': article['link'],
|
||||||
|
'title': article.get('title', '')
|
||||||
|
})
|
||||||
|
|
||||||
|
# Create tracking records using the tracking service
|
||||||
|
tracking_data = tracking_service.create_newsletter_tracking(
|
||||||
|
newsletter_id=newsletter_id,
|
||||||
|
subscriber_email=subscriber_email,
|
||||||
|
article_links=article_links
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'pixel_tracking_id': tracking_data['pixel_tracking_id'],
|
||||||
|
'link_tracking_map': tracking_data['link_tracking_map'],
|
||||||
|
'tracking_enabled': tracking_data.get('tracking_enabled', True)
|
||||||
|
}
|
||||||
451
tests/backend/test_analytics.py
Normal file
451
tests/backend/test_analytics.py
Normal file
@@ -0,0 +1,451 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test analytics functionality for email tracking
|
||||||
|
Run from backend directory with venv activated:
|
||||||
|
cd backend
|
||||||
|
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||||
|
python test_analytics.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
# Add backend directory to path
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
|
||||||
|
from services.analytics_service import (
|
||||||
|
get_open_rate,
|
||||||
|
get_click_rate,
|
||||||
|
get_newsletter_metrics,
|
||||||
|
get_article_performance,
|
||||||
|
get_subscriber_activity_status,
|
||||||
|
update_subscriber_activity_statuses
|
||||||
|
)
|
||||||
|
from database import (
|
||||||
|
newsletter_sends_collection,
|
||||||
|
link_clicks_collection,
|
||||||
|
subscriber_activity_collection
|
||||||
|
)
|
||||||
|
from app import app
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("Analytics Service Tests")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
# Test counters
|
||||||
|
tests_passed = 0
|
||||||
|
tests_failed = 0
|
||||||
|
|
||||||
|
def test_result(test_name, passed, message=""):
|
||||||
|
"""Print test result"""
|
||||||
|
global tests_passed, tests_failed
|
||||||
|
if passed:
|
||||||
|
tests_passed += 1
|
||||||
|
print(f"✓ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
else:
|
||||||
|
tests_failed += 1
|
||||||
|
print(f"❌ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
|
||||||
|
|
||||||
|
# Setup test data
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Setting up test data...")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Clean up existing test data
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': {'$regex': '^test-analytics-'}})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': {'$regex': '^test-analytics-'}})
|
||||||
|
subscriber_activity_collection.delete_many({'email': {'$regex': '^test-analytics-'}})
|
||||||
|
|
||||||
|
# Create test newsletter sends
|
||||||
|
test_newsletter_id = 'test-analytics-newsletter-001'
|
||||||
|
|
||||||
|
# Create 10 newsletter sends: 7 opened, 3 not opened
|
||||||
|
for i in range(10):
|
||||||
|
opened = i < 7 # First 7 are opened
|
||||||
|
doc = {
|
||||||
|
'newsletter_id': test_newsletter_id,
|
||||||
|
'subscriber_email': f'test-analytics-user{i}@example.com',
|
||||||
|
'tracking_id': f'test-pixel-{i}',
|
||||||
|
'sent_at': datetime.utcnow(),
|
||||||
|
'opened': opened,
|
||||||
|
'first_opened_at': datetime.utcnow() if opened else None,
|
||||||
|
'last_opened_at': datetime.utcnow() if opened else None,
|
||||||
|
'open_count': 1 if opened else 0,
|
||||||
|
'created_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
newsletter_sends_collection.insert_one(doc)
|
||||||
|
|
||||||
|
# Create test link clicks for an article
|
||||||
|
test_article_url = 'https://example.com/test-analytics-article'
|
||||||
|
|
||||||
|
# Create 10 link tracking records: 4 clicked, 6 not clicked
|
||||||
|
for i in range(10):
|
||||||
|
clicked = i < 4 # First 4 are clicked
|
||||||
|
doc = {
|
||||||
|
'tracking_id': f'test-link-{i}',
|
||||||
|
'newsletter_id': test_newsletter_id,
|
||||||
|
'subscriber_email': f'test-analytics-user{i}@example.com',
|
||||||
|
'article_url': test_article_url,
|
||||||
|
'article_title': 'Test Analytics Article',
|
||||||
|
'clicked': clicked,
|
||||||
|
'clicked_at': datetime.utcnow() if clicked else None,
|
||||||
|
'user_agent': 'Test Agent' if clicked else None,
|
||||||
|
'created_at': datetime.utcnow()
|
||||||
|
}
|
||||||
|
link_clicks_collection.insert_one(doc)
|
||||||
|
|
||||||
|
print("✓ Test data created")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error setting up test data: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
# Test 1: Open Rate Calculation
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 1: Open Rate Calculation")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
open_rate = get_open_rate(test_newsletter_id)
|
||||||
|
|
||||||
|
# Expected: 7 out of 10 = 70%
|
||||||
|
is_correct = open_rate == 70.0
|
||||||
|
test_result("Calculate open rate", is_correct, f"Open rate: {open_rate}% (expected 70%)")
|
||||||
|
|
||||||
|
# Test with non-existent newsletter
|
||||||
|
open_rate_empty = get_open_rate('non-existent-newsletter')
|
||||||
|
handles_empty = open_rate_empty == 0.0
|
||||||
|
test_result("Handle non-existent newsletter", handles_empty,
|
||||||
|
f"Open rate: {open_rate_empty}% (expected 0%)")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Open rate calculation", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 2: Click Rate Calculation
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 2: Click Rate Calculation")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
click_rate = get_click_rate(test_article_url)
|
||||||
|
|
||||||
|
# Expected: 4 out of 10 = 40%
|
||||||
|
is_correct = click_rate == 40.0
|
||||||
|
test_result("Calculate click rate", is_correct, f"Click rate: {click_rate}% (expected 40%)")
|
||||||
|
|
||||||
|
# Test with non-existent article
|
||||||
|
click_rate_empty = get_click_rate('https://example.com/non-existent')
|
||||||
|
handles_empty = click_rate_empty == 0.0
|
||||||
|
test_result("Handle non-existent article", handles_empty,
|
||||||
|
f"Click rate: {click_rate_empty}% (expected 0%)")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Click rate calculation", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 3: Newsletter Metrics
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 3: Newsletter Metrics")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
metrics = get_newsletter_metrics(test_newsletter_id)
|
||||||
|
|
||||||
|
# Verify all expected fields
|
||||||
|
has_all_fields = all(key in metrics for key in [
|
||||||
|
'newsletter_id', 'total_sent', 'total_opened', 'open_rate',
|
||||||
|
'total_clicks', 'unique_clickers', 'click_through_rate'
|
||||||
|
])
|
||||||
|
test_result("Returns all required fields", has_all_fields)
|
||||||
|
|
||||||
|
# Verify values
|
||||||
|
correct_sent = metrics['total_sent'] == 10
|
||||||
|
test_result("Correct total_sent", correct_sent, f"Total sent: {metrics['total_sent']}")
|
||||||
|
|
||||||
|
correct_opened = metrics['total_opened'] == 7
|
||||||
|
test_result("Correct total_opened", correct_opened, f"Total opened: {metrics['total_opened']}")
|
||||||
|
|
||||||
|
correct_open_rate = metrics['open_rate'] == 70.0
|
||||||
|
test_result("Correct open_rate", correct_open_rate, f"Open rate: {metrics['open_rate']}%")
|
||||||
|
|
||||||
|
correct_clicks = metrics['total_clicks'] == 4
|
||||||
|
test_result("Correct total_clicks", correct_clicks, f"Total clicks: {metrics['total_clicks']}")
|
||||||
|
|
||||||
|
correct_unique_clickers = metrics['unique_clickers'] == 4
|
||||||
|
test_result("Correct unique_clickers", correct_unique_clickers,
|
||||||
|
f"Unique clickers: {metrics['unique_clickers']}")
|
||||||
|
|
||||||
|
correct_ctr = metrics['click_through_rate'] == 40.0
|
||||||
|
test_result("Correct click_through_rate", correct_ctr,
|
||||||
|
f"CTR: {metrics['click_through_rate']}%")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Newsletter metrics", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 4: Article Performance
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 4: Article Performance")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
performance = get_article_performance(test_article_url)
|
||||||
|
|
||||||
|
# Verify all expected fields
|
||||||
|
has_all_fields = all(key in performance for key in [
|
||||||
|
'article_url', 'total_sent', 'total_clicks', 'click_rate',
|
||||||
|
'unique_clickers', 'newsletters'
|
||||||
|
])
|
||||||
|
test_result("Returns all required fields", has_all_fields)
|
||||||
|
|
||||||
|
# Verify values
|
||||||
|
correct_sent = performance['total_sent'] == 10
|
||||||
|
test_result("Correct total_sent", correct_sent, f"Total sent: {performance['total_sent']}")
|
||||||
|
|
||||||
|
correct_clicks = performance['total_clicks'] == 4
|
||||||
|
test_result("Correct total_clicks", correct_clicks, f"Total clicks: {performance['total_clicks']}")
|
||||||
|
|
||||||
|
correct_click_rate = performance['click_rate'] == 40.0
|
||||||
|
test_result("Correct click_rate", correct_click_rate, f"Click rate: {performance['click_rate']}%")
|
||||||
|
|
||||||
|
correct_unique = performance['unique_clickers'] == 4
|
||||||
|
test_result("Correct unique_clickers", correct_unique,
|
||||||
|
f"Unique clickers: {performance['unique_clickers']}")
|
||||||
|
|
||||||
|
has_newsletters = len(performance['newsletters']) > 0
|
||||||
|
test_result("Returns newsletter list", has_newsletters,
|
||||||
|
f"Newsletters: {performance['newsletters']}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Article performance", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 5: Activity Status Classification
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 5: Activity Status Classification")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create test data for activity classification
|
||||||
|
now = datetime.utcnow()
|
||||||
|
|
||||||
|
# Active user (opened 10 days ago)
|
||||||
|
newsletter_sends_collection.insert_one({
|
||||||
|
'newsletter_id': 'test-analytics-activity',
|
||||||
|
'subscriber_email': 'test-analytics-active@example.com',
|
||||||
|
'tracking_id': 'test-active-pixel',
|
||||||
|
'sent_at': now - timedelta(days=10),
|
||||||
|
'opened': True,
|
||||||
|
'first_opened_at': now - timedelta(days=10),
|
||||||
|
'last_opened_at': now - timedelta(days=10),
|
||||||
|
'open_count': 1,
|
||||||
|
'created_at': now - timedelta(days=10)
|
||||||
|
})
|
||||||
|
|
||||||
|
# Inactive user (opened 45 days ago)
|
||||||
|
newsletter_sends_collection.insert_one({
|
||||||
|
'newsletter_id': 'test-analytics-activity',
|
||||||
|
'subscriber_email': 'test-analytics-inactive@example.com',
|
||||||
|
'tracking_id': 'test-inactive-pixel',
|
||||||
|
'sent_at': now - timedelta(days=45),
|
||||||
|
'opened': True,
|
||||||
|
'first_opened_at': now - timedelta(days=45),
|
||||||
|
'last_opened_at': now - timedelta(days=45),
|
||||||
|
'open_count': 1,
|
||||||
|
'created_at': now - timedelta(days=45)
|
||||||
|
})
|
||||||
|
|
||||||
|
# Dormant user (opened 90 days ago)
|
||||||
|
newsletter_sends_collection.insert_one({
|
||||||
|
'newsletter_id': 'test-analytics-activity',
|
||||||
|
'subscriber_email': 'test-analytics-dormant@example.com',
|
||||||
|
'tracking_id': 'test-dormant-pixel',
|
||||||
|
'sent_at': now - timedelta(days=90),
|
||||||
|
'opened': True,
|
||||||
|
'first_opened_at': now - timedelta(days=90),
|
||||||
|
'last_opened_at': now - timedelta(days=90),
|
||||||
|
'open_count': 1,
|
||||||
|
'created_at': now - timedelta(days=90)
|
||||||
|
})
|
||||||
|
|
||||||
|
# New user (never opened)
|
||||||
|
newsletter_sends_collection.insert_one({
|
||||||
|
'newsletter_id': 'test-analytics-activity',
|
||||||
|
'subscriber_email': 'test-analytics-new@example.com',
|
||||||
|
'tracking_id': 'test-new-pixel',
|
||||||
|
'sent_at': now - timedelta(days=5),
|
||||||
|
'opened': False,
|
||||||
|
'first_opened_at': None,
|
||||||
|
'last_opened_at': None,
|
||||||
|
'open_count': 0,
|
||||||
|
'created_at': now - timedelta(days=5)
|
||||||
|
})
|
||||||
|
|
||||||
|
# Test classifications
|
||||||
|
active_status = get_subscriber_activity_status('test-analytics-active@example.com')
|
||||||
|
is_active = active_status == 'active'
|
||||||
|
test_result("Classify active user", is_active, f"Status: {active_status}")
|
||||||
|
|
||||||
|
inactive_status = get_subscriber_activity_status('test-analytics-inactive@example.com')
|
||||||
|
is_inactive = inactive_status == 'inactive'
|
||||||
|
test_result("Classify inactive user", is_inactive, f"Status: {inactive_status}")
|
||||||
|
|
||||||
|
dormant_status = get_subscriber_activity_status('test-analytics-dormant@example.com')
|
||||||
|
is_dormant = dormant_status == 'dormant'
|
||||||
|
test_result("Classify dormant user", is_dormant, f"Status: {dormant_status}")
|
||||||
|
|
||||||
|
new_status = get_subscriber_activity_status('test-analytics-new@example.com')
|
||||||
|
is_new = new_status == 'new'
|
||||||
|
test_result("Classify new user", is_new, f"Status: {new_status}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Activity status classification", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 6: Batch Update Activity Statuses
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 6: Batch Update Activity Statuses")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
updated_count = update_subscriber_activity_statuses()
|
||||||
|
|
||||||
|
# Should update all test subscribers
|
||||||
|
has_updates = updated_count > 0
|
||||||
|
test_result("Updates subscriber records", has_updates,
|
||||||
|
f"Updated {updated_count} subscribers")
|
||||||
|
|
||||||
|
# Verify a record was created
|
||||||
|
activity_record = subscriber_activity_collection.find_one({
|
||||||
|
'email': 'test-analytics-active@example.com'
|
||||||
|
})
|
||||||
|
|
||||||
|
record_exists = activity_record is not None
|
||||||
|
test_result("Creates activity record", record_exists)
|
||||||
|
|
||||||
|
if activity_record:
|
||||||
|
has_required_fields = all(key in activity_record for key in [
|
||||||
|
'email', 'status', 'total_opens', 'total_clicks',
|
||||||
|
'newsletters_received', 'newsletters_opened', 'updated_at'
|
||||||
|
])
|
||||||
|
test_result("Activity record has required fields", has_required_fields)
|
||||||
|
|
||||||
|
correct_status = activity_record['status'] == 'active'
|
||||||
|
test_result("Activity record has correct status", correct_status,
|
||||||
|
f"Status: {activity_record['status']}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Batch update activity statuses", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 7: Analytics API Endpoints
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 7: Analytics API Endpoints")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with app.test_client() as client:
|
||||||
|
# Test newsletter analytics endpoint
|
||||||
|
response = client.get(f'/api/analytics/newsletter/{test_newsletter_id}')
|
||||||
|
|
||||||
|
is_200 = response.status_code == 200
|
||||||
|
test_result("Newsletter endpoint returns 200", is_200, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
if is_200:
|
||||||
|
data = response.get_json()
|
||||||
|
has_data = data is not None and 'open_rate' in data
|
||||||
|
test_result("Newsletter endpoint returns data", has_data)
|
||||||
|
|
||||||
|
# Test article analytics endpoint
|
||||||
|
response = client.get(f'/api/analytics/article/{test_article_url}')
|
||||||
|
|
||||||
|
is_200 = response.status_code == 200
|
||||||
|
test_result("Article endpoint returns 200", is_200, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
if is_200:
|
||||||
|
data = response.get_json()
|
||||||
|
has_data = data is not None and 'click_rate' in data
|
||||||
|
test_result("Article endpoint returns data", has_data)
|
||||||
|
|
||||||
|
# Test subscriber analytics endpoint
|
||||||
|
response = client.get('/api/analytics/subscriber/test-analytics-active@example.com')
|
||||||
|
|
||||||
|
is_200 = response.status_code == 200
|
||||||
|
test_result("Subscriber endpoint returns 200", is_200, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
if is_200:
|
||||||
|
data = response.get_json()
|
||||||
|
has_data = data is not None and 'status' in data
|
||||||
|
test_result("Subscriber endpoint returns data", has_data)
|
||||||
|
|
||||||
|
# Test update activity endpoint
|
||||||
|
response = client.post('/api/analytics/update-activity')
|
||||||
|
|
||||||
|
is_200 = response.status_code == 200
|
||||||
|
test_result("Update activity endpoint returns 200", is_200, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
if is_200:
|
||||||
|
data = response.get_json()
|
||||||
|
has_count = data is not None and 'updated_count' in data
|
||||||
|
test_result("Update activity endpoint returns count", has_count)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Analytics API endpoints", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Clean up test data
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Cleaning up test data...")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': {'$regex': '^test-analytics-'}})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': {'$regex': '^test-analytics-'}})
|
||||||
|
subscriber_activity_collection.delete_many({'email': {'$regex': '^test-analytics-'}})
|
||||||
|
print("✓ Test data cleaned up")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠ Error cleaning up: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST SUMMARY")
|
||||||
|
print("="*80)
|
||||||
|
print(f"Total tests: {tests_passed + tests_failed}")
|
||||||
|
print(f"✓ Passed: {tests_passed}")
|
||||||
|
print(f"❌ Failed: {tests_failed}")
|
||||||
|
|
||||||
|
if tests_failed == 0:
|
||||||
|
print("\n🎉 All tests passed!")
|
||||||
|
else:
|
||||||
|
print(f"\n⚠ {tests_failed} test(s) failed")
|
||||||
|
|
||||||
|
print("="*80 + "\n")
|
||||||
|
|
||||||
|
# Exit with appropriate code
|
||||||
|
sys.exit(0 if tests_failed == 0 else 1)
|
||||||
389
tests/backend/test_privacy.py
Normal file
389
tests/backend/test_privacy.py
Normal file
@@ -0,0 +1,389 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test privacy compliance features for email tracking
|
||||||
|
Run from backend directory with venv activated:
|
||||||
|
cd backend
|
||||||
|
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||||
|
python test_privacy.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from pymongo import MongoClient
|
||||||
|
|
||||||
|
# Add backend directory to path
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
|
||||||
|
from config import Config
|
||||||
|
from services.tracking_service import (
|
||||||
|
create_newsletter_tracking,
|
||||||
|
anonymize_old_tracking_data,
|
||||||
|
delete_subscriber_tracking_data
|
||||||
|
)
|
||||||
|
from database import (
|
||||||
|
newsletter_sends_collection,
|
||||||
|
link_clicks_collection,
|
||||||
|
subscriber_activity_collection,
|
||||||
|
subscribers_collection
|
||||||
|
)
|
||||||
|
from app import app
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("Privacy Compliance Tests")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
# Test counters
|
||||||
|
tests_passed = 0
|
||||||
|
tests_failed = 0
|
||||||
|
|
||||||
|
def test_result(test_name, passed, message=""):
|
||||||
|
"""Print test result"""
|
||||||
|
global tests_passed, tests_failed
|
||||||
|
if passed:
|
||||||
|
tests_passed += 1
|
||||||
|
print(f"✓ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
else:
|
||||||
|
tests_failed += 1
|
||||||
|
print(f"❌ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
|
||||||
|
|
||||||
|
# Setup: Clean up test data
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Setup: Cleaning test data")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
test_newsletter_id = 'privacy-test-newsletter'
|
||||||
|
test_email = 'privacy-test@example.com'
|
||||||
|
test_email_opted_out = 'opted-out@example.com'
|
||||||
|
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': test_newsletter_id})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': test_newsletter_id})
|
||||||
|
subscriber_activity_collection.delete_many({'email': {'$in': [test_email, test_email_opted_out]}})
|
||||||
|
subscribers_collection.delete_many({'email': {'$in': [test_email, test_email_opted_out]}})
|
||||||
|
|
||||||
|
print("✓ Test data cleaned")
|
||||||
|
|
||||||
|
|
||||||
|
# Test 1: Data Anonymization
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 1: Data Anonymization")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create old tracking records (older than 90 days)
|
||||||
|
old_date = datetime.utcnow() - timedelta(days=100)
|
||||||
|
|
||||||
|
old_newsletter_doc = {
|
||||||
|
'newsletter_id': test_newsletter_id,
|
||||||
|
'subscriber_email': 'old-user@example.com',
|
||||||
|
'tracking_id': 'old-tracking-id-1',
|
||||||
|
'sent_at': old_date,
|
||||||
|
'opened': True,
|
||||||
|
'first_opened_at': old_date,
|
||||||
|
'last_opened_at': old_date,
|
||||||
|
'open_count': 3,
|
||||||
|
'created_at': old_date
|
||||||
|
}
|
||||||
|
newsletter_sends_collection.insert_one(old_newsletter_doc)
|
||||||
|
|
||||||
|
old_link_doc = {
|
||||||
|
'tracking_id': 'old-link-tracking-id-1',
|
||||||
|
'newsletter_id': test_newsletter_id,
|
||||||
|
'subscriber_email': 'old-user@example.com',
|
||||||
|
'article_url': 'https://example.com/old-article',
|
||||||
|
'article_title': 'Old Article',
|
||||||
|
'clicked': True,
|
||||||
|
'clicked_at': old_date,
|
||||||
|
'created_at': old_date
|
||||||
|
}
|
||||||
|
link_clicks_collection.insert_one(old_link_doc)
|
||||||
|
|
||||||
|
# Create recent tracking records (within 90 days)
|
||||||
|
recent_date = datetime.utcnow() - timedelta(days=30)
|
||||||
|
|
||||||
|
recent_newsletter_doc = {
|
||||||
|
'newsletter_id': test_newsletter_id,
|
||||||
|
'subscriber_email': 'recent-user@example.com',
|
||||||
|
'tracking_id': 'recent-tracking-id-1',
|
||||||
|
'sent_at': recent_date,
|
||||||
|
'opened': True,
|
||||||
|
'first_opened_at': recent_date,
|
||||||
|
'last_opened_at': recent_date,
|
||||||
|
'open_count': 1,
|
||||||
|
'created_at': recent_date
|
||||||
|
}
|
||||||
|
newsletter_sends_collection.insert_one(recent_newsletter_doc)
|
||||||
|
|
||||||
|
# Run anonymization
|
||||||
|
result = anonymize_old_tracking_data(retention_days=90)
|
||||||
|
|
||||||
|
# Check that old records were anonymized
|
||||||
|
old_newsletter_after = newsletter_sends_collection.find_one({'tracking_id': 'old-tracking-id-1'})
|
||||||
|
old_anonymized = old_newsletter_after and old_newsletter_after['subscriber_email'] == 'anonymized'
|
||||||
|
test_result("Anonymizes old newsletter records", old_anonymized,
|
||||||
|
f"Email: {old_newsletter_after.get('subscriber_email', 'N/A') if old_newsletter_after else 'N/A'}")
|
||||||
|
|
||||||
|
old_link_after = link_clicks_collection.find_one({'tracking_id': 'old-link-tracking-id-1'})
|
||||||
|
link_anonymized = old_link_after and old_link_after['subscriber_email'] == 'anonymized'
|
||||||
|
test_result("Anonymizes old link click records", link_anonymized,
|
||||||
|
f"Email: {old_link_after.get('subscriber_email', 'N/A') if old_link_after else 'N/A'}")
|
||||||
|
|
||||||
|
# Check that aggregated metrics are preserved
|
||||||
|
metrics_preserved = (
|
||||||
|
old_newsletter_after and
|
||||||
|
old_newsletter_after['open_count'] == 3 and
|
||||||
|
old_newsletter_after['opened'] == True
|
||||||
|
)
|
||||||
|
test_result("Preserves aggregated metrics", metrics_preserved,
|
||||||
|
f"Open count: {old_newsletter_after.get('open_count', 0) if old_newsletter_after else 0}")
|
||||||
|
|
||||||
|
# Check that recent records were NOT anonymized
|
||||||
|
recent_newsletter_after = newsletter_sends_collection.find_one({'tracking_id': 'recent-tracking-id-1'})
|
||||||
|
recent_not_anonymized = (
|
||||||
|
recent_newsletter_after and
|
||||||
|
recent_newsletter_after['subscriber_email'] == 'recent-user@example.com'
|
||||||
|
)
|
||||||
|
test_result("Does not anonymize recent records", recent_not_anonymized,
|
||||||
|
f"Email: {recent_newsletter_after.get('subscriber_email', 'N/A') if recent_newsletter_after else 'N/A'}")
|
||||||
|
|
||||||
|
# Check return counts
|
||||||
|
correct_counts = result['newsletter_sends_anonymized'] >= 1 and result['link_clicks_anonymized'] >= 1
|
||||||
|
test_result("Returns correct anonymization counts", correct_counts,
|
||||||
|
f"Newsletter: {result['newsletter_sends_anonymized']}, Links: {result['link_clicks_anonymized']}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Data anonymization", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 2: User Data Deletion
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 2: User Data Deletion")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create tracking records for a specific user
|
||||||
|
article_links = [
|
||||||
|
{'url': 'https://example.com/article1', 'title': 'Article 1'},
|
||||||
|
{'url': 'https://example.com/article2', 'title': 'Article 2'}
|
||||||
|
]
|
||||||
|
|
||||||
|
tracking_data = create_newsletter_tracking(
|
||||||
|
newsletter_id=test_newsletter_id,
|
||||||
|
subscriber_email=test_email,
|
||||||
|
article_links=article_links
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create subscriber activity record
|
||||||
|
subscriber_activity_collection.insert_one({
|
||||||
|
'email': test_email,
|
||||||
|
'status': 'active',
|
||||||
|
'last_opened_at': datetime.utcnow(),
|
||||||
|
'total_opens': 5,
|
||||||
|
'total_clicks': 3
|
||||||
|
})
|
||||||
|
|
||||||
|
# Verify records exist
|
||||||
|
newsletter_count_before = newsletter_sends_collection.count_documents({'subscriber_email': test_email})
|
||||||
|
link_count_before = link_clicks_collection.count_documents({'subscriber_email': test_email})
|
||||||
|
activity_count_before = subscriber_activity_collection.count_documents({'email': test_email})
|
||||||
|
|
||||||
|
records_exist = newsletter_count_before > 0 and link_count_before > 0 and activity_count_before > 0
|
||||||
|
test_result("Creates test tracking records", records_exist,
|
||||||
|
f"Newsletter: {newsletter_count_before}, Links: {link_count_before}, Activity: {activity_count_before}")
|
||||||
|
|
||||||
|
# Delete all tracking data for the user
|
||||||
|
delete_result = delete_subscriber_tracking_data(test_email)
|
||||||
|
|
||||||
|
# Verify all records were deleted
|
||||||
|
newsletter_count_after = newsletter_sends_collection.count_documents({'subscriber_email': test_email})
|
||||||
|
link_count_after = link_clicks_collection.count_documents({'subscriber_email': test_email})
|
||||||
|
activity_count_after = subscriber_activity_collection.count_documents({'email': test_email})
|
||||||
|
|
||||||
|
all_deleted = newsletter_count_after == 0 and link_count_after == 0 and activity_count_after == 0
|
||||||
|
test_result("Deletes all tracking records", all_deleted,
|
||||||
|
f"Remaining - Newsletter: {newsletter_count_after}, Links: {link_count_after}, Activity: {activity_count_after}")
|
||||||
|
|
||||||
|
# Check return counts
|
||||||
|
correct_delete_counts = (
|
||||||
|
delete_result['newsletter_sends_deleted'] == newsletter_count_before and
|
||||||
|
delete_result['link_clicks_deleted'] == link_count_before and
|
||||||
|
delete_result['subscriber_activity_deleted'] == activity_count_before
|
||||||
|
)
|
||||||
|
test_result("Returns correct deletion counts", correct_delete_counts,
|
||||||
|
f"Deleted - Newsletter: {delete_result['newsletter_sends_deleted']}, Links: {delete_result['link_clicks_deleted']}, Activity: {delete_result['subscriber_activity_deleted']}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("User data deletion", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 3: Tracking Opt-Out
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 3: Tracking Opt-Out")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create subscriber with tracking disabled
|
||||||
|
subscribers_collection.insert_one({
|
||||||
|
'email': test_email_opted_out,
|
||||||
|
'subscribed_at': datetime.utcnow(),
|
||||||
|
'tracking_enabled': False
|
||||||
|
})
|
||||||
|
|
||||||
|
# Try to create tracking for opted-out subscriber
|
||||||
|
article_links = [
|
||||||
|
{'url': 'https://example.com/article1', 'title': 'Article 1'}
|
||||||
|
]
|
||||||
|
|
||||||
|
tracking_data_opted_out = create_newsletter_tracking(
|
||||||
|
newsletter_id=test_newsletter_id,
|
||||||
|
subscriber_email=test_email_opted_out,
|
||||||
|
article_links=article_links
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check that no tracking was created
|
||||||
|
no_pixel_id = tracking_data_opted_out['pixel_tracking_id'] is None
|
||||||
|
test_result("Does not create pixel tracking for opted-out users", no_pixel_id,
|
||||||
|
f"Pixel ID: {tracking_data_opted_out['pixel_tracking_id']}")
|
||||||
|
|
||||||
|
empty_link_map = len(tracking_data_opted_out['link_tracking_map']) == 0
|
||||||
|
test_result("Does not create link tracking for opted-out users", empty_link_map,
|
||||||
|
f"Link map size: {len(tracking_data_opted_out['link_tracking_map'])}")
|
||||||
|
|
||||||
|
tracking_disabled_flag = tracking_data_opted_out.get('tracking_enabled') == False
|
||||||
|
test_result("Returns tracking_enabled=False for opted-out users", tracking_disabled_flag)
|
||||||
|
|
||||||
|
# Verify no database records were created
|
||||||
|
newsletter_count = newsletter_sends_collection.count_documents({'subscriber_email': test_email_opted_out})
|
||||||
|
link_count = link_clicks_collection.count_documents({'subscriber_email': test_email_opted_out})
|
||||||
|
|
||||||
|
no_db_records = newsletter_count == 0 and link_count == 0
|
||||||
|
test_result("Does not create database records for opted-out users", no_db_records,
|
||||||
|
f"Newsletter records: {newsletter_count}, Link records: {link_count}")
|
||||||
|
|
||||||
|
# Test opt-in/opt-out endpoints
|
||||||
|
with app.test_client() as client:
|
||||||
|
# Create a subscriber with tracking enabled
|
||||||
|
subscribers_collection.insert_one({
|
||||||
|
'email': test_email,
|
||||||
|
'subscribed_at': datetime.utcnow(),
|
||||||
|
'tracking_enabled': True
|
||||||
|
})
|
||||||
|
|
||||||
|
# Opt out
|
||||||
|
response = client.post(f'/api/tracking/subscriber/{test_email}/opt-out')
|
||||||
|
opt_out_success = response.status_code == 200 and response.json.get('success') == True
|
||||||
|
test_result("Opt-out endpoint works", opt_out_success,
|
||||||
|
f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
# Verify tracking is disabled
|
||||||
|
subscriber = subscribers_collection.find_one({'email': test_email})
|
||||||
|
tracking_disabled = subscriber and subscriber.get('tracking_enabled') == False
|
||||||
|
test_result("Opt-out disables tracking in database", tracking_disabled)
|
||||||
|
|
||||||
|
# Opt back in
|
||||||
|
response = client.post(f'/api/tracking/subscriber/{test_email}/opt-in')
|
||||||
|
opt_in_success = response.status_code == 200 and response.json.get('success') == True
|
||||||
|
test_result("Opt-in endpoint works", opt_in_success,
|
||||||
|
f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
# Verify tracking is enabled
|
||||||
|
subscriber = subscribers_collection.find_one({'email': test_email})
|
||||||
|
tracking_enabled = subscriber and subscriber.get('tracking_enabled') == True
|
||||||
|
test_result("Opt-in enables tracking in database", tracking_enabled)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Tracking opt-out", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 4: Privacy API Endpoints
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 4: Privacy API Endpoints")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with app.test_client() as client:
|
||||||
|
# Create test tracking data
|
||||||
|
article_links = [{'url': 'https://example.com/test', 'title': 'Test'}]
|
||||||
|
create_newsletter_tracking(
|
||||||
|
newsletter_id=test_newsletter_id,
|
||||||
|
subscriber_email='api-test@example.com',
|
||||||
|
article_links=article_links
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test deletion endpoint
|
||||||
|
response = client.delete('/api/tracking/subscriber/api-test@example.com')
|
||||||
|
|
||||||
|
delete_endpoint_works = response.status_code == 200 and response.json.get('success') == True
|
||||||
|
test_result("Deletion endpoint returns success", delete_endpoint_works,
|
||||||
|
f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
# Verify data was deleted
|
||||||
|
remaining_records = newsletter_sends_collection.count_documents({'subscriber_email': 'api-test@example.com'})
|
||||||
|
data_deleted = remaining_records == 0
|
||||||
|
test_result("Deletion endpoint removes data", data_deleted,
|
||||||
|
f"Remaining records: {remaining_records}")
|
||||||
|
|
||||||
|
# Test anonymization endpoint
|
||||||
|
response = client.post('/api/tracking/anonymize', json={'retention_days': 90})
|
||||||
|
|
||||||
|
anonymize_endpoint_works = response.status_code == 200 and response.json.get('success') == True
|
||||||
|
test_result("Anonymization endpoint returns success", anonymize_endpoint_works,
|
||||||
|
f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
has_counts = 'anonymized_counts' in response.json
|
||||||
|
test_result("Anonymization endpoint returns counts", has_counts)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Privacy API endpoints", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Clean up test data
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Cleaning up test data...")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': test_newsletter_id})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': test_newsletter_id})
|
||||||
|
subscriber_activity_collection.delete_many({'email': {'$in': [test_email, test_email_opted_out, 'api-test@example.com']}})
|
||||||
|
subscribers_collection.delete_many({'email': {'$in': [test_email, test_email_opted_out, 'api-test@example.com']}})
|
||||||
|
|
||||||
|
# Clean up anonymized records
|
||||||
|
newsletter_sends_collection.delete_many({'subscriber_email': 'anonymized'})
|
||||||
|
link_clicks_collection.delete_many({'subscriber_email': 'anonymized'})
|
||||||
|
|
||||||
|
print("✓ Test data cleaned up")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠ Error cleaning up: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST SUMMARY")
|
||||||
|
print("="*80)
|
||||||
|
print(f"Total tests: {tests_passed + tests_failed}")
|
||||||
|
print(f"✓ Passed: {tests_passed}")
|
||||||
|
print(f"❌ Failed: {tests_failed}")
|
||||||
|
|
||||||
|
if tests_failed == 0:
|
||||||
|
print("\n🎉 All privacy compliance tests passed!")
|
||||||
|
else:
|
||||||
|
print(f"\n⚠ {tests_failed} test(s) failed")
|
||||||
|
|
||||||
|
print("="*80 + "\n")
|
||||||
|
|
||||||
|
# Exit with appropriate code
|
||||||
|
sys.exit(0 if tests_failed == 0 else 1)
|
||||||
260
tests/backend/test_tracking.py
Normal file
260
tests/backend/test_tracking.py
Normal file
@@ -0,0 +1,260 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test email tracking functionality
|
||||||
|
Run from backend directory with venv activated:
|
||||||
|
cd backend
|
||||||
|
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||||
|
python test_tracking.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
from pymongo import MongoClient
|
||||||
|
|
||||||
|
# Add backend directory to path
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
|
||||||
|
from config import Config
|
||||||
|
from services.tracking_service import generate_tracking_id, create_newsletter_tracking
|
||||||
|
from database import newsletter_sends_collection, link_clicks_collection
|
||||||
|
from app import app
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("Email Tracking System Tests")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
# Test counters
|
||||||
|
tests_passed = 0
|
||||||
|
tests_failed = 0
|
||||||
|
|
||||||
|
def test_result(test_name, passed, message=""):
|
||||||
|
"""Print test result"""
|
||||||
|
global tests_passed, tests_failed
|
||||||
|
if passed:
|
||||||
|
tests_passed += 1
|
||||||
|
print(f"✓ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
else:
|
||||||
|
tests_failed += 1
|
||||||
|
print(f"❌ {test_name}")
|
||||||
|
if message:
|
||||||
|
print(f" {message}")
|
||||||
|
|
||||||
|
|
||||||
|
# Test 1: Tracking ID Generation
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 1: Tracking ID Generation")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
tracking_id = generate_tracking_id()
|
||||||
|
|
||||||
|
# Check format (UUID4)
|
||||||
|
is_valid_uuid = len(tracking_id) == 36 and tracking_id.count('-') == 4
|
||||||
|
test_result("Generate tracking ID", is_valid_uuid, f"Generated ID: {tracking_id}")
|
||||||
|
|
||||||
|
# Check uniqueness
|
||||||
|
tracking_id2 = generate_tracking_id()
|
||||||
|
is_unique = tracking_id != tracking_id2
|
||||||
|
test_result("Tracking IDs are unique", is_unique, f"ID1: {tracking_id[:8]}... ID2: {tracking_id2[:8]}...")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Generate tracking ID", False, f"Error: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
# Test 2: Create Newsletter Tracking
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 2: Create Newsletter Tracking")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Clean up test data first
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': 'test-newsletter-001'})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': 'test-newsletter-001'})
|
||||||
|
|
||||||
|
# Create tracking with article links
|
||||||
|
article_links = [
|
||||||
|
{'url': 'https://example.com/article1', 'title': 'Test Article 1'},
|
||||||
|
{'url': 'https://example.com/article2', 'title': 'Test Article 2'}
|
||||||
|
]
|
||||||
|
|
||||||
|
tracking_data = create_newsletter_tracking(
|
||||||
|
newsletter_id='test-newsletter-001',
|
||||||
|
subscriber_email='test@example.com',
|
||||||
|
article_links=article_links
|
||||||
|
)
|
||||||
|
|
||||||
|
# Verify return data structure
|
||||||
|
has_pixel_id = 'pixel_tracking_id' in tracking_data
|
||||||
|
test_result("Returns pixel tracking ID", has_pixel_id)
|
||||||
|
|
||||||
|
has_link_map = 'link_tracking_map' in tracking_data
|
||||||
|
test_result("Returns link tracking map", has_link_map)
|
||||||
|
|
||||||
|
correct_link_count = len(tracking_data.get('link_tracking_map', {})) == 2
|
||||||
|
test_result("Creates tracking for all links", correct_link_count,
|
||||||
|
f"Created {len(tracking_data.get('link_tracking_map', {}))} link tracking records")
|
||||||
|
|
||||||
|
# Verify database records
|
||||||
|
newsletter_record = newsletter_sends_collection.find_one({
|
||||||
|
'tracking_id': tracking_data['pixel_tracking_id']
|
||||||
|
})
|
||||||
|
|
||||||
|
record_exists = newsletter_record is not None
|
||||||
|
test_result("Creates newsletter_sends record", record_exists)
|
||||||
|
|
||||||
|
if newsletter_record:
|
||||||
|
correct_initial_state = (
|
||||||
|
newsletter_record['opened'] == False and
|
||||||
|
newsletter_record['open_count'] == 0 and
|
||||||
|
newsletter_record['first_opened_at'] is None
|
||||||
|
)
|
||||||
|
test_result("Newsletter record has correct initial state", correct_initial_state)
|
||||||
|
|
||||||
|
# Verify link click records
|
||||||
|
link_records = list(link_clicks_collection.find({'newsletter_id': 'test-newsletter-001'}))
|
||||||
|
correct_link_records = len(link_records) == 2
|
||||||
|
test_result("Creates link_clicks records", correct_link_records,
|
||||||
|
f"Created {len(link_records)} link click records")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Create newsletter tracking", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 3: Tracking Pixel Endpoint
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 3: Tracking Pixel Endpoint")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with app.test_client() as client:
|
||||||
|
# Test with valid tracking ID
|
||||||
|
pixel_tracking_id = tracking_data['pixel_tracking_id']
|
||||||
|
response = client.get(f'/api/track/pixel/{pixel_tracking_id}')
|
||||||
|
|
||||||
|
is_png = response.content_type == 'image/png'
|
||||||
|
test_result("Returns PNG for valid tracking_id", is_png,
|
||||||
|
f"Content-Type: {response.content_type}")
|
||||||
|
|
||||||
|
is_200 = response.status_code == 200
|
||||||
|
test_result("Returns 200 status", is_200, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
# Verify database was updated
|
||||||
|
updated_record = newsletter_sends_collection.find_one({
|
||||||
|
'tracking_id': pixel_tracking_id
|
||||||
|
})
|
||||||
|
|
||||||
|
was_logged = (
|
||||||
|
updated_record and
|
||||||
|
updated_record['opened'] == True and
|
||||||
|
updated_record['open_count'] == 1 and
|
||||||
|
updated_record['first_opened_at'] is not None
|
||||||
|
)
|
||||||
|
test_result("Logs email open event", was_logged,
|
||||||
|
f"Open count: {updated_record.get('open_count', 0) if updated_record else 0}")
|
||||||
|
|
||||||
|
# Test multiple opens
|
||||||
|
response2 = client.get(f'/api/track/pixel/{pixel_tracking_id}')
|
||||||
|
updated_record2 = newsletter_sends_collection.find_one({
|
||||||
|
'tracking_id': pixel_tracking_id
|
||||||
|
})
|
||||||
|
|
||||||
|
handles_multiple = (
|
||||||
|
updated_record2 and
|
||||||
|
updated_record2['open_count'] == 2 and
|
||||||
|
updated_record2['last_opened_at'] != updated_record2['first_opened_at']
|
||||||
|
)
|
||||||
|
test_result("Handles multiple opens", handles_multiple,
|
||||||
|
f"Open count: {updated_record2.get('open_count', 0) if updated_record2 else 0}")
|
||||||
|
|
||||||
|
# Test with invalid tracking ID
|
||||||
|
response3 = client.get('/api/track/pixel/invalid-tracking-id-12345')
|
||||||
|
|
||||||
|
fails_silently = response3.status_code == 200 and response3.content_type == 'image/png'
|
||||||
|
test_result("Returns PNG for invalid tracking_id (fails silently)", fails_silently)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Tracking pixel endpoint", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Test 4: Link Redirect Endpoint
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Test 4: Link Redirect Endpoint")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with app.test_client() as client:
|
||||||
|
# Test with valid tracking ID
|
||||||
|
article_url = 'https://example.com/article1'
|
||||||
|
link_tracking_id = tracking_data['link_tracking_map'][article_url]
|
||||||
|
|
||||||
|
response = client.get(f'/api/track/click/{link_tracking_id}', follow_redirects=False)
|
||||||
|
|
||||||
|
is_redirect = response.status_code == 302
|
||||||
|
test_result("Returns 302 redirect", is_redirect, f"Status: {response.status_code}")
|
||||||
|
|
||||||
|
correct_location = response.location == article_url
|
||||||
|
test_result("Redirects to correct URL", correct_location,
|
||||||
|
f"Location: {response.location}")
|
||||||
|
|
||||||
|
# Verify database was updated
|
||||||
|
click_record = link_clicks_collection.find_one({
|
||||||
|
'tracking_id': link_tracking_id
|
||||||
|
})
|
||||||
|
|
||||||
|
was_logged = (
|
||||||
|
click_record and
|
||||||
|
click_record['clicked'] == True and
|
||||||
|
click_record['clicked_at'] is not None
|
||||||
|
)
|
||||||
|
test_result("Logs click event", was_logged)
|
||||||
|
|
||||||
|
# Test with invalid tracking ID
|
||||||
|
response2 = client.get('/api/track/click/invalid-tracking-id-12345', follow_redirects=False)
|
||||||
|
|
||||||
|
redirects_on_invalid = response2.status_code == 302
|
||||||
|
test_result("Redirects on invalid tracking_id", redirects_on_invalid,
|
||||||
|
f"Redirects to: {response2.location}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
test_result("Link redirect endpoint", False, f"Error: {str(e)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
|
||||||
|
# Clean up test data
|
||||||
|
print("\n" + "-"*80)
|
||||||
|
print("Cleaning up test data...")
|
||||||
|
print("-"*80)
|
||||||
|
|
||||||
|
try:
|
||||||
|
newsletter_sends_collection.delete_many({'newsletter_id': 'test-newsletter-001'})
|
||||||
|
link_clicks_collection.delete_many({'newsletter_id': 'test-newsletter-001'})
|
||||||
|
print("✓ Test data cleaned up")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠ Error cleaning up: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("TEST SUMMARY")
|
||||||
|
print("="*80)
|
||||||
|
print(f"Total tests: {tests_passed + tests_failed}")
|
||||||
|
print(f"✓ Passed: {tests_passed}")
|
||||||
|
print(f"❌ Failed: {tests_failed}")
|
||||||
|
|
||||||
|
if tests_failed == 0:
|
||||||
|
print("\n🎉 All tests passed!")
|
||||||
|
else:
|
||||||
|
print(f"\n⚠ {tests_failed} test(s) failed")
|
||||||
|
|
||||||
|
print("="*80 + "\n")
|
||||||
|
|
||||||
|
# Exit with appropriate code
|
||||||
|
sys.exit(0 if tests_failed == 0 else 1)
|
||||||
208
tests/sender/test_newsletter_tracking.py
Normal file
208
tests/sender/test_newsletter_tracking.py
Normal file
@@ -0,0 +1,208 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Integration test for newsletter with tracking.
|
||||||
|
Tests the full flow of generating a newsletter with tracking enabled.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
# Add backend directory to path
|
||||||
|
backend_dir = Path(__file__).parent.parent / 'backend'
|
||||||
|
sys.path.insert(0, str(backend_dir))
|
||||||
|
|
||||||
|
# Mock the tracking service to avoid database dependency
|
||||||
|
class MockTrackingService:
|
||||||
|
"""Mock tracking service for testing"""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def create_newsletter_tracking(newsletter_id, subscriber_email, article_links=None):
|
||||||
|
"""Mock create_newsletter_tracking function"""
|
||||||
|
link_tracking_map = {}
|
||||||
|
|
||||||
|
if article_links:
|
||||||
|
for i, article in enumerate(article_links):
|
||||||
|
link_tracking_map[article['url']] = f"mock-link-{i}"
|
||||||
|
|
||||||
|
return {
|
||||||
|
'pixel_tracking_id': 'mock-pixel-123',
|
||||||
|
'link_tracking_map': link_tracking_map,
|
||||||
|
'newsletter_id': newsletter_id,
|
||||||
|
'subscriber_email': subscriber_email
|
||||||
|
}
|
||||||
|
|
||||||
|
# Import after setting up path
|
||||||
|
from tracking_integration import inject_tracking_pixel, replace_article_links, generate_tracking_urls
|
||||||
|
from jinja2 import Template
|
||||||
|
|
||||||
|
|
||||||
|
def test_newsletter_with_tracking():
|
||||||
|
"""Test generating a newsletter with tracking enabled"""
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("NEWSLETTER TRACKING INTEGRATION TEST")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
# Mock article data
|
||||||
|
articles = [
|
||||||
|
{
|
||||||
|
'title': 'Munich Tech Summit Announces 2025 Dates',
|
||||||
|
'author': 'Tech Reporter',
|
||||||
|
'link': 'https://example.com/tech-summit',
|
||||||
|
'summary': 'The annual Munich Tech Summit will return in 2025 with exciting new features.',
|
||||||
|
'source': 'Munich Tech News',
|
||||||
|
'published_at': datetime.now()
|
||||||
|
},
|
||||||
|
{
|
||||||
|
'title': 'New Public Transport Routes Launched',
|
||||||
|
'author': 'Transport Desk',
|
||||||
|
'link': 'https://example.com/transport-routes',
|
||||||
|
'summary': 'MVG announces three new bus routes connecting suburban areas.',
|
||||||
|
'source': 'Munich Transport',
|
||||||
|
'published_at': datetime.now()
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
newsletter_id = 'test-newsletter-2025-11-11'
|
||||||
|
subscriber_email = 'test@example.com'
|
||||||
|
api_url = 'http://localhost:5001'
|
||||||
|
|
||||||
|
print(f"\nNewsletter ID: {newsletter_id}")
|
||||||
|
print(f"Subscriber: {subscriber_email}")
|
||||||
|
print(f"Articles: {len(articles)}")
|
||||||
|
print(f"API URL: {api_url}")
|
||||||
|
|
||||||
|
# Step 1: Generate tracking URLs
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 1: Generate tracking data")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
tracking_data = generate_tracking_urls(
|
||||||
|
articles=articles,
|
||||||
|
newsletter_id=newsletter_id,
|
||||||
|
subscriber_email=subscriber_email,
|
||||||
|
tracking_service=MockTrackingService
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"✓ Pixel tracking ID: {tracking_data['pixel_tracking_id']}")
|
||||||
|
print(f"✓ Link tracking map: {len(tracking_data['link_tracking_map'])} links")
|
||||||
|
for url, tracking_id in tracking_data['link_tracking_map'].items():
|
||||||
|
print(f" - {url} → {tracking_id}")
|
||||||
|
|
||||||
|
# Step 2: Load and render template
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 2: Render newsletter template")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
template_path = Path(__file__).parent / 'newsletter_template.html'
|
||||||
|
with open(template_path, 'r', encoding='utf-8') as f:
|
||||||
|
template_content = f.read()
|
||||||
|
|
||||||
|
template = Template(template_content)
|
||||||
|
|
||||||
|
now = datetime.now()
|
||||||
|
template_data = {
|
||||||
|
'date': now.strftime('%A, %B %d, %Y'),
|
||||||
|
'year': now.year,
|
||||||
|
'article_count': len(articles),
|
||||||
|
'articles': articles,
|
||||||
|
'unsubscribe_link': 'http://localhost:3000/unsubscribe',
|
||||||
|
'website_link': 'http://localhost:3000',
|
||||||
|
'tracking_enabled': True
|
||||||
|
}
|
||||||
|
|
||||||
|
html = template.render(**template_data)
|
||||||
|
print("✓ Template rendered")
|
||||||
|
|
||||||
|
# Step 3: Inject tracking pixel
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 3: Inject tracking pixel")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
html = inject_tracking_pixel(
|
||||||
|
html,
|
||||||
|
tracking_data['pixel_tracking_id'],
|
||||||
|
api_url
|
||||||
|
)
|
||||||
|
|
||||||
|
pixel_url = f"{api_url}/api/track/pixel/{tracking_data['pixel_tracking_id']}"
|
||||||
|
if pixel_url in html:
|
||||||
|
print(f"✓ Tracking pixel injected: {pixel_url}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Tracking pixel NOT found")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Step 4: Replace article links
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 4: Replace article links with tracking URLs")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
html = replace_article_links(
|
||||||
|
html,
|
||||||
|
tracking_data['link_tracking_map'],
|
||||||
|
api_url
|
||||||
|
)
|
||||||
|
|
||||||
|
# Verify all article links were replaced
|
||||||
|
success = True
|
||||||
|
for article in articles:
|
||||||
|
original_url = article['link']
|
||||||
|
tracking_id = tracking_data['link_tracking_map'].get(original_url)
|
||||||
|
|
||||||
|
if tracking_id:
|
||||||
|
tracking_url = f"{api_url}/api/track/click/{tracking_id}"
|
||||||
|
if tracking_url in html:
|
||||||
|
print(f"✓ Link replaced: {original_url}")
|
||||||
|
print(f" → {tracking_url}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Link NOT replaced: {original_url}")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
# Verify original URL is NOT in the HTML (should be replaced)
|
||||||
|
if f'href="{original_url}"' in html:
|
||||||
|
print(f"✗ Original URL still present: {original_url}")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
# Step 5: Verify privacy notice
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 5: Verify privacy notice")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
if "This email contains tracking to measure engagement" in html:
|
||||||
|
print("✓ Privacy notice present in footer")
|
||||||
|
else:
|
||||||
|
print("✗ Privacy notice NOT found")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
# Step 6: Save output for inspection
|
||||||
|
print("\n" + "-"*70)
|
||||||
|
print("Step 6: Save test output")
|
||||||
|
print("-"*70)
|
||||||
|
|
||||||
|
output_file = 'test_newsletter_with_tracking.html'
|
||||||
|
with open(output_file, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(html)
|
||||||
|
|
||||||
|
print(f"✓ Test newsletter saved to: {output_file}")
|
||||||
|
print(f" Open it in your browser to inspect the tracking integration")
|
||||||
|
|
||||||
|
return success
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TESTING NEWSLETTER WITH TRACKING")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
success = test_newsletter_with_tracking()
|
||||||
|
|
||||||
|
print("\n" + "="*70)
|
||||||
|
if success:
|
||||||
|
print("✓ ALL TESTS PASSED")
|
||||||
|
print("="*70 + "\n")
|
||||||
|
sys.exit(0)
|
||||||
|
else:
|
||||||
|
print("✗ SOME TESTS FAILED")
|
||||||
|
print("="*70 + "\n")
|
||||||
|
sys.exit(1)
|
||||||
179
tests/sender/test_newsletter_with_tracking.html
Normal file
179
tests/sender/test_newsletter_with_tracking.html
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8"/>
|
||||||
|
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
|
||||||
|
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
|
||||||
|
<title>Munich News Daily</title>
|
||||||
|
<!--[if mso]>
|
||||||
|
<style type="text/css">
|
||||||
|
body, table, td {font-family: Arial, Helvetica, sans-serif !important;}
|
||||||
|
</style>
|
||||||
|
<![endif]-->
|
||||||
|
</head>
|
||||||
|
<body style="margin: 0; padding: 0; background-color: #f4f4f4; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;">
|
||||||
|
<!-- Wrapper Table -->
|
||||||
|
<table border="0" cellpadding="0" cellspacing="0" role="presentation" style="background-color: #f4f4f4;" width="100%">
|
||||||
|
<tr>
|
||||||
|
<td align="center" style="padding: 20px 0;">
|
||||||
|
<!-- Main Container -->
|
||||||
|
<table border="0" cellpadding="0" cellspacing="0" role="presentation" style="background-color: #ffffff; max-width: 600px;" width="600">
|
||||||
|
<!-- Header -->
|
||||||
|
<tr>
|
||||||
|
<td style="background-color: #1a1a1a; padding: 30px 40px; text-align: center;">
|
||||||
|
<h1 style="margin: 0 0 8px 0; font-size: 28px; font-weight: 700; color: #ffffff; letter-spacing: -0.5px;">
|
||||||
|
Munich News Daily
|
||||||
|
</h1>
|
||||||
|
<p style="margin: 0; font-size: 14px; color: #999999; letter-spacing: 0.5px;">
|
||||||
|
Tuesday, November 11, 2025
|
||||||
|
</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Greeting -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 30px 40px 20px 40px;">
|
||||||
|
<p style="margin: 0; font-size: 16px; line-height: 1.5; color: #333333;">
|
||||||
|
Good morning ☀️
|
||||||
|
</p>
|
||||||
|
<p style="margin: 15px 0 0 0; font-size: 15px; line-height: 1.6; color: #666666;">
|
||||||
|
Here's what's happening in Munich today. We've summarized 2 stories using AI so you can stay informed in under 5 minutes.
|
||||||
|
</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Divider -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 0 40px;">
|
||||||
|
<div style="height: 1px; background-color: #e0e0e0;"></div>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Articles -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 25px 40px;">
|
||||||
|
<!-- Article Number Badge -->
|
||||||
|
<table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<span style="display: inline-block; background-color: #000000; color: #ffffff; width: 24px; height: 24px; line-height: 24px; text-align: center; border-radius: 50%; font-size: 12px; font-weight: 600;">
|
||||||
|
1
|
||||||
|
</span>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
<!-- Article Title -->
|
||||||
|
<h2 style="margin: 12px 0 8px 0; font-size: 19px; font-weight: 700; line-height: 1.3; color: #1a1a1a;">
|
||||||
|
Munich Tech Summit Announces 2025 Dates
|
||||||
|
</h2>
|
||||||
|
<!-- Article Meta -->
|
||||||
|
<p style="margin: 0 0 12px 0; font-size: 13px; color: #999999;">
|
||||||
|
<span style="color: #000000; font-weight: 600;">Munich Tech News</span>
|
||||||
|
<span> • Tech Reporter</span>
|
||||||
|
</p>
|
||||||
|
<!-- Article Summary -->
|
||||||
|
<p style="margin: 0 0 15px 0; font-size: 15px; line-height: 1.6; color: #333333;">
|
||||||
|
The annual Munich Tech Summit will return in 2025 with exciting new features.
|
||||||
|
</p>
|
||||||
|
<!-- Read More Link -->
|
||||||
|
<a href="http://localhost:5001/api/track/click/mock-link-0" style="display: inline-block; color: #000000; text-decoration: none; font-size: 14px; font-weight: 600; border-bottom: 2px solid #000000; padding-bottom: 2px;">
|
||||||
|
Read more →
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Article Divider -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 0 40px;">
|
||||||
|
<div style="height: 1px; background-color: #f0f0f0;"></div>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 25px 40px;">
|
||||||
|
<!-- Article Number Badge -->
|
||||||
|
<table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<span style="display: inline-block; background-color: #000000; color: #ffffff; width: 24px; height: 24px; line-height: 24px; text-align: center; border-radius: 50%; font-size: 12px; font-weight: 600;">
|
||||||
|
2
|
||||||
|
</span>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
<!-- Article Title -->
|
||||||
|
<h2 style="margin: 12px 0 8px 0; font-size: 19px; font-weight: 700; line-height: 1.3; color: #1a1a1a;">
|
||||||
|
New Public Transport Routes Launched
|
||||||
|
</h2>
|
||||||
|
<!-- Article Meta -->
|
||||||
|
<p style="margin: 0 0 12px 0; font-size: 13px; color: #999999;">
|
||||||
|
<span style="color: #000000; font-weight: 600;">Munich Transport</span>
|
||||||
|
<span> • Transport Desk</span>
|
||||||
|
</p>
|
||||||
|
<!-- Article Summary -->
|
||||||
|
<p style="margin: 0 0 15px 0; font-size: 15px; line-height: 1.6; color: #333333;">
|
||||||
|
MVG announces three new bus routes connecting suburban areas.
|
||||||
|
</p>
|
||||||
|
<!-- Read More Link -->
|
||||||
|
<a href="http://localhost:5001/api/track/click/mock-link-1" style="display: inline-block; color: #000000; text-decoration: none; font-size: 14px; font-weight: 600; border-bottom: 2px solid #000000; padding-bottom: 2px;">
|
||||||
|
Read more →
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Article Divider -->
|
||||||
|
<!-- Bottom Divider -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 25px 40px 0 40px;">
|
||||||
|
<div style="height: 1px; background-color: #e0e0e0;"></div>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Summary Box -->
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 30px 40px;">
|
||||||
|
<table border="0" cellpadding="0" cellspacing="0" role="presentation" style="background-color: #f8f8f8; border-radius: 8px;" width="100%">
|
||||||
|
<tr>
|
||||||
|
<td style="padding: 25px; text-align: center;">
|
||||||
|
<p style="margin: 0 0 8px 0; font-size: 13px; color: #666666; text-transform: uppercase; letter-spacing: 1px; font-weight: 600;">
|
||||||
|
Today's Digest
|
||||||
|
</p>
|
||||||
|
<p style="margin: 0; font-size: 36px; font-weight: 700; color: #000000;">
|
||||||
|
2
|
||||||
|
</p>
|
||||||
|
<p style="margin: 8px 0 0 0; font-size: 14px; color: #666666;">
|
||||||
|
stories • AI-summarized • 5 min read
|
||||||
|
</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<!-- Footer -->
|
||||||
|
<tr>
|
||||||
|
<td style="background-color: #1a1a1a; padding: 30px 40px; text-align: center;">
|
||||||
|
<p style="margin: 0 0 15px 0; font-size: 14px; color: #ffffff; font-weight: 600;">
|
||||||
|
Munich News Daily
|
||||||
|
</p>
|
||||||
|
<p style="margin: 0 0 20px 0; font-size: 13px; color: #999999; line-height: 1.5;">
|
||||||
|
AI-powered news summaries for busy people.<br/>
|
||||||
|
Delivered daily to your inbox.
|
||||||
|
</p>
|
||||||
|
<!-- Footer Links -->
|
||||||
|
<p style="margin: 0; font-size: 12px; color: #666666;">
|
||||||
|
<a href="http://localhost:3000" style="color: #999999; text-decoration: none;">Visit Website</a>
|
||||||
|
<span style="color: #444444;"> • </span>
|
||||||
|
<a href="http://localhost:3000/unsubscribe" style="color: #999999; text-decoration: none;">Unsubscribe</a>
|
||||||
|
</p>
|
||||||
|
<!-- Privacy Notice -->
|
||||||
|
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666; line-height: 1.4;">
|
||||||
|
This email contains tracking to measure engagement and improve our content.<br/>
|
||||||
|
We respect your privacy and anonymize data after 90 days.
|
||||||
|
</p>
|
||||||
|
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666;">
|
||||||
|
© 2025 Munich News Daily. All rights reserved.
|
||||||
|
</p>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
<!-- End Main Container -->
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
<!-- End Wrapper Table -->
|
||||||
|
<img alt="" height="1" src="http://localhost:5001/api/track/pixel/mock-pixel-123" style="display:block;" width="1"/></body>
|
||||||
|
</html>
|
||||||
187
tests/sender/test_tracking_integration.py
Normal file
187
tests/sender/test_tracking_integration.py
Normal file
@@ -0,0 +1,187 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test script for tracking integration in newsletter sender.
|
||||||
|
Tests tracking pixel injection and link replacement.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Add backend directory to path
|
||||||
|
backend_dir = Path(__file__).parent.parent / 'backend'
|
||||||
|
sys.path.insert(0, str(backend_dir))
|
||||||
|
|
||||||
|
from tracking_integration import inject_tracking_pixel, replace_article_links
|
||||||
|
|
||||||
|
|
||||||
|
def test_inject_tracking_pixel():
|
||||||
|
"""Test that tracking pixel is correctly injected into HTML"""
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TEST 1: Inject Tracking Pixel")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
# Test HTML
|
||||||
|
html = """<html>
|
||||||
|
<body>
|
||||||
|
<p>Newsletter content</p>
|
||||||
|
</body>
|
||||||
|
</html>"""
|
||||||
|
|
||||||
|
tracking_id = "test-tracking-123"
|
||||||
|
api_url = "http://localhost:5001"
|
||||||
|
|
||||||
|
# Inject pixel
|
||||||
|
result = inject_tracking_pixel(html, tracking_id, api_url)
|
||||||
|
|
||||||
|
# Verify pixel is present
|
||||||
|
expected_pixel = f'<img src="{api_url}/api/track/pixel/{tracking_id}" width="1" height="1" alt="" style="display:block;" />'
|
||||||
|
|
||||||
|
if expected_pixel in result:
|
||||||
|
print("✓ Tracking pixel correctly injected")
|
||||||
|
print(f" Pixel URL: {api_url}/api/track/pixel/{tracking_id}")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
print("✗ Tracking pixel NOT found in HTML")
|
||||||
|
print(f" Expected: {expected_pixel}")
|
||||||
|
print(f" Result: {result}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_replace_article_links():
|
||||||
|
"""Test that article links are correctly replaced with tracking URLs"""
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TEST 2: Replace Article Links")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
# Test HTML with article links
|
||||||
|
html = """<html>
|
||||||
|
<body>
|
||||||
|
<a href="https://example.com/article1">Article 1</a>
|
||||||
|
<a href="https://example.com/article2">Article 2</a>
|
||||||
|
<a href="https://example.com/untracked">Untracked Link</a>
|
||||||
|
</body>
|
||||||
|
</html>"""
|
||||||
|
|
||||||
|
# Tracking map
|
||||||
|
link_tracking_map = {
|
||||||
|
"https://example.com/article1": "track-id-1",
|
||||||
|
"https://example.com/article2": "track-id-2"
|
||||||
|
}
|
||||||
|
|
||||||
|
api_url = "http://localhost:5001"
|
||||||
|
|
||||||
|
# Replace links
|
||||||
|
result = replace_article_links(html, link_tracking_map, api_url)
|
||||||
|
|
||||||
|
# Verify replacements
|
||||||
|
success = True
|
||||||
|
|
||||||
|
# Check article 1 link
|
||||||
|
expected_url_1 = f"{api_url}/api/track/click/track-id-1"
|
||||||
|
if expected_url_1 in result:
|
||||||
|
print(f"✓ Article 1 link replaced: {expected_url_1}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Article 1 link NOT replaced")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
# Check article 2 link
|
||||||
|
expected_url_2 = f"{api_url}/api/track/click/track-id-2"
|
||||||
|
if expected_url_2 in result:
|
||||||
|
print(f"✓ Article 2 link replaced: {expected_url_2}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Article 2 link NOT replaced")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
# Check untracked link remains unchanged
|
||||||
|
if "https://example.com/untracked" in result:
|
||||||
|
print(f"✓ Untracked link preserved: https://example.com/untracked")
|
||||||
|
else:
|
||||||
|
print(f"✗ Untracked link was modified (should remain unchanged)")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
return success
|
||||||
|
|
||||||
|
|
||||||
|
def test_full_integration():
|
||||||
|
"""Test full integration: pixel + link replacement"""
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TEST 3: Full Integration (Pixel + Links)")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
# Test HTML
|
||||||
|
html = """<html>
|
||||||
|
<body>
|
||||||
|
<h1>Newsletter</h1>
|
||||||
|
<a href="https://example.com/article">Read Article</a>
|
||||||
|
</body>
|
||||||
|
</html>"""
|
||||||
|
|
||||||
|
api_url = "http://localhost:5001"
|
||||||
|
pixel_tracking_id = "pixel-123"
|
||||||
|
link_tracking_map = {
|
||||||
|
"https://example.com/article": "link-456"
|
||||||
|
}
|
||||||
|
|
||||||
|
# First inject pixel
|
||||||
|
html = inject_tracking_pixel(html, pixel_tracking_id, api_url)
|
||||||
|
|
||||||
|
# Then replace links
|
||||||
|
html = replace_article_links(html, link_tracking_map, api_url)
|
||||||
|
|
||||||
|
# Verify both are present
|
||||||
|
success = True
|
||||||
|
|
||||||
|
pixel_url = f"{api_url}/api/track/pixel/{pixel_tracking_id}"
|
||||||
|
if pixel_url in html:
|
||||||
|
print(f"✓ Tracking pixel present: {pixel_url}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Tracking pixel NOT found")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
link_url = f"{api_url}/api/track/click/link-456"
|
||||||
|
if link_url in html:
|
||||||
|
print(f"✓ Tracking link present: {link_url}")
|
||||||
|
else:
|
||||||
|
print(f"✗ Tracking link NOT found")
|
||||||
|
success = False
|
||||||
|
|
||||||
|
if success:
|
||||||
|
print("\n✓ Full integration successful!")
|
||||||
|
print("\nFinal HTML:")
|
||||||
|
print("-" * 70)
|
||||||
|
print(html)
|
||||||
|
print("-" * 70)
|
||||||
|
|
||||||
|
return success
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TRACKING INTEGRATION TEST SUITE")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
results.append(("Inject Tracking Pixel", test_inject_tracking_pixel()))
|
||||||
|
results.append(("Replace Article Links", test_replace_article_links()))
|
||||||
|
results.append(("Full Integration", test_full_integration()))
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("TEST SUMMARY")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
passed = sum(1 for _, result in results if result)
|
||||||
|
total = len(results)
|
||||||
|
|
||||||
|
for test_name, result in results:
|
||||||
|
status = "✓ PASS" if result else "✗ FAIL"
|
||||||
|
print(f"{status}: {test_name}")
|
||||||
|
|
||||||
|
print("-" * 70)
|
||||||
|
print(f"Results: {passed}/{total} tests passed")
|
||||||
|
print("="*70 + "\n")
|
||||||
|
|
||||||
|
# Exit with appropriate code
|
||||||
|
sys.exit(0 if passed == total else 1)
|
||||||
Reference in New Issue
Block a user