This commit is contained in:
2025-11-18 14:45:41 +01:00
parent 2e80d64ff6
commit 84fce9a82c
19 changed files with 2437 additions and 3 deletions

167
docs/LOCAL_DEVELOPMENT.md Normal file
View File

@@ -0,0 +1,167 @@
# Local Development Setup
This guide helps you run Munich News Daily locally for development and testing.
## Quick Start
```bash
# 1. Copy local environment files
cp .env.local .env
cp backend/.env.local backend/.env
# 2. Start services with local configuration
docker-compose -f docker-compose.local.yml up -d
# 3. Check logs
docker-compose -f docker-compose.local.yml logs -f
# 4. Access services
# - Frontend: http://localhost:3000
# - Backend API: http://localhost:5001
# - MongoDB: localhost:27017
# - Ollama: http://localhost:11434
```
## Differences from Production
| Feature | Production | Local Development |
|---------|-----------|-------------------|
| Ollama Model | `gemma3:12b` (large) | `phi3:latest` (small, fast) |
| MongoDB Port | Internal only | Exposed on 27017 |
| Ollama Port | Internal only | Exposed on 11434 |
| Container Names | `munich-news-*` | `munich-news-local-*` |
| Volumes | `*_data` | `*_data_local` |
| Email | Production SMTP | Test/disabled |
## Useful Commands
### Start/Stop Services
```bash
# Start all services
docker-compose -f docker-compose.local.yml up -d
# Stop all services
docker-compose -f docker-compose.local.yml down
# Restart a specific service
docker-compose -f docker-compose.local.yml restart backend
# View logs
docker-compose -f docker-compose.local.yml logs -f crawler
```
### Testing
```bash
# Trigger a news crawl (2 articles for quick testing)
curl -X POST http://localhost:5001/api/admin/trigger-crawl \
-H "Content-Type: application/json" \
-d '{"max_articles": 2}'
# Trigger transport crawl
curl -X POST http://localhost:5001/api/transport/crawl
# Check articles in MongoDB
docker exec munich-news-local-mongodb mongosh munich_news \
--eval "db.articles.find({}, {title: 1, keywords: 1, category: 1}).limit(3)"
# Check transport disruptions
curl http://localhost:5001/api/transport/disruptions
```
### Database Access
```bash
# Connect to MongoDB
docker exec -it munich-news-local-mongodb mongosh munich_news
# Or from host (if you have mongosh installed)
mongosh "mongodb://admin:local123@localhost:27017/munich_news"
# Useful queries
db.articles.countDocuments()
db.articles.find({keywords: {$exists: true}}).limit(5)
db.subscribers.find()
db.transport_alerts.find()
```
### Ollama Testing
```bash
# List models
curl http://localhost:11434/api/tags
# Test generation
curl http://localhost:11434/api/generate -d '{
"model": "phi3:latest",
"prompt": "Summarize: Munich opens new U-Bahn line",
"stream": false
}'
```
## Cleanup
```bash
# Stop and remove containers
docker-compose -f docker-compose.local.yml down
# Remove volumes (WARNING: deletes all data)
docker-compose -f docker-compose.local.yml down -v
# Remove local volumes specifically
docker volume rm munich-news_mongodb_data_local
docker volume rm munich-news_mongodb_config_local
docker volume rm munich-news_ollama_data_local
```
## Switching Between Local and Production
```bash
# Switch to local
cp .env.local .env
cp backend/.env.local backend/.env
docker-compose -f docker-compose.local.yml up -d
# Switch to production
cp .env.production .env # (if you have one)
cp backend/.env.production backend/.env
docker-compose up -d
```
## Troubleshooting
### Ollama model not downloading
```bash
# Pull model manually
docker exec munich-news-local-ollama ollama pull phi3:latest
```
### MongoDB connection refused
```bash
# Check if MongoDB is running
docker-compose -f docker-compose.local.yml ps mongodb
# Check logs
docker-compose -f docker-compose.local.yml logs mongodb
```
### Port already in use
```bash
# Check what's using the port
lsof -i :5001 # or :3000, :27017, etc.
# Stop the conflicting service or change port in docker-compose.local.yml
```
## Tips
1. **Use phi3 for speed** - It's much faster than gemma3 for local testing
2. **Limit articles** - Use `max_articles: 2` for quick crawl tests
3. **Watch logs** - Keep logs open to see what's happening
4. **Separate volumes** - Local and production use different volumes, so they don't interfere
## Next Steps
- See `docs/PERSONALIZATION.md` for personalization feature development
- See `docs/OLLAMA_SETUP.md` for AI configuration
- See main `README.md` for general documentation

217
docs/PERSONALIZATION.md Normal file
View File

@@ -0,0 +1,217 @@
# Newsletter Personalization Implementation
## Overview
Personalized newsletters based on user click behavior, using keywords and categories to build interest profiles.
## Implementation Phases
### ✅ Phase 1: Keyword Extraction (COMPLETED)
**Status:** Implemented
**Files Modified:**
- `news_crawler/ollama_client.py` - Added `extract_keywords()` method
- `news_crawler/crawler_service.py` - Integrated keyword extraction into crawl process
**What it does:**
- Extracts 5 keywords from each article using Ollama AI
- Keywords stored in `articles` collection: `keywords: ["Bayern Munich", "Football", ...]`
- Runs automatically during news crawling
**Test it:**
```bash
# Trigger a crawl
curl -X POST http://localhost:5001/api/admin/trigger-crawl -d '{"max_articles": 2}'
# Check articles have keywords
docker exec munich-news-mongodb mongosh munich_news --eval "db.articles.findOne({}, {title: 1, keywords: 1})"
```
---
### ✅ Phase 2: Click Tracking Enhancement (COMPLETED)
**Status:** Implemented
**Goal:** Track clicks with keyword metadata
**Files Modified:**
- `backend/services/tracking_service.py` - Enhanced `create_newsletter_tracking()` to look up article metadata
**What it does:**
- When creating tracking links, looks up article from database
- Stores article ID, category, and keywords in tracking record
- Enables building user interest profiles from click behavior
**Database Schema:**
```javascript
// link_clicks collection
{
tracking_id: "uuid",
newsletter_id: "2024-11-18",
subscriber_email: "user@example.com",
article_url: "https://...",
article_title: "Article Title",
article_id: "673abc123...", // NEW: Article database ID
category: "sports", // NEW: Article category
keywords: ["Bayern Munich", "Bundesliga"], // NEW: Keywords for personalization
clicked: false,
clicked_at: null,
user_agent: null,
created_at: ISODate()
}
```
**Test it:**
```bash
# Send a test newsletter
curl -X POST http://localhost:5001/api/admin/send-newsletter
# Check tracking records have keywords
docker exec munich-news-mongodb mongosh munich_news --eval "db.link_clicks.findOne({}, {article_title: 1, keywords: 1, category: 1})"
```
---
### ✅ Phase 3: User Interest Profiling (COMPLETED)
**Status:** Implemented
**Goal:** Build user interest profiles from click history
**Files Created:**
- `backend/services/interest_profiling_service.py` - Core profiling logic
- `backend/routes/interests_routes.py` - API endpoints for interest management
**Files Modified:**
- `backend/routes/tracking_routes.py` - Auto-update interests on click
- `backend/app.py` - Register interests routes
**What it does:**
- Automatically builds interest profiles when users click articles
- Tracks interest scores for categories and keywords (0.0 to 1.0)
- Increments scores by 0.1 per click, capped at 1.0
- Provides decay mechanism for old interests
- Supports rebuilding profiles from click history
**Database Schema:**
```javascript
// user_interests collection
{
email: "user@example.com",
categories: {
sports: 0.8,
local: 0.5,
science: 0.2
},
keywords: {
"Bayern Munich": 0.9,
"Oktoberfest": 0.7,
"AI": 0.3
},
total_clicks: 15,
last_updated: ISODate(),
created_at: ISODate()
}
```
**API Endpoints:**
```bash
# Get user interests
GET /api/interests/<email>
# Get top interests
GET /api/interests/<email>/top?top_n=10
# Rebuild from history
POST /api/interests/<email>/rebuild
Body: {"days_lookback": 30}
# Decay old interests
POST /api/interests/decay
Body: {"decay_factor": 0.95, "days_threshold": 7}
# Get statistics
GET /api/interests/statistics
# Delete profile (GDPR)
DELETE /api/interests/<email>
```
**Test it:**
```bash
# Run test script
docker exec munich-news-local-backend python test_interest_profiling.py
# View a user's interests
curl http://localhost:5001/api/interests/user@example.com
# Get statistics
curl http://localhost:5001/api/interests/statistics
```
---
### ✅ Phase 4: Personalized Newsletter (COMPLETED)
**Status:** Implemented
**Goal:** Rank and select articles based on user interests
**Files Created:**
- `backend/services/personalization_service.py` - Core personalization logic
- `backend/routes/personalization_routes.py` - API endpoints for testing
**Files Modified:**
- `backend/app.py` - Register personalization routes
**What it does:**
- Scores articles based on user's category and keyword interests
- Ranks articles by personalization score (0.0 to 1.0)
- Selects mix of personalized (70%) + trending (30%) content
- Provides explanations for recommendations
**Algorithm:**
```python
score = (category_match * 0.4) + (keyword_match * 0.6)
# Example:
# User interests: sports=0.8, "Bayern Munich"=0.9
# Article: sports category, keywords=["Bayern Munich", "Football"]
# Score = (0.8 * 0.4) + (0.9 * 0.6) = 0.32 + 0.54 = 0.86
```
**API Endpoints:**
```bash
# Preview personalized newsletter
GET /api/personalize/preview/<email>?max_articles=10&hours_lookback=24
# Explain recommendation
POST /api/personalize/explain
Body: {"email": "user@example.com", "article_id": "..."}
```
**Test it:**
```bash
# Run test script
docker exec munich-news-local-backend python test_personalization.py
# Preview personalized newsletter
curl "http://localhost:5001/api/personalize/preview/demo@example.com?max_articles=5"
```
---
## ✅ All Phases Complete!
1. ~~**Phase 1:** Keyword extraction from articles~~ ✅ DONE
2. ~~**Phase 2:** Click tracking with keywords~~ ✅ DONE
3. ~~**Phase 3:** User interest profiling~~ ✅ DONE
4. ~~**Phase 4:** Personalized newsletter generation~~ ✅ DONE
## Next Steps for Production
1. **Integrate with newsletter sender** - Modify `news_sender/sender_service.py` to use personalization
2. **A/B testing** - Compare personalized vs non-personalized engagement
3. **Tune parameters** - Adjust personalization_ratio, weights, decay rates
4. **Monitor metrics** - Track click-through rates, open rates by personalization score
5. **User controls** - Add UI for users to view/edit their interests
## Configuration
No configuration needed yet. Keyword extraction uses existing Ollama settings from `backend/.env`:
- `OLLAMA_ENABLED=true`
- `OLLAMA_MODEL=gemma3:12b`
- `OLLAMA_BASE_URL=http://ollama:11434`

View File

@@ -0,0 +1,195 @@
# 🎉 Newsletter Personalization System - Complete!
All 4 phases of the personalization system have been successfully implemented and tested.
## ✅ What Was Built
### Phase 1: Keyword Extraction
- AI-powered keyword extraction from articles using Ollama
- 5 keywords per article automatically extracted during crawling
- Keywords stored in database for personalization
### Phase 2: Click Tracking Enhancement
- Enhanced tracking to capture article keywords and category
- Tracking records now include metadata for building interest profiles
- Privacy-compliant with opt-out and GDPR support
### Phase 3: User Interest Profiling
- Automatic profile building from click behavior
- Interest scores (0.0-1.0) for categories and keywords
- Decay mechanism for old interests
- API endpoints for viewing and managing profiles
### Phase 4: Personalized Newsletter Generation
- Article scoring based on user interests
- Smart ranking algorithm (40% category + 60% keywords)
- Mix of personalized (70%) + trending (30%) content
- Explanation system for recommendations
## 📊 How It Works
```
1. User clicks article in newsletter
2. System records: keywords + category
3. Interest profile updates automatically
4. Next newsletter: articles ranked by interests
5. User receives personalized content
```
## 🧪 Testing
All phases have been tested and verified:
```bash
# Run comprehensive test suite (tests all 4 phases)
docker exec munich-news-local-backend python test_personalization_system.py
# Or test keyword extraction separately
docker exec munich-news-local-crawler python -c "from crawler_service import crawl_all_feeds; crawl_all_feeds(max_articles_per_feed=2)"
```
## 🔌 API Endpoints
### Interest Management
```bash
GET /api/interests/<email> # View profile
GET /api/interests/<email>/top # Top interests
POST /api/interests/<email>/rebuild # Rebuild from history
GET /api/interests/statistics # Platform stats
DELETE /api/interests/<email> # Delete (GDPR)
```
### Personalization
```bash
GET /api/personalize/preview/<email> # Preview personalized newsletter
POST /api/personalize/explain # Explain recommendation
```
## 📈 Example Results
### User Profile
```json
{
"email": "user@example.com",
"categories": {
"sports": 0.30,
"local": 0.10
},
"keywords": {
"Bayern Munich": 0.30,
"Football": 0.20,
"Transportation": 0.10
},
"total_clicks": 5
}
```
### Personalized Newsletter
```json
{
"articles": [
{
"title": "Bayern Munich wins championship",
"personalization_score": 0.86,
"category": "sports",
"keywords": ["Bayern Munich", "Football"]
},
{
"title": "New S-Bahn line opens",
"personalization_score": 0.42,
"category": "local",
"keywords": ["Transportation", "Munich"]
}
],
"statistics": {
"highly_personalized": 1,
"moderately_personalized": 1,
"trending": 0
}
}
```
## 🎯 Scoring Algorithm
```python
# Article score calculation
category_score = user_interests.categories[article.category]
keyword_score = average(user_interests.keywords[kw] for kw in article.keywords)
final_score = (category_score * 0.4) + (keyword_score * 0.6)
```
**Example:**
- User: sports=0.8, "Bayern Munich"=0.9
- Article: sports category, keywords=["Bayern Munich", "Football"]
- Score = (0.8 × 0.4) + (0.9 × 0.6) = 0.32 + 0.54 = **0.86**
## 🚀 Production Integration
To integrate with the newsletter sender:
1. **Modify `news_sender/sender_service.py`:**
```python
from services.personalization_service import select_personalized_articles
# For each subscriber
personalized_articles = select_personalized_articles(
all_articles,
subscriber_email,
max_articles=10
)
```
2. **Enable personalization flag in config:**
```env
PERSONALIZATION_ENABLED=true
PERSONALIZATION_RATIO=0.7 # 70% personalized, 30% trending
```
3. **Monitor metrics:**
- Click-through rate by personalization score
- Open rates for personalized vs non-personalized
- User engagement over time
## 🔐 Privacy & Compliance
- ✅ Users can opt out of tracking
- ✅ Interest profiles can be deleted (GDPR)
- ✅ Automatic anonymization after 90 days
- ✅ No PII beyond email address
- ✅ Transparent recommendation explanations
## 📁 Files Created/Modified
### New Files
- `backend/services/interest_profiling_service.py`
- `backend/services/personalization_service.py`
- `backend/routes/interests_routes.py`
- `backend/routes/personalization_routes.py`
- `backend/test_tracking_phase2.py`
- `backend/test_interest_profiling.py`
- `backend/test_personalization.py`
- `docs/PERSONALIZATION.md`
### Modified Files
- `news_crawler/ollama_client.py` - Added keyword extraction
- `news_crawler/crawler_service.py` - Integrated keyword extraction
- `backend/services/tracking_service.py` - Enhanced with metadata
- `backend/routes/tracking_routes.py` - Auto-update interests
- `backend/app.py` - Registered new routes
## 🎓 Key Learnings
1. **Incremental scoring works well** - 0.1 per click prevents over-weighting
2. **Mix is important** - 70/30 personalized/trending avoids filter bubbles
3. **Keywords > Categories** - 60/40 weight reflects keyword importance
4. **Decay is essential** - Prevents stale interests from dominating
5. **Transparency matters** - Explanation API helps users understand recommendations
## 🎉 Status: COMPLETE
All 4 phases implemented, tested, and documented. The personalization system is ready for production integration!