Compare commits

...

3 Commits

Author SHA1 Message Date
7346ee9de2 update
All checks were successful
dongho-repo/Munich-news/pipeline/head This commit looks good
2025-12-10 15:57:07 +00:00
6e9fbe44c4 update 2025-12-10 15:52:41 +00:00
4e8b60f77c update 2025-12-10 15:50:11 +00:00
15 changed files with 439 additions and 617 deletions

25
Jenkinsfile vendored Normal file
View File

@@ -0,0 +1,25 @@
pipeline {
agent any
stages {
stage('Security Scan') {
steps {
withCredentials([string(credentialsId: 'nvd-api-key', variable: 'NVD_API_KEY')]) {
// Run OWASP Dependency Check using the specific installation configured in Jenkins
// Using NVD API Key to avoid rate limiting
dependencyCheck additionalArguments: "--scan ./ --format ALL --nvdApiKey ${NVD_API_KEY}", odcInstallation: 'depcheck'
}
}
}
}
post {
always {
// Publish the results
dependencyCheckPublisher pattern: 'dependency-check-report.xml'
// Archive the reports
archiveArtifacts allowEmptyArchive: true, artifacts: 'dependency-check-report.html'
}
}
}

View File

@@ -1,56 +1,36 @@
# Quick Start Guide # Quick Start Guide
Get Munich News Daily running in 5 minutes! Get Munich News Daily running in 5 minutes!
## Prerequisites ## 📋 Prerequisites
- **Docker** & **Docker Compose** installed
- **4GB+ RAM** (for AI models)
- *(Optional)* NVIDIA GPU for faster processing
- Docker & Docker Compose installed ## 🚀 Setup Steps
- 4GB+ RAM (for Ollama AI models)
- (Optional) NVIDIA GPU for 5-10x faster AI processing
## Setup
### 1. Configure Environment ### 1. Configure Environment
```bash ```bash
# Copy example environment file
cp backend/.env.example backend/.env cp backend/.env.example backend/.env
# Edit with your settings (required: email configuration)
nano backend/.env nano backend/.env
``` ```
**Required:** Update `SMTP_SERVER`, `EMAIL_USER`, and `EMAIL_PASSWORD`.
**Minimum required settings:** ### 2. Start the System
```env
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
```
### 2. Start System
```bash ```bash
# Option 1: Auto-detect GPU and start (recommended) # Auto-detects GPU capabilities and starts services
./start-with-gpu.sh ./start-with-gpu.sh
# Option 2: Start without GPU # Watch installation progress (first time model download ~2GB)
docker-compose up -d
# View logs
docker-compose logs -f
# Wait for Ollama model download (first time only, ~2-5 minutes)
docker-compose logs -f ollama-setup docker-compose logs -f ollama-setup
``` ```
**Note:** First startup downloads the phi3:latest AI model (2.2GB). This happens automatically. ### 3. Add News Sources
### 3. Add RSS Feeds
```bash ```bash
mongosh munich_news # Connect to database
docker-compose exec mongodb mongosh munich_news
# Paste this into the mongo shell:
db.rss_feeds.insertMany([ db.rss_feeds.insertMany([
{ {
name: "Süddeutsche Zeitung München", name: "Süddeutsche Zeitung München",
@@ -65,11 +45,9 @@ db.rss_feeds.insertMany([
]) ])
``` ```
### 4. Add Subscribers ### 4. Add Yourself as Subscriber
```bash ```bash
mongosh munich_news # Still in mongo shell:
db.subscribers.insertOne({ db.subscribers.insertOne({
email: "your-email@example.com", email: "your-email@example.com",
active: true, active: true,
@@ -78,90 +56,35 @@ db.subscribers.insertOne({
}) })
``` ```
### 5. Test It ### 5. Verify Installation
```bash ```bash
# Test crawler # 1. Run the crawler manually to fetch news
docker-compose exec crawler python crawler_service.py 5 docker-compose exec crawler python crawler_service.py 5
# Test newsletter # 2. Send a test email to yourself
docker-compose exec sender python sender_service.py test your-email@example.com docker-compose exec sender python sender_service.py test your-email@example.com
``` ```
## What Happens Next? ## 🎮 Dashboard Access
The system will automatically: Once running, access the services:
- **Backend API**: Runs continuously at http://localhost:5001 for tracking and analytics - **Dashboard**: [http://localhost:3000](http://localhost:3000)
- **6:00 AM Berlin time**: Crawl news articles - **API**: [http://localhost:5001](http://localhost:5001)
- **7:00 AM Berlin time**: Send newsletter to subscribers
## View Results ## ⏭️ What's Next?
The system is now fully automated:
1. **6:00 AM**: Crawls news and generates AI summaries.
2. **7:00 AM**: Sends the daily newsletter.
### Useful Commands
```bash ```bash
# Check articles # Stop everything
mongosh munich_news
db.articles.find().sort({ crawled_at: -1 }).limit(5)
# Check logs
docker-compose logs -f crawler
docker-compose logs -f sender
```
## Common Commands
```bash
# Stop system
docker-compose down docker-compose down
# Restart system # View logs for a service
docker-compose restart docker-compose logs -f crawler
# View logs # Update code & rebuild
docker-compose logs -f
# Rebuild after changes
docker-compose up -d --build docker-compose up -d --build
``` ```
## New Features
### GPU Acceleration (5-10x Faster)
Enable GPU support for faster AI processing:
```bash
./check-gpu.sh # Check if GPU is available
./start-with-gpu.sh # Start with GPU support
```
See [docs/GPU_SETUP.md](docs/GPU_SETUP.md) for details.
### Send Newsletter to All Subscribers
```bash
# Send newsletter to all active subscribers
curl -X POST http://localhost:5001/api/admin/send-newsletter \
-H "Content-Type: application/json" \
-d '{"max_articles": 10}'
```
### Security Features
- ✅ Only Backend API exposed (port 5001)
- ✅ MongoDB internal-only (secure)
- ✅ Ollama internal-only (secure)
- ✅ All services communicate via internal Docker network
## Need Help?
- **Documentation Index**: [docs/INDEX.md](docs/INDEX.md)
- **GPU Setup**: [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
- **API Reference**: [docs/ADMIN_API.md](docs/ADMIN_API.md)
- **Security Guide**: [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
- **Full Documentation**: [README.md](README.md)
## Next Steps
1.**Enable GPU acceleration** - [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
2. Set up tracking API (optional)
3. Customize newsletter template
4. Add more RSS feeds
5. Monitor engagement metrics
6. Review security settings - [docs/SECURITY_NOTES.md](docs/SECURITY_NOTES.md)
That's it! Your automated news system is running. 🎉

535
README.md
View File

@@ -1,460 +1,193 @@
# Munich News Daily - Automated Newsletter System # Munich News Daily - Automated Newsletter System
A fully automated news aggregation and newsletter system that crawls Munich news sources, generates AI summaries, and sends daily newsletters with engagement tracking. A fully automated news aggregation system that crawls Munich news sources, generates AI-powered summaries, tracks local transport disruptions, and delivers personalized daily newsletters.
![Munich News Daily](https://via.placeholder.com/800x400?text=Munich+News+Daily+Dashboard)
## ✨ Key Features ## ✨ Key Features
- **🤖 AI-Powered Clustering** - Automatically detects duplicate stories from different sources - **🤖 AI-Powered Clustering** - Smartly detects duplicate stories and groups related articles using ChromaDB vector search.
- **📰 Neutral Summaries** - Combines multiple perspectives into balanced coverage - **📝 Neutral Summaries** - Generates balanced, multi-perspective summaries using local LLMs (Ollama).
- **🎯 Smart Prioritization** - Shows most important stories first (multi-source coverage) - **🚇 Transport Updates** - Real-time tracking of Munich public transport (MVG) disruptions options.
- **🎨 Personalized Newsletters** - AI-powered content recommendations based on user interests - **🎯 Smart Prioritization** - Ranks stories based on relevance and user preferences.
- **📊 Engagement Tracking** - Open rates, click tracking, and analytics - **🎨 Personalized Newsletters** - diverse content delivery system.
- **⚡ GPU Acceleration** - 5-10x faster AI processing with GPU support - **📊 Engagement Analytics** - Detailed tracking of open rates, click-throughs, and user interests.
- **🔒 GDPR Compliant** - Privacy-first with data retention controls - ** GPU Acceleration** - Integrated support for NVIDIA GPUs for faster AI processing.
- **🔒 Privacy First** - GDPR-compliant with automatic data retention policies and anonymization.
**🚀 NEW:** GPU acceleration support for 5-10x faster AI processing! See [docs/GPU_SETUP.md](docs/GPU_SETUP.md)
## 🚀 Quick Start ## 🚀 Quick Start
For a detailed 5-minute setup guide, see [QUICKSTART.md](QUICKSTART.md).
```bash ```bash
# 1. Configure environment # 1. Configure environment
cp backend/.env.example backend/.env cp backend/.env.example backend/.env
# Edit backend/.env with your email settings # Edit backend/.env with your email settings
# 2. Start everything # 2. Start everything (Auto-detects GPU)
docker-compose up -d ./start-with-gpu.sh
# 3. View logs # Questions?
docker-compose logs -f # See logs: docker-compose logs -f
``` ```
That's it! The system will automatically: The system will automatically:
- **Frontend**: Web interface and admin dashboard (http://localhost:3000) 1. **6:00 AM**: Crawl news & transport updates.
- **Backend API**: Runs continuously for tracking and analytics (http://localhost:5001) 2. **6:30 AM**: Generate AI summaries & clusters.
- **6:00 AM Berlin time**: Crawl news articles and generate summaries 3. **7:00 AM**: Send personalized newsletters.
- **7:00 AM Berlin time**: Send newsletter to all subscribers
### Access Points ## 📋 System Architecture
- **Newsletter Page**: http://localhost:3000 The system is built as a set of microservices orchestrated by Docker Compose.
- **Admin Dashboard**: http://localhost:3000/admin.html
- **Backend API**: http://localhost:5001
📖 **New to the project?** See [QUICKSTART.md](QUICKSTART.md) for a detailed 5-minute setup guide. ```mermaid
graph TD
🚀 **GPU Acceleration:** Enable 5-10x faster AI processing with [GPU Setup Guide](docs/GPU_SETUP.md) User[Subscribers] -->|Email| Sender[Newsletter Sender]
User -->|Web| Frontend[React Frontend]
## 📋 System Overview Frontend -->|API| Backend[Backend API]
``` subgraph "Core Services"
6:00 AM → News Crawler Crawler[News Crawler]
Transport[Transport Crawler]
Fetches articles from RSS feeds Sender
Extracts full content Backend
Generates AI summaries end
Saves to MongoDB
subgraph "Data & AI"
7:00 AM → Newsletter Sender Mongo[(MongoDB)]
Redis[(Redis)]
Waits for crawler to finish Chroma[(ChromaDB)]
Fetches today's articles Ollama[Ollama AI]
Generates newsletter with tracking end
Sends to all subscribers
Crawler -->|Save| Mongo
✅ Done! Repeat tomorrow Crawler -->|Embeddings| Chroma
Crawler -->|Summarize| Ollama
Transport -->|Save| Mongo
Sender -->|Read| Mongo
Sender -->|Track| Backend
Backend -->|Read/Write| Mongo
Backend -->|Cache| Redis
``` ```
## 🏗️ Architecture ### Core Components
### Components | Service | Description | Port |
|---------|-------------|------|
| **Frontend** | React-based user dashboard and admin interface. | 3000 |
| **Backend API** | Flask API for tracking, analytics, and management. | 5001 |
| **News Crawler** | Fetches RSS feeds, extracts content, and runs AI clustering. | - |
| **Transport Crawler** | Monitors MVG (Munich Transport) for delays and disruptions. | - |
| **Newsletter Sender** | Manages subscribers, generates templates, and sends emails. | - |
| **Ollama** | Local LLM runner for on-premise AI (Phi-3, Llama3, etc.). | - |
| **ChromaDB** | Vector database for semantic search and article clustering. | - |
- **Ollama**: AI service for summarization and translation (internal only, GPU-accelerated) ## 📂 Project Structure
- **MongoDB**: Data storage (articles, subscribers, tracking) (internal only)
- **Backend API**: Flask API for tracking and analytics (port 5001 - only exposed service)
- **News Crawler**: Automated RSS feed crawler with AI summarization (internal only)
- **Newsletter Sender**: Automated email sender with tracking (internal only)
- **Frontend**: React dashboard (optional)
### Technology Stack ```text
munich-news/
├── backend/ # Flask API for tracking & analytics
├── frontend/ # React dashboard & admin UI
├── news_crawler/ # RSS fetcher & AI summarizer service
├── news_sender/ # Email generation & dispatch service
├── transport_crawler/ # MVG transport disruption monitor
├── docker-compose.yml # Main service orchestration
└── docs/ # Detailed documentation
```
- Python 3.11 ## 🛠️ Installation & Setup
- MongoDB 7.0
- Ollama (phi3:latest model for AI)
- Docker & Docker Compose
- Flask (API)
- Schedule (automation)
- Jinja2 (email templates)
## 📦 Installation 1. **Clone the repository**
```bash
git clone https://github.com/yourusername/munich-news.git
cd munich-news
```
### Prerequisites 2. **Environment Configuration**
```bash
cp backend/.env.example backend/.env
nano backend/.env
```
*Critical settings:* `SMTP_SERVER`, `EMAIL_USER`, `EMAIL_PASSWORD`.
- Docker & Docker Compose 3. **Start the System**
- 4GB+ RAM (for Ollama AI models) ```bash
- (Optional) NVIDIA GPU for 5-10x faster AI processing # Recommended: Helper script (handles GPU & Model setup)
./start-with-gpu.sh
# Alternative: Standard Docker Compose
docker-compose up -d
```
### Setup 4. **Initial Setup (First Run)**
* The system needs to download the AI model (approx. 2GB).
1. **Clone the repository** * Watch progress: `docker-compose logs -f ollama-setup`
```bash
git clone <repository-url>
cd munich-news
```
2. **Configure environment**
```bash
cp backend/.env.example backend/.env
# Edit backend/.env with your settings
```
3. **Configure Ollama (AI features)**
```bash
# Option 1: Use integrated Docker Compose Ollama (recommended)
./configure-ollama.sh
# Select option 1
# Option 2: Use external Ollama server
# Install from https://ollama.ai/download
# Then run: ollama pull phi3:latest
```
4. **Start the system**
```bash
# Auto-detect GPU and start (recommended)
./start-with-gpu.sh
# Or start manually
docker-compose up -d
# First time: Wait for Ollama model download (2-5 minutes)
docker-compose logs -f ollama-setup
```
📖 **For detailed Ollama setup & GPU acceleration:** See [docs/OLLAMA_SETUP.md](docs/OLLAMA_SETUP.md)
💡 **To change AI model:** Edit `OLLAMA_MODEL` in `.env`, then run `./pull-ollama-model.sh`. See [docs/CHANGING_AI_MODEL.md](docs/CHANGING_AI_MODEL.md)
## ⚙️ Configuration ## ⚙️ Configuration
Edit `backend/.env`: Key configuration options in `backend/.env`:
```env | Category | Variable | Description |
# MongoDB |----------|----------|-------------|
MONGODB_URI=mongodb://localhost:27017/ | **Email** | `SMTP_SERVER` | SMTP Server (e.g., smtp.gmail.com) |
| | `EMAIL_USER` | Your sending email address |
| **AI** | `OLLAMA_MODEL` | Model to use (default: phi3:latest) |
| **Schedule** | `CRAWLER_TIME` | Time to start crawling (e.g., "06:00") |
| | `SENDER_TIME` | Time to send emails (e.g., "07:00") |
# Email (SMTP) ## 📊 Usage & Monitoring
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
# Newsletter ### Access Points
NEWSLETTER_MAX_ARTICLES=10 * **Web Dashboard**: [http://localhost:3000](http://localhost:3000) (or configured domain)
NEWSLETTER_HOURS_LOOKBACK=24 * **API**: [http://localhost:5001](http://localhost:5001)
# Tracking ### Useful Commands
TRACKING_ENABLED=true
TRACKING_API_URL=http://localhost:5001
TRACKING_DATA_RETENTION_DAYS=90
# Ollama (AI Summarization) **View Logs**
OLLAMA_ENABLED=true ```bash
OLLAMA_BASE_URL=http://127.0.0.1:11434 docker-compose logs -f [service_name]
OLLAMA_MODEL=phi3:latest # e.g., docker-compose logs -f crawler
``` ```
## 📊 Usage **Manual Trigger**
### View Logs
```bash ```bash
# All services # Run News Crawler immediately
docker-compose logs -f
# Specific service
docker-compose logs -f crawler
docker-compose logs -f sender
docker-compose logs -f mongodb
```
### Manual Operations
```bash
# Run crawler manually
docker-compose exec crawler python crawler_service.py 10 docker-compose exec crawler python crawler_service.py 10
# Send test newsletter # Run Transport Crawler immediately
docker-compose exec sender python sender_service.py test your-email@example.com docker-compose exec transport-crawler python transport_service.py
# Preview newsletter # Send Test Newsletter
docker-compose exec sender python sender_service.py preview docker-compose exec sender python sender_service.py test user@example.com
``` ```
### Database Access **Database Access**
```bash ```bash
# Connect to MongoDB # Connect to MongoDB
docker-compose exec mongodb mongosh munich_news docker-compose exec mongodb mongosh munich_news
# View articles
db.articles.find().sort({ crawled_at: -1 }).limit(5).pretty()
# View subscribers
db.subscribers.find({ active: true }).pretty()
# View tracking data
db.newsletter_sends.find().sort({ created_at: -1 }).limit(10).pretty()
``` ```
## 🔧 Management ## 🌐 Production Deployment (Traefik)
### Add RSS Feeds This project is configured to work with **Traefik** as a reverse proxy.
The `docker-compose.yml` includes labels for:
- `news.dongho.kim` (Frontend)
- `news-api.dongho.kim` (Backend)
```bash To use this locally, add these to your `/etc/hosts`:
mongosh munich_news ```text
127.0.0.1 news.dongho.kim news-api.dongho.kim
db.rss_feeds.insertOne({
name: "Source Name",
url: "https://example.com/rss",
active: true
})
``` ```
### Add Subscribers For production, ensure your Traefik proxy network is named `proxy` or update the `docker-compose.yml` accordingly.
```bash
mongosh munich_news
db.subscribers.insertOne({
email: "user@example.com",
active: true,
tracking_enabled: true,
subscribed_at: new Date()
})
```
### View Analytics
```bash
# Newsletter metrics
curl http://localhost:5001/api/analytics/newsletter/2024-01-15
# Article performance
curl http://localhost:5001/api/analytics/article/https://example.com/article
# Subscriber activity
curl http://localhost:5001/api/analytics/subscriber/user@example.com
```
## ⏰ Schedule Configuration
### Change Crawler Time (default: 6:00 AM)
Edit `news_crawler/scheduled_crawler.py`:
```python
schedule.every().day.at("06:00").do(run_crawler) # Change time
```
### Change Sender Time (default: 7:00 AM)
Edit `news_sender/scheduled_sender.py`:
```python
schedule.every().day.at("07:00").do(run_sender) # Change time
```
After changes:
```bash
docker-compose up -d --build
```
## 📈 Monitoring
### Container Status
```bash
docker-compose ps
```
### Check Next Scheduled Runs
```bash
# Crawler
docker-compose logs crawler | grep "Next scheduled run"
# Sender
docker-compose logs sender | grep "Next scheduled run"
```
### Engagement Metrics
```bash
mongosh munich_news
// Open rate
var sent = db.newsletter_sends.countDocuments({ newsletter_id: "2024-01-15" })
var opened = db.newsletter_sends.countDocuments({ newsletter_id: "2024-01-15", opened: true })
print("Open Rate: " + ((opened / sent) * 100).toFixed(2) + "%")
// Click rate
var clicks = db.link_clicks.countDocuments({ newsletter_id: "2024-01-15" })
print("Click Rate: " + ((clicks / sent) * 100).toFixed(2) + "%")
```
## 🐛 Troubleshooting
### Crawler Not Finding Articles
```bash
# Check RSS feeds
mongosh munich_news --eval "db.rss_feeds.find({ active: true })"
# Test manually
docker-compose exec crawler python crawler_service.py 5
```
### Newsletter Not Sending
```bash
# Check email config
docker-compose exec sender python -c "from sender_service import Config; print(Config.SMTP_SERVER)"
# Test email
docker-compose exec sender python sender_service.py test your-email@example.com
```
### Containers Not Starting
```bash
# Check logs
docker-compose logs
# Rebuild
docker-compose up -d --build
# Reset everything
docker-compose down -v
docker-compose up -d
```
## 🔐 Privacy & Compliance
### GDPR Features
- **Data Retention**: Automatic anonymization after 90 days
- **Opt-Out**: Subscribers can disable tracking
- **Data Deletion**: Full data removal on request
- **Transparency**: Privacy notice in all emails
### Privacy Endpoints
```bash
# Delete subscriber data
curl -X DELETE http://localhost:5001/api/tracking/subscriber/user@example.com
# Anonymize old data
curl -X POST http://localhost:5001/api/tracking/anonymize
# Opt out of tracking
curl -X POST http://localhost:5001/api/tracking/subscriber/user@example.com/opt-out
```
## 📚 Documentation
### Getting Started
- **[QUICKSTART.md](QUICKSTART.md)** - 5-minute setup guide
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines
### Core Features
- **[docs/AI_NEWS_AGGREGATION.md](docs/AI_NEWS_AGGREGATION.md)** - AI-powered clustering & neutral summaries
- **[docs/PERSONALIZATION.md](docs/PERSONALIZATION.md)** - Personalized newsletter system
- **[docs/PERSONALIZATION_COMPLETE.md](docs/PERSONALIZATION_COMPLETE.md)** - Personalization implementation guide
- **[docs/FEATURES.md](docs/FEATURES.md)** - Complete feature list
- **[docs/API.md](docs/API.md)** - API endpoints reference
### Technical Documentation
- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - System architecture
- **[docs/SETUP.md](docs/SETUP.md)** - Detailed setup guide
- **[docs/OLLAMA_SETUP.md](docs/OLLAMA_SETUP.md)** - AI/Ollama configuration
- **[docs/GPU_SETUP.md](docs/GPU_SETUP.md)** - GPU acceleration setup
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production deployment
- **[docs/SECURITY.md](docs/SECURITY.md)** - Security best practices
- **[docs/REFERENCE.md](docs/REFERENCE.md)** - Complete reference
- **[docs/DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Deployment guide
- **[docs/API.md](docs/API.md)** - API reference
- **[docs/DATABASE_SCHEMA.md](docs/DATABASE_SCHEMA.md)** - Database structure
- **[docs/BACKEND_STRUCTURE.md](docs/BACKEND_STRUCTURE.md)** - Backend organization
### Component Documentation
- **[docs/CRAWLER_HOW_IT_WORKS.md](docs/CRAWLER_HOW_IT_WORKS.md)** - Crawler internals
- **[docs/EXTRACTION_STRATEGIES.md](docs/EXTRACTION_STRATEGIES.md)** - Content extraction
- **[docs/RSS_URL_EXTRACTION.md](docs/RSS_URL_EXTRACTION.md)** - RSS parsing
## 🧪 Testing
All test files are organized in the `tests/` directory:
```bash
# Run crawler tests
docker-compose exec crawler python tests/crawler/test_crawler.py
# Run sender tests
docker-compose exec sender python tests/sender/test_tracking_integration.py
# Run backend tests
docker-compose exec backend python tests/backend/test_tracking.py
# Test personalization system (all 4 phases)
docker exec munich-news-local-backend python test_personalization_system.py
```
## 🚀 Production Deployment
### Environment Setup
1. Update `backend/.env` with production values
2. Set strong MongoDB password
3. Use HTTPS for tracking URLs
4. Configure proper SMTP server
### Security
```bash
# Use production compose file
docker-compose -f docker-compose.prod.yml up -d
# Set MongoDB password
export MONGO_PASSWORD=your-secure-password
```
### Monitoring
- Set up log rotation
- Configure health checks
- Set up alerts for failures
- Monitor database size
## 📚 Documentation
Complete documentation available in the [docs/](docs/) directory:
- **[Documentation Index](docs/INDEX.md)** - Complete documentation guide
- **[GPU Setup](docs/GPU_SETUP.md)** - 5-10x faster with GPU acceleration
- **[Admin API](docs/ADMIN_API.md)** - API endpoints reference
- **[Security Guide](docs/SECURITY_NOTES.md)** - Security best practices
- **[System Architecture](docs/SYSTEM_ARCHITECTURE.md)** - Technical overview
## 📝 License
[Your License Here]
## 🤝 Contributing ## 🤝 Contributing
Contributions welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) first. We welcome contributions! Please check [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## 📧 Support ## 📄 License
For issues or questions, please open a GitHub issue. MIT License - see [LICENSE](LICENSE) for details.
---
**Built with ❤️ for Munich News Daily**

View File

@@ -13,6 +13,7 @@ from routes.admin_routes import admin_bp
from routes.transport_routes import transport_bp from routes.transport_routes import transport_bp
from routes.interests_routes import interests_bp from routes.interests_routes import interests_bp
from routes.personalization_routes import personalization_bp from routes.personalization_routes import personalization_bp
from routes.search_routes import search_bp
# Initialize Flask app # Initialize Flask app
app = Flask(__name__) app = Flask(__name__)
@@ -33,6 +34,7 @@ app.register_blueprint(admin_bp)
app.register_blueprint(transport_bp) app.register_blueprint(transport_bp)
app.register_blueprint(interests_bp) app.register_blueprint(interests_bp)
app.register_blueprint(personalization_bp) app.register_blueprint(personalization_bp)
app.register_blueprint(search_bp)
# Health check endpoint # Health check endpoint
@app.route('/health') @app.route('/health')

View File

@@ -87,7 +87,8 @@ class ChromaClient:
# Prepare text for embedding (Title + Summary + Start of Content) # Prepare text for embedding (Title + Summary + Start of Content)
# This gives semantic search a good overview # This gives semantic search a good overview
title = article.get('title', '') # Use English title if available, otherwise original
title = article.get('title_en') if article.get('title_en') else article.get('title', '')
summary = article.get('summary') or '' summary = article.get('summary') or ''
content_snippet = article.get('content', '')[:1000] content_snippet = article.get('content', '')[:1000]

View File

@@ -45,6 +45,11 @@ class Config:
TRACKING_API_URL = os.getenv('TRACKING_API_URL', f'http://localhost:{os.getenv("FLASK_PORT", "5000")}') TRACKING_API_URL = os.getenv('TRACKING_API_URL', f'http://localhost:{os.getenv("FLASK_PORT", "5000")}')
TRACKING_DATA_RETENTION_DAYS = int(os.getenv('TRACKING_DATA_RETENTION_DAYS', '90')) TRACKING_DATA_RETENTION_DAYS = int(os.getenv('TRACKING_DATA_RETENTION_DAYS', '90'))
# ChromaDB
CHROMA_HOST = os.getenv('CHROMA_HOST', 'chromadb')
CHROMA_PORT = int(os.getenv('CHROMA_PORT', '8000'))
CHROMA_COLLECTION = os.getenv('CHROMA_COLLECTION', 'munich_news_articles')
@classmethod @classmethod
def print_config(cls): def print_config(cls):
"""Print configuration (without sensitive data)""" """Print configuration (without sensitive data)"""
@@ -57,3 +62,5 @@ class Config:
print(f" Ollama Enabled: {cls.OLLAMA_ENABLED}") print(f" Ollama Enabled: {cls.OLLAMA_ENABLED}")
print(f" Tracking Enabled: {cls.TRACKING_ENABLED}") print(f" Tracking Enabled: {cls.TRACKING_ENABLED}")
print(f" Tracking API URL: {cls.TRACKING_API_URL}") print(f" Tracking API URL: {cls.TRACKING_API_URL}")
print(f" ChromaDB Host: {cls.CHROMA_HOST}")
print(f" ChromaDB Port: {cls.CHROMA_PORT}")

View File

@@ -8,3 +8,4 @@ Jinja2==3.1.2
redis==5.0.1 redis==5.0.1
chromadb>=0.4.0 chromadb>=0.4.0
sentence-transformers>=2.2.2

View File

@@ -24,8 +24,11 @@ def get_news():
db_articles = [] db_articles = []
for doc in cursor: for doc in cursor:
# Use English title if available, otherwise fallback to original
title = doc.get('title_en') if doc.get('title_en') else doc.get('title', '')
article = { article = {
'title': doc.get('title', ''), 'title': title,
'author': doc.get('author'), 'author': doc.get('author'),
'link': doc.get('link', ''), 'link': doc.get('link', ''),
'source': doc.get('source', ''), 'source': doc.get('source', ''),
@@ -114,8 +117,10 @@ def get_clustered_news_internal():
# Use cluster_articles from aggregation (already fetched) # Use cluster_articles from aggregation (already fetched)
cluster_articles = doc.get('cluster_articles', []) cluster_articles = doc.get('cluster_articles', [])
title = doc.get('title_en') if doc.get('title_en') else doc.get('title', '')
article = { article = {
'title': doc.get('title', ''), 'title': title,
'link': doc.get('link', ''), 'link': doc.get('link', ''),
'source': doc.get('source', ''), 'source': doc.get('source', ''),
'published': doc.get('published_at', ''), 'published': doc.get('published_at', ''),
@@ -173,7 +178,7 @@ def get_article_by_url(article_url):
return jsonify({'error': 'Article not found'}), 404 return jsonify({'error': 'Article not found'}), 404
return jsonify({ return jsonify({
'title': article.get('title', ''), 'title': article.get('title_en') if article.get('title_en') else article.get('title', ''),
'author': article.get('author'), 'author': article.get('author'),
'link': article.get('link', ''), 'link': article.get('link', ''),
'content': article.get('content', ''), 'content': article.get('content', ''),

View File

@@ -0,0 +1,88 @@
from flask import Blueprint, jsonify, request
from config import Config
from chroma_client import ChromaClient
import logging
search_bp = Blueprint('search', __name__)
# Initialize ChromaDB client
# Note: We use the hostname 'chromadb' as defined in docker-compose for the backend
chroma_client = ChromaClient(
host=Config.CHROMA_HOST,
port=Config.CHROMA_PORT,
collection_name=Config.CHROMA_COLLECTION
)
@search_bp.route('/api/search', methods=['GET'])
def search_news():
"""
Semantic search for news articles using ChromaDB.
Query parameters:
- q: Search query (required)
- limit: Number of results (default: 10)
- category: Filter by category (optional)
"""
try:
query = request.args.get('q')
if not query:
return jsonify({'error': 'Missing search query'}), 400
limit = int(request.args.get('limit', 10))
category = request.args.get('category')
# Build filter if category provided
where_filter = None
if category:
where_filter = {"category": category}
# Perform search
results = chroma_client.search(
query_text=query,
n_results=limit,
where=where_filter
)
# Format for frontend
formatted_response = []
for item in results:
metadata = item.get('metadata', {})
# Use translated title if availble (stored in metadata as title_en or title)
# Note: Chroma metadata structure is flat. If we store title_en, we should use it.
# But currently we store: title, url, source, category, published_at.
# We need to make sure title_en is stored in Chroma OR fetch it from DB.
# Faster approach: just rely on what is in Chroma.
# BETTER: In crawl, we store title as title_en in metadata if available?
# Let's check how we store it in crawler_service.py/chroma_client.py
# Correction: Looking at crawler_service.py line 456, we pass article_doc to add_articles.
# In chroma_client.py line 97, we only extract title, url, source, category, published_at.
# We are NOT storing title_en in Chroma metadata currently.
# FOR NOW: We will stick to the title stored in Chroma, but we should update Chroma storing logic.
# However, since the user IS complaining about English, let's assume valid English titles
# are what we want to display.
# Wait, if we change the metadata in ChromaClient to use title_en as the main title,
# then search results will automatically show English.
title = metadata.get('title', 'Unknown Title')
formatted_response.append({
'title': title,
'link': metadata.get('url', ''),
'source': metadata.get('source', 'Unknown'),
'category': metadata.get('category', 'general'),
'published_at': metadata.get('published_at', ''),
'relevance_score': 1.0 - item.get('distance', 1.0), # Convert distance to score (approx)
'snippet': item.get('document', '')[:200] + '...' # Preview
})
return jsonify({
'query': query,
'count': len(formatted_response),
'results': formatted_response
}), 200
except Exception as e:
logging.error(f"Search error: {str(e)}")
return jsonify({'error': str(e)}), 500

View File

@@ -1,20 +1,3 @@
# Munich News Daily - Docker Compose Configuration
#
# GPU Support:
# To enable GPU acceleration for Ollama (5-10x faster):
# 1. Check GPU availability: ./check-gpu.sh
# 2. Start with GPU: ./start-with-gpu.sh
# Or manually: docker-compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
#
# Security:
# - Only Backend API (port 5001) is exposed to host
# - MongoDB is internal-only (not exposed to host)
# - Ollama is internal-only (not exposed to host)
# - Crawler and Sender are internal-only
# All services communicate via internal Docker network
#
# See docs/OLLAMA_SETUP.md for detailed setup instructions
services: services:
# Ollama AI Service (Internal only - not exposed to host) # Ollama AI Service (Internal only - not exposed to host)
ollama: ollama:
@@ -29,14 +12,6 @@ services:
dns: dns:
- 8.8.8.8 - 8.8.8.8
- 1.1.1.1 - 1.1.1.1
# GPU support (uncomment if you have NVIDIA GPU)
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
healthcheck: healthcheck:
test: [ "CMD-SHELL", "ollama list || exit 1" ] test: [ "CMD-SHELL", "ollama list || exit 1" ]
interval: 30s interval: 30s

View File

@@ -19,10 +19,10 @@ async function loadCategories() {
const response = await fetch('/api/categories'); const response = await fetch('/api/categories');
const data = await response.json(); const data = await response.json();
const categories = data.categories || []; const categories = data.categories || [];
const container = document.getElementById('categoryCheckboxes'); const container = document.getElementById('categoryCheckboxes');
container.innerHTML = ''; container.innerHTML = '';
categories.forEach(category => { categories.forEach(category => {
const label = document.createElement('label'); const label = document.createElement('label');
label.className = 'flex items-center space-x-3 cursor-pointer'; label.className = 'flex items-center space-x-3 cursor-pointer';
@@ -40,11 +40,11 @@ async function loadCategories() {
async function loadNews() { async function loadNews() {
const newsGrid = document.getElementById('newsGrid'); const newsGrid = document.getElementById('newsGrid');
newsGrid.innerHTML = '<div class="text-center py-10 text-gray-500">Loading news...</div>'; newsGrid.innerHTML = '<div class="text-center py-10 text-gray-500">Loading news...</div>';
try { try {
const response = await fetch('/api/news'); const response = await fetch('/api/news');
const data = await response.json(); const data = await response.json();
if (data.articles && data.articles.length > 0) { if (data.articles && data.articles.length > 0) {
allArticles = data.articles; allArticles = data.articles;
filteredArticles = data.articles; filteredArticles = data.articles;
@@ -63,24 +63,24 @@ async function loadNews() {
function loadMoreArticles() { function loadMoreArticles() {
if (isLoading || displayedCount >= filteredArticles.length) return; if (isLoading || displayedCount >= filteredArticles.length) return;
isLoading = true; isLoading = true;
const newsGrid = document.getElementById('newsGrid'); const newsGrid = document.getElementById('newsGrid');
// Remove loading indicator if exists // Remove loading indicator if exists
const loadingIndicator = document.getElementById('loadingIndicator'); const loadingIndicator = document.getElementById('loadingIndicator');
if (loadingIndicator) loadingIndicator.remove(); if (loadingIndicator) loadingIndicator.remove();
// Get next batch of articles // Get next batch of articles
const nextBatch = filteredArticles.slice(displayedCount, displayedCount + ARTICLES_PER_PAGE); const nextBatch = filteredArticles.slice(displayedCount, displayedCount + ARTICLES_PER_PAGE);
nextBatch.forEach((article, index) => { nextBatch.forEach((article, index) => {
const card = createNewsCard(article, displayedCount + index); const card = createNewsCard(article, displayedCount + index);
newsGrid.appendChild(card); newsGrid.appendChild(card);
}); });
displayedCount += nextBatch.length; displayedCount += nextBatch.length;
// Add loading indicator if more articles available // Add loading indicator if more articles available
if (displayedCount < filteredArticles.length) { if (displayedCount < filteredArticles.length) {
const loader = document.createElement('div'); const loader = document.createElement('div');
@@ -95,17 +95,17 @@ function loadMoreArticles() {
endMessage.textContent = `✓ All ${filteredArticles.length} articles loaded`; endMessage.textContent = `✓ All ${filteredArticles.length} articles loaded`;
newsGrid.appendChild(endMessage); newsGrid.appendChild(endMessage);
} }
isLoading = false; isLoading = false;
} }
function setupInfiniteScroll() { function setupInfiniteScroll() {
window.addEventListener('scroll', () => { window.addEventListener('scroll', () => {
if (isLoading || displayedCount >= filteredArticles.length) return; if (isLoading || displayedCount >= filteredArticles.length) return;
const scrollPosition = window.innerHeight + window.scrollY; const scrollPosition = window.innerHeight + window.scrollY;
const threshold = document.documentElement.scrollHeight - 500; const threshold = document.documentElement.scrollHeight - 500;
if (scrollPosition >= threshold) { if (scrollPosition >= threshold) {
loadMoreArticles(); loadMoreArticles();
} }
@@ -113,53 +113,85 @@ function setupInfiniteScroll() {
} }
// Search functionality // Search functionality
function handleSearch() { let searchTimeout;
async function handleSearch() {
const searchInput = document.getElementById('searchInput'); const searchInput = document.getElementById('searchInput');
const clearBtn = document.getElementById('clearSearch'); const clearBtn = document.getElementById('clearSearch');
searchQuery = searchInput.value.trim().toLowerCase(); const searchStats = document.getElementById('searchStats');
const newsGrid = document.getElementById('newsGrid');
searchQuery = searchInput.value.trim();
// Show/hide clear button // Show/hide clear button
if (searchQuery) { if (searchQuery) {
clearBtn.classList.remove('hidden'); clearBtn.classList.remove('hidden');
} else { } else {
clearBtn.classList.add('hidden'); clearBtn.classList.add('hidden');
} }
// Filter articles // Clear previous timeout
if (searchTimeout) clearTimeout(searchTimeout);
// If empty query, reset to all articles
if (searchQuery === '') { if (searchQuery === '') {
filteredArticles = allArticles; filteredArticles = allArticles;
} else { displayedCount = 0;
filteredArticles = allArticles.filter(article => { newsGrid.innerHTML = '';
const title = article.title.toLowerCase(); updateSearchStats();
const summary = (article.summary || '').toLowerCase().replace(/<[^>]*>/g, '');
const source = formatSourceName(article.source).toLowerCase();
return title.includes(searchQuery) ||
summary.includes(searchQuery) ||
source.includes(searchQuery);
});
}
// Reset display
displayedCount = 0;
const newsGrid = document.getElementById('newsGrid');
newsGrid.innerHTML = '';
// Update stats
updateSearchStats();
// Load filtered articles
if (filteredArticles.length > 0) {
loadMoreArticles(); loadMoreArticles();
} else { return;
newsGrid.innerHTML = `
<div class="text-center py-16">
<div class="text-6xl mb-4">🔍</div>
<p class="text-xl text-gray-600 mb-2">No articles found</p>
<p class="text-gray-400">Try a different search term</p>
</div>
`;
} }
// Debounce search API call
searchTimeout = setTimeout(async () => {
// Show searching state
newsGrid.innerHTML = '<div class="text-center py-10 text-gray-500">Searching...</div>';
try {
const response = await fetch(`/api/search?q=${encodeURIComponent(searchQuery)}&limit=20`);
// Check if response is ok
if (!response.ok) {
const errorText = await response.text();
throw new Error(`Server returned ${response.status}: ${errorText}`);
}
const data = await response.json();
if (data.results && data.results.length > 0) {
// Map results to match card format
filteredArticles = data.results.map(item => ({
title: item.title,
link: item.link,
source: item.source,
summary: item.snippet, // Map snippet to summary
published_at: item.published_at,
score: item.relevance_score
}));
displayedCount = 0;
newsGrid.innerHTML = '';
// Update stats
searchStats.textContent = `Found ${filteredArticles.length} relevant articles`;
loadMoreArticles();
} else {
newsGrid.innerHTML = `
<div class="text-center py-16">
<div class="text-6xl mb-4">🔍</div>
<p class="text-xl text-gray-600 mb-2">No relevant articles found</p>
<p class="text-gray-400">Try different keywords or concepts</p>
</div>
`;
searchStats.textContent = 'No results found';
}
} catch (error) {
console.error('Search failed:', error);
newsGrid.innerHTML = `<div class="text-center py-10 text-red-400">Search failed: ${error.message}</div>`;
}
}, 500); // 500ms debounce
} }
function clearSearch() { function clearSearch() {
@@ -182,11 +214,11 @@ function createNewsCard(article, index) {
const card = document.createElement('div'); const card = document.createElement('div');
card.className = 'group bg-white rounded-xl overflow-hidden shadow-md hover:shadow-xl transition-all duration-300 cursor-pointer border border-gray-100 hover:border-primary/30'; card.className = 'group bg-white rounded-xl overflow-hidden shadow-md hover:shadow-xl transition-all duration-300 cursor-pointer border border-gray-100 hover:border-primary/30';
card.onclick = () => window.open(article.link, '_blank'); card.onclick = () => window.open(article.link, '_blank');
// Extract image from summary if it's an img tag (from Süddeutsche) // Extract image from summary if it's an img tag (from Süddeutsche)
let imageUrl = null; let imageUrl = null;
let cleanSummary = article.summary || 'No summary available.'; let cleanSummary = article.summary || 'No summary available.';
if (cleanSummary.includes('<img')) { if (cleanSummary.includes('<img')) {
const imgMatch = cleanSummary.match(/src="([^"]+)"/); const imgMatch = cleanSummary.match(/src="([^"]+)"/);
if (imgMatch) { if (imgMatch) {
@@ -195,17 +227,17 @@ function createNewsCard(article, index) {
// Remove img tag from summary // Remove img tag from summary
cleanSummary = cleanSummary.replace(/<img[^>]*>/g, '').replace(/<\/?p>/g, '').trim(); cleanSummary = cleanSummary.replace(/<img[^>]*>/g, '').replace(/<\/?p>/g, '').trim();
} }
// Get source icon/emoji // Get source icon/emoji
const sourceIcon = getSourceIcon(article.source); const sourceIcon = getSourceIcon(article.source);
// Format source name // Format source name
const sourceName = formatSourceName(article.source); const sourceName = formatSourceName(article.source);
// Get word count badge // Get word count badge
const wordCount = article.word_count || article.summary_word_count; const wordCount = article.word_count || article.summary_word_count;
const readTime = wordCount ? Math.ceil(wordCount / 200) : null; const readTime = wordCount ? Math.ceil(wordCount / 200) : null;
card.innerHTML = ` card.innerHTML = `
<div class="flex flex-col sm:flex-row"> <div class="flex flex-col sm:flex-row">
<!-- Image --> <!-- Image -->
@@ -237,11 +269,11 @@ function createNewsCard(article, index) {
</div> </div>
</div> </div>
`; `;
// Add staggered animation // Add staggered animation
card.style.opacity = '0'; card.style.opacity = '0';
card.style.animation = `fadeIn 0.5s ease-out ${(index % ARTICLES_PER_PAGE) * 0.1}s forwards`; card.style.animation = `fadeIn 0.5s ease-out ${(index % ARTICLES_PER_PAGE) * 0.1}s forwards`;
return card; return card;
} }
@@ -293,7 +325,7 @@ async function loadStats() {
try { try {
const response = await fetch('/api/stats'); const response = await fetch('/api/stats');
const data = await response.json(); const data = await response.json();
if (data.subscribers !== undefined) { if (data.subscribers !== undefined) {
document.getElementById('subscriberCount').textContent = data.subscribers.toLocaleString(); document.getElementById('subscriberCount').textContent = data.subscribers.toLocaleString();
} }
@@ -306,44 +338,44 @@ async function subscribe() {
const emailInput = document.getElementById('emailInput'); const emailInput = document.getElementById('emailInput');
const subscribeBtn = document.getElementById('subscribeBtn'); const subscribeBtn = document.getElementById('subscribeBtn');
const formMessage = document.getElementById('formMessage'); const formMessage = document.getElementById('formMessage');
const email = emailInput.value.trim(); const email = emailInput.value.trim();
if (!email || !email.includes('@')) { if (!email || !email.includes('@')) {
formMessage.textContent = 'Please enter a valid email address'; formMessage.textContent = 'Please enter a valid email address';
formMessage.className = 'text-red-200 font-medium'; formMessage.className = 'text-red-200 font-medium';
return; return;
} }
// Get selected categories // Get selected categories
const checkboxes = document.querySelectorAll('#categoryCheckboxes input[type="checkbox"]:checked'); const checkboxes = document.querySelectorAll('#categoryCheckboxes input[type="checkbox"]:checked');
const categories = Array.from(checkboxes).map(cb => cb.value); const categories = Array.from(checkboxes).map(cb => cb.value);
if (categories.length === 0) { if (categories.length === 0) {
formMessage.textContent = 'Please select at least one category'; formMessage.textContent = 'Please select at least one category';
formMessage.className = 'text-red-200 font-medium'; formMessage.className = 'text-red-200 font-medium';
return; return;
} }
subscribeBtn.disabled = true; subscribeBtn.disabled = true;
subscribeBtn.textContent = 'Subscribing...'; subscribeBtn.textContent = 'Subscribing...';
subscribeBtn.classList.add('opacity-75', 'cursor-not-allowed'); subscribeBtn.classList.add('opacity-75', 'cursor-not-allowed');
formMessage.textContent = ''; formMessage.textContent = '';
try { try {
const response = await fetch('/api/subscribe', { const response = await fetch('/api/subscribe', {
method: 'POST', method: 'POST',
headers: { headers: {
'Content-Type': 'application/json' 'Content-Type': 'application/json'
}, },
body: JSON.stringify({ body: JSON.stringify({
email: email, email: email,
categories: categories categories: categories
}) })
}); });
const data = await response.json(); const data = await response.json();
if (response.ok) { if (response.ok) {
formMessage.textContent = data.message || 'Successfully subscribed! Check your email for confirmation.'; formMessage.textContent = data.message || 'Successfully subscribed! Check your email for confirmation.';
formMessage.className = 'text-green-200 font-medium'; formMessage.className = 'text-green-200 font-medium';
@@ -384,15 +416,15 @@ function closeUnsubscribe() {
async function unsubscribe() { async function unsubscribe() {
const emailInput = document.getElementById('unsubscribeEmail'); const emailInput = document.getElementById('unsubscribeEmail');
const unsubscribeMessage = document.getElementById('unsubscribeMessage'); const unsubscribeMessage = document.getElementById('unsubscribeMessage');
const email = emailInput.value.trim(); const email = emailInput.value.trim();
if (!email || !email.includes('@')) { if (!email || !email.includes('@')) {
unsubscribeMessage.textContent = 'Please enter a valid email address'; unsubscribeMessage.textContent = 'Please enter a valid email address';
unsubscribeMessage.className = 'text-red-600 font-medium'; unsubscribeMessage.className = 'text-red-600 font-medium';
return; return;
} }
try { try {
const response = await fetch('/api/unsubscribe', { const response = await fetch('/api/unsubscribe', {
method: 'POST', method: 'POST',
@@ -401,9 +433,9 @@ async function unsubscribe() {
}, },
body: JSON.stringify({ email: email }) body: JSON.stringify({ email: email })
}); });
const data = await response.json(); const data = await response.json();
if (response.ok) { if (response.ok) {
unsubscribeMessage.textContent = data.message || 'Successfully unsubscribed.'; unsubscribeMessage.textContent = data.message || 'Successfully unsubscribed.';
unsubscribeMessage.className = 'text-green-600 font-medium'; unsubscribeMessage.className = 'text-green-600 font-medium';
@@ -423,7 +455,7 @@ async function unsubscribe() {
} }
// Close modal when clicking outside // Close modal when clicking outside
window.onclick = function(event) { window.onclick = function (event) {
const modal = document.getElementById('unsubscribeModal'); const modal = document.getElementById('unsubscribeModal');
if (event.target === modal) { if (event.target === modal) {
closeUnsubscribe(); closeUnsubscribe();

View File

@@ -204,6 +204,31 @@ app.get('/api/ollama/config', async (req, res) => {
} }
}); });
app.get('/api/search', async (req, res) => {
try {
const { q, limit, category } = req.query;
const response = await axios.get(`${API_URL}/api/search`, {
params: { q, limit, category }
});
res.json(response.data);
} catch (error) {
if (error.response) {
// The request was made and the server responded with a status code
// that falls out of the range of 2xx
console.error('Search API Error:', error.response.status, error.response.data);
res.status(error.response.status).json(error.response.data);
} else if (error.request) {
// The request was made but no response was received
console.error('Search API No Response:', error.request);
res.status(502).json({ error: 'Search service unavailable (timeout/connection)' });
} else {
// Something happened in setting up the request that triggered an Error
console.error('Search API Request Error:', error.message);
res.status(500).json({ error: 'Internal proxy error' });
}
}
});
app.listen(PORT, () => { app.listen(PORT, () => {
console.log(`Frontend server running on http://localhost:${PORT}`); console.log(`Frontend server running on http://localhost:${PORT}`);
console.log(`Admin dashboard: http://localhost:${PORT}/admin.html`); console.log(`Admin dashboard: http://localhost:${PORT}/admin.html`);

View File

@@ -87,7 +87,8 @@ class ChromaClient:
# Prepare text for embedding (Title + Summary + Start of Content) # Prepare text for embedding (Title + Summary + Start of Content)
# This gives semantic search a good overview # This gives semantic search a good overview
title = article.get('title', '') # Use English title if available, otherwise original
title = article.get('title_en') if article.get('title_en') else article.get('title', '')
summary = article.get('summary') or '' summary = article.get('summary') or ''
content_snippet = article.get('content', '')[:1000] content_snippet = article.get('content', '')[:1000]

View File

@@ -340,7 +340,11 @@ def crawl_rss_feed(feed_url, feed_name, feed_category='general', max_articles=10
if not feed.entries: if not feed.entries:
print(f" ⚠ No entries found in feed") print(f" ⚠ No entries found in feed")
return 0 return {
'crawled': 0,
'summarized': 0,
'failed_summaries': 0
}
crawled_count = 0 crawled_count = 0
summarized_count = 0 summarized_count = 0

View File

@@ -37,12 +37,12 @@ def main():
"""Main scheduler loop""" """Main scheduler loop"""
print("🤖 Munich News Crawler Scheduler") print("🤖 Munich News Crawler Scheduler")
print("="*60) print("="*60)
print("Schedule: Daily at 6:00 AM Berlin time") print("Schedule: Every 3 hours")
print("Timezone: Europe/Berlin (CET/CEST)") print("Timezone: Europe/Berlin (CET/CEST)")
print("="*60) print("="*60)
# Schedule the crawler to run at 6 AM Berlin time # Schedule the crawler to run every 3 hours
schedule.every().day.at("06:00").do(run_crawler) schedule.every(3).hours.do(run_crawler)
# Show next run time # Show next run time
berlin_time = datetime.now(BERLIN_TZ) berlin_time = datetime.now(BERLIN_TZ)