This commit is contained in:
2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions

143
backend/DATABASE_SCHEMA.md Normal file
View File

@@ -0,0 +1,143 @@
# MongoDB Database Schema
This document describes the MongoDB collections and their structure for Munich News Daily.
## Collections
### 1. Articles Collection (`articles`)
Stores all news articles aggregated from Munich news sources.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
title: String, // Article title (required)
author: String, // Article author (optional, extracted during crawl)
link: String, // Article URL (required, unique)
content: String, // Full article content (no length limit)
summary: String, // AI-generated English summary (≤150 words)
word_count: Number, // Word count of full content
summary_word_count: Number, // Word count of AI summary
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
published_at: String, // Original publication date from RSS feed or crawled
crawled_at: DateTime, // When article content was crawled (UTC)
summarized_at: DateTime, // When AI summary was generated (UTC)
created_at: DateTime // When article was added to database (UTC)
}
```
**Indexes:**
- `link` - Unique index to prevent duplicate articles
- `created_at` - Index for efficient sorting by date
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439011"),
title: "New U-Bahn Line Opens in Munich",
author: "Max Mustermann",
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
word_count: 1250,
summary_word_count: 48,
source: "Süddeutsche Zeitung München",
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
created_at: ISODate("2024-01-15T09:00:00.000Z")
}
```
### 2. Subscribers Collection (`subscribers`)
Stores all newsletter subscribers.
**Document Structure:**
```javascript
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (required, unique, lowercase)
subscribed_at: DateTime, // When user subscribed (UTC)
status: String // Subscription status: 'active' or 'inactive'
}
```
**Indexes:**
- `email` - Unique index for email lookups and preventing duplicates
- `subscribed_at` - Index for analytics and sorting
**Example Document:**
```javascript
{
_id: ObjectId("507f1f77bcf86cd799439012"),
email: "user@example.com",
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
status: "active"
}
```
## Design Decisions
### Why MongoDB?
1. **Flexibility**: Easy to add new fields without schema migrations
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
4. **Document Model**: Natural fit for news articles and subscriber data
### Schema Choices
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
### Future Enhancements
Potential fields to add in the future:
**Articles:**
- `category`: String (e.g., "politics", "sports", "culture")
- `tags`: Array of Strings
- `image_url`: String
- `sent_in_newsletter`: Boolean (track if article was sent)
- `sent_at`: DateTime (when article was included in newsletter)
**Subscribers:**
- `preferences`: Object (newsletter frequency, categories, etc.)
- `last_sent_at`: DateTime (last newsletter sent date)
- `unsubscribed_at`: DateTime (when user unsubscribed)
- `verification_token`: String (for email verification)
## AI Summarization Workflow
When the crawler processes an article:
1. **Extract Content**: Full article text is extracted from the webpage
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
### Summary Field Details
- **Language**: Always in English, regardless of source article language
- **Length**: Maximum 150 words
- **Format**: Plain text, concise and clear
- **Purpose**: Quick preview for newsletters and frontend display
### Querying Articles
```javascript
// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })
// Get articles without summaries
db.articles.find({ summary: { $exists: false } })
// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
```

98
backend/STRUCTURE.md Normal file
View File

@@ -0,0 +1,98 @@
# Backend Structure
The backend has been modularized for better maintainability and scalability.
## Directory Structure
```
backend/
├── app.py # Main Flask application entry point
├── config.py # Configuration management
├── database.py # Database connection and initialization
├── requirements.txt # Python dependencies
├── .env # Environment variables
├── routes/ # API route handlers (blueprints)
│ ├── __init__.py
│ ├── subscription_routes.py # /api/subscribe, /api/unsubscribe
│ ├── news_routes.py # /api/news, /api/stats
│ ├── rss_routes.py # /api/rss-feeds (CRUD operations)
│ └── ollama_routes.py # /api/ollama/* (AI features)
└── services/ # Business logic layer
├── __init__.py
├── news_service.py # News fetching and storage logic
├── email_service.py # Newsletter email sending
└── ollama_service.py # Ollama AI integration
```
## Key Components
### app.py
- Main Flask application
- Registers all blueprints
- Minimal code, just wiring things together
### config.py
- Centralized configuration
- Loads environment variables
- Single source of truth for all settings
### database.py
- MongoDB connection setup
- Collection definitions
- Database initialization with indexes
### routes/
Each route file is a Flask Blueprint handling specific API endpoints:
- **subscription_routes.py**: User subscription management
- **news_routes.py**: News fetching and statistics
- **rss_routes.py**: RSS feed management (add/remove/list/toggle)
- **ollama_routes.py**: AI/Ollama integration endpoints
### services/
Business logic separated from route handlers:
- **news_service.py**: Fetches news from RSS feeds, saves to database
- **email_service.py**: Sends newsletter emails to subscribers
- **ollama_service.py**: Communicates with Ollama AI server
## Benefits of This Structure
1. **Separation of Concerns**: Routes handle HTTP, services handle business logic
2. **Testability**: Each module can be tested independently
3. **Maintainability**: Easy to find and modify specific functionality
4. **Scalability**: Easy to add new routes or services
5. **Reusability**: Services can be used by multiple routes
## Adding New Features
### To add a new API endpoint:
1. Create a new route file in `routes/` or add to existing one
2. Create a Blueprint and define routes
3. Register the blueprint in `app.py`
### To add new business logic:
1. Create a new service file in `services/`
2. Import and use in your route handlers
### Example:
```python
# services/my_service.py
def my_business_logic():
return "Hello"
# routes/my_routes.py
from flask import Blueprint
from services.my_service import my_business_logic
my_bp = Blueprint('my', __name__)
@my_bp.route('/api/my-endpoint')
def my_endpoint():
result = my_business_logic()
return {'message': result}
# app.py
from routes.my_routes import my_bp
app.register_blueprint(my_bp)
```

29
backend/app.py Normal file
View File

@@ -0,0 +1,29 @@
from flask import Flask
from flask_cors import CORS
from config import Config
from database import init_db
from routes.subscription_routes import subscription_bp
from routes.news_routes import news_bp
from routes.rss_routes import rss_bp
from routes.ollama_routes import ollama_bp
from routes.newsletter_routes import newsletter_bp
# Initialize Flask app
app = Flask(__name__)
CORS(app)
# Initialize database
init_db()
# Register blueprints
app.register_blueprint(subscription_bp)
app.register_blueprint(news_bp)
app.register_blueprint(rss_bp)
app.register_blueprint(ollama_bp)
app.register_blueprint(newsletter_bp)
# Print configuration
Config.print_config()
if __name__ == '__main__':
app.run(debug=True, port=Config.FLASK_PORT, host='127.0.0.1')

52
backend/config.py Normal file
View File

@@ -0,0 +1,52 @@
import os
from dotenv import load_dotenv
from pathlib import Path
# Get the directory where this script is located
backend_dir = Path(__file__).parent
env_path = backend_dir / '.env'
# Load .env file
load_dotenv(dotenv_path=env_path)
# Debug: Print if .env file exists (for troubleshooting)
if env_path.exists():
print(f"✓ Loading .env file from: {env_path}")
else:
print(f"⚠ Warning: .env file not found at {env_path}")
print(f" Current working directory: {os.getcwd()}")
print(f" Looking for .env in: {env_path}")
class Config:
"""Application configuration"""
# MongoDB
MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')
DB_NAME = 'munich_news'
# Email
SMTP_SERVER = os.getenv('SMTP_SERVER', 'smtp.gmail.com')
SMTP_PORT = int(os.getenv('SMTP_PORT', '587'))
EMAIL_USER = os.getenv('EMAIL_USER', '')
EMAIL_PASSWORD = os.getenv('EMAIL_PASSWORD', '')
# Ollama
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL', 'http://localhost:11434')
OLLAMA_MODEL = os.getenv('OLLAMA_MODEL', 'llama2')
OLLAMA_API_KEY = os.getenv('OLLAMA_API_KEY', '')
OLLAMA_ENABLED = os.getenv('OLLAMA_ENABLED', 'false').lower() == 'true'
# Flask
FLASK_PORT = int(os.getenv('FLASK_PORT', '5000'))
@classmethod
def print_config(cls):
"""Print configuration (without sensitive data)"""
print("\nApplication Configuration:")
print(f" MongoDB URI: {cls.MONGODB_URI}")
print(f" Database: {cls.DB_NAME}")
print(f" Flask Port: {cls.FLASK_PORT}")
print(f" Ollama Base URL: {cls.OLLAMA_BASE_URL}")
print(f" Ollama Model: {cls.OLLAMA_MODEL}")
print(f" Ollama Enabled: {cls.OLLAMA_ENABLED}")

53
backend/database.py Normal file
View File

@@ -0,0 +1,53 @@
from pymongo import MongoClient
from datetime import datetime
from config import Config
# MongoDB setup
client = MongoClient(Config.MONGODB_URI)
db = client[Config.DB_NAME]
# Collections
articles_collection = db['articles']
subscribers_collection = db['subscribers']
rss_feeds_collection = db['rss_feeds']
def init_db():
"""Initialize database with indexes"""
# Create unique index on article links to prevent duplicates
articles_collection.create_index('link', unique=True)
# Create index on created_at for faster sorting
articles_collection.create_index('created_at')
# Create unique index on subscriber emails
subscribers_collection.create_index('email', unique=True)
# Create index on subscribed_at
subscribers_collection.create_index('subscribed_at')
# Create unique index on RSS feed URLs
rss_feeds_collection.create_index('url', unique=True)
# Initialize default RSS feeds if collection is empty
if rss_feeds_collection.count_documents({}) == 0:
default_feeds = [
{
'name': 'Süddeutsche Zeitung München',
'url': 'https://www.sueddeutsche.de/muenchen/rss',
'active': True,
'created_at': datetime.utcnow()
},
{
'name': 'Münchner Merkur',
'url': 'https://www.merkur.de/muenchen/rss',
'active': True,
'created_at': datetime.utcnow()
},
{
'name': 'Abendzeitung München',
'url': 'https://www.abendzeitung-muenchen.de/rss',
'active': True,
'created_at': datetime.utcnow()
}
]
rss_feeds_collection.insert_many(default_feeds)
print(f"Initialized {len(default_feeds)} default RSS feeds")
print("Database initialized with indexes")

32
backend/env.template Normal file
View File

@@ -0,0 +1,32 @@
# MongoDB Configuration
# For Docker Compose (no authentication):
MONGODB_URI=mongodb://localhost:27017/
# For Docker Compose with authentication:
# MONGODB_URI=mongodb://admin:password@localhost:27017/
# For MongoDB Atlas (cloud):
# MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/
# Email Configuration (for sending newsletters)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_USER=your-email@gmail.com
EMAIL_PASSWORD=your-app-password
# Note: For Gmail, use an App Password: https://support.google.com/accounts/answer/185833
# Ollama Configuration (for AI-powered features)
# Remote Ollama server URL (e.g., http://your-server-ip:11434 or https://your-domain.com)
OLLAMA_BASE_URL=http://localhost:11434
# Optional: API key if your Ollama server requires authentication
# OLLAMA_API_KEY=your-api-key-here
# Model name to use (e.g., llama2, mistral, codellama, llama3, phi3:latest)
OLLAMA_MODEL=phi3:latest
# Enable/disable Ollama features (true/false)
# When enabled, the crawler will automatically summarize articles in English (≤150 words)
OLLAMA_ENABLED=true
# Timeout for Ollama requests in seconds (default: 30)
OLLAMA_TIMEOUT=30
# Flask Server Configuration
# Port for Flask server (default: 5001 to avoid AirPlay conflict on macOS)
FLASK_PORT=5001

61
backend/fix_duplicates.py Normal file
View File

@@ -0,0 +1,61 @@
"""
Script to fix duplicate RSS feeds and create unique index
Run this once: python fix_duplicates.py
"""
from pymongo import MongoClient
from config import Config
# Connect to MongoDB
client = MongoClient(Config.MONGODB_URI)
db = client[Config.DB_NAME]
rss_feeds_collection = db['rss_feeds']
print("Fixing duplicate RSS feeds...")
# Get all feeds
all_feeds = list(rss_feeds_collection.find())
print(f"Total feeds found: {len(all_feeds)}")
# Find duplicates by URL
seen_urls = {}
duplicates_to_remove = []
for feed in all_feeds:
url = feed.get('url')
if url in seen_urls:
# This is a duplicate, mark for removal
duplicates_to_remove.append(feed['_id'])
print(f" Duplicate found: {feed['name']} - {url}")
else:
# First occurrence, keep it
seen_urls[url] = feed['_id']
# Remove duplicates
if duplicates_to_remove:
result = rss_feeds_collection.delete_many({'_id': {'$in': duplicates_to_remove}})
print(f"Removed {result.deleted_count} duplicate feeds")
else:
print("No duplicates found")
# Drop existing indexes (if any)
print("\nDropping existing indexes...")
try:
rss_feeds_collection.drop_indexes()
print("Indexes dropped")
except Exception as e:
print(f"Note: {e}")
# Create unique index on URL
print("\nCreating unique index on 'url' field...")
rss_feeds_collection.create_index('url', unique=True)
print("✓ Unique index created successfully")
# Verify
remaining_feeds = list(rss_feeds_collection.find())
print(f"\nFinal feed count: {len(remaining_feeds)}")
print("\nRemaining feeds:")
for feed in remaining_feeds:
print(f" - {feed['name']}: {feed['url']}")
print("\n✓ Done! Duplicates removed and unique index created.")
print("You can now restart your Flask app.")

8
backend/requirements.txt Normal file
View File

@@ -0,0 +1,8 @@
Flask==3.0.0
flask-cors==4.0.0
feedparser==6.0.10
python-dotenv==1.0.0
pymongo==4.6.1
requests==2.31.0
Jinja2==3.1.2

View File

@@ -0,0 +1 @@
# Routes package

View File

@@ -0,0 +1,123 @@
from flask import Blueprint, jsonify
from database import articles_collection
from services.news_service import fetch_munich_news, save_articles_to_db
news_bp = Blueprint('news', __name__)
@news_bp.route('/api/news', methods=['GET'])
def get_news():
"""Get latest Munich news"""
try:
# Fetch fresh news and save to database
articles = fetch_munich_news()
save_articles_to_db(articles)
# Get articles from MongoDB, sorted by created_at (newest first)
cursor = articles_collection.find().sort('created_at', -1).limit(20)
db_articles = []
for doc in cursor:
article = {
'title': doc.get('title', ''),
'author': doc.get('author'),
'link': doc.get('link', ''),
'source': doc.get('source', ''),
'published': doc.get('published_at', ''),
'word_count': doc.get('word_count'),
'has_full_content': bool(doc.get('content')),
'has_summary': bool(doc.get('summary'))
}
# Include AI summary if available
if doc.get('summary'):
article['summary'] = doc.get('summary', '')
article['summary_word_count'] = doc.get('summary_word_count')
article['summarized_at'] = doc.get('summarized_at', '').isoformat() if doc.get('summarized_at') else None
# Fallback: Include preview of content if no summary (first 200 chars)
elif doc.get('content'):
article['preview'] = doc.get('content', '')[:200] + '...'
db_articles.append(article)
# Combine fresh articles with database articles and deduplicate
seen_links = set()
combined = []
# Add fresh articles first (they're more recent)
for article in articles:
link = article.get('link', '')
if link and link not in seen_links:
seen_links.add(link)
combined.append(article)
# Add database articles
for article in db_articles:
link = article.get('link', '')
if link and link not in seen_links:
seen_links.add(link)
combined.append(article)
return jsonify({'articles': combined[:20]}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
@news_bp.route('/api/news/<path:article_url>', methods=['GET'])
def get_article_by_url(article_url):
"""Get full article content by URL"""
try:
# Decode URL
from urllib.parse import unquote
decoded_url = unquote(article_url)
# Find article by link
article = articles_collection.find_one({'link': decoded_url})
if not article:
return jsonify({'error': 'Article not found'}), 404
return jsonify({
'title': article.get('title', ''),
'author': article.get('author'),
'link': article.get('link', ''),
'content': article.get('content', ''),
'summary': article.get('summary'),
'word_count': article.get('word_count', 0),
'summary_word_count': article.get('summary_word_count'),
'source': article.get('source', ''),
'published_at': article.get('published_at', ''),
'crawled_at': article.get('crawled_at', '').isoformat() if article.get('crawled_at') else None,
'summarized_at': article.get('summarized_at', '').isoformat() if article.get('summarized_at') else None,
'created_at': article.get('created_at', '').isoformat() if article.get('created_at') else None
}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
@news_bp.route('/api/stats', methods=['GET'])
def get_stats():
"""Get subscription statistics"""
try:
from database import subscribers_collection
# Count only active subscribers
subscriber_count = subscribers_collection.count_documents({'status': 'active'})
# Also get total article count
article_count = articles_collection.count_documents({})
# Count crawled articles
crawled_count = articles_collection.count_documents({'content': {'$exists': True, '$ne': ''}})
# Count summarized articles
summarized_count = articles_collection.count_documents({'summary': {'$exists': True, '$ne': ''}})
return jsonify({
'subscribers': subscriber_count,
'articles': article_count,
'crawled_articles': crawled_count,
'summarized_articles': summarized_count
}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500

View File

@@ -0,0 +1,62 @@
from flask import Blueprint, Response
from pathlib import Path
from jinja2 import Template
from datetime import datetime
from database import articles_collection
newsletter_bp = Blueprint('newsletter', __name__)
@newsletter_bp.route('/api/newsletter/preview', methods=['GET'])
def preview_newsletter():
"""Preview the newsletter HTML (for testing)"""
try:
# Get latest articles with AI summaries
cursor = articles_collection.find(
{'summary': {'$exists': True, '$ne': None}}
).sort('created_at', -1).limit(10)
articles = []
for doc in cursor:
articles.append({
'title': doc.get('title', ''),
'author': doc.get('author'),
'link': doc.get('link', ''),
'summary': doc.get('summary', ''),
'source': doc.get('source', ''),
'published_at': doc.get('published_at', '')
})
if not articles:
return Response(
"<h1>No articles with summaries found</h1><p>Run the crawler with Ollama enabled first.</p>",
mimetype='text/html'
)
# Load template
template_path = Path(__file__).parent.parent / 'templates' / 'newsletter_template.html'
with open(template_path, 'r', encoding='utf-8') as f:
template_content = f.read()
template = Template(template_content)
# Prepare data
now = datetime.now()
template_data = {
'date': now.strftime('%A, %B %d, %Y'),
'year': now.year,
'article_count': len(articles),
'articles': articles,
'unsubscribe_link': 'http://localhost:3000/unsubscribe',
'website_link': 'http://localhost:3000'
}
# Render and return HTML
html_content = template.render(**template_data)
return Response(html_content, mimetype='text/html')
except Exception as e:
return Response(
f"<h1>Error</h1><p>{str(e)}</p>",
mimetype='text/html'
), 500

View File

@@ -0,0 +1,158 @@
from flask import Blueprint, jsonify
from config import Config
from services.ollama_service import call_ollama, list_ollama_models
import os
ollama_bp = Blueprint('ollama', __name__)
@ollama_bp.route('/api/ollama/ping', methods=['GET', 'POST'])
def ping_ollama():
"""Test connection to Ollama server"""
try:
# Check if Ollama is enabled
if not Config.OLLAMA_ENABLED:
return jsonify({
'status': 'disabled',
'message': 'Ollama is not enabled. Set OLLAMA_ENABLED=true in your .env file.',
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': False
}
}), 200
# Send a simple test prompt
test_prompt = "Say 'Hello! I am connected and working.' in one sentence."
system_prompt = "You are a helpful assistant. Respond briefly and concisely."
response_text, error_message = call_ollama(test_prompt, system_prompt)
if response_text:
return jsonify({
'status': 'success',
'message': 'Successfully connected to Ollama',
'response': response_text,
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': True
}
}), 200
else:
# Try to get available models for better error message
available_models, _ = list_ollama_models()
troubleshooting = {
'check_server': f'Verify Ollama is running at {Config.OLLAMA_BASE_URL}',
'check_model': f'Verify model "{Config.OLLAMA_MODEL}" is available (run: ollama list)',
'test_connection': f'Test manually: curl {Config.OLLAMA_BASE_URL}/api/generate -d \'{{"model":"{Config.OLLAMA_MODEL}","prompt":"test"}}\''
}
if available_models:
troubleshooting['available_models'] = available_models
troubleshooting['suggestion'] = f'Try setting OLLAMA_MODEL to one of: {", ".join(available_models[:5])}'
return jsonify({
'status': 'error',
'message': error_message or 'Failed to get response from Ollama',
'error_details': error_message,
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': True
},
'troubleshooting': troubleshooting
}), 500
except Exception as e:
return jsonify({
'status': 'error',
'message': f'Error connecting to Ollama: {str(e)}',
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': Config.OLLAMA_ENABLED
}
}), 500
@ollama_bp.route('/api/ollama/config', methods=['GET'])
def get_ollama_config():
"""Get current Ollama configuration (for debugging)"""
try:
from pathlib import Path
backend_dir = Path(__file__).parent.parent
env_path = backend_dir / '.env'
return jsonify({
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': Config.OLLAMA_ENABLED,
'has_api_key': bool(Config.OLLAMA_API_KEY)
},
'env_file_path': str(env_path),
'env_file_exists': env_path.exists(),
'current_working_directory': os.getcwd()
}), 200
except Exception as e:
return jsonify({
'error': str(e),
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': Config.OLLAMA_ENABLED
}
}), 500
@ollama_bp.route('/api/ollama/models', methods=['GET'])
def get_ollama_models():
"""List available models on Ollama server"""
try:
if not Config.OLLAMA_ENABLED:
return jsonify({
'status': 'disabled',
'message': 'Ollama is not enabled. Set OLLAMA_ENABLED=true in your .env file.',
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': False
}
}), 200
models, error_message = list_ollama_models()
if models is not None:
return jsonify({
'status': 'success',
'models': models,
'current_model': Config.OLLAMA_MODEL,
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': True
}
}), 200
else:
return jsonify({
'status': 'error',
'message': error_message or 'Failed to list models',
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': True
}
}), 500
except Exception as e:
return jsonify({
'status': 'error',
'message': f'Error listing models: {str(e)}',
'ollama_config': {
'base_url': Config.OLLAMA_BASE_URL,
'model': Config.OLLAMA_MODEL,
'enabled': Config.OLLAMA_ENABLED
}
}), 500

View File

@@ -0,0 +1,124 @@
from flask import Blueprint, request, jsonify
from datetime import datetime
from pymongo.errors import DuplicateKeyError
from bson.objectid import ObjectId
import feedparser
from database import rss_feeds_collection
rss_bp = Blueprint('rss', __name__)
@rss_bp.route('/api/rss-feeds', methods=['GET'])
def get_rss_feeds():
"""Get all RSS feeds"""
try:
cursor = rss_feeds_collection.find().sort('created_at', -1)
feeds = []
for feed in cursor:
feeds.append({
'id': str(feed['_id']),
'name': feed.get('name', ''),
'url': feed.get('url', ''),
'active': feed.get('active', True),
'created_at': feed.get('created_at', '').isoformat() if feed.get('created_at') else ''
})
return jsonify({'feeds': feeds}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
@rss_bp.route('/api/rss-feeds', methods=['POST'])
def add_rss_feed():
"""Add a new RSS feed"""
data = request.json
name = data.get('name', '').strip()
url = data.get('url', '').strip()
if not name or not url:
return jsonify({'error': 'Name and URL are required'}), 400
if not url.startswith('http://') and not url.startswith('https://'):
return jsonify({'error': 'URL must start with http:// or https://'}), 400
try:
# Test if the RSS feed is valid
try:
feed = feedparser.parse(url)
if not feed.entries:
return jsonify({'error': 'Invalid RSS feed or no entries found'}), 400
except Exception as e:
return jsonify({'error': f'Failed to parse RSS feed: {str(e)}'}), 400
feed_doc = {
'name': name,
'url': url,
'active': True,
'created_at': datetime.utcnow()
}
try:
result = rss_feeds_collection.insert_one(feed_doc)
return jsonify({
'message': 'RSS feed added successfully',
'id': str(result.inserted_id)
}), 201
except DuplicateKeyError:
return jsonify({'error': 'RSS feed URL already exists'}), 409
except Exception as e:
return jsonify({'error': str(e)}), 500
@rss_bp.route('/api/rss-feeds/<feed_id>', methods=['DELETE'])
def remove_rss_feed(feed_id):
"""Remove an RSS feed"""
try:
# Validate ObjectId
try:
obj_id = ObjectId(feed_id)
except Exception:
return jsonify({'error': 'Invalid feed ID'}), 400
result = rss_feeds_collection.delete_one({'_id': obj_id})
if result.deleted_count > 0:
return jsonify({'message': 'RSS feed removed successfully'}), 200
else:
return jsonify({'error': 'RSS feed not found'}), 404
except Exception as e:
return jsonify({'error': str(e)}), 500
@rss_bp.route('/api/rss-feeds/<feed_id>/toggle', methods=['PATCH'])
def toggle_rss_feed(feed_id):
"""Toggle RSS feed active status"""
try:
# Validate ObjectId
try:
obj_id = ObjectId(feed_id)
except Exception:
return jsonify({'error': 'Invalid feed ID'}), 400
# Get current status
feed = rss_feeds_collection.find_one({'_id': obj_id})
if not feed:
return jsonify({'error': 'RSS feed not found'}), 404
# Toggle status
new_status = not feed.get('active', True)
result = rss_feeds_collection.update_one(
{'_id': obj_id},
{'$set': {'active': new_status}}
)
if result.modified_count > 0:
return jsonify({
'message': f'RSS feed {"activated" if new_status else "deactivated"} successfully',
'active': new_status
}), 200
else:
return jsonify({'error': 'Failed to update RSS feed'}), 500
except Exception as e:
return jsonify({'error': str(e)}), 500

View File

@@ -0,0 +1,63 @@
from flask import Blueprint, request, jsonify
from datetime import datetime
from pymongo.errors import DuplicateKeyError
from database import subscribers_collection
subscription_bp = Blueprint('subscription', __name__)
@subscription_bp.route('/api/subscribe', methods=['POST'])
def subscribe():
"""Subscribe a user to the newsletter"""
data = request.json
email = data.get('email', '').strip().lower()
if not email or '@' not in email:
return jsonify({'error': 'Invalid email address'}), 400
try:
subscriber_doc = {
'email': email,
'subscribed_at': datetime.utcnow(),
'status': 'active'
}
# Try to insert, if duplicate key error, subscriber already exists
try:
subscribers_collection.insert_one(subscriber_doc)
return jsonify({'message': 'Successfully subscribed!'}), 201
except DuplicateKeyError:
# Check if subscriber is active
existing = subscribers_collection.find_one({'email': email})
if existing and existing.get('status') == 'active':
return jsonify({'message': 'Email already subscribed'}), 200
else:
# Reactivate if previously unsubscribed
subscribers_collection.update_one(
{'email': email},
{'$set': {'status': 'active', 'subscribed_at': datetime.utcnow()}}
)
return jsonify({'message': 'Successfully re-subscribed!'}), 200
except Exception as e:
return jsonify({'error': str(e)}), 500
@subscription_bp.route('/api/unsubscribe', methods=['POST'])
def unsubscribe():
"""Unsubscribe a user from the newsletter"""
data = request.json
email = data.get('email', '').strip().lower()
try:
result = subscribers_collection.update_one(
{'email': email},
{'$set': {'status': 'inactive'}}
)
if result.matched_count > 0:
return jsonify({'message': 'Successfully unsubscribed'}), 200
else:
return jsonify({'error': 'Email not found in subscribers'}), 404
except Exception as e:
return jsonify({'error': str(e)}), 500

View File

@@ -0,0 +1 @@
# Services package

View File

@@ -0,0 +1,88 @@
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime
from pathlib import Path
from jinja2 import Template
from config import Config
from database import subscribers_collection, articles_collection
def send_newsletter(max_articles=10):
"""Send newsletter to all subscribers with AI-summarized articles"""
if not Config.EMAIL_USER or not Config.EMAIL_PASSWORD:
print("Email credentials not configured")
return
# Get latest articles with AI summaries from database
cursor = articles_collection.find(
{'summary': {'$exists': True, '$ne': None}}
).sort('created_at', -1).limit(max_articles)
articles = []
for doc in cursor:
articles.append({
'title': doc.get('title', ''),
'author': doc.get('author'),
'link': doc.get('link', ''),
'summary': doc.get('summary', ''),
'source': doc.get('source', ''),
'published_at': doc.get('published_at', '')
})
if not articles:
print("No articles with summaries to send")
return
# Load email template
template_path = Path(__file__).parent.parent / 'templates' / 'newsletter_template.html'
with open(template_path, 'r', encoding='utf-8') as f:
template_content = f.read()
template = Template(template_content)
# Prepare template data
now = datetime.now()
template_data = {
'date': now.strftime('%A, %B %d, %Y'),
'year': now.year,
'article_count': len(articles),
'articles': articles,
'unsubscribe_link': 'http://localhost:3000', # Update with actual unsubscribe link
'website_link': 'http://localhost:3000'
}
# Render HTML
html_content = template.render(**template_data)
# Get all active subscribers
subscribers_cursor = subscribers_collection.find({'status': 'active'})
subscribers = [doc['email'] for doc in subscribers_cursor]
# Send emails
for subscriber in subscribers:
try:
msg = MIMEMultipart('alternative')
msg['Subject'] = f'Munich News Daily - {datetime.now().strftime("%B %d, %Y")}'
msg['From'] = f'Munich News Daily <{Config.EMAIL_USER}>'
msg['To'] = subscriber
msg['Date'] = datetime.now().strftime('%a, %d %b %Y %H:%M:%S %z')
msg['Message-ID'] = f'<{datetime.now().timestamp()}.{subscriber}@dongho.kim>'
msg['X-Mailer'] = 'Munich News Daily'
# Add plain text version as fallback
plain_text = "This email requires HTML support. Please view it in an HTML-capable email client."
msg.attach(MIMEText(plain_text, 'plain', 'utf-8'))
# Add HTML version
msg.attach(MIMEText(html_content, 'html', 'utf-8'))
server = smtplib.SMTP(Config.SMTP_SERVER, Config.SMTP_PORT)
server.starttls()
server.login(Config.EMAIL_USER, Config.EMAIL_PASSWORD)
server.send_message(msg)
server.quit()
print(f"Newsletter sent to {subscriber}")
except Exception as e:
print(f"Error sending to {subscriber}: {e}")

View File

@@ -0,0 +1,90 @@
import feedparser
from datetime import datetime
from pymongo.errors import DuplicateKeyError
from database import articles_collection, rss_feeds_collection
from utils.rss_utils import extract_article_url, extract_article_summary, extract_published_date
def get_active_rss_feeds():
"""Get all active RSS feeds from database"""
feeds = []
cursor = rss_feeds_collection.find({'active': True})
for feed in cursor:
feeds.append({
'name': feed.get('name', ''),
'url': feed.get('url', '')
})
return feeds
def fetch_munich_news():
"""Fetch news from Munich news sources"""
articles = []
# Get RSS feeds from database instead of hardcoded list
sources = get_active_rss_feeds()
for source in sources:
try:
feed = feedparser.parse(source['url'])
for entry in feed.entries[:5]: # Get top 5 from each source
# Extract article URL using utility function
article_url = extract_article_url(entry)
if not article_url:
print(f" ⚠ No valid URL for: {entry.get('title', 'Unknown')[:50]}")
continue # Skip entries without valid URL
# Extract summary
summary = extract_article_summary(entry)
if summary:
summary = summary[:200] + '...' if len(summary) > 200 else summary
articles.append({
'title': entry.get('title', ''),
'link': article_url,
'summary': summary,
'source': source['name'],
'published': extract_published_date(entry)
})
except Exception as e:
print(f"Error fetching from {source['name']}: {e}")
return articles
def save_articles_to_db(articles):
"""Save articles to MongoDB, avoiding duplicates"""
saved_count = 0
for article in articles:
try:
# Prepare article document
article_doc = {
'title': article.get('title', ''),
'link': article.get('link', ''),
'summary': article.get('summary', ''),
'source': article.get('source', ''),
'published_at': article.get('published', ''),
'created_at': datetime.utcnow()
}
# Use update_one with upsert to handle duplicates
# This will insert if link doesn't exist, or update if it does
result = articles_collection.update_one(
{'link': article_doc['link']},
{'$setOnInsert': article_doc}, # Only set on insert, don't update existing
upsert=True
)
if result.upserted_id:
saved_count += 1
except DuplicateKeyError:
# Link already exists, skip
pass
except Exception as e:
print(f"Error saving article {article.get('link', 'unknown')}: {e}")
if saved_count > 0:
print(f"Saved {saved_count} new articles to database")

View File

@@ -0,0 +1,96 @@
import requests
from config import Config
def list_ollama_models():
"""List available models on Ollama server"""
if not Config.OLLAMA_ENABLED:
return None, "Ollama is not enabled"
try:
url = f"{Config.OLLAMA_BASE_URL}/api/tags"
headers = {}
if Config.OLLAMA_API_KEY:
headers["Authorization"] = f"Bearer {Config.OLLAMA_API_KEY}"
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
result = response.json()
models = result.get('models', [])
model_names = [model.get('name', '') for model in models]
return model_names, None
except requests.exceptions.RequestException as e:
return None, f"Error listing models: {str(e)}"
except Exception as e:
return None, f"Unexpected error: {str(e)}"
def call_ollama(prompt, system_prompt=None):
"""Call Ollama API to generate text"""
if not Config.OLLAMA_ENABLED:
return None, "Ollama is not enabled"
try:
url = f"{Config.OLLAMA_BASE_URL}/api/generate"
payload = {
"model": Config.OLLAMA_MODEL,
"prompt": prompt,
"stream": False
}
if system_prompt:
payload["system"] = system_prompt
headers = {}
if Config.OLLAMA_API_KEY:
headers["Authorization"] = f"Bearer {Config.OLLAMA_API_KEY}"
print(f"Calling Ollama at {url} with model {Config.OLLAMA_MODEL}")
response = requests.post(url, json=payload, headers=headers, timeout=30)
response.raise_for_status()
result = response.json()
response_text = result.get('response', '').strip()
if not response_text:
return None, "Ollama returned empty response"
return response_text, None
except requests.exceptions.ConnectionError as e:
error_msg = f"Cannot connect to Ollama server at {Config.OLLAMA_BASE_URL}. Is Ollama running?"
print(f"Connection error: {error_msg}")
return None, error_msg
except requests.exceptions.Timeout:
error_msg = "Request to Ollama timed out after 30 seconds"
print(f"Timeout error: {error_msg}")
return None, error_msg
except requests.exceptions.HTTPError as e:
# Check if it's a model not found error
if e.response.status_code == 404:
try:
error_data = e.response.json()
if 'model' in error_data.get('error', '').lower() and 'not found' in error_data.get('error', '').lower():
# Try to get available models
available_models, _ = list_ollama_models()
if available_models:
error_msg = f"Model '{Config.OLLAMA_MODEL}' not found. Available models: {', '.join(available_models)}"
else:
error_msg = f"Model '{Config.OLLAMA_MODEL}' not found. Use 'ollama list' on the server to see available models."
else:
error_msg = f"HTTP error from Ollama: {e.response.status_code} - {e.response.text}"
except (ValueError, KeyError):
error_msg = f"HTTP error from Ollama: {e.response.status_code} - {e.response.text}"
else:
error_msg = f"HTTP error from Ollama: {e.response.status_code} - {e.response.text}"
print(f"HTTP error: {error_msg}")
return None, error_msg
except requests.exceptions.RequestException as e:
error_msg = f"Request error: {str(e)}"
print(f"Request error: {error_msg}")
return None, error_msg
except Exception as e:
error_msg = f"Unexpected error: {str(e)}"
print(f"Unexpected error: {error_msg}")
return None, error_msg

View File

@@ -0,0 +1,162 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Munich News Daily</title>
<!--[if mso]>
<style type="text/css">
body, table, td {font-family: Arial, Helvetica, sans-serif !important;}
</style>
<![endif]-->
</head>
<body style="margin: 0; padding: 0; background-color: #f4f4f4; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;">
<!-- Wrapper Table -->
<table role="presentation" width="100%" cellpadding="0" cellspacing="0" border="0" style="background-color: #f4f4f4;">
<tr>
<td align="center" style="padding: 20px 0;">
<!-- Main Container -->
<table role="presentation" width="600" cellpadding="0" cellspacing="0" border="0" style="background-color: #ffffff; max-width: 600px;">
<!-- Header -->
<tr>
<td style="background-color: #1a1a1a; padding: 30px 40px; text-align: center;">
<h1 style="margin: 0 0 8px 0; font-size: 28px; font-weight: 700; color: #ffffff; letter-spacing: -0.5px;">
Munich News Daily
</h1>
<p style="margin: 0; font-size: 14px; color: #999999; letter-spacing: 0.5px;">
{{ date }}
</p>
</td>
</tr>
<!-- Greeting -->
<tr>
<td style="padding: 30px 40px 20px 40px;">
<p style="margin: 0; font-size: 16px; line-height: 1.5; color: #333333;">
Good morning ☀️
</p>
<p style="margin: 15px 0 0 0; font-size: 15px; line-height: 1.6; color: #666666;">
Here's what's happening in Munich today. We've summarized {{ article_count }} stories using AI so you can stay informed in under 5 minutes.
</p>
</td>
</tr>
<!-- Divider -->
<tr>
<td style="padding: 0 40px;">
<div style="height: 1px; background-color: #e0e0e0;"></div>
</td>
</tr>
<!-- Articles -->
{% for article in articles %}
<tr>
<td style="padding: 25px 40px;">
<!-- Article Number Badge -->
<table role="presentation" width="100%" cellpadding="0" cellspacing="0" border="0">
<tr>
<td>
<span style="display: inline-block; background-color: #000000; color: #ffffff; width: 24px; height: 24px; line-height: 24px; text-align: center; border-radius: 50%; font-size: 12px; font-weight: 600;">
{{ loop.index }}
</span>
</td>
</tr>
</table>
<!-- Article Title -->
<h2 style="margin: 12px 0 8px 0; font-size: 19px; font-weight: 700; line-height: 1.3; color: #1a1a1a;">
{{ article.title }}
</h2>
<!-- Article Meta -->
<p style="margin: 0 0 12px 0; font-size: 13px; color: #999999;">
<span style="color: #000000; font-weight: 600;">{{ article.source }}</span>
{% if article.author %}
<span> • {{ article.author }}</span>
{% endif %}
</p>
<!-- Article Summary -->
<p style="margin: 0 0 15px 0; font-size: 15px; line-height: 1.6; color: #333333;">
{{ article.summary }}
</p>
<!-- Read More Link -->
<a href="{{ article.link }}" style="display: inline-block; color: #000000; text-decoration: none; font-size: 14px; font-weight: 600; border-bottom: 2px solid #000000; padding-bottom: 2px;">
Read more →
</a>
</td>
</tr>
<!-- Article Divider -->
{% if not loop.last %}
<tr>
<td style="padding: 0 40px;">
<div style="height: 1px; background-color: #f0f0f0;"></div>
</td>
</tr>
{% endif %}
{% endfor %}
<!-- Bottom Divider -->
<tr>
<td style="padding: 25px 40px 0 40px;">
<div style="height: 1px; background-color: #e0e0e0;"></div>
</td>
</tr>
<!-- Summary Box -->
<tr>
<td style="padding: 30px 40px;">
<table role="presentation" width="100%" cellpadding="0" cellspacing="0" border="0" style="background-color: #f8f8f8; border-radius: 8px;">
<tr>
<td style="padding: 25px; text-align: center;">
<p style="margin: 0 0 8px 0; font-size: 13px; color: #666666; text-transform: uppercase; letter-spacing: 1px; font-weight: 600;">
Today's Digest
</p>
<p style="margin: 0; font-size: 36px; font-weight: 700; color: #000000;">
{{ article_count }}
</p>
<p style="margin: 8px 0 0 0; font-size: 14px; color: #666666;">
stories • AI-summarized • 5 min read
</p>
</td>
</tr>
</table>
</td>
</tr>
<!-- Footer -->
<tr>
<td style="background-color: #1a1a1a; padding: 30px 40px; text-align: center;">
<p style="margin: 0 0 15px 0; font-size: 14px; color: #ffffff; font-weight: 600;">
Munich News Daily
</p>
<p style="margin: 0 0 20px 0; font-size: 13px; color: #999999; line-height: 1.5;">
AI-powered news summaries for busy people.<br>
Delivered daily to your inbox.
</p>
<!-- Footer Links -->
<p style="margin: 0; font-size: 12px; color: #666666;">
<a href="{{ website_link }}" style="color: #999999; text-decoration: none;">Visit Website</a>
<span style="color: #444444;"></span>
<a href="{{ unsubscribe_link }}" style="color: #999999; text-decoration: none;">Unsubscribe</a>
</p>
<p style="margin: 20px 0 0 0; font-size: 11px; color: #666666;">
© {{ year }} Munich News Daily. All rights reserved.
</p>
</td>
</tr>
</table>
<!-- End Main Container -->
</td>
</tr>
</table>
<!-- End Wrapper Table -->
</body>
</html>

View File

@@ -0,0 +1,128 @@
#!/usr/bin/env python
"""
Test RSS feed URL extraction
Run from backend directory with venv activated:
cd backend
source venv/bin/activate # or venv\Scripts\activate on Windows
python test_rss_extraction.py
"""
from pymongo import MongoClient
from config import Config
import feedparser
from utils.rss_utils import extract_article_url, extract_article_summary, extract_published_date
print("\n" + "="*80)
print("RSS Feed URL Extraction Test")
print("="*80)
# Connect to database
print(f"\nConnecting to MongoDB: {Config.MONGODB_URI}")
client = MongoClient(Config.MONGODB_URI)
db = client[Config.DB_NAME]
# Get RSS feeds
print("Fetching RSS feeds from database...")
feeds = list(db['rss_feeds'].find())
if not feeds:
print("\n❌ No RSS feeds in database!")
print("\nAdd a feed first:")
print(" curl -X POST http://localhost:5001/api/rss-feeds \\")
print(" -H 'Content-Type: application/json' \\")
print(" -d '{\"name\": \"Süddeutsche Politik\", \"url\": \"https://rss.sueddeutsche.de/rss/Politik\"}'")
exit(1)
print(f"✓ Found {len(feeds)} feed(s)\n")
# Test each feed
total_success = 0
total_fail = 0
for feed_doc in feeds:
name = feed_doc.get('name', 'Unknown')
url = feed_doc.get('url', '')
active = feed_doc.get('active', True)
print("\n" + "="*80)
print(f"Feed: {name}")
print(f"URL: {url}")
print(f"Active: {'Yes' if active else 'No'}")
print("="*80)
if not active:
print("⏭ Skipping (inactive)")
continue
try:
# Parse RSS
print("\nFetching RSS feed...")
feed = feedparser.parse(url)
if not feed.entries:
print("❌ No entries found in feed")
continue
print(f"✓ Found {len(feed.entries)} entries")
# Test first 3 entries
print(f"\nTesting first 3 entries:")
print("-" * 80)
for i, entry in enumerate(feed.entries[:3], 1):
print(f"\n📰 Entry {i}:")
# Title
title = entry.get('title', 'No title')
print(f" Title: {title[:65]}")
# Test URL extraction
article_url = extract_article_url(entry)
if article_url:
print(f" ✓ URL: {article_url}")
total_success += 1
else:
print(f" ❌ Could not extract URL")
print(f" Available fields: {list(entry.keys())[:10]}")
print(f" link: {entry.get('link', 'N/A')}")
print(f" guid: {entry.get('guid', 'N/A')}")
print(f" id: {entry.get('id', 'N/A')}")
total_fail += 1
# Test summary
summary = extract_article_summary(entry)
if summary:
print(f" ✓ Summary: {summary[:70]}...")
else:
print(f" ⚠ No summary")
# Test date
pub_date = extract_published_date(entry)
if pub_date:
print(f" ✓ Date: {pub_date}")
else:
print(f" ⚠ No date")
except Exception as e:
print(f"❌ Error: {e}")
import traceback
traceback.print_exc()
# Summary
print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print(f"Total URLs tested: {total_success + total_fail}")
print(f"✓ Successfully extracted: {total_success}")
print(f"❌ Failed to extract: {total_fail}")
if total_fail == 0:
print("\n🎉 All URLs extracted successfully!")
print("\nYou can now run the crawler:")
print(" cd ../news_crawler")
print(" pip install -r requirements.txt")
print(" python crawler_service.py 5")
else:
print(f"\n{total_fail} URL(s) could not be extracted")
print("Check the output above for details")
print("="*80 + "\n")

View File

@@ -0,0 +1 @@
# Utils package

View File

@@ -0,0 +1,98 @@
"""
Utility functions for RSS feed processing
"""
def extract_article_url(entry):
"""
Extract article URL from RSS entry.
Different RSS feeds use different fields for the article URL.
Args:
entry: feedparser entry object
Returns:
str: Article URL or None if not found
Examples:
- Most feeds use 'link'
- Some use 'guid' as the URL
- Some use 'id' as the URL
- Some have guid as a dict with 'href'
"""
# Try 'link' first (most common)
if entry.get('link') and entry.get('link', '').startswith('http'):
return entry.get('link')
# Try 'guid' if it's a valid URL
if entry.get('guid'):
guid = entry.get('guid')
# guid can be a string
if isinstance(guid, str) and guid.startswith('http'):
return guid
# or a dict with 'href'
elif isinstance(guid, dict) and guid.get('href', '').startswith('http'):
return guid.get('href')
# Try 'id' if it's a valid URL
if entry.get('id') and entry.get('id', '').startswith('http'):
return entry.get('id')
# Try 'links' array (some feeds have multiple links)
if entry.get('links'):
for link in entry.get('links', []):
if isinstance(link, dict) and link.get('href', '').startswith('http'):
# Prefer 'alternate' type, but accept any http link
if link.get('type') == 'text/html' or link.get('rel') == 'alternate':
return link.get('href')
# If no alternate found, return first http link
for link in entry.get('links', []):
if isinstance(link, dict) and link.get('href', '').startswith('http'):
return link.get('href')
return None
def extract_article_summary(entry):
"""
Extract article summary/description from RSS entry.
Args:
entry: feedparser entry object
Returns:
str: Article summary or empty string
"""
# Try different fields
if entry.get('summary'):
return entry.get('summary', '')
elif entry.get('description'):
return entry.get('description', '')
elif entry.get('content'):
# content is usually a list of dicts
content = entry.get('content', [])
if content and isinstance(content, list) and len(content) > 0:
return content[0].get('value', '')
return ''
def extract_published_date(entry):
"""
Extract published date from RSS entry.
Args:
entry: feedparser entry object
Returns:
str: Published date or empty string
"""
# Try different fields
if entry.get('published'):
return entry.get('published', '')
elif entry.get('updated'):
return entry.get('updated', '')
elif entry.get('created'):
return entry.get('created', '')
return ''