9.8 KiB
MongoDB Database Schema
This document describes the MongoDB collections and their structure for Munich News Daily.
Collections
1. Articles Collection (articles)
Stores all news articles aggregated from Munich news sources.
Document Structure:
{
_id: ObjectId, // Auto-generated MongoDB ID
title: String, // Article title (required)
author: String, // Article author (optional, extracted during crawl)
link: String, // Article URL (required, unique)
content: String, // Full article content (no length limit)
summary: String, // AI-generated English summary (≤150 words)
word_count: Number, // Word count of full content
summary_word_count: Number, // Word count of AI summary
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
published_at: String, // Original publication date from RSS feed or crawled
crawled_at: DateTime, // When article content was crawled (UTC)
summarized_at: DateTime, // When AI summary was generated (UTC)
created_at: DateTime // When article was added to database (UTC)
}
Indexes:
link- Unique index to prevent duplicate articlescreated_at- Index for efficient sorting by date
Example Document:
{
_id: ObjectId("507f1f77bcf86cd799439011"),
title: "New U-Bahn Line Opens in Munich",
author: "Max Mustermann",
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
word_count: 1250,
summary_word_count: 48,
source: "Süddeutsche Zeitung München",
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
created_at: ISODate("2024-01-15T09:00:00.000Z")
}
2. Subscribers Collection (subscribers)
Stores all newsletter subscribers.
Document Structure:
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (required, unique, lowercase)
subscribed_at: DateTime, // When user subscribed (UTC)
status: String // Subscription status: 'active' or 'inactive'
}
Indexes:
email- Unique index for email lookups and preventing duplicatessubscribed_at- Index for analytics and sorting
Example Document:
{
_id: ObjectId("507f1f77bcf86cd799439012"),
email: "user@example.com",
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
status: "active"
}
3. Newsletter Sends Collection (newsletter_sends)
Tracks each newsletter sent to each subscriber for email open tracking.
Document Structure:
{
_id: ObjectId, // Auto-generated MongoDB ID
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
subscriber_email: String, // Recipient email
tracking_id: String, // Unique tracking ID for this send (UUID)
sent_at: DateTime, // When email was sent (UTC)
opened: Boolean, // Whether email was opened
first_opened_at: DateTime, // First open timestamp (null if not opened)
last_opened_at: DateTime, // Most recent open timestamp
open_count: Number, // Number of times opened
created_at: DateTime // Record creation time (UTC)
}
Indexes:
tracking_id- Unique index for fast pixel request lookupsnewsletter_id- Index for analytics queriessubscriber_email- Index for user activity queriessent_at- Index for time-based queries
Example Document:
{
_id: ObjectId("507f1f77bcf86cd799439013"),
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
opened: true,
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
open_count: 3,
created_at: ISODate("2024-01-15T08:00:00.000Z")
}
4. Link Clicks Collection (link_clicks)
Tracks individual link clicks from newsletters.
Document Structure:
{
_id: ObjectId, // Auto-generated MongoDB ID
tracking_id: String, // Unique tracking ID for this link (UUID)
newsletter_id: String, // Which newsletter this link was in
subscriber_email: String, // Who clicked
article_url: String, // Original article URL
article_title: String, // Article title for reporting
clicked_at: DateTime, // When link was clicked (UTC)
user_agent: String, // Browser/client info
created_at: DateTime // Record creation time (UTC)
}
Indexes:
tracking_id- Unique index for fast redirect request lookupsnewsletter_id- Index for analytics queriesarticle_url- Index for article performance queriessubscriber_email- Index for user activity queries
Example Document:
{
_id: ObjectId("507f1f77bcf86cd799439014"),
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
newsletter_id: "2024-01-15",
subscriber_email: "user@example.com",
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
article_title: "New U-Bahn Line Opens in Munich",
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
created_at: ISODate("2024-01-15T09:35:00.000Z")
}
5. Subscriber Activity Collection (subscriber_activity)
Aggregated activity status for each subscriber.
Document Structure:
{
_id: ObjectId, // Auto-generated MongoDB ID
email: String, // Subscriber email (unique)
status: String, // 'active', 'inactive', or 'dormant'
last_opened_at: DateTime, // Most recent email open (UTC)
last_clicked_at: DateTime, // Most recent link click (UTC)
total_opens: Number, // Lifetime open count
total_clicks: Number, // Lifetime click count
newsletters_received: Number, // Total newsletters sent
newsletters_opened: Number, // Total newsletters opened
updated_at: DateTime // Last status update (UTC)
}
Indexes:
email- Unique index for fast lookupsstatus- Index for filtering by activity levellast_opened_at- Index for time-based queries
Activity Status Classification:
- active: Opened an email in the last 30 days
- inactive: No opens in 30-60 days
- dormant: No opens in 60+ days
Example Document:
{
_id: ObjectId("507f1f77bcf86cd799439015"),
email: "user@example.com",
status: "active",
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
total_opens: 45,
total_clicks: 23,
newsletters_received: 60,
newsletters_opened: 45,
updated_at: ISODate("2024-01-15T10:00:00.000Z")
}
Design Decisions
Why MongoDB?
- Flexibility: Easy to add new fields without schema migrations
- Scalability: Handles large volumes of articles and subscribers efficiently
- Performance: Indexes on frequently queried fields (link, email, created_at)
- Document Model: Natural fit for news articles and subscriber data
Schema Choices
- Unique Link Index: Prevents duplicate articles from being stored, even if fetched multiple times
- Status Field: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
- UTC Timestamps: All dates stored in UTC for consistency across timezones
- Lowercase Emails: Emails stored in lowercase to prevent case-sensitivity issues
Future Enhancements
Potential fields to add in the future:
Articles:
category: String (e.g., "politics", "sports", "culture")tags: Array of Stringsimage_url: Stringsent_in_newsletter: Boolean (track if article was sent)sent_at: DateTime (when article was included in newsletter)
Subscribers:
preferences: Object (newsletter frequency, categories, etc.)last_sent_at: DateTime (last newsletter sent date)unsubscribed_at: DateTime (when user unsubscribed)verification_token: String (for email verification)
AI Summarization Workflow
When the crawler processes an article:
- Extract Content: Full article text is extracted from the webpage
- Summarize with Ollama: If
OLLAMA_ENABLED=true, the content is sent to Ollama for summarization - Store Both: Both the original
contentand AI-generatedsummaryare stored - Fallback: If Ollama is unavailable or fails, only the original content is stored
Summary Field Details
- Language: Always in English, regardless of source article language
- Length: Maximum 150 words
- Format: Plain text, concise and clear
- Purpose: Quick preview for newsletters and frontend display
Querying Articles
// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })
// Get articles without summaries
db.articles.find({ summary: { $exists: false } })
// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })