Files
Munich-news/docs/DATABASE_SCHEMA.md
2025-11-11 14:09:21 +01:00

9.8 KiB

MongoDB Database Schema

This document describes the MongoDB collections and their structure for Munich News Daily.

Collections

1. Articles Collection (articles)

Stores all news articles aggregated from Munich news sources.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  title: String,                    // Article title (required)
  author: String,                   // Article author (optional, extracted during crawl)
  link: String,                     // Article URL (required, unique)
  content: String,                  // Full article content (no length limit)
  summary: String,                  // AI-generated English summary (≤150 words)
  word_count: Number,               // Word count of full content
  summary_word_count: Number,       // Word count of AI summary
  source: String,                   // News source name (e.g., "Süddeutsche Zeitung München")
  published_at: String,             // Original publication date from RSS feed or crawled
  crawled_at: DateTime,             // When article content was crawled (UTC)
  summarized_at: DateTime,          // When AI summary was generated (UTC)
  created_at: DateTime              // When article was added to database (UTC)
}

Indexes:

  • link - Unique index to prevent duplicate articles
  • created_at - Index for efficient sorting by date

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  title: "New U-Bahn Line Opens in Munich",
  author: "Max Mustermann",
  link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
  summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
  word_count: 1250,
  summary_word_count: 48,
  source: "Süddeutsche Zeitung München",
  published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
  crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
  summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
  created_at: ISODate("2024-01-15T09:00:00.000Z")
}

2. Subscribers Collection (subscribers)

Stores all newsletter subscribers.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (required, unique, lowercase)
  subscribed_at: DateTime,          // When user subscribed (UTC)
  status: String                    // Subscription status: 'active' or 'inactive'
}

Indexes:

  • email - Unique index for email lookups and preventing duplicates
  • subscribed_at - Index for analytics and sorting

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439012"),
  email: "user@example.com",
  subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
  status: "active"
}

3. Newsletter Sends Collection (newsletter_sends)

Tracks each newsletter sent to each subscriber for email open tracking.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  newsletter_id: String,            // Unique ID for this newsletter batch (date-based)
  subscriber_email: String,         // Recipient email
  tracking_id: String,              // Unique tracking ID for this send (UUID)
  sent_at: DateTime,                // When email was sent (UTC)
  opened: Boolean,                  // Whether email was opened
  first_opened_at: DateTime,        // First open timestamp (null if not opened)
  last_opened_at: DateTime,         // Most recent open timestamp
  open_count: Number,               // Number of times opened
  created_at: DateTime              // Record creation time (UTC)
}

Indexes:

  • tracking_id - Unique index for fast pixel request lookups
  • newsletter_id - Index for analytics queries
  • subscriber_email - Index for user activity queries
  • sent_at - Index for time-based queries

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439013"),
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  sent_at: ISODate("2024-01-15T08:00:00.000Z"),
  opened: true,
  first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
  open_count: 3,
  created_at: ISODate("2024-01-15T08:00:00.000Z")
}

Tracks individual link clicks from newsletters.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  tracking_id: String,              // Unique tracking ID for this link (UUID)
  newsletter_id: String,            // Which newsletter this link was in
  subscriber_email: String,         // Who clicked
  article_url: String,              // Original article URL
  article_title: String,            // Article title for reporting
  clicked_at: DateTime,             // When link was clicked (UTC)
  user_agent: String,               // Browser/client info
  created_at: DateTime              // Record creation time (UTC)
}

Indexes:

  • tracking_id - Unique index for fast redirect request lookups
  • newsletter_id - Index for analytics queries
  • article_url - Index for article performance queries
  • subscriber_email - Index for user activity queries

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439014"),
  tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  article_title: "New U-Bahn Line Opens in Munich",
  clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  created_at: ISODate("2024-01-15T09:35:00.000Z")
}

5. Subscriber Activity Collection (subscriber_activity)

Aggregated activity status for each subscriber.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (unique)
  status: String,                   // 'active', 'inactive', or 'dormant'
  last_opened_at: DateTime,         // Most recent email open (UTC)
  last_clicked_at: DateTime,        // Most recent link click (UTC)
  total_opens: Number,              // Lifetime open count
  total_clicks: Number,             // Lifetime click count
  newsletters_received: Number,     // Total newsletters sent
  newsletters_opened: Number,       // Total newsletters opened
  updated_at: DateTime              // Last status update (UTC)
}

Indexes:

  • email - Unique index for fast lookups
  • status - Index for filtering by activity level
  • last_opened_at - Index for time-based queries

Activity Status Classification:

  • active: Opened an email in the last 30 days
  • inactive: No opens in 30-60 days
  • dormant: No opens in 60+ days

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439015"),
  email: "user@example.com",
  status: "active",
  last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  total_opens: 45,
  total_clicks: 23,
  newsletters_received: 60,
  newsletters_opened: 45,
  updated_at: ISODate("2024-01-15T10:00:00.000Z")
}

Design Decisions

Why MongoDB?

  1. Flexibility: Easy to add new fields without schema migrations
  2. Scalability: Handles large volumes of articles and subscribers efficiently
  3. Performance: Indexes on frequently queried fields (link, email, created_at)
  4. Document Model: Natural fit for news articles and subscriber data

Schema Choices

  1. Unique Link Index: Prevents duplicate articles from being stored, even if fetched multiple times
  2. Status Field: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
  3. UTC Timestamps: All dates stored in UTC for consistency across timezones
  4. Lowercase Emails: Emails stored in lowercase to prevent case-sensitivity issues

Future Enhancements

Potential fields to add in the future:

Articles:

  • category: String (e.g., "politics", "sports", "culture")
  • tags: Array of Strings
  • image_url: String
  • sent_in_newsletter: Boolean (track if article was sent)
  • sent_at: DateTime (when article was included in newsletter)

Subscribers:

  • preferences: Object (newsletter frequency, categories, etc.)
  • last_sent_at: DateTime (last newsletter sent date)
  • unsubscribed_at: DateTime (when user unsubscribed)
  • verification_token: String (for email verification)

AI Summarization Workflow

When the crawler processes an article:

  1. Extract Content: Full article text is extracted from the webpage
  2. Summarize with Ollama: If OLLAMA_ENABLED=true, the content is sent to Ollama for summarization
  3. Store Both: Both the original content and AI-generated summary are stored
  4. Fallback: If Ollama is unavailable or fails, only the original content is stored

Summary Field Details

  • Language: Always in English, regardless of source article language
  • Length: Maximum 150 words
  • Format: Plain text, concise and clear
  • Purpose: Quick preview for newsletters and frontend display

Querying Articles

// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })

// Get articles without summaries
db.articles.find({ summary: { $exists: false } })

// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })