dongho/Munich-news

Fork 0

Files

Dongho Kim 1075a91eac update

2025-11-11 14:09:21 +01:00

9.8 KiB

Raw Blame History

MongoDB Database Schema

This document describes the MongoDB collections and their structure for Munich News Daily.

Collections

1. Articles Collection (`articles`)

Stores all news articles aggregated from Munich news sources.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  title: String,                    // Article title (required)
  author: String,                   // Article author (optional, extracted during crawl)
  link: String,                     // Article URL (required, unique)
  content: String,                  // Full article content (no length limit)
  summary: String,                  // AI-generated English summary (≤150 words)
  word_count: Number,               // Word count of full content
  summary_word_count: Number,       // Word count of AI summary
  source: String,                   // News source name (e.g., "Süddeutsche Zeitung München")
  published_at: String,             // Original publication date from RSS feed or crawled
  crawled_at: DateTime,             // When article content was crawled (UTC)
  summarized_at: DateTime,          // When AI summary was generated (UTC)
  created_at: DateTime              // When article was added to database (UTC)
}

Indexes:

link - Unique index to prevent duplicate articles
created_at - Index for efficient sorting by date

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  title: "New U-Bahn Line Opens in Munich",
  author: "Max Mustermann",
  link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
  summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
  word_count: 1250,
  summary_word_count: 48,
  source: "Süddeutsche Zeitung München",
  published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
  crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
  summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
  created_at: ISODate("2024-01-15T09:00:00.000Z")
}

2. Subscribers Collection (`subscribers`)

Stores all newsletter subscribers.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (required, unique, lowercase)
  subscribed_at: DateTime,          // When user subscribed (UTC)
  status: String                    // Subscription status: 'active' or 'inactive'
}

Indexes:

email - Unique index for email lookups and preventing duplicates
subscribed_at - Index for analytics and sorting

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439012"),
  email: "user@example.com",
  subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
  status: "active"
}

3. Newsletter Sends Collection (`newsletter_sends`)

Tracks each newsletter sent to each subscriber for email open tracking.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  newsletter_id: String,            // Unique ID for this newsletter batch (date-based)
  subscriber_email: String,         // Recipient email
  tracking_id: String,              // Unique tracking ID for this send (UUID)
  sent_at: DateTime,                // When email was sent (UTC)
  opened: Boolean,                  // Whether email was opened
  first_opened_at: DateTime,        // First open timestamp (null if not opened)
  last_opened_at: DateTime,         // Most recent open timestamp
  open_count: Number,               // Number of times opened
  created_at: DateTime              // Record creation time (UTC)
}

Indexes:

tracking_id - Unique index for fast pixel request lookups
newsletter_id - Index for analytics queries
subscriber_email - Index for user activity queries
sent_at - Index for time-based queries

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439013"),
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  sent_at: ISODate("2024-01-15T08:00:00.000Z"),
  opened: true,
  first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
  open_count: 3,
  created_at: ISODate("2024-01-15T08:00:00.000Z")
}

4. Link Clicks Collection (`link_clicks`)

Tracks individual link clicks from newsletters.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  tracking_id: String,              // Unique tracking ID for this link (UUID)
  newsletter_id: String,            // Which newsletter this link was in
  subscriber_email: String,         // Who clicked
  article_url: String,              // Original article URL
  article_title: String,            // Article title for reporting
  clicked_at: DateTime,             // When link was clicked (UTC)
  user_agent: String,               // Browser/client info
  created_at: DateTime              // Record creation time (UTC)
}

Indexes:

tracking_id - Unique index for fast redirect request lookups
newsletter_id - Index for analytics queries
article_url - Index for article performance queries
subscriber_email - Index for user activity queries

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439014"),
  tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  newsletter_id: "2024-01-15",
  subscriber_email: "user@example.com",
  article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
  article_title: "New U-Bahn Line Opens in Munich",
  clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  created_at: ISODate("2024-01-15T09:35:00.000Z")
}

5. Subscriber Activity Collection (`subscriber_activity`)

Aggregated activity status for each subscriber.

Document Structure:

{
  _id: ObjectId,                    // Auto-generated MongoDB ID
  email: String,                    // Subscriber email (unique)
  status: String,                   // 'active', 'inactive', or 'dormant'
  last_opened_at: DateTime,         // Most recent email open (UTC)
  last_clicked_at: DateTime,        // Most recent link click (UTC)
  total_opens: Number,              // Lifetime open count
  total_clicks: Number,             // Lifetime click count
  newsletters_received: Number,     // Total newsletters sent
  newsletters_opened: Number,       // Total newsletters opened
  updated_at: DateTime              // Last status update (UTC)
}

Indexes:

email - Unique index for fast lookups
status - Index for filtering by activity level
last_opened_at - Index for time-based queries

Activity Status Classification:

active: Opened an email in the last 30 days
inactive: No opens in 30-60 days
dormant: No opens in 60+ days

Example Document:

{
  _id: ObjectId("507f1f77bcf86cd799439015"),
  email: "user@example.com",
  status: "active",
  last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
  last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
  total_opens: 45,
  total_clicks: 23,
  newsletters_received: 60,
  newsletters_opened: 45,
  updated_at: ISODate("2024-01-15T10:00:00.000Z")
}

Design Decisions

Why MongoDB?

Flexibility: Easy to add new fields without schema migrations
Scalability: Handles large volumes of articles and subscribers efficiently
Performance: Indexes on frequently queried fields (link, email, created_at)
Document Model: Natural fit for news articles and subscriber data

Schema Choices

Unique Link Index: Prevents duplicate articles from being stored, even if fetched multiple times
Status Field: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
UTC Timestamps: All dates stored in UTC for consistency across timezones
Lowercase Emails: Emails stored in lowercase to prevent case-sensitivity issues

Future Enhancements

Potential fields to add in the future:

Articles:

category: String (e.g., "politics", "sports", "culture")
tags: Array of Strings
image_url: String
sent_in_newsletter: Boolean (track if article was sent)
sent_at: DateTime (when article was included in newsletter)

Subscribers:

preferences: Object (newsletter frequency, categories, etc.)
last_sent_at: DateTime (last newsletter sent date)
unsubscribed_at: DateTime (when user unsubscribed)
verification_token: String (for email verification)

AI Summarization Workflow

When the crawler processes an article:

Extract Content: Full article text is extracted from the webpage
Summarize with Ollama: If OLLAMA_ENABLED=true, the content is sent to Ollama for summarization
Store Both: Both the original content and AI-generated summary are stored
Fallback: If Ollama is unavailable or fails, only the original content is stored

Summary Field Details

Language: Always in English, regardless of source article language
Length: Maximum 150 words
Format: Plain text, concise and clear
Purpose: Quick preview for newsletters and frontend display

Querying Articles

// Get articles with AI summaries
db.articles.find({ summary: { $exists: true, $ne: null } })

// Get articles without summaries
db.articles.find({ summary: { $exists: false } })

// Count summarized articles
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })

9.8 KiB

Raw Blame History

MongoDB Database Schema

Collections

1. Articles Collection (`articles`)

2. Subscribers Collection (`subscribers`)

4. Link Clicks Collection (`link_clicks`)

5. Subscriber Activity Collection (`subscriber_activity`)

Design Decisions

Why MongoDB?

Schema Choices

Future Enhancements

AI Summarization Workflow

Summary Field Details

Querying Articles

Build together

Resources

Get help

9.8 KiB Raw Blame History

MongoDB Database Schema

Collections

1. Articles Collection (articles)

2. Subscribers Collection (subscribers)

3. Newsletter Sends Collection (newsletter_sends)

4. Link Clicks Collection (link_clicks)

5. Subscriber Activity Collection (subscriber_activity)

Design Decisions

Why MongoDB?

Schema Choices

Future Enhancements

AI Summarization Workflow

Summary Field Details

Querying Articles

Build together

Resources

Get help

9.8 KiB

Raw Blame History

1. Articles Collection (`articles`)

2. Subscribers Collection (`subscribers`)

3. Newsletter Sends Collection (`newsletter_sends`)

4. Link Clicks Collection (`link_clicks`)

5. Subscriber Activity Collection (`subscriber_activity`)