272 lines
9.8 KiB
Markdown
272 lines
9.8 KiB
Markdown
# MongoDB Database Schema
|
|
|
|
This document describes the MongoDB collections and their structure for Munich News Daily.
|
|
|
|
## Collections
|
|
|
|
### 1. Articles Collection (`articles`)
|
|
|
|
Stores all news articles aggregated from Munich news sources.
|
|
|
|
**Document Structure:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId, // Auto-generated MongoDB ID
|
|
title: String, // Article title (required)
|
|
author: String, // Article author (optional, extracted during crawl)
|
|
link: String, // Article URL (required, unique)
|
|
content: String, // Full article content (no length limit)
|
|
summary: String, // AI-generated English summary (≤150 words)
|
|
word_count: Number, // Word count of full content
|
|
summary_word_count: Number, // Word count of AI summary
|
|
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
|
|
published_at: String, // Original publication date from RSS feed or crawled
|
|
crawled_at: DateTime, // When article content was crawled (UTC)
|
|
summarized_at: DateTime, // When AI summary was generated (UTC)
|
|
created_at: DateTime // When article was added to database (UTC)
|
|
}
|
|
```
|
|
|
|
**Indexes:**
|
|
- `link` - Unique index to prevent duplicate articles
|
|
- `created_at` - Index for efficient sorting by date
|
|
|
|
**Example Document:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId("507f1f77bcf86cd799439011"),
|
|
title: "New U-Bahn Line Opens in Munich",
|
|
author: "Max Mustermann",
|
|
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
|
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
|
|
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
|
|
word_count: 1250,
|
|
summary_word_count: 48,
|
|
source: "Süddeutsche Zeitung München",
|
|
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
|
|
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
|
|
created_at: ISODate("2024-01-15T09:00:00.000Z")
|
|
}
|
|
```
|
|
|
|
### 2. Subscribers Collection (`subscribers`)
|
|
|
|
Stores all newsletter subscribers.
|
|
|
|
**Document Structure:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId, // Auto-generated MongoDB ID
|
|
email: String, // Subscriber email (required, unique, lowercase)
|
|
subscribed_at: DateTime, // When user subscribed (UTC)
|
|
status: String // Subscription status: 'active' or 'inactive'
|
|
}
|
|
```
|
|
|
|
**Indexes:**
|
|
- `email` - Unique index for email lookups and preventing duplicates
|
|
- `subscribed_at` - Index for analytics and sorting
|
|
|
|
**Example Document:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId("507f1f77bcf86cd799439012"),
|
|
email: "user@example.com",
|
|
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
|
|
status: "active"
|
|
}
|
|
```
|
|
|
|
### 3. Newsletter Sends Collection (`newsletter_sends`)
|
|
|
|
Tracks each newsletter sent to each subscriber for email open tracking.
|
|
|
|
**Document Structure:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId, // Auto-generated MongoDB ID
|
|
newsletter_id: String, // Unique ID for this newsletter batch (date-based)
|
|
subscriber_email: String, // Recipient email
|
|
tracking_id: String, // Unique tracking ID for this send (UUID)
|
|
sent_at: DateTime, // When email was sent (UTC)
|
|
opened: Boolean, // Whether email was opened
|
|
first_opened_at: DateTime, // First open timestamp (null if not opened)
|
|
last_opened_at: DateTime, // Most recent open timestamp
|
|
open_count: Number, // Number of times opened
|
|
created_at: DateTime // Record creation time (UTC)
|
|
}
|
|
```
|
|
|
|
**Indexes:**
|
|
- `tracking_id` - Unique index for fast pixel request lookups
|
|
- `newsletter_id` - Index for analytics queries
|
|
- `subscriber_email` - Index for user activity queries
|
|
- `sent_at` - Index for time-based queries
|
|
|
|
**Example Document:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId("507f1f77bcf86cd799439013"),
|
|
newsletter_id: "2024-01-15",
|
|
subscriber_email: "user@example.com",
|
|
tracking_id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
|
sent_at: ISODate("2024-01-15T08:00:00.000Z"),
|
|
opened: true,
|
|
first_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
last_opened_at: ISODate("2024-01-15T14:20:00.000Z"),
|
|
open_count: 3,
|
|
created_at: ISODate("2024-01-15T08:00:00.000Z")
|
|
}
|
|
```
|
|
|
|
### 4. Link Clicks Collection (`link_clicks`)
|
|
|
|
Tracks individual link clicks from newsletters.
|
|
|
|
**Document Structure:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId, // Auto-generated MongoDB ID
|
|
tracking_id: String, // Unique tracking ID for this link (UUID)
|
|
newsletter_id: String, // Which newsletter this link was in
|
|
subscriber_email: String, // Who clicked
|
|
article_url: String, // Original article URL
|
|
article_title: String, // Article title for reporting
|
|
clicked_at: DateTime, // When link was clicked (UTC)
|
|
user_agent: String, // Browser/client info
|
|
created_at: DateTime // Record creation time (UTC)
|
|
}
|
|
```
|
|
|
|
**Indexes:**
|
|
- `tracking_id` - Unique index for fast redirect request lookups
|
|
- `newsletter_id` - Index for analytics queries
|
|
- `article_url` - Index for article performance queries
|
|
- `subscriber_email` - Index for user activity queries
|
|
|
|
**Example Document:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId("507f1f77bcf86cd799439014"),
|
|
tracking_id: "b2c3d4e5-f6a7-8901-bcde-f12345678901",
|
|
newsletter_id: "2024-01-15",
|
|
subscriber_email: "user@example.com",
|
|
article_url: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
|
article_title: "New U-Bahn Line Opens in Munich",
|
|
clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
|
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
|
created_at: ISODate("2024-01-15T09:35:00.000Z")
|
|
}
|
|
```
|
|
|
|
### 5. Subscriber Activity Collection (`subscriber_activity`)
|
|
|
|
Aggregated activity status for each subscriber.
|
|
|
|
**Document Structure:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId, // Auto-generated MongoDB ID
|
|
email: String, // Subscriber email (unique)
|
|
status: String, // 'active', 'inactive', or 'dormant'
|
|
last_opened_at: DateTime, // Most recent email open (UTC)
|
|
last_clicked_at: DateTime, // Most recent link click (UTC)
|
|
total_opens: Number, // Lifetime open count
|
|
total_clicks: Number, // Lifetime click count
|
|
newsletters_received: Number, // Total newsletters sent
|
|
newsletters_opened: Number, // Total newsletters opened
|
|
updated_at: DateTime // Last status update (UTC)
|
|
}
|
|
```
|
|
|
|
**Indexes:**
|
|
- `email` - Unique index for fast lookups
|
|
- `status` - Index for filtering by activity level
|
|
- `last_opened_at` - Index for time-based queries
|
|
|
|
**Activity Status Classification:**
|
|
- **active**: Opened an email in the last 30 days
|
|
- **inactive**: No opens in 30-60 days
|
|
- **dormant**: No opens in 60+ days
|
|
|
|
**Example Document:**
|
|
```javascript
|
|
{
|
|
_id: ObjectId("507f1f77bcf86cd799439015"),
|
|
email: "user@example.com",
|
|
status: "active",
|
|
last_opened_at: ISODate("2024-01-15T09:30:00.000Z"),
|
|
last_clicked_at: ISODate("2024-01-15T09:35:00.000Z"),
|
|
total_opens: 45,
|
|
total_clicks: 23,
|
|
newsletters_received: 60,
|
|
newsletters_opened: 45,
|
|
updated_at: ISODate("2024-01-15T10:00:00.000Z")
|
|
}
|
|
```
|
|
|
|
## Design Decisions
|
|
|
|
### Why MongoDB?
|
|
|
|
1. **Flexibility**: Easy to add new fields without schema migrations
|
|
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
|
|
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
|
|
4. **Document Model**: Natural fit for news articles and subscriber data
|
|
|
|
### Schema Choices
|
|
|
|
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
|
|
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
|
|
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
|
|
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
|
|
|
|
### Future Enhancements
|
|
|
|
Potential fields to add in the future:
|
|
|
|
**Articles:**
|
|
- `category`: String (e.g., "politics", "sports", "culture")
|
|
- `tags`: Array of Strings
|
|
- `image_url`: String
|
|
- `sent_in_newsletter`: Boolean (track if article was sent)
|
|
- `sent_at`: DateTime (when article was included in newsletter)
|
|
|
|
**Subscribers:**
|
|
- `preferences`: Object (newsletter frequency, categories, etc.)
|
|
- `last_sent_at`: DateTime (last newsletter sent date)
|
|
- `unsubscribed_at`: DateTime (when user unsubscribed)
|
|
- `verification_token`: String (for email verification)
|
|
|
|
|
|
|
|
## AI Summarization Workflow
|
|
|
|
When the crawler processes an article:
|
|
|
|
1. **Extract Content**: Full article text is extracted from the webpage
|
|
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
|
|
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
|
|
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
|
|
|
|
### Summary Field Details
|
|
|
|
- **Language**: Always in English, regardless of source article language
|
|
- **Length**: Maximum 150 words
|
|
- **Format**: Plain text, concise and clear
|
|
- **Purpose**: Quick preview for newsletters and frontend display
|
|
|
|
### Querying Articles
|
|
|
|
```javascript
|
|
// Get articles with AI summaries
|
|
db.articles.find({ summary: { $exists: true, $ne: null } })
|
|
|
|
// Get articles without summaries
|
|
db.articles.find({ summary: { $exists: false } })
|
|
|
|
// Count summarized articles
|
|
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
|
|
```
|