update
This commit is contained in:
143
backend/DATABASE_SCHEMA.md
Normal file
143
backend/DATABASE_SCHEMA.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# MongoDB Database Schema
|
||||
|
||||
This document describes the MongoDB collections and their structure for Munich News Daily.
|
||||
|
||||
## Collections
|
||||
|
||||
### 1. Articles Collection (`articles`)
|
||||
|
||||
Stores all news articles aggregated from Munich news sources.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
title: String, // Article title (required)
|
||||
author: String, // Article author (optional, extracted during crawl)
|
||||
link: String, // Article URL (required, unique)
|
||||
content: String, // Full article content (no length limit)
|
||||
summary: String, // AI-generated English summary (≤150 words)
|
||||
word_count: Number, // Word count of full content
|
||||
summary_word_count: Number, // Word count of AI summary
|
||||
source: String, // News source name (e.g., "Süddeutsche Zeitung München")
|
||||
published_at: String, // Original publication date from RSS feed or crawled
|
||||
crawled_at: DateTime, // When article content was crawled (UTC)
|
||||
summarized_at: DateTime, // When AI summary was generated (UTC)
|
||||
created_at: DateTime // When article was added to database (UTC)
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `link` - Unique index to prevent duplicate articles
|
||||
- `created_at` - Index for efficient sorting by date
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439011"),
|
||||
title: "New U-Bahn Line Opens in Munich",
|
||||
author: "Max Mustermann",
|
||||
link: "https://www.sueddeutsche.de/muenchen/ubahn-1.123456",
|
||||
content: "The new U-Bahn line connecting the city center with the airport opened today. Mayor Dieter Reiter attended the opening ceremony... [full article text continues]",
|
||||
summary: "Munich's new U-Bahn line connecting the city center to the airport opened today with Mayor Dieter Reiter in attendance. The line features 10 stations and runs every 10 minutes during peak hours, significantly reducing travel time. Construction took five years and cost approximately 2 billion euros.",
|
||||
word_count: 1250,
|
||||
summary_word_count: 48,
|
||||
source: "Süddeutsche Zeitung München",
|
||||
published_at: "Mon, 15 Jan 2024 10:00:00 +0100",
|
||||
crawled_at: ISODate("2024-01-15T09:30:00.000Z"),
|
||||
summarized_at: ISODate("2024-01-15T09:30:15.000Z"),
|
||||
created_at: ISODate("2024-01-15T09:00:00.000Z")
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Subscribers Collection (`subscribers`)
|
||||
|
||||
Stores all newsletter subscribers.
|
||||
|
||||
**Document Structure:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId, // Auto-generated MongoDB ID
|
||||
email: String, // Subscriber email (required, unique, lowercase)
|
||||
subscribed_at: DateTime, // When user subscribed (UTC)
|
||||
status: String // Subscription status: 'active' or 'inactive'
|
||||
}
|
||||
```
|
||||
|
||||
**Indexes:**
|
||||
- `email` - Unique index for email lookups and preventing duplicates
|
||||
- `subscribed_at` - Index for analytics and sorting
|
||||
|
||||
**Example Document:**
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("507f1f77bcf86cd799439012"),
|
||||
email: "user@example.com",
|
||||
subscribed_at: ISODate("2024-01-15T08:30:00.000Z"),
|
||||
status: "active"
|
||||
}
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Why MongoDB?
|
||||
|
||||
1. **Flexibility**: Easy to add new fields without schema migrations
|
||||
2. **Scalability**: Handles large volumes of articles and subscribers efficiently
|
||||
3. **Performance**: Indexes on frequently queried fields (link, email, created_at)
|
||||
4. **Document Model**: Natural fit for news articles and subscriber data
|
||||
|
||||
### Schema Choices
|
||||
|
||||
1. **Unique Link Index**: Prevents duplicate articles from being stored, even if fetched multiple times
|
||||
2. **Status Field**: Soft delete for subscribers (set to 'inactive' instead of deleting) - allows for analytics and easy re-subscription
|
||||
3. **UTC Timestamps**: All dates stored in UTC for consistency across timezones
|
||||
4. **Lowercase Emails**: Emails stored in lowercase to prevent case-sensitivity issues
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
Potential fields to add in the future:
|
||||
|
||||
**Articles:**
|
||||
- `category`: String (e.g., "politics", "sports", "culture")
|
||||
- `tags`: Array of Strings
|
||||
- `image_url`: String
|
||||
- `sent_in_newsletter`: Boolean (track if article was sent)
|
||||
- `sent_at`: DateTime (when article was included in newsletter)
|
||||
|
||||
**Subscribers:**
|
||||
- `preferences`: Object (newsletter frequency, categories, etc.)
|
||||
- `last_sent_at`: DateTime (last newsletter sent date)
|
||||
- `unsubscribed_at`: DateTime (when user unsubscribed)
|
||||
- `verification_token`: String (for email verification)
|
||||
|
||||
|
||||
|
||||
## AI Summarization Workflow
|
||||
|
||||
When the crawler processes an article:
|
||||
|
||||
1. **Extract Content**: Full article text is extracted from the webpage
|
||||
2. **Summarize with Ollama**: If `OLLAMA_ENABLED=true`, the content is sent to Ollama for summarization
|
||||
3. **Store Both**: Both the original `content` and AI-generated `summary` are stored
|
||||
4. **Fallback**: If Ollama is unavailable or fails, only the original content is stored
|
||||
|
||||
### Summary Field Details
|
||||
|
||||
- **Language**: Always in English, regardless of source article language
|
||||
- **Length**: Maximum 150 words
|
||||
- **Format**: Plain text, concise and clear
|
||||
- **Purpose**: Quick preview for newsletters and frontend display
|
||||
|
||||
### Querying Articles
|
||||
|
||||
```javascript
|
||||
// Get articles with AI summaries
|
||||
db.articles.find({ summary: { $exists: true, $ne: null } })
|
||||
|
||||
// Get articles without summaries
|
||||
db.articles.find({ summary: { $exists: false } })
|
||||
|
||||
// Count summarized articles
|
||||
db.articles.countDocuments({ summary: { $exists: true, $ne: null } })
|
||||
```
|
||||
Reference in New Issue
Block a user