3.8 KiB
3.8 KiB
Recent Changes - Full Content Storage
✅ What Changed
1. Removed Content Length Limit
Before:
'content': content_text[:10000] # Limited to 10k chars
After:
'content': content_text # Full content, no limit
2. Simplified Database Schema
Before:
{
summary: String, // Short summary
full_content: String // Limited content
}
After:
{
content: String // Full article content, no limit
}
3. Enhanced API Response
Before:
{
title: "...",
link: "...",
summary: "..."
}
After:
{
title: "...",
author: "...", // NEW!
link: "...",
preview: "...", // First 200 chars
word_count: 1250, // NEW!
has_full_content: true // NEW!
}
📊 Database Structure
Articles Collection
{
_id: ObjectId,
title: String, // Article title
author: String, // Article author (extracted)
link: String, // Article URL (unique)
content: String, // FULL article content (no limit)
word_count: Number, // Word count
source: String, // RSS feed name
published_at: String, // Publication date
crawled_at: DateTime, // When crawled
created_at: DateTime // When added
}
🆕 New API Endpoint
GET /api/news/<article_url>
Get full article content by URL.
Example:
# URL encode the article URL
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
Response:
{
"title": "New U-Bahn Line Opens in Munich",
"author": "Max Mustermann",
"link": "https://example.com/article",
"content": "The full article text here... (complete, no truncation)",
"word_count": 1250,
"source": "Süddeutsche Zeitung München",
"published_at": "2024-11-10T10:00:00Z",
"crawled_at": "2024-11-10T16:30:00Z",
"created_at": "2024-11-10T16:00:00Z"
}
📈 Enhanced Stats
GET /api/stats
Now includes crawled article count:
{
"subscribers": 150,
"articles": 500,
"crawled_articles": 350 // NEW!
}
🎯 Benefits
- Complete Content - No truncation, full articles stored
- Better for AI - Full context for summarization/analysis
- Cleaner Schema - Single
contentfield instead ofsummary+full_content - More Metadata - Author, word count, crawl timestamp
- Better API - Preview in list, full content on demand
🔄 Migration
If you have existing articles with full_content field, they will continue to work. New articles will use the content field.
To migrate old articles:
// MongoDB shell
db.articles.updateMany(
{ full_content: { $exists: true } },
[
{
$set: {
content: "$full_content"
}
},
{
$unset: ["full_content", "summary"]
}
]
)
🚀 Usage
Crawl Articles
cd news_crawler
python crawler_service.py 10
Get Article List (with previews)
curl http://localhost:5001/api/news
Get Full Article Content
# Get the article URL from the list, then:
curl "http://localhost:5001/api/news/<encoded_url>"
Check Stats
curl http://localhost:5001/api/stats
📝 Example Workflow
- Add RSS Feed
curl -X POST http://localhost:5001/api/rss-feeds \
-H "Content-Type: application/json" \
-d '{"name": "News Source", "url": "https://example.com/rss"}'
- Crawl Articles
cd news_crawler
python crawler_service.py 10
- View Articles
curl http://localhost:5001/api/news
- Get Full Content
# Copy article link from above, URL encode it
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
Now you have complete article content ready for AI processing! 🎉