Files
Munich-news/news_crawler/CHANGES.md
2025-11-10 19:13:33 +01:00

3.8 KiB

Recent Changes - Full Content Storage

What Changed

1. Removed Content Length Limit

Before:

'content': content_text[:10000]  # Limited to 10k chars

After:

'content': content_text  # Full content, no limit

2. Simplified Database Schema

Before:

{
  summary: String,      // Short summary
  full_content: String  // Limited content
}

After:

{
  content: String  // Full article content, no limit
}

3. Enhanced API Response

Before:

{
  title: "...",
  link: "...",
  summary: "..."
}

After:

{
  title: "...",
  author: "...",        // NEW!
  link: "...",
  preview: "...",       // First 200 chars
  word_count: 1250,     // NEW!
  has_full_content: true // NEW!
}

📊 Database Structure

Articles Collection

{
  _id: ObjectId,
  title: String,           // Article title
  author: String,          // Article author (extracted)
  link: String,            // Article URL (unique)
  content: String,         // FULL article content (no limit)
  word_count: Number,      // Word count
  source: String,          // RSS feed name
  published_at: String,    // Publication date
  crawled_at: DateTime,    // When crawled
  created_at: DateTime     // When added
}

🆕 New API Endpoint

GET /api/news/<article_url>

Get full article content by URL.

Example:

# URL encode the article URL
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"

Response:

{
  "title": "New U-Bahn Line Opens in Munich",
  "author": "Max Mustermann",
  "link": "https://example.com/article",
  "content": "The full article text here... (complete, no truncation)",
  "word_count": 1250,
  "source": "Süddeutsche Zeitung München",
  "published_at": "2024-11-10T10:00:00Z",
  "crawled_at": "2024-11-10T16:30:00Z",
  "created_at": "2024-11-10T16:00:00Z"
}

📈 Enhanced Stats

GET /api/stats

Now includes crawled article count:

{
  "subscribers": 150,
  "articles": 500,
  "crawled_articles": 350  // NEW!
}

🎯 Benefits

  1. Complete Content - No truncation, full articles stored
  2. Better for AI - Full context for summarization/analysis
  3. Cleaner Schema - Single content field instead of summary + full_content
  4. More Metadata - Author, word count, crawl timestamp
  5. Better API - Preview in list, full content on demand

🔄 Migration

If you have existing articles with full_content field, they will continue to work. New articles will use the content field.

To migrate old articles:

// MongoDB shell
db.articles.updateMany(
  { full_content: { $exists: true } },
  [
    {
      $set: {
        content: "$full_content"
      }
    },
    {
      $unset: ["full_content", "summary"]
    }
  ]
)

🚀 Usage

Crawl Articles

cd news_crawler
python crawler_service.py 10

Get Article List (with previews)

curl http://localhost:5001/api/news

Get Full Article Content

# Get the article URL from the list, then:
curl "http://localhost:5001/api/news/<encoded_url>"

Check Stats

curl http://localhost:5001/api/stats

📝 Example Workflow

  1. Add RSS Feed
curl -X POST http://localhost:5001/api/rss-feeds \
  -H "Content-Type: application/json" \
  -d '{"name": "News Source", "url": "https://example.com/rss"}'
  1. Crawl Articles
cd news_crawler
python crawler_service.py 10
  1. View Articles
curl http://localhost:5001/api/news
  1. Get Full Content
# Copy article link from above, URL encode it
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"

Now you have complete article content ready for AI processing! 🎉