update

2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions
--- a/news_crawler/CHANGES.md
+++ b/news_crawler/CHANGES.md
@@ -0,0 +1,191 @@
+# Recent Changes - Full Content Storage
+
+## ✅ What Changed
+
+### 1. Removed Content Length Limit
+**Before:**
+```python
+'content': content_text[:10000]  # Limited to 10k chars
+```
+
+**After:**
+```python
+'content': content_text  # Full content, no limit
+```
+
+### 2. Simplified Database Schema
+**Before:**
+```javascript
+{
+  summary: String,      // Short summary
+  full_content: String  // Limited content
+}
+```
+
+**After:**
+```javascript
+{
+  content: String  // Full article content, no limit
+}
+```
+
+### 3. Enhanced API Response
+**Before:**
+```javascript
+{
+  title: "...",
+  link: "...",
+  summary: "..."
+}
+```
+
+**After:**
+```javascript
+{
+  title: "...",
+  author: "...",        // NEW!
+  link: "...",
+  preview: "...",       // First 200 chars
+  word_count: 1250,     // NEW!
+  has_full_content: true // NEW!
+}
+```
+
+## 📊 Database Structure
+
+### Articles Collection
+```javascript
+{
+  _id: ObjectId,
+  title: String,           // Article title
+  author: String,          // Article author (extracted)
+  link: String,            // Article URL (unique)
+  content: String,         // FULL article content (no limit)
+  word_count: Number,      // Word count
+  source: String,          // RSS feed name
+  published_at: String,    // Publication date
+  crawled_at: DateTime,    // When crawled
+  created_at: DateTime     // When added
+}
+```
+
+## 🆕 New API Endpoint
+
+### GET /api/news/<article_url>
+Get full article content by URL.
+
+**Example:**
+```bash
+# URL encode the article URL
+curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
+```
+
+**Response:**
+```json
+{
+  "title": "New U-Bahn Line Opens in Munich",
+  "author": "Max Mustermann",
+  "link": "https://example.com/article",
+  "content": "The full article text here... (complete, no truncation)",
+  "word_count": 1250,
+  "source": "Süddeutsche Zeitung München",
+  "published_at": "2024-11-10T10:00:00Z",
+  "crawled_at": "2024-11-10T16:30:00Z",
+  "created_at": "2024-11-10T16:00:00Z"
+}
+```
+
+## 📈 Enhanced Stats
+
+### GET /api/stats
+Now includes crawled article count:
+
+```json
+{
+  "subscribers": 150,
+  "articles": 500,
+  "crawled_articles": 350  // NEW!
+}
+```
+
+## 🎯 Benefits
+
+1. **Complete Content** - No truncation, full articles stored
+2. **Better for AI** - Full context for summarization/analysis
+3. **Cleaner Schema** - Single `content` field instead of `summary` + `full_content`
+4. **More Metadata** - Author, word count, crawl timestamp
+5. **Better API** - Preview in list, full content on demand
+
+## 🔄 Migration
+
+If you have existing articles with `full_content` field, they will continue to work. New articles will use the `content` field.
+
+To migrate old articles:
+```javascript
+// MongoDB shell
+db.articles.updateMany(
+  { full_content: { $exists: true } },
+  [
+    {
+      $set: {
+        content: "$full_content"
+      }
+    },
+    {
+      $unset: ["full_content", "summary"]
+    }
+  ]
+)
+```
+
+## 🚀 Usage
+
+### Crawl Articles
+```bash
+cd news_crawler
+python crawler_service.py 10
+```
+
+### Get Article List (with previews)
+```bash
+curl http://localhost:5001/api/news
+```
+
+### Get Full Article Content
+```bash
+# Get the article URL from the list, then:
+curl "http://localhost:5001/api/news/<encoded_url>"
+```
+
+### Check Stats
+```bash
+curl http://localhost:5001/api/stats
+```
+
+## 📝 Example Workflow
+
+1. **Add RSS Feed**
+```bash
+curl -X POST http://localhost:5001/api/rss-feeds \
+  -H "Content-Type: application/json" \
+  -d '{"name": "News Source", "url": "https://example.com/rss"}'
+```
+
+2. **Crawl Articles**
+```bash
+cd news_crawler
+python crawler_service.py 10
+```
+
+3. **View Articles**
+```bash
+curl http://localhost:5001/api/news
+```
+
+4. **Get Full Content**
+```bash
+# Copy article link from above, URL encode it
+curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
+```
+
+Now you have complete article content ready for AI processing! 🎉