This commit is contained in:
2025-11-10 19:13:33 +01:00
commit ac5738c29d
64 changed files with 9445 additions and 0 deletions

191
news_crawler/CHANGES.md Normal file
View File

@@ -0,0 +1,191 @@
# Recent Changes - Full Content Storage
## ✅ What Changed
### 1. Removed Content Length Limit
**Before:**
```python
'content': content_text[:10000] # Limited to 10k chars
```
**After:**
```python
'content': content_text # Full content, no limit
```
### 2. Simplified Database Schema
**Before:**
```javascript
{
summary: String, // Short summary
full_content: String // Limited content
}
```
**After:**
```javascript
{
content: String // Full article content, no limit
}
```
### 3. Enhanced API Response
**Before:**
```javascript
{
title: "...",
link: "...",
summary: "..."
}
```
**After:**
```javascript
{
title: "...",
author: "...", // NEW!
link: "...",
preview: "...", // First 200 chars
word_count: 1250, // NEW!
has_full_content: true // NEW!
}
```
## 📊 Database Structure
### Articles Collection
```javascript
{
_id: ObjectId,
title: String, // Article title
author: String, // Article author (extracted)
link: String, // Article URL (unique)
content: String, // FULL article content (no limit)
word_count: Number, // Word count
source: String, // RSS feed name
published_at: String, // Publication date
crawled_at: DateTime, // When crawled
created_at: DateTime // When added
}
```
## 🆕 New API Endpoint
### GET /api/news/<article_url>
Get full article content by URL.
**Example:**
```bash
# URL encode the article URL
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
```
**Response:**
```json
{
"title": "New U-Bahn Line Opens in Munich",
"author": "Max Mustermann",
"link": "https://example.com/article",
"content": "The full article text here... (complete, no truncation)",
"word_count": 1250,
"source": "Süddeutsche Zeitung München",
"published_at": "2024-11-10T10:00:00Z",
"crawled_at": "2024-11-10T16:30:00Z",
"created_at": "2024-11-10T16:00:00Z"
}
```
## 📈 Enhanced Stats
### GET /api/stats
Now includes crawled article count:
```json
{
"subscribers": 150,
"articles": 500,
"crawled_articles": 350 // NEW!
}
```
## 🎯 Benefits
1. **Complete Content** - No truncation, full articles stored
2. **Better for AI** - Full context for summarization/analysis
3. **Cleaner Schema** - Single `content` field instead of `summary` + `full_content`
4. **More Metadata** - Author, word count, crawl timestamp
5. **Better API** - Preview in list, full content on demand
## 🔄 Migration
If you have existing articles with `full_content` field, they will continue to work. New articles will use the `content` field.
To migrate old articles:
```javascript
// MongoDB shell
db.articles.updateMany(
{ full_content: { $exists: true } },
[
{
$set: {
content: "$full_content"
}
},
{
$unset: ["full_content", "summary"]
}
]
)
```
## 🚀 Usage
### Crawl Articles
```bash
cd news_crawler
python crawler_service.py 10
```
### Get Article List (with previews)
```bash
curl http://localhost:5001/api/news
```
### Get Full Article Content
```bash
# Get the article URL from the list, then:
curl "http://localhost:5001/api/news/<encoded_url>"
```
### Check Stats
```bash
curl http://localhost:5001/api/stats
```
## 📝 Example Workflow
1. **Add RSS Feed**
```bash
curl -X POST http://localhost:5001/api/rss-feeds \
-H "Content-Type: application/json" \
-d '{"name": "News Source", "url": "https://example.com/rss"}'
```
2. **Crawl Articles**
```bash
cd news_crawler
python crawler_service.py 10
```
3. **View Articles**
```bash
curl http://localhost:5001/api/news
```
4. **Get Full Content**
```bash
# Copy article link from above, URL encode it
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
```
Now you have complete article content ready for AI processing! 🎉