192 lines
3.8 KiB
Markdown
192 lines
3.8 KiB
Markdown
# Recent Changes - Full Content Storage
|
|
|
|
## ✅ What Changed
|
|
|
|
### 1. Removed Content Length Limit
|
|
**Before:**
|
|
```python
|
|
'content': content_text[:10000] # Limited to 10k chars
|
|
```
|
|
|
|
**After:**
|
|
```python
|
|
'content': content_text # Full content, no limit
|
|
```
|
|
|
|
### 2. Simplified Database Schema
|
|
**Before:**
|
|
```javascript
|
|
{
|
|
summary: String, // Short summary
|
|
full_content: String // Limited content
|
|
}
|
|
```
|
|
|
|
**After:**
|
|
```javascript
|
|
{
|
|
content: String // Full article content, no limit
|
|
}
|
|
```
|
|
|
|
### 3. Enhanced API Response
|
|
**Before:**
|
|
```javascript
|
|
{
|
|
title: "...",
|
|
link: "...",
|
|
summary: "..."
|
|
}
|
|
```
|
|
|
|
**After:**
|
|
```javascript
|
|
{
|
|
title: "...",
|
|
author: "...", // NEW!
|
|
link: "...",
|
|
preview: "...", // First 200 chars
|
|
word_count: 1250, // NEW!
|
|
has_full_content: true // NEW!
|
|
}
|
|
```
|
|
|
|
## 📊 Database Structure
|
|
|
|
### Articles Collection
|
|
```javascript
|
|
{
|
|
_id: ObjectId,
|
|
title: String, // Article title
|
|
author: String, // Article author (extracted)
|
|
link: String, // Article URL (unique)
|
|
content: String, // FULL article content (no limit)
|
|
word_count: Number, // Word count
|
|
source: String, // RSS feed name
|
|
published_at: String, // Publication date
|
|
crawled_at: DateTime, // When crawled
|
|
created_at: DateTime // When added
|
|
}
|
|
```
|
|
|
|
## 🆕 New API Endpoint
|
|
|
|
### GET /api/news/<article_url>
|
|
Get full article content by URL.
|
|
|
|
**Example:**
|
|
```bash
|
|
# URL encode the article URL
|
|
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"title": "New U-Bahn Line Opens in Munich",
|
|
"author": "Max Mustermann",
|
|
"link": "https://example.com/article",
|
|
"content": "The full article text here... (complete, no truncation)",
|
|
"word_count": 1250,
|
|
"source": "Süddeutsche Zeitung München",
|
|
"published_at": "2024-11-10T10:00:00Z",
|
|
"crawled_at": "2024-11-10T16:30:00Z",
|
|
"created_at": "2024-11-10T16:00:00Z"
|
|
}
|
|
```
|
|
|
|
## 📈 Enhanced Stats
|
|
|
|
### GET /api/stats
|
|
Now includes crawled article count:
|
|
|
|
```json
|
|
{
|
|
"subscribers": 150,
|
|
"articles": 500,
|
|
"crawled_articles": 350 // NEW!
|
|
}
|
|
```
|
|
|
|
## 🎯 Benefits
|
|
|
|
1. **Complete Content** - No truncation, full articles stored
|
|
2. **Better for AI** - Full context for summarization/analysis
|
|
3. **Cleaner Schema** - Single `content` field instead of `summary` + `full_content`
|
|
4. **More Metadata** - Author, word count, crawl timestamp
|
|
5. **Better API** - Preview in list, full content on demand
|
|
|
|
## 🔄 Migration
|
|
|
|
If you have existing articles with `full_content` field, they will continue to work. New articles will use the `content` field.
|
|
|
|
To migrate old articles:
|
|
```javascript
|
|
// MongoDB shell
|
|
db.articles.updateMany(
|
|
{ full_content: { $exists: true } },
|
|
[
|
|
{
|
|
$set: {
|
|
content: "$full_content"
|
|
}
|
|
},
|
|
{
|
|
$unset: ["full_content", "summary"]
|
|
}
|
|
]
|
|
)
|
|
```
|
|
|
|
## 🚀 Usage
|
|
|
|
### Crawl Articles
|
|
```bash
|
|
cd news_crawler
|
|
python crawler_service.py 10
|
|
```
|
|
|
|
### Get Article List (with previews)
|
|
```bash
|
|
curl http://localhost:5001/api/news
|
|
```
|
|
|
|
### Get Full Article Content
|
|
```bash
|
|
# Get the article URL from the list, then:
|
|
curl "http://localhost:5001/api/news/<encoded_url>"
|
|
```
|
|
|
|
### Check Stats
|
|
```bash
|
|
curl http://localhost:5001/api/stats
|
|
```
|
|
|
|
## 📝 Example Workflow
|
|
|
|
1. **Add RSS Feed**
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/rss-feeds \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"name": "News Source", "url": "https://example.com/rss"}'
|
|
```
|
|
|
|
2. **Crawl Articles**
|
|
```bash
|
|
cd news_crawler
|
|
python crawler_service.py 10
|
|
```
|
|
|
|
3. **View Articles**
|
|
```bash
|
|
curl http://localhost:5001/api/news
|
|
```
|
|
|
|
4. **Get Full Content**
|
|
```bash
|
|
# Copy article link from above, URL encode it
|
|
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
|
```
|
|
|
|
Now you have complete article content ready for AI processing! 🎉
|