update
This commit is contained in:
191
news_crawler/CHANGES.md
Normal file
191
news_crawler/CHANGES.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Recent Changes - Full Content Storage
|
||||
|
||||
## ✅ What Changed
|
||||
|
||||
### 1. Removed Content Length Limit
|
||||
**Before:**
|
||||
```python
|
||||
'content': content_text[:10000] # Limited to 10k chars
|
||||
```
|
||||
|
||||
**After:**
|
||||
```python
|
||||
'content': content_text # Full content, no limit
|
||||
```
|
||||
|
||||
### 2. Simplified Database Schema
|
||||
**Before:**
|
||||
```javascript
|
||||
{
|
||||
summary: String, // Short summary
|
||||
full_content: String // Limited content
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```javascript
|
||||
{
|
||||
content: String // Full article content, no limit
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Enhanced API Response
|
||||
**Before:**
|
||||
```javascript
|
||||
{
|
||||
title: "...",
|
||||
link: "...",
|
||||
summary: "..."
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```javascript
|
||||
{
|
||||
title: "...",
|
||||
author: "...", // NEW!
|
||||
link: "...",
|
||||
preview: "...", // First 200 chars
|
||||
word_count: 1250, // NEW!
|
||||
has_full_content: true // NEW!
|
||||
}
|
||||
```
|
||||
|
||||
## 📊 Database Structure
|
||||
|
||||
### Articles Collection
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId,
|
||||
title: String, // Article title
|
||||
author: String, // Article author (extracted)
|
||||
link: String, // Article URL (unique)
|
||||
content: String, // FULL article content (no limit)
|
||||
word_count: Number, // Word count
|
||||
source: String, // RSS feed name
|
||||
published_at: String, // Publication date
|
||||
crawled_at: DateTime, // When crawled
|
||||
created_at: DateTime // When added
|
||||
}
|
||||
```
|
||||
|
||||
## 🆕 New API Endpoint
|
||||
|
||||
### GET /api/news/<article_url>
|
||||
Get full article content by URL.
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
# URL encode the article URL
|
||||
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"title": "New U-Bahn Line Opens in Munich",
|
||||
"author": "Max Mustermann",
|
||||
"link": "https://example.com/article",
|
||||
"content": "The full article text here... (complete, no truncation)",
|
||||
"word_count": 1250,
|
||||
"source": "Süddeutsche Zeitung München",
|
||||
"published_at": "2024-11-10T10:00:00Z",
|
||||
"crawled_at": "2024-11-10T16:30:00Z",
|
||||
"created_at": "2024-11-10T16:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
## 📈 Enhanced Stats
|
||||
|
||||
### GET /api/stats
|
||||
Now includes crawled article count:
|
||||
|
||||
```json
|
||||
{
|
||||
"subscribers": 150,
|
||||
"articles": 500,
|
||||
"crawled_articles": 350 // NEW!
|
||||
}
|
||||
```
|
||||
|
||||
## 🎯 Benefits
|
||||
|
||||
1. **Complete Content** - No truncation, full articles stored
|
||||
2. **Better for AI** - Full context for summarization/analysis
|
||||
3. **Cleaner Schema** - Single `content` field instead of `summary` + `full_content`
|
||||
4. **More Metadata** - Author, word count, crawl timestamp
|
||||
5. **Better API** - Preview in list, full content on demand
|
||||
|
||||
## 🔄 Migration
|
||||
|
||||
If you have existing articles with `full_content` field, they will continue to work. New articles will use the `content` field.
|
||||
|
||||
To migrate old articles:
|
||||
```javascript
|
||||
// MongoDB shell
|
||||
db.articles.updateMany(
|
||||
{ full_content: { $exists: true } },
|
||||
[
|
||||
{
|
||||
$set: {
|
||||
content: "$full_content"
|
||||
}
|
||||
},
|
||||
{
|
||||
$unset: ["full_content", "summary"]
|
||||
}
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
### Crawl Articles
|
||||
```bash
|
||||
cd news_crawler
|
||||
python crawler_service.py 10
|
||||
```
|
||||
|
||||
### Get Article List (with previews)
|
||||
```bash
|
||||
curl http://localhost:5001/api/news
|
||||
```
|
||||
|
||||
### Get Full Article Content
|
||||
```bash
|
||||
# Get the article URL from the list, then:
|
||||
curl "http://localhost:5001/api/news/<encoded_url>"
|
||||
```
|
||||
|
||||
### Check Stats
|
||||
```bash
|
||||
curl http://localhost:5001/api/stats
|
||||
```
|
||||
|
||||
## 📝 Example Workflow
|
||||
|
||||
1. **Add RSS Feed**
|
||||
```bash
|
||||
curl -X POST http://localhost:5001/api/rss-feeds \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"name": "News Source", "url": "https://example.com/rss"}'
|
||||
```
|
||||
|
||||
2. **Crawl Articles**
|
||||
```bash
|
||||
cd news_crawler
|
||||
python crawler_service.py 10
|
||||
```
|
||||
|
||||
3. **View Articles**
|
||||
```bash
|
||||
curl http://localhost:5001/api/news
|
||||
```
|
||||
|
||||
4. **Get Full Content**
|
||||
```bash
|
||||
# Copy article link from above, URL encode it
|
||||
curl "http://localhost:5001/api/news/https%3A%2F%2Fexample.com%2Farticle"
|
||||
```
|
||||
|
||||
Now you have complete article content ready for AI processing! 🎉
|
||||
Reference in New Issue
Block a user