update
This commit is contained in:
127
news_crawler/QUICKSTART.md
Normal file
127
news_crawler/QUICKSTART.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# News Crawler - Quick Start
|
||||
|
||||
## 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
cd news_crawler
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 2. Configure Environment
|
||||
|
||||
Make sure MongoDB is running and accessible. The crawler will use the same database as the backend.
|
||||
|
||||
Default connection: `mongodb://localhost:27017/`
|
||||
|
||||
To use a different MongoDB URI, create a `.env` file:
|
||||
```env
|
||||
MONGODB_URI=mongodb://localhost:27017/
|
||||
```
|
||||
|
||||
## 3. Run the Crawler
|
||||
|
||||
```bash
|
||||
# Crawl up to 10 articles per feed
|
||||
python crawler_service.py
|
||||
|
||||
# Crawl up to 20 articles per feed
|
||||
python crawler_service.py 20
|
||||
```
|
||||
|
||||
## 4. Verify Results
|
||||
|
||||
Check your MongoDB database:
|
||||
|
||||
```bash
|
||||
# Using mongosh
|
||||
mongosh
|
||||
use munich_news
|
||||
db.articles.find({full_content: {$exists: true}}).count()
|
||||
db.articles.findOne({full_content: {$exists: true}})
|
||||
```
|
||||
|
||||
## 5. Schedule Regular Crawling
|
||||
|
||||
### Option A: Cron (Linux/Mac)
|
||||
|
||||
```bash
|
||||
# Edit crontab
|
||||
crontab -e
|
||||
|
||||
# Add this line to run every 6 hours
|
||||
0 */6 * * * cd /path/to/news_crawler && /path/to/venv/bin/python crawler_service.py
|
||||
```
|
||||
|
||||
### Option B: Docker
|
||||
|
||||
```bash
|
||||
# Build and run
|
||||
docker-compose up
|
||||
|
||||
# Or run as a one-off
|
||||
docker-compose run --rm crawler
|
||||
```
|
||||
|
||||
### Option C: Manual
|
||||
|
||||
Just run the script whenever you want to fetch new articles:
|
||||
|
||||
```bash
|
||||
python crawler_service.py
|
||||
```
|
||||
|
||||
## What Gets Crawled?
|
||||
|
||||
The crawler:
|
||||
1. Fetches all active RSS feeds from the database
|
||||
2. For each feed, gets the latest articles
|
||||
3. Crawls the full content from each article URL
|
||||
4. Saves: title, full_content, word_count, crawled_at
|
||||
5. Skips articles that already have content
|
||||
|
||||
## Output Example
|
||||
|
||||
```
|
||||
============================================================
|
||||
🚀 Starting RSS Feed Crawler
|
||||
============================================================
|
||||
Found 3 active feed(s)
|
||||
|
||||
📰 Crawling feed: Süddeutsche Zeitung München
|
||||
URL: https://www.sueddeutsche.de/muenchen/rss
|
||||
🔍 Crawling: New U-Bahn Line Opens in Munich...
|
||||
✓ Saved (1250 words)
|
||||
🔍 Crawling: Munich Weather Update...
|
||||
✓ Saved (450 words)
|
||||
✓ Crawled 2 articles from Süddeutsche Zeitung München
|
||||
|
||||
============================================================
|
||||
✓ Crawling Complete!
|
||||
Total feeds processed: 3
|
||||
Total articles crawled: 15
|
||||
Duration: 45.23 seconds
|
||||
============================================================
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No feeds found:**
|
||||
- Make sure you've added RSS feeds via the backend API
|
||||
- Check MongoDB connection
|
||||
|
||||
**Can't extract content:**
|
||||
- Some sites block scrapers
|
||||
- Some sites require JavaScript (not supported yet)
|
||||
- Check if the URL is accessible
|
||||
|
||||
**Timeout errors:**
|
||||
- Increase timeout in the code
|
||||
- Check your internet connection
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once articles are crawled, you can:
|
||||
- View them in the frontend
|
||||
- Use Ollama to summarize them
|
||||
- Generate newsletters with full content
|
||||
- Perform text analysis
|
||||
Reference in New Issue
Block a user