133 lines
3.1 KiB
Markdown
133 lines
3.1 KiB
Markdown
# Testing RSS Feed URL Extraction
|
|
|
|
## Quick Test (Recommended)
|
|
|
|
Run this from the project root with backend virtual environment activated:
|
|
|
|
```bash
|
|
# 1. Activate backend virtual environment
|
|
cd backend
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# 2. Go back to project root
|
|
cd ..
|
|
|
|
# 3. Run the test
|
|
python test_feeds_quick.py
|
|
```
|
|
|
|
This will:
|
|
- ✓ Check what RSS feeds are in your database
|
|
- ✓ Fetch each feed
|
|
- ✓ Test URL extraction on first 3 articles
|
|
- ✓ Show what fields are available
|
|
- ✓ Verify summary and date extraction
|
|
|
|
## Expected Output
|
|
|
|
```
|
|
================================================================================
|
|
RSS Feed Test - Checking Database Feeds
|
|
================================================================================
|
|
|
|
✓ Found 3 feed(s) in database
|
|
|
|
================================================================================
|
|
Feed: Süddeutsche Zeitung München
|
|
URL: https://www.sueddeutsche.de/muenchen/rss
|
|
Active: True
|
|
================================================================================
|
|
Fetching RSS feed...
|
|
✓ Found 20 entries
|
|
|
|
--- Entry 1 ---
|
|
Title: New U-Bahn Line Opens in Munich
|
|
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-123
|
|
✓ Summary: The new U-Bahn line connecting the city center...
|
|
✓ Date: Mon, 10 Nov 2024 10:00:00 +0100
|
|
|
|
--- Entry 2 ---
|
|
Title: Munich Weather Update
|
|
✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-124
|
|
✓ Summary: Weather forecast for the week...
|
|
✓ Date: Mon, 10 Nov 2024 09:30:00 +0100
|
|
|
|
...
|
|
```
|
|
|
|
## If No Feeds Found
|
|
|
|
Add a feed first:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:5001/api/rss-feeds \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"name": "Süddeutsche Politik", "url": "https://rss.sueddeutsche.de/rss/Politik"}'
|
|
```
|
|
|
|
## Testing News Crawler
|
|
|
|
Once feeds are verified, test the crawler:
|
|
|
|
```bash
|
|
# 1. Install crawler dependencies
|
|
cd news_crawler
|
|
pip install -r requirements.txt
|
|
|
|
# 2. Run the test
|
|
python test_rss_feeds.py
|
|
|
|
# 3. Or run the actual crawler
|
|
python crawler_service.py 5
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "No module named 'pymongo'"
|
|
- Activate the backend virtual environment first
|
|
- Or install dependencies: `pip install -r backend/requirements.txt`
|
|
|
|
### "No RSS feeds in database"
|
|
- Make sure backend is running
|
|
- Add feeds via API (see above)
|
|
- Or check if MongoDB is running: `docker-compose ps`
|
|
|
|
### "Could not extract URL"
|
|
- The test will show available fields
|
|
- Check if the feed uses `guid`, `id`, or `links` instead of `link`
|
|
- Our utility should handle most cases automatically
|
|
|
|
### "No entries found"
|
|
- The RSS feed URL might be invalid
|
|
- Try opening the URL in a browser
|
|
- Check if it returns valid XML
|
|
|
|
## Manual Database Check
|
|
|
|
Using mongosh:
|
|
|
|
```bash
|
|
mongosh
|
|
use munich_news
|
|
db.rss_feeds.find()
|
|
db.articles.find().limit(3)
|
|
```
|
|
|
|
## What to Look For
|
|
|
|
✅ **Good signs:**
|
|
- URLs are extracted successfully
|
|
- URLs start with `http://` or `https://`
|
|
- Summaries are present
|
|
- Dates are extracted
|
|
|
|
⚠️ **Warning signs:**
|
|
- "Could not extract URL" messages
|
|
- Empty summaries (not critical)
|
|
- Missing dates (not critical)
|
|
|
|
❌ **Problems:**
|
|
- No entries found in feed
|
|
- All URL extractions fail
|
|
- Feed parsing errors
|