# Testing RSS Feed URL Extraction ## Quick Test (Recommended) Run this from the project root with backend virtual environment activated: ```bash # 1. Activate backend virtual environment cd backend source venv/bin/activate # On Windows: venv\Scripts\activate # 2. Go back to project root cd .. # 3. Run the test python test_feeds_quick.py ``` This will: - ✓ Check what RSS feeds are in your database - ✓ Fetch each feed - ✓ Test URL extraction on first 3 articles - ✓ Show what fields are available - ✓ Verify summary and date extraction ## Expected Output ``` ================================================================================ RSS Feed Test - Checking Database Feeds ================================================================================ ✓ Found 3 feed(s) in database ================================================================================ Feed: Süddeutsche Zeitung München URL: https://www.sueddeutsche.de/muenchen/rss Active: True ================================================================================ Fetching RSS feed... ✓ Found 20 entries --- Entry 1 --- Title: New U-Bahn Line Opens in Munich ✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-123 ✓ Summary: The new U-Bahn line connecting the city center... ✓ Date: Mon, 10 Nov 2024 10:00:00 +0100 --- Entry 2 --- Title: Munich Weather Update ✓ URL extracted: https://www.sueddeutsche.de/muenchen/article-124 ✓ Summary: Weather forecast for the week... ✓ Date: Mon, 10 Nov 2024 09:30:00 +0100 ... ``` ## If No Feeds Found Add a feed first: ```bash curl -X POST http://localhost:5001/api/rss-feeds \ -H "Content-Type: application/json" \ -d '{"name": "Süddeutsche Politik", "url": "https://rss.sueddeutsche.de/rss/Politik"}' ``` ## Testing News Crawler Once feeds are verified, test the crawler: ```bash # 1. Install crawler dependencies cd news_crawler pip install -r requirements.txt # 2. Run the test python test_rss_feeds.py # 3. Or run the actual crawler python crawler_service.py 5 ``` ## Troubleshooting ### "No module named 'pymongo'" - Activate the backend virtual environment first - Or install dependencies: `pip install -r backend/requirements.txt` ### "No RSS feeds in database" - Make sure backend is running - Add feeds via API (see above) - Or check if MongoDB is running: `docker-compose ps` ### "Could not extract URL" - The test will show available fields - Check if the feed uses `guid`, `id`, or `links` instead of `link` - Our utility should handle most cases automatically ### "No entries found" - The RSS feed URL might be invalid - Try opening the URL in a browser - Check if it returns valid XML ## Manual Database Check Using mongosh: ```bash mongosh use munich_news db.rss_feeds.find() db.articles.find().limit(3) ``` ## What to Look For ✅ **Good signs:** - URLs are extracted successfully - URLs start with `http://` or `https://` - Summaries are present - Dates are extracted ⚠️ **Warning signs:** - "Could not extract URL" messages - Empty summaries (not critical) - Missing dates (not critical) ❌ **Problems:** - No entries found in feed - All URL extractions fail - Feed parsing errors