Article content here...
Content here...
# Content Extraction Strategies The crawler uses multiple strategies to dynamically extract article metadata from any website. ## π― What Gets Extracted 1. **Title** - Article headline 2. **Author** - Article writer/journalist 3. **Published Date** - When article was published 4. **Content** - Main article text 5. **Description** - Meta description/summary ## π Extraction Strategies ### 1. Title Extraction Tries multiple methods in order of reliability: #### Strategy 1: H1 Tag ```html
John Doe
``` β Searches for: author-name, author, byline, writer #### Strategy 4: Schema.org Markup ```html John Doe ``` β Structured data #### Strategy 5: JSON-LD Structured Data ```html ``` β Most structured, very reliable **Cleaning:** - Removes "By " prefix - Validates length (< 100 chars) --- ### 3. Date Extraction Tries multiple methods: #### Strategy 1: Time Tag with Datetime ```html ``` β Most reliable - ISO format #### Strategy 2: Article Published Time Meta ```html ``` β Open Graph standard #### Strategy 3: OG Published Time ```html ``` β Facebook standard #### Strategy 4: Common Class Names ```html November 10, 2024 ``` β Searches for: publish-date, published, date, timestamp #### Strategy 5: Schema.org Markup ```html ``` β Structured data #### Strategy 6: JSON-LD Structured Data ```html ``` β Most structured --- ### 4. Content Extraction Tries multiple methods: #### Strategy 1: Semantic HTML Tags ```htmlArticle content here...
Content here...
Content here...
Content here...
``` β οΈ Last resort, may include navigation **Content Filtering:** - Removes `