# Content Extraction Strategies The crawler uses multiple strategies to dynamically extract article metadata from any website. ## 🎯 What Gets Extracted 1. **Title** - Article headline 2. **Author** - Article writer/journalist 3. **Published Date** - When article was published 4. **Content** - Main article text 5. **Description** - Meta description/summary ## πŸ“‹ Extraction Strategies ### 1. Title Extraction Tries multiple methods in order of reliability: #### Strategy 1: H1 Tag ```html

Article Title Here

``` βœ… Most reliable - usually the main headline #### Strategy 2: Open Graph Meta Tag ```html ``` βœ… Used by Facebook, very reliable #### Strategy 3: Twitter Card Meta Tag ```html ``` βœ… Used by Twitter, reliable #### Strategy 4: Title Tag (Fallback) ```html Article Title | Site Name ``` ⚠️ Often includes site name, needs cleaning **Cleaning:** - Removes " | Site Name" - Removes " - Site Name" --- ### 2. Author Extraction Tries multiple methods: #### Strategy 1: Meta Author Tag ```html ``` βœ… Standard HTML meta tag #### Strategy 2: Rel="author" Link ```html ``` βœ… Semantic HTML #### Strategy 3: Common Class Names ```html
John Doe
By John Doe

John Doe

``` βœ… Searches for: author-name, author, byline, writer #### Strategy 4: Schema.org Markup ```html ``` βœ… Structured data #### Strategy 5: JSON-LD Structured Data ```html ``` βœ… Most structured, very reliable **Cleaning:** - Removes "By " prefix - Validates length (< 100 chars) --- ### 3. Date Extraction Tries multiple methods: #### Strategy 1: Time Tag with Datetime ```html ``` βœ… Most reliable - ISO format #### Strategy 2: Article Published Time Meta ```html ``` βœ… Open Graph standard #### Strategy 3: OG Published Time ```html ``` βœ… Facebook standard #### Strategy 4: Common Class Names ```html November 10, 2024
10:00 AM, Nov 10
``` βœ… Searches for: publish-date, published, date, timestamp #### Strategy 5: Schema.org Markup ```html ``` βœ… Structured data #### Strategy 6: JSON-LD Structured Data ```html ``` βœ… Most structured --- ### 4. Content Extraction Tries multiple methods: #### Strategy 1: Semantic HTML Tags ```html

Article content here...

``` βœ… Best practice HTML5 #### Strategy 2: Common Class Names ```html
...
...
...
...
...
``` βœ… Searches for common patterns #### Strategy 3: Schema.org Markup ```html

Content here...

``` βœ… Structured data #### Strategy 4: Main Tag ```html

Content here...

``` βœ… Semantic HTML5 #### Strategy 5: Body Tag (Fallback) ```html

Content here...

``` ⚠️ Last resort, may include navigation **Content Filtering:** - Removes `