# Content Extraction Strategies The crawler uses multiple strategies to dynamically extract article metadata from any website. ## 🎯 What Gets Extracted 1. **Title** - Article headline 2. **Author** - Article writer/journalist 3. **Published Date** - When article was published 4. **Content** - Main article text 5. **Description** - Meta description/summary ## 📋 Extraction Strategies ### 1. Title Extraction Tries multiple methods in order of reliability: #### Strategy 1: H1 Tag ```html

Article Title Here

``` ✅ Most reliable - usually the main headline #### Strategy 2: Open Graph Meta Tag ```html ``` ✅ Used by Facebook, very reliable #### Strategy 3: Twitter Card Meta Tag ```html ``` ✅ Used by Twitter, reliable #### Strategy 4: Title Tag (Fallback) ```html Article Title | Site Name ``` ⚠️ Often includes site name, needs cleaning **Cleaning:** - Removes " | Site Name" - Removes " - Site Name" --- ### 2. Author Extraction Tries multiple methods: #### Strategy 1: Meta Author Tag ```html ``` ✅ Standard HTML meta tag #### Strategy 2: Rel="author" Link ```html John Doe ``` ✅ Semantic HTML #### Strategy 3: Common Class Names ```html

John Doe

By John Doe

John Doe

``` ✅ Searches for: author-name, author, byline, writer #### Strategy 4: Schema.org Markup ```html John Doe ``` ✅ Structured data #### Strategy 5: JSON-LD Structured Data ```html ``` ✅ Most structured, very reliable **Cleaning:** - Removes "By " prefix - Validates length (< 100 chars) --- ### 3. Date Extraction Tries multiple methods: #### Strategy 1: Time Tag with Datetime ```html November 10, 2024 ``` ✅ Most reliable - ISO format #### Strategy 2: Article Published Time Meta ```html ``` ✅ Open Graph standard #### Strategy 3: OG Published Time ```html ``` ✅ Facebook standard #### Strategy 4: Common Class Names ```html November 10, 2024 2024-11-10

10:00 AM, Nov 10

``` ✅ Searches for: publish-date, published, date, timestamp #### Strategy 5: Schema.org Markup ```html ``` ✅ Structured data #### Strategy 6: JSON-LD Structured Data ```html ``` ✅ Most structured --- ### 4. Content Extraction Tries multiple methods: #### Strategy 1: Semantic HTML Tags ```html

Article content here...

``` ✅ Best practice HTML5 #### Strategy 2: Common Class Names ```html

...

``` ✅ Searches for common patterns #### Strategy 3: Schema.org Markup ```html

Content here...

``` ✅ Structured data #### Strategy 4: Main Tag ```html

Content here...

``` ✅ Semantic HTML5 #### Strategy 5: Body Tag (Fallback) ```html

Content here...

``` ⚠️ Last resort, may include navigation **Content Filtering:** - Removes `