AI Fire
Posts
🤯 I Used to Struggle With Web Scraping... Until I Found This Game-Change

🤯 I Used to Struggle With Web Scraping... Until I Found This Game-Change

Turn Any Website into AI-Ready Data with an AI Website Scraper.

Max Anh
January 14, 2025

Do You Struggle With LLM Knowledge?

Have you ever asked an LLM about something specific, and it gave you a terrible answer?

Introduction: The Problem with LLM Knowledge
I. The Challenge: How Do You Feed an LLM Custom Kn …
II. The Solution: Crawl4AI – A Web Scraping Framew …
- Why It’s Different:
III. How to Use Crawl4AI to Scrape Any Website
- 1. Installing Crawl4AI
- 2. Scraping a Single Webpage
IV. Scaling Up: Scraping an Entire Website Efficie …
V. Final Step: Feeding the Data into an LLM with R …
- Making the AI Actually Know Things
VI. Web Scraping Ethics: What You Should Know
- 1. Check robots.txt Before Scraping
- 2. Ethical Web Scraping Rules
Conclusion: Why Crawl4AI is a Game-Changer

Introduction: The Problem with LLM Knowledge

There’s something frustrating about large language models (LLMs). They know a little bit about a lot of things—but when it comes to new, niche, or rapidly changing topics? Useless.

Even LLMs that can search the web don’t do much better. The results are scattered, outdated, or missing key details. They’re like that one friend who half-remembers a story and just fills in the blanks.

So I did what any desperate person would do—I gave the AI the information myself.

That’s the whole point of Retrieval-Augmented Generation (RAG). Instead of hoping an LLM knows something, you feed it the knowledge you want it to have. You make it an expert on your terms.

But here’s the problem: getting data into an LLM isn’t easy. Scraping an entire website to build a knowledge base can take forever. Most web scrapers are slow, messy, or break halfway through.

That’s where an AI website scraper comes in.

I found one that changed everything. It’s called Crawl4AI, and it can scrape an entire website, clean up the data, and prepare it for an LLM—all in seconds.

In this post, I’ll show you how it works, why it’s different, and how you can use it to make any LLM as smart as you want.

I. The Challenge: How Do You Feed an LLM Custom Knowledge?

The problem isn’t that LLMs are stupid. They just don’t know things. At least, not the things you care about.

They’re trained on general knowledge, but if you need them to answer specific questions about your niche, your tools, your workflows—good luck. They won’t have a clue.

You could manually add the information. But let’s be real, no one has time to copy-paste hundreds of pages into a chatbot. And even if you did, the formatting would be a mess.

That’s where an AI website scraper comes in.

You need a way to pull data from an entire website, structure it properly, and feed it into an LLM—fast. But most web scrapers? Slow, unreliable, and just a pain to deal with.

So what do you do? Sit there, waiting for an LLM update that may never come? Hope it magically learns what you need?

Or take control and make your LLM smarter yourself?

II. The Solution: Crawl4AI – A Web Scraping Framework for LLMs

There’s a reason most web scrapers are a nightmare. They break. They miss half the data. They return a mess of HTML that no LLM can understand.

So when I found Crawl4AI, I didn’t expect much. Just another AI website scraper promising to do the impossible. But this one? Actually worked.

It’s fast, simple, and efficient—no endless configurations, no struggling with blocked pages, no wasted hours cleaning up useless data.

Why It’s Different:

It doesn’t just scrape websites—it organizes them. Messy HTML? Gone. It converts everything into clean, structured Markdown that an LLM can actually read.
It removes all the junk. No more ads, pop-ups, or random scripts clogging up your knowledge base.
It handles proxies and sessions automatically. No more getting blocked mid-scrape.
It’s open-source and deployable with Docker. Which means you actually own your data, and no one can pull the plug on your work.

It’s not just about scraping a site—it’s about making LLMs actually useful. And for that, Crawl4AI is the AI website scraper I didn’t know I needed.

Learn How to Make AI Work For You!

Transform your AI skills with the AI Fire Academy Premium Plan – FREE for 14 days! Gain instant access to 200+ AI workflows, advanced tutorials, exclusive case studies, and unbeatable discounts. No risks, cancel anytime.

Start Your Free Trial Today >>

III. How to Use Crawl4AI to Scrape Any Website

I don’t like struggling with things that should be simple. And setting up an AI website scraper? Shouldn’t be hard. Shouldn’t require 10 hours of tutorials, debugging broken scripts, or crying over rate limits.

So when I found Crawl4AI, I expected the usual mess. But installing it? Took 30 seconds. Running it? Even faster.

1. Installing Crawl4AI

This part is almost too easy.

Run a single PIP install command.

# Install the package
pip install -U crawl4ai

# Run post-installation setup
crawl4ai-setup

# Verify your installation
crawl4ai-doctor

It sets up Playwright in the background. No extra configuration. No fixing dependencies.

python -m playwright install --with-deps chromium

It just works. Which is rare.

2. Scraping a Single Webpage

The real test: pulling an actual page. I tried it on the NBC News documentation homepage.

Raw HTML? A mess. Completely unreadable. If I fed that into an LLM, it would hallucinate half the response and still not give me what I needed.

how-to-use-crawl4ai-to-scrape-any-website

Crawl4AI fixed that. It converted everything into Markdown—structured, readable, clean. No more messy tags or useless data. Just exactly what an LLM needs to actually understand the content.

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

That’s the difference. Most web scrapers give you data. Crawl4AI gives you data that actually works.

IV. Scaling Up: Scraping an Entire Website Efficiently

Scraping a single page is nice. But if you need an AI website scraper that can handle an entire site? That’s a different story.

Manually copying links? No. Absolutely not. If a site has hundreds of pages, adding them one by one isn’t an option.

1. Extracting All URLs Automatically

Most websites have one. It’s like a map of every page they want search engines to find.

Instead of manually tracking down every URL, you pull the sitemap.
It works for blogs, e-commerce stores, documentation sites—anything structured.
Example: Pantic AI’s sitemap.xml—one request, and you get all the links in seconds.

No more guesswork. No more wasted time.

2. Crawling Multiple Pages in Sequence

Scraping one page at a time? Feels slow. But it’s better than starting a new browser session for every request.

A single session scrapes each page, one by one.
Success rates and data sizes are logged as you go.
Faster, but not the fastest.

3. Optimizing with Parallel Processing

This is where it gets fun. Instead of scraping one page at a time, what if you scrape ten at once?

Parallel crawling speeds things up.
Batch processing (10 pages at a time) cuts waiting time drastically.
Memory usage? Only 119MB RAM. Even while handling multiple requests.

For small sites, scraping in sequence is fine. But if you’re dealing with thousands of pages—or an entire e-commerce catalog—parallel processing is the only way to keep things moving.

Here’s the code you need:

import asyncio
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import requests
from xml.etree import ElementTree

async def crawl_sequential(urls: List[str]):
    print("\n=== Sequential Crawling with Session Reuse ===")

    browser_config = BrowserConfig(
        headless=True,
        # For better performance in Docker or low-memory environments:
        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
    )

    crawl_config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator()
    )

    # Create the crawler (opens the browser)
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()

    try:
        session_id = "session1"  # Reuse the same session across all URLs
        for url in urls:
            result = await crawler.arun(
                url=url,
                config=crawl_config,
                session_id=session_id
            )
            if result.success:
                print(f"Successfully crawled: {url}")
                # E.g. check markdown length
                print(f"Markdown length: {len(result.markdown_v2.raw_markdown)}")
            else:
                print(f"Failed: {url} - Error: {result.error_message}")
    finally:
        # After all URLs are done, close the crawler (and the browser)
        await crawler.close()

def get_pydantic_ai_docs_urls():
    """
    Fetches all URLs from the Pydantic AI documentation.
    Uses the sitemap (https://ai.pydantic.dev/sitemap.xml) to get these URLs.
    
    Returns:
        List[str]: List of URLs
    """            
    sitemap_url = "https://ai.pydantic.dev/sitemap.xml"
    try:
        response = requests.get(sitemap_url)
        response.raise_for_status()
        
        # Parse the XML
        root = ElementTree.fromstring(response.content)
        
        # Extract all URLs from the sitemap
        # The namespace is usually defined in the root element
        namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
        urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
        
        return urls
    except Exception as e:
        print(f"Error fetching sitemap: {e}")
        return []

async def main():
    urls = get_pydantic_ai_docs_urls()
    if urls:
        print(f"Found {len(urls)} URLs to crawl")
        await crawl_sequential(urls)
    else:
        print("No URLs found to crawl")

if __name__ == "__main__":
    asyncio.run(main())

And just like that, you have an AI website scraper that can handle full websites. Fast, clean, and without losing your sanity.

V. Final Step: Feeding the Data into an LLM with RAG

Scraping a website is one thing. Making that data actually useful is another.

An AI website scraper can pull in pages, clean them up, and format them beautifully. But if all that data just sits there? It’s worthless.

That’s where RAG (Retrieval-Augmented Generation) comes in.

Making the AI Actually Know Things

LLMs aren’t great at remembering details. You ask a question, and they either:

Make something up, or
Give you a vague, useless answer.

But when you feed the scraped data into a vector database, the AI stops guessing. It actually retrieves the right information.

VI. Web Scraping Ethics: What You Should Know

Just because you can scrape a website doesn’t mean you should. And if you’re using an AI website scraper, you need to know where the line is.

Some websites want you to scrape their data. Others don’t. And ignoring that? It’s an easy way to get blocked—or worse.

1. Check robots.txt Before Scraping

Most websites have a robots.txt file—a set of rules for web crawlers. It tells you:

What’s allowed (public pages, product listings, open data).
What’s off-limits (private user data, login-protected content).
If they care about scraping at all.

web-scraping-ethics-what-you-should-know

You can check it by adding /robots.txt to any website URL. If it says “Disallow: /”, stop. If it lists specific blocked pages, avoid them.

2. Ethical Web Scraping Rules

There are a few things that separate responsible scraping from just being a nuisance.

Avoid restricted pages. If a site blocks scrapers, don’t force it.
Don’t overload the server. Hitting a website with thousands of requests in seconds? That’s how you get IP-banned.
Contact the site owner if unsure. Some sites are okay with scraping—as long as you ask.

At the end of the day, an AI website scraper should help, not harm. And respecting ethical guidelines? That’s what keeps web scraping useful, legal, and sustainable.

Conclusion: Why Crawl4AI is a Game-Changer

I don’t like wasting time. And if you’ve ever tried to scrape a website for LLM training, you know how frustrating it can be.

Messy data. Slow speeds. Too many hoops to jump through.

But Crawl4AI changes that.

It’s not just another AI website scraper—it’s fast, simple, and actually usable. It pulls data, cleans it up, and makes it ready for AI without you having to fight with broken scripts or endless formatting issues.

And the best part? It works for any website—whether you’re building an AI agent, pulling research, or scraping an e-commerce catalog.

If you need real data, structured for an AI, this is the tool that actually does the job.

For now? The AI website scraper works. And that’s more than I can say for most things.

If you are interested in other topics and how AI is transforming different aspects of our lives, or even in making money using AI with more detailed, step-by-step guidance, you can find our other articles here:

*indicates a premium content, if any

Overall, how would you rate the AI Fire 101 Series?

Reply

or to participate.