Building a Flexible Web Crawler Architecture for AI Agents

4 minute read

Published: January 15, 2025

Building a Flexible Web Crawler Architecture for AI Agents

Note: This architectural pattern was inspired by reading “LLMs Engineer Handbook” by Paul Luzsting. The book provides excellent insights into building production-ready LLM applications, and I highly recommend it for anyone working in this field.

As AI agents become more sophisticated, they often need to gather information from various online sources. Each source might have its own structure, authentication requirements, and data format. How do we build a flexible architecture that can handle multiple data sources while remaining maintainable and extensible? Let’s explore a practical solution using Python.

The Challenge

Imagine you’re building an AI agent that needs to gather information from different platforms:

Professional profiles from LinkedIn
Project details from GitHub
Articles from various news sites
Custom webpage content

Each platform has its unique characteristics:

Different HTML structures
Various authentication methods
Rate limiting considerations
Platform-specific APIs

We need an architecture that can:

Handle each source appropriately
Be easily extended for new sources
Maintain clean, readable code
Provide a unified interface for the AI agent

The Solution: Dispatcher Pattern with Specialized Crawlers

Let’s look at an elegant solution using the Dispatcher pattern combined with specialized crawlers. This approach provides a clean way to route URLs to their appropriate handlers while maintaining extensibility.

The Base Architecture

First, we define our base crawler class:

class BaseCrawler:
    def crawl(self, url: str) -> dict:
        """Base method for crawling a specific URL"""
        raise NotImplementedError

Then, we implement specialized crawlers for each platform:

class LinkedInCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # LinkedIn-specific crawling logic
        # Handle authentication, rate limiting, etc.
        return {"platform": "linkedin", "data": {...}}

class GithubCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # GitHub-specific crawling logic
        return {"platform": "github", "data": {...}}

class CustomArticleCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # General article crawling logic
        return {"platform": "article", "data": {...}}

The Dispatcher

The heart of our architecture is the CrawlerDispatcher:

from urllib.parse import urlparse
import re
import logging

logger = logging.getLogger(__name__)

class CrawlerDispatcher:
    def __init__(self) -> None:
        self._crawlers = {}

    @classmethod
    def build(cls) -> "CrawlerDispatcher":
        dispatcher = cls()
        return dispatcher

    def register(self, domain: str, crawler: type[BaseCrawler]) -> None:
        parsed_domain = urlparse(domain)
        domain = parsed_domain.netloc
        self._crawlers[r"https://(www\.)?{}/*".format(re.escape(domain))] = crawler

    def register_linkedin(self) -> "CrawlerDispatcher":
        self.register("https://linkedin.com", LinkedInCrawler)
        return self

    def register_github(self) -> "CrawlerDispatcher":
        self.register("https://github.com", GithubCrawler)
        return self

    def get_crawler(self, link: str) -> BaseCrawler:
        for domain, crawler in self._crawlers.items():
            if re.match(domain, link):
                return crawler()
        
        logger.warning(f"No crawler found for link: {link}, defaulting to CustomArticleCrawler.")
        return CustomArticleCrawler()

How It Works

Registration: During initialization, we register different crawlers for their respective domains:

dispatcher = (CrawlerDispatcher.build()
          .register_linkedin()
          .register_github())

URL Matching: When a URL needs to be crawled, the dispatcher matches it against registered patterns:

crawler = dispatcher.get_crawler("https://www.linkedin.com/in/amirlayegh")
data = crawler.crawl("https://www.linkedin.com/in/amirlayegh")

Fallback Handling: If no specific crawler is found, it defaults to a general-purpose CustomArticleCrawler.

Key Benefits

Extensibility: Adding support for new platforms is as simple as:
- Creating a new crawler class
- Adding a registration method
- Registering it with the dispatcher
Separation of Concerns: Each crawler handles its own platform-specific logic:
- Authentication
- Rate limiting
- HTML parsing
- API interactions
Maintainability: Platform-specific changes only require updates to the relevant crawler.
Flexibility: The architecture can handle both API-based and HTML-scraping approaches.

Usage in AI Agent Context

This architecture is particularly valuable for AI agents because it:

Provides Unified Data Format: Despite different sources, each crawler returns data in a consistent format.
Handles Complexity Behind the Scenes: The AI agent doesn’t need to know about platform-specific details.
Enables Easy Extension: As the agent needs new data sources, we can add them without changing the existing code.

# AI Agent usage example
class ResearchAgent:
    def __init__(self):
        self.crawler_dispatcher = (CrawlerDispatcher.build()
                                 .register_linkedin()
                                 .register_github())

    async def research_topic(self, urls: List[str]) -> List[dict]:
        results = []
        for url in urls:
            crawler = self.crawler_dispatcher.get_crawler(url)
            data = crawler.crawl(url)
            results.append(data)
        return results

Future Enhancements

This architecture can be extended further:

Async Support: Add async crawling for better performance
Caching Layer: Implement caching to avoid repeated crawls
Rate Limiting: Add global rate limiting across crawlers
Error Handling: Implement retry mechanisms and circuit breakers
Validation: Add schema validation for crawler outputs

Conclusion

The Dispatcher pattern with specialized crawlers provides a robust foundation for AI agents that need to gather data from various online sources. It offers the perfect balance between flexibility and maintainability while keeping the complexity manageable.

By using this architecture, you can focus on implementing the specific crawling logic for each platform while maintaining a clean, extensible codebase that your AI agent can easily interact with.

Remember to always check and respect each platform’s terms of service and rate limits when implementing your crawlers. Happy coding!

Tags: Python, Architecture, Web Crawling, AI, Design Patterns

Share on

Twitter Facebook LinkedIn

Amirhossein Layegh

Building a Flexible Web Crawler Architecture for AI Agents

Building a Flexible Web Crawler Architecture for AI Agents

The Challenge

The Solution: Dispatcher Pattern with Specialized Crawlers

The Base Architecture

The Dispatcher

How It Works

Key Benefits

Usage in AI Agent Context

Future Enhancements

Conclusion

Share on

You May Also Enjoy

From Local to Global Sensemaking: First Impressions of Microsoft GraphRAG (MS GraphRAG)

Airbnb Search Benchmarking - Comparison of retrieval techniques

Airbnb Search Benchmarking - Comparison of retrieval techniques

Stream of Search (SoS): Learning to Search in Language

Reflection Tuning