Building a Flexible Web Crawler Architecture for AI Agents

4 minute read

Published:

Building a Flexible Web Crawler Architecture for AI Agents

Note: This architectural pattern was inspired by reading “LLMs Engineer Handbook” by Paul Luzsting. The book provides excellent insights into building production-ready LLM applications, and I highly recommend it for anyone working in this field.

As AI agents become more sophisticated, they often need to gather information from various online sources. Each source might have its own structure, authentication requirements, and data format. How do we build a flexible architecture that can handle multiple data sources while remaining maintainable and extensible? Let’s explore a practical solution using Python.

The Challenge

Imagine you’re building an AI agent that needs to gather information from different platforms:

  • Professional profiles from LinkedIn
  • Project details from GitHub
  • Articles from various news sites
  • Custom webpage content

Each platform has its unique characteristics:

  • Different HTML structures
  • Various authentication methods
  • Rate limiting considerations
  • Platform-specific APIs

We need an architecture that can:

  1. Handle each source appropriately
  2. Be easily extended for new sources
  3. Maintain clean, readable code
  4. Provide a unified interface for the AI agent

The Solution: Dispatcher Pattern with Specialized Crawlers

Let’s look at an elegant solution using the Dispatcher pattern combined with specialized crawlers. This approach provides a clean way to route URLs to their appropriate handlers while maintaining extensibility.

The Base Architecture

First, we define our base crawler class:

class BaseCrawler:
    def crawl(self, url: str) -> dict:
        """Base method for crawling a specific URL"""
        raise NotImplementedError

Then, we implement specialized crawlers for each platform:

class LinkedInCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # LinkedIn-specific crawling logic
        # Handle authentication, rate limiting, etc.
        return {"platform": "linkedin", "data": {...}}

class GithubCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # GitHub-specific crawling logic
        return {"platform": "github", "data": {...}}

class CustomArticleCrawler(BaseCrawler):
    def crawl(self, url: str) -> dict:
        # General article crawling logic
        return {"platform": "article", "data": {...}}

The Dispatcher

The heart of our architecture is the CrawlerDispatcher:

from urllib.parse import urlparse
import re
import logging

logger = logging.getLogger(__name__)

class CrawlerDispatcher:
    def __init__(self) -> None:
        self._crawlers = {}

    @classmethod
    def build(cls) -> "CrawlerDispatcher":
        dispatcher = cls()
        return dispatcher

    def register(self, domain: str, crawler: type[BaseCrawler]) -> None:
        parsed_domain = urlparse(domain)
        domain = parsed_domain.netloc
        self._crawlers[r"https://(www\.)?{}/*".format(re.escape(domain))] = crawler

    def register_linkedin(self) -> "CrawlerDispatcher":
        self.register("https://linkedin.com", LinkedInCrawler)
        return self

    def register_github(self) -> "CrawlerDispatcher":
        self.register("https://github.com", GithubCrawler)
        return self

    def get_crawler(self, link: str) -> BaseCrawler:
        for domain, crawler in self._crawlers.items():
            if re.match(domain, link):
                return crawler()
        
        logger.warning(f"No crawler found for link: {link}, defaulting to CustomArticleCrawler.")
        return CustomArticleCrawler()

How It Works

  1. Registration: During initialization, we register different crawlers for their respective domains:
    dispatcher = (CrawlerDispatcher.build()
              .register_linkedin()
              .register_github())
    
  2. URL Matching: When a URL needs to be crawled, the dispatcher matches it against registered patterns:
    crawler = dispatcher.get_crawler("https://www.linkedin.com/in/amirlayegh")
    data = crawler.crawl("https://www.linkedin.com/in/amirlayegh")
    
  3. Fallback Handling: If no specific crawler is found, it defaults to a general-purpose CustomArticleCrawler.

Key Benefits

  1. Extensibility: Adding support for new platforms is as simple as:
    • Creating a new crawler class
    • Adding a registration method
    • Registering it with the dispatcher
  2. Separation of Concerns: Each crawler handles its own platform-specific logic:
    • Authentication
    • Rate limiting
    • HTML parsing
    • API interactions
  3. Maintainability: Platform-specific changes only require updates to the relevant crawler.

  4. Flexibility: The architecture can handle both API-based and HTML-scraping approaches.

Usage in AI Agent Context

This architecture is particularly valuable for AI agents because it:

  1. Provides Unified Data Format: Despite different sources, each crawler returns data in a consistent format.

  2. Handles Complexity Behind the Scenes: The AI agent doesn’t need to know about platform-specific details.

  3. Enables Easy Extension: As the agent needs new data sources, we can add them without changing the existing code.

# AI Agent usage example
class ResearchAgent:
    def __init__(self):
        self.crawler_dispatcher = (CrawlerDispatcher.build()
                                 .register_linkedin()
                                 .register_github())

    async def research_topic(self, urls: List[str]) -> List[dict]:
        results = []
        for url in urls:
            crawler = self.crawler_dispatcher.get_crawler(url)
            data = crawler.crawl(url)
            results.append(data)
        return results

Future Enhancements

This architecture can be extended further:

  1. Async Support: Add async crawling for better performance
  2. Caching Layer: Implement caching to avoid repeated crawls
  3. Rate Limiting: Add global rate limiting across crawlers
  4. Error Handling: Implement retry mechanisms and circuit breakers
  5. Validation: Add schema validation for crawler outputs

Conclusion

The Dispatcher pattern with specialized crawlers provides a robust foundation for AI agents that need to gather data from various online sources. It offers the perfect balance between flexibility and maintainability while keeping the complexity manageable.

By using this architecture, you can focus on implementing the specific crawling logic for each platform while maintaining a clean, extensible codebase that your AI agent can easily interact with.

Remember to always check and respect each platform’s terms of service and rate limits when implementing your crawlers. Happy coding!


Tags: Python, Architecture, Web Crawling, AI, Design Patterns