crawl4ai version
0.7.4
Expected Behavior
The crawl4ai server can decode filter_chain and executes the deep crawl accordingly
Current Behavior
-
The docker server returns 500 if a filter_chain is specified in the deep_crawl_strategy
-
Using the RESI API, the server performs the crawl, but it does not take the filter_chain into account (the crawler crawls pages even though they are matched by the filter)
Is this reproducible?
Yes
Inputs Causing the Bug
filter_chain = [
URLPatternFilter(
patterns=["*about-us*"],
reverse=True,
),
]
crawler_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, filter_chain=FilterChain(filter_chain)
),
cache_mode=CacheMode.BYPASS,
)
Steps to Reproduce
1. Set up Docker server
2. Use the Crawl4aiDockerClient and specify a BFSDeepCrawlStrategy with filter_chain
Code snippets
# CODE FOR DOCKER CLIENT ISSUE
import asyncio
from crawl4ai import ( # Assuming you have crawl4ai installed
BrowserConfig,
CacheMode,
CrawlerRunConfig,
)
from crawl4ai.deep_crawling import (
BFSDeepCrawlStrategy,
FilterChain,
URLPatternFilter,
)
from crawl4ai.docker_client import Crawl4aiDockerClient
async def main():
# Point to the correct server port
async with Crawl4aiDockerClient(
base_url="http://xyz:11235/",
verbose=True,
) as client:
# If JWT is enabled on the server, authenticate first:
await client.authenticate(
"user@example.com"
) # See Server Configuration section
filter_chain = [
URLPatternFilter(
patterns=["*about-us*"],
reverse=True,
),
]
crawler_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, filter_chain=FilterChain(filter_chain)
),
cache_mode=CacheMode.BYPASS,
)
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(
headless=True
), # Use library classes for config aid
crawler_config=crawler_config,
)
if results: # client.crawl returns None on failure
print(type(results))
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(result)
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
if __name__ == "__main__":
asyncio.run(main())
# CODE FOR REST API ISSUE
URL_FILTERS: list = [
"*privacy*",
"*terms*",
"*cookie*",
"*contact*",
"*support*",
"*board*",
"*blog*",
"*press*",
"*career*",
"*about*",
"*sustainability*",
"*governance*",
"*youtube*",
]
browser_config_payload = {"type": "BrowserConfig", "params": {"headless": True}}
deep_crawl_strategy_payload = {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": 2,
"max_pages": 10,
"include_external": True,
"filter_chain": {
"type": "FilterChain",
"params": {
"filters": [
{
"type": "URLPatternFilter",
"params": {"patterns": URL_FILTERS},
}
],
},
},
},
}
crawler_config_payload = {
"type": "CrawlerRunConfig",
"params": {
"stream": False,
"cache_mode": "bypass",
"deep_crawl_strategy": deep_crawl_strategy_payload,
}, # Use string value of enum
}
crawl_payload = {
"urls": ["https://www.drdgold.com/investors/sens-news/2023"],
"browser_config": browser_config_payload,
"crawler_config": crawler_config_payload,
}
response = httpx.post(
"http://xyz:11235/crawl", # Updated port
json=crawl_payload,
timeout=90,
)
print(f"Status Code: {response.status_code}")
OS
Linux
Python version
3.12
Browser
chromium
Browser version
No response
Error logs & Screenshots (if applicable)
Traceback (most recent call last):
File "/app/api.py", line 433, in handle_crawl_request
crawler_config = CrawlerRunConfig.load(crawler_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 1553, in load
config = from_serializable_dict(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 148, in from_serializable_dict
k: from_serializable_dict(v) for k, v in data["params"].items()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 154, in from_serializable_dict
return [from_serializable_dict(item) for item in data]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawl4ai/async_configs.py", line 150, in from_serializable_dict
return cls(**constructor_args)
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: URLPatternFilter.__init__() got an unexpected keyword argument 'simple_suffixes'
crawl4ai version
0.7.4
Expected Behavior
The crawl4ai server can decode filter_chain and executes the deep crawl accordingly
Current Behavior
The docker server returns 500 if a filter_chain is specified in the deep_crawl_strategy
Using the RESI API, the server performs the crawl, but it does not take the filter_chain into account (the crawler crawls pages even though they are matched by the filter)
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux
Python version
3.12
Browser
chromium
Browser version
No response
Error logs & Screenshots (if applicable)