Skip to content

Purge phase fails if primary phase ends with empty queue #381

@seanstory

Description

@seanstory

Bug Description

Originally reported here: #172 (comment)
Easily reproduced with:

output_sink: elasticsearch
output_index: web-crawl-test

elasticsearch:
  host: http://host.docker.internal
  port: 9200
  api_key: <yourkeyhere>
  pipeline_enabled: false

domains:
  - url: https://traderjoes.com

Full error:

[2025-09-04T18:19:06.251Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[2025-09-04T18:19:06.255Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] ES connections will be authorized with configured API key
[2025-09-04T18:19:06.287Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Connected to ES at http://host.docker.internal:9200 - version: 9.2.0-SNAPSHOT; build flavor: default
[2025-09-04T18:19:06.550Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Index [web-crawl-test-2] did not exist, but was successfully created!
[2025-09-04T18:19:06.550Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Elasticsearch sink initialized for index [web-crawl-test-2] with pipeline disabled
[2025-09-04T18:19:06.561Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Starting the primary crawl with up to 10 parallel thread(s)...
[2025-09-04T18:19:06.796Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Following the redirect from 'https://traderjoes.com/robots.txt' to 'https://www.traderjoes.com/robots.txt'...
[2025-09-04T18:19:06.915Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Error while fetching robots.txt for https://traderjoes.com:443: Forbidden
[2025-09-04T18:19:06.930Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl status: queue_size=0, pages_visited=1, urls_allowed=1, urls_denied={}, crawl_duration_msec=371, crawling_time_msec=113.0, avg_response_time_msec=113.0, active_threads=1, http_client={:max_connections=>100, :used_connections=>1}, status_codes={"403"=>1}
[2025-09-04T18:19:07.007Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl queue is empty, finishing the primary crawl
[2025-09-04T18:19:07.008Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Finished a crawl stage. Result: success; Successfully finished the primary crawl with an empty crawl queue
[2025-09-04T18:19:07.027Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 1/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 2.0s..
[2025-09-04T18:19:09.038Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 2/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 4.0s..
[2025-09-04T18:19:13.060Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search attempt 3/4 failed: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'. Retrying in 8.0s..
[2025-09-04T18:19:21.085Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Search failed after 4 attempts: '[400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400}'.
[2025-09-04T18:19:21.086Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Crawl Error: Unexpected error while running the crawl: Elastic::Transport::Transport::Errors::BadRequest: [400] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"web-crawl-test-2","node":"vOvjAFReRA61cPaCDK1DXg","reason":{"type":"query_shard_exception","reason":"No mapping found for [last_crawled_at] in order to sort on","index_uuid":"L4UjVa_TTvmo9O0lbQnwiA","index":"web-crawl-test-2"}}]},"status":400} /usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/base.rb:228:in `__raise_transport_error'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/base.rb:346:in `perform_request'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/transport/http/faraday.rb:36:in `perform_request'
/usr/local/bundle/gems/elastic-transport-8.3.2/lib/elastic/transport/client.rb:197:in `perform_request'
/usr/local/bundle/gems/elasticsearch-8.13.0/lib/elasticsearch.rb:71:in `method_missing'
/usr/local/bundle/gems/elasticsearch-api-8.13.0/lib/elasticsearch/api/actions/search.rb:105:in `search'
/home/app/lib/es/client.rb:79:in `block in paginated_search'
/home/app/lib/es/client.rb:237:in `execute_with_retry'
/home/app/lib/es/client.rb:78:in `block in paginated_search'
org/jruby/RubyKernel.java:1725:in `loop'
/home/app/lib/es/client.rb:77:in `paginated_search'
/home/app/lib/crawler/output_sink/elasticsearch.rb:123:in `fetch_purge_docs'
/home/app/lib/crawler/coordinator.rb:98:in `run_purge_crawl!'
/home/app/lib/crawler/coordinator.rb:70:in `run_crawl!'
/home/app/lib/crawler/api/crawl.rb:88:in `start!'
/home/app/lib/crawler/cli/crawl.rb:25:in `call'
/usr/local/bundle/gems/dry-cli-0.7.0/lib/dry/cli.rb:116:in `perform_registry'
/usr/local/bundle/gems/dry-cli-0.7.0/lib/dry/cli.rb:65:in `call'
bin/crawler:28:in `<main>'
[2025-09-04T18:19:21.087Z] [crawl:68b9d81a7453b252bf8e2f94] [primary] Finished a crawl. Result: failure; Unexpected error while running the crawl, check system logs for details

Expected behavior

We should be able to catch the 404 in the purge phase and short-circuit. No need to delete things if the destination index is empty.

Environment

0.4.2, with 9.2.0-SNAPSHOT ES

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions