Skip to content

feat: Add CRW document loader and scrape tool nodes#6066

Open
us wants to merge 2 commits into
FlowiseAI:mainfrom
us:feat/add-crw-node
Open

feat: Add CRW document loader and scrape tool nodes#6066
us wants to merge 2 commits into
FlowiseAI:mainfrom
us:feat/add-crw-node

Conversation

@us
Copy link
Copy Markdown

@us us commented Mar 26, 2026

Summary

  • Add CRW document loader node with scrape, crawl, and map modes for web content extraction
  • Add CRW Scrape tool node for agent-based web scraping via DynamicStructuredTool
  • Add CRW API credential supporting both self-hosted instances and fastcrw.com cloud

Details

Document Loader (CRW)

  • Scrape: Extract content from a single URL with JS rendering, CSS/XPath selectors, stealth mode, proxy support
  • Crawl: BFS crawl of a site with configurable max depth and page limits
  • Map: Discover all URLs on a site via link following and optional sitemap parsing
  • Outputs as Document array or concatenated text, with optional text splitter support

Tool (CRW Scrape)

  • Exposes CRW scrape as a tool for LLM agents
  • Configurable tool name/description, output format, JS rendering, and main content extraction

Credential (CRW API)

  • API key (optional for self-hosted, required for cloud)
  • Configurable API URL (defaults to http://localhost:3000)

Test plan

  • Verify document loader node appears in Flowise UI under "Document Loaders"
  • Verify scrape tool node appears under "Tools"
  • Test scrape mode with a sample URL
  • Test crawl mode with a small site
  • Test map mode for URL discovery
  • Test CRW Scrape tool within an agent flow
  • Verify credential creation and connection

Add CRW integration with:
- Document loader node supporting scrape, crawl, and map modes
- Scrape tool node for agent-based web scraping
- CRW API credential with support for self-hosted and cloud instances
Copilot AI review requested due to automatic review settings March 26, 2026 12:18
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances web content extraction capabilities by integrating CRW, an AI-native web scraper. It introduces a versatile document loader that can scrape, crawl, or map websites, and a dedicated tool node for agents to leverage CRW's scraping features. A new credential type facilitates secure and flexible connections to CRW services, making it easier to incorporate dynamic web data into flows.

Highlights

  • CRW Document Loader Node: Introduced a new document loader node for CRW, supporting 'scrape' for single URLs, 'crawl' for multi-page BFS traversal, and 'map' for URL discovery via link following and sitemap parsing. It includes options for JS rendering, CSS/XPath selectors, stealth mode, proxy support, and text splitting.
  • CRW Scrape Tool Node: Added a CRW Scrape tool node, enabling LLM agents to perform web scraping. This tool is configurable with a custom name, description, output format, JS rendering options, and main content extraction.
  • CRW API Credential: Implemented a new credential type for CRW API, allowing connection to both self-hosted CRW instances and the fastcrw.com cloud service. It supports optional API keys and configurable API URLs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces integration with the CRW web scraping service, adding a new credential type, a document loader for scraping, crawling, and mapping websites, and a dedicated tool for CRW scrape functionality. Feedback includes making the 5-minute crawl timeout configurable for flexibility and removing an unused stealth parameter from the scrape method signature in the CRWScrape tool.


// Poll until completed or failed
const jobId = startData.id
const maxWaitMs = 5 * 60 * 1000 // 5 minutes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A hardcoded 5-minute timeout for crawling can be problematic. It might be too long for some use cases, potentially causing gateway timeouts in a web server environment, or too short for very large sites. Consider making this timeout configurable as a node input to provide more flexibility.


async scrape(
url: string,
params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[]; stealth?: boolean }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stealth parameter is defined in the scrape method's signature but is never used, as there is no corresponding input in the tool's configuration. It can be removed to simplify the code and avoid confusion.

Suggested change
params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[]; stealth?: boolean }
params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[] }

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CRW integrations to Flowise by introducing a new CRW document loader (scrape/crawl/map) and a CRW Scrape tool node, along with a credential for configuring CRW API access (self-hosted or fastcrw.com).

Changes:

  • Added CRW document loader node supporting scrape, crawl, and map modes with configurable extraction options.
  • Added CRW Scrape tool node exposing CRW scraping as a DynamicStructuredTool for agent use.
  • Added CRW API credential (API key + base URL) and node icons.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/components/nodes/tools/CRWScrape/crw.svg Adds CRW icon for the tool node.
packages/components/nodes/tools/CRWScrape/CRWScrape.ts Adds CRW Scrape tool node implementation and minimal CRW HTTP client.
packages/components/nodes/documentloaders/CRW/crw.svg Adds CRW icon for the document loader node.
packages/components/nodes/documentloaders/CRW/CRW.ts Adds CRW document loader node (scrape/crawl/map) with optional text splitting and metadata merge.
packages/components/credentials/CRWApi.credential.ts Adds CRW API credential definition (key optional + configurable API base URL).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +533 to +535
if (output === 'text') {
return docs.map((doc) => doc.pageContent).join('\n\n')
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When returning the text output, this loader returns the raw concatenated string without applying handleEscapeCharacters. Most other document loaders wrap their final text output with handleEscapeCharacters(..., false) to keep escaping consistent across nodes and avoid downstream parsing/display issues with special characters.

Copilot uses AI. Check for mistakes.
onlyMainContent: onlyMainContent ?? true
}
if (renderJs && renderJs !== 'auto') params.renderJs = renderJs
if (waitFor) params.waitFor = waitFor
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitFor is only forwarded when it is truthy (if (waitFor)), so a valid value of 0 ms cannot be passed through to the CRW API. Use an explicit undefined/null check instead (e.g., waitFor !== undefined) so zero is respected.

Suggested change
if (waitFor) params.waitFor = waitFor
if (waitFor !== undefined && waitFor !== null) params.waitFor = waitFor

Copilot uses AI. Check for mistakes.
- Make crawl timeout configurable via node input parameter (default 5min)
- Remove unused stealth param from CRWScrape tool's scrape method signature
- Apply handleEscapeCharacters to text output for consistency with other loaders
- Use explicit null/undefined check for waitFor to allow waitFor=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants