feat: Add CRW document loader and scrape tool nodes#6066
Conversation
Add CRW integration with: - Document loader node supporting scrape, crawl, and map modes - Scrape tool node for agent-based web scraping - CRW API credential with support for self-hosted and cloud instances
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances web content extraction capabilities by integrating CRW, an AI-native web scraper. It introduces a versatile document loader that can scrape, crawl, or map websites, and a dedicated tool node for agents to leverage CRW's scraping features. A new credential type facilitates secure and flexible connections to CRW services, making it easier to incorporate dynamic web data into flows. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces integration with the CRW web scraping service, adding a new credential type, a document loader for scraping, crawling, and mapping websites, and a dedicated tool for CRW scrape functionality. Feedback includes making the 5-minute crawl timeout configurable for flexibility and removing an unused stealth parameter from the scrape method signature in the CRWScrape tool.
|
|
||
| // Poll until completed or failed | ||
| const jobId = startData.id | ||
| const maxWaitMs = 5 * 60 * 1000 // 5 minutes |
There was a problem hiding this comment.
|
|
||
| async scrape( | ||
| url: string, | ||
| params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[]; stealth?: boolean } |
There was a problem hiding this comment.
The stealth parameter is defined in the scrape method's signature but is never used, as there is no corresponding input in the tool's configuration. It can be removed to simplify the code and avoid confusion.
| params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[]; stealth?: boolean } | |
| params: { onlyMainContent?: boolean; renderJs?: string; formats?: string[] } |
There was a problem hiding this comment.
Pull request overview
Adds CRW integrations to Flowise by introducing a new CRW document loader (scrape/crawl/map) and a CRW Scrape tool node, along with a credential for configuring CRW API access (self-hosted or fastcrw.com).
Changes:
- Added
CRWdocument loader node supporting scrape, crawl, and map modes with configurable extraction options. - Added
CRW Scrapetool node exposing CRW scraping as aDynamicStructuredToolfor agent use. - Added
CRW APIcredential (API key + base URL) and node icons.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/components/nodes/tools/CRWScrape/crw.svg | Adds CRW icon for the tool node. |
| packages/components/nodes/tools/CRWScrape/CRWScrape.ts | Adds CRW Scrape tool node implementation and minimal CRW HTTP client. |
| packages/components/nodes/documentloaders/CRW/crw.svg | Adds CRW icon for the document loader node. |
| packages/components/nodes/documentloaders/CRW/CRW.ts | Adds CRW document loader node (scrape/crawl/map) with optional text splitting and metadata merge. |
| packages/components/credentials/CRWApi.credential.ts | Adds CRW API credential definition (key optional + configurable API base URL). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (output === 'text') { | ||
| return docs.map((doc) => doc.pageContent).join('\n\n') | ||
| } |
There was a problem hiding this comment.
When returning the text output, this loader returns the raw concatenated string without applying handleEscapeCharacters. Most other document loaders wrap their final text output with handleEscapeCharacters(..., false) to keep escaping consistent across nodes and avoid downstream parsing/display issues with special characters.
| onlyMainContent: onlyMainContent ?? true | ||
| } | ||
| if (renderJs && renderJs !== 'auto') params.renderJs = renderJs | ||
| if (waitFor) params.waitFor = waitFor |
There was a problem hiding this comment.
waitFor is only forwarded when it is truthy (if (waitFor)), so a valid value of 0 ms cannot be passed through to the CRW API. Use an explicit undefined/null check instead (e.g., waitFor !== undefined) so zero is respected.
| if (waitFor) params.waitFor = waitFor | |
| if (waitFor !== undefined && waitFor !== null) params.waitFor = waitFor |
- Make crawl timeout configurable via node input parameter (default 5min) - Remove unused stealth param from CRWScrape tool's scrape method signature - Apply handleEscapeCharacters to text output for consistency with other loaders - Use explicit null/undefined check for waitFor to allow waitFor=0
Summary
DynamicStructuredToolDetails
Document Loader (
CRW)Tool (
CRW Scrape)Credential (
CRW API)http://localhost:3000)Test plan