This module implements text splitting functionality for Retrieval-Augmented Generation (RAG) systems. It takes large documents and breaks them into smaller, manageable chunks while preserving context through overlapping segments.
Why Split Text?
- Embedding models have token limits (typically 512-8192 tokens)
- Smaller chunks improve retrieval precision
- Overlap maintains context across boundaries
- Better semantic search results
The Core Problem:
Large Document (10,000 chars)
↓
[Chunk 1] [Chunk 2] [Chunk 3] [Chunk 4]
↑overlap↑ ↑overlap↑ ↑overlap↑
TextSplitter (Base Class)
├── CharacterTextSplitter
├── RecursiveCharacterTextSplitter
└── TokenTextSplitter
- Single Responsibility: Each class has one job
- Inheritance: Common logic in base class
- Polymorphism: Different splitting strategies via
splitText() - Composition: Complex splitters use simpler ones internally
// Our Implementation
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = await splitter.splitDocuments(documents);
// LangChain.js
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = await splitter.splitDocuments(documents);Result: Drop-in compatible! Code works with both.
Both implementations:
- TextSplitter (base)
├── CharacterTextSplitter
├── RecursiveCharacterTextSplitter
└── TokenTextSplitter
- Both use the "merge with overlap" algorithm
- Same recursive strategy for large chunks
- Same separator hierarchy
{
pageContent: "chunk text",
metadata: {
source: "...",
chunk: 0,
totalChunks: 5
}
}Our Implementation:
// Concise constructor
constructor({chunkSize = 1000, chunkOverlap = 200, lengthFunction = t => t.length} = {}) {
if (chunkOverlap >= chunkSize) {
throw new Error('chunkOverlap must be less than chunkSize');
}
Object.assign(this, {chunkSize, chunkOverlap, lengthFunction});
}LangChain.js:
// More verbose, more validation
constructor(fields) {
super(fields);
this.chunkSize = fields?.chunkSize ?? 1000;
this.chunkOverlap = fields?.chunkOverlap ?? 200;
// ... many more fields
// ... extensive validation
// ... error handling
}Why Simpler?
- Educational focus
- Fewer edge cases
- Easier to understand
- Less production overhead
Our Implementation:
const lengthFunction = text => Math.ceil(text.length / 4);- Simple approximation
- No external dependencies
- Fast but less accurate
LangChain.js:
import { encodingForModel } from "js-tiktoken";
const encoder = encodingForModel("gpt-4");
const tokens = encoder.encode(text);- Uses tiktoken library
- Exact token counting
- Slower but accurate
Our Implementation:
// Minimal error handling
if (chunkOverlap >= chunkSize) {
throw new Error('chunkOverlap must be less than chunkSize');
}LangChain.js:
// Extensive error handling
if (chunkOverlap >= chunkSize) {
throw new Error(`chunkOverlap (${chunkOverlap}) must be less than chunkSize (${chunkSize})`);
}
if (chunkSize <= 0) {
throw new Error('chunkSize must be positive');
}
// ... many more validations| Feature | Our Implementation | LangChain.js |
|---|---|---|
| Basic splitting | ✓ | ✓ |
| Recursive splitting | ✓ | ✓ |
| Token splitting | ✓ (approximate) | ✓ (exact) |
| Metadata tracking | ✓ | ✓ |
| Custom separators | ✓ | ✓ |
| Markdown splitting | ✗ | ✓ |
| Code splitting | ✗ | ✓ |
| LaTeX splitting | ✗ | ✓ |
| HTML splitting | ✗ | ✓ |
| Transform callbacks | ✗ | ✓ |
| Document transformers | ✗ | ✓ |
Our Implementation:
- Pure JavaScript
- JSDoc comments for type hints
- Simpler to understand
LangChain.js:
- Written in TypeScript
- Full type safety
- Better IDE support
- More complex codebase
From Our Implementation to LangChain.js:
// Step 1: Change import
// From:
import { RecursiveCharacterTextSplitter } from './example.js';
// To:
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Step 2: Code stays the same!
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = await splitter.splitDocuments(documents);
// Step 3: Optionally add LangChain features
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
// LangChain-specific features:
separators: ["\n\n", "\n", ".", "!", "?", ";", ",", " ", ""]
});Compatibility: 95% of code works without changes!
- TextSplitter Pattern: Base class + specialized subclasses
- Core Algorithm: Merge with overlap for context continuity
- Recursive Strategy: Try large separators first, fall back to smaller ones
- API Compatibility: Same interface as LangChain.js
- Simplicity: Focused on clarity over features
✓ Modular design (easy to extend)
✓ Clear separation of concerns
✓ Reusable components
✓ Well-documented
✓ Production-ready algorithm
✓ LangChain-compatible
- How text splitting works algorithmically
- Why overlap matters for context
- Recursive splitting strategy
- Token vs character splitting
- How to choose the right configuration
- Differences from LangChain.js
- Implement custom splitters for your domain
- Add specialized separators for your document types
- Experiment with different chunk sizes
- Measure retrieval quality with your data