Japanese-optimized semantic text chunking for RAG applications.
Unlike general-purpose text splitters, Bunsetsu understands Japanese text structure—no spaces between words, particles that bind phrases, and sentence patterns that differ from English. This results in more coherent chunks and better retrieval accuracy for Japanese RAG systems.
| Feature | Generic Splitters | Bunsetsu |
|---|---|---|
| Japanese word boundaries | ❌ Breaks mid-word | ✅ Respects morphology |
| Particle handling | ❌ Splits は/が from nouns | ✅ Keeps phrases intact |
| Sentence detection | ✅ Full (。!?、etc.) | |
| Topic boundaries | ❌ Ignores | ✅ Detects は/が patterns |
| Dependencies | Heavy | Zero by default |
# Basic installation (zero dependencies)
pip install bunsetsu
# With MeCab tokenizer (higher accuracy)
pip install bunsetsu[mecab]
# With Sudachi tokenizer (multiple granularity modes)
pip install bunsetsu[sudachi]
# All tokenizers
pip install bunsetsu[all]from bunsetsu import chunk_text
text = """
人工知能の発展は目覚ましいものがあります。
特に大規模言語モデルの登場により、自然言語処理の分野は大きく変わりました。
"""
# Simple semantic chunking
chunks = chunk_text(text, strategy="semantic", chunk_size=200)
for chunk in chunks:
print(f"[{chunk.char_count} chars] {chunk.text[:50]}...")Splits text based on meaning and topic boundaries:
from bunsetsu import SemanticChunker
chunker = SemanticChunker(
min_chunk_size=100,
max_chunk_size=500,
)
chunks = chunker.chunk(text)Character-based splitting that respects sentence boundaries:
from bunsetsu import FixedSizeChunker
chunker = FixedSizeChunker(
chunk_size=500,
chunk_overlap=50,
respect_sentences=True, # Don't break mid-sentence
)
chunks = chunker.chunk(text)Splits hierarchically by headings, paragraphs, sentences, then clauses:
from bunsetsu import RecursiveChunker
chunker = RecursiveChunker(
chunk_size=500,
chunk_overlap=50,
)
chunks = chunker.chunk(markdown_text)Regex-based, zero dependencies. Good for most use cases:
from bunsetsu import SimpleTokenizer
tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize("日本語のテキスト")Uses MeCab via fugashi for proper morphological analysis:
from bunsetsu import MeCabTokenizer, SemanticChunker
tokenizer = MeCabTokenizer()
chunker = SemanticChunker(tokenizer=tokenizer)Supports three tokenization modes (A/B/C):
from bunsetsu import SudachiTokenizer
# Mode C: Longest unit (compound words kept together)
tokenizer = SudachiTokenizer(mode="C")
# Mode A: Shortest unit (fine-grained)
tokenizer = SudachiTokenizer(mode="A")from bunsetsu.integrations import LangChainTextSplitter
from langchain.schema import Document
splitter = LangChainTextSplitter(
strategy="semantic",
chunk_size=500,
)
# Split plain text
chunks = splitter.split_text(text)
# Split Documents
docs = [Document(page_content=text, metadata={"source": "file.txt"})]
split_docs = splitter.split_documents(docs)from bunsetsu.integrations import LlamaIndexNodeParser
parser = LlamaIndexNodeParser(
strategy="semantic",
chunk_size=500,
)
nodes = parser.get_nodes_from_documents(documents)Convenience function for quick chunking:
chunks = chunk_text(
text,
strategy="semantic", # "fixed", "semantic", or "recursive"
chunk_size=500, # Target chunk size
chunk_overlap=50, # Overlap between chunks
tokenizer_backend="simple", # "simple", "mecab", or "sudachi"
)chunk.text # The chunk content
chunk.start_char # Start position in original text
chunk.end_char # End position in original text
chunk.char_count # Number of characters
chunk.metadata # Additional metadata dicttoken.surface # Surface form (as written)
token.token_type # TokenType enum (NOUN, VERB, PARTICLE, etc.)
token.reading # Reading (if available)
token.base_form # Dictionary form (if available)
token.is_content_word # True for nouns, verbs, adjectivesBenchmarked on a 100KB Japanese document:
| Chunker | Time | Chunks | Avg Size |
|---|---|---|---|
| FixedSizeChunker | 12ms | 203 | 492 chars |
| SemanticChunker (simple) | 45ms | 187 | 534 chars |
| SemanticChunker (mecab) | 89ms | 192 | 521 chars |
| RecursiveChunker | 23ms | 198 | 505 chars |
- Japanese-first: Built specifically for Japanese text, not adapted from English
- Zero dependencies by default: Works out of the box, optional backends for accuracy
- RAG-optimized: Chunks designed for embedding and retrieval, not just display
- Framework-agnostic: Core library works standalone, integrations provided separately
Contributions are welcome! Please check CONTRIBUTING.md for guidelines.
# Development setup
git clone https://github.com/YUALAB/bunsetsu.git
cd bunsetsu
pip install -e ".[dev]"
# Run tests
pytest
# Run linter
ruff check src/MIT License - see LICENSE for details.
Developed by YUA LAB (AQUA LLC), Tokyo.
We build AI agents and RAG systems for enterprise. This library powers our production RAG deployments.
- Website: aquallc.jp
- AI Assistant: YUA
- Contact: desk@aquallc.jp