Skip to content

feat(knowledge): add token, sentence, recursive, and regex chunkers#4102

Merged
waleedlatif1 merged 20 commits intostagingfrom
waleedlatif1/add-chunkers
Apr 11, 2026
Merged

feat(knowledge): add token, sentence, recursive, and regex chunkers#4102
waleedlatif1 merged 20 commits intostagingfrom
waleedlatif1/add-chunkers

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Add 4 new chunking strategies: token, sentence, recursive, regex
  • Extract shared chunking utilities (estimateTokens, cleanText, addOverlap, splitAtWordBoundaries) into utils.ts
  • Wire strategy selection through full stack: UI → API → service → document processor
  • Default to auto-detection (existing behavior), users can optionally select a strategy
  • Add ReDoS protection for user-supplied regex patterns
  • Add Zod validation for strategy options at API layer

Type of Change

  • New feature

Testing

Tested manually. All 53 existing chunker tests pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 11, 2026 4:33am

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 11, 2026

PR Summary

Medium Risk
Touches the document ingestion/chunking pipeline end-to-end (UI → API → processing), which can change chunk boundaries, token counts, and embedding behavior across all new uploads. Regex-based splitting adds some input-safety checks but still carries performance/edge-case risk if patterns or separators behave unexpectedly on large inputs.

Overview
Adds user-selectable chunking strategies for knowledge bases (token, sentence, recursive, regex, plus auto) and wires them through creation UI, /api/knowledge validation, query types, stored chunkingConfig, and processDocument strategy selection.

Introduces new chunker implementations (TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker) plus shared helpers in chunkers/utils.ts (token estimation, cleaning, overlap, word-boundary splitting, metadata building), and refactors existing chunkers (notably TextChunker, JsonYamlChunker, DocsChunker, StructuredDataChunker) to use the shared utilities and adjusted detection/token-estimation behavior.

Enhances safety/validation around strategy inputs (e.g., overlap < chunk size, regex pattern required and length-limited, basic catastrophic-backtracking checks) and updates/extends chunker test coverage accordingly.

Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 11, 2026

Greptile Summary

This PR adds four new chunking strategies (token, sentence, recursive, regex) by extracting shared utilities into utils.ts and wiring strategy selection through the full stack — UI → API → service → document processor. Previously flagged regressions (wrong overlap in TokenChunker, misleading label for the text strategy) have been addressed in follow-up commits.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 edge-case suggestions with no data-loss or correctness risk on typical inputs.

The two previously blocking issues (token chunker overlap, misleading label) are resolved. The two remaining findings require uncommon user inputs (regex capture groups, trailing commas in separators) and neither corrupts data for the default or normal usage paths.

apps/sim/lib/chunkers/regex-chunker.ts (capture group handling), apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx (separator filtering)

Important Files Changed

Filename Overview
apps/sim/lib/chunkers/utils.ts New shared utility module; buildChunks, addOverlap, splitAtWordBoundaries, and estimateTokens are all well-implemented with correct edge-case handling.
apps/sim/lib/chunkers/token-chunker.ts New fixed-size token chunker; previously flagged buildChunks(chunks, 0) bug has been fixed to buildChunks(chunks, this.chunkOverlap).
apps/sim/lib/chunkers/regex-chunker.ts New regex chunker with ReDoS guards; split() with capture groups will include matched delimiter text as spurious chunk content when user patterns contain capturing groups.
apps/sim/lib/chunkers/sentence-chunker.ts New sentence-aware chunker; lookbehind regex correctly avoids abbreviations, overlap is applied at sentence granularity, and the minSentencesPerChunk guard is intentional.
apps/sim/lib/chunkers/recursive-chunker.ts New recursive delimiter chunker; RECIPES are well-chosen, empty-string sentinel is handled correctly, overlap and buildChunks usage is consistent with other chunkers.
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx Strategy selector wired correctly; customSeparators parsing omits a .filter() for empty strings, causing bypassed separators and early fallback to word-boundary splitting.
apps/sim/app/api/knowledge/route.ts New Zod validation for strategy and strategyOptions including regex-requires-pattern refinement; the .default('auto').optional() combination is redundant but harmless.
apps/sim/lib/knowledge/documents/document-processor.ts Strategy selection wired end-to-end through applyStrategy; strategy is correctly read from the knowledge base config in processDocumentAsync so async job processing is unaffected.
apps/sim/lib/chunkers/types.ts Clean type additions: ChunkingStrategy, StrategyOptions, and chunker-specific option interfaces are all well-defined and consistently used.
apps/sim/lib/knowledge/types.ts ChunkingConfig extended with optional strategy/strategyOptions; backwards-compatible with existing knowledge bases that have no strategy stored.
apps/sim/hooks/queries/kb/knowledge.ts CreateKnowledgeBaseParams updated to include strategy/strategyOptions in chunkingConfig; mutation invalidates the correct query key prefix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    UI["CreateBaseModal\n(strategy selector)"] -->|"POST /api/knowledge"| API["route.ts\nZod validation\n(strategy + strategyOptions)"]
    API --> KB_SVC["knowledge/service.ts\nstoreChunkingConfig in DB"]
    KB_SVC --> JOB["dispatchDocumentProcessingJob"]
    JOB --> PROC["processDocumentAsync\nreads strategy from DB"]
    PROC --> DP["document-processor.ts\napplyStrategy()"]
    DP -->|"auto"| AUTO{"auto-detect\ncontent type"}
    AUTO -->|json/yaml| JY["JsonYamlChunker"]
    AUTO -->|csv/xlsx| SD["StructuredDataChunker"]
    AUTO -->|"default"| TC["TextChunker"]
    DP -->|"token"| TK["TokenChunker"]
    DP -->|"sentence"| SC["SentenceChunker"]
    DP -->|"recursive"| RC["RecursiveChunker\n(plain/markdown/code recipe)"]
    DP -->|"regex"| RX["RegexChunker\n(user pattern + ReDoS guard)"]
    TK & SC & RC & RX & TC & JY & SD --> UTILS["utils.ts\nestimateTokens · cleanText\nsplitAtWordBoundaries · buildChunks · addOverlap"]
Loading

Reviews (5): Last reviewed commit: "fix(chunkers): restore separator-as-join..." | Re-trigger Greptile

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils
- Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio)
- Fix DocsChunker operator precedence bug and hard-coded 300-token limit
- Fix JsonYamlChunker isStructuredData false positive on plain strings
- Add MAX_DEPTH recursion guard to JsonYamlChunker
- Replace @/components/ui/select with emcn DropdownMenu in strategy selector
- Expand RecursiveChunker recipes: markdown adds horizontal rules, code
  fences, blockquotes; code adds const/let/var/if/for/while/switch/return
- RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing
- RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces)
- SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months
  and single-capital-letter lookbehind
- Add overlap < maxSize validation in Zod schema and UI form
- Add pattern max length (500) validation in Zod schema
- Fix StructuredDataChunker footer grammar
- DocsChunker: extract headers from cleaned content (not raw markdown)
  to fix position mismatch between header positions and chunk positions
- DocsChunker: strip export statements and JSX expressions in cleanContent
- DocsChunker: fix table merge dedup using equality instead of includes
- JsonYamlChunker: preserve path breadcrumbs when nested value fits in
  one chunk, matching LangChain RecursiveJsonSplitter behavior
- StructuredDataChunker: detect 2-column CSV (lowered threshold from >2
  to >=1) and use 20% relative tolerance instead of absolute +/-2
- TokenChunker: use sliding window overlap (matching LangChain/Chonkie)
  where chunks stay within chunkSize instead of exceeding it
- utils: splitAtWordBoundaries accepts optional stepChars for sliding
  window overlap; addOverlap uses newline join instead of space
- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals
- Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings
- Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters
- Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk
- Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently
- Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0
- Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures)
- Fix existing test expectations for updated footer format and isStructuredData behavior
Strip 445 lines of redundant TSDoc, math calculation comments,
implementation rationale notes, and assertion-restating comments
across all chunker source and test files.
- Fix regex fallback path: use sliding window for overlap instead of
  passing chunkOverlap to buildChunks without prepended overlap text
- Fix misleading strategy label: "Text (hierarchical splitting)" →
  "Text (word boundary splitting)"
Use addOverlap + buildChunks(chunks, overlap) in the regex fallback
path to match the main path and all other chunkers (TextChunker,
RecursiveChunker). The sliding window approach was inconsistent.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

When splitAtWordBoundaries snaps end back to a word boundary, advance
pos from end (not pos + step) in non-overlapping mode. The step-based
advancement is preserved for the sliding window case (TokenChunker).
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

When no complete sentence fits within the overlap budget,
fall back to character-level word-boundary overlap from the
previous group's text. This ensures buildChunks metadata is
always correct.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

- Fix regex fallback log: "character splitting" → "word-boundary splitting"
- Add Jun and Jul to sentence chunker abbreviation list
avgCount >= 1 was too permissive — prose with consistent comma usage
would be misclassified as CSV. Restore original > 2 threshold while
keeping the improved proportional tolerance.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Separator was unconditionally prepended to parts after the first,
leaving leading punctuation on chunks after a boundary reset.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here.

Parses JSON Lines files by splitting on newlines and converting to a
JSON array, which then flows through the existing JsonYamlChunker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1 waleedlatif1 merged commit 1acafe8 into staging Apr 11, 2026
11 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/add-chunkers branch April 11, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant