feat(knowledge): add token, sentence, recursive, and regex chunkers by waleedlatif1 · Pull Request #4102 · simstudioai/sim

waleedlatif1 · 2026-04-11T00:37:11Z

Summary

Add 4 new chunking strategies: token, sentence, recursive, regex
Extract shared chunking utilities (estimateTokens, cleanText, addOverlap, splitAtWordBoundaries) into utils.ts
Wire strategy selection through full stack: UI → API → service → document processor
Default to auto-detection (existing behavior), users can optionally select a strategy
Add ReDoS protection for user-supplied regex patterns
Add Zod validation for strategy options at API layer

Type of Change

New feature

Testing

Tested manually. All 53 existing chunker tests pass.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2026-04-11T00:37:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Apr 11, 2026 4:33am

cursor · 2026-04-11T00:37:20Z

PR Summary

Medium Risk
Touches the document ingestion/chunking pipeline end-to-end (UI → API → processing), which can change chunk boundaries, token counts, and embedding behavior across all new uploads. Regex-based splitting adds some input-safety checks but still carries performance/edge-case risk if patterns or separators behave unexpectedly on large inputs.

Overview
Adds user-selectable chunking strategies for knowledge bases (token, sentence, recursive, regex, plus auto) and wires them through creation UI, /api/knowledge validation, query types, stored chunkingConfig, and processDocument strategy selection.

Introduces new chunker implementations (TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker) plus shared helpers in chunkers/utils.ts (token estimation, cleaning, overlap, word-boundary splitting, metadata building), and refactors existing chunkers (notably TextChunker, JsonYamlChunker, DocsChunker, StructuredDataChunker) to use the shared utilities and adjusted detection/token-estimation behavior.

Enhances safety/validation around strategy inputs (e.g., overlap < chunk size, regex pattern required and length-limited, basic catastrophic-backtracking checks) and updates/extends chunker test coverage accordingly.

^{Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here.}

greptile-apps · 2026-04-11T00:41:37Z

Greptile Summary

This PR adds four new chunking strategies (token, sentence, recursive, regex) by extracting shared utilities into utils.ts and wiring strategy selection through the full stack — UI → API → service → document processor. Previously flagged regressions (wrong overlap in TokenChunker, misleading label for the text strategy) have been addressed in follow-up commits.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 edge-case suggestions with no data-loss or correctness risk on typical inputs.

The two previously blocking issues (token chunker overlap, misleading label) are resolved. The two remaining findings require uncommon user inputs (regex capture groups, trailing commas in separators) and neither corrupts data for the default or normal usage paths.

apps/sim/lib/chunkers/regex-chunker.ts (capture group handling), apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx (separator filtering)

Important Files Changed

Filename	Overview
apps/sim/lib/chunkers/utils.ts	New shared utility module; buildChunks, addOverlap, splitAtWordBoundaries, and estimateTokens are all well-implemented with correct edge-case handling.
apps/sim/lib/chunkers/token-chunker.ts	New fixed-size token chunker; previously flagged buildChunks(chunks, 0) bug has been fixed to buildChunks(chunks, this.chunkOverlap).
apps/sim/lib/chunkers/regex-chunker.ts	New regex chunker with ReDoS guards; split() with capture groups will include matched delimiter text as spurious chunk content when user patterns contain capturing groups.
apps/sim/lib/chunkers/sentence-chunker.ts	New sentence-aware chunker; lookbehind regex correctly avoids abbreviations, overlap is applied at sentence granularity, and the minSentencesPerChunk guard is intentional.
apps/sim/lib/chunkers/recursive-chunker.ts	New recursive delimiter chunker; RECIPES are well-chosen, empty-string sentinel is handled correctly, overlap and buildChunks usage is consistent with other chunkers.
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx	Strategy selector wired correctly; customSeparators parsing omits a .filter() for empty strings, causing bypassed separators and early fallback to word-boundary splitting.
apps/sim/app/api/knowledge/route.ts	New Zod validation for strategy and strategyOptions including regex-requires-pattern refinement; the .default('auto').optional() combination is redundant but harmless.
apps/sim/lib/knowledge/documents/document-processor.ts	Strategy selection wired end-to-end through applyStrategy; strategy is correctly read from the knowledge base config in processDocumentAsync so async job processing is unaffected.
apps/sim/lib/chunkers/types.ts	Clean type additions: ChunkingStrategy, StrategyOptions, and chunker-specific option interfaces are all well-defined and consistently used.
apps/sim/lib/knowledge/types.ts	ChunkingConfig extended with optional strategy/strategyOptions; backwards-compatible with existing knowledge bases that have no strategy stored.
apps/sim/hooks/queries/kb/knowledge.ts	CreateKnowledgeBaseParams updated to include strategy/strategyOptions in chunkingConfig; mutation invalidates the correct query key prefix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    UI["CreateBaseModal\n(strategy selector)"] -->|"POST /api/knowledge"| API["route.ts\nZod validation\n(strategy + strategyOptions)"]
    API --> KB_SVC["knowledge/service.ts\nstoreChunkingConfig in DB"]
    KB_SVC --> JOB["dispatchDocumentProcessingJob"]
    JOB --> PROC["processDocumentAsync\nreads strategy from DB"]
    PROC --> DP["document-processor.ts\napplyStrategy()"]
    DP -->|"auto"| AUTO{"auto-detect\ncontent type"}
    AUTO -->|json/yaml| JY["JsonYamlChunker"]
    AUTO -->|csv/xlsx| SD["StructuredDataChunker"]
    AUTO -->|"default"| TC["TextChunker"]
    DP -->|"token"| TK["TokenChunker"]
    DP -->|"sentence"| SC["SentenceChunker"]
    DP -->|"recursive"| RC["RecursiveChunker\n(plain/markdown/code recipe)"]
    DP -->|"regex"| RX["RegexChunker\n(user pattern + ReDoS guard)"]
    TK & SC & RC & RX & TC & JY & SD --> UTILS["utils.ts\nestimateTokens · cleanText\nsplitAtWordBoundaries · buildChunks · addOverlap"]

_{Reviews (5): Last reviewed commit: "fix(chunkers): restore separator-as-join..." | Re-trigger Greptile}

apps/sim/lib/chunkers/regex-chunker.ts

...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx

apps/sim/lib/chunkers/sentence-chunker.ts

apps/sim/lib/chunkers/regex-chunker.ts

- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector

- Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar

- DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space

- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior

Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.

- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.

waleedlatif1 · 2026-04-11T01:51:29Z

@greptile

waleedlatif1 · 2026-04-11T01:51:32Z

@cursor review

apps/sim/lib/chunkers/utils.ts

apps/sim/lib/chunkers/text-chunker.ts

When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).

waleedlatif1 · 2026-04-11T02:03:09Z

@cursor review

When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.

waleedlatif1 · 2026-04-11T02:30:41Z

@greptile

waleedlatif1 · 2026-04-11T02:30:45Z

@cursor review

- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list

apps/sim/lib/chunkers/structured-data-chunker.ts

avgCount >= 1 was too permissive — prose with consistent comma usage would be misclassified as CSV. Restore original > 2 threshold while keeping the improved proportional tolerance.

waleedlatif1 · 2026-04-11T02:49:19Z

@greptile

waleedlatif1 · 2026-04-11T02:49:22Z

@cursor review

apps/sim/lib/chunkers/utils.ts

apps/sim/lib/knowledge/documents/document-processor.ts

apps/sim/lib/chunkers/recursive-chunker.ts

apps/sim/lib/chunkers/token-chunker.ts

waleedlatif1 · 2026-04-11T03:40:57Z

@greptile

waleedlatif1 · 2026-04-11T03:41:04Z

@cursor review

apps/sim/lib/chunkers/text-chunker.ts

Separator was unconditionally prepended to parts after the first, leaving leading punctuation on chunks after a boundary reset.

waleedlatif1 · 2026-04-11T03:55:32Z

@greptile

waleedlatif1 · 2026-04-11T03:55:35Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here.}

apps/sim/lib/chunkers/json-yaml-chunker.ts

apps/sim/lib/chunkers/regex-chunker.ts

Parses JSON Lines files by splitting on newlines and converting to a JSON array, which then flows through the existing JsonYamlChunker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(knowledge): add token, sentence, recursive, and regex chunkers

9f83f87

greptile-apps bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx Outdated Show resolved Hide resolved

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/sentence-chunker.ts Outdated Show resolved Hide resolved

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview April 11, 2026 00:50 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 00:55 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 01:01 Inactive

chore(chunkers): lint formatting

4872e75

vercel bot temporarily deployed to Preview April 11, 2026 01:13 Inactive

updated styling

fc006ee

vercel bot temporarily deployed to Preview April 11, 2026 01:17 Inactive

vercel bot temporarily deployed to Preview April 11, 2026 01:32 Inactive

chore(chunkers): remove unnecessary comments and dead code

cb814ff

Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.

vercel bot temporarily deployed to Preview April 11, 2026 01:46 Inactive

fix(chunkers): address PR review comments

899fc68

- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"

vercel bot temporarily deployed to Preview April 11, 2026 01:47 Inactive

fix(chunkers): use consistent overlap pattern in regex fallback

4c3508b

Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.

vercel bot temporarily deployed to Preview April 11, 2026 01:51 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/utils.ts Show resolved Hide resolved

apps/sim/lib/chunkers/text-chunker.ts Show resolved Hide resolved

fix(chunkers): prevent content loss in word boundary splitting

3a26dad

When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).

vercel bot temporarily deployed to Preview April 11, 2026 02:01 Inactive

fix(chunkers): fall back to character-level overlap in sentence chunker

ec6fa58

When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.

vercel bot temporarily deployed to Preview April 11, 2026 02:29 Inactive

fix(chunkers): fix log message and add missing month abbreviations

e391efa

- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list

vercel bot temporarily deployed to Preview April 11, 2026 02:37 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/structured-data-chunker.ts Show resolved Hide resolved

lint

f7fe06a

vercel bot temporarily deployed to Preview April 11, 2026 02:45 Inactive

fix(chunkers): restore structured data detection threshold to > 2

9c624db

avgCount >= 1 was too permissive — prose with consistent comma usage would be misclassified as CSV. Restore original > 2 threshold while keeping the improved proportional tolerance.

vercel bot temporarily deployed to Preview April 11, 2026 02:46 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/utils.ts Show resolved Hide resolved

apps/sim/lib/knowledge/documents/document-processor.ts Show resolved Hide resolved

apps/sim/lib/chunkers/recursive-chunker.ts Show resolved Hide resolved

greptile-apps bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/token-chunker.ts Outdated Show resolved Hide resolved

fix(chunkers): pass chunkOverlap to buildChunks in TokenChunker

4fd7685

vercel bot temporarily deployed to Preview April 11, 2026 03:15 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/text-chunker.ts Outdated Show resolved Hide resolved

fix(chunkers): restore separator-as-joiner pattern in splitRecursively

97a0bd4

Separator was unconditionally prepended to parts after the first, leaving leading punctuation on chunks after a boundary reset.

vercel bot temporarily deployed to Preview April 11, 2026 03:54 Inactive

cursor bot reviewed Apr 11, 2026

View reviewed changes

apps/sim/lib/chunkers/json-yaml-chunker.ts Show resolved Hide resolved

apps/sim/lib/chunkers/regex-chunker.ts Show resolved Hide resolved

feat(knowledge): add JSONL file support for knowledge base uploads

2c5a852

Parses JSON Lines files by splitting on newlines and converting to a JSON array, which then flows through the existing JsonYamlChunker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot temporarily deployed to Preview April 11, 2026 04:33 Inactive

waleedlatif1 merged commit 1acafe8 into staging Apr 11, 2026
11 checks passed

waleedlatif1 deleted the waleedlatif1/add-chunkers branch April 11, 2026 04:33

waleedlatif1 mentioned this pull request Apr 11, 2026

v0.6.36: new chunkers, sockets state machine, google sheets/drive/calendar triggers, docs updates, integrations/models pages improvements #4106

Merged

Conversation

waleedlatif1 commented Apr 11, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

greptile-apps bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

waleedlatif1 commented Apr 11, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 11, 2026 •

edited

Loading

cursor bot commented Apr 11, 2026 •

edited

Loading

greptile-apps bot commented Apr 11, 2026 •

edited

Loading