enhance: better word division for highlighting#2251
Merged
love-linger merged 1 commit intosourcegit-scm:developfrom Apr 8, 2026
Merged
enhance: better word division for highlighting#2251love-linger merged 1 commit intosourcegit-scm:developfrom
love-linger merged 1 commit intosourcegit-scm:developfrom
Conversation
Each line is divided into several chunks to highlight the changes. The previous implementation splits text at a fixed set of delimiter characters (spaces, tabs, and common ASCII symbols such as `+-*/=!,;`). Non-delimiter characters — including CJK ideographs, Hiragana, and Katakana — are never treated as boundaries, so they tend to form large, coarse chunks in languages like Japanese or Chinese that do not use spaces to separate words. A small change within such text causes the entire surrounding phrase to be highlighted. This new implementation classifies each character into one of three categories and groups consecutive characters of the same category into one chunk, except for the Other category which is always split character by character: - Letter (Unicode Ll/Lu/Lt/Lm + digits): ASCII letters, digits, and letters with diacritics such as é, ü, ß, ñ, ё. Consecutive Letter characters form one chunk, keeping European words intact. - OtherLetter (Unicode Lo): CJK, Hiragana, Katakana, Hangul, Thai, Arabic, Hebrew, etc. Consecutive OtherLetter characters form one chunk. CJK punctuation (。、「」…) falls into the Other category and therefore acts as a natural boundary between chunks. - Other (default): whitespace, control characters, punctuation, and symbols. This category corresponds to the delimiter characters of the previous implementation. Each character is always its own chunk, preserving the same per-character precision as before for operators, spaces, and punctuation. Category values for all 65,536 char values are pre-computed into a static read-only array at startup for lock-free O(1) lookup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve inline diff highlighting by replacing the ASCII delimiter–based
word division with a Unicode category–based approach, primarily to
produce more precise highlights in languages like Japanese and Chinese.
Problem
The previous implementation splits text at a fixed set of delimiter
characters (spaces, tabs, and common ASCII symbols such as
+-*/=!,;).Non-delimiter characters — including CJK ideographs, Hiragana, and
Katakana — are never treated as boundaries, so a small change within
Japanese or Chinese text causes the entire surrounding phrase to be
highlighted as changed.
Solution
Each character is classified into one of three categories. Consecutive
characters of the same category are grouped into one chunk, except for
the
Othercategory which retains the same per-character behavior asthe previous implementation:
LetterOtherLetterOtherCJK punctuation (。、「」…) falls into
Otherand acts as a naturalboundary between
OtherLetterchunks, making highlighted changesmore precise without requiring language-specific word segmentation.
Category values for all 65,536
charvalues are pre-computed into astatic read-only array at startup for lock-free O(1) lookup.
Japanese — before / after
English — no regression
Check point
/vs*,==vs!=, etc.)café,über…): treated as single word chunks