bun.sh

mirror of https://github.com/oven-sh/bun synced 2026-02-09 18:38:55 +00:00

Author	SHA1	Message	Date
Jarred Sumner	86d4d87beb	feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376 ) ## Summary Replace Bun's outdated grapheme breaking implementation with [Ghostty's approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode) using the [uucode](https://github.com/jacobsandlund/uucode) library. This adds proper GB9c (Indic Conjunct Break) support — Devanagari and other Indic script conjuncts now correctly form single grapheme clusters. ## Motivation The previous implementation used a `GraphemeBoundaryClass` enum with only 12 values and a 2-bit `BreakState` (just `extended_pictographic` and `regional_indicator` flags). It had no support for Unicode's GB9c rule, meaning Indic conjunct sequences (consonant + virama + consonant) were incorrectly split into multiple grapheme clusters. ## Architecture ### Runtime (zero uucode dependency, two table lookups) ``` codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values) (state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state) ``` The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs only at comptime to populate the 8KB decision array. At runtime it's pure table lookups. ### File Layout ``` src/deps/uucode/ ← Vendored library (MIT, build-time only) src/unicode/uucode/ ← Build-time integration ├── uucode_config.zig ← What Unicode properties to generate ├── grapheme_gen.zig ← Generator: queries uucode → writes tables ├── lut.zig ← 3-level lookup table generator └── CLAUDE.md ← Maintenance docs src/string/immutable/ ← Runtime (no uucode dependency) ├── grapheme.zig ← Grapheme break API + comptime decisions ├── grapheme_tables.zig ← Pre-generated tables (committed, ~91KB source) └── visible.zig ← Width calculation (2 lines changed) scripts/update-uucode.sh ← Update vendored uucode + regenerate ``` ### Key Types \| Type \| Size \| Values \| \|------\|------\|--------\| \| `GraphemeBreakNoControl` \| u5 \| 17 (adds `indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`, `zwnj`, etc.) \| \| `BreakState` \| u3 \| 5 (`default`, `regional_indicator`, `extended_pictographic`, `indic_conjunct_break_consonant`, `indic_conjunct_break_linker`) \| ### Binary Size The tables store only the `GraphemeBreakNoControl` enum per codepoint (not width or emoji properties, which visible.zig handles separately): - stage1: 8192 × u16 = 16KB (maps high byte → stage2 offset) - stage2: 27392 × u8 = 27KB (maps to stage3 index; max value is 16) - stage3: 17 × u5 = ~17 bytes (one per enum value) - Precomputed decisions: 8KB - Total: ~51KB (vs previous ~70KB+) ## How to Regenerate Tables ```bash # After updating src/deps/uucode/: ./scripts/update-uucode.sh # Or manually: vendor/zig/zig build generate-grapheme-tables ``` Normal builds never run the generator — they use the committed `grapheme_tables.zig`. ## Testing ```bash bun bd test test/js/bun/util/stringWidth.test.ts ``` New test cases verify Devanagari conjuncts (GB9c): - `क्ष` (Ka+Virama+Ssa) → single cluster, width 2 - `क्‍ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2 - `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>	2026-01-23 00:07:06 -08:00
Jarred Sumner	98cee5a57e	Improve Bun.stringWidth accuracy and robustness (#25447 ) This PR significantly improves `Bun.stringWidth` to handle a wider variety of Unicode characters and escape sequences correctly. ## Zero-width character handling Added support for many previously unhandled zero-width characters: - Soft hyphen (U+00AD) - Word joiner and invisible operators (U+2060-U+2064) - Lone surrogates (U+D800-U+DFFF) - Arabic formatting characters (U+0600-U+0605, U+06DD, U+070F, U+08E2) - Indic script combining marks (Devanagari through Malayalam) - Thai and Lao combining marks - Combining Diacritical Marks Extended and Supplement - Tag characters (U+E0000-U+E007F) ## ANSI escape sequence handling ### CSI sequences - Now properly handles ALL CSI final bytes (0x40-0x7E), not just `m` - This means cursor movement (A/B/C/D), erase (J/K), scroll (S/T), and other CSI commands are now correctly excluded from width calculation ### OSC sequences - Added support for OSC sequences (ESC ] ... BEL/ST) - OSC 8 hyperlinks are now properly handled - Supports both BEL (0x07) and ST (ESC \) terminators ### ESC ESC fix - Fixed state machine bug where `ESC ESC` would incorrectly reset state - Now correctly handles consecutive ESC characters ## Emoji handling Added proper grapheme-aware emoji width calculation: - Flag emoji (regional indicator pairs) → width 2 - Skin tone modifiers → width 2 - ZWJ sequences (family, professions, etc.) → width 2 - Keycap sequences → width 2 - Variation selectors (VS15 for text, VS16 for emoji presentation) - Uses ICU's `UCHAR_EMOJI` property for accurate emoji detection ## Test coverage Added comprehensive test suite with 94 tests covering: - All zero-width character categories - All CSI final bytes - OSC sequences with various terminators - Emoji edge cases (flags, skin tones, ZWJ, keycaps, variation selectors) - East Asian width (CJK, fullwidth, halfwidth katakana) - Indic and Thai script combining marks - Fuzzer-like stress tests for robustness ## Breaking changes This is a behavior change - `stringWidth` will return different values for some inputs. However, the new values are more accurate representations of terminal display width: \| Input \| Old \| New \| Why \| \|-------\|-----\|-----\|-----\| \| Flag emoji 🇺🇸 \| 1 \| 2 \| Flags display as 2 cells \| \| Skin tone 👋🏽 \| 4 \| 2 \| Emoji + modifier = 1 grapheme \| \| ZWJ family 👨‍👩‍👧 \| 8 \| 2 \| ZWJ sequence = 1 grapheme \| \| Word joiner U+2060 \| 1 \| 0 \| Invisible character \| \| OSC 8 hyperlinks \| counted URL \| just visible text \| URLs are invisible \| \| Cursor movement ESC[5A \| counted \| 0 \| Control sequence \| 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Claude Bot <claude-bot@bun.sh>	2025-12-10 16:17:57 -08:00
Jarred Sumner	cd6785771e	run prettier and add back format action (#13722 )	2024-09-03 21:32:52 -07:00
Meghan Denny	702cae51f6	test: Bun.stringWidth is enabled by default (#9321 )	2024-03-08 17:54:51 -08:00
Meghan Denny	ed339b367d	improve Bun.stringWidth's algorithm (#9022 ) * improve Bun.stringWidth's algorithm * add a bunch more tests from string-width package * make typescript happy * undo typescript changes * use better #define check for debug mode * properly handle latin1 width tests * support grapheme clusters * fix trailing newline * visibleUTF16WidthFn- add fast path for leading ascii * add firstNonASCII16IgnoreMin * fix firstNonASCII16CheckMin * vectorize visibleUTF16WidthFn * support emoji variation selector * expose stringWidth in release mode too * vectorize visibleLatin1Width * support ambiguousIsNarrow option * add typescript definition for stringWidth	2024-02-22 19:16:17 -08:00
Georgijs Vilums	3bc0f90a7c	skip invalid stringWidth test	2024-01-22 12:25:49 -08:00
Jarred Sumner	1560a866fe	Skip stringWidth tests for now	2024-01-21 19:25:57 -08:00
Jarred Sumner	a8ff7be642	Disable Bun.stringWidth until failing test case passes	2024-01-21 06:10:07 -08:00
Jarred Sumner	b82656d9fc	Introduce `Bun.stringWidth` (#8327 ) * Introduce `Bun.stringWidth` * [autofix.ci] apply automated fixes * Update utils.md --------- Co-authored-by: Jarred Sumner <709451+Jarred-Sumner@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>	2024-01-21 04:47:36 -08:00

9 Commits