mirror of
https://github.com/oven-sh/bun
synced 2026-02-02 15:08:46 +00:00
## Summary Replace Bun's outdated grapheme breaking implementation with [Ghostty's approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode) using the [uucode](https://github.com/jacobsandlund/uucode) library. This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari and other Indic script conjuncts now correctly form single grapheme clusters. ## Motivation The previous implementation used a `GraphemeBoundaryClass` enum with only 12 values and a 2-bit `BreakState` (just `extended_pictographic` and `regional_indicator` flags). It had no support for Unicode's GB9c rule, meaning Indic conjunct sequences (consonant + virama + consonant) were incorrectly split into multiple grapheme clusters. ## Architecture ### Runtime (zero uucode dependency, two table lookups) ``` codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values) (state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state) ``` The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs only at **comptime** to populate the 8KB decision array. At runtime it's pure table lookups. ### File Layout ``` src/deps/uucode/ ← Vendored library (MIT, build-time only) src/unicode/uucode/ ← Build-time integration ├── uucode_config.zig ← What Unicode properties to generate ├── grapheme_gen.zig ← Generator: queries uucode → writes tables ├── lut.zig ← 3-level lookup table generator └── CLAUDE.md ← Maintenance docs src/string/immutable/ ← Runtime (no uucode dependency) ├── grapheme.zig ← Grapheme break API + comptime decisions ├── grapheme_tables.zig ← Pre-generated tables (committed, ~91KB source) └── visible.zig ← Width calculation (2 lines changed) scripts/update-uucode.sh ← Update vendored uucode + regenerate ``` ### Key Types | Type | Size | Values | |------|------|--------| | `GraphemeBreakNoControl` | u5 | 17 (adds `indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`, `zwnj`, etc.) | | `BreakState` | u3 | 5 (`default`, `regional_indicator`, `extended_pictographic`, `indic_conjunct_break_consonant`, `indic_conjunct_break_linker`) | ### Binary Size The tables store only the `GraphemeBreakNoControl` enum per codepoint (not width or emoji properties, which visible.zig handles separately): - stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset) - stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16) - stage3: 17 × u5 = **~17 bytes** (one per enum value) - Precomputed decisions: **8KB** - **Total: ~51KB** (vs previous ~70KB+) ## How to Regenerate Tables ```bash # After updating src/deps/uucode/: ./scripts/update-uucode.sh # Or manually: vendor/zig/zig build generate-grapheme-tables ``` Normal builds never run the generator — they use the committed `grapheme_tables.zig`. ## Testing ```bash bun bd test test/js/bun/util/stringWidth.test.ts ``` New test cases verify Devanagari conjuncts (GB9c): - `क्ष` (Ka+Virama+Ssa) → single cluster, width 2 - `क्ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2 - `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2.2 KiB
Executable File
2.2 KiB
Executable File