Jarred Sumner
|
86d4d87beb
|
feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376)
## Summary
Replace Bun's outdated grapheme breaking implementation with [Ghostty's
approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode)
using the [uucode](https://github.com/jacobsandlund/uucode) library.
This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari
and other Indic script conjuncts now correctly form single grapheme
clusters.
## Motivation
The previous implementation used a `GraphemeBoundaryClass` enum with
only 12 values and a 2-bit `BreakState` (just `extended_pictographic`
and `regional_indicator` flags). It had no support for Unicode's GB9c
rule, meaning Indic conjunct sequences (consonant + virama + consonant)
were incorrectly split into multiple grapheme clusters.
## Architecture
### Runtime (zero uucode dependency, two table lookups)
```
codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values)
(state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state)
```
The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs
only at **comptime** to populate the 8KB decision array. At runtime it's
pure table lookups.
### File Layout
```
src/deps/uucode/ ← Vendored library (MIT, build-time only)
src/unicode/uucode/ ← Build-time integration
├── uucode_config.zig ← What Unicode properties to generate
├── grapheme_gen.zig ← Generator: queries uucode → writes tables
├── lut.zig ← 3-level lookup table generator
└── CLAUDE.md ← Maintenance docs
src/string/immutable/ ← Runtime (no uucode dependency)
├── grapheme.zig ← Grapheme break API + comptime decisions
├── grapheme_tables.zig ← Pre-generated tables (committed, ~91KB source)
└── visible.zig ← Width calculation (2 lines changed)
scripts/update-uucode.sh ← Update vendored uucode + regenerate
```
### Key Types
| Type | Size | Values |
|------|------|--------|
| `GraphemeBreakNoControl` | u5 | 17 (adds
`indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`,
`zwnj`, etc.) |
| `BreakState` | u3 | 5 (`default`, `regional_indicator`,
`extended_pictographic`, `indic_conjunct_break_consonant`,
`indic_conjunct_break_linker`) |
### Binary Size
The tables store only the `GraphemeBreakNoControl` enum per codepoint
(not width or emoji properties, which visible.zig handles separately):
- stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset)
- stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16)
- stage3: 17 × u5 = **~17 bytes** (one per enum value)
- Precomputed decisions: **8KB**
- **Total: ~51KB** (vs previous ~70KB+)
## How to Regenerate Tables
```bash
# After updating src/deps/uucode/:
./scripts/update-uucode.sh
# Or manually:
vendor/zig/zig build generate-grapheme-tables
```
Normal builds never run the generator — they use the committed
`grapheme_tables.zig`.
## Testing
```bash
bun bd test test/js/bun/util/stringWidth.test.ts
```
New test cases verify Devanagari conjuncts (GB9c):
- `क्ष` (Ka+Virama+Ssa) → single cluster, width 2
- `क्ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2
- `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3
---------
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
|
2026-01-23 00:07:06 -08:00 |
|