mirror of
https://github.com/oven-sh/bun
synced 2026-02-02 15:08:46 +00:00
feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376)
## Summary Replace Bun's outdated grapheme breaking implementation with [Ghostty's approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode) using the [uucode](https://github.com/jacobsandlund/uucode) library. This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari and other Indic script conjuncts now correctly form single grapheme clusters. ## Motivation The previous implementation used a `GraphemeBoundaryClass` enum with only 12 values and a 2-bit `BreakState` (just `extended_pictographic` and `regional_indicator` flags). It had no support for Unicode's GB9c rule, meaning Indic conjunct sequences (consonant + virama + consonant) were incorrectly split into multiple grapheme clusters. ## Architecture ### Runtime (zero uucode dependency, two table lookups) ``` codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values) (state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state) ``` The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs only at **comptime** to populate the 8KB decision array. At runtime it's pure table lookups. ### File Layout ``` src/deps/uucode/ ← Vendored library (MIT, build-time only) src/unicode/uucode/ ← Build-time integration ├── uucode_config.zig ← What Unicode properties to generate ├── grapheme_gen.zig ← Generator: queries uucode → writes tables ├── lut.zig ← 3-level lookup table generator └── CLAUDE.md ← Maintenance docs src/string/immutable/ ← Runtime (no uucode dependency) ├── grapheme.zig ← Grapheme break API + comptime decisions ├── grapheme_tables.zig ← Pre-generated tables (committed, ~91KB source) └── visible.zig ← Width calculation (2 lines changed) scripts/update-uucode.sh ← Update vendored uucode + regenerate ``` ### Key Types | Type | Size | Values | |------|------|--------| | `GraphemeBreakNoControl` | u5 | 17 (adds `indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`, `zwnj`, etc.) | | `BreakState` | u3 | 5 (`default`, `regional_indicator`, `extended_pictographic`, `indic_conjunct_break_consonant`, `indic_conjunct_break_linker`) | ### Binary Size The tables store only the `GraphemeBreakNoControl` enum per codepoint (not width or emoji properties, which visible.zig handles separately): - stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset) - stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16) - stage3: 17 × u5 = **~17 bytes** (one per enum value) - Precomputed decisions: **8KB** - **Total: ~51KB** (vs previous ~70KB+) ## How to Regenerate Tables ```bash # After updating src/deps/uucode/: ./scripts/update-uucode.sh # Or manually: vendor/zig/zig build generate-grapheme-tables ``` Normal builds never run the generator — they use the committed `grapheme_tables.zig`. ## Testing ```bash bun bd test test/js/bun/util/stringWidth.test.ts ``` New test cases verify Devanagari conjuncts (GB9c): - `क्ष` (Ka+Virama+Ssa) → single cluster, width 2 - `क्ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2 - `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
This commit is contained in:
82
scripts/update-uucode.sh
Executable file
82
scripts/update-uucode.sh
Executable file
@@ -0,0 +1,82 @@
|
||||
#!/bin/bash
|
||||
# Updates the vendored uucode library and regenerates grapheme tables.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/update-uucode.sh # update from default URL
|
||||
# ./scripts/update-uucode.sh /path/to/uucode # update from local directory
|
||||
# ./scripts/update-uucode.sh https://url.tar.gz # update from URL
|
||||
#
|
||||
# After running, verify with:
|
||||
# bun bd test test/js/bun/util/stringWidth.test.ts
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
BUN_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
UUCODE_DIR="$BUN_ROOT/src/deps/uucode"
|
||||
ZIG="$BUN_ROOT/vendor/zig/zig"
|
||||
|
||||
if [ ! -x "$ZIG" ]; then
|
||||
echo "error: zig not found at $ZIG"
|
||||
echo " run scripts/bootstrap.sh first"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
update_from_dir() {
|
||||
local src="$1"
|
||||
echo "Updating uucode from: $src"
|
||||
rm -rf "$UUCODE_DIR"
|
||||
mkdir -p "$UUCODE_DIR"
|
||||
cp -r "$src"/* "$UUCODE_DIR/"
|
||||
}
|
||||
|
||||
update_from_url() {
|
||||
local url="$1"
|
||||
local tmp
|
||||
tmp=$(mktemp -d)
|
||||
trap "rm -rf $tmp" EXIT
|
||||
|
||||
echo "Downloading uucode from: $url"
|
||||
curl -fsSL "$url" | tar -xz -C "$tmp" --strip-components=1
|
||||
|
||||
update_from_dir "$tmp"
|
||||
}
|
||||
|
||||
# Handle source argument
|
||||
if [ $# -ge 1 ]; then
|
||||
SOURCE="$1"
|
||||
if [ -d "$SOURCE" ]; then
|
||||
update_from_dir "$SOURCE"
|
||||
elif [[ "$SOURCE" == http* ]]; then
|
||||
update_from_url "$SOURCE"
|
||||
else
|
||||
echo "error: argument must be a directory or URL"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
# Default: use the zig global cache if available
|
||||
CACHED=$(find "$HOME/.cache/zig/p" -maxdepth 1 -name "uucode-*" -type d 2>/dev/null | sort -V | tail -1)
|
||||
if [ -n "$CACHED" ]; then
|
||||
update_from_dir "$CACHED"
|
||||
else
|
||||
echo "error: no uucode source specified and none found in zig cache"
|
||||
echo ""
|
||||
echo "usage: $0 <path-to-uucode-dir-or-url>"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Regenerating grapheme tables..."
|
||||
cd "$BUN_ROOT"
|
||||
"$ZIG" build generate-grapheme-tables
|
||||
|
||||
echo ""
|
||||
echo "Done. Updated files:"
|
||||
echo " src/deps/uucode/ (vendored library)"
|
||||
echo " src/string/immutable/grapheme_tables.zig (regenerated)"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. bun bd test test/js/bun/util/stringWidth.test.ts"
|
||||
echo " 2. git add src/deps/uucode src/string/immutable/grapheme_tables.zig"
|
||||
echo " 3. git commit -m 'Update uucode to <version>'"
|
||||
Reference in New Issue
Block a user