mirror of
https://github.com/oven-sh/bun
synced 2026-02-02 15:08:46 +00:00
## Summary Replace Bun's outdated grapheme breaking implementation with [Ghostty's approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode) using the [uucode](https://github.com/jacobsandlund/uucode) library. This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari and other Indic script conjuncts now correctly form single grapheme clusters. ## Motivation The previous implementation used a `GraphemeBoundaryClass` enum with only 12 values and a 2-bit `BreakState` (just `extended_pictographic` and `regional_indicator` flags). It had no support for Unicode's GB9c rule, meaning Indic conjunct sequences (consonant + virama + consonant) were incorrectly split into multiple grapheme clusters. ## Architecture ### Runtime (zero uucode dependency, two table lookups) ``` codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values) (state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state) ``` The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs only at **comptime** to populate the 8KB decision array. At runtime it's pure table lookups. ### File Layout ``` src/deps/uucode/ ← Vendored library (MIT, build-time only) src/unicode/uucode/ ← Build-time integration ├── uucode_config.zig ← What Unicode properties to generate ├── grapheme_gen.zig ← Generator: queries uucode → writes tables ├── lut.zig ← 3-level lookup table generator └── CLAUDE.md ← Maintenance docs src/string/immutable/ ← Runtime (no uucode dependency) ├── grapheme.zig ← Grapheme break API + comptime decisions ├── grapheme_tables.zig ← Pre-generated tables (committed, ~91KB source) └── visible.zig ← Width calculation (2 lines changed) scripts/update-uucode.sh ← Update vendored uucode + regenerate ``` ### Key Types | Type | Size | Values | |------|------|--------| | `GraphemeBreakNoControl` | u5 | 17 (adds `indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`, `zwnj`, etc.) | | `BreakState` | u3 | 5 (`default`, `regional_indicator`, `extended_pictographic`, `indic_conjunct_break_consonant`, `indic_conjunct_break_linker`) | ### Binary Size The tables store only the `GraphemeBreakNoControl` enum per codepoint (not width or emoji properties, which visible.zig handles separately): - stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset) - stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16) - stage3: 17 × u5 = **~17 bytes** (one per enum value) - Precomputed decisions: **8KB** - **Total: ~51KB** (vs previous ~70KB+) ## How to Regenerate Tables ```bash # After updating src/deps/uucode/: ./scripts/update-uucode.sh # Or manually: vendor/zig/zig build generate-grapheme-tables ``` Normal builds never run the generator — they use the committed `grapheme_tables.zig`. ## Testing ```bash bun bd test test/js/bun/util/stringWidth.test.ts ``` New test cases verify Devanagari conjuncts (GB9c): - `क्ष` (Ka+Virama+Ssa) → single cluster, width 2 - `क्ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2 - `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
83 lines
2.2 KiB
Bash
Executable File
83 lines
2.2 KiB
Bash
Executable File
#!/bin/bash
|
|
# Updates the vendored uucode library and regenerates grapheme tables.
|
|
#
|
|
# Usage:
|
|
# ./scripts/update-uucode.sh # update from default URL
|
|
# ./scripts/update-uucode.sh /path/to/uucode # update from local directory
|
|
# ./scripts/update-uucode.sh https://url.tar.gz # update from URL
|
|
#
|
|
# After running, verify with:
|
|
# bun bd test test/js/bun/util/stringWidth.test.ts
|
|
|
|
set -euo pipefail
|
|
|
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
BUN_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
|
UUCODE_DIR="$BUN_ROOT/src/deps/uucode"
|
|
ZIG="$BUN_ROOT/vendor/zig/zig"
|
|
|
|
if [ ! -x "$ZIG" ]; then
|
|
echo "error: zig not found at $ZIG"
|
|
echo " run scripts/bootstrap.sh first"
|
|
exit 1
|
|
fi
|
|
|
|
update_from_dir() {
|
|
local src="$1"
|
|
echo "Updating uucode from: $src"
|
|
rm -rf "$UUCODE_DIR"
|
|
mkdir -p "$UUCODE_DIR"
|
|
cp -r "$src"/* "$UUCODE_DIR/"
|
|
}
|
|
|
|
update_from_url() {
|
|
local url="$1"
|
|
local tmp
|
|
tmp=$(mktemp -d)
|
|
trap "rm -rf $tmp" EXIT
|
|
|
|
echo "Downloading uucode from: $url"
|
|
curl -fsSL "$url" | tar -xz -C "$tmp" --strip-components=1
|
|
|
|
update_from_dir "$tmp"
|
|
}
|
|
|
|
# Handle source argument
|
|
if [ $# -ge 1 ]; then
|
|
SOURCE="$1"
|
|
if [ -d "$SOURCE" ]; then
|
|
update_from_dir "$SOURCE"
|
|
elif [[ "$SOURCE" == http* ]]; then
|
|
update_from_url "$SOURCE"
|
|
else
|
|
echo "error: argument must be a directory or URL"
|
|
exit 1
|
|
fi
|
|
else
|
|
# Default: use the zig global cache if available
|
|
CACHED=$(find "$HOME/.cache/zig/p" -maxdepth 1 -name "uucode-*" -type d 2>/dev/null | sort -V | tail -1)
|
|
if [ -n "$CACHED" ]; then
|
|
update_from_dir "$CACHED"
|
|
else
|
|
echo "error: no uucode source specified and none found in zig cache"
|
|
echo ""
|
|
echo "usage: $0 <path-to-uucode-dir-or-url>"
|
|
exit 1
|
|
fi
|
|
fi
|
|
|
|
echo ""
|
|
echo "Regenerating grapheme tables..."
|
|
cd "$BUN_ROOT"
|
|
"$ZIG" build generate-grapheme-tables
|
|
|
|
echo ""
|
|
echo "Done. Updated files:"
|
|
echo " src/deps/uucode/ (vendored library)"
|
|
echo " src/string/immutable/grapheme_tables.zig (regenerated)"
|
|
echo ""
|
|
echo "Next steps:"
|
|
echo " 1. bun bd test test/js/bun/util/stringWidth.test.ts"
|
|
echo " 2. git add src/deps/uucode src/string/immutable/grapheme_tables.zig"
|
|
echo " 3. git commit -m 'Update uucode to <version>'"
|