Jarred Sumner 86d4d87beb feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376)
## Summary

Replace Bun's outdated grapheme breaking implementation with [Ghostty's
approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode)
using the [uucode](https://github.com/jacobsandlund/uucode) library.
This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari
and other Indic script conjuncts now correctly form single grapheme
clusters.

## Motivation

The previous implementation used a `GraphemeBoundaryClass` enum with
only 12 values and a 2-bit `BreakState` (just `extended_pictographic`
and `regional_indicator` flags). It had no support for Unicode's GB9c
rule, meaning Indic conjunct sequences (consonant + virama + consonant)
were incorrectly split into multiple grapheme clusters.

## Architecture

### Runtime (zero uucode dependency, two table lookups)

```
codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values)
(state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state)
```

The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs
only at **comptime** to populate the 8KB decision array. At runtime it's
pure table lookups.

### File Layout

```
src/deps/uucode/              ← Vendored library (MIT, build-time only)
src/unicode/uucode/           ← Build-time integration
  ├── uucode_config.zig       ← What Unicode properties to generate
  ├── grapheme_gen.zig        ← Generator: queries uucode → writes tables
  ├── lut.zig                 ← 3-level lookup table generator
  └── CLAUDE.md               ← Maintenance docs
src/string/immutable/         ← Runtime (no uucode dependency)
  ├── grapheme.zig            ← Grapheme break API + comptime decisions
  ├── grapheme_tables.zig     ← Pre-generated tables (committed, ~91KB source)
  └── visible.zig             ← Width calculation (2 lines changed)
scripts/update-uucode.sh      ← Update vendored uucode + regenerate
```

### Key Types

| Type | Size | Values |
|------|------|--------|
| `GraphemeBreakNoControl` | u5 | 17 (adds
`indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`,
`zwnj`, etc.) |
| `BreakState` | u3 | 5 (`default`, `regional_indicator`,
`extended_pictographic`, `indic_conjunct_break_consonant`,
`indic_conjunct_break_linker`) |

### Binary Size

The tables store only the `GraphemeBreakNoControl` enum per codepoint
(not width or emoji properties, which visible.zig handles separately):

- stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset)
- stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16)
- stage3: 17 × u5 = **~17 bytes** (one per enum value)
- Precomputed decisions: **8KB**
- **Total: ~51KB** (vs previous ~70KB+)

## How to Regenerate Tables

```bash
# After updating src/deps/uucode/:
./scripts/update-uucode.sh

# Or manually:
vendor/zig/zig build generate-grapheme-tables
```

Normal builds never run the generator — they use the committed
`grapheme_tables.zig`.

## Testing

```bash
bun bd test test/js/bun/util/stringWidth.test.ts
```

New test cases verify Devanagari conjuncts (GB9c):
- `क्ष` (Ka+Virama+Ssa) → single cluster, width 2
- `क्‍ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2
- `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-01-23 00:07:06 -08:00
2025-12-18 12:04:28 -08:00
2025-11-28 17:51:45 +11:00
2025-11-10 14:38:26 -08:00
2025-12-18 18:03:23 -08:00
2024-12-26 11:48:30 -08:00
2024-12-12 03:21:56 -08:00
2026-01-18 00:17:14 -08:00
2025-01-07 20:19:12 -08:00
2025-11-25 11:06:24 -08:00
2022-09-03 20:54:15 -07:00
2026-01-13 15:06:36 -08:00
2026-01-13 15:06:36 -08:00
2025-07-10 00:10:43 -07:00
go
2021-08-11 13:56:03 -07:00

Logo

Bun

stars Bun speed

Documentation   •   Discord   •   Issues   •   Roadmap

Read the docs →

What is Bun?

Bun is an all-in-one toolkit for JavaScript and TypeScript apps. It ships as a single executable called bun.

At its core is the Bun runtime, a fast JavaScript runtime designed as a drop-in replacement for Node.js. It's written in Zig and powered by JavaScriptCore under the hood, dramatically reducing startup times and memory usage.

bun run index.tsx             # TS and JSX supported out-of-the-box

The bun command-line tool also implements a test runner, script runner, and Node.js-compatible package manager. Instead of 1,000 node_modules for development, you only need bun. Bun's built-in tools are significantly faster than existing options and usable in existing Node.js projects with little to no changes.

bun test                      # run tests
bun run start                 # run the `start` script in `package.json`
bun install <pkg>             # install a package
bunx cowsay 'Hello, world!'   # execute a package

Install

Bun supports Linux (x64 & arm64), macOS (x64 & Apple Silicon) and Windows (x64).

Linux users — Kernel version 5.6 or higher is strongly recommended, but the minimum is 5.1.

x64 users — if you see "illegal instruction" or similar errors, check our CPU requirements

# with install script (recommended)
curl -fsSL https://bun.com/install | bash

# on windows
powershell -c "irm bun.sh/install.ps1 | iex"

# with npm
npm install -g bun

# with Homebrew
brew tap oven-sh/bun
brew install bun

# with Docker
docker pull oven/bun
docker run --rm --init --ulimit memlock=-1:-1 oven/bun

Upgrade

To upgrade to the latest version of Bun, run:

bun upgrade

Bun automatically releases a canary build on every commit to main. To upgrade to the latest canary build, run:

bun upgrade --canary

View canary build

Guides

Contributing

Refer to the Project > Contributing guide to start contributing to Bun.

License

Refer to the Project > License page for information about Bun's licensing.

Description
Bun is a fast, incrementally adoptable all-in-one JavaScript, TypeScript & JSX toolkit. Use individual tools like bun test or bun install in Node.js projects, or adopt the complete stack with a fast JavaScript runtime, bundler, test runner, and package manager built in. Bun aims for 100% Node.js compatibility.
Readme 777 MiB
Languages
Zig 60.6%
C++ 24.8%
TypeScript 8.3%
C 3.3%
JavaScript 1.4%
Other 1.1%