Files
bun.sh/LICENSE.md
Jarred Sumner 86d4d87beb feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376)
## Summary

Replace Bun's outdated grapheme breaking implementation with [Ghostty's
approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode)
using the [uucode](https://github.com/jacobsandlund/uucode) library.
This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari
and other Indic script conjuncts now correctly form single grapheme
clusters.

## Motivation

The previous implementation used a `GraphemeBoundaryClass` enum with
only 12 values and a 2-bit `BreakState` (just `extended_pictographic`
and `regional_indicator` flags). It had no support for Unicode's GB9c
rule, meaning Indic conjunct sequences (consonant + virama + consonant)
were incorrectly split into multiple grapheme clusters.

## Architecture

### Runtime (zero uucode dependency, two table lookups)

```
codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values)
(state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state)
```

The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs
only at **comptime** to populate the 8KB decision array. At runtime it's
pure table lookups.

### File Layout

```
src/deps/uucode/              ← Vendored library (MIT, build-time only)
src/unicode/uucode/           ← Build-time integration
  ├── uucode_config.zig       ← What Unicode properties to generate
  ├── grapheme_gen.zig        ← Generator: queries uucode → writes tables
  ├── lut.zig                 ← 3-level lookup table generator
  └── CLAUDE.md               ← Maintenance docs
src/string/immutable/         ← Runtime (no uucode dependency)
  ├── grapheme.zig            ← Grapheme break API + comptime decisions
  ├── grapheme_tables.zig     ← Pre-generated tables (committed, ~91KB source)
  └── visible.zig             ← Width calculation (2 lines changed)
scripts/update-uucode.sh      ← Update vendored uucode + regenerate
```

### Key Types

| Type | Size | Values |
|------|------|--------|
| `GraphemeBreakNoControl` | u5 | 17 (adds
`indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`,
`zwnj`, etc.) |
| `BreakState` | u3 | 5 (`default`, `regional_indicator`,
`extended_pictographic`, `indic_conjunct_break_consonant`,
`indic_conjunct_break_linker`) |

### Binary Size

The tables store only the `GraphemeBreakNoControl` enum per codepoint
(not width or emoji properties, which visible.zig handles separately):

- stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset)
- stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16)
- stage3: 17 × u5 = **~17 bytes** (one per enum value)
- Precomputed decisions: **8KB**
- **Total: ~51KB** (vs previous ~70KB+)

## How to Regenerate Tables

```bash
# After updating src/deps/uucode/:
./scripts/update-uucode.sh

# Or manually:
vendor/zig/zig build generate-grapheme-tables
```

Normal builds never run the generator — they use the committed
`grapheme_tables.zig`.

## Testing

```bash
bun bd test test/js/bun/util/stringWidth.test.ts
```

New test cases verify Devanagari conjuncts (GB9c):
- `क्ष` (Ka+Virama+Ssa) → single cluster, width 2
- `क्‍ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2
- `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-01-23 00:07:06 -08:00

4.4 KiB
Raw Permalink Blame History

Bun itself is MIT-licensed.

JavaScriptCore

Bun statically links JavaScriptCore (and WebKit) which is LGPL-2 licensed. WebCore files from WebKit are also licensed under LGPL2. Per LGPL2:

(1) If you statically link against an LGPLd library, you must also provide your application in an object (not necessarily source) format, so that a user has the opportunity to modify the library and relink the application.

You can find the patched version of WebKit used by Bun here: https://github.com/oven-sh/webkit. If you would like to relink Bun with changes:

  • git submodule update --init --recursive
  • make jsc
  • zig build

This compiles JavaScriptCore, compiles Buns .cpp bindings for JavaScriptCore (which are the object files using JavaScriptCore) and outputs a new bun binary with your changes.

Linked libraries

Bun statically links these libraries:

Library License
boringssl several licenses
brotli MIT
libarchive several licenses
lol-html BSD 3-Clause
mimalloc MIT
picohttp dual-licensed under the Perl License or the MIT License
zstd dual-licensed under the BSD License or GPLv2 license
simdutf Apache 2.0
tinycc LGPL v2.1
uSockets Apache 2.0
zlib-cloudflare zlib
c-ares MIT licensed
libicu 72 license here
libbase64 BSD 2-Clause
libuv (on Windows) MIT
libdeflate MIT
uucode MIT
A fork of uWebsockets Apache 2.0 licensed
Parts of Tigerbeetle's IO code Apache 2.0 licensed

Polyfills

For compatibility reasons, the following packages are embedded into Bun's binary and injected if imported.

Package License
assert MIT
browserify-zlib MIT
buffer MIT
constants-browserify MIT
crypto-browserify MIT
domain-browser MIT
events MIT
https-browserify MIT
os-browserify MIT
path-browserify MIT
process MIT
punycode MIT
querystring-es3 MIT
stream-browserify MIT
stream-http MIT
string_decoder MIT
timers-browserify MIT
tty-browserify MIT
url MIT
util MIT
vm-browserify MIT

Additional credits

  • Bun's JS transpiler, CSS lexer, and Node.js module resolver source code is a Zig port of @evanws esbuild project.
  • Credit to @kipply for the name "Bun"!