Files
bun.sh/LICENSE.md
Jarred Sumner 86d4d87beb feat(unicode): migrate grapheme breaking to uucode with GB9c support (#26376)
## Summary

Replace Bun's outdated grapheme breaking implementation with [Ghostty's
approach](https://github.com/ghostty-org/ghostty/tree/main/src/unicode)
using the [uucode](https://github.com/jacobsandlund/uucode) library.
This adds proper **GB9c (Indic Conjunct Break)** support — Devanagari
and other Indic script conjuncts now correctly form single grapheme
clusters.

## Motivation

The previous implementation used a `GraphemeBoundaryClass` enum with
only 12 values and a 2-bit `BreakState` (just `extended_pictographic`
and `regional_indicator` flags). It had no support for Unicode's GB9c
rule, meaning Indic conjunct sequences (consonant + virama + consonant)
were incorrectly split into multiple grapheme clusters.

## Architecture

### Runtime (zero uucode dependency, two table lookups)

```
codepoint → [3-level LUT] → GraphemeBreakNoControl enum (u5, 17 values)
(state, gb1, gb2) → [8KB precomputed array] → (break_result, new_state)
```

The full grapheme break algorithm (GB6-GB13, GB9c, GB11, GB999) runs
only at **comptime** to populate the 8KB decision array. At runtime it's
pure table lookups.

### File Layout

```
src/deps/uucode/              ← Vendored library (MIT, build-time only)
src/unicode/uucode/           ← Build-time integration
  ├── uucode_config.zig       ← What Unicode properties to generate
  ├── grapheme_gen.zig        ← Generator: queries uucode → writes tables
  ├── lut.zig                 ← 3-level lookup table generator
  └── CLAUDE.md               ← Maintenance docs
src/string/immutable/         ← Runtime (no uucode dependency)
  ├── grapheme.zig            ← Grapheme break API + comptime decisions
  ├── grapheme_tables.zig     ← Pre-generated tables (committed, ~91KB source)
  └── visible.zig             ← Width calculation (2 lines changed)
scripts/update-uucode.sh      ← Update vendored uucode + regenerate
```

### Key Types

| Type | Size | Values |
|------|------|--------|
| `GraphemeBreakNoControl` | u5 | 17 (adds
`indic_conjunct_break_{consonant,linker,extend}`, `emoji_modifier_base`,
`zwnj`, etc.) |
| `BreakState` | u3 | 5 (`default`, `regional_indicator`,
`extended_pictographic`, `indic_conjunct_break_consonant`,
`indic_conjunct_break_linker`) |

### Binary Size

The tables store only the `GraphemeBreakNoControl` enum per codepoint
(not width or emoji properties, which visible.zig handles separately):

- stage1: 8192 × u16 = **16KB** (maps high byte → stage2 offset)
- stage2: 27392 × u8 = **27KB** (maps to stage3 index; max value is 16)
- stage3: 17 × u5 = **~17 bytes** (one per enum value)
- Precomputed decisions: **8KB**
- **Total: ~51KB** (vs previous ~70KB+)

## How to Regenerate Tables

```bash
# After updating src/deps/uucode/:
./scripts/update-uucode.sh

# Or manually:
vendor/zig/zig build generate-grapheme-tables
```

Normal builds never run the generator — they use the committed
`grapheme_tables.zig`.

## Testing

```bash
bun bd test test/js/bun/util/stringWidth.test.ts
```

New test cases verify Devanagari conjuncts (GB9c):
- `क्ष` (Ka+Virama+Ssa) → single cluster, width 2
- `क्‍ष` (Ka+Virama+ZWJ+Ssa) → single cluster, width 2
- `क्क्क` (Ka+Virama+Ka+Virama+Ka) → single cluster, width 3

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-01-23 00:07:06 -08:00

74 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Bun itself is MIT-licensed.
## JavaScriptCore
Bun statically links JavaScriptCore (and WebKit) which is LGPL-2 licensed. WebCore files from WebKit are also licensed under LGPL2. Per LGPL2:
> (1) If you statically link against an LGPLd library, you must also provide your application in an object (not necessarily source) format, so that a user has the opportunity to modify the library and relink the application.
You can find the patched version of WebKit used by Bun here: <https://github.com/oven-sh/webkit>. If you would like to relink Bun with changes:
- `git submodule update --init --recursive`
- `make jsc`
- `zig build`
This compiles JavaScriptCore, compiles Buns `.cpp` bindings for JavaScriptCore (which are the object files using JavaScriptCore) and outputs a new `bun` binary with your changes.
## Linked libraries
Bun statically links these libraries:
| Library | License |
|---------|---------|
| [`boringssl`](https://boringssl.googlesource.com/boringssl/) | [several licenses](https://boringssl.googlesource.com/boringssl/+/refs/heads/master/LICENSE) |
| [`brotli`](https://github.com/google/brotli) | MIT |
| [`libarchive`](https://github.com/libarchive/libarchive) | [several licenses](https://github.com/libarchive/libarchive/blob/master/COPYING) |
| [`lol-html`](https://github.com/cloudflare/lol-html/tree/master/c-api) | BSD 3-Clause |
| [`mimalloc`](https://github.com/microsoft/mimalloc) | MIT |
| [`picohttp`](https://github.com/h2o/picohttpparser) | dual-licensed under the Perl License or the MIT License |
| [`zstd`](https://github.com/facebook/zstd) | dual-licensed under the BSD License or GPLv2 license |
| [`simdutf`](https://github.com/simdutf/simdutf) | Apache 2.0 |
| [`tinycc`](https://github.com/tinycc/tinycc) | LGPL v2.1 |
| [`uSockets`](https://github.com/uNetworking/uSockets) | Apache 2.0 |
| [`zlib-cloudflare`](https://github.com/cloudflare/zlib) | zlib |
| [`c-ares`](https://github.com/c-ares/c-ares) | MIT licensed |
| [`libicu`](https://github.com/unicode-org/icu) 72 | [license here](https://github.com/unicode-org/icu/blob/main/icu4c/LICENSE) |
| [`libbase64`](https://github.com/aklomp/base64/blob/master/LICENSE) | BSD 2-Clause |
| [`libuv`](https://github.com/libuv/libuv) (on Windows) | MIT |
| [`libdeflate`](https://github.com/ebiggers/libdeflate) | MIT |
| [`uucode`](https://github.com/jacobsandlund/uucode) | MIT |
| A fork of [`uWebsockets`](https://github.com/jarred-sumner/uwebsockets) | Apache 2.0 licensed |
| Parts of [Tigerbeetle's IO code](https://github.com/tigerbeetle/tigerbeetle/blob/532c8b70b9142c17e07737ab6d3da68d7500cbca/src/io/windows.zig#L1) | Apache 2.0 licensed |
## Polyfills
For compatibility reasons, the following packages are embedded into Bun's binary and injected if imported.
| Package | License |
|---------|---------|
| [`assert`](https://npmjs.com/package/assert) | MIT |
| [`browserify-zlib`](https://npmjs.com/package/browserify-zlib) | MIT |
| [`buffer`](https://npmjs.com/package/buffer) | MIT |
| [`constants-browserify`](https://npmjs.com/package/constants-browserify) | MIT |
| [`crypto-browserify`](https://npmjs.com/package/crypto-browserify) | MIT |
| [`domain-browser`](https://npmjs.com/package/domain-browser) | MIT |
| [`events`](https://npmjs.com/package/events) | MIT |
| [`https-browserify`](https://npmjs.com/package/https-browserify) | MIT |
| [`os-browserify`](https://npmjs.com/package/os-browserify) | MIT |
| [`path-browserify`](https://npmjs.com/package/path-browserify) | MIT |
| [`process`](https://npmjs.com/package/process) | MIT |
| [`punycode`](https://npmjs.com/package/punycode) | MIT |
| [`querystring-es3`](https://npmjs.com/package/querystring-es3) | MIT |
| [`stream-browserify`](https://npmjs.com/package/stream-browserify) | MIT |
| [`stream-http`](https://npmjs.com/package/stream-http) | MIT |
| [`string_decoder`](https://npmjs.com/package/string_decoder) | MIT |
| [`timers-browserify`](https://npmjs.com/package/timers-browserify) | MIT |
| [`tty-browserify`](https://npmjs.com/package/tty-browserify) | MIT |
| [`url`](https://npmjs.com/package/url) | MIT |
| [`util`](https://npmjs.com/package/util) | MIT |
| [`vm-browserify`](https://npmjs.com/package/vm-browserify) | MIT |
## Additional credits
- Bun's JS transpiler, CSS lexer, and Node.js module resolver source code is a Zig port of [@evanw](https://github.com/evanw)s [esbuild](https://github.com/evanw/esbuild) project.
- Credit to [@kipply](https://github.com/kipply) for the name "Bun"!