Files
bun.sh/test/js/bun/md/coverage.txt
Jarred Sumner 1bfe5c6b37 feat(md): Zig markdown parser with Bun.markdown API (#26440)
## Summary

- Port md4c (CommonMark-compliant markdown parser) from C to Zig under
`src/md/`
- Three output modes:
  - `Bun.markdown.html(input, options?)` — render to HTML string
- `Bun.markdown.render(input, callbacks?)` — render with custom
callbacks for each element
- `Bun.markdown.react(input, options?)` — render to a React Fragment
element, directly usable as a component return value
- React element creation uses a cached JSC Structure with
`putDirectOffset` for fast allocation
- Component overrides in `react()`: pass tag names as options keys to
replace default HTML elements with custom components
- GFM extensions: tables, strikethrough, task lists, permissive
autolinks, disallowed raw HTML tag filter
- Wire up `.md` as a bundler loader (via explicit `{ type: "md" }`)

## JavaScript API

### `Bun.markdown.html(input, options?)`

Renders markdown to an HTML string:

```js
const html = Bun.markdown.html("# Hello **world**");
// "<h1>Hello <strong>world</strong></h1>\n"

Bun.markdown.html("## Hello", { headingIds: true });
// '<h2 id="hello">Hello</h2>\n'
```

### `Bun.markdown.render(input, callbacks?)`

Renders markdown with custom JavaScript callbacks for each element. Each
callback receives children as a string and optional metadata, and
returns a string:

```js
// Custom HTML with classes
const html = Bun.markdown.render("# Title\n\nHello **world**", {
  heading: (children, { level }) => `<h${level} class="title">${children}</h${level}>`,
  paragraph: (children) => `<p>${children}</p>`,
  strong: (children) => `<b>${children}</b>`,
});

// ANSI terminal output
const ansi = Bun.markdown.render("# Hello\n\n**bold**", {
  heading: (children) => `\x1b[1;4m${children}\x1b[0m\n`,
  paragraph: (children) => children + "\n",
  strong: (children) => `\x1b[1m${children}\x1b[22m`,
});

// Strip all formatting
const text = Bun.markdown.render("# Hello **world**", {
  heading: (children) => children,
  paragraph: (children) => children,
  strong: (children) => children,
});
// "Hello world"

// Return null to omit elements
const result = Bun.markdown.render("# Title\n\n![logo](img.png)\n\nHello", {
  image: () => null,
  heading: (children) => children,
  paragraph: (children) => children + "\n",
});
// "Title\nHello\n"
```

Parser options can be included alongside callbacks:

```js
Bun.markdown.render("Visit www.example.com", {
  link: (children, { href }) => `[${children}](${href})`,
  paragraph: (children) => children,
  permissiveAutolinks: true,
});
```

### `Bun.markdown.react(input, options?)`

Returns a React Fragment element — use it directly as a component return
value:

```tsx
// Use as a component
function Markdown({ text }: { text: string }) {
  return Bun.markdown.react(text);
}

// With custom components
function Heading({ children }: { children: React.ReactNode }) {
  return <h1 className="title">{children}</h1>;
}
const element = Bun.markdown.react("# Hello", { h1: Heading });

// Server-side rendering
import { renderToString } from "react-dom/server";
const html = renderToString(Bun.markdown.react("# Hello **world**"));
// "<h1>Hello <strong>world</strong></h1>"
```

#### React 18 and older

By default, `react()` uses `Symbol.for('react.transitional.element')` as
the `$$typeof` symbol, which is what React 19 expects. For React 18 and
older, pass `reactVersion: 18`:

```tsx
const el = Bun.markdown.react("# Hello", { reactVersion: 18 });
```

### Component Overrides

Tag names can be overridden in `react()`:

```tsx
Bun.markdown.react(input, {
  h1: MyHeading,      // block elements
  p: CustomParagraph,
  a: CustomLink,      // inline elements
  img: CustomImage,
  pre: CodeBlock,
  // ... h1-h6, p, blockquote, ul, ol, li, pre, hr, html,
  //     table, thead, tbody, tr, th, td,
  //     em, strong, a, img, code, del, math, u, br
});
```

Boolean values are ignored (not treated as overrides), so parser options
like `{ strikethrough: true }` don't conflict with component overrides.

### Options

```js
Bun.markdown.html(input, {
  tables: true,              // GFM tables (default: true)
  strikethrough: true,       // ~~deleted~~ (default: true)
  tasklists: true,           // - [x] items (default: true)
  headingIds: true,          // Generate id attributes on headings
  autolinkHeadings: true,    // Wrap heading content in <a> tags
  tagFilter: false,          // GFM disallowed HTML tags
  wikiLinks: false,          // [[wiki]] links
  latexMath: false,          // $inline$ and $$display$$
  underline: false,          // __underline__ (instead of <strong>)
  // ... and more
});
```

## Architecture

### Parser (`src/md/`)

The parser is split into focused modules using Zig's delegation pattern:

| Module | Purpose |
|--------|---------|
| `parser.zig` | Core `Parser` struct, state, and re-exported method
delegation |
| `blocks.zig` | Block-level parsing: document processing, line
analysis, block start/end |
| `containers.zig` | Container management: blockquotes, lists, list
items |
| `inlines.zig` | Inline parsing: emphasis, code spans, HTML tags,
entities |
| `links.zig` | Link/image resolution, reference links, autolink
rendering |
| `autolinks.zig` | Permissive autolink detection (www, url, email) |
| `line_analysis.zig` | Line classification: headings, fences, HTML
blocks, tables |
| `ref_defs.zig` | Reference definition parsing and lookup |
| `render_blocks.zig` | Block rendering dispatch (code, HTML, table
blocks) |
| `html_renderer.zig` | HTML renderer implementing `Renderer` VTable |
| `types.zig` | Shared types: `Renderer` VTable, `BlockType`,
`SpanType`, `TextType`, etc. |

### Renderer Abstraction

Parsing is decoupled from output via a `Renderer` VTable interface:

```zig
pub const Renderer = struct {
    ptr: *anyopaque,
    vtable: *const VTable,

    pub const VTable = struct {
        enterBlock: *const fn (...) void,
        leaveBlock: *const fn (...) void,
        enterSpan:  *const fn (...) void,
        leaveSpan:  *const fn (...) void,
        text:       *const fn (...) void,
    };
};
```

Four renderers are implemented:
- **`HtmlRenderer`** (`src/md/html_renderer.zig`) — produces HTML string
output
- **`JsCallbackRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — calls
JS callbacks for each element, accumulates string output
- **`ParseRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — builds
React element AST with `MarkedArgumentBuffer` for GC safety
- **`JSReactElement`** (`src/bun.js/bindings/JSReactElement.cpp`) — C++
fast path for React element creation using cached JSC Structure +
`putDirectOffset`

## Test plan

- [x] 792 spec tests pass (CommonMark, GFM tables, strikethrough,
tasklists, permissive autolinks, GFM tag filter, wiki links, coverage,
regressions)
- [x] 114 API tests pass (`html()`, `render()`, `react()`,
`renderToString` integration, component overrides)
- [x] 58 GFM compatibility tests pass

```
bun bd test test/js/bun/md/md-spec.test.ts       # 792 pass
bun bd test test/js/bun/md/md-render-api.test.ts  # 114 pass
bun bd test test/js/bun/md/gfm-compat.test.ts     # 58 pass
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Dylan Conway <dylan.conway567@gmail.com>
Co-authored-by: SUZUKI Sosuke <sosuke@bun.com>
Co-authored-by: robobun <robobun@oven.sh>
Co-authored-by: Claude Bot <claude-bot@bun.sh>
Co-authored-by: Kirill Markelov <kerusha.chubko@gmail.com>
Co-authored-by: Ciro Spaciari <ciro.spaciari@gmail.com>
Co-authored-by: Alistair Smith <hi@alistair.sh>
2026-01-28 20:24:02 -08:00

298 lines
6.1 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Coverage
This file is just a collection of tests designed to activate code in MD4C
which may otherwise be hard to hit. It's to improve our test coverage.
## `md_is_unicode_whitespace__()`
Unicode whitespace (here U+2000) forms a word boundary so these cannot be
resolved as emphasis span because there is no closer mark.
```````````````````````````````` example
*foo *bar
.
<p>*foo *bar</p>
````````````````````````````````
## `md_is_unicode_punct__()`
Ditto for Unicode punctuation (here U+00A1).
```````````````````````````````` example
*foo¡*bar
.
<p>*foo¡*bar</p>
````````````````````````````````
## `md_get_unicode_fold_info()`
```````````````````````````````` example
[Příliš žluťoučký kůň úpěl ďábelské ódy.]
[PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.]: /url
.
<p><a href="/url">Příliš žluťoučký kůň úpěl ďábelské ódy.</a></p>
````````````````````````````````
## `md_decode_utf8__()` and `md_decode_utf8_before__()`
### Alphanumerical Character (i.e. not whitespace, not punctuation)
Non-whitespace & non-punctuation characters below suppress `_` from being
recognized as an emphasis because `_` should be seen as in-word character:
Example of 1-byte UTF-8 sequence (U+0058):
```````````````````````````````` example
X__foo__X
.
<p>X__foo__X</p>
````````````````````````````````
Example of 2-byte UTF-8 sequence (U+0158):
```````````````````````````````` example
Ř__foo__Ř
.
<p>Ř__foo__Ř</p>
````````````````````````````````
Example of 3-byte UTF-8 sequence (U+0BA3):
```````````````````````````````` example
ண__foo__ண
.
<p>ண__foo__ண</p>
````````````````````````````````
Example of 4-byte UTF-8 sequence (U+13142):
```````````````````````````````` example
𓅂__foo__𓅂
.
<p>𓅂__foo__𓅂</p>
````````````````````````````````
### Whitespace character
Whitespace on the other hand should not suppress `_`:
Example of 1-byte UTF-8 sequence (U+0009):
```````````````````````````````` example
x→__foo__→
.
<p>x <strong>foo</strong></p>
````````````````````````````````
(The initial `x` to suppress indented code block.)
Example of 2-byte UTF-8 sequence (U+00A0):
```````````````````````````````` example
__foo__
.
<p> <strong>foo</strong> </p>
````````````````````````````````
Example of 3-byte UTF-8 sequence (U+2000):
```````````````````````````````` example
 __foo__
.
<p> <strong>foo</strong> </p>
````````````````````````````````
(AFAIK, there is no 4-byte UTF-8 whitespace.)
### Punctuation character
Punctuation also should not suppress `_`:
Example of 1-byte UTF-8 sequence (U+002E):
```````````````````````````````` example
.__foo__.
.
<p>.<strong>foo</strong>.</p>
````````````````````````````````
Example of 2-byte UTF-8 sequence (U+00B7):
```````````````````````````````` example
·__foo__·
.
<p>·<strong>foo</strong>·</p>
````````````````````````````````
Example of 3-byte UTF-8 sequence (U+0C84):
```````````````````````````````` example
಄__foo__಄
.
<p>಄<strong>foo</strong>಄</p>
````````````````````````````````
Example of 4-byte UTF-8 sequence (U+1039F):
```````````````````````````````` example
𐎟__foo__𐎟
.
<p>𐎟<strong>foo</strong>𐎟</p>
````````````````````````````````
## `md_is_link_destination_A()`
```````````````````````````````` example
[link](</url\.with\.escape>)
.
<p><a href="/url.with.escape">link</a></p>
````````````````````````````````
## `md_link_label_eq()`
```````````````````````````````` example
[foo bar]
[foo bar]: /url
.
<p><a href="/url">foo bar</a></p>
````````````````````````````````
## `md_is_inline_link_spec()`
```````````````````````````````` example
> [link](/url 'foo
> bar')
.
<blockquote>
<p><a href="/url" title="foo
bar">link</a></p>
</blockquote>
````````````````````````````````
## `md_build_ref_def_hashtable()`
All link labels in the following example all have the same FNV1a hash (after
normalization of the label, which means after converting to a vector of Unicode
codepoints and lowercase folding).
So the example triggers quite complex code paths which are not otherwise easily
tested.
```````````````````````````````` example
[foo]: /foo
[qnptgbh]: /qnptgbh
[abgbrwcv]: /abgbrwcv
[abgbrwcv]: /abgbrwcv2
[abgbrwcv]: /abgbrwcv3
[abgbrwcv]: /abgbrwcv4
[alqadfgn]: /alqadfgn
[foo]
[qnptgbh]
[abgbrwcv]
[alqadfgn]
[axgydtdu]
.
<p><a href="/foo">foo</a>
<a href="/qnptgbh">qnptgbh</a>
<a href="/abgbrwcv">abgbrwcv</a>
<a href="/alqadfgn">alqadfgn</a>
[axgydtdu]</p>
````````````````````````````````
For the sake of completeness, the following C program was used to find the hash
collisions by brute force:
~~~
#include <stdio.h>
#include <string.h>
static unsigned etalon;
#define MD_FNV1A_BASE 2166136261
#define MD_FNV1A_PRIME 16777619
static inline unsigned
fnv1a(unsigned base, const void* data, size_t n)
{
const unsigned char* buf = (const unsigned char*) data;
unsigned hash = base;
size_t i;
for(i = 0; i < n; i++) {
hash ^= buf[i];
hash *= MD_FNV1A_PRIME;
}
return hash;
}
static unsigned
unicode_hash(const char* data, size_t n)
{
unsigned value;
unsigned hash = MD_FNV1A_BASE;
int i;
for(i = 0; i < n; i++) {
value = data[i];
hash = fnv1a(hash, &value, sizeof(unsigned));
}
return hash;
}
static void
recurse(char* buffer, size_t off, size_t len)
{
int ch;
if(off < len - 1) {
for(ch = 'a'; ch <= 'z'; ch++) {
buffer[off] = ch;
recurse(buffer, off+1, len);
}
} else {
for(ch = 'a'; ch <= 'z'; ch++) {
buffer[off] = ch;
if(unicode_hash(buffer, len) == etalon) {
printf("Dup: %.*s\n", (int)len, buffer);
}
}
}
}
int
main(int argc, char** argv)
{
char buffer[32];
int len;
if(argc < 2)
etalon = unicode_hash("foo", 3);
else
etalon = unicode_hash(argv[1], strlen(argv[1]));
for(len = 1; len <= sizeof(buffer); len++)
recurse(buffer, 0, len);
return 0;
}
~~~
## Flag `MD_FLAG_COLLAPSEWHITESPACE`
```````````````````````````````` example
foo bar → baz
.
<p>foo bar baz</p>
.
--fcollapse-whitespace
````````````````````````````````