mirror of
https://github.com/oven-sh/bun
synced 2026-02-10 02:48:50 +00:00
## Summary
- Port md4c (CommonMark-compliant markdown parser) from C to Zig under
`src/md/`
- Three output modes:
- `Bun.markdown.html(input, options?)` — render to HTML string
- `Bun.markdown.render(input, callbacks?)` — render with custom
callbacks for each element
- `Bun.markdown.react(input, options?)` — render to a React Fragment
element, directly usable as a component return value
- React element creation uses a cached JSC Structure with
`putDirectOffset` for fast allocation
- Component overrides in `react()`: pass tag names as options keys to
replace default HTML elements with custom components
- GFM extensions: tables, strikethrough, task lists, permissive
autolinks, disallowed raw HTML tag filter
- Wire up `.md` as a bundler loader (via explicit `{ type: "md" }`)
## JavaScript API
### `Bun.markdown.html(input, options?)`
Renders markdown to an HTML string:
```js
const html = Bun.markdown.html("# Hello **world**");
// "<h1>Hello <strong>world</strong></h1>\n"
Bun.markdown.html("## Hello", { headingIds: true });
// '<h2 id="hello">Hello</h2>\n'
```
### `Bun.markdown.render(input, callbacks?)`
Renders markdown with custom JavaScript callbacks for each element. Each
callback receives children as a string and optional metadata, and
returns a string:
```js
// Custom HTML with classes
const html = Bun.markdown.render("# Title\n\nHello **world**", {
heading: (children, { level }) => `<h${level} class="title">${children}</h${level}>`,
paragraph: (children) => `<p>${children}</p>`,
strong: (children) => `<b>${children}</b>`,
});
// ANSI terminal output
const ansi = Bun.markdown.render("# Hello\n\n**bold**", {
heading: (children) => `\x1b[1;4m${children}\x1b[0m\n`,
paragraph: (children) => children + "\n",
strong: (children) => `\x1b[1m${children}\x1b[22m`,
});
// Strip all formatting
const text = Bun.markdown.render("# Hello **world**", {
heading: (children) => children,
paragraph: (children) => children,
strong: (children) => children,
});
// "Hello world"
// Return null to omit elements
const result = Bun.markdown.render("# Title\n\n\n\nHello", {
image: () => null,
heading: (children) => children,
paragraph: (children) => children + "\n",
});
// "Title\nHello\n"
```
Parser options can be included alongside callbacks:
```js
Bun.markdown.render("Visit www.example.com", {
link: (children, { href }) => `[${children}](${href})`,
paragraph: (children) => children,
permissiveAutolinks: true,
});
```
### `Bun.markdown.react(input, options?)`
Returns a React Fragment element — use it directly as a component return
value:
```tsx
// Use as a component
function Markdown({ text }: { text: string }) {
return Bun.markdown.react(text);
}
// With custom components
function Heading({ children }: { children: React.ReactNode }) {
return <h1 className="title">{children}</h1>;
}
const element = Bun.markdown.react("# Hello", { h1: Heading });
// Server-side rendering
import { renderToString } from "react-dom/server";
const html = renderToString(Bun.markdown.react("# Hello **world**"));
// "<h1>Hello <strong>world</strong></h1>"
```
#### React 18 and older
By default, `react()` uses `Symbol.for('react.transitional.element')` as
the `$$typeof` symbol, which is what React 19 expects. For React 18 and
older, pass `reactVersion: 18`:
```tsx
const el = Bun.markdown.react("# Hello", { reactVersion: 18 });
```
### Component Overrides
Tag names can be overridden in `react()`:
```tsx
Bun.markdown.react(input, {
h1: MyHeading, // block elements
p: CustomParagraph,
a: CustomLink, // inline elements
img: CustomImage,
pre: CodeBlock,
// ... h1-h6, p, blockquote, ul, ol, li, pre, hr, html,
// table, thead, tbody, tr, th, td,
// em, strong, a, img, code, del, math, u, br
});
```
Boolean values are ignored (not treated as overrides), so parser options
like `{ strikethrough: true }` don't conflict with component overrides.
### Options
```js
Bun.markdown.html(input, {
tables: true, // GFM tables (default: true)
strikethrough: true, // ~~deleted~~ (default: true)
tasklists: true, // - [x] items (default: true)
headingIds: true, // Generate id attributes on headings
autolinkHeadings: true, // Wrap heading content in <a> tags
tagFilter: false, // GFM disallowed HTML tags
wikiLinks: false, // [[wiki]] links
latexMath: false, // $inline$ and $$display$$
underline: false, // __underline__ (instead of <strong>)
// ... and more
});
```
## Architecture
### Parser (`src/md/`)
The parser is split into focused modules using Zig's delegation pattern:
| Module | Purpose |
|--------|---------|
| `parser.zig` | Core `Parser` struct, state, and re-exported method
delegation |
| `blocks.zig` | Block-level parsing: document processing, line
analysis, block start/end |
| `containers.zig` | Container management: blockquotes, lists, list
items |
| `inlines.zig` | Inline parsing: emphasis, code spans, HTML tags,
entities |
| `links.zig` | Link/image resolution, reference links, autolink
rendering |
| `autolinks.zig` | Permissive autolink detection (www, url, email) |
| `line_analysis.zig` | Line classification: headings, fences, HTML
blocks, tables |
| `ref_defs.zig` | Reference definition parsing and lookup |
| `render_blocks.zig` | Block rendering dispatch (code, HTML, table
blocks) |
| `html_renderer.zig` | HTML renderer implementing `Renderer` VTable |
| `types.zig` | Shared types: `Renderer` VTable, `BlockType`,
`SpanType`, `TextType`, etc. |
### Renderer Abstraction
Parsing is decoupled from output via a `Renderer` VTable interface:
```zig
pub const Renderer = struct {
ptr: *anyopaque,
vtable: *const VTable,
pub const VTable = struct {
enterBlock: *const fn (...) void,
leaveBlock: *const fn (...) void,
enterSpan: *const fn (...) void,
leaveSpan: *const fn (...) void,
text: *const fn (...) void,
};
};
```
Four renderers are implemented:
- **`HtmlRenderer`** (`src/md/html_renderer.zig`) — produces HTML string
output
- **`JsCallbackRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — calls
JS callbacks for each element, accumulates string output
- **`ParseRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — builds
React element AST with `MarkedArgumentBuffer` for GC safety
- **`JSReactElement`** (`src/bun.js/bindings/JSReactElement.cpp`) — C++
fast path for React element creation using cached JSC Structure +
`putDirectOffset`
## Test plan
- [x] 792 spec tests pass (CommonMark, GFM tables, strikethrough,
tasklists, permissive autolinks, GFM tag filter, wiki links, coverage,
regressions)
- [x] 114 API tests pass (`html()`, `render()`, `react()`,
`renderToString` integration, component overrides)
- [x] 58 GFM compatibility tests pass
```
bun bd test test/js/bun/md/md-spec.test.ts # 792 pass
bun bd test test/js/bun/md/md-render-api.test.ts # 114 pass
bun bd test test/js/bun/md/gfm-compat.test.ts # 58 pass
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Dylan Conway <dylan.conway567@gmail.com>
Co-authored-by: SUZUKI Sosuke <sosuke@bun.com>
Co-authored-by: robobun <robobun@oven.sh>
Co-authored-by: Claude Bot <claude-bot@bun.sh>
Co-authored-by: Kirill Markelov <kerusha.chubko@gmail.com>
Co-authored-by: Ciro Spaciari <ciro.spaciari@gmail.com>
Co-authored-by: Alistair Smith <hi@alistair.sh>
298 lines
6.1 KiB
Plaintext
298 lines
6.1 KiB
Plaintext
|
||
# Coverage
|
||
|
||
This file is just a collection of tests designed to activate code in MD4C
|
||
which may otherwise be hard to hit. It's to improve our test coverage.
|
||
|
||
|
||
## `md_is_unicode_whitespace__()`
|
||
|
||
Unicode whitespace (here U+2000) forms a word boundary so these cannot be
|
||
resolved as emphasis span because there is no closer mark.
|
||
|
||
```````````````````````````````` example
|
||
*foo *bar
|
||
.
|
||
<p>*foo *bar</p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_is_unicode_punct__()`
|
||
|
||
Ditto for Unicode punctuation (here U+00A1).
|
||
|
||
```````````````````````````````` example
|
||
*foo¡*bar
|
||
.
|
||
<p>*foo¡*bar</p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_get_unicode_fold_info()`
|
||
|
||
```````````````````````````````` example
|
||
[Příliš žluťoučký kůň úpěl ďábelské ódy.]
|
||
|
||
[PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.]: /url
|
||
.
|
||
<p><a href="/url">Příliš žluťoučký kůň úpěl ďábelské ódy.</a></p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_decode_utf8__()` and `md_decode_utf8_before__()`
|
||
|
||
### Alphanumerical Character (i.e. not whitespace, not punctuation)
|
||
|
||
Non-whitespace & non-punctuation characters below suppress `_` from being
|
||
recognized as an emphasis because `_` should be seen as in-word character:
|
||
|
||
Example of 1-byte UTF-8 sequence (U+0058):
|
||
```````````````````````````````` example
|
||
X__foo__X
|
||
.
|
||
<p>X__foo__X</p>
|
||
````````````````````````````````
|
||
|
||
Example of 2-byte UTF-8 sequence (U+0158):
|
||
```````````````````````````````` example
|
||
Ř__foo__Ř
|
||
.
|
||
<p>Ř__foo__Ř</p>
|
||
````````````````````````````````
|
||
|
||
Example of 3-byte UTF-8 sequence (U+0BA3):
|
||
```````````````````````````````` example
|
||
ண__foo__ண
|
||
.
|
||
<p>ண__foo__ண</p>
|
||
````````````````````````````````
|
||
|
||
Example of 4-byte UTF-8 sequence (U+13142):
|
||
```````````````````````````````` example
|
||
𓅂__foo__𓅂
|
||
.
|
||
<p>𓅂__foo__𓅂</p>
|
||
````````````````````````````````
|
||
|
||
### Whitespace character
|
||
|
||
Whitespace on the other hand should not suppress `_`:
|
||
|
||
Example of 1-byte UTF-8 sequence (U+0009):
|
||
```````````````````````````````` example
|
||
x→__foo__→
|
||
.
|
||
<p>x <strong>foo</strong></p>
|
||
````````````````````````````````
|
||
(The initial `x` to suppress indented code block.)
|
||
|
||
Example of 2-byte UTF-8 sequence (U+00A0):
|
||
```````````````````````````````` example
|
||
__foo__
|
||
.
|
||
<p> <strong>foo</strong> </p>
|
||
````````````````````````````````
|
||
|
||
Example of 3-byte UTF-8 sequence (U+2000):
|
||
```````````````````````````````` example
|
||
__foo__
|
||
.
|
||
<p> <strong>foo</strong> </p>
|
||
````````````````````````````````
|
||
|
||
(AFAIK, there is no 4-byte UTF-8 whitespace.)
|
||
|
||
### Punctuation character
|
||
|
||
Punctuation also should not suppress `_`:
|
||
|
||
Example of 1-byte UTF-8 sequence (U+002E):
|
||
```````````````````````````````` example
|
||
.__foo__.
|
||
.
|
||
<p>.<strong>foo</strong>.</p>
|
||
````````````````````````````````
|
||
|
||
Example of 2-byte UTF-8 sequence (U+00B7):
|
||
```````````````````````````````` example
|
||
·__foo__·
|
||
.
|
||
<p>·<strong>foo</strong>·</p>
|
||
````````````````````````````````
|
||
|
||
Example of 3-byte UTF-8 sequence (U+0C84):
|
||
```````````````````````````````` example
|
||
಄__foo__಄
|
||
.
|
||
<p>಄<strong>foo</strong>಄</p>
|
||
````````````````````````````````
|
||
|
||
Example of 4-byte UTF-8 sequence (U+1039F):
|
||
```````````````````````````````` example
|
||
𐎟__foo__𐎟
|
||
.
|
||
<p>𐎟<strong>foo</strong>𐎟</p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_is_link_destination_A()`
|
||
|
||
```````````````````````````````` example
|
||
[link](</url\.with\.escape>)
|
||
.
|
||
<p><a href="/url.with.escape">link</a></p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_link_label_eq()`
|
||
|
||
```````````````````````````````` example
|
||
[foo bar]
|
||
|
||
[foo bar]: /url
|
||
.
|
||
<p><a href="/url">foo bar</a></p>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_is_inline_link_spec()`
|
||
|
||
```````````````````````````````` example
|
||
> [link](/url 'foo
|
||
> bar')
|
||
.
|
||
<blockquote>
|
||
<p><a href="/url" title="foo
|
||
bar">link</a></p>
|
||
</blockquote>
|
||
````````````````````````````````
|
||
|
||
|
||
## `md_build_ref_def_hashtable()`
|
||
|
||
All link labels in the following example all have the same FNV1a hash (after
|
||
normalization of the label, which means after converting to a vector of Unicode
|
||
codepoints and lowercase folding).
|
||
|
||
So the example triggers quite complex code paths which are not otherwise easily
|
||
tested.
|
||
|
||
```````````````````````````````` example
|
||
[foo]: /foo
|
||
[qnptgbh]: /qnptgbh
|
||
[abgbrwcv]: /abgbrwcv
|
||
[abgbrwcv]: /abgbrwcv2
|
||
[abgbrwcv]: /abgbrwcv3
|
||
[abgbrwcv]: /abgbrwcv4
|
||
[alqadfgn]: /alqadfgn
|
||
|
||
[foo]
|
||
[qnptgbh]
|
||
[abgbrwcv]
|
||
[alqadfgn]
|
||
[axgydtdu]
|
||
.
|
||
<p><a href="/foo">foo</a>
|
||
<a href="/qnptgbh">qnptgbh</a>
|
||
<a href="/abgbrwcv">abgbrwcv</a>
|
||
<a href="/alqadfgn">alqadfgn</a>
|
||
[axgydtdu]</p>
|
||
````````````````````````````````
|
||
|
||
For the sake of completeness, the following C program was used to find the hash
|
||
collisions by brute force:
|
||
|
||
~~~
|
||
|
||
#include <stdio.h>
|
||
#include <string.h>
|
||
|
||
|
||
static unsigned etalon;
|
||
|
||
|
||
|
||
#define MD_FNV1A_BASE 2166136261
|
||
#define MD_FNV1A_PRIME 16777619
|
||
|
||
static inline unsigned
|
||
fnv1a(unsigned base, const void* data, size_t n)
|
||
{
|
||
const unsigned char* buf = (const unsigned char*) data;
|
||
unsigned hash = base;
|
||
size_t i;
|
||
|
||
for(i = 0; i < n; i++) {
|
||
hash ^= buf[i];
|
||
hash *= MD_FNV1A_PRIME;
|
||
}
|
||
|
||
return hash;
|
||
}
|
||
|
||
|
||
static unsigned
|
||
unicode_hash(const char* data, size_t n)
|
||
{
|
||
unsigned value;
|
||
unsigned hash = MD_FNV1A_BASE;
|
||
int i;
|
||
|
||
for(i = 0; i < n; i++) {
|
||
value = data[i];
|
||
hash = fnv1a(hash, &value, sizeof(unsigned));
|
||
}
|
||
|
||
return hash;
|
||
}
|
||
|
||
|
||
static void
|
||
recurse(char* buffer, size_t off, size_t len)
|
||
{
|
||
int ch;
|
||
|
||
if(off < len - 1) {
|
||
for(ch = 'a'; ch <= 'z'; ch++) {
|
||
buffer[off] = ch;
|
||
recurse(buffer, off+1, len);
|
||
}
|
||
} else {
|
||
for(ch = 'a'; ch <= 'z'; ch++) {
|
||
buffer[off] = ch;
|
||
if(unicode_hash(buffer, len) == etalon) {
|
||
printf("Dup: %.*s\n", (int)len, buffer);
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
int
|
||
main(int argc, char** argv)
|
||
{
|
||
char buffer[32];
|
||
int len;
|
||
|
||
if(argc < 2)
|
||
etalon = unicode_hash("foo", 3);
|
||
else
|
||
etalon = unicode_hash(argv[1], strlen(argv[1]));
|
||
|
||
for(len = 1; len <= sizeof(buffer); len++)
|
||
recurse(buffer, 0, len);
|
||
|
||
return 0;
|
||
}
|
||
~~~
|
||
|
||
|
||
## Flag `MD_FLAG_COLLAPSEWHITESPACE`
|
||
|
||
```````````````````````````````` example
|
||
foo bar → baz
|
||
.
|
||
<p>foo bar baz</p>
|
||
.
|
||
--fcollapse-whitespace
|
||
````````````````````````````````
|