From 1bfe5c6b37e65995ac58e761ccbff7bb7b8dc954 Mon Sep 17 00:00:00 2001 From: Jarred Sumner Date: Wed, 28 Jan 2026 20:24:02 -0800 Subject: [PATCH] feat(md): Zig markdown parser with Bun.markdown API (#26440) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Port md4c (CommonMark-compliant markdown parser) from C to Zig under `src/md/` - Three output modes: - `Bun.markdown.html(input, options?)` — render to HTML string - `Bun.markdown.render(input, callbacks?)` — render with custom callbacks for each element - `Bun.markdown.react(input, options?)` — render to a React Fragment element, directly usable as a component return value - React element creation uses a cached JSC Structure with `putDirectOffset` for fast allocation - Component overrides in `react()`: pass tag names as options keys to replace default HTML elements with custom components - GFM extensions: tables, strikethrough, task lists, permissive autolinks, disallowed raw HTML tag filter - Wire up `.md` as a bundler loader (via explicit `{ type: "md" }`) ## JavaScript API ### `Bun.markdown.html(input, options?)` Renders markdown to an HTML string: ```js const html = Bun.markdown.html("# Hello **world**"); // "

Hello world

\n" Bun.markdown.html("## Hello", { headingIds: true }); // '

Hello

\n' ``` ### `Bun.markdown.render(input, callbacks?)` Renders markdown with custom JavaScript callbacks for each element. Each callback receives children as a string and optional metadata, and returns a string: ```js // Custom HTML with classes const html = Bun.markdown.render("# Title\n\nHello **world**", { heading: (children, { level }) => `${children}`, paragraph: (children) => `

${children}

`, strong: (children) => `${children}`, }); // ANSI terminal output const ansi = Bun.markdown.render("# Hello\n\n**bold**", { heading: (children) => `\x1b[1;4m${children}\x1b[0m\n`, paragraph: (children) => children + "\n", strong: (children) => `\x1b[1m${children}\x1b[22m`, }); // Strip all formatting const text = Bun.markdown.render("# Hello **world**", { heading: (children) => children, paragraph: (children) => children, strong: (children) => children, }); // "Hello world" // Return null to omit elements const result = Bun.markdown.render("# Title\n\n![logo](img.png)\n\nHello", { image: () => null, heading: (children) => children, paragraph: (children) => children + "\n", }); // "Title\nHello\n" ``` Parser options can be included alongside callbacks: ```js Bun.markdown.render("Visit www.example.com", { link: (children, { href }) => `[${children}](${href})`, paragraph: (children) => children, permissiveAutolinks: true, }); ``` ### `Bun.markdown.react(input, options?)` Returns a React Fragment element — use it directly as a component return value: ```tsx // Use as a component function Markdown({ text }: { text: string }) { return Bun.markdown.react(text); } // With custom components function Heading({ children }: { children: React.ReactNode }) { return

{children}

; } const element = Bun.markdown.react("# Hello", { h1: Heading }); // Server-side rendering import { renderToString } from "react-dom/server"; const html = renderToString(Bun.markdown.react("# Hello **world**")); // "

Hello world

" ``` #### React 18 and older By default, `react()` uses `Symbol.for('react.transitional.element')` as the `$$typeof` symbol, which is what React 19 expects. For React 18 and older, pass `reactVersion: 18`: ```tsx const el = Bun.markdown.react("# Hello", { reactVersion: 18 }); ``` ### Component Overrides Tag names can be overridden in `react()`: ```tsx Bun.markdown.react(input, { h1: MyHeading, // block elements p: CustomParagraph, a: CustomLink, // inline elements img: CustomImage, pre: CodeBlock, // ... h1-h6, p, blockquote, ul, ol, li, pre, hr, html, // table, thead, tbody, tr, th, td, // em, strong, a, img, code, del, math, u, br }); ``` Boolean values are ignored (not treated as overrides), so parser options like `{ strikethrough: true }` don't conflict with component overrides. ### Options ```js Bun.markdown.html(input, { tables: true, // GFM tables (default: true) strikethrough: true, // ~~deleted~~ (default: true) tasklists: true, // - [x] items (default: true) headingIds: true, // Generate id attributes on headings autolinkHeadings: true, // Wrap heading content in tags tagFilter: false, // GFM disallowed HTML tags wikiLinks: false, // [[wiki]] links latexMath: false, // $inline$ and $$display$$ underline: false, // __underline__ (instead of ) // ... and more }); ``` ## Architecture ### Parser (`src/md/`) The parser is split into focused modules using Zig's delegation pattern: | Module | Purpose | |--------|---------| | `parser.zig` | Core `Parser` struct, state, and re-exported method delegation | | `blocks.zig` | Block-level parsing: document processing, line analysis, block start/end | | `containers.zig` | Container management: blockquotes, lists, list items | | `inlines.zig` | Inline parsing: emphasis, code spans, HTML tags, entities | | `links.zig` | Link/image resolution, reference links, autolink rendering | | `autolinks.zig` | Permissive autolink detection (www, url, email) | | `line_analysis.zig` | Line classification: headings, fences, HTML blocks, tables | | `ref_defs.zig` | Reference definition parsing and lookup | | `render_blocks.zig` | Block rendering dispatch (code, HTML, table blocks) | | `html_renderer.zig` | HTML renderer implementing `Renderer` VTable | | `types.zig` | Shared types: `Renderer` VTable, `BlockType`, `SpanType`, `TextType`, etc. | ### Renderer Abstraction Parsing is decoupled from output via a `Renderer` VTable interface: ```zig pub const Renderer = struct { ptr: *anyopaque, vtable: *const VTable, pub const VTable = struct { enterBlock: *const fn (...) void, leaveBlock: *const fn (...) void, enterSpan: *const fn (...) void, leaveSpan: *const fn (...) void, text: *const fn (...) void, }; }; ``` Four renderers are implemented: - **`HtmlRenderer`** (`src/md/html_renderer.zig`) — produces HTML string output - **`JsCallbackRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — calls JS callbacks for each element, accumulates string output - **`ParseRenderer`** (`src/bun.js/api/MarkdownObject.zig`) — builds React element AST with `MarkedArgumentBuffer` for GC safety - **`JSReactElement`** (`src/bun.js/bindings/JSReactElement.cpp`) — C++ fast path for React element creation using cached JSC Structure + `putDirectOffset` ## Test plan - [x] 792 spec tests pass (CommonMark, GFM tables, strikethrough, tasklists, permissive autolinks, GFM tag filter, wiki links, coverage, regressions) - [x] 114 API tests pass (`html()`, `render()`, `react()`, `renderToString` integration, component overrides) - [x] 58 GFM compatibility tests pass ``` bun bd test test/js/bun/md/md-spec.test.ts # 792 pass bun bd test test/js/bun/md/md-render-api.test.ts # 114 pass bun bd test test/js/bun/md/gfm-compat.test.ts # 58 pass ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Dylan Conway Co-authored-by: SUZUKI Sosuke Co-authored-by: robobun Co-authored-by: Claude Bot Co-authored-by: Kirill Markelov Co-authored-by: Ciro Spaciari Co-authored-by: Alistair Smith --- cmake/Sources.json | 5 +- docs/docs.json | 1 + docs/runtime/bun-apis.mdx | 2 +- docs/runtime/markdown.mdx | 344 + packages/bun-types/bun.d.ts | 411 + packages/bun-types/jsx.d.ts | 11 + src/api/schema.zig | 1 + src/bake/DevServer/DirectoryWatchStore.zig | 1 + src/bun.js/ModuleLoader.zig | 2 +- src/bun.js/api.zig | 1 + src/bun.js/api/BunObject.zig | 6 + src/bun.js/api/MarkdownObject.zig | 1135 ++ src/bun.js/bindings/BunObject+exports.h | 1 + src/bun.js/bindings/BunObject.cpp | 1 + src/bun.js/bindings/JSReactElement.cpp | 121 + src/bun.js/bindings/JSReactElement.h | 26 + src/bun.js/bindings/ModuleLoader.cpp | 4 +- src/bun.js/bindings/ZigGlobalObject.cpp | 6 + src/bun.js/bindings/ZigGlobalObject.h | 2 + src/bun.js/bindings/headers-handwritten.h | 1 + src/bun.zig | 1 + src/bundler/LinkerContext.zig | 2 +- src/bundler/ParseTask.zig | 16 + src/js_printer.zig | 1 + src/md/autolinks.zig | 300 + src/md/blocks.zig | 865 ++ src/md/containers.zig | 192 + src/md/entity.zig | 2164 ++++ src/md/helpers.zig | 482 + src/md/html_renderer.zig | 709 ++ src/md/inlines.zig | 746 ++ src/md/line_analysis.zig | 527 + src/md/links.zig | 527 + src/md/parser.zig | 285 + src/md/ref_defs.zig | 351 + src/md/render_blocks.zig | 153 + src/md/root.zig | 104 + src/md/types.zig | 387 + src/md/unicode.zig | 477 + src/options.zig | 14 +- src/transpiler.zig | 35 +- test/integration/bun-types/bun-types.test.ts | 668 +- test/internal/ban-limits.json | 4 +- test/js/bun/md/coverage.txt | 297 + test/js/bun/md/gfm-compat.test.ts | 753 ++ test/js/bun/md/md-edge-cases.test.ts | 517 + test/js/bun/md/md-heading-ids.test.ts | 105 + test/js/bun/md/md-react.test.ts | 553 + test/js/bun/md/md-render-callback.test.ts | 258 + test/js/bun/md/md-spec.test.ts | 267 + test/js/bun/md/regressions.txt | 787 ++ test/js/bun/md/spec-gfm.txt | 227 + test/js/bun/md/spec-hard-soft-breaks.txt | 28 + test/js/bun/md/spec-latex-math.txt | 75 + test/js/bun/md/spec-permissive-autolinks.txt | 248 + test/js/bun/md/spec-strikethrough.txt | 63 + test/js/bun/md/spec-tables.txt | 278 + test/js/bun/md/spec-tasklists.txt | 127 + test/js/bun/md/spec-underline.txt | 47 + test/js/bun/md/spec-wiki-links.txt | 278 + test/js/bun/md/spec.txt | 9756 ++++++++++++++++++ 61 files changed, 25495 insertions(+), 261 deletions(-) create mode 100644 docs/runtime/markdown.mdx create mode 100644 packages/bun-types/jsx.d.ts create mode 100644 src/bun.js/api/MarkdownObject.zig create mode 100644 src/bun.js/bindings/JSReactElement.cpp create mode 100644 src/bun.js/bindings/JSReactElement.h create mode 100644 src/md/autolinks.zig create mode 100644 src/md/blocks.zig create mode 100644 src/md/containers.zig create mode 100644 src/md/entity.zig create mode 100644 src/md/helpers.zig create mode 100644 src/md/html_renderer.zig create mode 100644 src/md/inlines.zig create mode 100644 src/md/line_analysis.zig create mode 100644 src/md/links.zig create mode 100644 src/md/parser.zig create mode 100644 src/md/ref_defs.zig create mode 100644 src/md/render_blocks.zig create mode 100644 src/md/root.zig create mode 100644 src/md/types.zig create mode 100644 src/md/unicode.zig create mode 100644 test/js/bun/md/coverage.txt create mode 100644 test/js/bun/md/gfm-compat.test.ts create mode 100644 test/js/bun/md/md-edge-cases.test.ts create mode 100644 test/js/bun/md/md-heading-ids.test.ts create mode 100644 test/js/bun/md/md-react.test.ts create mode 100644 test/js/bun/md/md-render-callback.test.ts create mode 100644 test/js/bun/md/md-spec.test.ts create mode 100644 test/js/bun/md/regressions.txt create mode 100644 test/js/bun/md/spec-gfm.txt create mode 100644 test/js/bun/md/spec-hard-soft-breaks.txt create mode 100644 test/js/bun/md/spec-latex-math.txt create mode 100644 test/js/bun/md/spec-permissive-autolinks.txt create mode 100644 test/js/bun/md/spec-strikethrough.txt create mode 100644 test/js/bun/md/spec-tables.txt create mode 100644 test/js/bun/md/spec-tasklists.txt create mode 100644 test/js/bun/md/spec-underline.txt create mode 100644 test/js/bun/md/spec-wiki-links.txt create mode 100644 test/js/bun/md/spec.txt diff --git a/cmake/Sources.json b/cmake/Sources.json index 5ae4930693..2f5339f047 100644 --- a/cmake/Sources.json +++ b/cmake/Sources.json @@ -13,10 +13,7 @@ }, { "output": "JavaScriptSources.txt", - "paths": [ - "src/js/**/*.{js,ts}", - "src/install/PackageManager/scanner-entry.ts" - ] + "paths": ["src/js/**/*.{js,ts}", "src/install/PackageManager/scanner-entry.ts"] }, { "output": "JavaScriptCodegenSources.txt", diff --git a/docs/docs.json b/docs/docs.json index 7b8acdcbbc..9f5f486201 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -150,6 +150,7 @@ "/runtime/secrets", "/runtime/console", "/runtime/yaml", + "/runtime/markdown", "/runtime/json5", "/runtime/jsonl", "/runtime/html-rewriter", diff --git a/docs/runtime/bun-apis.mdx b/docs/runtime/bun-apis.mdx index aeb5934703..9e138686f7 100644 --- a/docs/runtime/bun-apis.mdx +++ b/docs/runtime/bun-apis.mdx @@ -55,5 +55,5 @@ Click the link in the right column to jump to the associated documentation. | Stream Processing | [`Bun.readableStreamTo*()`](/runtime/utils#bun-readablestreamto), `Bun.readableStreamToBytes()`, `Bun.readableStreamToBlob()`, `Bun.readableStreamToFormData()`, `Bun.readableStreamToJSON()`, `Bun.readableStreamToArray()` | | Memory & Buffer Management | `Bun.ArrayBufferSink`, `Bun.allocUnsafe`, `Bun.concatArrayBuffers` | | Module Resolution | [`Bun.resolveSync()`](/runtime/utils#bun-resolvesync) | -| Parsing & Formatting | [`Bun.semver`](/runtime/semver), `Bun.TOML.parse`, [`Bun.color`](/runtime/color) | +| Parsing & Formatting | [`Bun.semver`](/runtime/semver), `Bun.TOML.parse`, [`Bun.markdown`](/runtime/markdown), [`Bun.color`](/runtime/color) | | Low-level / Internals | `Bun.mmap`, `Bun.gc`, `Bun.generateHeapSnapshot`, [`bun:jsc`](https://bun.com/reference/bun/jsc) | diff --git a/docs/runtime/markdown.mdx b/docs/runtime/markdown.mdx new file mode 100644 index 0000000000..c7490cb6c7 --- /dev/null +++ b/docs/runtime/markdown.mdx @@ -0,0 +1,344 @@ +--- +title: Markdown +description: Parse and render Markdown with Bun's built-in Markdown API, supporting GFM extensions and custom rendering callbacks +--- + +{% callout type="note" %} +**Unstable API** — This API is under active development and may change in future versions of Bun. +{% /callout %} + +Bun includes a fast, built-in Markdown parser written in Zig. It supports GitHub Flavored Markdown (GFM) extensions and provides three APIs: + +- `Bun.markdown.html()` — render Markdown to an HTML string +- `Bun.markdown.render()` — render Markdown with custom callbacks for each element +- `Bun.markdown.react()` — render Markdown to React JSX elements + +--- + +## `Bun.markdown.html()` + +Convert a Markdown string to HTML. + +```ts +const html = Bun.markdown.html("# Hello **world**"); +// "

Hello world

\n" +``` + +GFM extensions like tables, strikethrough, and task lists are enabled by default: + +```ts +const html = Bun.markdown.html(` +| Feature | Status | +|-------------|--------| +| Tables | ~~done~~ | +| Strikethrough| ~~done~~ | +| Task lists | done | +`); +``` + +### Options + +Pass an options object as the second argument to configure the parser: + +```ts +const html = Bun.markdown.html("some markdown", { + tables: true, // GFM tables (default: true) + strikethrough: true, // GFM strikethrough (default: true) + tasklists: true, // GFM task lists (default: true) + tagFilter: true, // GFM tag filter for disallowed HTML tags + autolinks: true, // Autolink URLs, emails, and www. links +}); +``` + +All available options: + +| Option | Default | Description | +| ---------------------- | ------- | ----------------------------------------------------------- | +| `tables` | `false` | GFM tables | +| `strikethrough` | `false` | GFM strikethrough (`~~text~~`) | +| `tasklists` | `false` | GFM task lists (`- [x] item`) | +| `autolinks` | `false` | Enable autolinks — see [Autolinks](#autolinks) | +| `headings` | `false` | Heading IDs and autolinks — see [Heading IDs](#heading-ids) | +| `hardSoftBreaks` | `false` | Treat soft line breaks as hard breaks | +| `wikiLinks` | `false` | Enable `[[wiki links]]` | +| `underline` | `false` | `__text__` renders as `` instead of `` | +| `latexMath` | `false` | Enable `$inline$` and `$$display$$` math | +| `collapseWhitespace` | `false` | Collapse whitespace in text | +| `permissiveAtxHeaders` | `false` | ATX headers without space after `#` | +| `noIndentedCodeBlocks` | `false` | Disable indented code blocks | +| `noHtmlBlocks` | `false` | Disable HTML blocks | +| `noHtmlSpans` | `false` | Disable inline HTML | +| `tagFilter` | `false` | GFM tag filter for disallowed HTML tags | + +#### Autolinks + +Pass `true` to enable all autolink types, or an object for granular control: + +```ts +// Enable all autolinks (URL, WWW, email) +Bun.markdown.html("Visit www.example.com", { autolinks: true }); + +// Enable only specific types +Bun.markdown.html("Visit www.example.com", { + autolinks: { url: true, www: true }, +}); +``` + +#### Heading IDs + +Pass `true` to enable both heading IDs and autolink headings, or an object for granular control: + +```ts +// Enable heading IDs and autolink headings +Bun.markdown.html("## Hello World", { headings: true }); +// '

Hello World

\n' + +// Enable only heading IDs (no autolink) +Bun.markdown.html("## Hello World", { headings: { ids: true } }); +// '

Hello World

\n' +``` + +--- + +## `Bun.markdown.render()` + +Parse Markdown and render it using custom JavaScript callbacks. This gives you full control over the output format — you can generate HTML with custom classes, React elements, ANSI terminal output, or any other string format. + +```ts +const result = Bun.markdown.render("# Hello **world**", { + heading: (children, { level }) => `${children}`, + strong: children => `${children}`, + paragraph: children => `

${children}

`, +}); +// '

Hello world

' +``` + +### Callback signature + +Each callback receives: + +1. **`children`** — the accumulated content of the element as a string +2. **`meta`** (optional) — an object with element-specific metadata + +Return a string to replace the element's rendering. Return `null` or `undefined` to omit the element from the output entirely. If no callback is registered for an element, its children pass through unchanged. + +### Block callbacks + +| Callback | Meta | Description | +| ------------ | ------------------------------------------- | ---------------------------------------------------------------------------------------- | +| `heading` | `{ level: number, id?: string }` | Heading level 1–6. `id` is set when `headings: { ids: true }` is enabled | +| `paragraph` | — | Paragraph block | +| `blockquote` | — | Blockquote block | +| `code` | `{ language?: string }` | Fenced or indented code block. `language` is the info-string when specified on the fence | +| `list` | `{ ordered: boolean, start?: number }` | Ordered or unordered list. `start` is the start number for ordered lists | +| `listItem` | `{ checked?: boolean }` | List item. `checked` is set for task list items (`- [x]` / `- [ ]`) | +| `hr` | — | Horizontal rule | +| `table` | — | Table block | +| `thead` | — | Table head | +| `tbody` | — | Table body | +| `tr` | — | Table row | +| `th` | `{ align?: "left" \| "center" \| "right" }` | Table header cell. `align` is set when alignment is specified | +| `td` | `{ align?: "left" \| "center" \| "right" }` | Table data cell. `align` is set when alignment is specified | +| `html` | — | Raw HTML content | + +### Inline callbacks + +| Callback | Meta | Description | +| --------------- | ---------------------------------- | ---------------------------- | +| `strong` | — | Strong emphasis (`**text**`) | +| `emphasis` | — | Emphasis (`*text*`) | +| `link` | `{ href: string, title?: string }` | Link | +| `image` | `{ src: string, title?: string }` | Image | +| `codespan` | — | Inline code (`` `code` ``) | +| `strikethrough` | — | Strikethrough (`~~text~~`) | +| `text` | — | Plain text content | + +### Examples + +#### Custom HTML with classes + +```ts +const html = Bun.markdown.render("# Title\n\nHello **world**", { + heading: (children, { level }) => `${children}`, + paragraph: children => `

${children}

`, + strong: children => `${children}`, +}); +``` + +#### Stripping all formatting + +```ts +const plaintext = Bun.markdown.render("# Hello **world**", { + heading: children => children, + paragraph: children => children, + strong: children => children, + emphasis: children => children, + link: children => children, + image: () => "", + code: children => children, + codespan: children => children, +}); +// "Hello world" +``` + +#### Omitting elements + +Return `null` or `undefined` to remove an element from the output: + +```ts +const result = Bun.markdown.render("# Title\n\n![logo](img.png)\n\nHello", { + image: () => null, // Remove all images + heading: children => children, + paragraph: children => children + "\n", +}); +// "Title\nHello\n" +``` + +#### ANSI terminal output + +```ts +const ansi = Bun.markdown.render("# Hello\n\nThis is **bold** and *italic*", { + heading: (children, { level }) => `\x1b[1;4m${children}\x1b[0m\n`, + paragraph: children => children + "\n", + strong: children => `\x1b[1m${children}\x1b[22m`, + emphasis: children => `\x1b[3m${children}\x1b[23m`, +}); +``` + +#### Code block syntax highlighting + +````ts +const result = Bun.markdown.render("```js\nconsole.log('hi')\n```", { + code: (children, meta) => { + const lang = meta?.language ?? ""; + return `
${children}
`; + }, +}); +```` + +### Parser options + +Parser options are passed as a separate third argument: + +```ts +const result = Bun.markdown.render( + "Visit www.example.com", + { + link: (children, { href }) => `[${children}](${href})`, + paragraph: children => children, + }, + { autolinks: true }, +); +``` + +--- + +## `Bun.markdown.react()` + +Render Markdown directly to React elements. Returns a `` that you can use as a component return value. + +```tsx +function Markdown({ text }: { text: string }) { + return Bun.markdown.react(text); +} +``` + +### Server-side rendering + +Works with `renderToString()` and React Server Components: + +```tsx +import { renderToString } from "react-dom/server"; + +const html = renderToString(Bun.markdown.react("# Hello **world**")); +// "

Hello world

" +``` + +### Component overrides + +Replace any HTML element with a custom React component by passing it in the second argument, keyed by tag name: + +```tsx +function Code({ language, children }) { + return ( +
+      {children}
+    
+ ); +} + +function Link({ href, title, children }) { + return ( + + {children} + + ); +} + +function Heading({ id, children }) { + return ( +

+ {children} +

+ ); +} + +const el = Bun.markdown.react( + content, + { + pre: Code, + a: Link, + h2: Heading, + }, + { headings: { ids: true } }, +); +``` + +#### Available overrides + +Every HTML tag produced by the parser can be overridden: + +| Option | Props | Description | +| ------------ | ---------------------------- | --------------------------------------------------------------- | +| `h1`–`h6` | `{ id?, children }` | Headings. `id` is set when `headings: { ids: true }` is enabled | +| `p` | `{ children }` | Paragraph | +| `blockquote` | `{ children }` | Blockquote | +| `pre` | `{ language?, children }` | Code block. `language` is the info string (e.g. `"js"`) | +| `hr` | `{}` | Horizontal rule (no children) | +| `ul` | `{ children }` | Unordered list | +| `ol` | `{ start, children }` | Ordered list. `start` is the first item number | +| `li` | `{ checked?, children }` | List item. `checked` is set for task list items | +| `table` | `{ children }` | Table | +| `thead` | `{ children }` | Table head | +| `tbody` | `{ children }` | Table body | +| `tr` | `{ children }` | Table row | +| `th` | `{ align?, children }` | Table header cell | +| `td` | `{ align?, children }` | Table data cell | +| `em` | `{ children }` | Emphasis (`*text*`) | +| `strong` | `{ children }` | Strong (`**text**`) | +| `a` | `{ href, title?, children }` | Link | +| `img` | `{ src, alt?, title? }` | Image (no children) | +| `code` | `{ children }` | Inline code | +| `del` | `{ children }` | Strikethrough (`~~text~~`) | +| `br` | `{}` | Hard line break (no children) | + +### React 18 and older + +By default, elements use `Symbol.for('react.transitional.element')` as the `$$typeof` symbol. For React 18 and older, pass `reactVersion: 18` in the options (third argument): + +```tsx +function Markdown({ text }: { text: string }) { + return Bun.markdown.react(text, undefined, { reactVersion: 18 }); +} +``` + +### Parser options + +All [parser options](#options) are passed as the third argument: + +```tsx +const el = Bun.markdown.react("## Hello World", undefined, { + headings: { ids: true }, + autolinks: true, +}); +``` diff --git a/packages/bun-types/bun.d.ts b/packages/bun-types/bun.d.ts index 6098ea4db1..59c93bc82a 100644 --- a/packages/bun-types/bun.d.ts +++ b/packages/bun-types/bun.d.ts @@ -905,6 +905,417 @@ declare module "bun" { export function stringify(input: unknown, replacer?: undefined | null, space?: string | number): string; } + /** + * Markdown related APIs. + * + * Provides fast markdown parsing and rendering with three output modes: + * - `html()` — render to an HTML string + * - `render()` — render with custom callbacks for each element + * - `react()` — parse to React-compatible JSX elements + * + * Supports GFM extensions (tables, strikethrough, task lists, autolinks) and + * component overrides to replace default HTML tags with custom components. + * + * @example + * ```tsx + * // Render markdown to HTML + * const html = Bun.markdown.html("# Hello **world**"); + * // "

Hello world

\n" + * + * // Render with custom callbacks + * const ansi = Bun.markdown.render("# Hello **world**", { + * heading: (children, { level }) => `\x1b[1m${children}\x1b[0m\n`, + * strong: (children) => `\x1b[1m${children}\x1b[22m`, + * paragraph: (children) => children + "\n", + * }); + * + * // Render as a React component + * function Markdown({ text }: { text: string }) { + * return Bun.markdown.react(text); + * } + * + * // With component overrides + * const element = Bun.markdown.react("# Hello", { h1: MyHeadingComponent }); + * ``` + */ + namespace markdown { + /** + * Options for configuring the markdown parser. + * + * By default, GFM extensions (tables, strikethrough, task lists) are enabled. + */ + interface Options { + /** Enable GFM tables. Default: `true`. */ + tables?: boolean; + /** Enable GFM strikethrough (`~~text~~`). Default: `true`. */ + strikethrough?: boolean; + /** Enable GFM task lists (`- [x] item`). Default: `true`. */ + tasklists?: boolean; + /** Treat soft line breaks as hard line breaks. Default: `false`. */ + hardSoftBreaks?: boolean; + /** Enable wiki-style links (`[[target]]` or `[[target|label]]`). Default: `false`. */ + wikiLinks?: boolean; + /** Enable underline syntax (`__text__` renders as `` instead of ``). Default: `false`. */ + underline?: boolean; + /** Enable LaTeX math (`$inline$` and `$$display$$`). Default: `false`. */ + latexMath?: boolean; + /** Collapse whitespace in text content. Default: `false`. */ + collapseWhitespace?: boolean; + /** Allow ATX headers without a space after `#`. Default: `false`. */ + permissiveAtxHeaders?: boolean; + /** Disable indented code blocks. Default: `false`. */ + noIndentedCodeBlocks?: boolean; + /** Disable HTML blocks. Default: `false`. */ + noHtmlBlocks?: boolean; + /** Disable inline HTML spans. Default: `false`. */ + noHtmlSpans?: boolean; + /** + * Enable the GFM tag filter, which replaces `<` with `<` for disallowed + * HTML tags (e.g. `, , , (case insensitive) + if (self.text[pos] == '<' and pos + 1 < self.size and self.text[pos + 1] == '/') { + if (self.matchHtmlTag(pos, "script") or self.matchHtmlTag(pos, "pre") or + self.matchHtmlTag(pos, "style") or self.matchHtmlTag(pos, "textarea")) + return true; + } + }, + 2 => { + // Type 2: --> + if (self.text[pos] == '-' and pos + 2 < self.size and + self.text[pos + 1] == '-' and self.text[pos + 2] == '>') + return true; + }, + 3 => { + // Type 3: ?> + if (self.text[pos] == '?' and pos + 1 < self.size and self.text[pos + 1] == '>') + return true; + }, + 4 => { + // Type 4: > + if (self.text[pos] == '>') + return true; + }, + 5 => { + // Type 5: ]]> + if (self.text[pos] == ']' and pos + 2 < self.size and + self.text[pos + 1] == ']' and self.text[pos + 2] == '>') + return true; + }, + else => return false, + } + pos += 1; + } + return false; +} + +pub fn matchHtmlTag(self: *const Parser, off: OFF, tag: []const u8) bool { + if (off + 1 + tag.len >= self.size) return false; + const start = off + 1; + // Allow optional / for closing tags + var pos = start; + if (pos < self.size and self.text[pos] == '/') pos += 1; + if (pos + tag.len > self.size) return false; + if (!helpers.asciiCaseEql(self.text[pos .. pos + tag.len], tag)) return false; + pos += @intCast(tag.len); + if (pos >= self.size) return true; + const after = self.text[pos]; + return after == '>' or after == '/' or helpers.isBlank(after) or helpers.isNewline(after); +} + +pub fn isBlockLevelHtmlTag(self: *const Parser, off: OFF) bool { + const block_tags = [_][]const u8{ + "address", "article", "aside", "base", "basefont", "blockquote", "body", + "caption", "center", "col", "colgroup", "dd", "details", "dialog", + "dir", "div", "dl", "dt", "fieldset", "figcaption", "figure", + "footer", "form", "frame", "frameset", "h1", "h2", "h3", + "h4", "h5", "h6", "head", "header", "hr", "html", + "iframe", "legend", "li", "link", "main", "menu", "menuitem", + "nav", "noframes", "ol", "optgroup", "option", "p", "param", + "search", "section", "summary", "table", "tbody", "td", "tfoot", + "th", "thead", "title", "tr", "track", "ul", + }; + + for (block_tags) |tag| { + if (self.matchHtmlTag(off, tag)) return true; + } + return false; +} + +pub fn isCompleteHtmlTag(self: *const Parser, off: OFF) bool { + if (off + 1 >= self.size) return false; + var pos = off + 1; + + // Closing tag + if (pos < self.size and self.text[pos] == '/') { + pos += 1; + if (pos >= self.size or !helpers.isAlpha(self.text[pos])) return false; + while (pos < self.size and (helpers.isAlphaNum(self.text[pos]) or self.text[pos] == '-')) + pos += 1; + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + if (pos >= self.size or self.text[pos] != '>') return false; + pos += 1; + // Rest of line must be whitespace only + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + return pos >= self.size or helpers.isNewline(self.text[pos]); + } + + // Opening tag: + if (!helpers.isAlpha(self.text[pos])) return false; + while (pos < self.size and (helpers.isAlphaNum(self.text[pos]) or self.text[pos] == '-')) + pos += 1; + + // Parse attributes + while (true) { + const ws_start = pos; + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + if (pos >= self.size or helpers.isNewline(self.text[pos])) return false; + + // Check for end of tag + if (self.text[pos] == '>') { + pos += 1; + break; + } + if (self.text[pos] == '/' and pos + 1 < self.size and self.text[pos + 1] == '>') { + pos += 2; + break; + } + + // Attributes must be preceded by whitespace + if (pos == ws_start) return false; + + // Attribute name: [a-zA-Z_:][a-zA-Z0-9_.:-]* + if (!helpers.isAlpha(self.text[pos]) and self.text[pos] != '_' and self.text[pos] != ':') + return false; + pos += 1; + while (pos < self.size and (helpers.isAlphaNum(self.text[pos]) or + self.text[pos] == '_' or self.text[pos] == '.' or + self.text[pos] == ':' or self.text[pos] == '-')) + pos += 1; + + // Optional attribute value + var ws_pos = pos; + while (ws_pos < self.size and helpers.isBlank(self.text[ws_pos])) ws_pos += 1; + if (ws_pos < self.size and self.text[ws_pos] == '=') { + pos = ws_pos + 1; + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + if (pos >= self.size or helpers.isNewline(self.text[pos])) return false; + + if (self.text[pos] == '"') { + pos += 1; + while (pos < self.size and self.text[pos] != '"' and !helpers.isNewline(self.text[pos])) + pos += 1; + if (pos >= self.size or self.text[pos] != '"') return false; + pos += 1; + } else if (self.text[pos] == '\'') { + pos += 1; + while (pos < self.size and self.text[pos] != '\'' and !helpers.isNewline(self.text[pos])) + pos += 1; + if (pos >= self.size or self.text[pos] != '\'') return false; + pos += 1; + } else { + // Unquoted value + while (pos < self.size and !helpers.isBlank(self.text[pos]) and + !helpers.isNewline(self.text[pos]) and + self.text[pos] != '"' and self.text[pos] != '\'' and + self.text[pos] != '=' and self.text[pos] != '<' and + self.text[pos] != '>' and self.text[pos] != '`') + pos += 1; + } + } + } + + // Rest of line must be whitespace only + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + return pos >= self.size or helpers.isNewline(self.text[pos]); +} + +pub fn isTableUnderline(self: *Parser, off: OFF) struct { is_underline: bool, col_count: u32 } { + var pos = off; + var col_count: u32 = 0; + var had_pipe = false; + + // Skip leading pipe + if (pos < self.size and self.text[pos] == '|') { + had_pipe = true; + pos += 1; + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + } + + while (pos < self.size and !helpers.isNewline(self.text[pos])) { + // Expect optional ':' then dashes then optional ':' + const has_left_colon = pos < self.size and self.text[pos] == ':'; + if (has_left_colon) pos += 1; + + var dash_count: u32 = 0; + while (pos < self.size and self.text[pos] == '-') { + dash_count += 1; + pos += 1; + } + + if (dash_count == 0) return .{ .is_underline = false, .col_count = 0 }; + + const has_right_colon = pos < self.size and self.text[pos] == ':'; + if (has_right_colon) pos += 1; + + // Determine alignment + if (col_count < types.TABLE_MAXCOLCOUNT) { + self.table_alignments[col_count] = if (has_left_colon and has_right_colon) + .center + else if (has_left_colon) + .left + else if (has_right_colon) + .right + else + .default; + } + + col_count += 1; + + // Skip whitespace + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + + // Pipe separator or end + if (pos < self.size and self.text[pos] == '|') { + had_pipe = true; + pos += 1; + while (pos < self.size and helpers.isBlank(self.text[pos])) pos += 1; + if (pos >= self.size or helpers.isNewline(self.text[pos])) break; + } else if (pos >= self.size or helpers.isNewline(self.text[pos])) { + break; + } else { + return .{ .is_underline = false, .col_count = 0 }; + } + } + + if (col_count == 0 or (!had_pipe and col_count < 2)) + return .{ .is_underline = false, .col_count = 0 }; + + self.table_col_count = col_count; + return .{ .is_underline = true, .col_count = col_count }; +} + +/// Count the number of pipe-delimited columns in a table row. +/// Used to validate that header and delimiter row column counts match (GFM requirement). +pub fn countTableRowColumns(self: *const Parser, beg: OFF, end: OFF) u32 { + const row = self.text[beg..end]; + var col_count: u32 = 0; + var pos: usize = 0; + + // Skip leading whitespace + while (pos < row.len and helpers.isBlank(row[pos])) pos += 1; + + // Skip leading pipe + if (pos < row.len and row[pos] == '|') { + pos += 1; + } + + // Count cells between pipes + var in_cell = false; + while (pos < row.len) { + if (row[pos] == '|') { + col_count += 1; + in_cell = false; + pos += 1; + } else if (row[pos] == '\\' and pos + 1 < row.len) { + in_cell = true; + pos += 2; + } else if (helpers.isNewline(row[pos])) { + break; + } else { + in_cell = true; + pos += 1; + } + } + // If there's content after the last pipe (no trailing pipe), count it as a column + if (in_cell) col_count += 1; + return col_count; +} + +pub fn isContainerMark(self: *const Parser, indent: u32, off: OFF) struct { + is_container: bool, + container: Container, + off: OFF, +} { + if (off >= self.size) return .{ .is_container = false, .container = .{}, .off = off }; + + // md4c: indent >= code_indent_offset means this is indented code, not a container + if (indent >= self.code_indent_offset) return .{ .is_container = false, .container = .{}, .off = off }; + + const c = self.text[off]; + + // Blockquote + // Note: off points just past '>' — the optional space and remaining + // indent are handled by the caller via lineIndentation + the + // whitespace adjustment logic, matching md4c's behavior. + if (c == '>') { + return .{ + .is_container = true, + .container = .{ + .ch = '>', + .mark_indent = indent, + .contents_indent = indent + 1, + }, + .off = off + 1, + }; + } + + // Unordered list: -, +, * + // off points just past the marker (before the mandatory space). + // The space is included in the lineIndentation computation by the caller. + if ((c == '-' or c == '+' or c == '*') and + off + 1 < self.size and helpers.isBlank(self.text[off + 1])) + { + return .{ + .is_container = true, + .container = .{ + .ch = c, + .mark_indent = indent, + .contents_indent = indent + 1, + }, + .off = off + 1, + }; + } + // Empty unordered list item: marker followed by newline or EOF + if ((c == '-' or c == '+' or c == '*') and + (off + 1 >= self.size or helpers.isNewline(self.text[off + 1]))) + { + return .{ + .is_container = true, + .container = .{ + .ch = c, + .mark_indent = indent, + .contents_indent = indent + 1, + }, + .off = off + 1, + }; + } + + // Ordered list: digits followed by . or ) + if (helpers.isDigit(c)) { + var pos = off; + var num: u32 = 0; + while (pos < self.size and helpers.isDigit(self.text[pos]) and pos - off < 9) { + num = num * 10 + @as(u32, self.text[pos] - '0'); + pos += 1; + } + if (pos < self.size and (self.text[pos] == '.' or self.text[pos] == ')')) { + const delim = self.text[pos]; + pos += 1; // Past delimiter + if (pos < self.size and helpers.isBlank(self.text[pos])) { + // contents_indent = indent + marker_width (digits + delimiter) + const mark_width = pos - off; + return .{ + .is_container = true, + .container = .{ + .ch = delim, + .start = num, + .mark_indent = indent, + .contents_indent = indent + @as(u32, @intCast(mark_width)), + }, + .off = pos, + }; + } + // Empty list item + if (pos >= self.size or helpers.isNewline(self.text[pos])) { + const mark_width = pos - off; + return .{ + .is_container = true, + .container = .{ + .ch = delim, + .start = num, + .mark_indent = indent, + .contents_indent = indent + @as(u32, @intCast(mark_width)), + }, + .off = pos, + }; + } + } + } + + return .{ .is_container = false, .container = .{}, .off = off }; +} + +const helpers = @import("./helpers.zig"); +const std = @import("std"); + +const parser_mod = @import("./parser.zig"); +const Parser = parser_mod.Parser; + +const types = @import("./types.zig"); +const Attribute = types.Attribute; +const Container = types.Container; +const OFF = types.OFF; diff --git a/src/md/links.zig b/src/md/links.zig new file mode 100644 index 0000000000..89143a7944 --- /dev/null +++ b/src/md/links.zig @@ -0,0 +1,527 @@ +pub fn processLink(self: *Parser, content: []const u8, start: usize, base_off: OFF, is_image: bool) Parser.Error!?usize { + _ = base_off; + // start points at '[' + // Find matching ']', skipping code spans and HTML tags (which take precedence) + var pos = start + 1; + var bracket_depth: u32 = 1; + var has_inner_bracket = false; + while (pos < content.len and bracket_depth > 0) { + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + continue; + } + // Skip code spans — they take precedence over brackets (CommonMark §6.3) + if (content[pos] == '`') { + const count = inlines_mod.countBackticks(content, pos); + if (self.findCodeSpanEnd(content, pos + count, count)) |end_pos| { + pos = end_pos + count; + continue; + } + } + // Skip HTML tags and autolinks — they take precedence over brackets + if (content[pos] == '<' and !self.flags.no_html_spans) { + if (self.findHtmlTag(content, pos)) |tag_end| { + pos = tag_end; + continue; + } + if (self.findAutolink(content, pos)) |autolink| { + pos = autolink.end_pos; + continue; + } + } + if (content[pos] == '[') { + bracket_depth += 1; + has_inner_bracket = true; + } + if (content[pos] == ']') bracket_depth -= 1; + if (bracket_depth > 0) pos += 1; + } + + if (bracket_depth != 0) return null; + + const label_end = pos; + const label = content[start + 1 .. label_end]; + pos += 1; // skip ']' + + // Inline link: [text](url "title") + if (pos < content.len and content[pos] == '(') { + pos += 1; + // Skip whitespace (including newlines from merged paragraph lines) + while (pos < content.len and (helpers.isBlank(content[pos]) or content[pos] == '\n' or content[pos] == '\r')) pos += 1; + + // Parse destination + var dest_start = pos; + var dest_end = pos; + + if (pos < content.len and content[pos] == '<') { + // Angle-bracket destination (no newlines allowed) + dest_start = pos + 1; + pos += 1; + var angle_valid = true; + while (pos < content.len and content[pos] != '>') { + if (content[pos] == '\n' or content[pos] == '\r') { + angle_valid = false; + break; + } + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + } else { + pos += 1; + } + } + if (!angle_valid) return null; + dest_end = pos; + if (pos < content.len) pos += 1; // skip > + } else { + // Bare destination — balance parentheses + var paren_depth: u32 = 0; + while (pos < content.len and !helpers.isWhitespace(content[pos])) { + if (content[pos] == '(') { + paren_depth += 1; + } else if (content[pos] == ')') { + if (paren_depth == 0) break; + paren_depth -= 1; + } + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + } else { + pos += 1; + } + } + dest_end = pos; + } + + // Skip whitespace (including newlines) + while (pos < content.len and (helpers.isBlank(content[pos]) or content[pos] == '\n' or content[pos] == '\r')) pos += 1; + + // Optional title + var title: []const u8 = ""; + if (pos < content.len and (content[pos] == '"' or content[pos] == '\'' or content[pos] == '(')) { + const close_char: u8 = if (content[pos] == '(') ')' else content[pos]; + pos += 1; + const title_start = pos; + while (pos < content.len and content[pos] != close_char) { + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + } else { + pos += 1; + } + } + title = content[title_start..pos]; + if (pos < content.len) pos += 1; // skip closing quote + } + + // Skip whitespace (including newlines) + while (pos < content.len and (helpers.isBlank(content[pos]) or content[pos] == '\n' or content[pos] == '\r')) pos += 1; + + // Must end with ')' + if (pos < content.len and content[pos] == ')') { + pos += 1; + const dest = content[dest_start..dest_end]; + + // Link nesting prohibition: links cannot contain other links (CommonMark §6.7) + if (!is_image and has_inner_bracket and self.labelContainsLink(label)) { + return null; + } + + if (self.image_nesting_level > 0) { + // Inside image alt text — emit only text, no HTML tags + try self.processInlineContent(label, 0); + } else if (is_image) { + try self.renderer.enterSpan(.img, .{ .href = dest, .title = title }); + self.image_nesting_level += 1; + try self.processInlineContent(label, 0); + self.image_nesting_level -= 1; + try self.renderer.leaveSpan(.img); + } else { + try self.renderer.enterSpan(.a, .{ .href = dest, .title = title }); + self.link_nesting_level += 1; + try self.processInlineContent(label, 0); + self.link_nesting_level -= 1; + try self.renderer.leaveSpan(.a); + } + + return pos; + } + } + + // Reference link: [text][ref] or [text][] or shortcut [text] + if (pos < content.len and content[pos] == '[') { + const bracket_pos = pos; + pos += 1; + const ref_start = pos; + while (pos < content.len and content[pos] != ']') { + if (content[pos] == '[') break; // nested [ not allowed in ref + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + } else { + pos += 1; + } + } + if (pos < content.len and content[pos] == ']') { + const ref_label = if (pos > ref_start) content[ref_start..pos] else label; + pos += 1; + if (self.lookupRefDef(ref_label)) |ref_def| { + // Link nesting prohibition + if (!is_image and has_inner_bracket and self.labelContainsLink(label)) { + return null; + } + try self.renderRefLink(label, ref_def, is_image); + return pos; + } + } else { + // Reset pos if we didn't find a valid ] + pos = bracket_pos; + } + } + + // Shortcut reference link: [text] (no following [) + // Per CommonMark spec, shortcut refs must NOT be followed by [ + // Note: if followed by ( and inline link parsing failed above, still try shortcut + const char_after_label: u8 = if (label_end + 1 < content.len) content[label_end + 1] else 0; + if (char_after_label != '[') { + if (self.lookupRefDef(label)) |ref_def| { + // Link nesting prohibition + if (!is_image and has_inner_bracket and self.labelContainsLink(label)) { + return null; + } + try self.renderRefLink(label, ref_def, is_image); + return label_end + 1; + } + } + + return null; +} + +/// Try to match a bracket pair starting at `start` and check if it forms a link. +/// Returns whether it's a link, where the label ends, and the full link end position. +pub fn tryMatchBracketLink(self: *Parser, content: []const u8, start: usize) struct { is_link: bool, label_end: usize, link_end: usize } { + var pos = start + 1; + var depth: u32 = 1; + while (pos < content.len and depth > 0) { + if (content[pos] == '\\' and pos + 1 < content.len) { + pos += 2; + continue; + } + if (content[pos] == '`') { + const count = inlines_mod.countBackticks(content, pos); + if (self.findCodeSpanEnd(content, pos + count, count)) |end_pos| { + pos = end_pos + count; + continue; + } + } + if (content[pos] == '<' and !self.flags.no_html_spans) { + if (self.findHtmlTag(content, pos)) |tag_end| { + pos = tag_end; + continue; + } + if (self.findAutolink(content, pos)) |al| { + pos = al.end_pos; + continue; + } + } + if (content[pos] == '[') depth += 1; + if (content[pos] == ']') depth -= 1; + if (depth > 0) pos += 1; + } + if (depth != 0) return .{ .is_link = false, .label_end = 0, .link_end = 0 }; + + const label_end = pos; + pos += 1; // skip ] + + if (pos >= content.len) { + // Shortcut reference check + const inner_label = content[start + 1 .. label_end]; + const is_ref = self.lookupRefDef(inner_label) != null; + return .{ .is_link = is_ref, .label_end = label_end, .link_end = label_end + 1 }; + } + + // Inline link: ](...) + if (content[pos] == '(') { + var p = pos + 1; + // Skip whitespace + while (p < content.len and (helpers.isBlank(content[p]) or content[p] == '\n' or content[p] == '\r')) p += 1; + // Parse dest + if (p < content.len and content[p] == '<') { + p += 1; + while (p < content.len and content[p] != '>' and content[p] != '\n') { + if (content[p] == '\\' and p + 1 < content.len) { + p += 2; + } else { + p += 1; + } + } + if (p < content.len and content[p] == '>') p += 1 else return .{ .is_link = false, .label_end = label_end, .link_end = label_end + 1 }; + } else { + var paren_depth: u32 = 0; + while (p < content.len and !helpers.isWhitespace(content[p])) { + if (content[p] == '(') { + paren_depth += 1; + } else if (content[p] == ')') { + if (paren_depth == 0) break; + paren_depth -= 1; + } + if (content[p] == '\\' and p + 1 < content.len) { + p += 2; + } else { + p += 1; + } + } + } + // Skip whitespace + while (p < content.len and (helpers.isBlank(content[p]) or content[p] == '\n' or content[p] == '\r')) p += 1; + // Optional title + if (p < content.len and (content[p] == '"' or content[p] == '\'' or content[p] == '(')) { + const close_ch: u8 = if (content[p] == '(') ')' else content[p]; + p += 1; + while (p < content.len and content[p] != close_ch) { + if (content[p] == '\\' and p + 1 < content.len) { + p += 2; + } else { + p += 1; + } + } + if (p < content.len) p += 1; + } + // Skip whitespace + while (p < content.len and (helpers.isBlank(content[p]) or content[p] == '\n' or content[p] == '\r')) p += 1; + if (p < content.len and content[p] == ')') { + return .{ .is_link = true, .label_end = label_end, .link_end = p + 1 }; + } + } + + // Reference link: ][...] + if (content[pos] == '[') { + var p = pos + 1; + while (p < content.len and content[p] != ']') { + if (content[p] == '[') break; + if (content[p] == '\\' and p + 1 < content.len) { + p += 2; + } else { + p += 1; + } + } + if (p < content.len and content[p] == ']') { + const ref_label = if (p > pos + 1) content[pos + 1 .. p] else content[start + 1 .. label_end]; + if (self.lookupRefDef(ref_label) != null) return .{ .is_link = true, .label_end = label_end, .link_end = p + 1 }; + } + } + + // Shortcut reference + const inner_label = content[start + 1 .. label_end]; + if (self.lookupRefDef(inner_label) != null) return .{ .is_link = true, .label_end = label_end, .link_end = label_end + 1 }; + + return .{ .is_link = false, .label_end = label_end, .link_end = label_end + 1 }; +} + +/// Check if a link label contains an inner link construct. +/// Used to enforce the "links cannot contain other links" rule (CommonMark §6.7). +pub fn labelContainsLink(self: *Parser, label: []const u8) bool { + var pos: usize = 0; + while (pos < label.len) { + if (label[pos] == '\\' and pos + 1 < label.len) { + pos += 2; + continue; + } + // Skip code spans + if (label[pos] == '`') { + const count = inlines_mod.countBackticks(label, pos); + if (self.findCodeSpanEnd(label, pos + count, count)) |end_pos| { + pos = end_pos + count; + continue; + } + } + // Skip HTML tags and autolinks + if (label[pos] == '<' and !self.flags.no_html_spans) { + if (self.findHtmlTag(label, pos)) |tag_end| { + pos = tag_end; + continue; + } + if (self.findAutolink(label, pos)) |al| { + pos = al.end_pos; + continue; + } + } + if (label[pos] == '[') { + // Skip images (![...]) — images are allowed inside links + const is_inner_image = pos > 0 and label[pos - 1] == '!'; + // Try to find matching ] and check for link syntax + const inner = self.tryMatchBracketLink(label, pos); + if (inner.is_link and !is_inner_image) return true; + if (inner.link_end > pos) { + // Skip past entire construct (including (url) or [ref] for images) + pos = inner.link_end; + continue; + } + } + pos += 1; + } + return false; +} + +/// Process wiki link: [[destination]] or [[destination|label]] +pub fn processWikiLink(self: *Parser, content: []const u8, start: usize) Parser.Error!?usize { + // start points at first '[', next char is also '[' + var pos = start + 2; + + // Find closing ']]', checking for constraints + const inner_start = pos; + var pipe_pos: ?usize = null; + var bracket_depth: u32 = 0; + + while (pos < content.len) { + if (content[pos] == '\n' or content[pos] == '\r') { + return null; + } + if (content[pos] == '[') { + bracket_depth += 1; + } else if (content[pos] == ']') { + if (bracket_depth > 0) { + bracket_depth -= 1; + } else if (pos + 1 < content.len and content[pos + 1] == ']') { + break; + } else { + // Single ] without matching [, not a valid close + return null; + } + } else if (content[pos] == '|' and pipe_pos == null and bracket_depth == 0) { + pipe_pos = pos; + } + pos += 1; + } + + // Must end with ]] + if (pos >= content.len or content[pos] != ']') { + return null; + } + + const inner_end = pos; + + // Determine target and label + const target = if (pipe_pos) |pp| content[inner_start..pp] else content[inner_start..inner_end]; + const label = if (pipe_pos) |pp| content[pp + 1 .. inner_end] else content[inner_start..inner_end]; + + // Target must not exceed 100 characters + if (target.len > 100) { + return null; + } + + // Render the wikilink + try self.renderer.enterSpan(.wikilink, .{ .href = target }); + try self.processInlineContent(label, 0); + try self.renderer.leaveSpan(.wikilink); + + return pos + 2; // skip both ']' +} + +/// Render a reference link/image given the resolved ref def. +pub fn renderRefLink(self: *Parser, label_content: []const u8, ref: RefDef, is_image: bool) Parser.Error!void { + if (self.image_nesting_level > 0) { + // Inside image alt text — emit only text, no HTML tags + try self.processInlineContent(label_content, 0); + } else if (is_image) { + try self.renderer.enterSpan(.img, .{ .href = ref.dest, .title = ref.title }); + self.image_nesting_level += 1; + try self.processInlineContent(label_content, 0); + self.image_nesting_level -= 1; + try self.renderer.leaveSpan(.img); + } else { + try self.renderer.enterSpan(.a, .{ .href = ref.dest, .title = ref.title }); + self.link_nesting_level += 1; + try self.processInlineContent(label_content, 0); + self.link_nesting_level -= 1; + try self.renderer.leaveSpan(.a); + } +} + +pub fn findAutolink(self: *const Parser, content: []const u8, start: usize) ?struct { end_pos: usize, is_email: bool } { + _ = self; + if (start + 1 >= content.len) return null; + + const pos = start + 1; + + // Check for URI autolink: scheme://... + if (helpers.isAlpha(content[pos])) { + var scheme_end = pos; + while (scheme_end < content.len and (helpers.isAlphaNum(content[scheme_end]) or + content[scheme_end] == '+' or content[scheme_end] == '-' or content[scheme_end] == '.')) + { + scheme_end += 1; + } + const scheme_len = scheme_end - pos; + if (scheme_len >= 2 and scheme_len <= 32 and scheme_end < content.len and content[scheme_end] == ':') { + // URI autolink + var uri_end = scheme_end + 1; + while (uri_end < content.len and content[uri_end] != '>' and !helpers.isWhitespace(content[uri_end])) { + uri_end += 1; + } + if (uri_end < content.len and content[uri_end] == '>') { + return .{ .end_pos = uri_end + 1, .is_email = false }; + } + } + + // Check for email autolink + var email_pos = pos; + // username part + while (email_pos < content.len and (helpers.isAlphaNum(content[email_pos]) or + content[email_pos] == '.' or content[email_pos] == '-' or + content[email_pos] == '_' or content[email_pos] == '+')) + { + email_pos += 1; + } + if (email_pos < content.len and content[email_pos] == '@' and email_pos > pos) { + email_pos += 1; + // domain part: labels separated by '.', each 1-63 chars, alphanumeric or hyphen + const domain_start = email_pos; + var label_len: u32 = 0; + var dot_count: u32 = 0; + var valid_domain = true; + while (email_pos < content.len and (helpers.isAlphaNum(content[email_pos]) or + content[email_pos] == '.' or content[email_pos] == '-')) + { + if (content[email_pos] == '.') { + if (label_len == 0) { + valid_domain = false; + break; + } + label_len = 0; + dot_count += 1; + } else { + label_len += 1; + if (label_len > 63) { + valid_domain = false; + break; + } + } + email_pos += 1; + } + if (valid_domain and email_pos < content.len and content[email_pos] == '>' and + email_pos > domain_start and label_len > 0 and dot_count > 0 and + helpers.isAlphaNum(content[email_pos - 1])) + { + return .{ .end_pos = email_pos + 1, .is_email = true }; + } + } + } + + return null; +} + +pub fn renderAutolink(self: *Parser, url: []const u8, is_email: bool) bun.JSError!void { + try self.renderer.enterSpan(.a, .{ .href = url, .autolink = true, .autolink_email = is_email }); + try self.emitText(.normal, url); + try self.renderer.leaveSpan(.a); +} + +const bun = @import("bun"); +const helpers = @import("./helpers.zig"); +const inlines_mod = @import("./inlines.zig"); + +const parser_mod = @import("./parser.zig"); +const Parser = parser_mod.Parser; + +const ref_defs_mod = @import("./ref_defs.zig"); +const RefDef = ref_defs_mod.RefDef; + +const types = @import("./types.zig"); +const OFF = types.OFF; diff --git a/src/md/parser.zig b/src/md/parser.zig new file mode 100644 index 0000000000..4ad649bfde --- /dev/null +++ b/src/md/parser.zig @@ -0,0 +1,285 @@ +// Sub-modules + +/// Parser context holding all state during parsing. +pub const Parser = struct { + allocator: Allocator, + text: []const u8, + size: OFF, + flags: Flags, + + // Output + renderer: Renderer, + image_nesting_level: u32 = 0, + link_nesting_level: u32 = 0, + + // Code indent offset: 4 normally, maxInt if no_indented_code_blocks + code_indent_offset: u32, + doc_ends_with_newline: bool, + + // Mark character map — bitset of characters that need special handling + mark_char_map: bun.bit_set.StaticBitSet(256) = bun.bit_set.StaticBitSet(256).initEmpty(), + + // Dynamic arrays + marks: std.ArrayListUnmanaged(Mark) = .{}, + containers: std.ArrayListUnmanaged(Container) = .{}, + block_bytes: std.ArrayListAlignedUnmanaged(u8, .@"4") = .{}, + buffer: std.ArrayListUnmanaged(u8) = .{}, + emph_delims: std.ArrayListUnmanaged(EmphDelim) = .{}, + + // Number of active containers + n_containers: u32 = 0, + + // Current block being built + current_block: ?usize = null, + current_block_lines: std.ArrayListUnmanaged(VerbatimLine) = .{}, + + // Opener stacks + opener_stacks: [types.NUM_OPENER_STACKS]types.OpenerStack = + [_]types.OpenerStack{.{}} ** types.NUM_OPENER_STACKS, + + // Linked lists through marks + unresolved_link_head: i32 = -1, + unresolved_link_tail: i32 = -1, + table_cell_boundaries_head: i32 = -1, + table_cell_boundaries_tail: i32 = -1, + + // HTML block tracking + html_block_type: u8 = 0, + // Fenced code block indent + fence_indent: u32 = 0, + + // Table column alignments + table_col_count: u32 = 0, + table_alignments: [types.TABLE_MAXCOLCOUNT]Align = [_]Align{.default} ** types.TABLE_MAXCOLCOUNT, + + // Ref defs + ref_defs: std.ArrayListUnmanaged(RefDef) = .{}, + + // State + last_line_has_list_loosening_effect: bool = false, + last_list_item_starts_with_two_blank_lines: bool = false, + max_ref_def_output: u64 = 0, + + // Stack overflow protection for recursive inline processing + stack_check: bun.StackCheck, + + pub const BlockHeader = extern struct { + block_type: BlockType, + _pad: [3]u8 = .{ 0, 0, 0 }, + flags: u32, + data: u32, + n_lines: u32, + }; + + pub const EmphDelim = inlines_mod.EmphDelim; + pub const MAX_EMPH_MATCHES = inlines_mod.MAX_EMPH_MATCHES; + pub const RefDef = ref_defs_mod.RefDef; + + pub const Error = bun.JSError || bun.StackOverflow; + + fn init(allocator: Allocator, text: []const u8, flags: Flags, rend: Renderer) Parser { + const size: OFF = @intCast(text.len); + var p = Parser{ + .allocator = allocator, + .text = text, + .size = size, + .flags = flags, + .renderer = rend, + .code_indent_offset = if (flags.no_indented_code_blocks) std.math.maxInt(u32) else 4, + .doc_ends_with_newline = size > 0 and helpers.isNewline(text[size - 1]), + .max_ref_def_output = @min(@min(16 * @as(u64, size), 1024 * 1024), std.math.maxInt(u32)), + .stack_check = bun.StackCheck.init(), + }; + p.buildMarkCharMap(); + return p; + } + + fn deinit(self: *Parser) void { + self.marks.deinit(self.allocator); + self.containers.deinit(self.allocator); + self.block_bytes.deinit(self.allocator); + self.buffer.deinit(self.allocator); + self.current_block_lines.deinit(self.allocator); + self.ref_defs.deinit(self.allocator); + self.emph_delims.deinit(self.allocator); + } + + pub inline fn ch(self: *const Parser, off: OFF) u8 { + if (off >= self.size) return 0; + return self.text[off]; + } + + fn buildMarkCharMap(self: *Parser) void { + self.mark_char_map.set('\\'); + self.mark_char_map.set('*'); + self.mark_char_map.set('_'); + self.mark_char_map.set('`'); + self.mark_char_map.set('&'); + self.mark_char_map.set(';'); + self.mark_char_map.set('['); + self.mark_char_map.set('!'); + self.mark_char_map.set(']'); + self.mark_char_map.set(0); + self.mark_char_map.set('\n'); // newlines always need handling (hard/soft breaks) + if (!self.flags.no_html_spans) { + self.mark_char_map.set('<'); + self.mark_char_map.set('>'); + } + if (self.flags.strikethrough) self.mark_char_map.set('~'); + if (self.flags.latex_math) self.mark_char_map.set('$'); + if (self.flags.permissive_email_autolinks or self.flags.permissive_url_autolinks) + self.mark_char_map.set(':'); + if (self.flags.permissive_email_autolinks) self.mark_char_map.set('@'); + if (self.flags.permissive_www_autolinks) self.mark_char_map.set('.'); + if (self.flags.collapse_whitespace) { + self.mark_char_map.set(' '); + self.mark_char_map.set('\t'); + self.mark_char_map.set('\r'); + } + } + + // ======================================== + // Delegated methods (re-exports) + // ======================================== + + // render_blocks.zig + pub const enterBlock = render_blocks_mod.enterBlock; + pub const leaveBlock = render_blocks_mod.leaveBlock; + pub const processCodeBlock = render_blocks_mod.processCodeBlock; + pub const processHtmlBlock = render_blocks_mod.processHtmlBlock; + pub const processTableBlock = render_blocks_mod.processTableBlock; + pub const processTableRow = render_blocks_mod.processTableRow; + + // blocks.zig + pub const processDoc = blocks_mod.processDoc; + pub const analyzeLine = blocks_mod.analyzeLine; + pub const processLine = blocks_mod.processLine; + pub const startNewBlock = blocks_mod.startNewBlock; + pub const addLineToCurrentBlock = blocks_mod.addLineToCurrentBlock; + pub const endCurrentBlock = blocks_mod.endCurrentBlock; + pub const consumeRefDefsFromCurrentBlock = blocks_mod.consumeRefDefsFromCurrentBlock; + pub const getBlockHeaderAt = blocks_mod.getBlockHeaderAt; + pub const getBlockAt = blocks_mod.getBlockAt; + + // containers.zig + pub const pushContainer = containers_mod.pushContainer; + pub const pushContainerBytes = containers_mod.pushContainerBytes; + pub const enterChildContainers = containers_mod.enterChildContainers; + pub const leaveChildContainers = containers_mod.leaveChildContainers; + pub const isContainerCompatible = containers_mod.isContainerCompatible; + pub const processAllBlocks = containers_mod.processAllBlocks; + + // inlines.zig + pub const processLeafBlock = inlines_mod.processLeafBlock; + pub const processInlineContent = inlines_mod.processInlineContent; + pub const enterSpan = inlines_mod.enterSpan; + pub const leaveSpan = inlines_mod.leaveSpan; + pub const emitText = inlines_mod.emitText; + pub const emitEmphOpenTags = inlines_mod.emitEmphOpenTags; + pub const emitEmphCloseTags = inlines_mod.emitEmphCloseTags; + pub const findCodeSpanEnd = inlines_mod.findCodeSpanEnd; + pub const normalizeCodeSpanContent = inlines_mod.normalizeCodeSpanContent; + pub const isLeftFlanking = inlines_mod.isLeftFlanking; + pub const isRightFlanking = inlines_mod.isRightFlanking; + pub const canOpenEmphasis = inlines_mod.canOpenEmphasis; + pub const canCloseEmphasis = inlines_mod.canCloseEmphasis; + pub const collectEmphasisDelimiters = inlines_mod.collectEmphasisDelimiters; + pub const resolveEmphasisDelimiters = inlines_mod.resolveEmphasisDelimiters; + pub const findEntity = inlines_mod.findEntity; + pub const findHtmlTag = inlines_mod.findHtmlTag; + + // links.zig + pub const processLink = links_mod.processLink; + pub const tryMatchBracketLink = links_mod.tryMatchBracketLink; + pub const labelContainsLink = links_mod.labelContainsLink; + pub const processWikiLink = links_mod.processWikiLink; + pub const renderRefLink = links_mod.renderRefLink; + pub const findAutolink = links_mod.findAutolink; + pub const renderAutolink = links_mod.renderAutolink; + + // line_analysis.zig + pub const isSetextUnderline = line_analysis_mod.isSetextUnderline; + pub const isHrLine = line_analysis_mod.isHrLine; + pub const isAtxHeaderLine = line_analysis_mod.isAtxHeaderLine; + pub const isOpeningCodeFence = line_analysis_mod.isOpeningCodeFence; + pub const isClosingCodeFence = line_analysis_mod.isClosingCodeFence; + pub const isHtmlBlockStartCondition = line_analysis_mod.isHtmlBlockStartCondition; + pub const isHtmlBlockEndCondition = line_analysis_mod.isHtmlBlockEndCondition; + pub const matchHtmlTag = line_analysis_mod.matchHtmlTag; + pub const isBlockLevelHtmlTag = line_analysis_mod.isBlockLevelHtmlTag; + pub const isCompleteHtmlTag = line_analysis_mod.isCompleteHtmlTag; + pub const isTableUnderline = line_analysis_mod.isTableUnderline; + pub const countTableRowColumns = line_analysis_mod.countTableRowColumns; + pub const isContainerMark = line_analysis_mod.isContainerMark; + + // ref_defs.zig + pub const normalizeLabel = ref_defs_mod.normalizeLabel; + pub const lookupRefDef = ref_defs_mod.lookupRefDef; + pub const parseRefDef = ref_defs_mod.parseRefDef; + pub const skipRefDefWhitespace = ref_defs_mod.skipRefDefWhitespace; + pub const parseRefDefDest = ref_defs_mod.parseRefDefDest; + pub const parseRefDefTitle = ref_defs_mod.parseRefDefTitle; + pub const buildRefDefHashtable = ref_defs_mod.buildRefDefHashtable; +}; + +// ======================================== +// Public API +// ======================================== + +pub fn renderToHtml(text: []const u8, allocator: Allocator, flags: Flags, render_opts: root.RenderOptions) Parser.Error![]u8 { + // Skip UTF-8 BOM + const input = helpers.skipUtf8Bom(text); + + var html_renderer = HtmlRenderer.init(allocator, input, render_opts); + errdefer html_renderer.deinit(); + + var parser = Parser.init(allocator, input, flags, html_renderer.renderer()); + defer parser.deinit(); + + // HtmlRenderer never returns JSError/JSTerminated, so OutOfMemory is the only possible error. + parser.processDoc() catch |err| switch (err) { + error.OutOfMemory => return error.OutOfMemory, + error.JSError, error.JSTerminated => unreachable, + error.StackOverflow => return error.StackOverflow, + }; + + return html_renderer.toOwnedSlice(); +} + +/// Parse and render using a custom renderer. The caller provides its own +/// Renderer implementation (e.g. for JS callback-based rendering). +/// `render_options` carries render-only flags (tag_filter, heading_ids, +/// autolink_headings) so they are not silently dropped by the API. +pub fn renderWithRenderer(text: []const u8, allocator: Allocator, flags: Flags, render_options: root.RenderOptions, rend: Renderer) Parser.Error!void { + _ = render_options; // Available for renderer implementations; parse layer does not use these. + const input = helpers.skipUtf8Bom(text); + + var p = Parser.init(allocator, input, flags, rend); + defer p.deinit(); + + try p.processDoc(); +} + +const blocks_mod = @import("./blocks.zig"); +const bun = @import("bun"); +const containers_mod = @import("./containers.zig"); +const helpers = @import("./helpers.zig"); +const inlines_mod = @import("./inlines.zig"); +const line_analysis_mod = @import("./line_analysis.zig"); +const links_mod = @import("./links.zig"); +const ref_defs_mod = @import("./ref_defs.zig"); +const render_blocks_mod = @import("./render_blocks.zig"); +const root = @import("./root.zig"); +const std = @import("std"); +const HtmlRenderer = @import("./html_renderer.zig").HtmlRenderer; +const Allocator = std.mem.Allocator; + +const types = @import("./types.zig"); +const Align = types.Align; +const BlockType = types.BlockType; +const Container = types.Container; +const Flags = types.Flags; +const Mark = types.Mark; +const OFF = types.OFF; +const Renderer = types.Renderer; +const VerbatimLine = types.VerbatimLine; diff --git a/src/md/ref_defs.zig b/src/md/ref_defs.zig new file mode 100644 index 0000000000..e2208165b4 --- /dev/null +++ b/src/md/ref_defs.zig @@ -0,0 +1,351 @@ +pub const RefDef = struct { + label: []const u8, // normalized label + dest: []const u8, // raw destination (slice of source) + title: []const u8, // raw title (slice of source) +}; + +/// Normalize a link label for comparison: collapse whitespace runs to single space, +/// strip leading/trailing whitespace, case-fold. +pub fn normalizeLabel(self: *Parser, raw: []const u8) []const u8 { + // Collapse whitespace and apply Unicode case folding (per CommonMark §6.7) + var result = std.ArrayListUnmanaged(u8){}; + var in_ws = true; // skip leading whitespace + var i: usize = 0; + while (i < raw.len) { + const c = raw[i]; + switch (c) { + ' ', '\t', '\n', '\r' => { + if (!in_ws and result.items.len > 0) { + result.append(self.allocator, ' ') catch return raw; + in_ws = true; + } + i += 1; + }, + 0x80...0xFF => { + // Multi-byte UTF-8: decode, case fold, re-encode + const decoded = helpers.decodeUtf8(raw, i); + const fold = unicode.caseFold(decoded.codepoint); + var j: u2 = 0; + while (j < fold.n_codepoints) : (j += 1) { + var buf: [4]u8 = undefined; + const len = helpers.encodeUtf8(fold.codepoints[j], &buf); + if (len > 0) { + result.appendSlice(self.allocator, buf[0..len]) catch return raw; + } + } + in_ws = false; + i += @as(usize, decoded.len); + }, + else => { + // ASCII: simple toLower + result.append(self.allocator, std.ascii.toLower(c)) catch return raw; + in_ws = false; + i += 1; + }, + } + } + // Strip trailing space + if (result.items.len > 0 and result.items[result.items.len - 1] == ' ') { + result.items.len -= 1; + } + return result.items; +} + +/// Look up a reference definition by label (case-insensitive, whitespace-normalized). +pub fn lookupRefDef(self: *Parser, raw_label: []const u8) ?RefDef { + if (raw_label.len == 0) return null; + const normalized = self.normalizeLabel(raw_label); + if (normalized.len == 0) return null; // whitespace-only labels are invalid + for (self.ref_defs.items) |rd| { + if (std.mem.eql(u8, rd.label, normalized)) return rd; + } + return null; +} + +/// Try to parse a link reference definition from merged paragraph text at position `pos`. +/// Returns the end position and the parsed ref def, or null if not a valid ref def. +pub fn parseRefDef(self: *Parser, text: []const u8, pos: usize) ?struct { end_pos: usize, label: []const u8, dest: []const u8, title: []const u8 } { + var p = pos; + + // Must start with [ + if (p >= text.len or text[p] != '[') return null; + p += 1; + + // Parse label: content up to ], no unescaped [ or ] + const label_start = p; + var label_len: usize = 0; + while (p < text.len and text[p] != ']') { + if (text[p] == '[') return null; // no nested [ + if (text[p] == '\\' and p + 1 < text.len) { + p += 2; + label_len += 2; + } else { + p += 1; + label_len += 1; + } + if (label_len > 999) return null; // label too long + } + if (p >= text.len) return null; // no closing ] + const label = text[label_start..p]; + if (label.len == 0) return null; // empty label + p += 1; // skip ] + + // Must be followed by : + if (p >= text.len or text[p] != ':') return null; + p += 1; + + // Skip optional whitespace including up to one newline + p = self.skipRefDefWhitespace(text, p); + + // Parse destination + const dest_result = self.parseRefDefDest(text, p) orelse return null; + p = dest_result.end_pos; + const dest = dest_result.dest; + + // Save position before trying title (may need to backtrack) + const pos_after_dest = p; + + // Skip optional whitespace including up to one newline + const p_before_title_ws = p; + p = self.skipRefDefWhitespace(text, p); + const had_newline_before_title = blk: { + var i = p_before_title_ws; + while (i < p) : (i += 1) { + if (text[i] == '\n') break :blk true; + } + break :blk false; + }; + + // Parse optional title + var title: []const u8 = ""; + var had_whitespace_before_title = false; + if (p < text.len and (text[p] == '"' or text[p] == '\'' or text[p] == '(')) { + // Check that there was actual whitespace between dest and title + had_whitespace_before_title = (p > pos_after_dest); + if (had_whitespace_before_title) { + if (self.parseRefDefTitle(text, p)) |title_result| { + // Title must be followed by optional whitespace then end of line or end of text + var after_title = title_result.end_pos; + while (after_title < text.len and (text[after_title] == ' ' or text[after_title] == '\t')) after_title += 1; + if (after_title >= text.len or text[after_title] == '\n') { + title = title_result.title; + p = after_title; + if (p < text.len and text[p] == '\n') p += 1; + return .{ .end_pos = p, .label = label, .dest = dest, .title = title }; + } + // Title present but not followed by end of line — if title was on same line as dest, invalid + // If title was on new line, treat as no title (title line is separate paragraph content) + if (!had_newline_before_title) { + return null; // title on same line as dest but not at end of line + } + } else { + // Invalid title syntax + if (!had_newline_before_title) { + return null; + } + } + } + } + + // No title: backtrack to right after destination and check for end-of-line + p = pos_after_dest; + while (p < text.len and (text[p] == ' ' or text[p] == '\t')) p += 1; + if (p < text.len and text[p] != '\n') return null; + if (p < text.len and text[p] == '\n') p += 1; + + return .{ .end_pos = p, .label = label, .dest = dest, .title = title }; +} + +pub fn skipRefDefWhitespace(self: *const Parser, text: []const u8, start: usize) usize { + _ = self; + var p = start; + while (p < text.len and (text[p] == ' ' or text[p] == '\t')) p += 1; + if (p < text.len and text[p] == '\n') { + p += 1; + while (p < text.len and (text[p] == ' ' or text[p] == '\t')) p += 1; + } + return p; +} + +pub fn parseRefDefDest(self: *const Parser, text: []const u8, start: usize) ?struct { dest: []const u8, end_pos: usize } { + _ = self; + var p = start; + if (p >= text.len) return null; + + if (text[p] == '<') { + // Angle-bracket destination + p += 1; + const dest_start = p; + while (p < text.len and text[p] != '>' and text[p] != '\n') { + if (text[p] == '\\' and p + 1 < text.len) { + p += 2; + } else { + p += 1; + } + } + if (p >= text.len or text[p] != '>') return null; + const dest = text[dest_start..p]; + p += 1; // skip > + return .{ .dest = dest, .end_pos = p }; + } else { + // Bare destination — balance parentheses + const dest_start = p; + var paren_depth: u32 = 0; + while (p < text.len and !helpers.isWhitespace(text[p])) { + if (text[p] == '(') { + paren_depth += 1; + } else if (text[p] == ')') { + if (paren_depth == 0) break; + paren_depth -= 1; + } + if (text[p] == '\\' and p + 1 < text.len) { + p += 2; + } else { + p += 1; + } + } + if (p == dest_start) return null; // empty dest not allowed for bare + return .{ .dest = text[dest_start..p], .end_pos = p }; + } +} + +pub fn parseRefDefTitle(self: *const Parser, text: []const u8, start: usize) ?struct { title: []const u8, end_pos: usize } { + _ = self; + var p = start; + if (p >= text.len) return null; + + const open_char = text[p]; + const close_char: u8 = switch (open_char) { + '"', '\'' => open_char, + '(' => ')', + else => return null, + }; + p += 1; + const title_start = p; + + while (p < text.len and text[p] != close_char) { + if (text[p] == '\\' and p + 1 < text.len) { + p += 2; + } else { + // For () titles, nested ( is not allowed + if (open_char == '(' and text[p] == '(') return null; + p += 1; + } + } + if (p >= text.len) return null; // no closing quote/paren + const title = text[title_start..p]; + p += 1; // skip close + return .{ .title = title, .end_pos = p }; +} + +pub fn buildRefDefHashtable(self: *Parser) error{OutOfMemory}!void { + var off: usize = 0; + const bytes = self.block_bytes.items; + + while (off < bytes.len) { + // Align to BlockHeader + const align_mask: usize = @alignOf(BlockHeader) - 1; + off = (off + align_mask) & ~align_mask; + if (off + @sizeOf(BlockHeader) > bytes.len) break; + + const hdr: *BlockHeader = @ptrCast(@alignCast(bytes.ptr + off)); + const hdr_off = off; + off += @sizeOf(BlockHeader); + + const n_lines = hdr.n_lines; + const lines_size = n_lines * @sizeOf(VerbatimLine); + if (off + lines_size > bytes.len) break; + + const line_ptr: [*]VerbatimLine = @ptrCast(@alignCast(bytes.ptr + off)); + const block_lines = line_ptr[0..n_lines]; + off += lines_size; + + // Only process paragraph blocks (not container openers/closers) + if (hdr.block_type != .p or hdr.flags & types.BLOCK_CONTAINER_OPENER != 0 or hdr.flags & types.BLOCK_CONTAINER_CLOSER != 0) { + continue; + } + + if (n_lines == 0) continue; + + // Merge lines into buffer to parse ref defs + self.buffer.clearRetainingCapacity(); + for (block_lines) |vline| { + if (vline.beg > vline.end or vline.end > self.size) continue; + if (self.buffer.items.len > 0) { + self.buffer.append(self.allocator, '\n') catch {}; + } + self.buffer.appendSlice(self.allocator, self.text[vline.beg..vline.end]) catch {}; + } + + const merged = self.buffer.items; + var pos: usize = 0; + var lines_consumed: u32 = 0; + + // Try to parse consecutive ref defs from the start + while (pos < merged.len) { + const result = self.parseRefDef(merged, pos) orelse break; + + // Normalize and store the ref def (first definition wins) + const norm_label = self.normalizeLabel(result.label); + if (norm_label.len == 0) break; // whitespace-only labels are invalid + var already_exists = false; + for (self.ref_defs.items) |existing| { + if (std.mem.eql(u8, existing.label, norm_label)) { + already_exists = true; + break; + } + } + if (!already_exists) { + // Dupe dest and title since they point into self.buffer which gets reused + const dest_dupe = self.allocator.dupe(u8, result.dest) catch return error.OutOfMemory; + const title_dupe = self.allocator.dupe(u8, result.title) catch return error.OutOfMemory; + try self.ref_defs.append(self.allocator, .{ + .label = norm_label, + .dest = dest_dupe, + .title = title_dupe, + }); + } + + // Count how many newlines were consumed to track lines + var newlines: u32 = 0; + for (merged[pos..result.end_pos]) |mc| { + if (mc == '\n') newlines += 1; + } + // If end_pos is at the end and last char wasn't \n, that's still a consumed line + if (result.end_pos >= merged.len and (result.end_pos == pos or merged[result.end_pos - 1] != '\n')) { + newlines += 1; + } + lines_consumed += newlines; + pos = result.end_pos; + } + + // Update the block: mark consumed lines + if (lines_consumed > 0) { + if (lines_consumed >= n_lines) { + // Entire paragraph is ref defs — flag to skip during rendering + hdr.flags |= types.BLOCK_REF_DEF_ONLY; + } else { + // Mark consumed lines as invalid (beg > end triggers skip in processLeafBlock) + const line_base: [*]VerbatimLine = @ptrCast(@alignCast(bytes.ptr + hdr_off + @sizeOf(BlockHeader))); + var i: u32 = 0; + while (i < lines_consumed) : (i += 1) { + line_base[i].beg = 1; + line_base[i].end = 0; + } + } + } + } +} + +const helpers = @import("./helpers.zig"); +const parser_mod = @import("./parser.zig"); +const std = @import("std"); +const unicode = @import("./unicode.zig"); + +const Parser = parser_mod.Parser; +const BlockHeader = Parser.BlockHeader; + +const types = @import("./types.zig"); +const Align = types.Align; +const Mark = types.Mark; +const VerbatimLine = types.VerbatimLine; diff --git a/src/md/render_blocks.zig b/src/md/render_blocks.zig new file mode 100644 index 0000000000..abece86188 --- /dev/null +++ b/src/md/render_blocks.zig @@ -0,0 +1,153 @@ +pub fn enterBlock(self: *Parser, block_type: BlockType, data: u32, flags: u32) bun.JSError!void { + if (self.image_nesting_level > 0) return; + try self.renderer.enterBlock(block_type, data, flags); +} + +pub fn leaveBlock(self: *Parser, block_type: BlockType, data: u32) bun.JSError!void { + if (self.image_nesting_level > 0) return; + try self.renderer.leaveBlock(block_type, data); +} + +pub fn processCodeBlock(self: *Parser, block_lines: []const VerbatimLine, data: u32, flags: u32) bun.JSError!void { + _ = data; + + var count = block_lines.len; + + // Trim trailing blank lines from indented code blocks (not fenced) + if (flags & types.BLOCK_FENCED_CODE == 0) { + while (count > 0 and block_lines[count - 1].beg >= block_lines[count - 1].end) { + count -= 1; + } + } + + for (block_lines[0..count]) |vline| { + // Output indented content + for (0..vline.indent) |_| { + try self.emitText(.normal, " "); + } + const content = self.text[vline.beg..vline.end]; + try self.emitText(.normal, content); + try self.emitText(.normal, "\n"); + } +} + +pub fn processHtmlBlock(self: *Parser, block_lines: []const VerbatimLine) bun.JSError!void { + for (block_lines, 0..) |vline, i| { + if (i > 0) try self.emitText(.html, "\n"); + for (0..vline.indent) |_| { + try self.emitText(.html, " "); + } + try self.emitText(.html, self.text[vline.beg..vline.end]); + } + try self.emitText(.html, "\n"); +} + +pub fn processTableBlock(self: *Parser, block_lines: []const VerbatimLine, col_count: u32) Parser.Error!void { + if (block_lines.len < 2) return; + + // First line is header, second is underline, rest are body + try self.enterBlock(.thead, 0, 0); + try self.enterBlock(.tr, 0, 0); + try self.processTableRow(block_lines[0], true, col_count); + try self.leaveBlock(.tr, 0); + try self.leaveBlock(.thead, 0); + + if (block_lines.len > 2) { + try self.enterBlock(.tbody, 0, 0); + for (block_lines[2..]) |vline| { + try self.enterBlock(.tr, 0, 0); + try self.processTableRow(vline, false, col_count); + try self.leaveBlock(.tr, 0); + } + try self.leaveBlock(.tbody, 0); + } +} + +pub fn processTableRow(self: *Parser, vline: VerbatimLine, is_header: bool, col_count: u32) Parser.Error!void { + const row_text = self.text[vline.beg..vline.end]; + var start: usize = 0; + var cell_index: u32 = 0; + + // Skip leading pipe + if (start < row_text.len and row_text[start] == '|') start += 1; + + while (start < row_text.len and cell_index < col_count) { + // Find cell end, skipping escaped chars and code spans + var end = start; + while (end < row_text.len and row_text[end] != '|') { + if (row_text[end] == '\\' and end + 1 < row_text.len) { + end += 2; + } else { + end += 1; + } + } + + // Skip trailing pipe cell + if (end == row_text.len and start == end) break; + + // Trim cell content + var cell_beg = start; + var cell_end = end; + while (cell_beg < cell_end and helpers.isBlank(row_text[cell_beg])) cell_beg += 1; + while (cell_end > cell_beg and helpers.isBlank(row_text[cell_end - 1])) cell_end -= 1; + + const cell_type: BlockType = if (is_header) .th else .td; + const align_data: u32 = if (cell_index < types.TABLE_MAXCOLCOUNT) @intFromEnum(self.table_alignments[cell_index]) else 0; + try self.enterBlock(cell_type, align_data, 0); + if (cell_beg < cell_end) { + const cell_content = row_text[cell_beg..cell_end]; + // GFM: \| in table cells should be consumed at the table level, + // replacing \| with | before inline processing. This matters for + // code spans where backslash escapes don't apply. + if (std.mem.indexOf(u8, cell_content, "\\|") != null) { + var buf: std.ArrayListUnmanaged(u8) = .{}; + defer buf.deinit(self.allocator); + const unescaped = if (buf.ensureTotalCapacity(self.allocator, cell_content.len)) |_| blk: { + var ci: usize = 0; + while (ci < cell_content.len) { + if (cell_content[ci] == '\\' and ci + 1 < cell_content.len and cell_content[ci + 1] == '|') { + buf.appendAssumeCapacity('|'); + ci += 2; + } else { + buf.appendAssumeCapacity(cell_content[ci]); + ci += 1; + } + } + break :blk buf.items; + } else |_| cell_content; + try self.processInlineContent(unescaped, vline.beg + @as(OFF, @intCast(cell_beg))); + } else { + try self.processInlineContent(cell_content, vline.beg + @as(OFF, @intCast(cell_beg))); + } + } + try self.leaveBlock(cell_type, 0); + cell_index += 1; + + if (end < row_text.len) { + start = end + 1; // skip | + } else { + break; + } + } + + // Pad short rows with empty cells + const cell_type: BlockType = if (is_header) .th else .td; + while (cell_index < col_count) { + const align_data: u32 = if (cell_index < types.TABLE_MAXCOLCOUNT) @intFromEnum(self.table_alignments[cell_index]) else 0; + try self.enterBlock(cell_type, align_data, 0); + try self.leaveBlock(cell_type, 0); + cell_index += 1; + } +} + +const bun = @import("bun"); +const helpers = @import("./helpers.zig"); +const std = @import("std"); + +const parser_mod = @import("./parser.zig"); +const Parser = parser_mod.Parser; + +const types = @import("./types.zig"); +const BlockType = types.BlockType; +const OFF = types.OFF; +const VerbatimLine = types.VerbatimLine; diff --git a/src/md/root.zig b/src/md/root.zig new file mode 100644 index 0000000000..219c51861c --- /dev/null +++ b/src/md/root.zig @@ -0,0 +1,104 @@ +// Re-export types needed by external renderers (e.g. JS callback renderer). +pub const Renderer = types.Renderer; +pub const BlockType = types.BlockType; +pub const SpanType = types.SpanType; +pub const TextType = types.TextType; +pub const SpanDetail = types.SpanDetail; +pub const Align = types.Align; +pub const BLOCK_FENCED_CODE = types.BLOCK_FENCED_CODE; + +pub const RenderOptions = struct { + tag_filter: bool = false, + heading_ids: bool = false, + autolink_headings: bool = false, +}; + +pub const Options = struct { + tables: bool = true, + strikethrough: bool = true, + tasklists: bool = true, + permissive_autolinks: bool = false, + permissive_url_autolinks: bool = false, + permissive_www_autolinks: bool = false, + permissive_email_autolinks: bool = false, + hard_soft_breaks: bool = false, + wiki_links: bool = false, + underline: bool = false, + latex_math: bool = false, + collapse_whitespace: bool = false, + permissive_atx_headers: bool = false, + no_indented_code_blocks: bool = false, + no_html_blocks: bool = false, + no_html_spans: bool = false, + /// GFM tag filter: replaces `<` with `<` for disallowed HTML tags + /// (title, textarea, style, xmp, iframe, noembed, noframes, script, plaintext). + tag_filter: bool = false, + heading_ids: bool = false, + autolink_headings: bool = false, + + pub const commonmark: Options = .{ + .tables = false, + .strikethrough = false, + .tasklists = false, + }; + + pub const github: Options = .{ + .tables = true, + .strikethrough = true, + .tasklists = true, + .permissive_autolinks = true, + .permissive_www_autolinks = true, + .permissive_email_autolinks = true, + .tag_filter = true, + }; + + pub fn toFlags(self: Options) Flags { + return .{ + .tables = self.tables, + .strikethrough = self.strikethrough, + .tasklists = self.tasklists, + .permissive_url_autolinks = self.permissive_url_autolinks or self.permissive_autolinks, + .permissive_www_autolinks = self.permissive_www_autolinks or self.permissive_autolinks, + .permissive_email_autolinks = self.permissive_email_autolinks or self.permissive_autolinks, + .hard_soft_breaks = self.hard_soft_breaks, + .wiki_links = self.wiki_links, + .underline = self.underline, + .latex_math = self.latex_math, + .collapse_whitespace = self.collapse_whitespace, + .permissive_atx_headers = self.permissive_atx_headers, + .no_indented_code_blocks = self.no_indented_code_blocks, + .no_html_blocks = self.no_html_blocks, + .no_html_spans = self.no_html_spans, + }; + } + + pub fn toRenderOptions(self: Options) RenderOptions { + return .{ + .tag_filter = self.tag_filter, + .heading_ids = self.heading_ids, + .autolink_headings = self.autolink_headings, + }; + } +}; + +pub fn renderToHtml(text: []const u8, allocator: std.mem.Allocator) parser.Parser.Error![]u8 { + return renderToHtmlWithOptions(text, allocator, .{}); +} + +pub fn renderToHtmlWithOptions(text: []const u8, allocator: std.mem.Allocator, options: Options) parser.Parser.Error![]u8 { + return parser.renderToHtml(text, allocator, options.toFlags(), options.toRenderOptions()); +} + +/// Parse and render using a custom renderer implementation. +pub fn renderWithRenderer(text: []const u8, allocator: std.mem.Allocator, options: Options, renderer: Renderer) parser.Parser.Error!void { + return parser.renderWithRenderer(text, allocator, options.toFlags(), options.toRenderOptions(), renderer); +} + +pub const types = @import("./types.zig"); +const Flags = types.Flags; + +pub const entity = @import("./entity.zig"); +pub const helpers = @import("./helpers.zig"); + +const parser = @import("./parser.zig"); +const std = @import("std"); diff --git a/src/md/types.zig b/src/md/types.zig new file mode 100644 index 0000000000..996a2b0fcc --- /dev/null +++ b/src/md/types.zig @@ -0,0 +1,387 @@ +/// Offset into the input document. +pub const OFF = u32; +/// Size type. +pub const SZ = u32; + +/// Block types reported via enter_block / leave_block callbacks. +pub const BlockType = enum(u8) { + doc, + quote, + ul, + ol, + li, + hr, + h, + code, + html, + p, + table, + thead, + tbody, + tr, + th, + td, +}; + +/// Span (inline) types reported via enter_span / leave_span callbacks. +pub const SpanType = enum(u8) { + em, + strong, + a, + img, + code, + del, + latexmath, + latexmath_display, + wikilink, + u, +}; + +/// Text types reported via the text callback. +pub const TextType = enum(u8) { + normal, + null_char, + br, + softbr, + entity, + code, + html, + latexmath, +}; + +/// Table cell alignment. +pub const Align = enum(u8) { + default, + left, + center, + right, +}; + +// --- Detail structs --- + +pub const UlDetail = struct { + is_tight: bool, + mark: u8, +}; + +pub const OlDetail = struct { + start: u32, + is_tight: bool, + mark_delimiter: u8, +}; + +pub const LiDetail = struct { + is_task: bool, + task_mark: u8, + task_mark_offset: OFF, +}; + +pub const HDetail = struct { + level: u8, +}; + +pub const CodeDetail = struct { + info: Attribute, + lang: Attribute, + fence_char: u8, +}; + +pub const TableDetail = struct { + col_count: u32, + head_row_count: u32, + body_row_count: u32, +}; + +pub const TdDetail = struct { + alignment: Align, +}; + +pub const ADetail = struct { + href: Attribute, + title: Attribute, +}; + +pub const ImgDetail = struct { + src: Attribute, + title: Attribute, +}; + +pub const WikilinkDetail = struct { + target: Attribute, +}; + +/// Renderer interface. The parser calls these methods to produce output. +pub const Renderer = struct { + ptr: *anyopaque, + vtable: *const VTable, + + pub const VTable = struct { + enterBlock: *const fn (ptr: *anyopaque, block_type: BlockType, data: u32, flags: u32) bun.JSError!void, + leaveBlock: *const fn (ptr: *anyopaque, block_type: BlockType, data: u32) bun.JSError!void, + enterSpan: *const fn (ptr: *anyopaque, span_type: SpanType, detail: SpanDetail) bun.JSError!void, + leaveSpan: *const fn (ptr: *anyopaque, span_type: SpanType) bun.JSError!void, + text: *const fn (ptr: *anyopaque, text_type: TextType, content: []const u8) bun.JSError!void, + }; + + pub inline fn enterBlock(self: Renderer, block_type: BlockType, data: u32, flags: u32) bun.JSError!void { + return self.vtable.enterBlock(self.ptr, block_type, data, flags); + } + pub inline fn leaveBlock(self: Renderer, block_type: BlockType, data: u32) bun.JSError!void { + return self.vtable.leaveBlock(self.ptr, block_type, data); + } + pub inline fn enterSpan(self: Renderer, span_type: SpanType, detail: SpanDetail) bun.JSError!void { + return self.vtable.enterSpan(self.ptr, span_type, detail); + } + pub inline fn leaveSpan(self: Renderer, span_type: SpanType) bun.JSError!void { + return self.vtable.leaveSpan(self.ptr, span_type); + } + pub inline fn text(self: Renderer, text_type: TextType, content: []const u8) bun.JSError!void { + return self.vtable.text(self.ptr, text_type, content); + } +}; + +/// Detail data for span events (links, images, wikilinks). +pub const SpanDetail = struct { + href: []const u8 = "", + title: []const u8 = "", + /// Standard autolink (angle-bracket): use writeUrlEscaped (no entity/escape processing) + autolink: bool = false, + /// Standard autolink is an email: prepend "mailto:" to href + autolink_email: bool = false, + /// Permissive autolink: use HTML-escaping for href (not URL-escaping) + permissive_autolink: bool = false, + /// Permissive www autolink: prepend "http://" to href + autolink_www: bool = false, +}; + +/// An attribute is a string that may contain embedded entities. +/// The text is split into substrings, each with a type (normal or entity). +pub const Attribute = struct { + /// Slices into the source text, one per substring. + substr_offsets: []const SubstrOffset, + substr_types: []const SubstrType, + + pub const SubstrType = enum(u8) { + normal, + entity, + }; + + pub const SubstrOffset = struct { + beg: OFF, + end: OFF, + }; + + pub fn text(self: Attribute, src: []const u8) []const u8 { + if (self.substr_offsets.len == 0) return ""; + const first = self.substr_offsets[0].beg; + const last = self.substr_offsets[self.substr_offsets.len - 1].end; + return src[first..last]; + } +}; + +// --- Internal types used by the parser --- + +/// Line types during block analysis. +pub const LineType = enum(u8) { + blank, + hr, + atxheader, + setextunderline, + setextheader, + indentedcode, + fencedcode, + html, + text, + table, + tableunderline, +}; + +/// A line analysis result. +pub const Line = struct { + type: LineType = .blank, + beg: OFF = 0, + end: OFF = 0, + indent: u32 = 0, + data: u32 = 0, + enforce_new_block: bool = false, +}; + +/// A verbatim line (stores beg/end offsets plus indent for indented code). +pub const VerbatimLine = extern struct { + beg: OFF, + end: OFF, + indent: u32, +}; + +/// Container types: blockquote or list item. +pub const Container = struct { + ch: u8 = 0, + is_loose: bool = false, + is_task: bool = false, + task_mark_off: OFF = 0, + start: u32 = 0, + mark_indent: u32 = 0, + contents_indent: u32 = 0, + block_byte_off: u32 = 0, +}; + +/// Block flags stored in MD_BLOCK. +pub const BlockFlags = packed struct(u32) { + container_closer: bool = false, + container_opener: bool = false, + loose_list: bool = false, + setext_header: bool = false, + _padding: u28 = 0, +}; + +pub const BLOCK_CONTAINER_CLOSER: u32 = 0x01; +pub const BLOCK_CONTAINER_OPENER: u32 = 0x02; +pub const BLOCK_LOOSE_LIST: u32 = 0x04; +pub const BLOCK_SETEXT_HEADER: u32 = 0x08; +pub const BLOCK_FENCED_CODE: u32 = 0x10; +pub const BLOCK_REF_DEF_ONLY: u32 = 0x20; + +/// Block descriptor stored in block_bytes buffer. +pub const Block = struct { + type: BlockType, + flags: u32 = 0, + data: u32 = 0, + n_lines: u32 = 0, +}; + +/// Mark flags. +pub const MarkFlags = struct { + pub const POTENTIAL_OPENER: u16 = 0x01; + pub const POTENTIAL_CLOSER: u16 = 0x02; + pub const OPENER: u16 = 0x04; + pub const CLOSER: u16 = 0x08; + pub const RESOLVED: u16 = 0x10; + + // Emphasis analysis flags + pub const EMPH_INTRAWORD: u16 = 0x20; + pub const EMPH_MOD3_0: u16 = 0x40; + pub const EMPH_MOD3_1: u16 = 0x80; + pub const EMPH_MOD3_2: u16 = 0x100; + + pub const EMPH_OC: u16 = POTENTIAL_OPENER | POTENTIAL_CLOSER; +}; + +/// A mark in the inline processing system. +pub const Mark = struct { + beg: OFF = 0, + end: OFF = 0, + prev: i32 = -1, + next: i32 = -1, + ch: u8 = 0, + flags: u16 = 0, +}; + +/// Parser flags controlling which extensions are enabled. +pub const Flags = struct { + collapse_whitespace: bool = false, + permissive_atx_headers: bool = false, + permissive_url_autolinks: bool = false, + permissive_www_autolinks: bool = false, + permissive_email_autolinks: bool = false, + no_indented_code_blocks: bool = false, + no_html_blocks: bool = false, + no_html_spans: bool = false, + tables: bool = true, + strikethrough: bool = true, + tasklists: bool = true, + latex_math: bool = false, + wiki_links: bool = false, + underline: bool = false, + hard_soft_breaks: bool = false, + + pub const commonmark: Flags = .{ + .tables = false, + .strikethrough = false, + .tasklists = false, + }; + + pub const github: Flags = .{ + .tables = true, + .strikethrough = true, + .tasklists = true, + .permissive_url_autolinks = true, + .permissive_www_autolinks = true, + .permissive_email_autolinks = true, + }; + + pub fn permissiveAutolinks(self: Flags) bool { + return self.permissive_url_autolinks or self.permissive_www_autolinks or self.permissive_email_autolinks; + } +}; + +/// Number of opener stacks used during inline analysis. +/// 6 for *, 6 for _, 2 for ~, 1 for brackets, 1 for $ +pub const NUM_OPENER_STACKS: usize = 16; + +// Opener stack indices +pub const ASTERISK_OPENERS_OO_0: usize = 0; +pub const ASTERISK_OPENERS_OO_1: usize = 1; +pub const ASTERISK_OPENERS_OO_2: usize = 2; +pub const ASTERISK_OPENERS_OC_0: usize = 3; +pub const ASTERISK_OPENERS_OC_1: usize = 4; +pub const ASTERISK_OPENERS_OC_2: usize = 5; +pub const UNDERSCORE_OPENERS_OO_0: usize = 6; +pub const UNDERSCORE_OPENERS_OO_1: usize = 7; +pub const UNDERSCORE_OPENERS_OO_2: usize = 8; +pub const UNDERSCORE_OPENERS_OC_0: usize = 9; +pub const UNDERSCORE_OPENERS_OC_1: usize = 10; +pub const UNDERSCORE_OPENERS_OC_2: usize = 11; +pub const TILDE_OPENERS_1: usize = 12; +pub const TILDE_OPENERS_2: usize = 13; +pub const BRACKET_OPENERS: usize = 14; +pub const DOLLAR_OPENERS: usize = 15; + +/// An opener stack: a doubly-linked list through mark indices. +pub const OpenerStack = struct { + top: i32 = -1, +}; + +/// Internal limits matching md4c. +pub const CODESPAN_MARK_MAXLEN: u32 = 255; +pub const TABLE_MAXCOLCOUNT: u32 = 128; + +/// Reference definition used for link resolution. +pub const RefDef = struct { + label: []const u8, + title: Attribute, + dest_beg: OFF, + dest_end: OFF, + label_needs_free: bool = false, + title_needs_free: bool = false, +}; + +// ======================================== +// Metadata extraction helpers +// ======================================== + +/// Extract table cell alignment from block data. +pub fn alignmentFromData(data: u32) Align { + return @enumFromInt(@as(u2, @truncate(data))); +} + +/// Get string name for alignment, or null for default. +pub fn alignmentName(alignment: Align) ?[]const u8 { + return switch (alignment) { + .left => "left", + .center => "center", + .right => "right", + .default => null, + }; +} + +/// Extract task list item mark from block data. Returns 0 for non-task items. +pub fn taskMarkFromData(data: u32) u8 { + return @truncate(data); +} + +/// Check if a task mark indicates a checked box. +pub fn isTaskChecked(task_mark: u8) bool { + return task_mark != 0 and task_mark != ' '; +} + +const bun = @import("bun"); diff --git a/src/md/unicode.zig b/src/md/unicode.zig new file mode 100644 index 0000000000..eb2158a8cd --- /dev/null +++ b/src/md/unicode.zig @@ -0,0 +1,477 @@ +pub const FoldInfo = struct { + codepoints: [3]u21, + n_codepoints: u2, +}; + +/// A map entry represents either a single codepoint or a range. +/// Ranges are encoded as two consecutive entries: +/// (min_codepoint | range_start_flag), (max_codepoint | range_end_flag) +/// Single codepoints are stored as-is (no flags). +/// +/// The corresponding data array is indexed in the same way: each physical +/// slot in the map array maps to `n_codepoints` consecutive entries in the +/// data array, at position `physical_index * n_codepoints`. +const MapEntry = u32; + +const range_start_flag: u32 = 0x40000000; +const range_end_flag: u32 = 0x80000000; +const codepoint_mask: u32 = 0x00ffffff; + +/// Extract the raw codepoint value from a map entry. +inline fn rawCodepoint(entry: MapEntry) u21 { + return @intCast(entry & codepoint_mask); +} + +/// Binary search over a sorted map of codepoints, supporting both single values +/// and ranges encoded with flag bits. Returns the physical index of the found +/// record (for ranges, the index of the range-start entry), or null on failure. +/// +/// This is a direct port of md4c's `md_unicode_bsearch__`. +fn unicodeBsearch(codepoint: u21, map: []const MapEntry) ?usize { + if (map.len == 0) return null; + + var beg: usize = 0; + var end: usize = map.len - 1; + + while (beg <= end) { + var pivot_beg: usize = (beg + end) / 2; + var pivot_end: usize = pivot_beg; + + // If pivot points at a range-start entry, the range-end is the next entry. + if (map[pivot_end] & range_start_flag != 0) + pivot_end += 1; + // If pivot points at a range-end entry, the range-start is the previous entry. + if (map[pivot_beg] & range_end_flag != 0) { + if (pivot_beg == 0) return null; + pivot_beg -= 1; + } + + const lo_cp = rawCodepoint(map[pivot_beg]); + const hi_cp = rawCodepoint(map[pivot_end]); + + if (codepoint < lo_cp) { + if (pivot_beg == 0) return null; + end = pivot_beg - 1; + } else if (codepoint > hi_cp) { + beg = pivot_end + 1; + } else { + return pivot_beg; + } + } + + return null; +} + +// --------------------------------------------------------------------------- +// Fold Map 1: single-codepoint mappings +// (generated by scripts/build_folding_map.py in md4c) +// +// Each physical slot in fold_map_1 corresponds to exactly 1 entry in +// fold_map_1_data (at the same index). Range pairs (start, end) occupy +// two physical slots and thus two data entries -- the start entry's data +// value is the base folded codepoint for the range. +// --------------------------------------------------------------------------- + +const fold_map_1 = [_]MapEntry{ + range_start_flag | 0x0041, range_end_flag | 0x005a, + 0x00b5, range_start_flag | 0x00c0, + range_end_flag | 0x00d6, range_start_flag | 0x00d8, + range_end_flag | 0x00de, range_start_flag | 0x0100, + range_end_flag | 0x012e, range_start_flag | 0x0132, + range_end_flag | 0x0136, range_start_flag | 0x0139, + range_end_flag | 0x0147, range_start_flag | 0x014a, + range_end_flag | 0x0176, 0x0178, + range_start_flag | 0x0179, range_end_flag | 0x017d, + 0x017f, 0x0181, + 0x0182, 0x0184, + 0x0186, 0x0187, + 0x0189, 0x018a, + 0x018b, 0x018e, + 0x018f, 0x0190, + 0x0191, 0x0193, + 0x0194, 0x0196, + 0x0197, 0x0198, + 0x019c, 0x019d, + 0x019f, range_start_flag | 0x01a0, + range_end_flag | 0x01a4, 0x01a6, + 0x01a7, 0x01a9, + 0x01ac, 0x01ae, + 0x01af, 0x01b1, + 0x01b2, 0x01b3, + 0x01b5, 0x01b7, + 0x01b8, 0x01bc, + 0x01c4, 0x01c5, + 0x01c7, 0x01c8, + 0x01ca, range_start_flag | 0x01cb, + range_end_flag | 0x01db, range_start_flag | 0x01de, + range_end_flag | 0x01ee, 0x01f1, + 0x01f2, 0x01f4, + 0x01f6, 0x01f7, + range_start_flag | 0x01f8, range_end_flag | 0x021e, + 0x0220, range_start_flag | 0x0222, + range_end_flag | 0x0232, 0x023a, + 0x023b, 0x023d, + 0x023e, 0x0241, + 0x0243, 0x0244, + 0x0245, range_start_flag | 0x0246, + range_end_flag | 0x024e, 0x0345, + 0x0370, 0x0372, + 0x0376, 0x037f, + 0x0386, range_start_flag | 0x0388, + range_end_flag | 0x038a, 0x038c, + 0x038e, 0x038f, + range_start_flag | 0x0391, range_end_flag | 0x03a1, + range_start_flag | 0x03a3, range_end_flag | 0x03ab, + 0x03c2, 0x03cf, + 0x03d0, 0x03d1, + 0x03d5, 0x03d6, + range_start_flag | 0x03d8, range_end_flag | 0x03ee, + 0x03f0, 0x03f1, + 0x03f4, 0x03f5, + 0x03f7, 0x03f9, + 0x03fa, range_start_flag | 0x03fd, + range_end_flag | 0x03ff, range_start_flag | 0x0400, + range_end_flag | 0x040f, range_start_flag | 0x0410, + range_end_flag | 0x042f, range_start_flag | 0x0460, + range_end_flag | 0x0480, range_start_flag | 0x048a, + range_end_flag | 0x04be, 0x04c0, + range_start_flag | 0x04c1, range_end_flag | 0x04cd, + range_start_flag | 0x04d0, range_end_flag | 0x052e, + range_start_flag | 0x0531, range_end_flag | 0x0556, + range_start_flag | 0x10a0, range_end_flag | 0x10c5, + 0x10c7, 0x10cd, + range_start_flag | 0x13f8, range_end_flag | 0x13fd, + 0x1c80, 0x1c81, + 0x1c82, 0x1c83, + 0x1c84, 0x1c85, + 0x1c86, 0x1c87, + 0x1c88, range_start_flag | 0x1c90, + range_end_flag | 0x1cba, range_start_flag | 0x1cbd, + range_end_flag | 0x1cbf, range_start_flag | 0x1e00, + range_end_flag | 0x1e94, 0x1e9b, + range_start_flag | 0x1ea0, range_end_flag | 0x1efe, + range_start_flag | 0x1f08, range_end_flag | 0x1f0f, + range_start_flag | 0x1f18, range_end_flag | 0x1f1d, + range_start_flag | 0x1f28, range_end_flag | 0x1f2f, + range_start_flag | 0x1f38, range_end_flag | 0x1f3f, + range_start_flag | 0x1f48, range_end_flag | 0x1f4d, + 0x1f59, 0x1f5b, + 0x1f5d, 0x1f5f, + range_start_flag | 0x1f68, range_end_flag | 0x1f6f, + 0x1fb8, 0x1fb9, + 0x1fba, 0x1fbb, + 0x1fbe, range_start_flag | 0x1fc8, + range_end_flag | 0x1fcb, 0x1fd8, + 0x1fd9, 0x1fda, + 0x1fdb, 0x1fe8, + 0x1fe9, 0x1fea, + 0x1feb, 0x1fec, + 0x1ff8, 0x1ff9, + 0x1ffa, 0x1ffb, + 0x2126, 0x212a, + 0x212b, 0x2132, + range_start_flag | 0x2160, range_end_flag | 0x216f, + 0x2183, range_start_flag | 0x24b6, + range_end_flag | 0x24cf, range_start_flag | 0x2c00, + range_end_flag | 0x2c2f, 0x2c60, + 0x2c62, 0x2c63, + 0x2c64, range_start_flag | 0x2c67, + range_end_flag | 0x2c6b, 0x2c6d, + 0x2c6e, 0x2c6f, + 0x2c70, 0x2c72, + 0x2c75, 0x2c7e, + 0x2c7f, range_start_flag | 0x2c80, + range_end_flag | 0x2ce2, 0x2ceb, + 0x2ced, 0x2cf2, + range_start_flag | 0xa640, range_end_flag | 0xa66c, + range_start_flag | 0xa680, range_end_flag | 0xa69a, + range_start_flag | 0xa722, range_end_flag | 0xa72e, + range_start_flag | 0xa732, range_end_flag | 0xa76e, + 0xa779, 0xa77b, + 0xa77d, range_start_flag | 0xa77e, + range_end_flag | 0xa786, 0xa78b, + 0xa78d, 0xa790, + 0xa792, range_start_flag | 0xa796, + range_end_flag | 0xa7a8, 0xa7aa, + 0xa7ab, 0xa7ac, + 0xa7ad, 0xa7ae, + 0xa7b0, 0xa7b1, + 0xa7b2, 0xa7b3, + range_start_flag | 0xa7b4, range_end_flag | 0xa7c2, + 0xa7c4, 0xa7c5, + 0xa7c6, 0xa7c7, + 0xa7c9, 0xa7d0, + 0xa7d6, 0xa7d8, + 0xa7f5, range_start_flag | 0xab70, + range_end_flag | 0xabbf, range_start_flag | 0xff21, + range_end_flag | 0xff3a, range_start_flag | 0x10400, + range_end_flag | 0x10427, range_start_flag | 0x104b0, + range_end_flag | 0x104d3, range_start_flag | 0x10570, + range_end_flag | 0x1057a, range_start_flag | 0x1057c, + range_end_flag | 0x1058a, range_start_flag | 0x1058c, + range_end_flag | 0x10592, 0x10594, + 0x10595, range_start_flag | 0x10c80, + range_end_flag | 0x10cb2, range_start_flag | 0x118a0, + range_end_flag | 0x118bf, range_start_flag | 0x16e40, + range_end_flag | 0x16e5f, range_start_flag | 0x1e900, + range_end_flag | 0x1e921, +}; + +// Data indexed by physical map index. Each physical map slot has one +// corresponding data entry. For range start/end pairs, the range-start +// slot's data value is the base folded codepoint; the range-end slot's +// data value is the end folded codepoint. +const fold_map_1_data = [_]u21{ + 0x0061, 0x007a, 0x03bc, 0x00e0, 0x00f6, 0x00f8, 0x00fe, 0x0101, 0x012f, 0x0133, 0x0137, 0x013a, 0x0148, + 0x014b, 0x0177, 0x00ff, 0x017a, 0x017e, 0x0073, 0x0253, 0x0183, 0x0185, 0x0254, 0x0188, 0x0256, 0x0257, + 0x018c, 0x01dd, 0x0259, 0x025b, 0x0192, 0x0260, 0x0263, 0x0269, 0x0268, 0x0199, 0x026f, 0x0272, 0x0275, + 0x01a1, 0x01a5, 0x0280, 0x01a8, 0x0283, 0x01ad, 0x0288, 0x01b0, 0x028a, 0x028b, 0x01b4, 0x01b6, 0x0292, + 0x01b9, 0x01bd, 0x01c6, 0x01c6, 0x01c9, 0x01c9, 0x01cc, 0x01cc, 0x01dc, 0x01df, 0x01ef, 0x01f3, 0x01f3, + 0x01f5, 0x0195, 0x01bf, 0x01f9, 0x021f, 0x019e, 0x0223, 0x0233, 0x2c65, 0x023c, 0x019a, 0x2c66, 0x0242, + 0x0180, 0x0289, 0x028c, 0x0247, 0x024f, 0x03b9, 0x0371, 0x0373, 0x0377, 0x03f3, 0x03ac, 0x03ad, 0x03af, + 0x03cc, 0x03cd, 0x03ce, 0x03b1, 0x03c1, 0x03c3, 0x03cb, 0x03c3, 0x03d7, 0x03b2, 0x03b8, 0x03c6, 0x03c0, + 0x03d9, 0x03ef, 0x03ba, 0x03c1, 0x03b8, 0x03b5, 0x03f8, 0x03f2, 0x03fb, 0x037b, 0x037d, 0x0450, 0x045f, + 0x0430, 0x044f, 0x0461, 0x0481, 0x048b, 0x04bf, 0x04cf, 0x04c2, 0x04ce, 0x04d1, 0x052f, 0x0561, 0x0586, + 0x2d00, 0x2d25, 0x2d27, 0x2d2d, 0x13f0, 0x13f5, 0x0432, 0x0434, 0x043e, 0x0441, 0x0442, 0x0442, 0x044a, + 0x0463, 0xa64b, 0x10d0, 0x10fa, 0x10fd, 0x10ff, 0x1e01, 0x1e95, 0x1e61, 0x1ea1, 0x1eff, 0x1f00, 0x1f07, + 0x1f10, 0x1f15, 0x1f20, 0x1f27, 0x1f30, 0x1f37, 0x1f40, 0x1f45, 0x1f51, 0x1f53, 0x1f55, 0x1f57, 0x1f60, + 0x1f67, 0x1fb0, 0x1fb1, 0x1f70, 0x1f71, 0x03b9, 0x1f72, 0x1f75, 0x1fd0, 0x1fd1, 0x1f76, 0x1f77, 0x1fe0, + 0x1fe1, 0x1f7a, 0x1f7b, 0x1fe5, 0x1f78, 0x1f79, 0x1f7c, 0x1f7d, 0x03c9, 0x006b, 0x00e5, 0x214e, 0x2170, + 0x217f, 0x2184, 0x24d0, 0x24e9, 0x2c30, 0x2c5f, 0x2c61, 0x026b, 0x1d7d, 0x027d, 0x2c68, 0x2c6c, 0x0251, + 0x0271, 0x0250, 0x0252, 0x2c73, 0x2c76, 0x023f, 0x0240, 0x2c81, 0x2ce3, 0x2cec, 0x2cee, 0x2cf3, 0xa641, + 0xa66d, 0xa681, 0xa69b, 0xa723, 0xa72f, 0xa733, 0xa76f, 0xa77a, 0xa77c, 0x1d79, 0xa77f, 0xa787, 0xa78c, + 0x0265, 0xa791, 0xa793, 0xa797, 0xa7a9, 0x0266, 0x025c, 0x0261, 0x026c, 0x026a, 0x029e, 0x0287, 0x029d, + 0xab53, 0xa7b5, 0xa7c3, 0xa794, 0x0282, 0x1d8e, 0xa7c8, 0xa7ca, 0xa7d1, 0xa7d7, 0xa7d9, 0xa7f6, 0x13a0, + 0x13ef, 0xff41, 0xff5a, 0x10428, 0x1044f, 0x104d8, 0x104fb, 0x10597, 0x105a1, 0x105a3, 0x105b1, 0x105b3, 0x105b9, + 0x105bb, 0x105bc, 0x10cc0, 0x10cf2, 0x118c0, 0x118df, 0x16e60, 0x16e7f, 0x1e922, 0x1e943, +}; + +// --------------------------------------------------------------------------- +// Fold Map 2: two-codepoint mappings +// --------------------------------------------------------------------------- + +const fold_map_2 = [_]MapEntry{ + 0x00df, 0x0130, 0x0149, 0x01f0, 0x0587, 0x1e96, 0x1e97, 0x1e98, 0x1e99, + 0x1e9a, 0x1e9e, 0x1f50, range_start_flag | 0x1f80, range_end_flag | 0x1f87, range_start_flag | 0x1f88, range_end_flag | 0x1f8f, range_start_flag | 0x1f90, range_end_flag | 0x1f97, + range_start_flag | 0x1f98, range_end_flag | 0x1f9f, range_start_flag | 0x1fa0, range_end_flag | 0x1fa7, range_start_flag | 0x1fa8, range_end_flag | 0x1faf, 0x1fb2, 0x1fb3, 0x1fb4, + 0x1fb6, 0x1fbc, 0x1fc2, 0x1fc3, 0x1fc4, 0x1fc6, 0x1fcc, 0x1fd6, 0x1fe4, + 0x1fe6, 0x1ff2, 0x1ff3, 0x1ff4, 0x1ff6, 0x1ffc, 0xfb00, 0xfb01, 0xfb02, + 0xfb05, 0xfb06, 0xfb13, 0xfb14, 0xfb15, 0xfb16, 0xfb17, +}; + +// Each physical map slot corresponds to 2 data entries. +const fold_map_2_data = [_]u21{ + 0x0073, 0x0073, 0x0069, 0x0307, 0x02bc, 0x006e, 0x006a, 0x030c, 0x0565, 0x0582, 0x0068, 0x0331, 0x0074, 0x0308, + 0x0077, 0x030a, 0x0079, 0x030a, 0x0061, 0x02be, 0x0073, 0x0073, 0x03c5, 0x0313, 0x1f00, 0x03b9, 0x1f07, 0x03b9, + 0x1f00, 0x03b9, 0x1f07, 0x03b9, 0x1f20, 0x03b9, 0x1f27, 0x03b9, 0x1f20, 0x03b9, 0x1f27, 0x03b9, 0x1f60, 0x03b9, + 0x1f67, 0x03b9, 0x1f60, 0x03b9, 0x1f67, 0x03b9, 0x1f70, 0x03b9, 0x03b1, 0x03b9, 0x03ac, 0x03b9, 0x03b1, 0x0342, + 0x03b1, 0x03b9, 0x1f74, 0x03b9, 0x03b7, 0x03b9, 0x03ae, 0x03b9, 0x03b7, 0x0342, 0x03b7, 0x03b9, 0x03b9, 0x0342, + 0x03c1, 0x0313, 0x03c5, 0x0342, 0x1f7c, 0x03b9, 0x03c9, 0x03b9, 0x03ce, 0x03b9, 0x03c9, 0x0342, 0x03c9, 0x03b9, + 0x0066, 0x0066, 0x0066, 0x0069, 0x0066, 0x006c, 0x0073, 0x0074, 0x0073, 0x0074, 0x0574, 0x0576, 0x0574, 0x0565, + 0x0574, 0x056b, 0x057e, 0x0576, 0x0574, 0x056d, +}; + +// --------------------------------------------------------------------------- +// Fold Map 3: three-codepoint mappings +// --------------------------------------------------------------------------- + +const fold_map_3 = [_]MapEntry{ + 0x0390, 0x03b0, 0x1f52, 0x1f54, 0x1f56, 0x1fb7, 0x1fc7, 0x1fd2, 0x1fd3, + 0x1fd7, 0x1fe2, 0x1fe3, 0x1fe7, 0x1ff7, 0xfb03, 0xfb04, +}; + +// Each physical map slot corresponds to 3 data entries. +const fold_map_3_data = [_]u21{ + 0x03b9, 0x0308, 0x0301, 0x03c5, 0x0308, 0x0301, 0x03c5, 0x0313, 0x0300, 0x03c5, 0x0313, 0x0301, + 0x03c5, 0x0313, 0x0342, 0x03b1, 0x0342, 0x03b9, 0x03b7, 0x0342, 0x03b9, 0x03b9, 0x0308, 0x0300, + 0x03b9, 0x0308, 0x0301, 0x03b9, 0x0308, 0x0342, 0x03c5, 0x0308, 0x0300, 0x03c5, 0x0308, 0x0301, + 0x03c5, 0x0308, 0x0342, 0x03c9, 0x0342, 0x03b9, 0x0066, 0x0066, 0x0069, 0x0066, 0x0066, 0x006c, +}; + +/// Get the Unicode case-folded version of a codepoint. +/// Returns the original codepoint wrapped in a FoldInfo if no folding is needed. +/// +/// This is a direct port of md4c's `md_get_unicode_fold_info`. +pub fn caseFold(codepoint: u21) FoldInfo { + // Fast path for ASCII characters. + if (codepoint <= 0x7f) { + if (codepoint >= 'A' and codepoint <= 'Z') { + return .{ .codepoints = .{ codepoint + ('a' - 'A'), 0, 0 }, .n_codepoints = 1 }; + } + return .{ .codepoints = .{ codepoint, 0, 0 }, .n_codepoints = 1 }; + } + + // Try each fold map in order: 1-codepoint, 2-codepoint, 3-codepoint. + // The data arrays are indexed by: physical_index * n_codepoints. + return caseFoldFromMap(1, &fold_map_1, &fold_map_1_data, codepoint) orelse + caseFoldFromMap(2, &fold_map_2, &fold_map_2_data, codepoint) orelse + caseFoldFromMap(3, &fold_map_3, &fold_map_3_data, codepoint) orelse + // No mapping found -- map the codepoint to itself. + .{ .codepoints = .{ codepoint, 0, 0 }, .n_codepoints = 1 }; +} + +fn caseFoldFromMap( + comptime n: u2, + map: []const MapEntry, + data: []const u21, + codepoint: u21, +) ?FoldInfo { + const index = unicodeBsearch(codepoint, map) orelse return null; + const data_offset = index * @as(usize, n); + + var result: FoldInfo = .{ .codepoints = .{ 0, 0, 0 }, .n_codepoints = n }; + + // Copy the base codepoints from the data table. + inline for (0..n) |k| { + result.codepoints[k] = data[data_offset + k]; + } + + // If the codepoint doesn't exactly match the map entry, we are + // inside a range and need to adjust the first output codepoint. + const map_cp = rawCodepoint(map[index]); + if (map_cp != codepoint) { + if (map_cp + 1 == result.codepoints[0]) { + // Alternating type of the range. + // e.g. 0x0100->0x0101, 0x0101->0x0101, 0x0102->0x0103, ... + result.codepoints[0] = codepoint + (if ((codepoint & 0x1) == (map_cp & 0x1)) @as(u21, 1) else @as(u21, 0)); + } else { + // Range-to-range mapping: offset from range start. + result.codepoints[0] += (codepoint - map_cp); + } + } + + return result; +} + +// --------------------------------------------------------------------------- +// Tests +// --------------------------------------------------------------------------- + +test "ASCII case folding" { + // Uppercase ASCII letters fold to lowercase. + const info_A = caseFold('A'); + try std.testing.expectEqual(@as(u21, 'a'), info_A.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info_A.n_codepoints); + + const info_Z = caseFold('Z'); + try std.testing.expectEqual(@as(u21, 'z'), info_Z.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info_Z.n_codepoints); + + // Lowercase ASCII letters map to themselves. + const info_a = caseFold('a'); + try std.testing.expectEqual(@as(u21, 'a'), info_a.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info_a.n_codepoints); + + // Non-letter ASCII maps to itself. + const info_0 = caseFold('0'); + try std.testing.expectEqual(@as(u21, '0'), info_0.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info_0.n_codepoints); +} + +test "Latin-1 supplement case folding" { + // U+00C0 LATIN CAPITAL LETTER A WITH GRAVE -> U+00E0 + const info = caseFold(0x00c0); + try std.testing.expectEqual(@as(u21, 0x00e0), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); + + // U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS -> U+00F6 + const info2 = caseFold(0x00d6); + try std.testing.expectEqual(@as(u21, 0x00f6), info2.codepoints[0]); +} + +test "Greek case folding" { + // U+0391 GREEK CAPITAL LETTER ALPHA -> U+03B1 + const info = caseFold(0x0391); + try std.testing.expectEqual(@as(u21, 0x03b1), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Cyrillic case folding" { + // U+0410 CYRILLIC CAPITAL LETTER A -> U+0430 + const info = caseFold(0x0410); + try std.testing.expectEqual(@as(u21, 0x0430), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Two-codepoint fold: German eszett" { + // U+00DF LATIN SMALL LETTER SHARP S -> 0x0073, 0x0073 ("ss") + const info = caseFold(0x00df); + try std.testing.expectEqual(@as(u2, 2), info.n_codepoints); + try std.testing.expectEqual(@as(u21, 0x0073), info.codepoints[0]); + try std.testing.expectEqual(@as(u21, 0x0073), info.codepoints[1]); +} + +test "Two-codepoint fold: fi ligature" { + // U+FB01 LATIN SMALL LIGATURE FI -> 0x0066, 0x0069 ("fi") + const info = caseFold(0xfb01); + try std.testing.expectEqual(@as(u2, 2), info.n_codepoints); + try std.testing.expectEqual(@as(u21, 0x0066), info.codepoints[0]); + try std.testing.expectEqual(@as(u21, 0x0069), info.codepoints[1]); +} + +test "Three-codepoint fold: ffi ligature" { + // U+FB03 LATIN SMALL LIGATURE FFI -> 0x0066, 0x0066, 0x0069 ("ffi") + const info = caseFold(0xfb03); + try std.testing.expectEqual(@as(u2, 3), info.n_codepoints); + try std.testing.expectEqual(@as(u21, 0x0066), info.codepoints[0]); + try std.testing.expectEqual(@as(u21, 0x0066), info.codepoints[1]); + try std.testing.expectEqual(@as(u21, 0x0069), info.codepoints[2]); +} + +test "Alternating range: Latin Extended-A" { + // U+0100 LATIN CAPITAL LETTER A WITH MACRON -> U+0101 + const info1 = caseFold(0x0100); + try std.testing.expectEqual(@as(u21, 0x0101), info1.codepoints[0]); + + // U+0101 is already lowercase -> maps to itself + const info2 = caseFold(0x0101); + try std.testing.expectEqual(@as(u21, 0x0101), info2.codepoints[0]); + + // U+0102 LATIN CAPITAL LETTER A WITH BREVE -> U+0103 + const info3 = caseFold(0x0102); + try std.testing.expectEqual(@as(u21, 0x0103), info3.codepoints[0]); +} + +test "No mapping: already lowercase or not a letter" { + // Some arbitrary codepoint with no fold mapping + const info = caseFold(0x0600); + try std.testing.expectEqual(@as(u21, 0x0600), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Omega symbol" { + // U+2126 OHM SIGN -> U+03C9 GREEK SMALL LETTER OMEGA + const info = caseFold(0x2126); + try std.testing.expectEqual(@as(u21, 0x03c9), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Fullwidth Latin uppercase" { + // U+FF21 FULLWIDTH LATIN CAPITAL LETTER A -> U+FF41 + const info = caseFold(0xff21); + try std.testing.expectEqual(@as(u21, 0xff41), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Deseret alphabet" { + // U+10400 DESERET CAPITAL LETTER LONG I -> U+10428 + const info = caseFold(0x10400); + try std.testing.expectEqual(@as(u21, 0x10428), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Micro sign" { + // U+00B5 MICRO SIGN -> U+03BC GREEK SMALL LETTER MU + const info = caseFold(0x00b5); + try std.testing.expectEqual(@as(u21, 0x03bc), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +test "Kelvin sign" { + // U+212A KELVIN SIGN -> U+006B LATIN SMALL LETTER K + const info = caseFold(0x212a); + try std.testing.expectEqual(@as(u21, 0x006b), info.codepoints[0]); + try std.testing.expectEqual(@as(u2, 1), info.n_codepoints); +} + +const std = @import("std"); diff --git a/src/options.zig b/src/options.zig index d1e76d5acf..81d0c28d3b 100644 --- a/src/options.zig +++ b/src/options.zig @@ -635,6 +635,7 @@ pub const Loader = enum(u8) { html = 17, yaml = 18, json5 = 19, + md = 20, pub const Optional = enum(u8) { none = 254, @@ -705,7 +706,7 @@ pub const Loader = enum(u8) { .css => bun.http.MimeType.css, .toml, .yaml, .json, .jsonc, .json5 => bun.http.MimeType.json, .wasm => bun.http.MimeType.wasm, - .html => bun.http.MimeType.html, + .html, .md => bun.http.MimeType.html, else => { for (paths) |path| { var extname = std.fs.path.extension(path); @@ -758,6 +759,7 @@ pub const Loader = enum(u8) { map.set(.text, "input.txt"); map.set(.bunsh, "input.sh"); map.set(.html, "input.html"); + map.set(.md, "input.md"); break :brk map; }; @@ -777,7 +779,7 @@ pub const Loader = enum(u8) { if (zig_str.len == 0) return null; return fromString(zig_str.slice()) orelse { - return global.throwInvalidArguments("invalid loader - must be js, jsx, tsx, ts, css, file, toml, yaml, wasm, bunsh, or json", .{}); + return global.throwInvalidArguments("invalid loader - must be js, jsx, tsx, ts, css, file, toml, yaml, wasm, bunsh, json, or md", .{}); }; } @@ -808,6 +810,8 @@ pub const Loader = enum(u8) { .{ "sqlite", .sqlite }, .{ "sqlite_embedded", .sqlite_embedded }, .{ "html", .html }, + .{ "md", .md }, + .{ "markdown", .md }, }); pub const api_names = bun.ComptimeStringMap(api.Loader, .{ @@ -835,6 +839,8 @@ pub const Loader = enum(u8) { .{ "sh", .file }, .{ "sqlite", .sqlite }, .{ "html", .html }, + .{ "md", .md }, + .{ "markdown", .md }, }); pub fn fromString(slice_: string) ?Loader { @@ -873,6 +879,7 @@ pub const Loader = enum(u8) { .dataurl => .dataurl, .text => .text, .sqlite_embedded, .sqlite => .sqlite, + .md => .md, }; } @@ -899,6 +906,7 @@ pub const Loader = enum(u8) { .html => .html, .sqlite => .sqlite, .sqlite_embedded => .sqlite_embedded, + .md => .md, _ => .file, }; } @@ -938,7 +946,7 @@ pub const Loader = enum(u8) { pub fn sideEffects(this: Loader) bun.resolver.SideEffects { return switch (this) { - .text, .json, .jsonc, .toml, .yaml, .json5, .file => bun.resolver.SideEffects.no_side_effects__pure_data, + .text, .json, .jsonc, .toml, .yaml, .json5, .file, .md => bun.resolver.SideEffects.no_side_effects__pure_data, else => bun.resolver.SideEffects.has_side_effects, }; } diff --git a/src/transpiler.zig b/src/transpiler.zig index 3315f61245..37a063355e 100644 --- a/src/transpiler.zig +++ b/src/transpiler.zig @@ -626,7 +626,7 @@ pub const Transpiler = struct { }; switch (loader) { - .jsx, .tsx, .js, .ts, .json, .jsonc, .toml, .yaml, .json5, .text => { + .jsx, .tsx, .js, .ts, .json, .jsonc, .toml, .yaml, .json5, .text, .md => { var result = transpiler.parse( ParseOptions{ .allocator = transpiler.allocator, @@ -1357,6 +1357,39 @@ pub const Transpiler = struct { .input_fd = input_fd, }; }, + .md => { + const html = bun.md.renderToHtml(source.contents, allocator) catch { + transpiler.log.addErrorFmt( + null, + logger.Loc.Empty, + transpiler.allocator, + "Failed to render markdown to HTML", + .{}, + ) catch {}; + return null; + }; + const expr = js_ast.Expr.init(js_ast.E.String, js_ast.E.String{ + .data = html, + }, logger.Loc.Empty); + const stmt = js_ast.Stmt.alloc(js_ast.S.ExportDefault, js_ast.S.ExportDefault{ + .value = js_ast.StmtOrExpr{ .expr = expr }, + .default_name = js_ast.LocRef{ + .loc = logger.Loc{}, + .ref = Ref.None, + }, + }, logger.Loc{ .start = 0 }); + var stmts = allocator.alloc(js_ast.Stmt, 1) catch unreachable; + stmts[0] = stmt; + var parts = allocator.alloc(js_ast.Part, 1) catch unreachable; + parts[0] = js_ast.Part{ .stmts = stmts }; + + return ParseResult{ + .ast = js_ast.Ast.fromParts(parts), + .source = source.*, + .loader = loader, + .input_fd = input_fd, + }; + }, .wasm => { if (transpiler.options.target.isBun()) { if (!source.isWebAssembly()) { diff --git a/test/integration/bun-types/bun-types.test.ts b/test/integration/bun-types/bun-types.test.ts index 31d2b1b1f5..f066b70468 100644 --- a/test/integration/bun-types/bun-types.test.ts +++ b/test/integration/bun-types/bun-types.test.ts @@ -1,4 +1,4 @@ -import { fileURLToPath, $ as Shell } from "bun"; +import { $ as Shell, fileURLToPath } from "bun"; import { afterAll, beforeAll, describe, expect, test } from "bun:test"; import { makeTree } from "harness"; import { readFileSync } from "node:fs"; @@ -27,16 +27,16 @@ const DEFAULT_COMPILER_OPTIONS = ts.parseJsonConfigFileContent( const $ = Shell.cwd(BUN_REPO_ROOT); let TEMP_DIR: string; -let TEMP_FIXTURE_DIR: string; +let BASE_FIXTURE_DIR: string; beforeAll(async () => { TEMP_DIR = await mkdtemp(join(tmpdir(), "bun-types-test-")); - TEMP_FIXTURE_DIR = join(TEMP_DIR, "fixture"); + BASE_FIXTURE_DIR = join(TEMP_DIR, "base-fixture"); try { - await $`mkdir -p ${TEMP_FIXTURE_DIR}`.quiet(); + await $`mkdir -p ${BASE_FIXTURE_DIR}`.quiet(); - await cp(FIXTURE_SOURCE_DIR, TEMP_FIXTURE_DIR, { recursive: true }); + await cp(FIXTURE_SOURCE_DIR, BASE_FIXTURE_DIR, { recursive: true }); await $` cd ${BUN_TYPES_PACKAGE_ROOT} @@ -51,16 +51,16 @@ beforeAll(async () => { await $` cd ${BUN_TYPES_PACKAGE_ROOT} bun run build - bun pm pack --destination ${TEMP_FIXTURE_DIR} + bun pm pack --destination ${BASE_FIXTURE_DIR} rm CLAUDE.md mv package.json.backup package.json - cd ${TEMP_FIXTURE_DIR} + cd ${BASE_FIXTURE_DIR} bun add bun-types@${BUN_TYPES_TARBALL_NAME} rm ${BUN_TYPES_TARBALL_NAME} `.quiet(); - const atTypesBunDir = join(TEMP_FIXTURE_DIR, "node_modules", "@types", "bun"); + const atTypesBunDir = join(BASE_FIXTURE_DIR, "node_modules", "@types", "bun"); await mkdir(atTypesBunDir, { recursive: true }); await makeTree(atTypesBunDir, { @@ -84,6 +84,52 @@ beforeAll(async () => { } }); +type Diagnostic = { line: string | null; message: string; code: number }; + +interface TypeTestConfig { + /** Extra tsconfig compiler options */ + options?: Partial; + /** Specify extra files to include in the build */ + files?: Record; + /** Extra packages to install before type checking */ + packages?: string[]; + /** Expected empty interfaces */ + emptyInterfaces: Set; + /** Expected diagnostics - array for exact match, or function for custom assertions */ + diagnostics: Diagnostic[] | ((diagnostics: Diagnostic[]) => void); +} + +let fixtureCounter = 0; + +async function createIsolatedFixture(packages?: string[]): Promise { + const fixtureDir = join(TEMP_DIR, `fixture-${fixtureCounter++}`); + await cp(BASE_FIXTURE_DIR, fixtureDir, { recursive: true }); + + if (packages?.length) { + await $`cd ${fixtureDir} && bun add ${packages}`.quiet(); + } + + return fixtureDir; +} + +function typeTest(name: string, config: TypeTestConfig) { + test(name, async () => { + const fixtureDir = await createIsolatedFixture(config.packages); + const { diagnostics, emptyInterfaces } = await diagnose(fixtureDir, { + options: config.options, + files: config.files, + }); + + expect(emptyInterfaces).toEqual(config.emptyInterfaces); + + if (typeof config.diagnostics === "function") { + config.diagnostics(diagnostics); + } else { + expect(diagnostics).toEqual(config.diagnostics); + } + }); +} + async function diagnose( fixtureDir: string, config: { @@ -194,8 +240,6 @@ async function diagnose( }; } -const expectedEmptyInterfacesWhenNoDOM = new Set(["ThisType"]); - function checkForEmptyInterfaces(program: ts.Program) { const empties = new Set(); @@ -252,11 +296,11 @@ function checkForEmptyInterfaces(program: ts.Program) { afterAll(async () => { if (TEMP_DIR) { if (Bun.env.TYPES_INTEGRATION_TEST_KEEP_TEMP_DIR === "true") { - console.log(`Keeping temp dir ${TEMP_DIR}/fixture for debugging`); + console.log(`Keeping temp dir ${TEMP_DIR} for debugging`); // Write tsconfig with skipLibCheck disabled for proper type checking const tsconfig = structuredClone(sourceTsconfig); tsconfig.compilerOptions.skipLibCheck = false; - await Bun.write(join(TEMP_DIR, "fixture", "tsconfig.json"), JSON.stringify(tsconfig, null, 2)); + await Bun.write(join(TEMP_DIR, "base-fixture", "tsconfig.json"), JSON.stringify(tsconfig, null, 2)); } else { await rm(TEMP_DIR, { recursive: true, force: true }); } @@ -265,11 +309,9 @@ afterAll(async () => { describe("@types/bun integration test", () => { describe("basic type checks", () => { - test("checks without lib.dom.d.ts", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - expect(diagnostics).toEqual([]); + typeTest("checks without lib.dom.d.ts", { + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [], }); }); @@ -287,250 +329,245 @@ describe("@types/bun integration test", () => { const vi_shouldBeDefined: object = vi; `; - test("checks without lib.dom.d.ts and test-globals references", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - files: { - "reference-the-globals.ts": `/// `, - "my-test.test.ts": code, - }, - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - expect(diagnostics).toEqual([]); + typeTest("checks without lib.dom.d.ts and test-globals references", { + files: { + "reference-the-globals.ts": `/// `, + "my-test.test.ts": code, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [], }); - test("test-globals FAILS when the test-globals.d.ts is not referenced", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - files: { "my-test.test.ts": code }, // no reference to bun-types/test-globals - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); // should still have no empty interfaces - expect(diagnostics).toMatchInlineSnapshot(` - [ - { - "code": 2582, - "line": "my-test.test.ts:2:48", - "message": "Cannot find name 'test'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", - }, - { - "code": 2582, - "line": "my-test.test.ts:3:46", - "message": "Cannot find name 'it'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", - }, - { - "code": 2582, - "line": "my-test.test.ts:4:52", - "message": "Cannot find name 'describe'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", - }, - { - "code": 2304, - "line": "my-test.test.ts:5:50", - "message": "Cannot find name 'expect'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:6:53", - "message": "Cannot find name 'beforeAll'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:7:54", - "message": "Cannot find name 'beforeEach'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:8:53", - "message": "Cannot find name 'afterEach'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:9:52", - "message": "Cannot find name 'afterAll'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:10:44", - "message": "Cannot find name 'jest'.", - }, - { - "code": 2304, - "line": "my-test.test.ts:11:42", - "message": "Cannot find name 'vi'.", - }, - ] - `); + typeTest("test-globals FAILS when the test-globals.d.ts is not referenced", { + files: { "my-test.test.ts": code }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [ + { + "code": 2582, + "line": "my-test.test.ts:2:48", + "message": + "Cannot find name 'test'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", + }, + { + "code": 2582, + "line": "my-test.test.ts:3:46", + "message": + "Cannot find name 'it'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", + }, + { + "code": 2582, + "line": "my-test.test.ts:4:52", + "message": + "Cannot find name 'describe'. Do you need to install type definitions for a test runner? Try \`npm i --save-dev @types/jest\` or \`npm i --save-dev @types/mocha\`.", + }, + { + "code": 2304, + "line": "my-test.test.ts:5:50", + "message": "Cannot find name 'expect'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:6:53", + "message": "Cannot find name 'beforeAll'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:7:54", + "message": "Cannot find name 'beforeEach'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:8:53", + "message": "Cannot find name 'afterEach'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:9:52", + "message": "Cannot find name 'afterAll'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:10:44", + "message": "Cannot find name 'jest'.", + }, + { + "code": 2304, + "line": "my-test.test.ts:11:42", + "message": "Cannot find name 'vi'.", + }, + ], }); }); describe("bun:bundle feature()", () => { - test("Registry augmentation restricts feature() to known flags", async () => { - const testCode = ` - // Augment the Registry to define known flags - declare module "bun:bundle" { - interface Registry { - features: "DEBUG" | "PREMIUM" | "BETA"; + typeTest("Registry augmentation restricts feature() to known flags", { + files: { + "registry-test.ts": ` + // Augment the Registry to define known flags + declare module "bun:bundle" { + interface Registry { + features: "DEBUG" | "PREMIUM" | "BETA"; + } } - } - import { feature } from "bun:bundle"; + import { feature } from "bun:bundle"; - // Valid flags work - const a: boolean = feature("DEBUG"); - const b: boolean = feature("PREMIUM"); - const c: boolean = feature("BETA"); + // Valid flags work + const a: boolean = feature("DEBUG"); + const b: boolean = feature("PREMIUM"); + const c: boolean = feature("BETA"); - // Invalid flags are caught at compile time - // @ts-expect-error - "INVALID_FLAG" is not assignable to "DEBUG" | "PREMIUM" | "BETA" - const invalid: boolean = feature("INVALID_FLAG"); + // Invalid flags are caught at compile time + // @ts-expect-error - "INVALID_FLAG" is not assignable to "DEBUG" | "PREMIUM" | "BETA" + const invalid: boolean = feature("INVALID_FLAG"); - // @ts-expect-error - typos are caught - const typo: boolean = feature("DEUBG"); - `; - - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - files: { - "registry-test.ts": testCode, - }, - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - // Filter to only our test file - no diagnostics because @ts-expect-error suppresses errors - const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("registry-test.ts")); - expect(relevantDiagnostics).toEqual([]); + // @ts-expect-error - typos are caught + const typo: boolean = feature("DEUBG"); + `, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: diagnostics => { + const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("registry-test.ts")); + expect(relevantDiagnostics).toEqual([]); + }, }); - test("Registry augmentation produces type errors for invalid flags", async () => { - // Verify that without @ts-expect-error, invalid flags actually produce errors - const invalidTestCode = ` - declare module "bun:bundle" { - interface Registry { - features: "ALLOWED_FLAG"; + typeTest("Registry augmentation produces type errors for invalid flags", { + files: { + "registry-invalid-test.ts": ` + declare module "bun:bundle" { + interface Registry { + features: "ALLOWED_FLAG"; + } } - } - import { feature } from "bun:bundle"; + import { feature } from "bun:bundle"; - // This should cause a type error - INVALID_FLAG is not in Registry.features - const invalid: boolean = feature("INVALID_FLAG"); - `; - - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - files: { - "registry-invalid-test.ts": invalidTestCode, - }, - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("registry-invalid-test.ts")); - expect(relevantDiagnostics).toMatchInlineSnapshot(` - [ + // This should cause a type error - INVALID_FLAG is not in Registry.features + const invalid: boolean = feature("INVALID_FLAG"); + `, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: diagnostics => { + const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("registry-invalid-test.ts")); + expect(relevantDiagnostics).toEqual([ { "code": 2345, - "line": "registry-invalid-test.ts:11:42", + "line": "registry-invalid-test.ts:11:44", "message": "Argument of type '\"INVALID_FLAG\"' is not assignable to parameter of type '\"ALLOWED_FLAG\"'.", }, - ] - `); + ]); + }, }); - test("without Registry augmentation, feature() accepts any string", async () => { - // When Registry is not augmented, feature() falls back to accepting any string - const testCode = ` - import { feature } from "bun:bundle"; + typeTest("without Registry augmentation, feature() accepts any string", { + files: { + "no-registry-test.ts": ` + import { feature } from "bun:bundle"; - // Any string works when Registry.features is not defined - const a: boolean = feature("ANY_FLAG"); - const b: boolean = feature("ANOTHER_FLAG"); - const c: boolean = feature("whatever"); - `; + // Any string works when Registry.features is not defined + const a: boolean = feature("ANY_FLAG"); + const b: boolean = feature("ANOTHER_FLAG"); + const c: boolean = feature("whatever"); + `, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: diagnostics => { + const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("no-registry-test.ts")); + expect(relevantDiagnostics).toEqual([]); + }, + }); + }); - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - files: { - "no-registry-test.ts": testCode, - }, - }); + describe("Bunland reaching for JSX", () => { + typeTest("Bun.markdown.react() returns type compatible with React.ReactElement", { + packages: ["@types/react", "@types/react-dom"], + files: { + "jsx-test.tsx": ` + import {expectType, expectAssignable} from './utilities.ts'; + import type React from "react"; - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - const relevantDiagnostics = diagnostics.filter(d => d.line?.startsWith("no-registry-test.ts")); - expect(relevantDiagnostics).toEqual([]); + const markdownResult = Bun.markdown.react("# Hello"); + expectType(markdownResult).is>>(); + expectAssignable(markdownResult); + + function App() { + return
{markdownResult}
; + } + `, + }, + emptyInterfaces: expectedEmptyInterfacesThatReactDeclareWhenNoDOM, + diagnostics: [], + }); + + typeTest("Bun.markdown.react() returns unknown if React is not installed", { + files: { + "jsx-test.tsx": ` + import {expectType} from './utilities.ts'; + expectType(Bun.markdown.react("# Hello")).is(); + `, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [], }); }); describe("lib configuration", () => { - test("checks with no lib at all", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - options: { - lib: [], - }, - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - expect(diagnostics).toEqual([]); + typeTest("checks with no lib at all", { + options: { + lib: [], + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [], }); - test("fails with types: [] and no jsx", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - options: { - lib: [], - types: [], - jsx: ts.JsxEmit.None, - }, - }); - - expect(emptyInterfaces).toEqual(expectedEmptyInterfacesWhenNoDOM); - expect(diagnostics).toEqual([ - // // This is expected because we, of course, can't check that our tsx file is passing - // // when tsx is turned off... - // { - // "code": 17004, - // "line": "[slug].tsx:17:10", - // "message": "Cannot use JSX unless the '--jsx' flag is provided.", - // }, - ]); + typeTest("fails with types: [] and no jsx", { + options: { + lib: [], + types: [], + jsx: ts.JsxEmit.None, + }, + emptyInterfaces: expectedEmptyInterfacesWhenNoDOM, + diagnostics: [], }); - test("checks with lib.dom.d.ts", async () => { - const { diagnostics, emptyInterfaces } = await diagnose(TEMP_FIXTURE_DIR, { - options: { - lib: ["ESNext", "DOM", "DOM.Iterable", "DOM.AsyncIterable"].map(name => `lib.${name.toLowerCase()}.d.ts`), - }, - }); - - expect(emptyInterfaces).toEqual( - new Set([ - "ThisType", - "RTCAnswerOptions", - "RTCOfferAnswerOptions", - "RTCSetParameterOptions", - "EXT_color_buffer_float", - "EXT_float_blend", - "EXT_frag_depth", - "EXT_shader_texture_lod", - "FragmentDirective", - "MediaSourceHandle", - "OES_element_index_uint", - "OES_fbo_render_mipmap", - "OES_texture_float", - "OES_texture_float_linear", - "OES_texture_half_float_linear", - "PeriodicWave", - "RTCRtpScriptTransform", - "WebGLBuffer", - "WebGLFramebuffer", - "WebGLProgram", - "WebGLQuery", - "WebGLRenderbuffer", - "WebGLSampler", - "WebGLShader", - "WebGLSync", - "WebGLTexture", - "WebGLTransformFeedback", - "WebGLUniformLocation", - "WebGLVertexArrayObject", - "WebGLVertexArrayObjectOES", - ]), - ); - expect(diagnostics).toEqual([ + typeTest("checks with lib.dom.d.ts", { + options: { + lib: ["ESNext", "DOM", "DOM.Iterable", "DOM.AsyncIterable"].map(name => `lib.${name.toLowerCase()}.d.ts`), + }, + emptyInterfaces: new Set([ + "ThisType", + "RTCAnswerOptions", + "RTCOfferAnswerOptions", + "RTCSetParameterOptions", + "EXT_color_buffer_float", + "EXT_float_blend", + "EXT_frag_depth", + "EXT_shader_texture_lod", + "FragmentDirective", + "MediaSourceHandle", + "OES_element_index_uint", + "OES_fbo_render_mipmap", + "OES_texture_float", + "OES_texture_float_linear", + "OES_texture_half_float_linear", + "PeriodicWave", + "RTCRtpScriptTransform", + "WebGLBuffer", + "WebGLFramebuffer", + "WebGLProgram", + "WebGLQuery", + "WebGLRenderbuffer", + "WebGLSampler", + "WebGLShader", + "WebGLSync", + "WebGLTexture", + "WebGLTransformFeedback", + "WebGLUniformLocation", + "WebGLVertexArrayObject", + "WebGLVertexArrayObjectOES", + ]), + diagnostics: [ { code: 2322, line: "24154.ts:11:3", @@ -595,35 +632,35 @@ describe("@types/bun integration test", () => { message: "Property 'text' does not exist on type 'ReadableStream>'.", }, { - "code": 2769, - "line": "streams.ts:18:3", - "message": + code: 2769, + line: "streams.ts:18:3", + message: "No overload matches this call.\nOverload 1 of 3, '(underlyingSource: UnderlyingByteSource, strategy?: { highWaterMark?: number | undefined; } | undefined): ReadableStream>', gave the following error.\nType '\"direct\"' is not assignable to type '\"bytes\"'.", }, { - "code": 2339, - "line": "streams.ts:20:16", - "message": "Property 'write' does not exist on type 'ReadableByteStreamController'.", + code: 2339, + line: "streams.ts:20:16", + message: "Property 'write' does not exist on type 'ReadableByteStreamController'.", }, { - "code": 2339, - "line": "streams.ts:46:19", - "message": "Property 'json' does not exist on type 'ReadableStream>'.", + code: 2339, + line: "streams.ts:46:19", + message: "Property 'json' does not exist on type 'ReadableStream>'.", }, { - "code": 2339, - "line": "streams.ts:47:19", - "message": "Property 'bytes' does not exist on type 'ReadableStream>'.", + code: 2339, + line: "streams.ts:47:19", + message: "Property 'bytes' does not exist on type 'ReadableStream>'.", }, { - "code": 2339, - "line": "streams.ts:48:19", - "message": "Property 'text' does not exist on type 'ReadableStream>'.", + code: 2339, + line: "streams.ts:48:19", + message: "Property 'text' does not exist on type 'ReadableStream>'.", }, { - "code": 2339, - "line": "streams.ts:49:19", - "message": "Property 'blob' does not exist on type 'ReadableStream>'.", + code: 2339, + line: "streams.ts:49:19", + message: "Property 'blob' does not exist on type 'ReadableStream>'.", }, { code: 2345, @@ -749,7 +786,144 @@ describe("@types/bun integration test", () => { line: "worker.ts:25:11", message: "Property 'threadId' does not exist on type 'Worker'.", }, - ]); + ], }); }); }); + +const expectedEmptyInterfacesWhenNoDOM = new Set(["ThisType"]); + +const expectedEmptyInterfacesThatReactDeclareWhenNoDOM = new Set([ + ...expectedEmptyInterfacesWhenNoDOM, + "Document", + "DataTransfer", + "StyleMedia", + "Element", + "DocumentFragment", + "HTMLElement", + "HTMLAnchorElement", + "HTMLAreaElement", + "HTMLAudioElement", + "HTMLBaseElement", + "HTMLBodyElement", + "HTMLBRElement", + "HTMLButtonElement", + "HTMLCanvasElement", + "HTMLDataElement", + "HTMLDataListElement", + "HTMLDetailsElement", + "HTMLDialogElement", + "HTMLDivElement", + "HTMLDListElement", + "HTMLEmbedElement", + "HTMLFieldSetElement", + "HTMLFormElement", + "HTMLHeadingElement", + "HTMLHeadElement", + "HTMLHRElement", + "HTMLHtmlElement", + "HTMLIFrameElement", + "HTMLImageElement", + "HTMLInputElement", + "HTMLModElement", + "HTMLLabelElement", + "HTMLLegendElement", + "HTMLLIElement", + "HTMLLinkElement", + "HTMLMapElement", + "HTMLMetaElement", + "HTMLMeterElement", + "HTMLObjectElement", + "HTMLOListElement", + "HTMLOptGroupElement", + "HTMLOptionElement", + "HTMLOutputElement", + "HTMLParagraphElement", + "HTMLParamElement", + "HTMLPreElement", + "HTMLProgressElement", + "HTMLQuoteElement", + "HTMLSlotElement", + "HTMLScriptElement", + "HTMLSelectElement", + "HTMLSourceElement", + "HTMLSpanElement", + "HTMLStyleElement", + "HTMLTableElement", + "HTMLTableColElement", + "HTMLTableDataCellElement", + "HTMLTableHeaderCellElement", + "HTMLTableRowElement", + "HTMLTableSectionElement", + "HTMLTemplateElement", + "HTMLTextAreaElement", + "HTMLTimeElement", + "HTMLTitleElement", + "HTMLTrackElement", + "HTMLUListElement", + "HTMLVideoElement", + "HTMLWebViewElement", + "SVGElement", + "SVGSVGElement", + "SVGCircleElement", + "SVGClipPathElement", + "SVGDefsElement", + "SVGDescElement", + "SVGEllipseElement", + "SVGFEBlendElement", + "SVGFEColorMatrixElement", + "SVGFEComponentTransferElement", + "SVGFECompositeElement", + "SVGFEConvolveMatrixElement", + "SVGFEDiffuseLightingElement", + "SVGFEDisplacementMapElement", + "SVGFEDistantLightElement", + "SVGFEDropShadowElement", + "SVGFEFloodElement", + "SVGFEFuncAElement", + "SVGFEFuncBElement", + "SVGFEFuncGElement", + "SVGFEFuncRElement", + "SVGFEGaussianBlurElement", + "SVGFEImageElement", + "SVGFEMergeElement", + "SVGFEMergeNodeElement", + "SVGFEMorphologyElement", + "SVGFEOffsetElement", + "SVGFEPointLightElement", + "SVGFESpecularLightingElement", + "SVGFESpotLightElement", + "SVGFETileElement", + "SVGFETurbulenceElement", + "SVGFilterElement", + "SVGForeignObjectElement", + "SVGGElement", + "SVGImageElement", + "SVGLineElement", + "SVGLinearGradientElement", + "SVGMarkerElement", + "SVGMaskElement", + "SVGMetadataElement", + "SVGPathElement", + "SVGPatternElement", + "SVGPolygonElement", + "SVGPolylineElement", + "SVGRadialGradientElement", + "SVGRectElement", + "SVGSetElement", + "SVGStopElement", + "SVGSwitchElement", + "SVGSymbolElement", + "SVGTextElement", + "SVGTextPathElement", + "SVGTSpanElement", + "SVGUseElement", + "SVGViewElement", + "Text", + "TouchList", + "WebGLRenderingContext", + "WebGL2RenderingContext", + "TrustedHTML", + "MediaStream", + "MediaSource", +]); diff --git a/test/internal/ban-limits.json b/test/internal/ban-limits.json index 9f9902af9b..fea8a53a01 100644 --- a/test/internal/ban-limits.json +++ b/test/internal/ban-limits.json @@ -7,7 +7,7 @@ ".arguments_old(": 263, ".jsBoolean(false)": 0, ".jsBoolean(true)": 0, - ".stdDir()": 41, + ".stdDir()": 42, ".stdFile()": 16, "// autofix": 148, ": [^=]+= undefined,$": 256, @@ -44,4 +44,4 @@ "undefined != ": 0, "undefined == ": 0, "usingnamespace": 0 -} \ No newline at end of file +} diff --git a/test/js/bun/md/coverage.txt b/test/js/bun/md/coverage.txt new file mode 100644 index 0000000000..146210c005 --- /dev/null +++ b/test/js/bun/md/coverage.txt @@ -0,0 +1,297 @@ + +# Coverage + +This file is just a collection of tests designed to activate code in MD4C +which may otherwise be hard to hit. It's to improve our test coverage. + + +## `md_is_unicode_whitespace__()` + +Unicode whitespace (here U+2000) forms a word boundary so these cannot be +resolved as emphasis span because there is no closer mark. + +```````````````````````````````` example +*foo *bar +. +

*foo *bar

+```````````````````````````````` + + +## `md_is_unicode_punct__()` + +Ditto for Unicode punctuation (here U+00A1). + +```````````````````````````````` example +*foo¡*bar +. +

*foo¡*bar

+```````````````````````````````` + + +## `md_get_unicode_fold_info()` + +```````````````````````````````` example +[Příliš žluťoučký kůň úpěl ďábelské ódy.] + +[PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.]: /url +. +

Příliš žluťoučký kůň úpěl ďábelské ódy.

+```````````````````````````````` + + +## `md_decode_utf8__()` and `md_decode_utf8_before__()` + +### Alphanumerical Character (i.e. not whitespace, not punctuation) + +Non-whitespace & non-punctuation characters below suppress `_` from being +recognized as an emphasis because `_` should be seen as in-word character: + +Example of 1-byte UTF-8 sequence (U+0058): +```````````````````````````````` example +X__foo__X +. +

X__foo__X

+```````````````````````````````` + +Example of 2-byte UTF-8 sequence (U+0158): +```````````````````````````````` example +Ř__foo__Ř +. +

Ř__foo__Ř

+```````````````````````````````` + +Example of 3-byte UTF-8 sequence (U+0BA3): +```````````````````````````````` example +ண__foo__ண +. +

ண__foo__ண

+```````````````````````````````` + +Example of 4-byte UTF-8 sequence (U+13142): +```````````````````````````````` example +𓅂__foo__𓅂 +. +

𓅂__foo__𓅂

+```````````````````````````````` + +### Whitespace character + +Whitespace on the other hand should not suppress `_`: + +Example of 1-byte UTF-8 sequence (U+0009): +```````````````````````````````` example +x→__foo__→ +. +

x foo

+```````````````````````````````` +(The initial `x` to suppress indented code block.) + +Example of 2-byte UTF-8 sequence (U+00A0): +```````````````````````````````` example + __foo__ +. +

foo

+```````````````````````````````` + +Example of 3-byte UTF-8 sequence (U+2000): +```````````````````````````````` example + __foo__ +. +

 foo 

+```````````````````````````````` + +(AFAIK, there is no 4-byte UTF-8 whitespace.) + +### Punctuation character + +Punctuation also should not suppress `_`: + +Example of 1-byte UTF-8 sequence (U+002E): +```````````````````````````````` example +.__foo__. +. +

.foo.

+```````````````````````````````` + +Example of 2-byte UTF-8 sequence (U+00B7): +```````````````````````````````` example +·__foo__· +. +

·foo·

+```````````````````````````````` + +Example of 3-byte UTF-8 sequence (U+0C84): +```````````````````````````````` example +಄__foo__಄ +. +

foo

+```````````````````````````````` + +Example of 4-byte UTF-8 sequence (U+1039F): +```````````````````````````````` example +𐎟__foo__𐎟 +. +

𐎟foo𐎟

+```````````````````````````````` + + +## `md_is_link_destination_A()` + +```````````````````````````````` example +[link]() +. +

link

+```````````````````````````````` + + +## `md_link_label_eq()` + +```````````````````````````````` example +[foo bar] + +[foo bar]: /url +. +

foo bar

+```````````````````````````````` + + +## `md_is_inline_link_spec()` + +```````````````````````````````` example +> [link](/url 'foo +> bar') +. +
+

link

+
+```````````````````````````````` + + +## `md_build_ref_def_hashtable()` + +All link labels in the following example all have the same FNV1a hash (after +normalization of the label, which means after converting to a vector of Unicode +codepoints and lowercase folding). + +So the example triggers quite complex code paths which are not otherwise easily +tested. + +```````````````````````````````` example +[foo]: /foo +[qnptgbh]: /qnptgbh +[abgbrwcv]: /abgbrwcv +[abgbrwcv]: /abgbrwcv2 +[abgbrwcv]: /abgbrwcv3 +[abgbrwcv]: /abgbrwcv4 +[alqadfgn]: /alqadfgn + +[foo] +[qnptgbh] +[abgbrwcv] +[alqadfgn] +[axgydtdu] +. +

foo +qnptgbh +abgbrwcv +alqadfgn +[axgydtdu]

+```````````````````````````````` + +For the sake of completeness, the following C program was used to find the hash +collisions by brute force: + +~~~ + +#include +#include + + +static unsigned etalon; + + + +#define MD_FNV1A_BASE 2166136261 +#define MD_FNV1A_PRIME 16777619 + +static inline unsigned +fnv1a(unsigned base, const void* data, size_t n) +{ + const unsigned char* buf = (const unsigned char*) data; + unsigned hash = base; + size_t i; + + for(i = 0; i < n; i++) { + hash ^= buf[i]; + hash *= MD_FNV1A_PRIME; + } + + return hash; +} + + +static unsigned +unicode_hash(const char* data, size_t n) +{ + unsigned value; + unsigned hash = MD_FNV1A_BASE; + int i; + + for(i = 0; i < n; i++) { + value = data[i]; + hash = fnv1a(hash, &value, sizeof(unsigned)); + } + + return hash; +} + + +static void +recurse(char* buffer, size_t off, size_t len) +{ + int ch; + + if(off < len - 1) { + for(ch = 'a'; ch <= 'z'; ch++) { + buffer[off] = ch; + recurse(buffer, off+1, len); + } + } else { + for(ch = 'a'; ch <= 'z'; ch++) { + buffer[off] = ch; + if(unicode_hash(buffer, len) == etalon) { + printf("Dup: %.*s\n", (int)len, buffer); + } + } + } +} + +int +main(int argc, char** argv) +{ + char buffer[32]; + int len; + + if(argc < 2) + etalon = unicode_hash("foo", 3); + else + etalon = unicode_hash(argv[1], strlen(argv[1])); + + for(len = 1; len <= sizeof(buffer); len++) + recurse(buffer, 0, len); + + return 0; +} +~~~ + + +## Flag `MD_FLAG_COLLAPSEWHITESPACE` + +```````````````````````````````` example +foo bar → baz +. +

foo bar baz

+. +--fcollapse-whitespace +```````````````````````````````` diff --git a/test/js/bun/md/gfm-compat.test.ts b/test/js/bun/md/gfm-compat.test.ts new file mode 100644 index 0000000000..c4dcce5dcd --- /dev/null +++ b/test/js/bun/md/gfm-compat.test.ts @@ -0,0 +1,753 @@ +import { describe, expect, test } from "bun:test"; + +/** + * GFM Compatibility Tests + * + * These tests verify areas where md4c (and Bun's markdown parser derived from it) + * differs from cmark-gfm (the reference GFM implementation). Expected outputs were + * generated using cmark-gfm 0.29.0.gfm.13 with the appropriate extensions. + * + * Each section corresponds to a known incompatibility between md4c and GFM. + */ + +const markdown = Bun.markdown; + +function render(input: string, options?: Record): string { + return markdown.html(input + "\n", options ?? {}); +} + +function renderGFM(md: string): string { + return render(md, { + tables: true, + strikethrough: true, + tasklists: true, + autolinks: true, + tagFilter: true, + }); +} + +// Normalize HTML for comparison: collapse whitespace, normalize tags +function normalize(html: string): string { + return html.replace(/\s+/g, " ").replace(/>\s+<").trim(); +} + +// ============================================================================ +// 1. Tables Cannot Interrupt Paragraphs +// +// In GFM (cmark-gfm), a table can appear immediately after paragraph text +// without a blank line separator. md4c requires a blank line before a table. +// +// References: md4c issues #262, #282 +// ============================================================================ +describe("tables interrupting paragraphs", () => { + test("table immediately after paragraph text", () => { + const md = `Some paragraph text. +| Col 1 | Col 2 | +|-------|-------| +| a | b |`; + + // cmark-gfm: paragraph + table + const expected = normalize(`

Some paragraph text.

+ + + + + + + + + + + + + +
Col 1Col 2
ab
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("table after blank line works in both", () => { + const md = `Some paragraph text. + +| Col 1 | Col 2 | +|-------|-------| +| a | b |`; + + const expected = normalize(`

Some paragraph text.

+ + + + + + + + + + + + + +
Col 1Col 2
ab
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("table interrupts paragraph with multiple rows", () => { + const md = `Hello world +| a | b | c | +|---|---|---| +| 1 | 2 | 3 | +| 4 | 5 | 6 |`; + + const expected = normalize(`

Hello world

+ + + + + + + + + + + + + + + + + + + + +
abc
123
456
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 2. Table Header/Delimiter Column Count Mismatch +// +// GFM rejects a table entirely if the header row and delimiter row have +// different column counts — it renders as a plain paragraph. md4c is more +// permissive and accepts the table using only the columns from the delimiter. +// +// Reference: md4c issue #137 +// ============================================================================ +describe("table column count mismatch", () => { + test("more header columns than delimiter columns", () => { + const md = `| abc | def | +| --- | +| bar |`; + + // cmark-gfm: rejects as table, renders as paragraph + const expected = normalize(`

| abc | def | +| --- | +| bar |

`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("fewer header columns than delimiter columns", () => { + const md = `| abc | +| --- | --- | +| bar | baz |`; + + const expected = normalize(`

| abc | +| --- | --- | +| bar | baz |

`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("three header columns, two delimiter columns", () => { + const md = `| a | b | c | +| --- | --- | +| 1 | 2 | 3 |`; + + const expected = normalize(`

| a | b | c | +| --- | --- | +| 1 | 2 | 3 |

`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("one header column, three delimiter columns", () => { + const md = `| a | +| --- | --- | --- | +| 1 |`; + + const expected = normalize(`

| a | +| --- | --- | --- | +| 1 |

`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 3. Pipes Inside Code Spans in Tables +// +// In cmark-gfm, pipe characters inside backtick code spans within table rows +// are treated as cell delimiters (code spans do NOT take precedence over +// table cell boundaries). md4c treats code spans as higher precedence. +// +// Reference: md4c issues #136, #262 +// ============================================================================ +describe("pipes in code spans in tables", () => { + test("pipe in code span splits cell in GFM", () => { + const md = `| Column 1 | Column 2 | +|---------|---------| +| \`foo | bar\` | baz |`; + + // cmark-gfm: the pipe inside backticks acts as a cell delimiter + // so `foo becomes cell 1, bar` becomes cell 2 + const expected = normalize(` + + + + + + + + + + + + +
Column 1Column 2
\`foobar\`
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("escaped pipe in code span preserves code span", () => { + const md = `| Column 1 | Column 2 | +|---------|---------| +| \`foo \\| bar\` | baz |`; + + // cmark-gfm: the escaped pipe is not a delimiter, code span is preserved + const expected = normalize(` + + + + + + + + + + + + +
Column 1Column 2
foo | barbaz
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("multiple pipes in code span", () => { + const md = `| a | b | +|---|---| +| \`x | y | z\` | w |`; + + // cmark-gfm: pipes in backticks are cell delimiters + const expected = normalize(` + + + + + + + + + + + + +
ab
\`xy
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("code span with pipe as second cell value", () => { + const md = `| a | b | +|---|---| +| \`code\` | \`a|b\` |`; + + // cmark-gfm: pipe in second cell's backticks splits the cell + const expected = normalize(` + + + + + + + + + + + + +
ab
code\`a
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 4. Empty Table Body (No Data Rows) +// +// When a table has only a header row and no body rows, GFM omits the +// tags entirely. md4c includes empty tags. +// +// Reference: md4c issue #138 +// ============================================================================ +describe("empty table body", () => { + test("table with header only omits tbody", () => { + const md = `| abc | def | +| --- | --- |`; + + // cmark-gfm: no at all + const expected = normalize(` + + + + + + +
abcdef
`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("table with header followed by blank line omits tbody", () => { + const md = `| abc | def | +| --- | --- | + +Next paragraph.`; + + const expected = normalize(` + + + + + + +
abcdef
+

Next paragraph.

`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 5. Disallowed Raw HTML (Tagfilter) +// +// GFM spec section 6.11: nine specific HTML tags have their leading `<` +// replaced with `<` to prevent rendering. md4c has no equivalent — it +// either allows all HTML or disables all HTML. +// +// Filtered tags: script, style, iframe, textarea, title, plaintext, xmp, +// noframes, noembed +// ============================================================================ +describe("disallowed raw HTML (tagfilter)", () => { + // These tests verify the GFM tagfilter behavior (enabled via tagFilter option). + + test("script tag is filtered", () => { + const md = ``; + const expected = `<script>alert("xss")</script>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("style tag is filtered", () => { + const md = ``; + const expected = `<style>body{color:red}</style>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("iframe tag is filtered", () => { + const md = ``; + const expected = `<iframe src="https://example.com"></iframe>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("textarea tag is filtered", () => { + const md = ``; + const expected = `<textarea>hello</textarea>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("title tag is filtered", () => { + const md = `hi`; + const expected = `<title>hi</title>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("plaintext tag is filtered", () => { + const md = `stuff`; + const expected = `<p>&lt;plaintext>stuff</p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("xmp tag is filtered", () => { + const md = `<xmp>stuff</xmp>`; + const expected = `<p>&lt;xmp>stuff&lt;/xmp></p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("noframes tag is filtered", () => { + const md = `<noframes>stuff</noframes>`; + const expected = `&lt;noframes>stuff&lt;/noframes>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("noembed tag is filtered", () => { + const md = `<noembed>stuff</noembed>`; + const expected = `<p>&lt;noembed>stuff&lt;/noembed></p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("allowed tags pass through unchanged", () => { + const md = `<strong>bold</strong> and <em>italic</em>`; + const expected = `<p><strong>bold</strong> and <em>italic</em></p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("filtering is case insensitive", () => { + const md = `<SCRIPT>alert("xss")</SCRIPT>`; + const expected = `&lt;SCRIPT>alert("xss")&lt;/SCRIPT>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("tagfilter in inline context", () => { + const md = `hello <script>alert("xss")</script> world`; + const expected = `<p>hello &lt;script>alert("xss")&lt;/script> world</p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("self-closing filtered tag", () => { + const md = `<script />`; + const expected = `&lt;script />`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("filtered tag with attributes", () => { + const md = `<script type="text/javascript">`; + const expected = `&lt;script type="text/javascript">`; + expect(renderGFM(md).trim()).toBe(expected); + }); + + test("similar tag names are not filtered", () => { + const md = `<scripting>not filtered</scripting>`; + const expected = `<p><scripting>not filtered</scripting></p>`; + expect(renderGFM(md).trim()).toBe(expected); + }); +}); + +// ============================================================================ +// 6. Autolinks with Formatting Delimiters in URLs +// +// md4c incorrectly treats ~, *, ** inside URLs as span delimiters, which can +// corrupt parser state and produce unbalanced/invalid output. +// +// Reference: md4c issues #294, #251 +// ============================================================================ +describe("autolinks with special characters", () => { + test("tilde in URL path", () => { + const md = `https://example.com/~user/file`; + const expected = `<p><a href="https://example.com/~user/file">https://example.com/~user/file</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("asterisk in URL path", () => { + const md = `https://example.com/path*file`; + const expected = `<p><a href="https://example.com/path*file">https://example.com/path*file</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("double asterisk in URL path", () => { + const md = `https://example.com/**path`; + const expected = `<p><a href="https://example.com/**path">https://example.com/**path</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("tilde in URL followed by text", () => { + const md = `Visit https://example.com/~user then go home`; + const expected = `<p>Visit <a href="https://example.com/~user">https://example.com/~user</a> then go home</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("multiple tildes in URL path", () => { + const md = `https://example.com/~user1/~user2/file`; + const expected = `<p><a href="https://example.com/~user1/~user2/file">https://example.com/~user1/~user2/file</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("plus sign in URL path", () => { + const md = `https://codereview.qt-project.org/c/qt/qtwayland/+/545836`; + const expected = `<p><a href="https://codereview.qt-project.org/c/qt/qtwayland/+/545836">https://codereview.qt-project.org/c/qt/qtwayland/+/545836</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("URL autolink not preceded by alphanumeric", () => { + const md = `texthttp://example.com`; + // cmark-gfm: does NOT autolink because preceded by alpha + const expected = `<p>texthttp://example.com</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); +}); + +// ============================================================================ +// 7. Autolink Parenthesis Balancing and Trailing Punctuation +// +// GFM has complex rules for parenthesis balancing in URLs and stripping +// trailing punctuation. These are areas where parsers commonly diverge. +// +// Reference: md4c issue #135 (fixed), GFM spec section 6.9 +// ============================================================================ +describe("autolink parentheses and trailing punctuation", () => { + test("balanced parentheses in URL are preserved", () => { + const md = `www.google.com/search?q=Markup+(business)`; + const expected = `<p><a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("unbalanced closing paren with trailing text", () => { + const md = `www.google.com/search?q=(business))+ok`; + const expected = `<p><a href="http://www.google.com/search?q=(business))+ok">www.google.com/search?q=(business))+ok</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("URL ending in paren wrapped in parens strips outer paren", () => { + const md = `(www.google.com/search?q=Markup+(business))`; + const expected = `<p>(<a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a>)</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("entity-like suffix excluded from URL", () => { + const md = `www.google.com/search?q=commonmark&hl;`; + const expected = `<p><a href="http://www.google.com/search?q=commonmark">www.google.com/search?q=commonmark</a>&amp;hl;</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("less-than terminates autolink", () => { + const md = `www.example.com<more`; + const expected = `<p><a href="http://www.example.com">www.example.com</a>&lt;more</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("trailing period stripped from URL", () => { + const md = `Visit www.commonmark.org.`; + const expected = `<p>Visit <a href="http://www.commonmark.org">www.commonmark.org</a>.</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("email autolink with trailing period excluded", () => { + const md = `Email foo@bar.baz.`; + const expected = `<p>Email <a href="mailto:foo@bar.baz">foo@bar.baz</a>.</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("www autolink at start of line", () => { + const md = `www.example.com/path`; + const expected = `<p><a href="http://www.example.com/path">www.example.com/path</a></p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); + + test("email autolink in sentence", () => { + const md = `Contact foo@bar.baz for info`; + const expected = `<p>Contact <a href="mailto:foo@bar.baz">foo@bar.baz</a> for info</p>`; + expect(normalize(render(md, { autolinks: true }))).toBe(normalize(expected)); + }); +}); + +// ============================================================================ +// 8. Strikethrough Edge Cases +// +// - GFM spec formally requires ~~ (double tilde), but GitHub.com also accepts +// ~ (single tilde). md4c and cmark-gfm both accept single tildes. +// - Flanking delimiter rules apply to tildes. +// - Strikethrough does not span across paragraph boundaries. +// +// Reference: md4c issues #242, #243 +// ============================================================================ +describe("strikethrough edge cases", () => { + test("double tilde strikethrough", () => { + const md = `~~strikethrough~~`; + const expected = `<p><del>strikethrough</del></p>`; + expect(normalize(render(md, { strikethrough: true }))).toBe(normalize(expected)); + }); + + test("single tilde strikethrough", () => { + const md = `~strikethrough~`; + const expected = `<p><del>strikethrough</del></p>`; + expect(normalize(render(md, { strikethrough: true }))).toBe(normalize(expected)); + }); + + test("tilde adjacent to quotes does not trigger strikethrough", () => { + const md = `copy "~user1/file" to "~user2/file"`; + // cmark-gfm: no strikethrough due to flanking rules + const expected = `<p>copy &quot;~user1/file&quot; to &quot;~user2/file&quot;</p>`; + expect(normalize(render(md, { strikethrough: true }))).toBe(normalize(expected)); + }); + + test("strikethrough does not span across paragraphs", () => { + const md = `This ~~has a + +new paragraph~~.`; + const expected = normalize(`<p>This ~~has a</p> +<p>new paragraph~~.</p>`); + expect(normalize(render(md, { strikethrough: true }))).toBe(expected); + }); + + test("triple tilde is treated as code fence, not strikethrough", () => { + const md = `~~~not strikethrough~~~`; + // cmark-gfm: treated as a code fence with "not" as the info string + const expected = normalize(`<pre><code class="language-not"></code></pre>`); + expect(normalize(render(md, { strikethrough: true }))).toBe(expected); + }); +}); + +// ============================================================================ +// 9. Code Fence Closing with Tab +// +// CommonMark spec: the closing code fence "may be followed only by spaces or +// tabs, which are ignored." md4c may fail to recognize a closing fence +// followed by a tab. +// +// Reference: md4c issue #292 +// ============================================================================ +describe("code fence closing with tab", () => { + test("closing fence followed by tab", () => { + const md = "```\ncode here\n```\t"; + const expected = normalize(`<pre><code>code here +</code></pre>`); + expect(normalize(render(md))).toBe(expected); + }); + + test("closing fence followed by spaces", () => { + const md = "```\ncode here\n``` "; + const expected = normalize(`<pre><code>code here +</code></pre>`); + expect(normalize(render(md))).toBe(expected); + }); + + test("closing fence followed by tab and spaces", () => { + const md = "```\ncode here\n```\t "; + const expected = normalize(`<pre><code>code here +</code></pre>`); + expect(normalize(render(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 10. ATX Heading + Emphasis Interactions +// +// md4c can mishandle emphasis markers in headings combined with inline links +// containing underscores, producing malformed output. +// +// Reference: md4c issue #278 +// ============================================================================ +describe("heading emphasis interactions", () => { + test("underscore emphasis in heading with link", () => { + const md = `# _foo [bar_](/url)`; + // cmark-gfm: underscores don't form emphasis here + const expected = normalize(`<h1>_foo <a href="/url">bar_</a></h1>`); + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("asterisk emphasis in heading with link", () => { + const md = `# *foo [bar*](/url)`; + const expected = normalize(`<h1>*foo <a href="/url">bar*</a></h1>`); + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("underscores in link URL inside heading", () => { + const md = `# heading [link_text](http://example.com/foo_bar)`; + const expected = normalize(`<h1>heading <a href="http://example.com/foo_bar">link_text</a></h1>`); + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); + +// ============================================================================ +// 11. Combined GFM Extensions +// +// Test that multiple GFM extensions work correctly together. +// ============================================================================ +describe("combined GFM extensions", () => { + test("table with strikethrough and autolinks in cells", () => { + const md = `| Feature | Status | +|---------|--------| +| ~~old~~ | https://example.com | +| new | www.example.com |`; + + const expected = normalize(`<table> +<thead> +<tr> +<th>Feature</th> +<th>Status</th> +</tr> +</thead> +<tbody> +<tr> +<td><del>old</del></td> +<td><a href="https://example.com">https://example.com</a></td> +</tr> +<tr> +<td>new</td> +<td><a href="http://www.example.com">www.example.com</a></td> +</tr> +</tbody> +</table>`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("table without leading/trailing pipes", () => { + const md = `abc | def +--- | --- +bar | baz`; + + const expected = normalize(`<table> +<thead> +<tr> +<th>abc</th> +<th>def</th> +</tr> +</thead> +<tbody> +<tr> +<td>bar</td> +<td>baz</td> +</tr> +</tbody> +</table>`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); + + test("table with alignment", () => { + const md = `| left | center | right | +|:-----|:------:|------:| +| a | b | c |`; + + const expected = normalize(`<table> +<thead> +<tr> +<th align="left">left</th> +<th align="center">center</th> +<th align="right">right</th> +</tr> +</thead> +<tbody> +<tr> +<td align="left">a</td> +<td align="center">b</td> +<td align="right">c</td> +</tr> +</tbody> +</table>`); + + expect(normalize(renderGFM(md))).toBe(expected); + }); +}); diff --git a/test/js/bun/md/md-edge-cases.test.ts b/test/js/bun/md/md-edge-cases.test.ts new file mode 100644 index 0000000000..5ce7fb530a --- /dev/null +++ b/test/js/bun/md/md-edge-cases.test.ts @@ -0,0 +1,517 @@ +import { describe, expect, test } from "bun:test"; +import { renderToString } from "react-dom/server"; + +const Markdown = Bun.markdown; + +// ============================================================================ +// Fuzzer-like tests: edge cases, pathological inputs, invariant checks +// ============================================================================ + +describe("fuzzer-like edge cases", () => { + // ---- Empty / whitespace-only inputs ---- + + test("empty string produces empty output across all APIs", () => { + expect(Markdown.html("")).toBe(""); + expect(Markdown.render("", {})).toBe(""); + const el = Markdown.react("", undefined, { reactVersion: 18 }); + expect(renderToString(el)).toBe(""); + }); + + test("whitespace-only inputs", () => { + for (const ws of [" ", "\t", "\n", "\r\n", " \n\t \n\n"]) { + expect(typeof Markdown.html(ws)).toBe("string"); + expect(typeof Markdown.render(ws, {})).toBe("string"); + Markdown.react(ws, undefined, { reactVersion: 18 }); // should not throw + } + }); + + // ---- Null bytes and control characters ---- + + test("null bytes are replaced with U+FFFD", () => { + const html = Markdown.html("a\0b\n"); + expect(html).toContain("\uFFFD"); + }); + + test("input with many null bytes does not crash", () => { + const input = Buffer.alloc(200, "\0").toString(); + expect(typeof Markdown.html(input)).toBe("string"); + expect(typeof Markdown.render(input, {})).toBe("string"); + Markdown.react(input, undefined, { reactVersion: 18 }); + }); + + test("control characters in input", () => { + const ctrl = + "\x01\x02\x03\x04\x05\x06\x07\x08\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"; + expect(typeof Markdown.html(ctrl)).toBe("string"); + }); + + // ---- Binary / non-UTF8 ---- + + test("Buffer input works", () => { + const buf = Buffer.from("# Hello\n"); + expect(Markdown.html(buf)).toContain("<h1>"); + }); + + test("binary-ish buffer does not crash", () => { + const buf = Buffer.alloc(256); + for (let i = 0; i < 256; i++) buf[i] = i; + expect(typeof Markdown.html(buf)).toBe("string"); + }); + + // ---- Deeply nested structures ---- + + test("deeply nested blockquotes", () => { + const depth = 100; + const input = Buffer.alloc(depth, "> ").toString() + "deep\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("deeply nested lists", () => { + let input = ""; + for (let i = 0; i < 50; i++) { + input += Buffer.alloc(i * 2, " ").toString() + "- item\n"; + } + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("deeply nested emphasis", () => { + const depth = 50; + const open = Buffer.alloc(depth, "*").toString(); + const close = open; + const input = open + "text" + close + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("deeply nested links", () => { + let input = ""; + for (let i = 0; i < 30; i++) { + input += "["; + } + input += "text"; + for (let i = 0; i < 30; i++) { + input += "](url)"; + } + input += "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + // ---- Long inputs ---- + + test("very long single line", () => { + const input = Buffer.alloc(100_000, "a").toString() + "\n"; + const result = Markdown.html(input); + expect(typeof result).toBe("string"); + expect(result.length).toBeGreaterThan(100_000); + }); + + test("many short lines", () => { + const lines: string[] = []; + for (let i = 0; i < 5_000; i++) { + lines.push("line " + i); + } + const input = lines.join("\n") + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("long heading text for slug generation", () => { + const longTitle = Buffer.alloc(10_000, "x").toString(); + const input = "# " + longTitle + "\n"; + const result = Markdown.html(input, { headings: { ids: true } }); + expect(result).toContain("id="); + }); + + // ---- Pathological patterns ---- + + test("many unclosed brackets", () => { + const input = Buffer.alloc(500, "[").toString() + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("many unclosed parentheses after link", () => { + const input = "[text](" + Buffer.alloc(500, "(").toString() + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("alternating backticks", () => { + const input = Buffer.alloc(1000, "`a").toString() + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("many consecutive heading markers", () => { + const input = Buffer.alloc(500, "# ").toString() + "text\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("many consecutive horizontal rules", () => { + const lines: string[] = []; + for (let i = 0; i < 500; i++) { + lines.push("---"); + } + expect(typeof Markdown.html(lines.join("\n") + "\n")).toBe("string"); + }); + + test("table with many columns", () => { + const cols = 100; + const header = "|" + Buffer.alloc(cols, "h|").toString(); + const sep = "|" + Buffer.alloc(cols, "-|").toString(); + const row = "|" + Buffer.alloc(cols, "d|").toString(); + const input = header + "\n" + sep + "\n" + row + "\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("table with many rows", () => { + const lines = ["| a | b |", "| - | - |"]; + for (let i = 0; i < 1000; i++) { + lines.push("| x | y |"); + } + expect(typeof Markdown.html(lines.join("\n") + "\n")).toBe("string"); + }); + + // ---- HTML injection patterns ---- + + test("script tags are passed through or filtered", () => { + const input = '<script>alert("xss")</script>\n'; + // with tagFilter enabled, disallowed tags should be escaped + const filtered = Markdown.html(input, { tagFilter: true }); + expect(filtered).not.toContain("<script>"); + }); + + test("nested HTML entities", () => { + const input = "&amp;amp;amp;amp;\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + // ---- Option combinations ---- + + const allOptions = { + tables: true, + strikethrough: true, + tasklists: true, + tagFilter: true, + autolinks: true, + hardSoftBreaks: true, + wikiLinks: true, + underline: true, + latexMath: true, + collapseWhitespace: true, + permissiveAtxHeaders: true, + noIndentedCodeBlocks: true, + noHtmlBlocks: true, + noHtmlSpans: true, + headings: true, + }; + + test("all options enabled simultaneously", () => { + const input = `# Heading + +**bold** *italic* ~~strike~~ __underline__ + +- [x] task +- [ ] unchecked + +| a | b | +| - | - | +| 1 | 2 | + +$E=mc^2$ + +$$ +\\int_0^1 x^2 dx +$$ + +[[wiki link]] + +www.example.com +user@example.com +https://example.com + +\`\`\`js +code +\`\`\` + +--- +`; + const result = Markdown.html(input, allOptions); + expect(typeof result).toBe("string"); + expect(result.length).toBeGreaterThan(0); + }); + + test("all options work with render()", () => { + const input = "# Hello **world**\n"; + const result = Markdown.render( + input, + { + heading: (c: string, m: any) => `[H${m.level}:${c}]`, + strong: (c: string) => `[B:${c}]`, + }, + allOptions, + ); + expect(result).toContain("[H1:"); + expect(result).toContain("[B:world]"); + }); + + test("all options work with react()", () => { + const input = "# Hello **world**\n"; + const el = Markdown.react(input, undefined, { ...allOptions, reactVersion: 18 }); + const html = renderToString(el); + expect(html).toContain("<h1"); + expect(html).toContain("<strong>"); + }); + + // ---- Invariant checks ---- + + test("html() always returns a string", () => { + const inputs = [ + "", + " ", + "\n", + "# H\n", + "```\ncode\n```\n", + "| a |\n| - |\n| b |\n", + "> quote\n", + "- list\n", + "1. ordered\n", + "![img](url)\n", + "[link](url)\n", + "**bold**\n", + "*italic*\n", + "~~strike~~\n", + "`code`\n", + "---\n", + "<div>html</div>\n", + "&amp;\n", + ]; + for (const input of inputs) { + const result = Markdown.html(input); + expect(typeof result).toBe("string"); + } + }); + + test("render() always returns a string", () => { + const inputs = ["", "# H\n", "**b**\n", "[l](u)\n", "```\nc\n```\n"]; + for (const input of inputs) { + const result = Markdown.render(input, {}); + expect(typeof result).toBe("string"); + } + }); + + test("render() with all callbacks returning null produces empty string", () => { + const nullCb = () => null; + const result = Markdown.render("# Hello **world**\n\nParagraph\n", { + heading: nullCb, + paragraph: nullCb, + strong: nullCb, + text: nullCb, + }); + expect(result).toBe(""); + }); + + test("render() with all callbacks returning empty string", () => { + const emptyCb = () => ""; + const result = Markdown.render("# Hello\n\nWorld\n", { + heading: emptyCb, + paragraph: emptyCb, + }); + expect(result).toBe(""); + }); + + // ---- Callback error handling ---- + + test("render() callback that throws propagates the error", () => { + expect(() => { + Markdown.render("# Hello\n", { + heading: () => { + throw new Error("callback error"); + }, + }); + }).toThrow("callback error"); + }); + + test("react() component override that throws propagates during render", () => { + // Component overrides are used as element types, so they throw during + // renderToString, not during Markdown.react() itself. + expect(() => { + renderToString( + Markdown.react( + "# Hello\n", + { + h1: () => { + throw new Error("component error"); + }, + }, + { reactVersion: 18 }, + ), + ); + }).toThrow("component error"); + }); + + // ---- Invalid argument types ---- + + test("html() with non-string/buffer throws TypeError", () => { + expect(() => Markdown.html(123 as any)).toThrow(); + expect(() => Markdown.html(null as any)).toThrow(); + expect(() => Markdown.html(undefined as any)).toThrow(); + expect(() => Markdown.html({} as any)).toThrow(); + }); + + test("render() with non-string/buffer throws TypeError", () => { + expect(() => Markdown.render(123 as any, {})).toThrow(); + expect(() => Markdown.render(null as any, {})).toThrow(); + }); + + test("react() with non-string/buffer throws TypeError", () => { + expect(() => Markdown.react(123 as any)).toThrow(); + expect(() => Markdown.react(null as any)).toThrow(); + }); + + // ---- Emoji and Unicode ---- + + test("emoji in markdown", () => { + const input = "# Hello \u{1F600}\n\n\u{1F4A9} **bold \u{1F60D}**\n"; + const result = Markdown.html(input); + expect(result).toContain("\u{1F600}"); + expect(result).toContain("\u{1F4A9}"); + }); + + test("CJK characters", () => { + const input = "# \u4F60\u597D\u4E16\u754C\n\n\u3053\u3093\u306B\u3061\u306F\n"; + expect(Markdown.html(input)).toContain("\u4F60\u597D"); + }); + + test("RTL text", () => { + const input = "# \u0645\u0631\u062D\u0628\u0627\n\n\u0634\u0643\u0631\u0627\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("mixed scripts and combining characters", () => { + const input = + "# Caf\u00E9 na\u00EFve r\u00E9sum\u00E9\n\nZ\u0361\u035C\u0321a\u030A\u0326l\u0338\u031Bg\u030D\u0320o\u0362\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + // ---- Entity edge cases ---- + + test("all HTML5 named entities", () => { + const input = "&amp; &lt; &gt; &quot; &apos; &nbsp; &copy; &reg; &trade;\n"; + const result = Markdown.html(input); + expect(result).toContain("&amp;"); + expect(result).toContain("&lt;"); + }); + + test("numeric entities", () => { + const input = "&#65; &#x41; &#128512;\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + test("invalid entities pass through", () => { + const input = "&notavalidentity; &#99999999;\n"; + expect(typeof Markdown.html(input)).toBe("string"); + }); + + // ---- Rapid API alternation ---- + + test("alternating between html/render/react does not corrupt state", () => { + const input = "# Hello **world**\n\n- item 1\n- item 2\n"; + for (let i = 0; i < 50; i++) { + const html = Markdown.html(input); + expect(html).toContain("<h1>"); + expect(html).toContain("<strong>"); + + const rendered = Markdown.render(input, { + heading: (c: string) => `[H:${c}]`, + strong: (c: string) => `[B:${c}]`, + }); + expect(rendered).toContain("[H:"); + expect(rendered).toContain("[B:world]"); + + const el = Markdown.react(input, undefined, { reactVersion: 18 }); + const reactHtml = renderToString(el); + expect(reactHtml).toContain("<h1>"); + } + }); + + // ---- GFM extension edge cases ---- + + test("wiki links with special characters", () => { + const input = "[[page with spaces]] [[page/with/slashes]] [[page#with#hashes]]\n"; + const result = Markdown.html(input, { wikiLinks: true }); + expect(typeof result).toBe("string"); + }); + + test("latex math edge cases", () => { + const inputs = [ + "$$ $$\n", // empty display math + "$ $\n", // empty inline math + "$a$b$c$\n", // adjacent math + "$$\n\\frac{1}{2}\n$$\n", // multi-line display math + ]; + for (const input of inputs) { + expect(typeof Markdown.html(input, { latexMath: true })).toBe("string"); + } + }); + + test("strikethrough edge cases", () => { + const inputs = [ + "~~~~\n", // 4 tildes + "~~ ~~\n", // space-only content + "~~a~~ ~~b~~\n", // adjacent + "~~**bold strike**~~\n", // nested + ]; + for (const input of inputs) { + expect(typeof Markdown.html(input)).toBe("string"); + } + }); + + test("task list edge cases", () => { + const inputs = [ + "- [x]\n", // checked, no text + "- [ ]\n", // unchecked, no text + "- [X] capital\n", // capital X + "- [x] **bold task**\n", // nested inline + ]; + for (const input of inputs) { + expect(typeof Markdown.html(input)).toBe("string"); + } + }); + + // ---- Autolink edge cases ---- + + test("autolink edge cases", () => { + const inputs = [ + "www.example.com\n", + "www.example.com/path?q=1&r=2#hash\n", + "user@example.com\n", + "https://example.com\n", + "https://example.com/path(with)parens\n", + ]; + for (const input of inputs) { + const result = Markdown.html(input, { autolinks: true }); + expect(typeof result).toBe("string"); + } + }); + + // ---- Heading ID collision ---- + + test("duplicate heading IDs get deduplicated", () => { + const input = "# Hello\n\n# Hello\n\n# Hello\n"; + const result = Markdown.html(input, { headings: { ids: true } }); + expect(result).toContain('id="hello"'); + expect(result).toContain('id="hello-1"'); + expect(result).toContain('id="hello-2"'); + }); + + test("heading ID deduplication with render()", () => { + const ids: string[] = []; + Markdown.render( + "# A\n\n# A\n\n# A\n", + { + heading: (_c: string, m: any) => { + ids.push(m.id); + return ""; + }, + }, + { headings: { ids: true } }, + ); + expect(ids).toEqual(["a", "a-1", "a-2"]); + }); +}); diff --git a/test/js/bun/md/md-heading-ids.test.ts b/test/js/bun/md/md-heading-ids.test.ts new file mode 100644 index 0000000000..29ca96692e --- /dev/null +++ b/test/js/bun/md/md-heading-ids.test.ts @@ -0,0 +1,105 @@ +import { describe, expect, test } from "bun:test"; + +const Markdown = Bun.markdown; + +// ============================================================================ +// Heading IDs and Autolink Headings (HTML output) +// ============================================================================ + +describe("headingIds option", () => { + test("basic heading gets an id attribute", () => { + const result = Markdown.html("## Hello World\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="hello-world">Hello World</h2>\n'); + }); + + test("heading levels 1-6 all get ids", () => { + for (let i = 1; i <= 6; i++) { + const md = Buffer.alloc(i, "#").toString() + " Test\n"; + const result = Markdown.html(md, { headings: { ids: true } }); + expect(result).toBe(`<h${i} id="test">Test</h${i}>\n`); + } + }); + + test("special characters are stripped from slug", () => { + const result = Markdown.html("## Hello, World!\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="hello-world">Hello, World!</h2>\n'); + }); + + test("uppercase is lowercased in slug", () => { + const result = Markdown.html("## ALLCAPS\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="allcaps">ALLCAPS</h2>\n'); + }); + + test("duplicate headings get deduplicated with -N suffix", () => { + const md = "## Foo\n\n## Foo\n\n## Foo\n"; + const result = Markdown.html(md, { headings: { ids: true } }); + expect(result).toContain('<h2 id="foo">Foo</h2>'); + expect(result).toContain('<h2 id="foo-1">Foo</h2>'); + expect(result).toContain('<h2 id="foo-2">Foo</h2>'); + }); + + test("inline markup is stripped from slug", () => { + const result = Markdown.html("## Hello **World**\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="hello-world">Hello <strong>World</strong></h2>\n'); + }); + + test("inline code is included in slug text", () => { + const result = Markdown.html("## Use `foo()` here\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="use-foo-here">Use <code>foo()</code> here</h2>\n'); + }); + + test("hyphens in heading text are preserved", () => { + const result = Markdown.html("## my-heading-text\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="my-heading-text">my-heading-text</h2>\n'); + }); + + test("numbers are kept in slug", () => { + const result = Markdown.html("## Step 3\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="step-3">Step 3</h2>\n'); + }); + + test("empty heading produces empty id", () => { + const result = Markdown.html("##\n", { headings: { ids: true }, permissiveAtxHeaders: true }); + expect(result).toBe('<h2 id=""></h2>\n'); + }); + + test("headingIds defaults to false", () => { + const result = Markdown.html("## Test\n"); + expect(result).toBe("<h2>Test</h2>\n"); + }); + + test("multiple spaces collapse to single hyphen", () => { + const result = Markdown.html("## hello world\n", { headings: { ids: true } }); + expect(result).toBe('<h2 id="hello-world">hello world</h2>\n'); + }); + + test("mixed content heading", () => { + const md = "## Install `bun` on Linux\n"; + const result = Markdown.html(md, { headings: { ids: true } }); + expect(result).toBe('<h2 id="install-bun-on-linux">Install <code>bun</code> on Linux</h2>\n'); + }); +}); + +describe("autolinkHeadings option", () => { + test("wraps heading content in anchor tag", () => { + const result = Markdown.html("## Hello\n", { headings: true }); + expect(result).toBe('<h2 id="hello"><a href="#hello">Hello</a></h2>\n'); + }); + + test("anchor wraps all inline content", () => { + const result = Markdown.html("## Hello **World**\n", { headings: true }); + expect(result).toBe('<h2 id="hello-world"><a href="#hello-world">Hello <strong>World</strong></a></h2>\n'); + }); + + test("autolink with deduplication", () => { + const md = "## Foo\n\n## Foo\n"; + const result = Markdown.html(md, { headings: true }); + expect(result).toContain('<h2 id="foo"><a href="#foo">Foo</a></h2>'); + expect(result).toContain('<h2 id="foo-1"><a href="#foo-1">Foo</a></h2>'); + }); + + test("autolinkHeadings without headingIds has no effect", () => { + const result = Markdown.html("## Test\n", { headings: { autolink: true } }); + expect(result).toBe("<h2>Test</h2>\n"); + }); +}); diff --git a/test/js/bun/md/md-react.test.ts b/test/js/bun/md/md-react.test.ts new file mode 100644 index 0000000000..d9be7f019b --- /dev/null +++ b/test/js/bun/md/md-react.test.ts @@ -0,0 +1,553 @@ +import { describe, expect, test } from "bun:test"; +import React from "react"; +import { renderToString } from "react-dom/server"; + +const Markdown = Bun.markdown; + +/** renderToString the Fragment returned by Markdown.react. + * Uses reactVersion: 18 since the project has react-dom@18 installed. */ +function reactRender(md: string, components?: any, opts?: any): string { + return renderToString(Markdown.react(md, components, { reactVersion: 18, ...opts })); +} + +// ============================================================================ +// Bun.markdown.react() — React element AST +// ============================================================================ + +describe("Bun.markdown.react", () => { + const REACT_ELEMENT_SYMBOL = Symbol.for("react.element"); + const REACT_FRAGMENT_SYMBOL = Symbol.for("react.fragment"); + const REACT_TRANSITIONAL_SYMBOL = Symbol.for("react.transitional.element"); + + /** Helper: get the children array from the Fragment returned by react() */ + function children(md: string, components?: any, opts?: any): any[] { + return Markdown.react(md, components, opts).props.children; + } + + test("returns a Fragment element", () => { + const result = Markdown.react("# Hello\n"); + expect(result.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(result.type).toBe(REACT_FRAGMENT_SYMBOL); + expect(result.key).toBeNull(); + expect(result.ref).toBeNull(); + expect(result.props.children).toBeArray(); + }); + + test("fragment children are React elements", () => { + const els = children("# Hello\n"); + expect(els).toHaveLength(1); + expect(els[0].$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + }); + + test("element has type, key, ref, props", () => { + const el = children("# Hello\n")[0]; + expect(el.type).toBe("h1"); + expect(el.key).toBeNull(); + expect(el.ref).toBeNull(); + expect(el.props).toEqual({ children: ["Hello"] }); + }); + + test("heading levels 1-6", () => { + for (let i = 1; i <= 6; i++) { + const md = Buffer.alloc(i, "#").toString() + " Level\n"; + const el = children(md)[0]; + expect(el.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(el.type).toBe(`h${i}`); + expect(el.props.children).toEqual(["Level"]); + } + }); + + test("text is plain strings in children", () => { + expect(children("Hello world\n")[0].props.children).toEqual(["Hello world"]); + }); + + test("nested inline elements are React elements", () => { + const p = children("Hello **world**\n")[0]; + expect(p.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(p.props.children[0]).toBe("Hello "); + const strong = p.props.children[1]; + expect(strong.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(strong.type).toBe("strong"); + expect(strong.props.children).toEqual(["world"]); + }); + + test("link has href in props", () => { + const link = children("[click](https://example.com)\n")[0].props.children[0]; + expect(link.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(link.type).toBe("a"); + expect(link.props.href).toBe("https://example.com"); + expect(link.props.children).toEqual(["click"]); + }); + + test("image has src and alt in props", () => { + const img = children("![alt](img.png)\n")[0].props.children[0]; + expect(img.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(img.type).toBe("img"); + expect(img.props.src).toBe("img.png"); + expect(img.props.alt).toBe("alt"); + expect(img.props.children).toBeUndefined(); + }); + + test("code block with language", () => { + const pre = children("```ts\nconst x = 1;\n```\n")[0]; + expect(pre.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(pre.type).toBe("pre"); + expect(pre.props.language).toBe("ts"); + }); + + test("hr is void element", () => { + const hr = children("---\n")[0]; + expect(hr.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(hr.type).toBe("hr"); + expect(hr.key).toBeNull(); + expect(hr.ref).toBeNull(); + expect(hr.props).toEqual({}); + }); + + test("br produces React element", () => { + const pChildren = children("line1 \nline2\n")[0].props.children; + const br = pChildren.find((c: any) => typeof c === "object" && c?.type === "br"); + expect(br).toBeDefined(); + expect(br.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(br.props).toEqual({}); + }); + + test("ordered list with start", () => { + const ol = children("3. first\n4. second\n")[0]; + expect(ol.type).toBe("ol"); + expect(ol.props.start).toBe(3); + expect(ol.props.children).toHaveLength(2); + expect(ol.props.children[0].type).toBe("li"); + }); + + test("table structure", () => { + const table = children("| A | B |\n|---|---|\n| 1 | 2 |\n")[0]; + expect(table.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(table.type).toBe("table"); + const thead = table.props.children.find((c: any) => c.type === "thead"); + expect(thead).toBeDefined(); + expect(thead.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + }); + + test("headingIds adds id to props", () => { + const el = children("## Hello World\n", undefined, { headings: { ids: true } })[0]; + expect(el.type).toBe("h2"); + expect(el.props.id).toBe("hello-world"); + expect(el.props.children).toEqual(["Hello World"]); + }); + + test("default $$typeof is react.transitional.element", () => { + const result = Markdown.react("# Hi\n"); + expect(result.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(result.props.children[0].$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + }); + + test("reactVersion 18 uses react.element symbol on all elements", () => { + const result = Markdown.react("Hello **world**\n", undefined, { reactVersion: 18 }); + expect(result.$$typeof).toBe(REACT_ELEMENT_SYMBOL); + const p = result.props.children[0]; + expect(p.$$typeof).toBe(REACT_ELEMENT_SYMBOL); + const strong = p.props.children[1]; + expect(strong.$$typeof).toBe(REACT_ELEMENT_SYMBOL); + }); + + test("multiple blocks", () => { + const els = children("# Title\n\nParagraph\n"); + expect(els).toHaveLength(2); + expect(els[0].$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(els[0].type).toBe("h1"); + expect(els[1].$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(els[1].type).toBe("p"); + }); + + test("complete document", () => { + const els = children(`# Hello + +This is **bold** and *italic*. + +- item one +- item two + +--- +`); + expect(els[0].type).toBe("h1"); + expect(els[1].type).toBe("p"); + expect(els[2].type).toBe("ul"); + expect(els[3].type).toBe("hr"); + for (const el of els) { + expect(el.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + } + }); + + test("blockquote contains nested React elements", () => { + const bq = children("> quoted text\n")[0]; + expect(bq.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(bq.type).toBe("blockquote"); + const p = bq.props.children[0]; + expect(p.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(p.type).toBe("p"); + expect(p.props.children).toEqual(["quoted text"]); + }); + + test("deeply nested elements are all React elements", () => { + const bq = children("> **bold *and italic***\n")[0]; + expect(bq.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + const p = bq.props.children[0]; + expect(p.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + const strong = p.props.children[0]; + expect(strong.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(strong.type).toBe("strong"); + const em = strong.props.children[1]; + expect(em.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(em.type).toBe("em"); + expect(em.props.children).toEqual(["and italic"]); + }); + + test("link with title in React element", () => { + const link = children('[text](https://example.com "My Title")\n')[0].props.children[0]; + expect(link.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(link.type).toBe("a"); + expect(link.props.href).toBe("https://example.com"); + expect(link.props.title).toBe("My Title"); + expect(link.props.children).toEqual(["text"]); + }); + + test("image with title in React element", () => { + const img = children('![alt](pic.jpg "Photo")\n')[0].props.children[0]; + expect(img.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(img.type).toBe("img"); + expect(img.props.src).toBe("pic.jpg"); + expect(img.props.title).toBe("Photo"); + expect(img.props.alt).toBe("alt"); + expect(img.props.children).toBeUndefined(); + }); + + test("inline code is a React element", () => { + const code = children("`code`\n")[0].props.children[0]; + expect(code.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(code.type).toBe("code"); + expect(code.props.children).toEqual(["code"]); + }); + + test("strikethrough is a React element", () => { + const del = children("~~deleted~~\n")[0].props.children[0]; + expect(del.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(del.type).toBe("del"); + expect(del.props.children).toEqual(["deleted"]); + }); + + test("unordered list children are React elements", () => { + const ul = children("- a\n- b\n")[0]; + expect(ul.type).toBe("ul"); + for (const li of ul.props.children) { + expect(li.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(li.type).toBe("li"); + } + }); + + test("entities are decoded in React output", () => { + const text = children("&amp; &lt; &gt;\n")[0].props.children.join(""); + expect(text).toContain("&"); + expect(text).toContain("<"); + expect(text).toContain(">"); + }); + + test("softbr becomes newline string", () => { + const pChildren = children("line1\nline2\n")[0].props.children; + expect(pChildren).toContain("\n"); + }); +}); + +// ============================================================================ +// Bun.markdown.react() + React renderToString integration +// ============================================================================ + +describe("Bun.markdown.react renderToString", () => { + test("heading", () => { + expect(reactRender("# Hello\n")).toBe("<h1>Hello</h1>"); + }); + + test("heading levels 1-6", () => { + for (let i = 1; i <= 6; i++) { + const md = Buffer.alloc(i, "#").toString() + " Level\n"; + expect(reactRender(md)).toBe(`<h${i}>Level</h${i}>`); + } + }); + + test("paragraph", () => { + expect(reactRender("Hello world\n")).toBe("<p>Hello world</p>"); + }); + + test("bold text", () => { + expect(reactRender("**bold**\n")).toBe("<p><strong>bold</strong></p>"); + }); + + test("italic text", () => { + expect(reactRender("*italic*\n")).toBe("<p><em>italic</em></p>"); + }); + + test("nested bold and italic", () => { + expect(reactRender("**bold *and italic***\n")).toBe("<p><strong>bold <em>and italic</em></strong></p>"); + }); + + test("strikethrough", () => { + expect(reactRender("~~deleted~~\n")).toBe("<p><del>deleted</del></p>"); + }); + + test("inline code", () => { + expect(reactRender("`code`\n")).toBe("<p><code>code</code></p>"); + }); + + test("link", () => { + expect(reactRender("[click](https://example.com)\n")).toBe('<p><a href="https://example.com">click</a></p>'); + }); + + test("link with title", () => { + expect(reactRender('[click](https://example.com "title")\n')).toBe( + '<p><a href="https://example.com" title="title">click</a></p>', + ); + }); + + test("image", () => { + expect(reactRender("![alt](img.png)\n")).toBe('<p><img src="img.png" alt="alt"/></p>'); + }); + + test("hr", () => { + expect(reactRender("---\n")).toBe("<hr/>"); + }); + + test("br", () => { + const html = reactRender("line1 \nline2\n"); + expect(html).toContain("<br/>"); + expect(html).toContain("line1"); + expect(html).toContain("line2"); + }); + + test("blockquote", () => { + expect(reactRender("> quoted\n")).toBe("<blockquote><p>quoted</p></blockquote>"); + }); + + test("unordered list", () => { + expect(reactRender("- a\n- b\n")).toBe("<ul><li>a</li><li>b</li></ul>"); + }); + + test("ordered list", () => { + expect(reactRender("1. a\n2. b\n")).toBe('<ol start="1"><li>a</li><li>b</li></ol>'); + }); + + test("ordered list with start", () => { + const html = reactRender("3. a\n4. b\n"); + expect(html).toContain('<ol start="3">'); + }); + + test("table", () => { + const html = reactRender("| A | B |\n|---|---|\n| 1 | 2 |\n"); + expect(html).toContain("<table>"); + expect(html).toContain("<thead>"); + expect(html).toContain("<tbody>"); + expect(html).toContain("<th>A</th>"); + expect(html).toContain("<td>1</td>"); + }); + + test("mixed document", () => { + const html = reactRender(`# Title + +Hello **world**, this is *important*. + +- item one +- item two +`); + expect(html).toContain("<h1>Title</h1>"); + expect(html).toContain("<strong>world</strong>"); + expect(html).toContain("<em>important</em>"); + expect(html).toContain("<li>item one</li>"); + expect(html).toContain("<li>item two</li>"); + }); + + test("entities are decoded", () => { + const html = reactRender("&amp; &lt; &gt;\n"); + expect(html).toContain("&amp;"); // React re-escapes & in output + expect(html).toContain("&lt;"); + expect(html).toContain("&gt;"); + }); + + test("headingIds produce id attribute", () => { + const html = reactRender("## Hello World\n", undefined, { headings: { ids: true } }); + expect(html).toBe('<h2 id="hello-world">Hello World</h2>'); + }); + + test("code block renders as pre", () => { + const html = reactRender("```\ncode here\n```\n"); + expect(html).toContain("<pre>"); + expect(html).toContain("code here"); + }); + + test("nested blockquote with formatting", () => { + const html = reactRender("> **bold** in quote\n"); + expect(html).toBe("<blockquote><p><strong>bold</strong> in quote</p></blockquote>"); + }); + + test("link inside heading", () => { + const html = reactRender("# [Bun](https://bun.sh)\n"); + expect(html).toBe('<h1><a href="https://bun.sh">Bun</a></h1>'); + }); + + test("multiple paragraphs", () => { + const html = reactRender("First paragraph.\n\nSecond paragraph.\n"); + expect(html).toBe("<p>First paragraph.</p><p>Second paragraph.</p>"); + }); + + test("reactVersion 18 produces correct structure", () => { + const result = Markdown.react("# Hello\n", undefined, { reactVersion: 18 }); + const els = result.props.children; + expect(els[0].type).toBe("h1"); + expect(els[0].props.children).toEqual(["Hello"]); + }); +}); + +// ============================================================================ +// Component overrides (render + react) +// ============================================================================ + +// (render() is callback-based, component overrides are only for react()) + +describe("Bun.markdown.react component overrides", () => { + const REACT_TRANSITIONAL_SYMBOL = Symbol.for("react.transitional.element"); + const REACT_ELEMENT_SYMBOL = Symbol.for("react.element"); + + /** Helper: get fragment children */ + function children(md: string, components?: any, opts?: any): any[] { + return Markdown.react(md, components, opts).props.children; + } + + test("function component override replaces type", () => { + function MyHeading({ children }: any) { + return React.createElement("div", { className: "heading" }, ...children); + } + const el = children("# Hello\n", { h1: MyHeading })[0]; + expect(el.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(el.type).toBe(MyHeading); + expect(el.props.children).toEqual(["Hello"]); + }); + + test("string override in react mode", () => { + const el = children("# Hello\n", { h1: "section" })[0]; + expect(el.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(el.type).toBe("section"); + expect(el.props.children).toEqual(["Hello"]); + }); + + test("multiple component overrides", () => { + function P({ children }: any) { + return React.createElement("div", null, ...children); + } + function Strong({ children }: any) { + return React.createElement("b", null, ...children); + } + const els = children("Hello **world**\n", { p: P, strong: Strong }); + expect(els[0].type).toBe(P); + const strong = els[0].props.children[1]; + expect(strong.type).toBe(Strong); + }); + + test("boolean override is ignored in react mode", () => { + expect(children("# Hello\n", { h1: true })[0].type).toBe("h1"); + }); + + test("override with reactVersion 18", () => { + const el = children("# Hello\n", { h1: "custom-h1" }, { reactVersion: 18 })[0]; + expect(el.$$typeof).toBe(REACT_ELEMENT_SYMBOL); + expect(el.type).toBe("custom-h1"); + }); + + test("link override preserves href prop", () => { + function Link({ href, children }: any) { + return React.createElement("a", { href, className: "custom" }, ...children); + } + const link = children("[click](https://example.com)\n", { a: Link })[0].props.children[0]; + expect(link.type).toBe(Link); + expect(link.props.href).toBe("https://example.com"); + }); + + test("image override preserves src and alt props", () => { + function Img(props: any) { + return React.createElement("img", props); + } + const img = children("![photo](pic.jpg)\n", { img: Img })[0].props.children[0]; + expect(img.type).toBe(Img); + expect(img.props.src).toBe("pic.jpg"); + expect(img.props.alt).toBe("photo"); + }); + + test("hr override in react mode", () => { + const el = children("---\n", { hr: "custom-hr" })[0]; + expect(el.$$typeof).toBe(REACT_TRANSITIONAL_SYMBOL); + expect(el.type).toBe("custom-hr"); + expect(el.props).toEqual({}); + }); +}); + +describe("Bun.markdown.react renderToString with component overrides", () => { + test("function component renders custom HTML", () => { + function Heading({ children }: any) { + return React.createElement("div", { className: "title" }, ...children); + } + const html = reactRender("# Hello\n", { h1: Heading }); + expect(html).toBe('<div class="title">Hello</div>'); + }); + + test("multiple custom components", () => { + function P({ children }: any) { + return React.createElement("section", null, ...children); + } + function Strong({ children }: any) { + return React.createElement("b", null, ...children); + } + const html = reactRender("Hello **world**\n", { p: P, strong: Strong }); + expect(html).toBe("<section>Hello <b>world</b></section>"); + }); + + test("custom link component", () => { + function Link({ href, children }: any) { + return React.createElement("a", { href, target: "_blank" }, ...children); + } + const html = reactRender("[click](https://example.com)\n", { a: Link }); + expect(html).toBe('<p><a href="https://example.com" target="_blank">click</a></p>'); + }); + + test("custom image component", () => { + function Img({ src, alt }: any) { + return React.createElement("figure", null, React.createElement("img", { src, alt })); + } + const html = reactRender("![photo](pic.jpg)\n", { img: Img }); + expect(html).toBe('<p><figure><img src="pic.jpg" alt="photo"/></figure></p>'); + }); + + test("custom code block with language", () => { + function Code({ language, children }: any) { + return React.createElement("pre", { "data-lang": language || "text" }, ...children); + } + const html = reactRender("```js\nconst x = 1;\n```\n", { pre: Code }); + expect(html).toContain('data-lang="js"'); + expect(html).toContain("const x = 1;"); + }); + + test("custom list components", () => { + function List({ children }: any) { + return React.createElement("div", { className: "list" }, ...children); + } + function Item({ children }: any) { + return React.createElement("span", null, ...children); + } + const html = reactRender("- a\n- b\n", { ul: List, li: Item }); + expect(html).toBe('<div class="list"><span>a</span><span>b</span></div>'); + }); + + test("override only specific elements", () => { + function H1({ children }: any) { + return React.createElement("h1", { className: "big" }, ...children); + } + const html = reactRender("# Title\n\nParagraph\n", { h1: H1 }); + expect(html).toBe('<h1 class="big">Title</h1><p>Paragraph</p>'); + }); +}); diff --git a/test/js/bun/md/md-render-callback.test.ts b/test/js/bun/md/md-render-callback.test.ts new file mode 100644 index 0000000000..b5d5087b6e --- /dev/null +++ b/test/js/bun/md/md-render-callback.test.ts @@ -0,0 +1,258 @@ +import { describe, expect, test } from "bun:test"; + +const Markdown = Bun.markdown; + +// ============================================================================ +// Bun.markdown.render() — callback-based string renderer +// ============================================================================ + +describe("Bun.markdown.render", () => { + test("returns a string", () => { + const result = Markdown.render("# Hello\n", { + heading: (children: string) => `<h1>${children}</h1>`, + }); + expect(typeof result).toBe("string"); + }); + + test("without callbacks, children pass through unchanged", () => { + const result = Markdown.render("Hello world\n"); + expect(result).toBe("Hello world"); + }); + + test("heading callback with level metadata", () => { + const result = Markdown.render("# Hello\n", { + heading: (children: string, { level }: any) => `<h${level}>${children}</h${level}>`, + paragraph: (children: string) => children, + }); + expect(result).toBe("<h1>Hello</h1>"); + }); + + test("heading levels 1-6", () => { + for (let i = 1; i <= 6; i++) { + const md = Buffer.alloc(i, "#").toString() + " Level\n"; + const result = Markdown.render(md, { + heading: (children: string, { level }: any) => `[h${level}:${children}]`, + }); + expect(result).toBe(`[h${i}:Level]`); + } + }); + + test("paragraph callback", () => { + const result = Markdown.render("Hello world\n", { + paragraph: (children: string) => `<p>${children}</p>`, + }); + expect(result).toBe("<p>Hello world</p>"); + }); + + test("strong callback", () => { + const result = Markdown.render("**bold**\n", { + strong: (children: string) => `<b>${children}</b>`, + paragraph: (children: string) => children, + }); + expect(result).toBe("<b>bold</b>"); + }); + + test("emphasis callback", () => { + const result = Markdown.render("*italic*\n", { + emphasis: (children: string) => `<i>${children}</i>`, + paragraph: (children: string) => children, + }); + expect(result).toBe("<i>italic</i>"); + }); + + test("link callback with href metadata", () => { + const result = Markdown.render("[click](https://example.com)\n", { + link: (children: string, { href }: any) => `<a href="${href}">${children}</a>`, + paragraph: (children: string) => children, + }); + expect(result).toBe('<a href="https://example.com">click</a>'); + }); + + test("link callback with title metadata", () => { + const result = Markdown.render('[click](https://example.com "My Title")\n', { + link: (children: string, { href, title }: any) => `<a href="${href}" title="${title}">${children}</a>`, + paragraph: (children: string) => children, + }); + expect(result).toBe('<a href="https://example.com" title="My Title">click</a>'); + }); + + test("image callback with src metadata", () => { + const result = Markdown.render("![alt text](image.png)\n", { + image: (children: string, { src }: any) => `<img src="${src}" alt="${children}" />`, + paragraph: (children: string) => children, + }); + expect(result).toBe('<img src="image.png" alt="alt text" />'); + }); + + test("code block callback with language metadata", () => { + const result = Markdown.render("```js\nconsole.log('hi');\n```\n", { + code: (children: string, meta: any) => `<pre lang="${meta?.language}">${children}</pre>`, + }); + expect(result).toBe("<pre lang=\"js\">console.log('hi');\n</pre>"); + }); + + test("code block without language", () => { + const result = Markdown.render("```\nplain code\n```\n", { + code: (children: string, meta: any) => `<pre lang="${meta?.language ?? "none"}">${children}</pre>`, + }); + expect(result).toBe('<pre lang="none">plain code\n</pre>'); + }); + + test("codespan callback", () => { + const result = Markdown.render("`code`\n", { + codespan: (children: string) => `<code>${children}</code>`, + paragraph: (children: string) => children, + }); + expect(result).toBe("<code>code</code>"); + }); + + test("hr callback", () => { + const result = Markdown.render("---\n", { + hr: () => "<hr />", + }); + expect(result).toBe("<hr />"); + }); + + test("blockquote callback", () => { + const result = Markdown.render("> quoted text\n", { + blockquote: (children: string) => `<blockquote>${children}</blockquote>`, + paragraph: (children: string) => `<p>${children}</p>`, + }); + expect(result).toBe("<blockquote><p>quoted text</p></blockquote>"); + }); + + test("list callbacks (ordered)", () => { + const result = Markdown.render("1. first\n2. second\n", { + list: (children: string, { ordered, start }: any) => + ordered ? `<ol start="${start}">${children}</ol>` : `<ul>${children}</ul>`, + listItem: (children: string) => `<li>${children}</li>`, + }); + expect(result).toBe('<ol start="1"><li>first</li><li>second</li></ol>'); + }); + + test("list callbacks (unordered)", () => { + const result = Markdown.render("- a\n- b\n", { + list: (children: string, { ordered }: any) => (ordered ? `<ol>${children}</ol>` : `<ul>${children}</ul>`), + listItem: (children: string) => `<li>${children}</li>`, + }); + expect(result).toBe("<ul><li>a</li><li>b</li></ul>"); + }); + + test("ordered list with start number", () => { + const result = Markdown.render("3. first\n4. second\n", { + list: (children: string, { start }: any) => `<ol start="${start}">${children}</ol>`, + listItem: (children: string) => `<li>${children}</li>`, + }); + expect(result).toBe('<ol start="3"><li>first</li><li>second</li></ol>'); + }); + + test("strikethrough callback", () => { + const result = Markdown.render("~~deleted~~\n", { + strikethrough: (children: string) => `<del>${children}</del>`, + paragraph: (children: string) => children, + }); + expect(result).toBe("<del>deleted</del>"); + }); + + test("text callback", () => { + const result = Markdown.render("Hello world\n", { + text: (text: string) => text.toUpperCase(), + paragraph: (children: string) => children, + }); + expect(result).toBe("HELLO WORLD"); + }); + + test("returning null omits element", () => { + const result = Markdown.render("# Title\n\n![logo](img.png)\n\nHello\n", { + image: () => null, + heading: (children: string) => children, + paragraph: (children: string) => children + "\n", + }); + expect(result).toBe("Title\nHello\n"); + }); + + test("returning undefined omits element", () => { + const result = Markdown.render("# Title\n\nHello\n", { + heading: () => undefined, + paragraph: (children: string) => children, + }); + expect(result).toBe("Hello"); + }); + + test("multiple callbacks combined", () => { + const result = Markdown.render("# Title\n\nHello **world**\n", { + heading: (children: string, { level }: any) => `<h${level} class="heading">${children}</h${level}>`, + paragraph: (children: string) => `<p class="body">${children}</p>`, + strong: (children: string) => `<strong class="bold">${children}</strong>`, + }); + expect(result).toBe('<h1 class="heading">Title</h1><p class="body">Hello <strong class="bold">world</strong></p>'); + }); + + test("stripping all formatting", () => { + const result = Markdown.render("# Hello **world**\n", { + heading: (children: string) => children, + paragraph: (children: string) => children, + strong: (children: string) => children, + emphasis: (children: string) => children, + link: (children: string) => children, + image: () => "", + code: (children: string) => children, + codespan: (children: string) => children, + }); + expect(result).toBe("Hello world"); + }); + + test("ANSI terminal output", () => { + const result = Markdown.render("# Hello\n\nThis is **bold** and *italic*\n", { + heading: (children: string) => `\x1b[1;4m${children}\x1b[0m\n`, + paragraph: (children: string) => children + "\n", + strong: (children: string) => `\x1b[1m${children}\x1b[22m`, + emphasis: (children: string) => `\x1b[3m${children}\x1b[23m`, + }); + expect(result).toBe("\x1b[1;4mHello\x1b[0m\nThis is \x1b[1mbold\x1b[22m and \x1b[3mitalic\x1b[23m\n"); + }); + + test("parser options work alongside callbacks", () => { + const result = Markdown.render( + "Visit www.example.com\n", + { + link: (children: string, { href }: any) => `[${children}](${href})`, + paragraph: (children: string) => children, + }, + { autolinks: true }, + ); + expect(result).toContain("[www.example.com]"); + }); + + test("headings option provides id in heading meta", () => { + const result = Markdown.render( + "## Hello World\n", + { + heading: (children: string, { level, id }: any) => `<h${level} id="${id}">${children}</h${level}>`, + }, + { headings: { ids: true } }, + ); + expect(result).toBe('<h2 id="hello-world">Hello World</h2>'); + }); + + test("table callbacks", () => { + const result = Markdown.render("| A | B |\n|---|---|\n| 1 | 2 |\n", { + table: (children: string) => `<table>${children}</table>`, + thead: (children: string) => `<thead>${children}</thead>`, + tbody: (children: string) => `<tbody>${children}</tbody>`, + tr: (children: string) => `<tr>${children}</tr>`, + th: (children: string) => `<th>${children}</th>`, + td: (children: string) => `<td>${children}</td>`, + }); + expect(result).toContain("<table>"); + expect(result).toContain("<th>A</th>"); + expect(result).toContain("<td>1</td>"); + }); + + test("entities are decoded", () => { + const result = Markdown.render("&amp;\n", { + paragraph: (children: string) => children, + }); + expect(result).toBe("&"); + }); +}); diff --git a/test/js/bun/md/md-spec.test.ts b/test/js/bun/md/md-spec.test.ts new file mode 100644 index 0000000000..db0aa590ad --- /dev/null +++ b/test/js/bun/md/md-spec.test.ts @@ -0,0 +1,267 @@ +import { describe, expect, test } from "bun:test"; +import { readFileSync } from "fs"; +import { join } from "path"; + +const SPEC_DIR = import.meta.dir; + +interface SpecExample { + markdown: string; + expected: string; + line: number; + section: string; + flags: string[]; +} + +function parseSpecFile(path: string): SpecExample[] { + const content = readFileSync(path, "utf8").replace(/\r\n?/g, "\n"); + const lines = content.split("\n"); + const examples: SpecExample[] = []; + const fence = "`".repeat(32); + let i = 0; + let currentSection = ""; + + while (i < lines.length) { + const line = lines[i]; + // Track section headers + if (line.startsWith("# ") || line.startsWith("## ") || line.startsWith("### ")) { + currentSection = line.replace(/^#+\s*/, ""); + } + if (line.startsWith(fence + " example")) { + const startLine = i + 1; + i++; + // Collect markdown input (until lone "." line) + const mdLines: string[] = []; + while (i < lines.length && lines[i] !== ".") { + mdLines.push(lines[i]); + i++; + } + i++; // skip the "." + // Collect expected HTML (until closing fence) + const htmlLines: string[] = []; + while (i < lines.length && !lines[i].startsWith(fence)) { + htmlLines.push(lines[i]); + i++; + } + // Extension spec files have a second "." followed by flags (e.g. "--ftables"). + // Strip trailing ".\n--fXXX\n--fYYY\n..." from expected HTML and save flags. + let expectedHtml = htmlLines.join("\n"); + let flags: string[] = []; + const flagMatch = expectedHtml.match(/\n\.\n((?:--[^\n]+\n?)+)$/); + if (flagMatch) { + expectedHtml = expectedHtml.slice(0, -flagMatch[0].length); + flags = flagMatch[1] + .trim() + .split("\n") + .flatMap((line: string) => line.split(/\s+/)) + .filter((f: string) => f.startsWith("--f")); + } + examples.push({ + markdown: mdLines.join("\n").replaceAll("\u2192", "\t"), + expected: expectedHtml.replaceAll("\u2192", "\t"), + line: startLine, + section: currentSection, + flags, + }); + } + i++; + } + return examples; +} + +const markdown = Bun.markdown; + +function renderMarkdown(md: string, flags?: string[]): string { + const options: Record<string, any> = {}; + if (flags && flags.length > 0) { + for (const flag of flags) { + // Strip --f prefix, replace - with _ + const name = flag.slice(3).replace(/-/g, "_"); + // Map autolink flags to compound option + if (name === "permissive_autolinks") { + options.autolinks = true; + } else if (name === "permissive_url_autolinks") { + if (typeof options.autolinks !== "object") options.autolinks = {}; + options.autolinks.url = true; + } else if (name === "permissive_www_autolinks") { + if (typeof options.autolinks !== "object") options.autolinks = {}; + options.autolinks.www = true; + } else if (name === "permissive_email_autolinks") { + if (typeof options.autolinks !== "object") options.autolinks = {}; + options.autolinks.email = true; + } else { + options[name] = true; + } + } + } + return markdown.html(md + "\n", options); +} + +// Normalize HTML for comparison, ported from md4c's normalize.py. +// This ignores insignificant output differences: +// - Whitespace around block-level tags is removed +// - Multiple whitespace chars collapsed to single space (outside <pre>) +// - Self-closing tags converted to open tags (<br /> → <br>) +function normalizeHtml(html: string): string { + const blockTags = new Set([ + "article", + "header", + "aside", + "hgroup", + "blockquote", + "hr", + "iframe", + "body", + "li", + "map", + "button", + "object", + "canvas", + "ol", + "caption", + "output", + "col", + "p", + "colgroup", + "pre", + "dd", + "progress", + "div", + "section", + "dl", + "table", + "td", + "dt", + "tbody", + "embed", + "textarea", + "fieldset", + "tfoot", + "figcaption", + "th", + "figure", + "thead", + "footer", + "tr", + "form", + "ul", + "h1", + "h2", + "h3", + "h4", + "h5", + "h6", + "video", + "script", + "style", + ]); + + let output = ""; + let lastType = "starttag"; + let lastTag = ""; + let inPre = false; + + // Simple HTML tokenizer: splits into tags and text + const tokens = html.match(/<!\[CDATA\[.*?\]\]>|<!--.*?-->|<!\S[^>]*>|<\?[^>]*>|<\/?[a-zA-Z][^>]*\/?>|[^<]+/gs) || []; + + for (const token of tokens) { + if (token.startsWith("<![CDATA")) { + output += token; + lastType = "data"; + } else if (token.startsWith("<!--")) { + output += token; + lastType = "comment"; + } else if (token.startsWith("<!") || token.startsWith("<?")) { + output += token; + lastType = "decl"; + } else if (token.startsWith("</")) { + // End tag + const tag = token.slice(2, -1).trim().toLowerCase(); + if (tag === "pre") inPre = false; + if (blockTags.has(tag)) output = output.trimEnd(); + output += `</${tag}>`; + lastTag = tag; + lastType = "endtag"; + } else if (token.startsWith("<")) { + // Start tag (possibly self-closing) + const selfClosing = token.endsWith("/>"); + const inner = token.slice(1, selfClosing ? -2 : -1).trim(); + const spaceIdx = inner.search(/[\s\/]/); + const tag = (spaceIdx === -1 ? inner : inner.slice(0, spaceIdx)).toLowerCase(); + + if (tag === "pre") inPre = true; + if (blockTags.has(tag)) output = output.trimEnd(); + + // Parse attributes + let attrStr = spaceIdx === -1 ? "" : inner.slice(spaceIdx).replace(/\/$/, "").trim(); + let attrs: [string, string | null][] = []; + const attrRe = /([a-zA-Z_:][a-zA-Z0-9_.:-]*)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|(\S+)))?/g; + let m; + while ((m = attrRe.exec(attrStr)) !== null) { + const name = m[1].toLowerCase(); + const value = m[2] ?? m[3] ?? m[4] ?? null; + attrs.push([name, value]); + } + attrs.sort((a, b) => a[0].localeCompare(b[0])); + + output += `<${tag}`; + for (const [k, v] of attrs) { + output += ` ${k}`; + if (v !== null) output += `="${v}"`; + } + output += ">"; + + lastTag = tag; + // Self-closing tags are treated as endtags for whitespace purposes + lastType = selfClosing ? "endtag" : "starttag"; + } else { + // Text data + let data = token; + const afterTag = lastType === "endtag" || lastType === "starttag"; + const afterBlockTag = afterTag && blockTags.has(lastTag); + + if (afterTag && lastTag === "br") data = data.replace(/^\n/, ""); + if (!inPre) data = data.replace(/\s+/g, " "); + if (afterBlockTag && !inPre) { + if (lastType === "starttag") data = data.trimStart(); + else if (lastType === "endtag") data = data.trim(); + } + + output += data; + lastType = "data"; + } + } + + return output.trim(); +} + +const specFiles = [ + { name: "CommonMark", file: "spec.txt" }, + { name: "GFM Tables", file: "spec-tables.txt" }, + { name: "GFM Strikethrough", file: "spec-strikethrough.txt" }, + { name: "GFM Tasklists", file: "spec-tasklists.txt" }, + { name: "Permissive Autolinks", file: "spec-permissive-autolinks.txt" }, + { name: "GFM", file: "spec-gfm.txt" }, + { name: "Coverage", file: "coverage.txt" }, + { name: "Regressions", file: "regressions.txt" }, +]; + +for (const { name, file } of specFiles) { + const specPath = join(SPEC_DIR, file); + let examples: SpecExample[]; + try { + examples = parseSpecFile(specPath); + } catch { + continue; + } + if (examples.length === 0) continue; + + describe(name, () => { + for (let i = 0; i < examples.length; i++) { + const ex = examples[i]; + test(`example ${i + 1} (line ${ex.line}): ${ex.section}`, () => { + const actual = renderMarkdown(ex.markdown, ex.flags.length > 0 ? ex.flags : undefined); + expect(normalizeHtml(actual)).toBe(normalizeHtml(ex.expected)); + }); + } + }); +} diff --git a/test/js/bun/md/regressions.txt b/test/js/bun/md/regressions.txt new file mode 100644 index 0000000000..70c2d16779 --- /dev/null +++ b/test/js/bun/md/regressions.txt @@ -0,0 +1,787 @@ + +# Regressions + +These are (mostly) unit tests collected via bug resolving. Adding them here +prevents us from reintroducing subtle bugs which we've already seen. + + +## [Issue 2](https://github.com/mity/md4c/issues/2) + +Raw HTML block: + +```````````````````````````````` example +<gi att1=tok1 att2=tok2> +. +<gi att1=tok1 att2=tok2> +```````````````````````````````` + +Inline: + +```````````````````````````````` example +foo <gi att1=tok1 att2=tok2> bar +. +<p>foo <gi att1=tok1 att2=tok2> bar</p> +```````````````````````````````` + +Inline with a line break: + +```````````````````````````````` example +foo <gi att1=tok1 +att2=tok2> bar +. +<p>foo <gi att1=tok1 +att2=tok2> bar</p> +```````````````````````````````` + + +## [Issue 4](https://github.com/mity/md4c/issues/4) + +```````````````````````````````` example +![alt text with *entity* &copy;](img.png 'title') +. +<p><img src="img.png" alt="alt text with entity ©" title="title"></p> +```````````````````````````````` + + +## [Issue 9](https://github.com/mity/md4c/issues/9) + +```````````````````````````````` example +> [foo +> bar]: /url +> +> [foo bar] +. +<blockquote> +<p><a href="/url">foo +bar</a></p> +</blockquote> +```````````````````````````````` + + +## [Issue 10](https://github.com/mity/md4c/issues/10) + +```````````````````````````````` example +[x]: +x +- <? + + x +. +<ul> +<li><? + +x +</li> +</ul> +```````````````````````````````` + + +## [Issue 11](https://github.com/mity/md4c/issues/11) + +```````````````````````````````` example +x [link](/url "foo &ndash; bar") x +. +<p>x <a href="/url" title="foo – bar">link</a> x</p> +```````````````````````````````` + + +## [Issue 14](https://github.com/mity/md4c/issues/14) + +```````````````````````````````` example +a***b* c* +. +<p>a*<em><em>b</em> c</em></p> +```````````````````````````````` + + +## [Issue 15](https://github.com/mity/md4c/issues/15) + +```````````````````````````````` example +***b* c* +. +<p>*<em><em>b</em> c</em></p> +```````````````````````````````` + + +## [Issue 21](https://github.com/mity/md4c/issues/21) + +```````````````````````````````` example +a*b**c* +. +<p>a<em>b**c</em></p> +```````````````````````````````` + + +## [Issue 33](https://github.com/mity/md4c/issues/33) + +```````````````````````````````` example +```&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp; +. +<pre><code class="language-&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;"></code></pre> +```````````````````````````````` + + +## [Issue 36](https://github.com/mity/md4c/issues/36) + +```````````````````````````````` example +__x_ _x___ +. +<p><em><em>x</em> <em>x</em></em>_</p> +```````````````````````````````` + + +## [Issue 39](https://github.com/mity/md4c/issues/39) + +```````````````````````````````` example +[\\]: x +. +```````````````````````````````` + + +## [Issue 40](https://github.com/mity/md4c/issues/40) + +```````````````````````````````` example +[x](url +'title' +)x +. +<p><a href="url" title="title">x</a>x</p> +```````````````````````````````` + + +## [Issue 41](https://github.com/mity/md4c/issues/41) +```````````````````````````````` example +* x|x +---|--- +. +<ul> +<li>x|x +---|---</li> +</ul> +. +--ftables +```````````````````````````````` +(Not a table, because the underline has wrong indentation and is not part of the +list item.) + +```````````````````````````````` example +* x|x + ---|--- +x|x +. +<ul> +<li><table> +<thead> +<tr> +<th>x</th> +<th>x</th> +</tr> +</thead> +</table> +</li> +</ul> +<p>x|x</p> +. +--ftables +```````````````````````````````` +(Here the underline has the right indentation so the table is detected. +But the last line is not part of it due its indentation.) + + +## [Issue 42](https://github.com/mity/md4c/issues/42) + +```````````````````````````````` example +] http://x.x *x* + +|x|x| +|---|---| +|x| +. +<p>] http://x.x <em>x</em></p> +<table> +<thead> +<tr> +<th>x</th> +<th>x</th> +</tr> +</thead> +<tbody> +<tr> +<td>x</td> +<td></td> +</tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + + +## [Issue 53](https://github.com/mity/md4c/issues/53) + +```````````````````````````````` example +This is [link](http://github.com/). +. +<p>This is <a href="http://github.com/">link</a>.</p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +```````````````````````````````` example +This is [link](http://github.com/)X +. +<p>This is <a href="http://github.com/">link</a>X</p> +. +--fpermissive-url-autolinks +```````````````````````````````` + + +## [Issue 65](https://github.com/mity/md4c/issues/65) + +```````````````````````````````` example +` +. +<p>`</p> +```````````````````````````````` + + +## [Issue 69](https://github.com/mity/md4c/issues/69) +```````````````````````````````` example +~`foo`~ +. +<p><del><code>foo</code></del></p> +. +--fstrikethrough +```````````````````````````````` + +```````````````````````````````` example +~*foo*~ +. +<p><del><em>foo</em></del></p> +. +--fstrikethrough +```````````````````````````````` + +```````````````````````````````` example +*~foo~* +. +<p><em><del>foo</del></em></p> +. +--fstrikethrough +```````````````````````````````` + + +## [Issue 74](https://github.com/mity/md4c/issues/74) + +```````````````````````````````` example +[f]: +- + xx +- +. +<pre><code>xx +</code></pre> +<ul> +<li></li> +</ul> +```````````````````````````````` + + +## [Issue 76](https://github.com/mity/md4c/issues/76) + +```````````````````````````````` example +*(http://example.com)* +. +<p><em>(<a href="http://example.com">http://example.com</a>)</em></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + + +## [Issue 78](https://github.com/mity/md4c/issues/78) + +```````````````````````````````` example +[SS ẞ]: /url +[ẞ SS] +. +<p><a href="/url">ẞ SS</a></p> +```````````````````````````````` + + +## [Issue 83](https://github.com/mity/md4c/issues/83) + +```````````````````````````````` example +foo +> +. +<p>foo</p> +<blockquote> +</blockquote> + +```````````````````````````````` + + +## [Issue 95](https://github.com/mity/md4c/issues/95) + +```````````````````````````````` example +. foo +. +<p>. foo</p> +```````````````````````````````` + + +## [Issue 96](https://github.com/mity/md4c/issues/96) + +```````````````````````````````` example +[ab]: /foo +[a] [ab] [abc] +. +<p>[a] <a href="/foo">ab</a> [abc]</p> +```````````````````````````````` + +```````````````````````````````` example +[a b]: /foo +[a b] +. +<p><a href="/foo">a b</a></p> +```````````````````````````````` + + +## [Issue 97](https://github.com/mity/md4c/issues/97) + +```````````````````````````````` example +*a **b c* d** +. +<p><em>a <em><em>b c</em> d</em></em></p> + +```````````````````````````````` + + +## [Issue 100](https://github.com/mity/md4c/issues/100) + +```````````````````````````````` example +<foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123> +. +<p><a href="mailto:foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123">foo@123456789012345678901234567890123456789012345678901234567890123.123456789012345678901234567890123456789012345678901234567890123</a></p> +```````````````````````````````` + +```````````````````````````````` example +<foo@123456789012345678901234567890123456789012345678901234567890123x.123456789012345678901234567890123456789012345678901234567890123> +. +<p>&lt;foo@123456789012345678901234567890123456789012345678901234567890123x.123456789012345678901234567890123456789012345678901234567890123&gt;</p> +```````````````````````````````` +(Note the `x` here which turns it over the max. allowed length limit.) + + +## [Issue 104](https://github.com/mity/md4c/issues/104) + +```````````````````````````````` example +A | B +--- | --- +[x](url) +. +<table> +<thead> +<tr> +<th>A</th> +<th>B</th> +</tr> +</thead> +<tbody> +<tr> +<td><a href="url">x</a></td> +<td></td> +</tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + + +## [Issue 107](https://github.com/mity/md4c/issues/107) + +```````````````````````````````` example +***foo *bar baz*** +. +<p>*<strong>foo <em>bar baz</em></strong></p> + +```````````````````````````````` + + +## [Issue 124](https://github.com/mity/md4c/issues/124) + +```````````````````````````````` example +~~~ + x +~~~ + +~~~ + x +~~~ +. +<pre><code> x +</code></pre> +<pre><code> x +</code></pre> +```````````````````````````````` + + +## [Issue 131](https://github.com/mity/md4c/issues/131) + +```````````````````````````````` example +[![alt][img]][link] + +[img]: img_url +[link]: link_url +. +<p><a href="link_url"><img src="img_url" alt="alt"></a></p> +```````````````````````````````` + + +## [Issue 138](https://github.com/mity/md4c/issues/138) + +```````````````````````````````` example +| abc | def | +| --- | --- | +. +<table> +<thead> +<tr> +<th>abc</th> +<th>def</th> +</tr> +</thead> +</table> +. +--ftables +```````````````````````````````` + + +## [Issue 142](https://github.com/mity/md4c/issues/142) + +```````````````````````````````` example +[fooﬗ]: /url +[fooﬕ] +. +<p>[fooﬕ]</p> +```````````````````````````````` + + +## [Issue 149](https://github.com/mity/md4c/issues/149) + +```````````````````````````````` example +- <script> +- foo +bar +</script> +. +<ul> +<li><script> +</li> +<li>foo +bar +</script></li> +</ul> +```````````````````````````````` + + +## [Issue 152](https://github.com/mity/md4c/issues/152) + +```````````````````````````````` example +[http://example.com](http://example.com) +. +<p><a href="http://example.com">http://example.com</a></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + + +## [Issue 190](https://github.com/mity/md4c/issues/190) + +```````````````````````````````` example +- + + foo +. +<ul> +<li></li> +</ul> +<pre><code>foo +</code></pre> +```````````````````````````````` + + +## [Issue 200](https://github.com/mity/md4c/issues/200) + +```````````````````````````````` example +<!-- foo --> + ``` + bar + ``` +. +<!-- foo --> +<pre><code>``` +bar +``` +</code></pre> +```````````````````````````````` + + +## [Issue 201](https://github.com/mity/md4c/issues/201) + +```````````````````````````````` example +foo + ``` + bar + ``` +. +<p>foo +<code>bar</code></p> +```````````````````````````````` + + +## [Issue 207](https://github.com/mity/md4c/issues/207) + +```````````````````````````````` example +<textarea> + +*foo* + +_bar_ + +</textarea> + +baz +. +<textarea> + +*foo* + +_bar_ + +</textarea> +<p>baz</p> +```````````````````````````````` + + +## [Issue 210](https://github.com/mity/md4c/issues/210) + +```````````````````````````````` example +![outer ![inner](img_inner "inner title")](img_outer "outer title") +. +<p><img src="img_outer" alt="outer inner" title="outer title"></p> +```````````````````````````````` + + +## [Issue 212](https://github.com/mity/md4c/issues/212) + +```````````````````````````````` example +x +|- +|[*x*]() +. +<table> +<thead> +<tr> +<th>x</th> +</tr> +</thead> +<tbody> +<tr> +<td><a href=""><em>x</em></a></td> +</tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + + +## [Issue 213](https://github.com/mity/md4c/issues/215) + +```````````````````````````````` example +x +|- +[| + +[[ ]][[![|]()]] +. +<table> +<thead> +<tr> +<th>x</th> +</tr> +</thead> +<tbody> +<tr> +<td>[</td> +</tr> +</tbody> +</table> +<p><x-wikilink data-target=" "> </x-wikilink><x-wikilink data-target="![|]()"><img src="" alt="|"></x-wikilink></p> +. +--ftables --fwiki-links +```````````````````````````````` + + +## [Issue 215](https://github.com/mity/md4c/issues/215) + +```````````````````````````````` example +title +--→ +. +<h2>title</h2> +```````````````````````````````` + + +## [Issue 216](https://github.com/mity/md4c/issues/216) + +```````````````````````````````` example +x <!A> +. +<p>x <!A></p> +```````````````````````````````` + + +## [Issue 217](https://github.com/mity/md4c/issues/217) + +```````````````````````````````` example +__!_!__ + +__!x!__ + +**!*!** + +--- + +_*__*_* + +_*xx*_* + +_*__-_- + +_*xx-_- +. +<p><strong>!_!</strong></p> +<p><strong>!x!</strong></p> +<p><strong>!*!</strong></p> +<hr /> +<p><em><em>__</em></em>*</p> +<p><em><em>xx</em></em>*</p> +<p><em>*__-</em>-</p> +<p><em>*xx-</em>-</p> +```````````````````````````````` + + +## [Issue 222](https://github.com/mity/md4c/issues/222) + +```````````````````````````````` example +~foo ~bar baz~ +. +<p>~foo <del>bar baz</del></p> +. +--fstrikethrough +```````````````````````````````` + + +## [Issue 223](https://github.com/mity/md4c/issues/223) + +If from one side (and the other has no space/newline), replace new line with +space. + +```````````````````````````````` example [no-normalize] +` +foo` +. +<p><code> foo</code></p> +```````````````````````````````` + +```````````````````````````````` example [no-normalize] +`foo +` +. +<p><code>foo </code></p> +```````````````````````````````` + +If from both side, eat it. + +```````````````````````````````` example [no-normalize] +` +foo +` +. +<p><code>foo</code></p> +```````````````````````````````` + + +## [Issue 226](https://github.com/mity/md4c/issues/226) + +```````````````````````````````` example +https://example.com/ +https://example.com/dir/ +. +<p><a href="https://example.com/">https://example.com/</a> +<a href="https://example.com/dir/">https://example.com/dir/</a></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + + +## [Issue 242](https://github.com/mity/md4c/issues/242) + +```````````````````````````````` example +copy ~user1/file to ~user2/file + +copy "~user1/file" to "~user2/file" +. +<p>copy ~user1/file to ~user2/file</p> +<p>copy &quot;~user1/file&quot; to &quot;~user2/file&quot;</p> +. +--fstrikethrough +```````````````````````````````` + + +## [Issue 248](https://github.com/mity/md4c/issues/248) + +(These are in spec.txt, but we need the [no-normalize] flag in order to +catch the whitespace issues.) + +```````````````````````````````` example [no-normalize] +#→Foo +. +<h1>Foo</h1> +```````````````````````````````` + +```````````````````````````````` example [no-normalize] + Foo *bar +baz*→ +==== +. +<h1>Foo <em>bar +baz</em></h1> +```````````````````````````````` + + +## [Issue 250](https://github.com/mity/md4c/issues/250) + +Handling trailing tabulator character versus hard break. + +Space + space + tab + newline is not hard break: +```````````````````````````````` example [no-normalize] +foo → +bar +. +<p>foo +bar</p> +```````````````````````````````` + +Tab + space + space + newline is hard break: +```````````````````````````````` example [no-normalize] +foo→ +bar +. +<p>foo<br> +bar</p> +```````````````````````````````` + diff --git a/test/js/bun/md/spec-gfm.txt b/test/js/bun/md/spec-gfm.txt new file mode 100644 index 0000000000..97800af0db --- /dev/null +++ b/test/js/bun/md/spec-gfm.txt @@ -0,0 +1,227 @@ +--- +title: GitHub Flavored Markdown Spec (GFM-specific examples) +version: '0.29' +date: '2019-04-06' +license: '[CC-BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/)' +... + + +# Inlines + + +## Backslash escapes + +```````````````````````````````` example +<http://example.com?find=\*> +. +<p><a href="http://example.com?find=%5C*">http://example.com?find=\*</a></p> +```````````````````````````````` + + +## Entity and numeric character references + +```````````````````````````````` example +&nbsp &x; &#; &#x; +&#987654321; +&#abcdef0; +&ThisIsNotDefined; &hi?; +. +<p>&amp;nbsp &amp;x; &amp;#; &amp;#x; +&amp;#987654321; +&amp;#abcdef0; +&amp;ThisIsNotDefined; &amp;hi?;</p> +```````````````````````````````` + + +## Code spans + +```````````````````````````````` example +`<http://foo.bar.`baz>` +. +<p><code>&lt;http://foo.bar.</code>baz&gt;`</p> +```````````````````````````````` + +```````````````````````````````` example +<http://foo.bar.`baz>` +. +<p><a href="http://foo.bar.%60baz">http://foo.bar.`baz</a>`</p> +```````````````````````````````` + + +## Emphasis and strong emphasis + +```````````````````````````````` example +**a<http://foo.bar/?q=**> +. +<p>**a<a href="http://foo.bar/?q=**">http://foo.bar/?q=**</a></p> +```````````````````````````````` + +```````````````````````````````` example +__a<http://foo.bar/?q=__> +. +<p>__a<a href="http://foo.bar/?q=__">http://foo.bar/?q=__</a></p> +```````````````````````````````` + + +## Links + +```````````````````````````````` example +[link](#fragment) + +[link](http://example.com#fragment) + +[link](http://example.com?foo=3#frag) +. +<p><a href="#fragment">link</a></p> +<p><a href="http://example.com#fragment">link</a></p> +<p><a href="http://example.com?foo=3#frag">link</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo<http://example.com/?search=](uri)> +. +<p>[foo<a href="http://example.com/?search=%5D(uri)">http://example.com/?search=](uri)</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo *bar][ref] + +[ref]: /uri +. +<p><a href="/uri">foo *bar</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo<http://example.com/?search=][ref]> + +[ref]: /uri +. +<p>[foo<a href="http://example.com/?search=%5D%5Bref%5D">http://example.com/?search=][ref]</a></p> +```````````````````````````````` + +```````````````````````````````` example +[Толпой][Толпой] is a Russian word. + +[ТОЛПОЙ]: /url +. +<p><a href="/url">Толпой</a> is a Russian word.</p> +```````````````````````````````` + + +## Autolinks + +```````````````````````````````` example +<http://foo.bar.baz/test?q=hello&id=22&boolean> +. +<p><a href="http://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean">http://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean</a></p> +```````````````````````````````` + +```````````````````````````````` example +<http://../> +. +<p><a href="http://../">http://../</a></p> +```````````````````````````````` + +```````````````````````````````` example +<http://foo.bar/baz bim> +. +<p>&lt;http://foo.bar/baz bim&gt;</p> +```````````````````````````````` + +```````````````````````````````` example +<http://example.com/\[\> +. +<p><a href="http://example.com/%5C%5B%5C">http://example.com/\[\</a></p> +```````````````````````````````` + +```````````````````````````````` example +< http://foo.bar > +. +<p>&lt; http://foo.bar &gt;</p> +```````````````````````````````` + +```````````````````````````````` example +http://example.com +. +<p>http://example.com</p> +```````````````````````````````` + + +## Disallowed Raw HTML + +```````````````````````````````` example +<strong> <title> <style> <em> + +<blockquote> + <xmp> is disallowed. <XMP> is also disallowed. +</blockquote> +. +<p><strong> &lt;title> &lt;style> <em></p> +<blockquote> + &lt;xmp> is disallowed. &lt;XMP> is also disallowed. +</blockquote> +. +--ftag-filter +```````````````````````````````` + +```````````````````````````````` example +<script>alert('xss')</script> +. +&lt;script>alert('xss')&lt;/script> +. +--ftag-filter +```````````````````````````````` + +```````````````````````````````` example +<style> +body { color: red; } +</style> +. +&lt;style> +body { color: red; } +&lt;/style> +. +--ftag-filter +```````````````````````````````` + +```````````````````````````````` example +<textarea> +form content +</textarea> +. +&lt;textarea> +form content +&lt;/textarea> +. +--ftag-filter +```````````````````````````````` + +Inline disallowed tags are also filtered: + +```````````````````````````````` example +Inline <iframe src="x"> and <noembed> and <plaintext> tags. +. +<p>Inline &lt;iframe src="x"> and &lt;noembed> and &lt;plaintext> tags.</p> +. +--ftag-filter +```````````````````````````````` + +Allowed tags pass through: + +```````````````````````````````` example +<b>bold</b> and <em>emphasis</em> are fine. +. +<p><b>bold</b> and <em>emphasis</em> are fine.</p> +. +--ftag-filter +```````````````````````````````` + +Without the tag filter, disallowed tags pass through: + +```````````````````````````````` example +<strong> <title> <style> <em> +. +<p><strong> <title> <style> <em></p> +```````````````````````````````` + + diff --git a/test/js/bun/md/spec-hard-soft-breaks.txt b/test/js/bun/md/spec-hard-soft-breaks.txt new file mode 100644 index 0000000000..b33f6c783b --- /dev/null +++ b/test/js/bun/md/spec-hard-soft-breaks.txt @@ -0,0 +1,28 @@ + +# Hard Soft Breaks + +With the flag `MD_FLAG_HARD_SOFT_BREAKS`, MD4C treats all newline characters as +hard breaks. + +```````````````````````````````` example +foo +baz +. +<p>foo<br> +baz</p> +. +--fhard-soft-breaks +```````````````````````````````` + +```````````````````````````````` example +A quote from the CommonMark Spec below: + +A renderer may also provide an option to +render soft line breaks as hard line breaks. +. +<p>A quote from the CommonMark Spec below:</p> +<p>A renderer may also provide an option to<br> +render soft line breaks as hard line breaks.</p> +. +--fhard-soft-breaks +```````````````````````````````` diff --git a/test/js/bun/md/spec-latex-math.txt b/test/js/bun/md/spec-latex-math.txt new file mode 100644 index 0000000000..32849915ff --- /dev/null +++ b/test/js/bun/md/spec-latex-math.txt @@ -0,0 +1,75 @@ + +# LaTeX Math + +With the flag `MD_FLAG_LATEXMATHSPANS`, MD4C enables extension for recognition +of LaTeX style math spans. + +A math span is is any text wrapped in dollars or double dollars (`$...$` or +`$$...$$`). + +```````````````````````````````` example +$a+b=c$ Hello, world! +. +<p><x-equation>a+b=c</x-equation> Hello, world!</p> +. +--flatex-math +```````````````````````````````` + +However the LaTeX math spans cannot be nested: + +```````````````````````````````` example +$$foo $bar$ baz$$ +. +<p>$$foo <x-equation>bar</x-equation> baz$$</p> +. +--flatex-math +```````````````````````````````` + +The opening delimiter cannot be preceded with an alphanumerical character: + +```````````````````````````````` example +x$a+b=c$ +. +<p>x$a+b=c$</p> +. +--flatex-math +```````````````````````````````` + +Similarly the closing delimiter cannot be followed with an alphanumerical character: + +```````````````````````````````` example +$a+b=c$x +. +<p>$a+b=c$x</p> +. +--flatex-math +```````````````````````````````` + +If the double dollar sign is used, the math span is a display math span. + +```````````````````````````````` example +This is a display equation: $$\int_a^b x dx$$. +. +<p>This is a display equation: <x-equation type="display">\int_a^b x dx</x-equation>.</p> +. +--flatex-math +```````````````````````````````` + +Math spans may span multiple lines as they are normal spans: + +```````````````````````````````` example +$$ +\int_a^b +f(x) dx +$$ +. +<p><x-equation type="display"> \int_a^b f(x) dx </x-equation></p> +. +--flatex-math +```````````````````````````````` + +Note though that many (simple) renderers may output the math spans just as a +verbatim text. (This includes the HTML renderer used by the `md2html` utility.) + +Only advanced renderers which implement LaTeX math syntax can be expected to +provide better results. diff --git a/test/js/bun/md/spec-permissive-autolinks.txt b/test/js/bun/md/spec-permissive-autolinks.txt new file mode 100644 index 0000000000..f7d267194d --- /dev/null +++ b/test/js/bun/md/spec-permissive-autolinks.txt @@ -0,0 +1,248 @@ + +# Permissive Autolinks + +Standard autolinks (as per CommonMark specification) have to be decorated with +`<` and `>` so for example: + +```````````````````````````````` example +<mailto:john.doe@gmail.com> +<https://example.com> +. +<p><a href="mailto:john.doe@gmail.com">mailto:john.doe@gmail.com</a> +<a href="https://example.com">https://example.com</a></p> +```````````````````````````````` + +With flags `MD_FLAG_PERMISSIVEURLAUTOLINKS`, `MD_FLAG_PERMISSIVEWWWAUTOLINKS` +and `MD_FLAG_PERMISSIVEEMAILAUTOLINKS`, MD4C is able also to recognize autolinks +without those marks. + +Example of permissive autolinks follows: + +```````````````````````````````` example +john.doe@gmail.com +https://www.example.com +www.example.com +. +<p><a href="mailto:john.doe@gmail.com">john.doe@gmail.com</a> +<a href="https://www.example.com">https://www.example.com</a> +<a href="http://www.example.com">www.example.com</a></p> +. +--fpermissive-email-autolinks +--fpermissive-url-autolinks +--fpermissive-www-autolinks +```````````````````````````````` + +However as this syntax also brings some more danger of false positives, more +strict rules apply to what characters may or may not form such autolinks. +When a need arises to use a link which does not satisfy these restrictions, +standard Markdown autolinks have to be used. + +First and formost, these autolinks have to be delimited from surrounded text, +i.e. whitespace, beginning/end of line, or very limited punctuation must +precede and follow respectively. + +Therefore these are not autolinks because `:` precedes or follows: + +```````````````````````````````` example +:john.doe@gmail.com +:https://www.example.com +:www.example.com +. +<p>:john.doe@gmail.com +:https://www.example.com +:www.example.com</p> +. +--fpermissive-email-autolinks +--fpermissive-url-autolinks +--fpermissive-www-autolinks +```````````````````````````````` + +Allowed punctuation right before autolink includes only opening brackets `(`, +`{` or `[`: + +```````````````````````````````` example +[john.doe@gmail.com +(https://www.example.com +{www.example.com +. +<p>[<a href="mailto:john.doe@gmail.com">john.doe@gmail.com</a> +(<a href="https://www.example.com">https://www.example.com</a> +{<a href="http://www.example.com">www.example.com</a></p> +. +--fpermissive-email-autolinks +--fpermissive-url-autolinks +--fpermissive-www-autolinks +```````````````````````````````` + +Correspondingly, the respective closing brackets may follow the autolinks. + +```````````````````````````````` example +john.doe@gmail.com] +https://www.example.com) +www.example.com} +. +<p><a href="mailto:john.doe@gmail.com">john.doe@gmail.com</a>] +<a href="https://www.example.com">https://www.example.com</a>) +<a href="http://www.example.com">www.example.com</a>}</p> +. +--fpermissive-email-autolinks +--fpermissive-url-autolinks +--fpermissive-www-autolinks +```````````````````````````````` + +Some other punctuation characters are also allowed after the autolink so that +the autolinks may appear at the end of a sentence or clause (`.`, `!`, `?`, +`,`, `;`): + +```````````````````````````````` example +Have you ever visited http://zombo.com? +. +<p>Have you ever visited <a href="http://zombo.com">http://zombo.com</a>?</p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +Markdown emphasis mark can also precede (but only opening mark) or follow +(only closer mark): + +```````````````````````````````` example +You may contact me at **john.doe@example.com**. +. +<p>You may contact me at <strong><a href="mailto:john.doe@example.com">john.doe@example.com</a></strong>.</p> +. +--fpermissive-email-autolinks +```````````````````````````````` + +However the following is not, because in this example `*` is literal `*` and +such punctuation is not allowed before the autolink: + +```````````````````````````````` example +*john.doe@example.com + +john.doe@example.com* +. +<p>*john.doe@example.com</p> +<p>john.doe@example.com*</p> +. +--fpermissive-email-autolinks +```````````````````````````````` + +## Permissive URL Autolinks + +Permissive URL autolinks (`MD_FLAG_PERMISSIVEURLAUTOLINKS`) are formed +by mandatory URL scheme, mandatory host, optional path, optional query and +optional fragment. + +The permissive URL autolinks recognize only `http://`, `https://` and `ftp://` +as the scheme: + +```````````````````````````````` example +https://example.com +http://example.com +ftp://example.com + +ssh://example.com +. +<p><a href="https://example.com">https://example.com</a> +<a href="http://example.com">http://example.com</a> +<a href="ftp://example.com">ftp://example.com</a></p> +<p>ssh://example.com</p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +The host is a sequence made of alphanumerical characters, `.`, `-` and `_`. +It has to include at least two components delimited with `.`, last component +has to have at least two characters, and occurrence of `.`, `-` and `_` has to +be immediately preceded and followed with a letter or digit. + +The host specification may optionally be followed with path. Path begins with +character `/` and uses it also for delimiting path components. Every path +component is made of alhanumerical characters and `.`, `-`, `_`. Once again, +any occurrence of `.`, `-`, `_` has to be surrounded with alphanumerical +character. + +```````````````````````````````` example +https://example.com/images/branding/logo_272x92.png +. +<p><a href="https://example.com/images/branding/logo_272x92.png">https://example.com/images/branding/logo_272x92.png</a></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +Then optionally query may follow. The query is made of `?` and then with +alhanumerical characters, `&`, `.`, `-`, `+`, `_`, `=`, `(` and `)`. Once again any +of those non-alhanumerical characters has to be surrounded with alpha-numerical +characters, and also brackets `(` have to be balanced `)`. + +```````````````````````````````` example +https://www.google.com/search?q=md4c+markdown +. +<p><a href="https://www.google.com/search?q=md4c+markdown">https://www.google.com/search?q=md4c+markdown</a></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +And finally there may be an optional fragment. + +```````````````````````````````` example +https://example.com#fragment +. +<p><a href="https://example.com#fragment">https://example.com#fragment</a></p> +. +--fpermissive-url-autolinks +```````````````````````````````` + +And finally one complex example: + +```````````````````````````````` example +http://commonmark.org + +(Visit https://encrypted.google.com/search?q=Markup+(business)) + +Anonymous FTP is available at ftp://foo.bar.baz. +. +<p><a href="http://commonmark.org">http://commonmark.org</a></p> +<p>(Visit <a href="https://encrypted.google.com/search?q=Markup+(business)">https://encrypted.google.com/search?q=Markup+(business)</a>)</p> +<p>Anonymous FTP is available at <a href="ftp://foo.bar.baz">ftp://foo.bar.baz</a>.</p> +. +--fpermissive-url-autolinks +```````````````````````````````` + + +## Permissive WWW Autolinks + +Permissive WWW autolinks (`MD_FLAG_PERMISSIVEWWWAUTOLINKS`) are very similar +to the permissive URL autolinks. Actually the only difference is that instead +of providing an explicit scheme, they have to begin with `www.`. + +```````````````````````````````` example +www.google.com/search?q=Markdown +. +<p><a href="http://www.google.com/search?q=Markdown">www.google.com/search?q=Markdown</a></p> +. +--fpermissive-www-autolinks +```````````````````````````````` + + +## Permissive E-mail Autolinks + +Permissive E-mail autolinks (`MD_FLAG_PERMISSIVEEMAILAUTOLINKS`) impose the +following limitations to the e-mail addresses: + +1. The username (before the `@`) can only use alphanumerical characters and + characters `.`, `-`, `_` and `+`. However every such non-alphanumerical + character must be immediately preceded and followed by an alphanumerical + character. + + For example this is not an auto-link because of that double underscore `__`. + + ```````````````````````````````` example + john__doe@example.com + . + <p>john__doe@example.com</p> + . + --fpermissive-email-autolinks + ```````````````````````````````` + +2. Same rules for domain as for URL and WWW autolinks apply. diff --git a/test/js/bun/md/spec-strikethrough.txt b/test/js/bun/md/spec-strikethrough.txt new file mode 100644 index 0000000000..06610f58e5 --- /dev/null +++ b/test/js/bun/md/spec-strikethrough.txt @@ -0,0 +1,63 @@ + +# Strike-Through + +With the flag `MD_FLAG_STRIKETHROUGH`, MD4C enables extension for recognition +of strike-through spans. + +Strike-through text is any text wrapped in one or two tildes (`~`). + +```````````````````````````````` example +~Hi~ Hello, world! +. +<p><del>Hi</del> Hello, world!</p> +. +--fstrikethrough +```````````````````````````````` + +If the length of the opener and closer doesn't match, the strike-through is +not recognized. + +```````````````````````````````` example +This ~text~~ is curious. +. +<p>This ~text~~ is curious.</p> +. +--fstrikethrough +```````````````````````````````` + +Too long tilde sequence won't be recognized: + +```````````````````````````````` example +foo ~~~bar~~~ +. +<p>foo ~~~bar~~~</p> +. +--fstrikethrough +```````````````````````````````` + +Also note the markers cannot open a strike-through span if they are followed +with a whitespace; and similarly, then cannot close the span if they are +preceded with a whitespace: + +```````````````````````````````` example +~foo ~bar +. +<p>~foo ~bar</p> +. +--fstrikethrough +```````````````````````````````` + + +As with regular emphasis delimiters, a new paragraph will cause the cessation +of parsing a strike-through: + +```````````````````````````````` example +This ~~has a + +new paragraph~~. +. +<p>This ~~has a</p> +<p>new paragraph~~.</p> +. +--fstrikethrough +```````````````````````````````` diff --git a/test/js/bun/md/spec-tables.txt b/test/js/bun/md/spec-tables.txt new file mode 100644 index 0000000000..b17b0b62f1 --- /dev/null +++ b/test/js/bun/md/spec-tables.txt @@ -0,0 +1,278 @@ + +# Tables + +With the flag `MD_FLAG_TABLES`, MD4C enables extension for recognition of +tables. + +Basic table example of a table with two columns and three lines (when not +counting the header) is as follows: + +```````````````````````````````` example +| Column 1 | Column 2 | +|----------|----------| +| foo | bar | +| baz | qux | +| quux | quuz | +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +The leading and succeeding pipe characters (`|`) on each line are optional: + +```````````````````````````````` example +Column 1 | Column 2 | +---------|--------- | +foo | bar | +baz | qux | +quux | quuz | +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +```````````````````````````````` example +| Column 1 | Column 2 +|----------|--------- +| foo | bar +| baz | qux +| quux | quuz +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +```````````````````````````````` example +Column 1 | Column 2 +---------|--------- +foo | bar +baz | qux +quux | quuz +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +However for one-column table, at least one pipe has to be used in the table +header underline, otherwise it would be parsed as a Setext title followed by +a paragraph. + +```````````````````````````````` example +Column 1 +-------- +foo +baz +quux +. +<h2>Column 1</h2> +<p>foo +baz +quux</p> +. +--ftables +```````````````````````````````` + +Leading and trailing whitespace in a table cell is ignored and the columns do +not need to be aligned. + +```````````````````````````````` example +Column 1 |Column 2 +---|--- +foo | bar +baz| qux +quux|quuz +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +A table can interrupt a paragraph (GFM behavior). + +```````````````````````````````` example +Lorem ipsum dolor sit amet. +| Column 1 | Column 2 +| ---------|--------- +| foo | bar +| baz | qux +| quux | quuz +. +<p>Lorem ipsum dolor sit amet.</p> +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +```````````````````````````````` + +Similarly, paragraph cannot interrupt a table: + +```````````````````````````````` example +Column 1 | Column 2 +---------|--------- +foo | bar +baz | qux +quux | quuz +Lorem ipsum dolor sit amet. +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +<tr><td>Lorem ipsum dolor sit amet.</td><td></td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +The first, the last or both the first and the last dash in each column +underline can be replaced with a colon (`:`) to request left, right or middle +alignment of the respective column: + +```````````````````````````````` example +| Column 1 | Column 2 | Column 3 | Column 4 | +|----------|:---------|:--------:|---------:| +| default | left | center | right | +. +<table> +<thead> +<tr><th>Column 1</th><th align="left">Column 2</th><th align="center">Column 3</th><th align="right">Column 4</th></tr> +</thead> +<tbody> +<tr><td>default</td><td align="left">left</td><td align="center">center</td><td align="right">right</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +To include a literal pipe character in any cell, it has to be escaped. + +```````````````````````````````` example +Column 1 | Column 2 +---------|--------- +foo | bar +baz | qux \| xyzzy +quux | quuz +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>foo</td><td>bar</td></tr> +<tr><td>baz</td><td>qux | xyzzy</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +Contents of each cell is parsed as an inline text which may contents any +inline Markdown spans like emphasis, strong emphasis, links etc. + +```````````````````````````````` example +Column 1 | Column 2 +---------|--------- +*foo* | bar +**baz** | [qux] +quux | [quuz](/url2) + +[qux]: /url +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td><em>foo</em></td><td>bar</td></tr> +<tr><td><strong>baz</strong></td><td><a href="/url">qux</a></td></tr> +<tr><td>quux</td><td><a href="/url2">quuz</a></td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` + +Pipes inside code spans are recognized as cell boundaries (GFM behavior). + +```````````````````````````````` example +Column 1 | Column 2 +---------|--------- +`foo | bar` +baz | qux +quux | quuz +. +<table> +<thead> +<tr><th>Column 1</th><th>Column 2</th></tr> +</thead> +<tbody> +<tr><td>`foo</td><td>bar`</td></tr> +<tr><td>baz</td><td>qux</td></tr> +<tr><td>quux</td><td>quuz</td></tr> +</tbody> +</table> +. +--ftables +```````````````````````````````` diff --git a/test/js/bun/md/spec-tasklists.txt b/test/js/bun/md/spec-tasklists.txt new file mode 100644 index 0000000000..59de3f144d --- /dev/null +++ b/test/js/bun/md/spec-tasklists.txt @@ -0,0 +1,127 @@ + +# Tasklists + +With the flag `MD_FLAG_TASKLISTS`, MD4C enables extension for recognition of +task lists. + +Basic task list may look as follows: + +```````````````````````````````` example + * [x] foo + * [X] bar + * [ ] baz +. +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li> +</ul> +. +--ftasklists +```````````````````````````````` + +Task lists can also be in ordered lists: + +```````````````````````````````` example + 1. [x] foo + 2. [X] bar + 3. [ ] baz +. +<ol> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li> +</ol> +. +--ftasklists +```````````````````````````````` + +Task lists can also be nested in ordinary lists: + +```````````````````````````````` example + * xxx: + * [x] foo + * [x] bar + * [ ] baz + * yyy: + * [ ] qux + * [x] quux + * [ ] quuz +. +<ul> +<li>xxx: +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li> +</ul></li> +<li>yyy: +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>qux</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>quux</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>quuz</li> +</ul></li> +</ul> +. +--ftasklists +```````````````````````````````` + +Or in a parent task list: + +```````````````````````````````` example + 1. [x] xxx: + * [x] foo + * [x] bar + * [ ] baz + 2. [ ] yyy: + * [ ] qux + * [x] quux + * [ ] quuz +. +<ol> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>xxx: +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>foo</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>bar</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>baz</li> +</ul></li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>yyy: +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>qux</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>quux</li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>quuz</li> +</ul></li> +</ol> +. +--ftasklists +```````````````````````````````` + +Also, ordinary lists can be nested in the task lists. + +```````````````````````````````` example + * [x] xxx: + * foo + * bar + * baz + * [ ] yyy: + * qux + * quux + * quuz +. +<ul> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled checked>xxx: +<ul> +<li>foo</li> +<li>bar</li> +<li>baz</li> +</ul></li> +<li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled>yyy: +<ul> +<li>qux</li> +<li>quux</li> +<li>quuz</li> +</ul></li> +</ul> +. +--ftasklists +```````````````````````````````` diff --git a/test/js/bun/md/spec-underline.txt b/test/js/bun/md/spec-underline.txt new file mode 100644 index 0000000000..d7e66ac61c --- /dev/null +++ b/test/js/bun/md/spec-underline.txt @@ -0,0 +1,47 @@ + +# Underline + +With the flag `MD_FLAG_UNDERLINE`, MD4C sees underscore `_` rather as a mark +denoting an underlined span rather than an ordinary emphasis (or a strong +emphasis). + +```````````````````````````````` example +_foo_ +. +<p><u>foo</u></p> +. +--funderline +```````````````````````````````` + +In sequences of multiple underscores, each single one translates into an +underline span mark. + +```````````````````````````````` example +___foo___ +. +<p><u><u><u>foo</u></u></u></p> +. +--funderline +```````````````````````````````` + +Intra-word underscores are not recognized as underline marks: + +```````````````````````````````` example +foo_bar_baz +. +<p>foo_bar_baz</p> +. +--funderline +```````````````````````````````` + +Also the parser follows the standard understanding when the underscore can +or cannot open or close a span. Therefore there is no underline in the following +example because no underline can be seen as a closing mark. + +```````````````````````````````` example +_foo _bar +. +<p>_foo _bar</p> +. +--funderline +```````````````````````````````` diff --git a/test/js/bun/md/spec-wiki-links.txt b/test/js/bun/md/spec-wiki-links.txt new file mode 100644 index 0000000000..0793a32c13 --- /dev/null +++ b/test/js/bun/md/spec-wiki-links.txt @@ -0,0 +1,278 @@ + +# Wiki Links + +With the flag `MD_FLAG_WIKILINKS`, MD4C recognizes wiki links. + +The simple wiki-link is a wiki-link destination enclosed in `[[` followed with +`]]`. + +```````````````````````````````` example +[[foo]] +. +<p><x-wikilink data-target="foo">foo</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +However wiki-link may contain an explicit label, delimited from the destination +with `|`. + +```````````````````````````````` example +[[foo|bar]] +. +<p><x-wikilink data-target="foo">bar</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +A wiki-link destination cannot be empty. + +```````````````````````````````` example +[[]] +. +<p>[[]]</p> +. +--fwiki-links +```````````````````````````````` + +```````````````````````````````` example +[[|foo]] +. +<p>[[|foo]]</p> +. +--fwiki-links +```````````````````````````````` + + +The wiki-link destination cannot contain a new line. + +```````````````````````````````` example +[[foo +bar]] +. +<p>[[foo +bar]]</p> +. +--fwiki-links +```````````````````````````````` + +```````````````````````````````` example +[[foo +bar|baz]] +. +<p>[[foo +bar|baz]]</p> +. +--fwiki-links +```````````````````````````````` + +The wiki-link destination is rendered verbatim; inline markup in it is not +recognized. + +```````````````````````````````` example +[[*foo*]] +. +<p><x-wikilink data-target="*foo*">*foo*</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +```````````````````````````````` example +[[foo|![bar](bar.jpg)]] +. +<p><x-wikilink data-target="foo"><img src="bar.jpg" alt="bar"></x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +With multiple `|` delimiters, only the first one is recognized and the other +ones are part of the label. + +```````````````````````````````` example +[[foo|bar|baz]] +. +<p><x-wikilink data-target="foo">bar|baz</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +However the delimiter `|` can be escaped with `/`. + +```````````````````````````````` example +[[foo\|bar|baz]] +. +<p><x-wikilink data-target="foo|bar">baz</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +The label can contain inline elements. + +```````````````````````````````` example +[[foo|*bar*]] +. +<p><x-wikilink data-target="foo"><em>bar</em></x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +Empty explicit label is the same as using the implicit label; i.e. the verbatim +destination string is used as the label. + +```````````````````````````````` example +[[foo|]] +. +<p><x-wikilink data-target="foo">foo</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +The label can span multiple lines. + +```````````````````````````````` example +[[foo|foo +bar +baz]] +. +<p><x-wikilink data-target="foo">foo +bar +baz</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +Wiki-links have higher priority than links. + +```````````````````````````````` example +[[foo]](foo.jpg) +. +<p><x-wikilink data-target="foo">foo</x-wikilink>(foo.jpg)</p> +. +--fwiki-links +```````````````````````````````` + +```````````````````````````````` example +[foo]: /url + +[[foo]] +. +<p><x-wikilink data-target="foo">foo</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +Wiki links can be inlined in tables. + +```````````````````````````````` example +| A | B | +|------------------|-----| +| [[foo|*bar*]] | baz | +. +<table> +<thead> +<tr> +<th>A</th> +<th>B</th> +</tr> +</thead> +<tbody> +<tr> +<td><x-wikilink data-target="foo"><em>bar</em></x-wikilink></td> +<td>baz</td> +</tr> +</tbody> +</table> +. +--fwiki-links --ftables +```````````````````````````````` + +Wiki-links are not prioritized over images. + +```````````````````````````````` example +![[foo]](foo.jpg) +. +<p><img src="foo.jpg" alt="[foo]"></p> +. +--fwiki-links +```````````````````````````````` + +Something that may look like a wiki-link at first, but turns out not to be, +is recognized as a normal link. + +```````````````````````````````` example +[[foo] + +[foo]: /url +. +<p>[<a href="/url">foo</a></p> +. +--fwiki-links +```````````````````````````````` + +Escaping the opening `[` escapes only that one character, not the whole `[[` +opener: + +```````````````````````````````` example +\[[foo]] + +[foo]: /url +. +<p>[<a href="/url">foo</a>]</p> +. +--fwiki-links +```````````````````````````````` + +Like with other inline links, the innermost wiki-link is preferred. + +```````````````````````````````` example +[[foo[[bar]]]] +. +<p>[[foo<x-wikilink data-target="bar">bar</x-wikilink>]]</p> +. +--fwiki-links +```````````````````````````````` + +There is limit of 100 characters for the wiki-link destination. + +```````````````````````````````` example +[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901]] +[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901|foo]] +. +<p>[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901]] +[[12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901|foo]]</p> +. +--fwiki-links +```````````````````````````````` + +100 characters inside a wiki link target works. + +```````````````````````````````` example +[[1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890]] +[[1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890|foo]] +. +<p><x-wikilink data-target="1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890">1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890</x-wikilink> +<x-wikilink data-target="1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890">foo</x-wikilink></p> +. +--fwiki-links +```````````````````````````````` + +The limit on link content does not include any characters belonging to a block +quote, if the label spans multiple lines contained in a block quote. + +```````````````````````````````` example +> [[12345678901234567890123456789012345678901234567890|1234567890 +> 1234567890 +> 1234567890 +> 1234567890 +> 123456789]] +. +<blockquote> +<p><x-wikilink data-target="12345678901234567890123456789012345678901234567890">1234567890 +1234567890 +1234567890 +1234567890 +123456789</x-wikilink></p> +</blockquote> +. +--fwiki-links +```````````````````````````````` diff --git a/test/js/bun/md/spec.txt b/test/js/bun/md/spec.txt new file mode 100644 index 0000000000..f1fab281e9 --- /dev/null +++ b/test/js/bun/md/spec.txt @@ -0,0 +1,9756 @@ +--- +title: CommonMark Spec +author: John MacFarlane +version: '0.31.2' +date: '2024-01-28' +license: '[CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)' +... + +# Introduction + +## What is Markdown? + +Markdown is a plain text format for writing structured documents, +based on conventions for indicating formatting in email +and usenet posts. It was developed by John Gruber (with +help from Aaron Swartz) and released in 2004 in the form of a +[syntax description](https://daringfireball.net/projects/markdown/syntax) +and a Perl script (`Markdown.pl`) for converting Markdown to +HTML. In the next decade, dozens of implementations were +developed in many languages. Some extended the original +Markdown syntax with conventions for footnotes, tables, and +other document elements. Some allowed Markdown documents to be +rendered in formats other than HTML. Websites like Reddit, +StackOverflow, and GitHub had millions of people using Markdown. +And Markdown started to be used beyond the web, to author books, +articles, slide shows, letters, and lecture notes. + +What distinguishes Markdown from many other lightweight markup +syntaxes, which are often easier to write, is its readability. +As Gruber writes: + +> The overriding design goal for Markdown's formatting syntax is +> to make it as readable as possible. The idea is that a +> Markdown-formatted document should be publishable as-is, as +> plain text, without looking like it's been marked up with tags +> or formatting instructions. +> (<https://daringfireball.net/projects/markdown/>) + +The point can be illustrated by comparing a sample of +[AsciiDoc](https://asciidoc.org/) with +an equivalent sample of Markdown. Here is a sample of +AsciiDoc from the AsciiDoc manual: + +``` +1. List item one. ++ +List item one continued with a second paragraph followed by an +Indented block. ++ +................. +$ ls *.sh +$ mv *.sh ~/tmp +................. ++ +List item continued with a third paragraph. + +2. List item two continued with an open block. ++ +-- +This paragraph is part of the preceding list item. + +a. This list is nested and does not require explicit item +continuation. ++ +This paragraph is part of the preceding list item. + +b. List item b. + +This paragraph belongs to item two of the outer list. +-- +``` + +And here is the equivalent in Markdown: +``` +1. List item one. + + List item one continued with a second paragraph followed by an + Indented block. + + $ ls *.sh + $ mv *.sh ~/tmp + + List item continued with a third paragraph. + +2. List item two continued with an open block. + + This paragraph is part of the preceding list item. + + 1. This list is nested and does not require explicit item continuation. + + This paragraph is part of the preceding list item. + + 2. List item b. + + This paragraph belongs to item two of the outer list. +``` + +The AsciiDoc version is, arguably, easier to write. You don't need +to worry about indentation. But the Markdown version is much easier +to read. The nesting of list items is apparent to the eye in the +source, not just in the processed document. + +## Why is a spec needed? + +John Gruber's [canonical description of Markdown's +syntax](https://daringfireball.net/projects/markdown/syntax) +does not specify the syntax unambiguously. Here are some examples of +questions it does not answer: + +1. How much indentation is needed for a sublist? The spec says that + continuation paragraphs need to be indented four spaces, but is + not fully explicit about sublists. It is natural to think that + they, too, must be indented four spaces, but `Markdown.pl` does + not require that. This is hardly a "corner case," and divergences + between implementations on this issue often lead to surprises for + users in real documents. (See [this comment by John + Gruber](https://web.archive.org/web/20170611172104/http://article.gmane.org/gmane.text.markdown.general/1997).) + +2. Is a blank line needed before a block quote or heading? + Most implementations do not require the blank line. However, + this can lead to unexpected results in hard-wrapped text, and + also to ambiguities in parsing (note that some implementations + put the heading inside the blockquote, while others do not). + (John Gruber has also spoken [in favor of requiring the blank + lines](https://web.archive.org/web/20170611172104/http://article.gmane.org/gmane.text.markdown.general/2146).) + +3. Is a blank line needed before an indented code block? + (`Markdown.pl` requires it, but this is not mentioned in the + documentation, and some implementations do not require it.) + + ``` markdown + paragraph + code? + ``` + +4. What is the exact rule for determining when list items get + wrapped in `<p>` tags? Can a list be partially "loose" and partially + "tight"? What should we do with a list like this? + + ``` markdown + 1. one + + 2. two + 3. three + ``` + + Or this? + + ``` markdown + 1. one + - a + + - b + 2. two + ``` + + (There are some relevant comments by John Gruber + [here](https://web.archive.org/web/20170611172104/http://article.gmane.org/gmane.text.markdown.general/2554).) + +5. Can list markers be indented? Can ordered list markers be right-aligned? + + ``` markdown + 8. item 1 + 9. item 2 + 10. item 2a + ``` + +6. Is this one list with a thematic break in its second item, + or two lists separated by a thematic break? + + ``` markdown + * a + * * * * * + * b + ``` + +7. When list markers change from numbers to bullets, do we have + two lists or one? (The Markdown syntax description suggests two, + but the perl scripts and many other implementations produce one.) + + ``` markdown + 1. fee + 2. fie + - foe + - fum + ``` + +8. What are the precedence rules for the markers of inline structure? + For example, is the following a valid link, or does the code span + take precedence ? + + ``` markdown + [a backtick (`)](/url) and [another backtick (`)](/url). + ``` + +9. What are the precedence rules for markers of emphasis and strong + emphasis? For example, how should the following be parsed? + + ``` markdown + *foo *bar* baz* + ``` + +10. What are the precedence rules between block-level and inline-level + structure? For example, how should the following be parsed? + + ``` markdown + - `a long code span can contain a hyphen like this + - and it can screw things up` + ``` + +11. Can list items include section headings? (`Markdown.pl` does not + allow this, but does allow blockquotes to include headings.) + + ``` markdown + - # Heading + ``` + +12. Can list items be empty? + + ``` markdown + * a + * + * b + ``` + +13. Can link references be defined inside block quotes or list items? + + ``` markdown + > Blockquote [foo]. + > + > [foo]: /url + ``` + +14. If there are multiple definitions for the same reference, which takes + precedence? + + ``` markdown + [foo]: /url1 + [foo]: /url2 + + [foo][] + ``` + +In the absence of a spec, early implementers consulted `Markdown.pl` +to resolve these ambiguities. But `Markdown.pl` was quite buggy, and +gave manifestly bad results in many cases, so it was not a +satisfactory replacement for a spec. + +Because there is no unambiguous spec, implementations have diverged +considerably. As a result, users are often surprised to find that +a document that renders one way on one system (say, a GitHub wiki) +renders differently on another (say, converting to docbook using +pandoc). To make matters worse, because nothing in Markdown counts +as a "syntax error," the divergence often isn't discovered right away. + +## About this document + +This document attempts to specify Markdown syntax unambiguously. +It contains many examples with side-by-side Markdown and +HTML. These are intended to double as conformance tests. An +accompanying script `spec_tests.py` can be used to run the tests +against any Markdown program: + + python test/spec_tests.py --spec spec.txt --program PROGRAM + +Since this document describes how Markdown is to be parsed into +an abstract syntax tree, it would have made sense to use an abstract +representation of the syntax tree instead of HTML. But HTML is capable +of representing the structural distinctions we need to make, and the +choice of HTML for the tests makes it possible to run the tests against +an implementation without writing an abstract syntax tree renderer. + +Note that not every feature of the HTML samples is mandated by +the spec. For example, the spec says what counts as a link +destination, but it doesn't mandate that non-ASCII characters in +the URL be percent-encoded. To use the automatic tests, +implementers will need to provide a renderer that conforms to +the expectations of the spec examples (percent-encoding +non-ASCII characters in URLs). But a conforming implementation +can use a different renderer and may choose not to +percent-encode non-ASCII characters in URLs. + +This document is generated from a text file, `spec.txt`, written +in Markdown with a small extension for the side-by-side tests. +The script `tools/makespec.py` can be used to convert `spec.txt` into +HTML or CommonMark (which can then be converted into other formats). + +In the examples, the `→` character is used to represent tabs. + +# Preliminaries + +## Characters and lines + +Any sequence of [characters] is a valid CommonMark +document. + +A [character](@) is a Unicode code point. Although some +code points (for example, combining accents) do not correspond to +characters in an intuitive sense, all code points count as characters +for purposes of this spec. + +This spec does not specify an encoding; it thinks of lines as composed +of [characters] rather than bytes. A conforming parser may be limited +to a certain encoding. + +A [line](@) is a sequence of zero or more [characters] +other than line feed (`U+000A`) or carriage return (`U+000D`), +followed by a [line ending] or by the end of file. + +A [line ending](@) is a line feed (`U+000A`), a carriage return +(`U+000D`) not followed by a line feed, or a carriage return and a +following line feed. + +A line containing no characters, or a line containing only spaces +(`U+0020`) or tabs (`U+0009`), is called a [blank line](@). + +The following definitions of character classes will be used in this spec: + +A [Unicode whitespace character](@) is a character in the Unicode `Zs` general +category, or a tab (`U+0009`), line feed (`U+000A`), form feed (`U+000C`), or +carriage return (`U+000D`). + +[Unicode whitespace](@) is a sequence of one or more +[Unicode whitespace characters]. + +A [tab](@) is `U+0009`. + +A [space](@) is `U+0020`. + +An [ASCII control character](@) is a character between `U+0000–1F` (both +including) or `U+007F`. + +An [ASCII punctuation character](@) +is `!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`, +`*`, `+`, `,`, `-`, `.`, `/` (U+0021–2F), +`:`, `;`, `<`, `=`, `>`, `?`, `@` (U+003A–0040), +`[`, `\`, `]`, `^`, `_`, `` ` `` (U+005B–0060), +`{`, `|`, `}`, or `~` (U+007B–007E). + +A [Unicode punctuation character](@) is a character in the Unicode `P` +(puncuation) or `S` (symbol) general categories. + +## Tabs + +Tabs in lines are not expanded to [spaces]. However, +in contexts where spaces help to define block structure, +tabs behave as if they were replaced by spaces with a tab stop +of 4 characters. + +Thus, for example, a tab can be used instead of four spaces +in an indented code block. (Note, however, that internal +tabs are passed through as literal tabs, not expanded to +spaces.) + +```````````````````````````````` example +→foo→baz→→bim +. +<pre><code>foo→baz→→bim +</code></pre> +```````````````````````````````` + +```````````````````````````````` example + →foo→baz→→bim +. +<pre><code>foo→baz→→bim +</code></pre> +```````````````````````````````` + +```````````````````````````````` example + a→a + ὐ→a +. +<pre><code>a→a +ὐ→a +</code></pre> +```````````````````````````````` + +In the following example, a continuation paragraph of a list +item is indented with a tab; this has exactly the same effect +as indentation with four spaces would: + +```````````````````````````````` example + - foo + +→bar +. +<ul> +<li> +<p>foo</p> +<p>bar</p> +</li> +</ul> +```````````````````````````````` + +```````````````````````````````` example +- foo + +→→bar +. +<ul> +<li> +<p>foo</p> +<pre><code> bar +</code></pre> +</li> +</ul> +```````````````````````````````` + +Normally the `>` that begins a block quote may be followed +optionally by a space, which is not considered part of the +content. In the following case `>` is followed by a tab, +which is treated as if it were expanded into three spaces. +Since one of these spaces is considered part of the +delimiter, `foo` is considered to be indented six spaces +inside the block quote context, so we get an indented +code block starting with two spaces. + +```````````````````````````````` example +>→→foo +. +<blockquote> +<pre><code> foo +</code></pre> +</blockquote> +```````````````````````````````` + +```````````````````````````````` example +-→→foo +. +<ul> +<li> +<pre><code> foo +</code></pre> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example + foo +→bar +. +<pre><code>foo +bar +</code></pre> +```````````````````````````````` + +```````````````````````````````` example + - foo + - bar +→ - baz +. +<ul> +<li>foo +<ul> +<li>bar +<ul> +<li>baz</li> +</ul> +</li> +</ul> +</li> +</ul> +```````````````````````````````` + +```````````````````````````````` example +#→Foo +. +<h1>Foo</h1> +```````````````````````````````` + +```````````````````````````````` example +*→*→*→ +. +<hr /> +```````````````````````````````` + + +## Insecure characters + +For security reasons, the Unicode character `U+0000` must be replaced +with the REPLACEMENT CHARACTER (`U+FFFD`). + + +## Backslash escapes + +Any ASCII punctuation character may be backslash-escaped: + +```````````````````````````````` example +\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~ +. +<p>!&quot;#$%&amp;'()*+,-./:;&lt;=&gt;?@[\]^_`{|}~</p> +```````````````````````````````` + + +Backslashes before other characters are treated as literal +backslashes: + +```````````````````````````````` example +\→\A\a\ \3\φ\« +. +<p>\→\A\a\ \3\φ\«</p> +```````````````````````````````` + + +Escaped characters are treated as regular characters and do +not have their usual Markdown meanings: + +```````````````````````````````` example +\*not emphasized* +\<br/> not a tag +\[not a link](/foo) +\`not code` +1\. not a list +\* not a list +\# not a heading +\[foo]: /url "not a reference" +\&ouml; not a character entity +. +<p>*not emphasized* +&lt;br/&gt; not a tag +[not a link](/foo) +`not code` +1. not a list +* not a list +# not a heading +[foo]: /url &quot;not a reference&quot; +&amp;ouml; not a character entity</p> +```````````````````````````````` + + +If a backslash is itself escaped, the following character is not: + +```````````````````````````````` example +\\*emphasis* +. +<p>\<em>emphasis</em></p> +```````````````````````````````` + + +A backslash at the end of the line is a [hard line break]: + +```````````````````````````````` example +foo\ +bar +. +<p>foo<br /> +bar</p> +```````````````````````````````` + + +Backslash escapes do not work in code blocks, code spans, autolinks, or +raw HTML: + +```````````````````````````````` example +`` \[\` `` +. +<p><code>\[\`</code></p> +```````````````````````````````` + + +```````````````````````````````` example + \[\] +. +<pre><code>\[\] +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +~~~ +\[\] +~~~ +. +<pre><code>\[\] +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +<https://example.com?find=\*> +. +<p><a href="https://example.com?find=%5C*">https://example.com?find=\*</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<a href="/bar\/)"> +. +<a href="/bar\/)"> +```````````````````````````````` + + +But they work in all other contexts, including URLs and link titles, +link references, and [info strings] in [fenced code blocks]: + +```````````````````````````````` example +[foo](/bar\* "ti\*tle") +. +<p><a href="/bar*" title="ti*tle">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo] + +[foo]: /bar\* "ti\*tle" +. +<p><a href="/bar*" title="ti*tle">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +``` foo\+bar +foo +``` +. +<pre><code class="language-foo+bar">foo +</code></pre> +```````````````````````````````` + + +## Entity and numeric character references + +Valid HTML entity references and numeric character references +can be used in place of the corresponding Unicode character, +with the following exceptions: + +- Entity and character references are not recognized in code + blocks and code spans. + +- Entity and character references cannot stand in place of + special characters that define structural elements in + CommonMark. For example, although `&#42;` can be used + in place of a literal `*` character, `&#42;` cannot replace + `*` in emphasis delimiters, bullet list markers, or thematic + breaks. + +Conforming CommonMark parsers need not store information about +whether a particular character was represented in the source +using a Unicode character or an entity reference. + +[Entity references](@) consist of `&` + any of the valid +HTML5 entity names + `;`. The +document <https://html.spec.whatwg.org/entities.json> +is used as an authoritative source for the valid entity +references and their corresponding code points. + +```````````````````````````````` example +&nbsp; &amp; &copy; &AElig; &Dcaron; +&frac34; &HilbertSpace; &DifferentialD; +&ClockwiseContourIntegral; &ngE; +. +<p>  &amp; © Æ Ď +¾ ℋ ⅆ +∲ ≧̸</p> +```````````````````````````````` + + +[Decimal numeric character +references](@) +consist of `&#` + a string of 1--7 arabic digits + `;`. A +numeric character reference is parsed as the corresponding +Unicode character. Invalid Unicode code points will be replaced by +the REPLACEMENT CHARACTER (`U+FFFD`). For security reasons, +the code point `U+0000` will also be replaced by `U+FFFD`. + +```````````````````````````````` example +&#35; &#1234; &#992; &#0; +. +<p># Ӓ Ϡ �</p> +```````````````````````````````` + + +[Hexadecimal numeric character +references](@) consist of `&#` + +either `X` or `x` + a string of 1-6 hexadecimal digits + `;`. +They too are parsed as the corresponding Unicode character (this +time specified with a hexadecimal numeral instead of decimal). + +```````````````````````````````` example +&#X22; &#XD06; &#xcab; +. +<p>&quot; ആ ಫ</p> +```````````````````````````````` + + +Here are some nonentities: + +```````````````````````````````` example +&nbsp &x; &#; &#x; +&#87654321; +&#abcdef0; +&ThisIsNotDefined; &hi?; +. +<p>&amp;nbsp &amp;x; &amp;#; &amp;#x; +&amp;#87654321; +&amp;#abcdef0; +&amp;ThisIsNotDefined; &amp;hi?;</p> +```````````````````````````````` + + +Although HTML5 does accept some entity references +without a trailing semicolon (such as `&copy`), these are not +recognized here, because it makes the grammar too ambiguous: + +```````````````````````````````` example +&copy +. +<p>&amp;copy</p> +```````````````````````````````` + + +Strings that are not on the list of HTML5 named entities are not +recognized as entity references either: + +```````````````````````````````` example +&MadeUpEntity; +. +<p>&amp;MadeUpEntity;</p> +```````````````````````````````` + + +Entity and numeric character references are recognized in any +context besides code spans or code blocks, including +URLs, [link titles], and [fenced code block][] [info strings]: + +```````````````````````````````` example +<a href="&ouml;&ouml;.html"> +. +<a href="&ouml;&ouml;.html"> +```````````````````````````````` + + +```````````````````````````````` example +[foo](/f&ouml;&ouml; "f&ouml;&ouml;") +. +<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo] + +[foo]: /f&ouml;&ouml; "f&ouml;&ouml;" +. +<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +``` f&ouml;&ouml; +foo +``` +. +<pre><code class="language-föö">foo +</code></pre> +```````````````````````````````` + + +Entity and numeric character references are treated as literal +text in code spans and code blocks: + +```````````````````````````````` example +`f&ouml;&ouml;` +. +<p><code>f&amp;ouml;&amp;ouml;</code></p> +```````````````````````````````` + + +```````````````````````````````` example + f&ouml;f&ouml; +. +<pre><code>f&amp;ouml;f&amp;ouml; +</code></pre> +```````````````````````````````` + + +Entity and numeric character references cannot be used +in place of symbols indicating structure in CommonMark +documents. + +```````````````````````````````` example +&#42;foo&#42; +*foo* +. +<p>*foo* +<em>foo</em></p> +```````````````````````````````` + +```````````````````````````````` example +&#42; foo + +* foo +. +<p>* foo</p> +<ul> +<li>foo</li> +</ul> +```````````````````````````````` + +```````````````````````````````` example +foo&#10;&#10;bar +. +<p>foo + +bar</p> +```````````````````````````````` + +```````````````````````````````` example +&#9;foo +. +<p>→foo</p> +```````````````````````````````` + + +```````````````````````````````` example +[a](url &quot;tit&quot;) +. +<p>[a](url &quot;tit&quot;)</p> +```````````````````````````````` + + + +# Blocks and inlines + +We can think of a document as a sequence of +[blocks](@)---structural elements like paragraphs, block +quotations, lists, headings, rules, and code blocks. Some blocks (like +block quotes and list items) contain other blocks; others (like +headings and paragraphs) contain [inline](@) content---text, +links, emphasized text, images, code spans, and so on. + +## Precedence + +Indicators of block structure always take precedence over indicators +of inline structure. So, for example, the following is a list with +two items, not a list with one item containing a code span: + +```````````````````````````````` example +- `one +- two` +. +<ul> +<li>`one</li> +<li>two`</li> +</ul> +```````````````````````````````` + + +This means that parsing can proceed in two steps: first, the block +structure of the document can be discerned; second, text lines inside +paragraphs, headings, and other block constructs can be parsed for inline +structure. The second step requires information about link reference +definitions that will be available only at the end of the first +step. Note that the first step requires processing lines in sequence, +but the second can be parallelized, since the inline parsing of +one block element does not affect the inline parsing of any other. + +## Container blocks and leaf blocks + +We can divide blocks into two types: +[container blocks](#container-blocks), +which can contain other blocks, and [leaf blocks](#leaf-blocks), +which cannot. + +# Leaf blocks + +This section describes the different kinds of leaf block that make up a +Markdown document. + +## Thematic breaks + +A line consisting of optionally up to three spaces of indentation, followed by a +sequence of three or more matching `-`, `_`, or `*` characters, each followed +optionally by any number of spaces or tabs, forms a +[thematic break](@). + +```````````````````````````````` example +*** +--- +___ +. +<hr /> +<hr /> +<hr /> +```````````````````````````````` + + +Wrong characters: + +```````````````````````````````` example ++++ +. +<p>+++</p> +```````````````````````````````` + + +```````````````````````````````` example +=== +. +<p>===</p> +```````````````````````````````` + + +Not enough characters: + +```````````````````````````````` example +-- +** +__ +. +<p>-- +** +__</p> +```````````````````````````````` + + +Up to three spaces of indentation are allowed: + +```````````````````````````````` example + *** + *** + *** +. +<hr /> +<hr /> +<hr /> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example + *** +. +<pre><code>*** +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +Foo + *** +. +<p>Foo +***</p> +```````````````````````````````` + + +More than three characters may be used: + +```````````````````````````````` example +_____________________________________ +. +<hr /> +```````````````````````````````` + + +Spaces and tabs are allowed between the characters: + +```````````````````````````````` example + - - - +. +<hr /> +```````````````````````````````` + + +```````````````````````````````` example + ** * ** * ** * ** +. +<hr /> +```````````````````````````````` + + +```````````````````````````````` example +- - - - +. +<hr /> +```````````````````````````````` + + +Spaces and tabs are allowed at the end: + +```````````````````````````````` example +- - - - +. +<hr /> +```````````````````````````````` + + +However, no other characters may occur in the line: + +```````````````````````````````` example +_ _ _ _ a + +a------ + +---a--- +. +<p>_ _ _ _ a</p> +<p>a------</p> +<p>---a---</p> +```````````````````````````````` + + +It is required that all of the characters other than spaces or tabs be the same. +So, this is not a thematic break: + +```````````````````````````````` example + *-* +. +<p><em>-</em></p> +```````````````````````````````` + + +Thematic breaks do not need blank lines before or after: + +```````````````````````````````` example +- foo +*** +- bar +. +<ul> +<li>foo</li> +</ul> +<hr /> +<ul> +<li>bar</li> +</ul> +```````````````````````````````` + + +Thematic breaks can interrupt a paragraph: + +```````````````````````````````` example +Foo +*** +bar +. +<p>Foo</p> +<hr /> +<p>bar</p> +```````````````````````````````` + + +If a line of dashes that meets the above conditions for being a +thematic break could also be interpreted as the underline of a [setext +heading], the interpretation as a +[setext heading] takes precedence. Thus, for example, +this is a setext heading, not a paragraph followed by a thematic break: + +```````````````````````````````` example +Foo +--- +bar +. +<h2>Foo</h2> +<p>bar</p> +```````````````````````````````` + + +When both a thematic break and a list item are possible +interpretations of a line, the thematic break takes precedence: + +```````````````````````````````` example +* Foo +* * * +* Bar +. +<ul> +<li>Foo</li> +</ul> +<hr /> +<ul> +<li>Bar</li> +</ul> +```````````````````````````````` + + +If you want a thematic break in a list item, use a different bullet: + +```````````````````````````````` example +- Foo +- * * * +. +<ul> +<li>Foo</li> +<li> +<hr /> +</li> +</ul> +```````````````````````````````` + + +## ATX headings + +An [ATX heading](@) +consists of a string of characters, parsed as inline content, between an +opening sequence of 1--6 unescaped `#` characters and an optional +closing sequence of any number of unescaped `#` characters. +The opening sequence of `#` characters must be followed by spaces or tabs, or +by the end of line. The optional closing sequence of `#`s must be preceded by +spaces or tabs and may be followed by spaces or tabs only. The opening +`#` character may be preceded by up to three spaces of indentation. The raw +contents of the heading are stripped of leading and trailing space or tabs +before being parsed as inline content. The heading level is equal to the number +of `#` characters in the opening sequence. + +Simple headings: + +```````````````````````````````` example +# foo +## foo +### foo +#### foo +##### foo +###### foo +. +<h1>foo</h1> +<h2>foo</h2> +<h3>foo</h3> +<h4>foo</h4> +<h5>foo</h5> +<h6>foo</h6> +```````````````````````````````` + + +More than six `#` characters is not a heading: + +```````````````````````````````` example +####### foo +. +<p>####### foo</p> +```````````````````````````````` + + +At least one space or tab is required between the `#` characters and the +heading's contents, unless the heading is empty. Note that many +implementations currently do not require the space. However, the +space was required by the +[original ATX implementation](http://www.aaronsw.com/2002/atx/atx.py), +and it helps prevent things like the following from being parsed as +headings: + +```````````````````````````````` example +#5 bolt + +#hashtag +. +<p>#5 bolt</p> +<p>#hashtag</p> +```````````````````````````````` + + +This is not a heading, because the first `#` is escaped: + +```````````````````````````````` example +\## foo +. +<p>## foo</p> +```````````````````````````````` + + +Contents are parsed as inlines: + +```````````````````````````````` example +# foo *bar* \*baz\* +. +<h1>foo <em>bar</em> *baz*</h1> +```````````````````````````````` + + +Leading and trailing spaces or tabs are ignored in parsing inline content: + +```````````````````````````````` example +# foo +. +<h1>foo</h1> +```````````````````````````````` + + +Up to three spaces of indentation are allowed: + +```````````````````````````````` example + ### foo + ## foo + # foo +. +<h3>foo</h3> +<h2>foo</h2> +<h1>foo</h1> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example + # foo +. +<pre><code># foo +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +foo + # bar +. +<p>foo +# bar</p> +```````````````````````````````` + + +A closing sequence of `#` characters is optional: + +```````````````````````````````` example +## foo ## + ### bar ### +. +<h2>foo</h2> +<h3>bar</h3> +```````````````````````````````` + + +It need not be the same length as the opening sequence: + +```````````````````````````````` example +# foo ################################## +##### foo ## +. +<h1>foo</h1> +<h5>foo</h5> +```````````````````````````````` + + +Spaces or tabs are allowed after the closing sequence: + +```````````````````````````````` example +### foo ### +. +<h3>foo</h3> +```````````````````````````````` + + +A sequence of `#` characters with anything but spaces or tabs following it +is not a closing sequence, but counts as part of the contents of the +heading: + +```````````````````````````````` example +### foo ### b +. +<h3>foo ### b</h3> +```````````````````````````````` + + +The closing sequence must be preceded by a space or tab: + +```````````````````````````````` example +# foo# +. +<h1>foo#</h1> +```````````````````````````````` + + +Backslash-escaped `#` characters do not count as part +of the closing sequence: + +```````````````````````````````` example +### foo \### +## foo #\## +# foo \# +. +<h3>foo ###</h3> +<h2>foo ###</h2> +<h1>foo #</h1> +```````````````````````````````` + + +ATX headings need not be separated from surrounding content by blank +lines, and they can interrupt paragraphs: + +```````````````````````````````` example +**** +## foo +**** +. +<hr /> +<h2>foo</h2> +<hr /> +```````````````````````````````` + + +```````````````````````````````` example +Foo bar +# baz +Bar foo +. +<p>Foo bar</p> +<h1>baz</h1> +<p>Bar foo</p> +```````````````````````````````` + + +ATX headings can be empty: + +```````````````````````````````` example +## +# +### ### +. +<h2></h2> +<h1></h1> +<h3></h3> +```````````````````````````````` + + +## Setext headings + +A [setext heading](@) consists of one or more +lines of text, not interrupted by a blank line, of which the first line does not +have more than 3 spaces of indentation, followed by +a [setext heading underline]. The lines of text must be such +that, were they not followed by the setext heading underline, +they would be interpreted as a paragraph: they cannot be +interpretable as a [code fence], [ATX heading][ATX headings], +[block quote][block quotes], [thematic break][thematic breaks], +[list item][list items], or [HTML block][HTML blocks]. + +A [setext heading underline](@) is a sequence of +`=` characters or a sequence of `-` characters, with no more than 3 +spaces of indentation and any number of trailing spaces or tabs. + +The heading is a level 1 heading if `=` characters are used in +the [setext heading underline], and a level 2 heading if `-` +characters are used. The contents of the heading are the result +of parsing the preceding lines of text as CommonMark inline +content. + +In general, a setext heading need not be preceded or followed by a +blank line. However, it cannot interrupt a paragraph, so when a +setext heading comes after a paragraph, a blank line is needed between +them. + +Simple examples: + +```````````````````````````````` example +Foo *bar* +========= + +Foo *bar* +--------- +. +<h1>Foo <em>bar</em></h1> +<h2>Foo <em>bar</em></h2> +```````````````````````````````` + + +The content of the header may span more than one line: + +```````````````````````````````` example +Foo *bar +baz* +==== +. +<h1>Foo <em>bar +baz</em></h1> +```````````````````````````````` + +The contents are the result of parsing the headings's raw +content as inlines. The heading's raw content is formed by +concatenating the lines and removing initial and final +spaces or tabs. + +```````````````````````````````` example + Foo *bar +baz*→ +==== +. +<h1>Foo <em>bar +baz</em></h1> +```````````````````````````````` + + +The underlining can be any length: + +```````````````````````````````` example +Foo +------------------------- + +Foo += +. +<h2>Foo</h2> +<h1>Foo</h1> +```````````````````````````````` + + +The heading content can be preceded by up to three spaces of indentation, and +need not line up with the underlining: + +```````````````````````````````` example + Foo +--- + + Foo +----- + + Foo + === +. +<h2>Foo</h2> +<h2>Foo</h2> +<h1>Foo</h1> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example + Foo + --- + + Foo +--- +. +<pre><code>Foo +--- + +Foo +</code></pre> +<hr /> +```````````````````````````````` + + +The setext heading underline can be preceded by up to three spaces of +indentation, and may have trailing spaces or tabs: + +```````````````````````````````` example +Foo + ---- +. +<h2>Foo</h2> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example +Foo + --- +. +<p>Foo +---</p> +```````````````````````````````` + + +The setext heading underline cannot contain internal spaces or tabs: + +```````````````````````````````` example +Foo += = + +Foo +--- - +. +<p>Foo += =</p> +<p>Foo</p> +<hr /> +```````````````````````````````` + + +Trailing spaces or tabs in the content line do not cause a hard line break: + +```````````````````````````````` example +Foo +----- +. +<h2>Foo</h2> +```````````````````````````````` + + +Nor does a backslash at the end: + +```````````````````````````````` example +Foo\ +---- +. +<h2>Foo\</h2> +```````````````````````````````` + + +Since indicators of block structure take precedence over +indicators of inline structure, the following are setext headings: + +```````````````````````````````` example +`Foo +---- +` + +<a title="a lot +--- +of dashes"/> +. +<h2>`Foo</h2> +<p>`</p> +<h2>&lt;a title=&quot;a lot</h2> +<p>of dashes&quot;/&gt;</p> +```````````````````````````````` + + +The setext heading underline cannot be a [lazy continuation +line] in a list item or block quote: + +```````````````````````````````` example +> Foo +--- +. +<blockquote> +<p>Foo</p> +</blockquote> +<hr /> +```````````````````````````````` + + +```````````````````````````````` example +> foo +bar +=== +. +<blockquote> +<p>foo +bar +===</p> +</blockquote> +```````````````````````````````` + + +```````````````````````````````` example +- Foo +--- +. +<ul> +<li>Foo</li> +</ul> +<hr /> +```````````````````````````````` + + +A blank line is needed between a paragraph and a following +setext heading, since otherwise the paragraph becomes part +of the heading's content: + +```````````````````````````````` example +Foo +Bar +--- +. +<h2>Foo +Bar</h2> +```````````````````````````````` + + +But in general a blank line is not required before or after +setext headings: + +```````````````````````````````` example +--- +Foo +--- +Bar +--- +Baz +. +<hr /> +<h2>Foo</h2> +<h2>Bar</h2> +<p>Baz</p> +```````````````````````````````` + + +Setext headings cannot be empty: + +```````````````````````````````` example + +==== +. +<p>====</p> +```````````````````````````````` + + +Setext heading text lines must not be interpretable as block +constructs other than paragraphs. So, the line of dashes +in these examples gets interpreted as a thematic break: + +```````````````````````````````` example +--- +--- +. +<hr /> +<hr /> +```````````````````````````````` + + +```````````````````````````````` example +- foo +----- +. +<ul> +<li>foo</li> +</ul> +<hr /> +```````````````````````````````` + + +```````````````````````````````` example + foo +--- +. +<pre><code>foo +</code></pre> +<hr /> +```````````````````````````````` + + +```````````````````````````````` example +> foo +----- +. +<blockquote> +<p>foo</p> +</blockquote> +<hr /> +```````````````````````````````` + + +If you want a heading with `> foo` as its literal text, you can +use backslash escapes: + +```````````````````````````````` example +\> foo +------ +. +<h2>&gt; foo</h2> +```````````````````````````````` + + +**Compatibility note:** Most existing Markdown implementations +do not allow the text of setext headings to span multiple lines. +But there is no consensus about how to interpret + +``` markdown +Foo +bar +--- +baz +``` + +One can find four different interpretations: + +1. paragraph "Foo", heading "bar", paragraph "baz" +2. paragraph "Foo bar", thematic break, paragraph "baz" +3. paragraph "Foo bar --- baz" +4. heading "Foo bar", paragraph "baz" + +We find interpretation 4 most natural, and interpretation 4 +increases the expressive power of CommonMark, by allowing +multiline headings. Authors who want interpretation 1 can +put a blank line after the first paragraph: + +```````````````````````````````` example +Foo + +bar +--- +baz +. +<p>Foo</p> +<h2>bar</h2> +<p>baz</p> +```````````````````````````````` + + +Authors who want interpretation 2 can put blank lines around +the thematic break, + +```````````````````````````````` example +Foo +bar + +--- + +baz +. +<p>Foo +bar</p> +<hr /> +<p>baz</p> +```````````````````````````````` + + +or use a thematic break that cannot count as a [setext heading +underline], such as + +```````````````````````````````` example +Foo +bar +* * * +baz +. +<p>Foo +bar</p> +<hr /> +<p>baz</p> +```````````````````````````````` + + +Authors who want interpretation 3 can use backslash escapes: + +```````````````````````````````` example +Foo +bar +\--- +baz +. +<p>Foo +bar +--- +baz</p> +```````````````````````````````` + + +## Indented code blocks + +An [indented code block](@) is composed of one or more +[indented chunks] separated by blank lines. +An [indented chunk](@) is a sequence of non-blank lines, +each preceded by four or more spaces of indentation. The contents of the code +block are the literal contents of the lines, including trailing +[line endings], minus four spaces of indentation. +An indented code block has no [info string]. + +An indented code block cannot interrupt a paragraph, so there must be +a blank line between a paragraph and a following indented code block. +(A blank line is not needed, however, between a code block and a following +paragraph.) + +```````````````````````````````` example + a simple + indented code block +. +<pre><code>a simple + indented code block +</code></pre> +```````````````````````````````` + + +If there is any ambiguity between an interpretation of indentation +as a code block and as indicating that material belongs to a [list +item][list items], the list item interpretation takes precedence: + +```````````````````````````````` example + - foo + + bar +. +<ul> +<li> +<p>foo</p> +<p>bar</p> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +1. foo + + - bar +. +<ol> +<li> +<p>foo</p> +<ul> +<li>bar</li> +</ul> +</li> +</ol> +```````````````````````````````` + + + +The contents of a code block are literal text, and do not get parsed +as Markdown: + +```````````````````````````````` example + <a/> + *hi* + + - one +. +<pre><code>&lt;a/&gt; +*hi* + +- one +</code></pre> +```````````````````````````````` + + +Here we have three chunks separated by blank lines: + +```````````````````````````````` example + chunk1 + + chunk2 + + + + chunk3 +. +<pre><code>chunk1 + +chunk2 + + + +chunk3 +</code></pre> +```````````````````````````````` + + +Any initial spaces or tabs beyond four spaces of indentation will be included in +the content, even in interior blank lines: + +```````````````````````````````` example + chunk1 + + chunk2 +. +<pre><code>chunk1 + + chunk2 +</code></pre> +```````````````````````````````` + + +An indented code block cannot interrupt a paragraph. (This +allows hanging indents and the like.) + +```````````````````````````````` example +Foo + bar + +. +<p>Foo +bar</p> +```````````````````````````````` + + +However, any non-blank line with fewer than four spaces of indentation ends +the code block immediately. So a paragraph may occur immediately +after indented code: + +```````````````````````````````` example + foo +bar +. +<pre><code>foo +</code></pre> +<p>bar</p> +```````````````````````````````` + + +And indented code can occur immediately before and after other kinds of +blocks: + +```````````````````````````````` example +# Heading + foo +Heading +------ + foo +---- +. +<h1>Heading</h1> +<pre><code>foo +</code></pre> +<h2>Heading</h2> +<pre><code>foo +</code></pre> +<hr /> +```````````````````````````````` + + +The first line can be preceded by more than four spaces of indentation: + +```````````````````````````````` example + foo + bar +. +<pre><code> foo +bar +</code></pre> +```````````````````````````````` + + +Blank lines preceding or following an indented code block +are not included in it: + +```````````````````````````````` example + + + foo + + +. +<pre><code>foo +</code></pre> +```````````````````````````````` + + +Trailing spaces or tabs are included in the code block's content: + +```````````````````````````````` example + foo +. +<pre><code>foo +</code></pre> +```````````````````````````````` + + + +## Fenced code blocks + +A [code fence](@) is a sequence +of at least three consecutive backtick characters (`` ` ``) or +tildes (`~`). (Tildes and backticks cannot be mixed.) +A [fenced code block](@) +begins with a code fence, preceded by up to three spaces of indentation. + +The line with the opening code fence may optionally contain some text +following the code fence; this is trimmed of leading and trailing +spaces or tabs and called the [info string](@). If the [info string] comes +after a backtick fence, it may not contain any backtick +characters. (The reason for this restriction is that otherwise +some inline code would be incorrectly interpreted as the +beginning of a fenced code block.) + +The content of the code block consists of all subsequent lines, until +a closing [code fence] of the same type as the code block +began with (backticks or tildes), and with at least as many backticks +or tildes as the opening code fence. If the leading code fence is +preceded by N spaces of indentation, then up to N spaces of indentation are +removed from each line of the content (if present). (If a content line is not +indented, it is preserved unchanged. If it is indented N spaces or less, all +of the indentation is removed.) + +The closing code fence may be preceded by up to three spaces of indentation, and +may be followed only by spaces or tabs, which are ignored. If the end of the +containing block (or document) is reached and no closing code fence +has been found, the code block contains all of the lines after the +opening code fence until the end of the containing block (or +document). (An alternative spec would require backtracking in the +event that a closing code fence is not found. But this makes parsing +much less efficient, and there seems to be no real downside to the +behavior described here.) + +A fenced code block may interrupt a paragraph, and does not require +a blank line either before or after. + +The content of a code fence is treated as literal text, not parsed +as inlines. The first word of the [info string] is typically used to +specify the language of the code sample, and rendered in the `class` +attribute of the `code` tag. However, this spec does not mandate any +particular treatment of the [info string]. + +Here is a simple example with backticks: + +```````````````````````````````` example +``` +< + > +``` +. +<pre><code>&lt; + &gt; +</code></pre> +```````````````````````````````` + + +With tildes: + +```````````````````````````````` example +~~~ +< + > +~~~ +. +<pre><code>&lt; + &gt; +</code></pre> +```````````````````````````````` + +Fewer than three backticks is not enough: + +```````````````````````````````` example +`` +foo +`` +. +<p><code>foo</code></p> +```````````````````````````````` + +The closing code fence must use the same character as the opening +fence: + +```````````````````````````````` example +``` +aaa +~~~ +``` +. +<pre><code>aaa +~~~ +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +~~~ +aaa +``` +~~~ +. +<pre><code>aaa +``` +</code></pre> +```````````````````````````````` + + +The closing code fence must be at least as long as the opening fence: + +```````````````````````````````` example +```` +aaa +``` +`````` +. +<pre><code>aaa +``` +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +~~~~ +aaa +~~~ +~~~~ +. +<pre><code>aaa +~~~ +</code></pre> +```````````````````````````````` + + +Unclosed code blocks are closed by the end of the document +(or the enclosing [block quote][block quotes] or [list item][list items]): + +```````````````````````````````` example +``` +. +<pre><code></code></pre> +```````````````````````````````` + + +```````````````````````````````` example +````` + +``` +aaa +. +<pre><code> +``` +aaa +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +> ``` +> aaa + +bbb +. +<blockquote> +<pre><code>aaa +</code></pre> +</blockquote> +<p>bbb</p> +```````````````````````````````` + + +A code block can have all empty lines as its content: + +```````````````````````````````` example +``` + + +``` +. +<pre><code> + +</code></pre> +```````````````````````````````` + + +A code block can be empty: + +```````````````````````````````` example +``` +``` +. +<pre><code></code></pre> +```````````````````````````````` + + +Fences can be indented. If the opening fence is indented, +content lines will have equivalent opening indentation removed, +if present: + +```````````````````````````````` example + ``` + aaa +aaa +``` +. +<pre><code>aaa +aaa +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example + ``` +aaa + aaa +aaa + ``` +. +<pre><code>aaa +aaa +aaa +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example + ``` + aaa + aaa + aaa + ``` +. +<pre><code>aaa + aaa +aaa +</code></pre> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example + ``` + aaa + ``` +. +<pre><code>``` +aaa +``` +</code></pre> +```````````````````````````````` + + +Closing fences may be preceded by up to three spaces of indentation, and their +indentation need not match that of the opening fence: + +```````````````````````````````` example +``` +aaa + ``` +. +<pre><code>aaa +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example + ``` +aaa + ``` +. +<pre><code>aaa +</code></pre> +```````````````````````````````` + + +This is not a closing fence, because it is indented 4 spaces: + +```````````````````````````````` example +``` +aaa + ``` +. +<pre><code>aaa + ``` +</code></pre> +```````````````````````````````` + + + +Code fences (opening and closing) cannot contain internal spaces or tabs: + +```````````````````````````````` example +``` ``` +aaa +. +<p><code> </code> +aaa</p> +```````````````````````````````` + + +```````````````````````````````` example +~~~~~~ +aaa +~~~ ~~ +. +<pre><code>aaa +~~~ ~~ +</code></pre> +```````````````````````````````` + + +Fenced code blocks can interrupt paragraphs, and can be followed +directly by paragraphs, without a blank line between: + +```````````````````````````````` example +foo +``` +bar +``` +baz +. +<p>foo</p> +<pre><code>bar +</code></pre> +<p>baz</p> +```````````````````````````````` + + +Other blocks can also occur before and after fenced code blocks +without an intervening blank line: + +```````````````````````````````` example +foo +--- +~~~ +bar +~~~ +# baz +. +<h2>foo</h2> +<pre><code>bar +</code></pre> +<h1>baz</h1> +```````````````````````````````` + + +An [info string] can be provided after the opening code fence. +Although this spec doesn't mandate any particular treatment of +the info string, the first word is typically used to specify +the language of the code block. In HTML output, the language is +normally indicated by adding a class to the `code` element consisting +of `language-` followed by the language name. + +```````````````````````````````` example +```ruby +def foo(x) + return 3 +end +``` +. +<pre><code class="language-ruby">def foo(x) + return 3 +end +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +~~~~ ruby startline=3 $%@#$ +def foo(x) + return 3 +end +~~~~~~~ +. +<pre><code class="language-ruby">def foo(x) + return 3 +end +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +````; +```` +. +<pre><code class="language-;"></code></pre> +```````````````````````````````` + + +[Info strings] for backtick code blocks cannot contain backticks: + +```````````````````````````````` example +``` aa ``` +foo +. +<p><code>aa</code> +foo</p> +```````````````````````````````` + + +[Info strings] for tilde code blocks can contain backticks and tildes: + +```````````````````````````````` example +~~~ aa ``` ~~~ +foo +~~~ +. +<pre><code class="language-aa">foo +</code></pre> +```````````````````````````````` + + +Closing code fences cannot have [info strings]: + +```````````````````````````````` example +``` +``` aaa +``` +. +<pre><code>``` aaa +</code></pre> +```````````````````````````````` + + + +## HTML blocks + +An [HTML block](@) is a group of lines that is treated +as raw HTML (and will not be escaped in HTML output). + +There are seven kinds of [HTML block], which can be defined by their +start and end conditions. The block begins with a line that meets a +[start condition](@) (after up to three optional spaces of indentation). +It ends with the first subsequent line that meets a matching +[end condition](@), or the last line of the document, or the last line of +the [container block](#container-blocks) containing the current HTML +block, if no line is encountered that meets the [end condition]. If +the first line meets both the [start condition] and the [end +condition], the block will contain just that line. + +1. **Start condition:** line begins with the string `<pre`, +`<script`, `<style`, or `<textarea` (case-insensitive), followed by a space, +a tab, the string `>`, or the end of the line.\ +**End condition:** line contains an end tag +`</pre>`, `</script>`, `</style>`, or `</textarea>` (case-insensitive; it +need not match the start tag). + +2. **Start condition:** line begins with the string `<!--`.\ +**End condition:** line contains the string `-->`. + +3. **Start condition:** line begins with the string `<?`.\ +**End condition:** line contains the string `?>`. + +4. **Start condition:** line begins with the string `<!` +followed by an ASCII letter.\ +**End condition:** line contains the character `>`. + +5. **Start condition:** line begins with the string +`<![CDATA[`.\ +**End condition:** line contains the string `]]>`. + +6. **Start condition:** line begins with the string `<` or `</` +followed by one of the strings (case-insensitive) `address`, +`article`, `aside`, `base`, `basefont`, `blockquote`, `body`, +`caption`, `center`, `col`, `colgroup`, `dd`, `details`, `dialog`, +`dir`, `div`, `dl`, `dt`, `fieldset`, `figcaption`, `figure`, +`footer`, `form`, `frame`, `frameset`, +`h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `head`, `header`, `hr`, +`html`, `iframe`, `legend`, `li`, `link`, `main`, `menu`, `menuitem`, +`nav`, `noframes`, `ol`, `optgroup`, `option`, `p`, `param`, +`search`, `section`, `summary`, `table`, `tbody`, `td`, +`tfoot`, `th`, `thead`, `title`, `tr`, `track`, `ul`, followed +by a space, a tab, the end of the line, the string `>`, or +the string `/>`.\ +**End condition:** line is followed by a [blank line]. + +7. **Start condition:** line begins with a complete [open tag] +(with any [tag name] other than `pre`, `script`, +`style`, or `textarea`) or a complete [closing tag], +followed by zero or more spaces and tabs, followed by the end of the line.\ +**End condition:** line is followed by a [blank line]. + +HTML blocks continue until they are closed by their appropriate +[end condition], or the last line of the document or other [container +block](#container-blocks). This means any HTML **within an HTML +block** that might otherwise be recognised as a start condition will +be ignored by the parser and passed through as-is, without changing +the parser's state. + +For instance, `<pre>` within an HTML block started by `<table>` will not affect +the parser state; as the HTML block was started in by start condition 6, it +will end at any blank line. This can be surprising: + +```````````````````````````````` example +<table><tr><td> +<pre> +**Hello**, + +_world_. +</pre> +</td></tr></table> +. +<table><tr><td> +<pre> +**Hello**, +<p><em>world</em>. +</pre></p> +</td></tr></table> +```````````````````````````````` + +In this case, the HTML block is terminated by the blank line — the `**Hello**` +text remains verbatim — and regular parsing resumes, with a paragraph, +emphasised `world` and inline and block HTML following. + +All types of [HTML blocks] except type 7 may interrupt +a paragraph. Blocks of type 7 may not interrupt a paragraph. +(This restriction is intended to prevent unwanted interpretation +of long tags inside a wrapped paragraph as starting HTML blocks.) + +Some simple examples follow. Here are some basic HTML blocks +of type 6: + +```````````````````````````````` example +<table> + <tr> + <td> + hi + </td> + </tr> +</table> + +okay. +. +<table> + <tr> + <td> + hi + </td> + </tr> +</table> +<p>okay.</p> +```````````````````````````````` + + +```````````````````````````````` example + <div> + *hello* + <foo><a> +. + <div> + *hello* + <foo><a> +```````````````````````````````` + + +A block can also start with a closing tag: + +```````````````````````````````` example +</div> +*foo* +. +</div> +*foo* +```````````````````````````````` + + +Here we have two HTML blocks with a Markdown paragraph between them: + +```````````````````````````````` example +<DIV CLASS="foo"> + +*Markdown* + +</DIV> +. +<DIV CLASS="foo"> +<p><em>Markdown</em></p> +</DIV> +```````````````````````````````` + + +The tag on the first line can be partial, as long +as it is split where there would be whitespace: + +```````````````````````````````` example +<div id="foo" + class="bar"> +</div> +. +<div id="foo" + class="bar"> +</div> +```````````````````````````````` + + +```````````````````````````````` example +<div id="foo" class="bar + baz"> +</div> +. +<div id="foo" class="bar + baz"> +</div> +```````````````````````````````` + + +An open tag need not be closed: +```````````````````````````````` example +<div> +*foo* + +*bar* +. +<div> +*foo* +<p><em>bar</em></p> +```````````````````````````````` + + + +A partial tag need not even be completed (garbage +in, garbage out): + +```````````````````````````````` example +<div id="foo" +*hi* +. +<div id="foo" +*hi* +```````````````````````````````` + + +```````````````````````````````` example +<div class +foo +. +<div class +foo +```````````````````````````````` + + +The initial tag doesn't even need to be a valid +tag, as long as it starts like one: + +```````````````````````````````` example +<div *???-&&&-<--- +*foo* +. +<div *???-&&&-<--- +*foo* +```````````````````````````````` + + +In type 6 blocks, the initial tag need not be on a line by +itself: + +```````````````````````````````` example +<div><a href="bar">*foo*</a></div> +. +<div><a href="bar">*foo*</a></div> +```````````````````````````````` + + +```````````````````````````````` example +<table><tr><td> +foo +</td></tr></table> +. +<table><tr><td> +foo +</td></tr></table> +```````````````````````````````` + + +Everything until the next blank line or end of document +gets included in the HTML block. So, in the following +example, what looks like a Markdown code block +is actually part of the HTML block, which continues until a blank +line or the end of the document is reached: + +```````````````````````````````` example +<div></div> +``` c +int x = 33; +``` +. +<div></div> +``` c +int x = 33; +``` +```````````````````````````````` + + +To start an [HTML block] with a tag that is *not* in the +list of block-level tags in (6), you must put the tag by +itself on the first line (and it must be complete): + +```````````````````````````````` example +<a href="foo"> +*bar* +</a> +. +<a href="foo"> +*bar* +</a> +```````````````````````````````` + + +In type 7 blocks, the [tag name] can be anything: + +```````````````````````````````` example +<Warning> +*bar* +</Warning> +. +<Warning> +*bar* +</Warning> +```````````````````````````````` + + +```````````````````````````````` example +<i class="foo"> +*bar* +</i> +. +<i class="foo"> +*bar* +</i> +```````````````````````````````` + + +```````````````````````````````` example +</ins> +*bar* +. +</ins> +*bar* +```````````````````````````````` + + +These rules are designed to allow us to work with tags that +can function as either block-level or inline-level tags. +The `<del>` tag is a nice example. We can surround content with +`<del>` tags in three different ways. In this case, we get a raw +HTML block, because the `<del>` tag is on a line by itself: + +```````````````````````````````` example +<del> +*foo* +</del> +. +<del> +*foo* +</del> +```````````````````````````````` + + +In this case, we get a raw HTML block that just includes +the `<del>` tag (because it ends with the following blank +line). So the contents get interpreted as CommonMark: + +```````````````````````````````` example +<del> + +*foo* + +</del> +. +<del> +<p><em>foo</em></p> +</del> +```````````````````````````````` + + +Finally, in this case, the `<del>` tags are interpreted +as [raw HTML] *inside* the CommonMark paragraph. (Because +the tag is not on a line by itself, we get inline HTML +rather than an [HTML block].) + +```````````````````````````````` example +<del>*foo*</del> +. +<p><del><em>foo</em></del></p> +```````````````````````````````` + + +HTML tags designed to contain literal content +(`pre`, `script`, `style`, `textarea`), comments, processing instructions, +and declarations are treated somewhat differently. +Instead of ending at the first blank line, these blocks +end at the first line containing a corresponding end tag. +As a result, these blocks can contain blank lines: + +A pre tag (type 1): + +```````````````````````````````` example +<pre language="haskell"><code> +import Text.HTML.TagSoup + +main :: IO () +main = print $ parseTags tags +</code></pre> +okay +. +<pre language="haskell"><code> +import Text.HTML.TagSoup + +main :: IO () +main = print $ parseTags tags +</code></pre> +<p>okay</p> +```````````````````````````````` + + +A script tag (type 1): + +```````````````````````````````` example +<script type="text/javascript"> +// JavaScript example + +document.getElementById("demo").innerHTML = "Hello JavaScript!"; +</script> +okay +. +<script type="text/javascript"> +// JavaScript example + +document.getElementById("demo").innerHTML = "Hello JavaScript!"; +</script> +<p>okay</p> +```````````````````````````````` + + +A textarea tag (type 1): + +```````````````````````````````` example +<textarea> + +*foo* + +_bar_ + +</textarea> +. +<textarea> + +*foo* + +_bar_ + +</textarea> +```````````````````````````````` + +A style tag (type 1): + +```````````````````````````````` example +<style + type="text/css"> +h1 {color:red;} + +p {color:blue;} +</style> +okay +. +<style + type="text/css"> +h1 {color:red;} + +p {color:blue;} +</style> +<p>okay</p> +```````````````````````````````` + + +If there is no matching end tag, the block will end at the +end of the document (or the enclosing [block quote][block quotes] +or [list item][list items]): + +```````````````````````````````` example +<style + type="text/css"> + +foo +. +<style + type="text/css"> + +foo +```````````````````````````````` + + +```````````````````````````````` example +> <div> +> foo + +bar +. +<blockquote> +<div> +foo +</blockquote> +<p>bar</p> +```````````````````````````````` + + +```````````````````````````````` example +- <div> +- foo +. +<ul> +<li> +<div> +</li> +<li>foo</li> +</ul> +```````````````````````````````` + + +The end tag can occur on the same line as the start tag: + +```````````````````````````````` example +<style>p{color:red;}</style> +*foo* +. +<style>p{color:red;}</style> +<p><em>foo</em></p> +```````````````````````````````` + + +```````````````````````````````` example +<!-- foo -->*bar* +*baz* +. +<!-- foo -->*bar* +<p><em>baz</em></p> +```````````````````````````````` + + +Note that anything on the last line after the +end tag will be included in the [HTML block]: + +```````````````````````````````` example +<script> +foo +</script>1. *bar* +. +<script> +foo +</script>1. *bar* +```````````````````````````````` + + +A comment (type 2): + +```````````````````````````````` example +<!-- Foo + +bar + baz --> +okay +. +<!-- Foo + +bar + baz --> +<p>okay</p> +```````````````````````````````` + + + +A processing instruction (type 3): + +```````````````````````````````` example +<?php + + echo '>'; + +?> +okay +. +<?php + + echo '>'; + +?> +<p>okay</p> +```````````````````````````````` + + +A declaration (type 4): + +```````````````````````````````` example +<!DOCTYPE html> +. +<!DOCTYPE html> +```````````````````````````````` + + +CDATA (type 5): + +```````````````````````````````` example +<![CDATA[ +function matchwo(a,b) +{ + if (a < b && a < 0) then { + return 1; + + } else { + + return 0; + } +} +]]> +okay +. +<![CDATA[ +function matchwo(a,b) +{ + if (a < b && a < 0) then { + return 1; + + } else { + + return 0; + } +} +]]> +<p>okay</p> +```````````````````````````````` + + +The opening tag can be preceded by up to three spaces of indentation, but not +four: + +```````````````````````````````` example + <!-- foo --> + + <!-- foo --> +. + <!-- foo --> +<pre><code>&lt;!-- foo --&gt; +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example + <div> + + <div> +. + <div> +<pre><code>&lt;div&gt; +</code></pre> +```````````````````````````````` + + +An HTML block of types 1--6 can interrupt a paragraph, and need not be +preceded by a blank line. + +```````````````````````````````` example +Foo +<div> +bar +</div> +. +<p>Foo</p> +<div> +bar +</div> +```````````````````````````````` + + +However, a following blank line is needed, except at the end of +a document, and except for blocks of types 1--5, [above][HTML +block]: + +```````````````````````````````` example +<div> +bar +</div> +*foo* +. +<div> +bar +</div> +*foo* +```````````````````````````````` + + +HTML blocks of type 7 cannot interrupt a paragraph: + +```````````````````````````````` example +Foo +<a href="bar"> +baz +. +<p>Foo +<a href="bar"> +baz</p> +```````````````````````````````` + + +This rule differs from John Gruber's original Markdown syntax +specification, which says: + +> The only restrictions are that block-level HTML elements — +> e.g. `<div>`, `<table>`, `<pre>`, `<p>`, etc. — must be separated from +> surrounding content by blank lines, and the start and end tags of the +> block should not be indented with spaces or tabs. + +In some ways Gruber's rule is more restrictive than the one given +here: + +- It requires that an HTML block be preceded by a blank line. +- It does not allow the start tag to be indented. +- It requires a matching end tag, which it also does not allow to + be indented. + +Most Markdown implementations (including some of Gruber's own) do not +respect all of these restrictions. + +There is one respect, however, in which Gruber's rule is more liberal +than the one given here, since it allows blank lines to occur inside +an HTML block. There are two reasons for disallowing them here. +First, it removes the need to parse balanced tags, which is +expensive and can require backtracking from the end of the document +if no matching end tag is found. Second, it provides a very simple +and flexible way of including Markdown content inside HTML tags: +simply separate the Markdown from the HTML using blank lines: + +Compare: + +```````````````````````````````` example +<div> + +*Emphasized* text. + +</div> +. +<div> +<p><em>Emphasized</em> text.</p> +</div> +```````````````````````````````` + + +```````````````````````````````` example +<div> +*Emphasized* text. +</div> +. +<div> +*Emphasized* text. +</div> +```````````````````````````````` + + +Some Markdown implementations have adopted a convention of +interpreting content inside tags as text if the open tag has +the attribute `markdown=1`. The rule given above seems a simpler and +more elegant way of achieving the same expressive power, which is also +much simpler to parse. + +The main potential drawback is that one can no longer paste HTML +blocks into Markdown documents with 100% reliability. However, +*in most cases* this will work fine, because the blank lines in +HTML are usually followed by HTML block tags. For example: + +```````````````````````````````` example +<table> + +<tr> + +<td> +Hi +</td> + +</tr> + +</table> +. +<table> +<tr> +<td> +Hi +</td> +</tr> +</table> +```````````````````````````````` + + +There are problems, however, if the inner tags are indented +*and* separated by spaces, as then they will be interpreted as +an indented code block: + +```````````````````````````````` example +<table> + + <tr> + + <td> + Hi + </td> + + </tr> + +</table> +. +<table> + <tr> +<pre><code>&lt;td&gt; + Hi +&lt;/td&gt; +</code></pre> + </tr> +</table> +```````````````````````````````` + + +Fortunately, blank lines are usually not necessary and can be +deleted. The exception is inside `<pre>` tags, but as described +[above][HTML blocks], raw HTML blocks starting with `<pre>` +*can* contain blank lines. + +## Link reference definitions + +A [link reference definition](@) +consists of a [link label], optionally preceded by up to three spaces of +indentation, followed +by a colon (`:`), optional spaces or tabs (including up to one +[line ending]), a [link destination], +optional spaces or tabs (including up to one +[line ending]), and an optional [link +title], which if it is present must be separated +from the [link destination] by spaces or tabs. +No further character may occur. + +A [link reference definition] +does not correspond to a structural element of a document. Instead, it +defines a label which can be used in [reference links] +and reference-style [images] elsewhere in the document. [Link +reference definitions] can come either before or after the links that use +them. + +```````````````````````````````` example +[foo]: /url "title" + +[foo] +. +<p><a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example + [foo]: + /url + 'the title' + +[foo] +. +<p><a href="/url" title="the title">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[Foo*bar\]]:my_(url) 'title (with parens)' + +[Foo*bar\]] +. +<p><a href="my_(url)" title="title (with parens)">Foo*bar]</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[Foo bar]: +<my url> +'title' + +[Foo bar] +. +<p><a href="my%20url" title="title">Foo bar</a></p> +```````````````````````````````` + + +The title may extend over multiple lines: + +```````````````````````````````` example +[foo]: /url ' +title +line1 +line2 +' + +[foo] +. +<p><a href="/url" title=" +title +line1 +line2 +">foo</a></p> +```````````````````````````````` + + +However, it may not contain a [blank line]: + +```````````````````````````````` example +[foo]: /url 'title + +with blank line' + +[foo] +. +<p>[foo]: /url 'title</p> +<p>with blank line'</p> +<p>[foo]</p> +```````````````````````````````` + + +The title may be omitted: + +```````````````````````````````` example +[foo]: +/url + +[foo] +. +<p><a href="/url">foo</a></p> +```````````````````````````````` + + +The link destination may not be omitted: + +```````````````````````````````` example +[foo]: + +[foo] +. +<p>[foo]:</p> +<p>[foo]</p> +```````````````````````````````` + + However, an empty link destination may be specified using + angle brackets: + +```````````````````````````````` example +[foo]: <> + +[foo] +. +<p><a href="">foo</a></p> +```````````````````````````````` + +The title must be separated from the link destination by +spaces or tabs: + +```````````````````````````````` example +[foo]: <bar>(baz) + +[foo] +. +<p>[foo]: <bar>(baz)</p> +<p>[foo]</p> +```````````````````````````````` + + +Both title and destination can contain backslash escapes +and literal backslashes: + +```````````````````````````````` example +[foo]: /url\bar\*baz "foo\"bar\baz" + +[foo] +. +<p><a href="/url%5Cbar*baz" title="foo&quot;bar\baz">foo</a></p> +```````````````````````````````` + + +A link can come before its corresponding definition: + +```````````````````````````````` example +[foo] + +[foo]: url +. +<p><a href="url">foo</a></p> +```````````````````````````````` + + +If there are several matching definitions, the first one takes +precedence: + +```````````````````````````````` example +[foo] + +[foo]: first +[foo]: second +. +<p><a href="first">foo</a></p> +```````````````````````````````` + + +As noted in the section on [Links], matching of labels is +case-insensitive (see [matches]). + +```````````````````````````````` example +[FOO]: /url + +[Foo] +. +<p><a href="/url">Foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[ΑΓΩ]: /φου + +[αγω] +. +<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p> +```````````````````````````````` + + +Whether something is a [link reference definition] is +independent of whether the link reference it defines is +used in the document. Thus, for example, the following +document contains just a link reference definition, and +no visible content: + +```````````````````````````````` example +[foo]: /url +. +```````````````````````````````` + + +Here is another one: + +```````````````````````````````` example +[ +foo +]: /url +bar +. +<p>bar</p> +```````````````````````````````` + + +This is not a link reference definition, because there are +characters other than spaces or tabs after the title: + +```````````````````````````````` example +[foo]: /url "title" ok +. +<p>[foo]: /url &quot;title&quot; ok</p> +```````````````````````````````` + + +This is a link reference definition, but it has no title: + +```````````````````````````````` example +[foo]: /url +"title" ok +. +<p>&quot;title&quot; ok</p> +```````````````````````````````` + + +This is not a link reference definition, because it is indented +four spaces: + +```````````````````````````````` example + [foo]: /url "title" + +[foo] +. +<pre><code>[foo]: /url &quot;title&quot; +</code></pre> +<p>[foo]</p> +```````````````````````````````` + + +This is not a link reference definition, because it occurs inside +a code block: + +```````````````````````````````` example +``` +[foo]: /url +``` + +[foo] +. +<pre><code>[foo]: /url +</code></pre> +<p>[foo]</p> +```````````````````````````````` + + +A [link reference definition] cannot interrupt a paragraph. + +```````````````````````````````` example +Foo +[bar]: /baz + +[bar] +. +<p>Foo +[bar]: /baz</p> +<p>[bar]</p> +```````````````````````````````` + + +However, it can directly follow other block elements, such as headings +and thematic breaks, and it need not be followed by a blank line. + +```````````````````````````````` example +# [Foo] +[foo]: /url +> bar +. +<h1><a href="/url">Foo</a></h1> +<blockquote> +<p>bar</p> +</blockquote> +```````````````````````````````` + +```````````````````````````````` example +[foo]: /url +bar +=== +[foo] +. +<h1>bar</h1> +<p><a href="/url">foo</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo]: /url +=== +[foo] +. +<p>=== +<a href="/url">foo</a></p> +```````````````````````````````` + + +Several [link reference definitions] +can occur one after another, without intervening blank lines. + +```````````````````````````````` example +[foo]: /foo-url "foo" +[bar]: /bar-url + "bar" +[baz]: /baz-url + +[foo], +[bar], +[baz] +. +<p><a href="/foo-url" title="foo">foo</a>, +<a href="/bar-url" title="bar">bar</a>, +<a href="/baz-url">baz</a></p> +```````````````````````````````` + + +[Link reference definitions] can occur +inside block containers, like lists and block quotations. They +affect the entire document, not just the container in which they +are defined: + +```````````````````````````````` example +[foo] + +> [foo]: /url +. +<p><a href="/url">foo</a></p> +<blockquote> +</blockquote> +```````````````````````````````` + + +## Paragraphs + +A sequence of non-blank lines that cannot be interpreted as other +kinds of blocks forms a [paragraph](@). +The contents of the paragraph are the result of parsing the +paragraph's raw content as inlines. The paragraph's raw content +is formed by concatenating the lines and removing initial and final +spaces or tabs. + +A simple example with two paragraphs: + +```````````````````````````````` example +aaa + +bbb +. +<p>aaa</p> +<p>bbb</p> +```````````````````````````````` + + +Paragraphs can contain multiple lines, but no blank lines: + +```````````````````````````````` example +aaa +bbb + +ccc +ddd +. +<p>aaa +bbb</p> +<p>ccc +ddd</p> +```````````````````````````````` + + +Multiple blank lines between paragraphs have no effect: + +```````````````````````````````` example +aaa + + +bbb +. +<p>aaa</p> +<p>bbb</p> +```````````````````````````````` + + +Leading spaces or tabs are skipped: + +```````````````````````````````` example + aaa + bbb +. +<p>aaa +bbb</p> +```````````````````````````````` + + +Lines after the first may be indented any amount, since indented +code blocks cannot interrupt paragraphs. + +```````````````````````````````` example +aaa + bbb + ccc +. +<p>aaa +bbb +ccc</p> +```````````````````````````````` + + +However, the first line may be preceded by up to three spaces of indentation. +Four spaces of indentation is too many: + +```````````````````````````````` example + aaa +bbb +. +<p>aaa +bbb</p> +```````````````````````````````` + + +```````````````````````````````` example + aaa +bbb +. +<pre><code>aaa +</code></pre> +<p>bbb</p> +```````````````````````````````` + + +Final spaces or tabs are stripped before inline parsing, so a paragraph +that ends with two or more spaces will not end with a [hard line +break]: + +```````````````````````````````` example +aaa +bbb +. +<p>aaa<br /> +bbb</p> +```````````````````````````````` + + +## Blank lines + +[Blank lines] between block-level elements are ignored, +except for the role they play in determining whether a [list] +is [tight] or [loose]. + +Blank lines at the beginning and end of the document are also ignored. + +```````````````````````````````` example + + +aaa + + +# aaa + + +. +<p>aaa</p> +<h1>aaa</h1> +```````````````````````````````` + + + +# Container blocks + +A [container block](#container-blocks) is a block that has other +blocks as its contents. There are two basic kinds of container blocks: +[block quotes] and [list items]. +[Lists] are meta-containers for [list items]. + +We define the syntax for container blocks recursively. The general +form of the definition is: + +> If X is a sequence of blocks, then the result of +> transforming X in such-and-such a way is a container of type Y +> with these blocks as its content. + +So, we explain what counts as a block quote or list item by explaining +how these can be *generated* from their contents. This should suffice +to define the syntax, although it does not give a recipe for *parsing* +these constructions. (A recipe is provided below in the section entitled +[A parsing strategy](#appendix-a-parsing-strategy).) + +## Block quotes + +A [block quote marker](@), +optionally preceded by up to three spaces of indentation, +consists of (a) the character `>` together with a following space of +indentation, or (b) a single character `>` not followed by a space of +indentation. + +The following rules define [block quotes]: + +1. **Basic case.** If a string of lines *Ls* constitute a sequence + of blocks *Bs*, then the result of prepending a [block quote + marker] to the beginning of each line in *Ls* + is a [block quote](#block-quotes) containing *Bs*. + +2. **Laziness.** If a string of lines *Ls* constitute a [block + quote](#block-quotes) with contents *Bs*, then the result of deleting + the initial [block quote marker] from one or + more lines in which the next character other than a space or tab after the + [block quote marker] is [paragraph continuation + text] is a block quote with *Bs* as its content. + [Paragraph continuation text](@) is text + that will be parsed as part of the content of a paragraph, but does + not occur at the beginning of the paragraph. + +3. **Consecutiveness.** A document cannot contain two [block + quotes] in a row unless there is a [blank line] between them. + +Nothing else counts as a [block quote](#block-quotes). + +Here is a simple example: + +```````````````````````````````` example +> # Foo +> bar +> baz +. +<blockquote> +<h1>Foo</h1> +<p>bar +baz</p> +</blockquote> +```````````````````````````````` + + +The space or tab after the `>` characters can be omitted: + +```````````````````````````````` example +># Foo +>bar +> baz +. +<blockquote> +<h1>Foo</h1> +<p>bar +baz</p> +</blockquote> +```````````````````````````````` + + +The `>` characters can be preceded by up to three spaces of indentation: + +```````````````````````````````` example + > # Foo + > bar + > baz +. +<blockquote> +<h1>Foo</h1> +<p>bar +baz</p> +</blockquote> +```````````````````````````````` + + +Four spaces of indentation is too many: + +```````````````````````````````` example + > # Foo + > bar + > baz +. +<pre><code>&gt; # Foo +&gt; bar +&gt; baz +</code></pre> +```````````````````````````````` + + +The Laziness clause allows us to omit the `>` before +[paragraph continuation text]: + +```````````````````````````````` example +> # Foo +> bar +baz +. +<blockquote> +<h1>Foo</h1> +<p>bar +baz</p> +</blockquote> +```````````````````````````````` + + +A block quote can contain some lazy and some non-lazy +continuation lines: + +```````````````````````````````` example +> bar +baz +> foo +. +<blockquote> +<p>bar +baz +foo</p> +</blockquote> +```````````````````````````````` + + +Laziness only applies to lines that would have been continuations of +paragraphs had they been prepended with [block quote markers]. +For example, the `> ` cannot be omitted in the second line of + +``` markdown +> foo +> --- +``` + +without changing the meaning: + +```````````````````````````````` example +> foo +--- +. +<blockquote> +<p>foo</p> +</blockquote> +<hr /> +```````````````````````````````` + + +Similarly, if we omit the `> ` in the second line of + +``` markdown +> - foo +> - bar +``` + +then the block quote ends after the first line: + +```````````````````````````````` example +> - foo +- bar +. +<blockquote> +<ul> +<li>foo</li> +</ul> +</blockquote> +<ul> +<li>bar</li> +</ul> +```````````````````````````````` + + +For the same reason, we can't omit the `> ` in front of +subsequent lines of an indented or fenced code block: + +```````````````````````````````` example +> foo + bar +. +<blockquote> +<pre><code>foo +</code></pre> +</blockquote> +<pre><code>bar +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +> ``` +foo +``` +. +<blockquote> +<pre><code></code></pre> +</blockquote> +<p>foo</p> +<pre><code></code></pre> +```````````````````````````````` + + +Note that in the following case, we have a [lazy +continuation line]: + +```````````````````````````````` example +> foo + - bar +. +<blockquote> +<p>foo +- bar</p> +</blockquote> +```````````````````````````````` + + +To see why, note that in + +```markdown +> foo +> - bar +``` + +the `- bar` is indented too far to start a list, and can't +be an indented code block because indented code blocks cannot +interrupt paragraphs, so it is [paragraph continuation text]. + +A block quote can be empty: + +```````````````````````````````` example +> +. +<blockquote> +</blockquote> +```````````````````````````````` + + +```````````````````````````````` example +> +> +> +. +<blockquote> +</blockquote> +```````````````````````````````` + + +A block quote can have initial or final blank lines: + +```````````````````````````````` example +> +> foo +> +. +<blockquote> +<p>foo</p> +</blockquote> +```````````````````````````````` + + +A blank line always separates block quotes: + +```````````````````````````````` example +> foo + +> bar +. +<blockquote> +<p>foo</p> +</blockquote> +<blockquote> +<p>bar</p> +</blockquote> +```````````````````````````````` + + +(Most current Markdown implementations, including John Gruber's +original `Markdown.pl`, will parse this example as a single block quote +with two paragraphs. But it seems better to allow the author to decide +whether two block quotes or one are wanted.) + +Consecutiveness means that if we put these block quotes together, +we get a single block quote: + +```````````````````````````````` example +> foo +> bar +. +<blockquote> +<p>foo +bar</p> +</blockquote> +```````````````````````````````` + + +To get a block quote with two paragraphs, use: + +```````````````````````````````` example +> foo +> +> bar +. +<blockquote> +<p>foo</p> +<p>bar</p> +</blockquote> +```````````````````````````````` + + +Block quotes can interrupt paragraphs: + +```````````````````````````````` example +foo +> bar +. +<p>foo</p> +<blockquote> +<p>bar</p> +</blockquote> +```````````````````````````````` + + +In general, blank lines are not needed before or after block +quotes: + +```````````````````````````````` example +> aaa +*** +> bbb +. +<blockquote> +<p>aaa</p> +</blockquote> +<hr /> +<blockquote> +<p>bbb</p> +</blockquote> +```````````````````````````````` + + +However, because of laziness, a blank line is needed between +a block quote and a following paragraph: + +```````````````````````````````` example +> bar +baz +. +<blockquote> +<p>bar +baz</p> +</blockquote> +```````````````````````````````` + + +```````````````````````````````` example +> bar + +baz +. +<blockquote> +<p>bar</p> +</blockquote> +<p>baz</p> +```````````````````````````````` + + +```````````````````````````````` example +> bar +> +baz +. +<blockquote> +<p>bar</p> +</blockquote> +<p>baz</p> +```````````````````````````````` + + +It is a consequence of the Laziness rule that any number +of initial `>`s may be omitted on a continuation line of a +nested block quote: + +```````````````````````````````` example +> > > foo +bar +. +<blockquote> +<blockquote> +<blockquote> +<p>foo +bar</p> +</blockquote> +</blockquote> +</blockquote> +```````````````````````````````` + + +```````````````````````````````` example +>>> foo +> bar +>>baz +. +<blockquote> +<blockquote> +<blockquote> +<p>foo +bar +baz</p> +</blockquote> +</blockquote> +</blockquote> +```````````````````````````````` + + +When including an indented code block in a block quote, +remember that the [block quote marker] includes +both the `>` and a following space of indentation. So *five spaces* are needed +after the `>`: + +```````````````````````````````` example +> code + +> not code +. +<blockquote> +<pre><code>code +</code></pre> +</blockquote> +<blockquote> +<p>not code</p> +</blockquote> +```````````````````````````````` + + + +## List items + +A [list marker](@) is a +[bullet list marker] or an [ordered list marker]. + +A [bullet list marker](@) +is a `-`, `+`, or `*` character. + +An [ordered list marker](@) +is a sequence of 1--9 arabic digits (`0-9`), followed by either a +`.` character or a `)` character. (The reason for the length +limit is that with 10 digits we start seeing integer overflows +in some browsers.) + +The following rules define [list items]: + +1. **Basic case.** If a sequence of lines *Ls* constitute a sequence of + blocks *Bs* starting with a character other than a space or tab, and *M* is + a list marker of width *W* followed by 1 ≤ *N* ≤ 4 spaces of indentation, + then the result of prepending *M* and the following spaces to the first line + of *Ls*, and indenting subsequent lines of *Ls* by *W + N* spaces, is a + list item with *Bs* as its contents. The type of the list item + (bullet or ordered) is determined by the type of its list marker. + If the list item is ordered, then it is also assigned a start + number, based on the ordered list marker. + + Exceptions: + + 1. When the first list item in a [list] interrupts + a paragraph---that is, when it starts on a line that would + otherwise count as [paragraph continuation text]---then (a) + the lines *Ls* must not begin with a blank line, and (b) if + the list item is ordered, the start number must be 1. + 2. If any line is a [thematic break][thematic breaks] then + that line is not a list item. + +For example, let *Ls* be the lines + +```````````````````````````````` example +A paragraph +with two lines. + + indented code + +> A block quote. +. +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +```````````````````````````````` + + +And let *M* be the marker `1.`, and *N* = 2. Then rule #1 says +that the following is an ordered list item with start number 1, +and the same contents as *Ls*: + +```````````````````````````````` example +1. A paragraph + with two lines. + + indented code + + > A block quote. +. +<ol> +<li> +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +The most important thing to notice is that the position of +the text after the list marker determines how much indentation +is needed in subsequent blocks in the list item. If the list +marker takes up two spaces of indentation, and there are three spaces between +the list marker and the next character other than a space or tab, then blocks +must be indented five spaces in order to fall under the list +item. + +Here are some examples showing how far content must be indented to be +put under the list item: + +```````````````````````````````` example +- one + + two +. +<ul> +<li>one</li> +</ul> +<p>two</p> +```````````````````````````````` + + +```````````````````````````````` example +- one + + two +. +<ul> +<li> +<p>one</p> +<p>two</p> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example + - one + + two +. +<ul> +<li>one</li> +</ul> +<pre><code> two +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example + - one + + two +. +<ul> +<li> +<p>one</p> +<p>two</p> +</li> +</ul> +```````````````````````````````` + + +It is tempting to think of this in terms of columns: the continuation +blocks must be indented at least to the column of the first character other than +a space or tab after the list marker. However, that is not quite right. +The spaces of indentation after the list marker determine how much relative +indentation is needed. Which column this indentation reaches will depend on +how the list item is embedded in other constructions, as shown by +this example: + +```````````````````````````````` example + > > 1. one +>> +>> two +. +<blockquote> +<blockquote> +<ol> +<li> +<p>one</p> +<p>two</p> +</li> +</ol> +</blockquote> +</blockquote> +```````````````````````````````` + + +Here `two` occurs in the same column as the list marker `1.`, +but is actually contained in the list item, because there is +sufficient indentation after the last containing blockquote marker. + +The converse is also possible. In the following example, the word `two` +occurs far to the right of the initial text of the list item, `one`, but +it is not considered part of the list item, because it is not indented +far enough past the blockquote marker: + +```````````````````````````````` example +>>- one +>> + > > two +. +<blockquote> +<blockquote> +<ul> +<li>one</li> +</ul> +<p>two</p> +</blockquote> +</blockquote> +```````````````````````````````` + + +Note that at least one space or tab is needed between the list marker and +any following content, so these are not list items: + +```````````````````````````````` example +-one + +2.two +. +<p>-one</p> +<p>2.two</p> +```````````````````````````````` + + +A list item may contain blocks that are separated by more than +one blank line. + +```````````````````````````````` example +- foo + + + bar +. +<ul> +<li> +<p>foo</p> +<p>bar</p> +</li> +</ul> +```````````````````````````````` + + +A list item may contain any kind of block: + +```````````````````````````````` example +1. foo + + ``` + bar + ``` + + baz + + > bam +. +<ol> +<li> +<p>foo</p> +<pre><code>bar +</code></pre> +<p>baz</p> +<blockquote> +<p>bam</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +A list item that contains an indented code block will preserve +empty lines within the code block verbatim. + +```````````````````````````````` example +- Foo + + bar + + + baz +. +<ul> +<li> +<p>Foo</p> +<pre><code>bar + + +baz +</code></pre> +</li> +</ul> +```````````````````````````````` + +Note that ordered list start numbers must be nine digits or less: + +```````````````````````````````` example +123456789. ok +. +<ol start="123456789"> +<li>ok</li> +</ol> +```````````````````````````````` + + +```````````````````````````````` example +1234567890. not ok +. +<p>1234567890. not ok</p> +```````````````````````````````` + + +A start number may begin with 0s: + +```````````````````````````````` example +0. ok +. +<ol start="0"> +<li>ok</li> +</ol> +```````````````````````````````` + + +```````````````````````````````` example +003. ok +. +<ol start="3"> +<li>ok</li> +</ol> +```````````````````````````````` + + +A start number may not be negative: + +```````````````````````````````` example +-1. not ok +. +<p>-1. not ok</p> +```````````````````````````````` + + + +2. **Item starting with indented code.** If a sequence of lines *Ls* + constitute a sequence of blocks *Bs* starting with an indented code + block, and *M* is a list marker of width *W* followed by + one space of indentation, then the result of prepending *M* and the + following space to the first line of *Ls*, and indenting subsequent lines + of *Ls* by *W + 1* spaces, is a list item with *Bs* as its contents. + If a line is empty, then it need not be indented. The type of the + list item (bullet or ordered) is determined by the type of its list + marker. If the list item is ordered, then it is also assigned a + start number, based on the ordered list marker. + +An indented code block will have to be preceded by four spaces of indentation +beyond the edge of the region where text will be included in the list item. +In the following case that is 6 spaces: + +```````````````````````````````` example +- foo + + bar +. +<ul> +<li> +<p>foo</p> +<pre><code>bar +</code></pre> +</li> +</ul> +```````````````````````````````` + + +And in this case it is 11 spaces: + +```````````````````````````````` example + 10. foo + + bar +. +<ol start="10"> +<li> +<p>foo</p> +<pre><code>bar +</code></pre> +</li> +</ol> +```````````````````````````````` + + +If the *first* block in the list item is an indented code block, +then by rule #2, the contents must be preceded by *one* space of indentation +after the list marker: + +```````````````````````````````` example + indented code + +paragraph + + more code +. +<pre><code>indented code +</code></pre> +<p>paragraph</p> +<pre><code>more code +</code></pre> +```````````````````````````````` + + +```````````````````````````````` example +1. indented code + + paragraph + + more code +. +<ol> +<li> +<pre><code>indented code +</code></pre> +<p>paragraph</p> +<pre><code>more code +</code></pre> +</li> +</ol> +```````````````````````````````` + + +Note that an additional space of indentation is interpreted as space +inside the code block: + +```````````````````````````````` example +1. indented code + + paragraph + + more code +. +<ol> +<li> +<pre><code> indented code +</code></pre> +<p>paragraph</p> +<pre><code>more code +</code></pre> +</li> +</ol> +```````````````````````````````` + + +Note that rules #1 and #2 only apply to two cases: (a) cases +in which the lines to be included in a list item begin with a +character other than a space or tab, and (b) cases in which +they begin with an indented code +block. In a case like the following, where the first block begins with +three spaces of indentation, the rules do not allow us to form a list item by +indenting the whole thing and prepending a list marker: + +```````````````````````````````` example + foo + +bar +. +<p>foo</p> +<p>bar</p> +```````````````````````````````` + + +```````````````````````````````` example +- foo + + bar +. +<ul> +<li>foo</li> +</ul> +<p>bar</p> +```````````````````````````````` + + +This is not a significant restriction, because when a block is preceded by up to +three spaces of indentation, the indentation can always be removed without +a change in interpretation, allowing rule #1 to be applied. So, in +the above case: + +```````````````````````````````` example +- foo + + bar +. +<ul> +<li> +<p>foo</p> +<p>bar</p> +</li> +</ul> +```````````````````````````````` + + +3. **Item starting with a blank line.** If a sequence of lines *Ls* + starting with a single [blank line] constitute a (possibly empty) + sequence of blocks *Bs*, and *M* is a list marker of width *W*, + then the result of prepending *M* to the first line of *Ls*, and + preceding subsequent lines of *Ls* by *W + 1* spaces of indentation, is a + list item with *Bs* as its contents. + If a line is empty, then it need not be indented. The type of the + list item (bullet or ordered) is determined by the type of its list + marker. If the list item is ordered, then it is also assigned a + start number, based on the ordered list marker. + +Here are some list items that start with a blank line but are not empty: + +```````````````````````````````` example +- + foo +- + ``` + bar + ``` +- + baz +. +<ul> +<li>foo</li> +<li> +<pre><code>bar +</code></pre> +</li> +<li> +<pre><code>baz +</code></pre> +</li> +</ul> +```````````````````````````````` + +When the list item starts with a blank line, the number of spaces +following the list marker doesn't change the required indentation: + +```````````````````````````````` example +- + foo +. +<ul> +<li>foo</li> +</ul> +```````````````````````````````` + + +A list item can begin with at most one blank line. +In the following example, `foo` is not part of the list +item: + +```````````````````````````````` example +- + + foo +. +<ul> +<li></li> +</ul> +<p>foo</p> +```````````````````````````````` + + +Here is an empty bullet list item: + +```````````````````````````````` example +- foo +- +- bar +. +<ul> +<li>foo</li> +<li></li> +<li>bar</li> +</ul> +```````````````````````````````` + + +It does not matter whether there are spaces or tabs following the [list marker]: + +```````````````````````````````` example +- foo +- +- bar +. +<ul> +<li>foo</li> +<li></li> +<li>bar</li> +</ul> +```````````````````````````````` + + +Here is an empty ordered list item: + +```````````````````````````````` example +1. foo +2. +3. bar +. +<ol> +<li>foo</li> +<li></li> +<li>bar</li> +</ol> +```````````````````````````````` + + +A list may start or end with an empty list item: + +```````````````````````````````` example +* +. +<ul> +<li></li> +</ul> +```````````````````````````````` + +However, an empty list item cannot interrupt a paragraph: + +```````````````````````````````` example +foo +* + +foo +1. +. +<p>foo +*</p> +<p>foo +1.</p> +```````````````````````````````` + + +4. **Indentation.** If a sequence of lines *Ls* constitutes a list item + according to rule #1, #2, or #3, then the result of preceding each line + of *Ls* by up to three spaces of indentation (the same for each line) also + constitutes a list item with the same contents and attributes. If a line is + empty, then it need not be indented. + +Indented one space: + +```````````````````````````````` example + 1. A paragraph + with two lines. + + indented code + + > A block quote. +. +<ol> +<li> +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +Indented two spaces: + +```````````````````````````````` example + 1. A paragraph + with two lines. + + indented code + + > A block quote. +. +<ol> +<li> +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +Indented three spaces: + +```````````````````````````````` example + 1. A paragraph + with two lines. + + indented code + + > A block quote. +. +<ol> +<li> +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +Four spaces indent gives a code block: + +```````````````````````````````` example + 1. A paragraph + with two lines. + + indented code + + > A block quote. +. +<pre><code>1. A paragraph + with two lines. + + indented code + + &gt; A block quote. +</code></pre> +```````````````````````````````` + + + +5. **Laziness.** If a string of lines *Ls* constitute a [list + item](#list-items) with contents *Bs*, then the result of deleting + some or all of the indentation from one or more lines in which the + next character other than a space or tab after the indentation is + [paragraph continuation text] is a + list item with the same contents and attributes. The unindented + lines are called + [lazy continuation line](@)s. + +Here is an example with [lazy continuation lines]: + +```````````````````````````````` example + 1. A paragraph +with two lines. + + indented code + + > A block quote. +. +<ol> +<li> +<p>A paragraph +with two lines.</p> +<pre><code>indented code +</code></pre> +<blockquote> +<p>A block quote.</p> +</blockquote> +</li> +</ol> +```````````````````````````````` + + +Indentation can be partially deleted: + +```````````````````````````````` example + 1. A paragraph + with two lines. +. +<ol> +<li>A paragraph +with two lines.</li> +</ol> +```````````````````````````````` + + +These examples show how laziness can work in nested structures: + +```````````````````````````````` example +> 1. > Blockquote +continued here. +. +<blockquote> +<ol> +<li> +<blockquote> +<p>Blockquote +continued here.</p> +</blockquote> +</li> +</ol> +</blockquote> +```````````````````````````````` + + +```````````````````````````````` example +> 1. > Blockquote +> continued here. +. +<blockquote> +<ol> +<li> +<blockquote> +<p>Blockquote +continued here.</p> +</blockquote> +</li> +</ol> +</blockquote> +```````````````````````````````` + + + +6. **That's all.** Nothing that is not counted as a list item by rules + #1--5 counts as a [list item](#list-items). + +The rules for sublists follow from the general rules +[above][List items]. A sublist must be indented the same number +of spaces of indentation a paragraph would need to be in order to be included +in the list item. + +So, in this case we need two spaces indent: + +```````````````````````````````` example +- foo + - bar + - baz + - boo +. +<ul> +<li>foo +<ul> +<li>bar +<ul> +<li>baz +<ul> +<li>boo</li> +</ul> +</li> +</ul> +</li> +</ul> +</li> +</ul> +```````````````````````````````` + + +One is not enough: + +```````````````````````````````` example +- foo + - bar + - baz + - boo +. +<ul> +<li>foo</li> +<li>bar</li> +<li>baz</li> +<li>boo</li> +</ul> +```````````````````````````````` + + +Here we need four, because the list marker is wider: + +```````````````````````````````` example +10) foo + - bar +. +<ol start="10"> +<li>foo +<ul> +<li>bar</li> +</ul> +</li> +</ol> +```````````````````````````````` + + +Three is not enough: + +```````````````````````````````` example +10) foo + - bar +. +<ol start="10"> +<li>foo</li> +</ol> +<ul> +<li>bar</li> +</ul> +```````````````````````````````` + + +A list may be the first block in a list item: + +```````````````````````````````` example +- - foo +. +<ul> +<li> +<ul> +<li>foo</li> +</ul> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +1. - 2. foo +. +<ol> +<li> +<ul> +<li> +<ol start="2"> +<li>foo</li> +</ol> +</li> +</ul> +</li> +</ol> +```````````````````````````````` + + +A list item can contain a heading: + +```````````````````````````````` example +- # Foo +- Bar + --- + baz +. +<ul> +<li> +<h1>Foo</h1> +</li> +<li> +<h2>Bar</h2> +baz</li> +</ul> +```````````````````````````````` + + +### Motivation + +John Gruber's Markdown spec says the following about list items: + +1. "List markers typically start at the left margin, but may be indented + by up to three spaces. List markers must be followed by one or more + spaces or a tab." + +2. "To make lists look nice, you can wrap items with hanging indents.... + But if you don't want to, you don't have to." + +3. "List items may consist of multiple paragraphs. Each subsequent + paragraph in a list item must be indented by either 4 spaces or one + tab." + +4. "It looks nice if you indent every line of the subsequent paragraphs, + but here again, Markdown will allow you to be lazy." + +5. "To put a blockquote within a list item, the blockquote's `>` + delimiters need to be indented." + +6. "To put a code block within a list item, the code block needs to be + indented twice — 8 spaces or two tabs." + +These rules specify that a paragraph under a list item must be indented +four spaces (presumably, from the left margin, rather than the start of +the list marker, but this is not said), and that code under a list item +must be indented eight spaces instead of the usual four. They also say +that a block quote must be indented, but not by how much; however, the +example given has four spaces indentation. Although nothing is said +about other kinds of block-level content, it is certainly reasonable to +infer that *all* block elements under a list item, including other +lists, must be indented four spaces. This principle has been called the +*four-space rule*. + +The four-space rule is clear and principled, and if the reference +implementation `Markdown.pl` had followed it, it probably would have +become the standard. However, `Markdown.pl` allowed paragraphs and +sublists to start with only two spaces indentation, at least on the +outer level. Worse, its behavior was inconsistent: a sublist of an +outer-level list needed two spaces indentation, but a sublist of this +sublist needed three spaces. It is not surprising, then, that different +implementations of Markdown have developed very different rules for +determining what comes under a list item. (Pandoc and python-Markdown, +for example, stuck with Gruber's syntax description and the four-space +rule, while discount, redcarpet, marked, PHP Markdown, and others +followed `Markdown.pl`'s behavior more closely.) + +Unfortunately, given the divergences between implementations, there +is no way to give a spec for list items that will be guaranteed not +to break any existing documents. However, the spec given here should +correctly handle lists formatted with either the four-space rule or +the more forgiving `Markdown.pl` behavior, provided they are laid out +in a way that is natural for a human to read. + +The strategy here is to let the width and indentation of the list marker +determine the indentation necessary for blocks to fall under the list +item, rather than having a fixed and arbitrary number. The writer can +think of the body of the list item as a unit which gets indented to the +right enough to fit the list marker (and any indentation on the list +marker). (The laziness rule, #5, then allows continuation lines to be +unindented if needed.) + +This rule is superior, we claim, to any rule requiring a fixed level of +indentation from the margin. The four-space rule is clear but +unnatural. It is quite unintuitive that + +``` markdown +- foo + + bar + + - baz +``` + +should be parsed as two lists with an intervening paragraph, + +``` html +<ul> +<li>foo</li> +</ul> +<p>bar</p> +<ul> +<li>baz</li> +</ul> +``` + +as the four-space rule demands, rather than a single list, + +``` html +<ul> +<li> +<p>foo</p> +<p>bar</p> +<ul> +<li>baz</li> +</ul> +</li> +</ul> +``` + +The choice of four spaces is arbitrary. It can be learned, but it is +not likely to be guessed, and it trips up beginners regularly. + +Would it help to adopt a two-space rule? The problem is that such +a rule, together with the rule allowing up to three spaces of indentation for +the initial list marker, allows text that is indented *less than* the +original list marker to be included in the list item. For example, +`Markdown.pl` parses + +``` markdown + - one + + two +``` + +as a single list item, with `two` a continuation paragraph: + +``` html +<ul> +<li> +<p>one</p> +<p>two</p> +</li> +</ul> +``` + +and similarly + +``` markdown +> - one +> +> two +``` + +as + +``` html +<blockquote> +<ul> +<li> +<p>one</p> +<p>two</p> +</li> +</ul> +</blockquote> +``` + +This is extremely unintuitive. + +Rather than requiring a fixed indent from the margin, we could require +a fixed indent (say, two spaces, or even one space) from the list marker (which +may itself be indented). This proposal would remove the last anomaly +discussed. Unlike the spec presented above, it would count the following +as a list item with a subparagraph, even though the paragraph `bar` +is not indented as far as the first paragraph `foo`: + +``` markdown + 10. foo + + bar +``` + +Arguably this text does read like a list item with `bar` as a subparagraph, +which may count in favor of the proposal. However, on this proposal indented +code would have to be indented six spaces after the list marker. And this +would break a lot of existing Markdown, which has the pattern: + +``` markdown +1. foo + + indented code +``` + +where the code is indented eight spaces. The spec above, by contrast, will +parse this text as expected, since the code block's indentation is measured +from the beginning of `foo`. + +The one case that needs special treatment is a list item that *starts* +with indented code. How much indentation is required in that case, since +we don't have a "first paragraph" to measure from? Rule #2 simply stipulates +that in such cases, we require one space indentation from the list marker +(and then the normal four spaces for the indented code). This will match the +four-space rule in cases where the list marker plus its initial indentation +takes four spaces (a common case), but diverge in other cases. + +## Lists + +A [list](@) is a sequence of one or more +list items [of the same type]. The list items +may be separated by any number of blank lines. + +Two list items are [of the same type](@) +if they begin with a [list marker] of the same type. +Two list markers are of the +same type if (a) they are bullet list markers using the same character +(`-`, `+`, or `*`) or (b) they are ordered list numbers with the same +delimiter (either `.` or `)`). + +A list is an [ordered list](@) +if its constituent list items begin with +[ordered list markers], and a +[bullet list](@) if its constituent list +items begin with [bullet list markers]. + +The [start number](@) +of an [ordered list] is determined by the list number of +its initial list item. The numbers of subsequent list items are +disregarded. + +A list is [loose](@) if any of its constituent +list items are separated by blank lines, or if any of its constituent +list items directly contain two block-level elements with a blank line +between them. Otherwise a list is [tight](@). +(The difference in HTML output is that paragraphs in a loose list are +wrapped in `<p>` tags, while paragraphs in a tight list are not.) + +Changing the bullet or ordered list delimiter starts a new list: + +```````````````````````````````` example +- foo +- bar ++ baz +. +<ul> +<li>foo</li> +<li>bar</li> +</ul> +<ul> +<li>baz</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +1. foo +2. bar +3) baz +. +<ol> +<li>foo</li> +<li>bar</li> +</ol> +<ol start="3"> +<li>baz</li> +</ol> +```````````````````````````````` + + +In CommonMark, a list can interrupt a paragraph. That is, +no blank line is needed to separate a paragraph from a following +list: + +```````````````````````````````` example +Foo +- bar +- baz +. +<p>Foo</p> +<ul> +<li>bar</li> +<li>baz</li> +</ul> +```````````````````````````````` + +`Markdown.pl` does not allow this, through fear of triggering a list +via a numeral in a hard-wrapped line: + +``` markdown +The number of windows in my house is +14. The number of doors is 6. +``` + +Oddly, though, `Markdown.pl` *does* allow a blockquote to +interrupt a paragraph, even though the same considerations might +apply. + +In CommonMark, we do allow lists to interrupt paragraphs, for +two reasons. First, it is natural and not uncommon for people +to start lists without blank lines: + +``` markdown +I need to buy +- new shoes +- a coat +- a plane ticket +``` + +Second, we are attracted to a + +> [principle of uniformity](@): +> if a chunk of text has a certain +> meaning, it will continue to have the same meaning when put into a +> container block (such as a list item or blockquote). + +(Indeed, the spec for [list items] and [block quotes] presupposes +this principle.) This principle implies that if + +``` markdown + * I need to buy + - new shoes + - a coat + - a plane ticket +``` + +is a list item containing a paragraph followed by a nested sublist, +as all Markdown implementations agree it is (though the paragraph +may be rendered without `<p>` tags, since the list is "tight"), +then + +``` markdown +I need to buy +- new shoes +- a coat +- a plane ticket +``` + +by itself should be a paragraph followed by a nested sublist. + +Since it is well established Markdown practice to allow lists to +interrupt paragraphs inside list items, the [principle of +uniformity] requires us to allow this outside list items as +well. ([reStructuredText](https://docutils.sourceforge.net/rst.html) +takes a different approach, requiring blank lines before lists +even inside other list items.) + +In order to solve the problem of unwanted lists in paragraphs with +hard-wrapped numerals, we allow only lists starting with `1` to +interrupt paragraphs. Thus, + +```````````````````````````````` example +The number of windows in my house is +14. The number of doors is 6. +. +<p>The number of windows in my house is +14. The number of doors is 6.</p> +```````````````````````````````` + +We may still get an unintended result in cases like + +```````````````````````````````` example +The number of windows in my house is +1. The number of doors is 6. +. +<p>The number of windows in my house is</p> +<ol> +<li>The number of doors is 6.</li> +</ol> +```````````````````````````````` + +but this rule should prevent most spurious list captures. + +There can be any number of blank lines between items: + +```````````````````````````````` example +- foo + +- bar + + +- baz +. +<ul> +<li> +<p>foo</p> +</li> +<li> +<p>bar</p> +</li> +<li> +<p>baz</p> +</li> +</ul> +```````````````````````````````` + +```````````````````````````````` example +- foo + - bar + - baz + + + bim +. +<ul> +<li>foo +<ul> +<li>bar +<ul> +<li> +<p>baz</p> +<p>bim</p> +</li> +</ul> +</li> +</ul> +</li> +</ul> +```````````````````````````````` + + +To separate consecutive lists of the same type, or to separate a +list from an indented code block that would otherwise be parsed +as a subparagraph of the final list item, you can insert a blank HTML +comment: + +```````````````````````````````` example +- foo +- bar + +<!-- --> + +- baz +- bim +. +<ul> +<li>foo</li> +<li>bar</li> +</ul> +<!-- --> +<ul> +<li>baz</li> +<li>bim</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +- foo + + notcode + +- foo + +<!-- --> + + code +. +<ul> +<li> +<p>foo</p> +<p>notcode</p> +</li> +<li> +<p>foo</p> +</li> +</ul> +<!-- --> +<pre><code>code +</code></pre> +```````````````````````````````` + + +List items need not be indented to the same level. The following +list items will be treated as items at the same list level, +since none is indented enough to belong to the previous list +item: + +```````````````````````````````` example +- a + - b + - c + - d + - e + - f +- g +. +<ul> +<li>a</li> +<li>b</li> +<li>c</li> +<li>d</li> +<li>e</li> +<li>f</li> +<li>g</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +1. a + + 2. b + + 3. c +. +<ol> +<li> +<p>a</p> +</li> +<li> +<p>b</p> +</li> +<li> +<p>c</p> +</li> +</ol> +```````````````````````````````` + +Note, however, that list items may not be preceded by more than +three spaces of indentation. Here `- e` is treated as a paragraph continuation +line, because it is indented more than three spaces: + +```````````````````````````````` example +- a + - b + - c + - d + - e +. +<ul> +<li>a</li> +<li>b</li> +<li>c</li> +<li>d +- e</li> +</ul> +```````````````````````````````` + +And here, `3. c` is treated as in indented code block, +because it is indented four spaces and preceded by a +blank line. + +```````````````````````````````` example +1. a + + 2. b + + 3. c +. +<ol> +<li> +<p>a</p> +</li> +<li> +<p>b</p> +</li> +</ol> +<pre><code>3. c +</code></pre> +```````````````````````````````` + + +This is a loose list, because there is a blank line between +two of the list items: + +```````````````````````````````` example +- a +- b + +- c +. +<ul> +<li> +<p>a</p> +</li> +<li> +<p>b</p> +</li> +<li> +<p>c</p> +</li> +</ul> +```````````````````````````````` + + +So is this, with a empty second item: + +```````````````````````````````` example +* a +* + +* c +. +<ul> +<li> +<p>a</p> +</li> +<li></li> +<li> +<p>c</p> +</li> +</ul> +```````````````````````````````` + + +These are loose lists, even though there are no blank lines between the items, +because one of the items directly contains two block-level elements +with a blank line between them: + +```````````````````````````````` example +- a +- b + + c +- d +. +<ul> +<li> +<p>a</p> +</li> +<li> +<p>b</p> +<p>c</p> +</li> +<li> +<p>d</p> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +- a +- b + + [ref]: /url +- d +. +<ul> +<li> +<p>a</p> +</li> +<li> +<p>b</p> +</li> +<li> +<p>d</p> +</li> +</ul> +```````````````````````````````` + + +This is a tight list, because the blank lines are in a code block: + +```````````````````````````````` example +- a +- ``` + b + + + ``` +- c +. +<ul> +<li>a</li> +<li> +<pre><code>b + + +</code></pre> +</li> +<li>c</li> +</ul> +```````````````````````````````` + + +This is a tight list, because the blank line is between two +paragraphs of a sublist. So the sublist is loose while +the outer list is tight: + +```````````````````````````````` example +- a + - b + + c +- d +. +<ul> +<li>a +<ul> +<li> +<p>b</p> +<p>c</p> +</li> +</ul> +</li> +<li>d</li> +</ul> +```````````````````````````````` + + +This is a tight list, because the blank line is inside the +block quote: + +```````````````````````````````` example +* a + > b + > +* c +. +<ul> +<li>a +<blockquote> +<p>b</p> +</blockquote> +</li> +<li>c</li> +</ul> +```````````````````````````````` + + +This list is tight, because the consecutive block elements +are not separated by blank lines: + +```````````````````````````````` example +- a + > b + ``` + c + ``` +- d +. +<ul> +<li>a +<blockquote> +<p>b</p> +</blockquote> +<pre><code>c +</code></pre> +</li> +<li>d</li> +</ul> +```````````````````````````````` + + +A single-paragraph list is tight: + +```````````````````````````````` example +- a +. +<ul> +<li>a</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +- a + - b +. +<ul> +<li>a +<ul> +<li>b</li> +</ul> +</li> +</ul> +```````````````````````````````` + + +This list is loose, because of the blank line between the +two block elements in the list item: + +```````````````````````````````` example +1. ``` + foo + ``` + + bar +. +<ol> +<li> +<pre><code>foo +</code></pre> +<p>bar</p> +</li> +</ol> +```````````````````````````````` + + +Here the outer list is loose, the inner list tight: + +```````````````````````````````` example +* foo + * bar + + baz +. +<ul> +<li> +<p>foo</p> +<ul> +<li>bar</li> +</ul> +<p>baz</p> +</li> +</ul> +```````````````````````````````` + + +```````````````````````````````` example +- a + - b + - c + +- d + - e + - f +. +<ul> +<li> +<p>a</p> +<ul> +<li>b</li> +<li>c</li> +</ul> +</li> +<li> +<p>d</p> +<ul> +<li>e</li> +<li>f</li> +</ul> +</li> +</ul> +```````````````````````````````` + + +# Inlines + +Inlines are parsed sequentially from the beginning of the character +stream to the end (left to right, in left-to-right languages). +Thus, for example, in + +```````````````````````````````` example +`hi`lo` +. +<p><code>hi</code>lo`</p> +```````````````````````````````` + +`hi` is parsed as code, leaving the backtick at the end as a literal +backtick. + + + +## Code spans + +A [backtick string](@) +is a string of one or more backtick characters (`` ` ``) that is neither +preceded nor followed by a backtick. + +A [code span](@) begins with a backtick string and ends with +a backtick string of equal length. The contents of the code span are +the characters between these two backtick strings, normalized in the +following ways: + +- First, [line endings] are converted to [spaces]. +- If the resulting string both begins *and* ends with a [space] + character, but does not consist entirely of [space] + characters, a single [space] character is removed from the + front and back. This allows you to include code that begins + or ends with backtick characters, which must be separated by + whitespace from the opening or closing backtick strings. + +This is a simple code span: + +```````````````````````````````` example +`foo` +. +<p><code>foo</code></p> +```````````````````````````````` + + +Here two backticks are used, because the code contains a backtick. +This example also illustrates stripping of a single leading and +trailing space: + +```````````````````````````````` example +`` foo ` bar `` +. +<p><code>foo ` bar</code></p> +```````````````````````````````` + + +This example shows the motivation for stripping leading and trailing +spaces: + +```````````````````````````````` example +` `` ` +. +<p><code>``</code></p> +```````````````````````````````` + +Note that only *one* space is stripped: + +```````````````````````````````` example +` `` ` +. +<p><code> `` </code></p> +```````````````````````````````` + +The stripping only happens if the space is on both +sides of the string: + +```````````````````````````````` example +` a` +. +<p><code> a</code></p> +```````````````````````````````` + +Only [spaces], and not [unicode whitespace] in general, are +stripped in this way: + +```````````````````````````````` example +` b ` +. +<p><code> b </code></p> +```````````````````````````````` + +No stripping occurs if the code span contains only spaces: + +```````````````````````````````` example +` ` +` ` +. +<p><code> </code> +<code> </code></p> +```````````````````````````````` + + +[Line endings] are treated like spaces: + +```````````````````````````````` example +`` +foo +bar +baz +`` +. +<p><code>foo bar baz</code></p> +```````````````````````````````` + +```````````````````````````````` example +`` +foo +`` +. +<p><code>foo </code></p> +```````````````````````````````` + + +Interior spaces are not collapsed: + +```````````````````````````````` example +`foo bar +baz` +. +<p><code>foo bar baz</code></p> +```````````````````````````````` + +Note that browsers will typically collapse consecutive spaces +when rendering `<code>` elements, so it is recommended that +the following CSS be used: + + code{white-space: pre-wrap;} + + +Note that backslash escapes do not work in code spans. All backslashes +are treated literally: + +```````````````````````````````` example +`foo\`bar` +. +<p><code>foo\</code>bar`</p> +```````````````````````````````` + + +Backslash escapes are never needed, because one can always choose a +string of *n* backtick characters as delimiters, where the code does +not contain any strings of exactly *n* backtick characters. + +```````````````````````````````` example +``foo`bar`` +. +<p><code>foo`bar</code></p> +```````````````````````````````` + +```````````````````````````````` example +` foo `` bar ` +. +<p><code>foo `` bar</code></p> +```````````````````````````````` + + +Code span backticks have higher precedence than any other inline +constructs except HTML tags and autolinks. Thus, for example, this is +not parsed as emphasized text, since the second `*` is part of a code +span: + +```````````````````````````````` example +*foo`*` +. +<p>*foo<code>*</code></p> +```````````````````````````````` + + +And this is not parsed as a link: + +```````````````````````````````` example +[not a `link](/foo`) +. +<p>[not a <code>link](/foo</code>)</p> +```````````````````````````````` + + +Code spans, HTML tags, and autolinks have the same precedence. +Thus, this is code: + +```````````````````````````````` example +`<a href="`">` +. +<p><code>&lt;a href=&quot;</code>&quot;&gt;`</p> +```````````````````````````````` + + +But this is an HTML tag: + +```````````````````````````````` example +<a href="`">` +. +<p><a href="`">`</p> +```````````````````````````````` + + +And this is code: + +```````````````````````````````` example +`<https://foo.bar.`baz>` +. +<p><code>&lt;https://foo.bar.</code>baz&gt;`</p> +```````````````````````````````` + + +But this is an autolink: + +```````````````````````````````` example +<https://foo.bar.`baz>` +. +<p><a href="https://foo.bar.%60baz">https://foo.bar.`baz</a>`</p> +```````````````````````````````` + + +When a backtick string is not closed by a matching backtick string, +we just have literal backticks: + +```````````````````````````````` example +```foo`` +. +<p>```foo``</p> +```````````````````````````````` + + +```````````````````````````````` example +`foo +. +<p>`foo</p> +```````````````````````````````` + +The following case also illustrates the need for opening and +closing backtick strings to be equal in length: + +```````````````````````````````` example +`foo``bar`` +. +<p>`foo<code>bar</code></p> +```````````````````````````````` + + +## Emphasis and strong emphasis + +John Gruber's original [Markdown syntax +description](https://daringfireball.net/projects/markdown/syntax#em) says: + +> Markdown treats asterisks (`*`) and underscores (`_`) as indicators of +> emphasis. Text wrapped with one `*` or `_` will be wrapped with an HTML +> `<em>` tag; double `*`'s or `_`'s will be wrapped with an HTML `<strong>` +> tag. + +This is enough for most users, but these rules leave much undecided, +especially when it comes to nested emphasis. The original +`Markdown.pl` test suite makes it clear that triple `***` and +`___` delimiters can be used for strong emphasis, and most +implementations have also allowed the following patterns: + +``` markdown +***strong emph*** +***strong** in emph* +***emph* in strong** +**in strong *emph*** +*in emph **strong*** +``` + +The following patterns are less widely supported, but the intent +is clear and they are useful (especially in contexts like bibliography +entries): + +``` markdown +*emph *with emph* in it* +**strong **with strong** in it** +``` + +Many implementations have also restricted intraword emphasis to +the `*` forms, to avoid unwanted emphasis in words containing +internal underscores. (It is best practice to put these in code +spans, but users often do not.) + +``` markdown +internal emphasis: foo*bar*baz +no emphasis: foo_bar_baz +``` + +The rules given below capture all of these patterns, while allowing +for efficient parsing strategies that do not backtrack. + +First, some definitions. A [delimiter run](@) is either +a sequence of one or more `*` characters that is not preceded or +followed by a non-backslash-escaped `*` character, or a sequence +of one or more `_` characters that is not preceded or followed by +a non-backslash-escaped `_` character. + +A [left-flanking delimiter run](@) is +a [delimiter run] that is (1) not followed by [Unicode whitespace], +and either (2a) not followed by a [Unicode punctuation character], or +(2b) followed by a [Unicode punctuation character] and +preceded by [Unicode whitespace] or a [Unicode punctuation character]. +For purposes of this definition, the beginning and the end of +the line count as Unicode whitespace. + +A [right-flanking delimiter run](@) is +a [delimiter run] that is (1) not preceded by [Unicode whitespace], +and either (2a) not preceded by a [Unicode punctuation character], or +(2b) preceded by a [Unicode punctuation character] and +followed by [Unicode whitespace] or a [Unicode punctuation character]. +For purposes of this definition, the beginning and the end of +the line count as Unicode whitespace. + +Here are some examples of delimiter runs. + + - left-flanking but not right-flanking: + + ``` + ***abc + _abc + **"abc" + _"abc" + ``` + + - right-flanking but not left-flanking: + + ``` + abc*** + abc_ + "abc"** + "abc"_ + ``` + + - Both left and right-flanking: + + ``` + abc***def + "abc"_"def" + ``` + + - Neither left nor right-flanking: + + ``` + abc *** def + a _ b + ``` + +(The idea of distinguishing left-flanking and right-flanking +delimiter runs based on the character before and the character +after comes from Roopesh Chander's +[vfmd](https://web.archive.org/web/20220608143320/http://www.vfmd.org/vfmd-spec/specification/#procedure-for-identifying-emphasis-tags). +vfmd uses the terminology "emphasis indicator string" instead of "delimiter +run," and its rules for distinguishing left- and right-flanking runs +are a bit more complex than the ones given here.) + +The following rules define emphasis and strong emphasis: + +1. A single `*` character [can open emphasis](@) + iff (if and only if) it is part of a [left-flanking delimiter run]. + +2. A single `_` character [can open emphasis] iff + it is part of a [left-flanking delimiter run] + and either (a) not part of a [right-flanking delimiter run] + or (b) part of a [right-flanking delimiter run] + preceded by a [Unicode punctuation character]. + +3. A single `*` character [can close emphasis](@) + iff it is part of a [right-flanking delimiter run]. + +4. A single `_` character [can close emphasis] iff + it is part of a [right-flanking delimiter run] + and either (a) not part of a [left-flanking delimiter run] + or (b) part of a [left-flanking delimiter run] + followed by a [Unicode punctuation character]. + +5. A double `**` [can open strong emphasis](@) + iff it is part of a [left-flanking delimiter run]. + +6. A double `__` [can open strong emphasis] iff + it is part of a [left-flanking delimiter run] + and either (a) not part of a [right-flanking delimiter run] + or (b) part of a [right-flanking delimiter run] + preceded by a [Unicode punctuation character]. + +7. A double `**` [can close strong emphasis](@) + iff it is part of a [right-flanking delimiter run]. + +8. A double `__` [can close strong emphasis] iff + it is part of a [right-flanking delimiter run] + and either (a) not part of a [left-flanking delimiter run] + or (b) part of a [left-flanking delimiter run] + followed by a [Unicode punctuation character]. + +9. Emphasis begins with a delimiter that [can open emphasis] and ends + with a delimiter that [can close emphasis], and that uses the same + character (`_` or `*`) as the opening delimiter. The + opening and closing delimiters must belong to separate + [delimiter runs]. If one of the delimiters can both + open and close emphasis, then the sum of the lengths of the + delimiter runs containing the opening and closing delimiters + must not be a multiple of 3 unless both lengths are + multiples of 3. + +10. Strong emphasis begins with a delimiter that + [can open strong emphasis] and ends with a delimiter that + [can close strong emphasis], and that uses the same character + (`_` or `*`) as the opening delimiter. The + opening and closing delimiters must belong to separate + [delimiter runs]. If one of the delimiters can both open + and close strong emphasis, then the sum of the lengths of + the delimiter runs containing the opening and closing + delimiters must not be a multiple of 3 unless both lengths + are multiples of 3. + +11. A literal `*` character cannot occur at the beginning or end of + `*`-delimited emphasis or `**`-delimited strong emphasis, unless it + is backslash-escaped. + +12. A literal `_` character cannot occur at the beginning or end of + `_`-delimited emphasis or `__`-delimited strong emphasis, unless it + is backslash-escaped. + +Where rules 1--12 above are compatible with multiple parsings, +the following principles resolve ambiguity: + +13. The number of nestings should be minimized. Thus, for example, + an interpretation `<strong>...</strong>` is always preferred to + `<em><em>...</em></em>`. + +14. An interpretation `<em><strong>...</strong></em>` is always + preferred to `<strong><em>...</em></strong>`. + +15. When two potential emphasis or strong emphasis spans overlap, + so that the second begins before the first ends and ends after + the first ends, the first takes precedence. Thus, for example, + `*foo _bar* baz_` is parsed as `<em>foo _bar</em> baz_` rather + than `*foo <em>bar* baz</em>`. + +16. When there are two potential emphasis or strong emphasis spans + with the same closing delimiter, the shorter one (the one that + opens later) takes precedence. Thus, for example, + `**foo **bar baz**` is parsed as `**foo <strong>bar baz</strong>` + rather than `<strong>foo **bar baz</strong>`. + +17. Inline code spans, links, images, and HTML tags group more tightly + than emphasis. So, when there is a choice between an interpretation + that contains one of these elements and one that does not, the + former always wins. Thus, for example, `*[foo*](bar)` is + parsed as `*<a href="bar">foo*</a>` rather than as + `<em>[foo</em>](bar)`. + +These rules can be illustrated through a series of examples. + +Rule 1: + +```````````````````````````````` example +*foo bar* +. +<p><em>foo bar</em></p> +```````````````````````````````` + + +This is not emphasis, because the opening `*` is followed by +whitespace, and hence not part of a [left-flanking delimiter run]: + +```````````````````````````````` example +a * foo bar* +. +<p>a * foo bar*</p> +```````````````````````````````` + + +This is not emphasis, because the opening `*` is preceded +by an alphanumeric and followed by punctuation, and hence +not part of a [left-flanking delimiter run]: + +```````````````````````````````` example +a*"foo"* +. +<p>a*&quot;foo&quot;*</p> +```````````````````````````````` + + +Unicode nonbreaking spaces count as whitespace, too: + +```````````````````````````````` example +* a * +. +<p>* a *</p> +```````````````````````````````` + + +Unicode symbols count as punctuation, too: + +```````````````````````````````` example +*$*alpha. + +*£*bravo. + +*€*charlie. +. +<p>*$*alpha.</p> +<p>*£*bravo.</p> +<p>*€*charlie.</p> +```````````````````````````````` + + +Intraword emphasis with `*` is permitted: + +```````````````````````````````` example +foo*bar* +. +<p>foo<em>bar</em></p> +```````````````````````````````` + + +```````````````````````````````` example +5*6*78 +. +<p>5<em>6</em>78</p> +```````````````````````````````` + + +Rule 2: + +```````````````````````````````` example +_foo bar_ +. +<p><em>foo bar</em></p> +```````````````````````````````` + + +This is not emphasis, because the opening `_` is followed by +whitespace: + +```````````````````````````````` example +_ foo bar_ +. +<p>_ foo bar_</p> +```````````````````````````````` + + +This is not emphasis, because the opening `_` is preceded +by an alphanumeric and followed by punctuation: + +```````````````````````````````` example +a_"foo"_ +. +<p>a_&quot;foo&quot;_</p> +```````````````````````````````` + + +Emphasis with `_` is not allowed inside words: + +```````````````````````````````` example +foo_bar_ +. +<p>foo_bar_</p> +```````````````````````````````` + + +```````````````````````````````` example +5_6_78 +. +<p>5_6_78</p> +```````````````````````````````` + + +```````````````````````````````` example +пристаням_стремятся_ +. +<p>пристаням_стремятся_</p> +```````````````````````````````` + + +Here `_` does not generate emphasis, because the first delimiter run +is right-flanking and the second left-flanking: + +```````````````````````````````` example +aa_"bb"_cc +. +<p>aa_&quot;bb&quot;_cc</p> +```````````````````````````````` + + +This is emphasis, even though the opening delimiter is +both left- and right-flanking, because it is preceded by +punctuation: + +```````````````````````````````` example +foo-_(bar)_ +. +<p>foo-<em>(bar)</em></p> +```````````````````````````````` + + +Rule 3: + +This is not emphasis, because the closing delimiter does +not match the opening delimiter: + +```````````````````````````````` example +_foo* +. +<p>_foo*</p> +```````````````````````````````` + + +This is not emphasis, because the closing `*` is preceded by +whitespace: + +```````````````````````````````` example +*foo bar * +. +<p>*foo bar *</p> +```````````````````````````````` + + +A line ending also counts as whitespace: + +```````````````````````````````` example +*foo bar +* +. +<p>*foo bar +*</p> +```````````````````````````````` + + +This is not emphasis, because the second `*` is +preceded by punctuation and followed by an alphanumeric +(hence it is not part of a [right-flanking delimiter run]: + +```````````````````````````````` example +*(*foo) +. +<p>*(*foo)</p> +```````````````````````````````` + + +The point of this restriction is more easily appreciated +with this example: + +```````````````````````````````` example +*(*foo*)* +. +<p><em>(<em>foo</em>)</em></p> +```````````````````````````````` + + +Intraword emphasis with `*` is allowed: + +```````````````````````````````` example +*foo*bar +. +<p><em>foo</em>bar</p> +```````````````````````````````` + + + +Rule 4: + +This is not emphasis, because the closing `_` is preceded by +whitespace: + +```````````````````````````````` example +_foo bar _ +. +<p>_foo bar _</p> +```````````````````````````````` + + +This is not emphasis, because the second `_` is +preceded by punctuation and followed by an alphanumeric: + +```````````````````````````````` example +_(_foo) +. +<p>_(_foo)</p> +```````````````````````````````` + + +This is emphasis within emphasis: + +```````````````````````````````` example +_(_foo_)_ +. +<p><em>(<em>foo</em>)</em></p> +```````````````````````````````` + + +Intraword emphasis is disallowed for `_`: + +```````````````````````````````` example +_foo_bar +. +<p>_foo_bar</p> +```````````````````````````````` + + +```````````````````````````````` example +_пристаням_стремятся +. +<p>_пристаням_стремятся</p> +```````````````````````````````` + + +```````````````````````````````` example +_foo_bar_baz_ +. +<p><em>foo_bar_baz</em></p> +```````````````````````````````` + + +This is emphasis, even though the closing delimiter is +both left- and right-flanking, because it is followed by +punctuation: + +```````````````````````````````` example +_(bar)_. +. +<p><em>(bar)</em>.</p> +```````````````````````````````` + + +Rule 5: + +```````````````````````````````` example +**foo bar** +. +<p><strong>foo bar</strong></p> +```````````````````````````````` + + +This is not strong emphasis, because the opening delimiter is +followed by whitespace: + +```````````````````````````````` example +** foo bar** +. +<p>** foo bar**</p> +```````````````````````````````` + + +This is not strong emphasis, because the opening `**` is preceded +by an alphanumeric and followed by punctuation, and hence +not part of a [left-flanking delimiter run]: + +```````````````````````````````` example +a**"foo"** +. +<p>a**&quot;foo&quot;**</p> +```````````````````````````````` + + +Intraword strong emphasis with `**` is permitted: + +```````````````````````````````` example +foo**bar** +. +<p>foo<strong>bar</strong></p> +```````````````````````````````` + + +Rule 6: + +```````````````````````````````` example +__foo bar__ +. +<p><strong>foo bar</strong></p> +```````````````````````````````` + + +This is not strong emphasis, because the opening delimiter is +followed by whitespace: + +```````````````````````````````` example +__ foo bar__ +. +<p>__ foo bar__</p> +```````````````````````````````` + + +A line ending counts as whitespace: +```````````````````````````````` example +__ +foo bar__ +. +<p>__ +foo bar__</p> +```````````````````````````````` + + +This is not strong emphasis, because the opening `__` is preceded +by an alphanumeric and followed by punctuation: + +```````````````````````````````` example +a__"foo"__ +. +<p>a__&quot;foo&quot;__</p> +```````````````````````````````` + + +Intraword strong emphasis is forbidden with `__`: + +```````````````````````````````` example +foo__bar__ +. +<p>foo__bar__</p> +```````````````````````````````` + + +```````````````````````````````` example +5__6__78 +. +<p>5__6__78</p> +```````````````````````````````` + + +```````````````````````````````` example +пристаням__стремятся__ +. +<p>пристаням__стремятся__</p> +```````````````````````````````` + + +```````````````````````````````` example +__foo, __bar__, baz__ +. +<p><strong>foo, <strong>bar</strong>, baz</strong></p> +```````````````````````````````` + + +This is strong emphasis, even though the opening delimiter is +both left- and right-flanking, because it is preceded by +punctuation: + +```````````````````````````````` example +foo-__(bar)__ +. +<p>foo-<strong>(bar)</strong></p> +```````````````````````````````` + + + +Rule 7: + +This is not strong emphasis, because the closing delimiter is preceded +by whitespace: + +```````````````````````````````` example +**foo bar ** +. +<p>**foo bar **</p> +```````````````````````````````` + + +(Nor can it be interpreted as an emphasized `*foo bar *`, because of +Rule 11.) + +This is not strong emphasis, because the second `**` is +preceded by punctuation and followed by an alphanumeric: + +```````````````````````````````` example +**(**foo) +. +<p>**(**foo)</p> +```````````````````````````````` + + +The point of this restriction is more easily appreciated +with these examples: + +```````````````````````````````` example +*(**foo**)* +. +<p><em>(<strong>foo</strong>)</em></p> +```````````````````````````````` + + +```````````````````````````````` example +**Gomphocarpus (*Gomphocarpus physocarpus*, syn. +*Asclepias physocarpa*)** +. +<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn. +<em>Asclepias physocarpa</em>)</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo "*bar*" foo** +. +<p><strong>foo &quot;<em>bar</em>&quot; foo</strong></p> +```````````````````````````````` + + +Intraword emphasis: + +```````````````````````````````` example +**foo**bar +. +<p><strong>foo</strong>bar</p> +```````````````````````````````` + + +Rule 8: + +This is not strong emphasis, because the closing delimiter is +preceded by whitespace: + +```````````````````````````````` example +__foo bar __ +. +<p>__foo bar __</p> +```````````````````````````````` + + +This is not strong emphasis, because the second `__` is +preceded by punctuation and followed by an alphanumeric: + +```````````````````````````````` example +__(__foo) +. +<p>__(__foo)</p> +```````````````````````````````` + + +The point of this restriction is more easily appreciated +with this example: + +```````````````````````````````` example +_(__foo__)_ +. +<p><em>(<strong>foo</strong>)</em></p> +```````````````````````````````` + + +Intraword strong emphasis is forbidden with `__`: + +```````````````````````````````` example +__foo__bar +. +<p>__foo__bar</p> +```````````````````````````````` + + +```````````````````````````````` example +__пристаням__стремятся +. +<p>__пристаням__стремятся</p> +```````````````````````````````` + + +```````````````````````````````` example +__foo__bar__baz__ +. +<p><strong>foo__bar__baz</strong></p> +```````````````````````````````` + + +This is strong emphasis, even though the closing delimiter is +both left- and right-flanking, because it is followed by +punctuation: + +```````````````````````````````` example +__(bar)__. +. +<p><strong>(bar)</strong>.</p> +```````````````````````````````` + + +Rule 9: + +Any nonempty sequence of inline elements can be the contents of an +emphasized span. + +```````````````````````````````` example +*foo [bar](/url)* +. +<p><em>foo <a href="/url">bar</a></em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo +bar* +. +<p><em>foo +bar</em></p> +```````````````````````````````` + + +In particular, emphasis and strong emphasis can be nested +inside emphasis: + +```````````````````````````````` example +_foo __bar__ baz_ +. +<p><em>foo <strong>bar</strong> baz</em></p> +```````````````````````````````` + + +```````````````````````````````` example +_foo _bar_ baz_ +. +<p><em>foo <em>bar</em> baz</em></p> +```````````````````````````````` + + +```````````````````````````````` example +__foo_ bar_ +. +<p><em><em>foo</em> bar</em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo *bar** +. +<p><em>foo <em>bar</em></em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo **bar** baz* +. +<p><em>foo <strong>bar</strong> baz</em></p> +```````````````````````````````` + +```````````````````````````````` example +*foo**bar**baz* +. +<p><em>foo<strong>bar</strong>baz</em></p> +```````````````````````````````` + +Note that in the preceding case, the interpretation + +``` markdown +<p><em>foo</em><em>bar<em></em>baz</em></p> +``` + + +is precluded by the condition that a delimiter that +can both open and close (like the `*` after `foo`) +cannot form emphasis if the sum of the lengths of +the delimiter runs containing the opening and +closing delimiters is a multiple of 3 unless +both lengths are multiples of 3. + + +For the same reason, we don't get two consecutive +emphasis sections in this example: + +```````````````````````````````` example +*foo**bar* +. +<p><em>foo**bar</em></p> +```````````````````````````````` + + +The same condition ensures that the following +cases are all strong emphasis nested inside +emphasis, even when the interior whitespace is +omitted: + + +```````````````````````````````` example +***foo** bar* +. +<p><em><strong>foo</strong> bar</em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo **bar*** +. +<p><em>foo <strong>bar</strong></em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo**bar*** +. +<p><em>foo<strong>bar</strong></em></p> +```````````````````````````````` + + +When the lengths of the interior closing and opening +delimiter runs are *both* multiples of 3, though, +they can match to create emphasis: + +```````````````````````````````` example +foo***bar***baz +. +<p>foo<em><strong>bar</strong></em>baz</p> +```````````````````````````````` + +```````````````````````````````` example +foo******bar*********baz +. +<p>foo<strong><strong><strong>bar</strong></strong></strong>***baz</p> +```````````````````````````````` + + +Indefinite levels of nesting are possible: + +```````````````````````````````` example +*foo **bar *baz* bim** bop* +. +<p><em>foo <strong>bar <em>baz</em> bim</strong> bop</em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo [*bar*](/url)* +. +<p><em>foo <a href="/url"><em>bar</em></a></em></p> +```````````````````````````````` + + +There can be no empty emphasis or strong emphasis: + +```````````````````````````````` example +** is not an empty emphasis +. +<p>** is not an empty emphasis</p> +```````````````````````````````` + + +```````````````````````````````` example +**** is not an empty strong emphasis +. +<p>**** is not an empty strong emphasis</p> +```````````````````````````````` + + + +Rule 10: + +Any nonempty sequence of inline elements can be the contents of an +strongly emphasized span. + +```````````````````````````````` example +**foo [bar](/url)** +. +<p><strong>foo <a href="/url">bar</a></strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo +bar** +. +<p><strong>foo +bar</strong></p> +```````````````````````````````` + + +In particular, emphasis and strong emphasis can be nested +inside strong emphasis: + +```````````````````````````````` example +__foo _bar_ baz__ +. +<p><strong>foo <em>bar</em> baz</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +__foo __bar__ baz__ +. +<p><strong>foo <strong>bar</strong> baz</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +____foo__ bar__ +. +<p><strong><strong>foo</strong> bar</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo **bar**** +. +<p><strong>foo <strong>bar</strong></strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo *bar* baz** +. +<p><strong>foo <em>bar</em> baz</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo*bar*baz** +. +<p><strong>foo<em>bar</em>baz</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +***foo* bar** +. +<p><strong><em>foo</em> bar</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo *bar*** +. +<p><strong>foo <em>bar</em></strong></p> +```````````````````````````````` + + +Indefinite levels of nesting are possible: + +```````````````````````````````` example +**foo *bar **baz** +bim* bop** +. +<p><strong>foo <em>bar <strong>baz</strong> +bim</em> bop</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo [*bar*](/url)** +. +<p><strong>foo <a href="/url"><em>bar</em></a></strong></p> +```````````````````````````````` + + +There can be no empty emphasis or strong emphasis: + +```````````````````````````````` example +__ is not an empty emphasis +. +<p>__ is not an empty emphasis</p> +```````````````````````````````` + + +```````````````````````````````` example +____ is not an empty strong emphasis +. +<p>____ is not an empty strong emphasis</p> +```````````````````````````````` + + + +Rule 11: + +```````````````````````````````` example +foo *** +. +<p>foo ***</p> +```````````````````````````````` + + +```````````````````````````````` example +foo *\** +. +<p>foo <em>*</em></p> +```````````````````````````````` + + +```````````````````````````````` example +foo *_* +. +<p>foo <em>_</em></p> +```````````````````````````````` + + +```````````````````````````````` example +foo ***** +. +<p>foo *****</p> +```````````````````````````````` + + +```````````````````````````````` example +foo **\*** +. +<p>foo <strong>*</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +foo **_** +. +<p>foo <strong>_</strong></p> +```````````````````````````````` + + +Note that when delimiters do not match evenly, Rule 11 determines +that the excess literal `*` characters will appear outside of the +emphasis, rather than inside it: + +```````````````````````````````` example +**foo* +. +<p>*<em>foo</em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo** +. +<p><em>foo</em>*</p> +```````````````````````````````` + + +```````````````````````````````` example +***foo** +. +<p>*<strong>foo</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +****foo* +. +<p>***<em>foo</em></p> +```````````````````````````````` + + +```````````````````````````````` example +**foo*** +. +<p><strong>foo</strong>*</p> +```````````````````````````````` + + +```````````````````````````````` example +*foo**** +. +<p><em>foo</em>***</p> +```````````````````````````````` + + + +Rule 12: + +```````````````````````````````` example +foo ___ +. +<p>foo ___</p> +```````````````````````````````` + + +```````````````````````````````` example +foo _\__ +. +<p>foo <em>_</em></p> +```````````````````````````````` + + +```````````````````````````````` example +foo _*_ +. +<p>foo <em>*</em></p> +```````````````````````````````` + + +```````````````````````````````` example +foo _____ +. +<p>foo _____</p> +```````````````````````````````` + + +```````````````````````````````` example +foo __\___ +. +<p>foo <strong>_</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +foo __*__ +. +<p>foo <strong>*</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +__foo_ +. +<p>_<em>foo</em></p> +```````````````````````````````` + + +Note that when delimiters do not match evenly, Rule 12 determines +that the excess literal `_` characters will appear outside of the +emphasis, rather than inside it: + +```````````````````````````````` example +_foo__ +. +<p><em>foo</em>_</p> +```````````````````````````````` + + +```````````````````````````````` example +___foo__ +. +<p>_<strong>foo</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +____foo_ +. +<p>___<em>foo</em></p> +```````````````````````````````` + + +```````````````````````````````` example +__foo___ +. +<p><strong>foo</strong>_</p> +```````````````````````````````` + + +```````````````````````````````` example +_foo____ +. +<p><em>foo</em>___</p> +```````````````````````````````` + + +Rule 13 implies that if you want emphasis nested directly inside +emphasis, you must use different delimiters: + +```````````````````````````````` example +**foo** +. +<p><strong>foo</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +*_foo_* +. +<p><em><em>foo</em></em></p> +```````````````````````````````` + + +```````````````````````````````` example +__foo__ +. +<p><strong>foo</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +_*foo*_ +. +<p><em><em>foo</em></em></p> +```````````````````````````````` + + +However, strong emphasis within strong emphasis is possible without +switching delimiters: + +```````````````````````````````` example +****foo**** +. +<p><strong><strong>foo</strong></strong></p> +```````````````````````````````` + + +```````````````````````````````` example +____foo____ +. +<p><strong><strong>foo</strong></strong></p> +```````````````````````````````` + + + +Rule 13 can be applied to arbitrarily long sequences of +delimiters: + +```````````````````````````````` example +******foo****** +. +<p><strong><strong><strong>foo</strong></strong></strong></p> +```````````````````````````````` + + +Rule 14: + +```````````````````````````````` example +***foo*** +. +<p><em><strong>foo</strong></em></p> +```````````````````````````````` + + +```````````````````````````````` example +_____foo_____ +. +<p><em><strong><strong>foo</strong></strong></em></p> +```````````````````````````````` + + +Rule 15: + +```````````````````````````````` example +*foo _bar* baz_ +. +<p><em>foo _bar</em> baz_</p> +```````````````````````````````` + + +```````````````````````````````` example +*foo __bar *baz bim__ bam* +. +<p><em>foo <strong>bar *baz bim</strong> bam</em></p> +```````````````````````````````` + + +Rule 16: + +```````````````````````````````` example +**foo **bar baz** +. +<p>**foo <strong>bar baz</strong></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo *bar baz* +. +<p>*foo <em>bar baz</em></p> +```````````````````````````````` + + +Rule 17: + +```````````````````````````````` example +*[bar*](/url) +. +<p>*<a href="/url">bar*</a></p> +```````````````````````````````` + + +```````````````````````````````` example +_foo [bar_](/url) +. +<p>_foo <a href="/url">bar_</a></p> +```````````````````````````````` + + +```````````````````````````````` example +*<img src="foo" title="*"/> +. +<p>*<img src="foo" title="*"/></p> +```````````````````````````````` + + +```````````````````````````````` example +**<a href="**"> +. +<p>**<a href="**"></p> +```````````````````````````````` + + +```````````````````````````````` example +__<a href="__"> +. +<p>__<a href="__"></p> +```````````````````````````````` + + +```````````````````````````````` example +*a `*`* +. +<p><em>a <code>*</code></em></p> +```````````````````````````````` + + +```````````````````````````````` example +_a `_`_ +. +<p><em>a <code>_</code></em></p> +```````````````````````````````` + + +```````````````````````````````` example +**a<https://foo.bar/?q=**> +. +<p>**a<a href="https://foo.bar/?q=**">https://foo.bar/?q=**</a></p> +```````````````````````````````` + + +```````````````````````````````` example +__a<https://foo.bar/?q=__> +. +<p>__a<a href="https://foo.bar/?q=__">https://foo.bar/?q=__</a></p> +```````````````````````````````` + + + +## Links + +A link contains [link text] (the visible text), a [link destination] +(the URI that is the link destination), and optionally a [link title]. +There are two basic kinds of links in Markdown. In [inline links] the +destination and title are given immediately after the link text. In +[reference links] the destination and title are defined elsewhere in +the document. + +A [link text](@) consists of a sequence of zero or more +inline elements enclosed by square brackets (`[` and `]`). The +following rules apply: + +- Links may not contain other links, at any level of nesting. If + multiple otherwise valid link definitions appear nested inside each + other, the inner-most definition is used. + +- Brackets are allowed in the [link text] only if (a) they + are backslash-escaped or (b) they appear as a matched pair of brackets, + with an open bracket `[`, a sequence of zero or more inlines, and + a close bracket `]`. + +- Backtick [code spans], [autolinks], and raw [HTML tags] bind more tightly + than the brackets in link text. Thus, for example, + `` [foo`]` `` could not be a link text, since the second `]` + is part of a code span. + +- The brackets in link text bind more tightly than markers for + [emphasis and strong emphasis]. Thus, for example, `*[foo*](url)` is a link. + +A [link destination](@) consists of either + +- a sequence of zero or more characters between an opening `<` and a + closing `>` that contains no line endings or unescaped + `<` or `>` characters, or + +- a nonempty sequence of characters that does not start with `<`, + does not include [ASCII control characters][ASCII control character] + or [space] character, and includes parentheses only if (a) they are + backslash-escaped or (b) they are part of a balanced pair of + unescaped parentheses. + (Implementations may impose limits on parentheses nesting to + avoid performance issues, but at least three levels of nesting + should be supported.) + +A [link title](@) consists of either + +- a sequence of zero or more characters between straight double-quote + characters (`"`), including a `"` character only if it is + backslash-escaped, or + +- a sequence of zero or more characters between straight single-quote + characters (`'`), including a `'` character only if it is + backslash-escaped, or + +- a sequence of zero or more characters between matching parentheses + (`(...)`), including a `(` or `)` character only if it is + backslash-escaped. + +Although [link titles] may span multiple lines, they may not contain +a [blank line]. + +An [inline link](@) consists of a [link text] followed immediately +by a left parenthesis `(`, an optional [link destination], an optional +[link title], and a right parenthesis `)`. +These four components may be separated by spaces, tabs, and up to one line +ending. +If both [link destination] and [link title] are present, they *must* be +separated by spaces, tabs, and up to one line ending. + +The link's text consists of the inlines contained +in the [link text] (excluding the enclosing square brackets). +The link's URI consists of the link destination, excluding enclosing +`<...>` if present, with backslash-escapes in effect as described +above. The link's title consists of the link title, excluding its +enclosing delimiters, with backslash-escapes in effect as described +above. + +Here is a simple inline link: + +```````````````````````````````` example +[link](/uri "title") +. +<p><a href="/uri" title="title">link</a></p> +```````````````````````````````` + + +The title, the link text and even +the destination may be omitted: + +```````````````````````````````` example +[link](/uri) +. +<p><a href="/uri">link</a></p> +```````````````````````````````` + +```````````````````````````````` example +[](./target.md) +. +<p><a href="./target.md"></a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link]() +. +<p><a href="">link</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link](<>) +. +<p><a href="">link</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[]() +. +<p><a href=""></a></p> +```````````````````````````````` + +The destination can only contain spaces if it is +enclosed in pointy brackets: + +```````````````````````````````` example +[link](/my uri) +. +<p>[link](/my uri)</p> +```````````````````````````````` + +```````````````````````````````` example +[link](</my uri>) +. +<p><a href="/my%20uri">link</a></p> +```````````````````````````````` + +The destination cannot contain line endings, +even if enclosed in pointy brackets: + +```````````````````````````````` example +[link](foo +bar) +. +<p>[link](foo +bar)</p> +```````````````````````````````` + +```````````````````````````````` example +[link](<foo +bar>) +. +<p>[link](<foo +bar>)</p> +```````````````````````````````` + +The destination can contain `)` if it is enclosed +in pointy brackets: + +```````````````````````````````` example +[a](<b)c>) +. +<p><a href="b)c">a</a></p> +```````````````````````````````` + +Pointy brackets that enclose links must be unescaped: + +```````````````````````````````` example +[link](<foo\>) +. +<p>[link](&lt;foo&gt;)</p> +```````````````````````````````` + +These are not links, because the opening pointy bracket +is not matched properly: + +```````````````````````````````` example +[a](<b)c +[a](<b)c> +[a](<b>c) +. +<p>[a](&lt;b)c +[a](&lt;b)c&gt; +[a](<b>c)</p> +```````````````````````````````` + +Parentheses inside the link destination may be escaped: + +```````````````````````````````` example +[link](\(foo\)) +. +<p><a href="(foo)">link</a></p> +```````````````````````````````` + +Any number of parentheses are allowed without escaping, as long as they are +balanced: + +```````````````````````````````` example +[link](foo(and(bar))) +. +<p><a href="foo(and(bar))">link</a></p> +```````````````````````````````` + +However, if you have unbalanced parentheses, you need to escape or use the +`<...>` form: + +```````````````````````````````` example +[link](foo(and(bar)) +. +<p>[link](foo(and(bar))</p> +```````````````````````````````` + + +```````````````````````````````` example +[link](foo\(and\(bar\)) +. +<p><a href="foo(and(bar)">link</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link](<foo(and(bar)>) +. +<p><a href="foo(and(bar)">link</a></p> +```````````````````````````````` + + +Parentheses and other symbols can also be escaped, as usual +in Markdown: + +```````````````````````````````` example +[link](foo\)\:) +. +<p><a href="foo):">link</a></p> +```````````````````````````````` + + +A link can contain fragment identifiers and queries: + +```````````````````````````````` example +[link](#fragment) + +[link](https://example.com#fragment) + +[link](https://example.com?foo=3#frag) +. +<p><a href="#fragment">link</a></p> +<p><a href="https://example.com#fragment">link</a></p> +<p><a href="https://example.com?foo=3#frag">link</a></p> +```````````````````````````````` + + +Note that a backslash before a non-escapable character is +just a backslash: + +```````````````````````````````` example +[link](foo\bar) +. +<p><a href="foo%5Cbar">link</a></p> +```````````````````````````````` + + +URL-escaping should be left alone inside the destination, as all +URL-escaped characters are also valid URL characters. Entity and +numerical character references in the destination will be parsed +into the corresponding Unicode code points, as usual. These may +be optionally URL-escaped when written as HTML, but this spec +does not enforce any particular policy for rendering URLs in +HTML or other formats. Renderers may make different decisions +about how to escape or normalize URLs in the output. + +```````````````````````````````` example +[link](foo%20b&auml;) +. +<p><a href="foo%20b%C3%A4">link</a></p> +```````````````````````````````` + + +Note that, because titles can often be parsed as destinations, +if you try to omit the destination and keep the title, you'll +get unexpected results: + +```````````````````````````````` example +[link]("title") +. +<p><a href="%22title%22">link</a></p> +```````````````````````````````` + + +Titles may be in single quotes, double quotes, or parentheses: + +```````````````````````````````` example +[link](/url "title") +[link](/url 'title') +[link](/url (title)) +. +<p><a href="/url" title="title">link</a> +<a href="/url" title="title">link</a> +<a href="/url" title="title">link</a></p> +```````````````````````````````` + + +Backslash escapes and entity and numeric character references +may be used in titles: + +```````````````````````````````` example +[link](/url "title \"&quot;") +. +<p><a href="/url" title="title &quot;&quot;">link</a></p> +```````````````````````````````` + + +Titles must be separated from the link using spaces, tabs, and up to one line +ending. +Other [Unicode whitespace] like non-breaking space doesn't work. + +```````````````````````````````` example +[link](/url "title") +. +<p><a href="/url%C2%A0%22title%22">link</a></p> +```````````````````````````````` + + +Nested balanced quotes are not allowed without escaping: + +```````````````````````````````` example +[link](/url "title "and" title") +. +<p>[link](/url &quot;title &quot;and&quot; title&quot;)</p> +```````````````````````````````` + + +But it is easy to work around this by using a different quote type: + +```````````````````````````````` example +[link](/url 'title "and" title') +. +<p><a href="/url" title="title &quot;and&quot; title">link</a></p> +```````````````````````````````` + + +(Note: `Markdown.pl` did allow double quotes inside a double-quoted +title, and its test suite included a test demonstrating this. +But it is hard to see a good rationale for the extra complexity this +brings, since there are already many ways---backslash escaping, +entity and numeric character references, or using a different +quote type for the enclosing title---to write titles containing +double quotes. `Markdown.pl`'s handling of titles has a number +of other strange features. For example, it allows single-quoted +titles in inline links, but not reference links. And, in +reference links but not inline links, it allows a title to begin +with `"` and end with `)`. `Markdown.pl` 1.0.1 even allows +titles with no closing quotation mark, though 1.0.2b8 does not. +It seems preferable to adopt a simple, rational rule that works +the same way in inline links and link reference definitions.) + +Spaces, tabs, and up to one line ending is allowed around the destination and +title: + +```````````````````````````````` example +[link]( /uri + "title" ) +. +<p><a href="/uri" title="title">link</a></p> +```````````````````````````````` + + +But it is not allowed between the link text and the +following parenthesis: + +```````````````````````````````` example +[link] (/uri) +. +<p>[link] (/uri)</p> +```````````````````````````````` + + +The link text may contain balanced brackets, but not unbalanced ones, +unless they are escaped: + +```````````````````````````````` example +[link [foo [bar]]](/uri) +. +<p><a href="/uri">link [foo [bar]]</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link] bar](/uri) +. +<p>[link] bar](/uri)</p> +```````````````````````````````` + + +```````````````````````````````` example +[link [bar](/uri) +. +<p>[link <a href="/uri">bar</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link \[bar](/uri) +. +<p><a href="/uri">link [bar</a></p> +```````````````````````````````` + + +The link text may contain inline content: + +```````````````````````````````` example +[link *foo **bar** `#`*](/uri) +. +<p><a href="/uri">link <em>foo <strong>bar</strong> <code>#</code></em></a></p> +```````````````````````````````` + + +```````````````````````````````` example +[![moon](moon.jpg)](/uri) +. +<p><a href="/uri"><img src="moon.jpg" alt="moon" /></a></p> +```````````````````````````````` + + +However, links may not contain other links, at any level of nesting. + +```````````````````````````````` example +[foo [bar](/uri)](/uri) +. +<p>[foo <a href="/uri">bar</a>](/uri)</p> +```````````````````````````````` + + +```````````````````````````````` example +[foo *[bar [baz](/uri)](/uri)*](/uri) +. +<p>[foo <em>[bar <a href="/uri">baz</a>](/uri)</em>](/uri)</p> +```````````````````````````````` + + +```````````````````````````````` example +![[[foo](uri1)](uri2)](uri3) +. +<p><img src="uri3" alt="[foo](uri2)" /></p> +```````````````````````````````` + + +These cases illustrate the precedence of link text grouping over +emphasis grouping: + +```````````````````````````````` example +*[foo*](/uri) +. +<p>*<a href="/uri">foo*</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo *bar](baz*) +. +<p><a href="baz*">foo *bar</a></p> +```````````````````````````````` + + +Note that brackets that *aren't* part of links do not take +precedence: + +```````````````````````````````` example +*foo [bar* baz] +. +<p><em>foo [bar</em> baz]</p> +```````````````````````````````` + + +These cases illustrate the precedence of HTML tags, code spans, +and autolinks over link grouping: + +```````````````````````````````` example +[foo <bar attr="](baz)"> +. +<p>[foo <bar attr="](baz)"></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo`](/uri)` +. +<p>[foo<code>](/uri)</code></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo<https://example.com/?search=](uri)> +. +<p>[foo<a href="https://example.com/?search=%5D(uri)">https://example.com/?search=](uri)</a></p> +```````````````````````````````` + + +There are three kinds of [reference link](@)s: +[full](#full-reference-link), [collapsed](#collapsed-reference-link), +and [shortcut](#shortcut-reference-link). + +A [full reference link](@) +consists of a [link text] immediately followed by a [link label] +that [matches] a [link reference definition] elsewhere in the document. + +A [link label](@) begins with a left bracket (`[`) and ends +with the first right bracket (`]`) that is not backslash-escaped. +Between these brackets there must be at least one character that is not a space, +tab, or line ending. +Unescaped square bracket characters are not allowed inside the +opening and closing square brackets of [link labels]. A link +label can have at most 999 characters inside the square +brackets. + +One label [matches](@) +another just in case their normalized forms are equal. To normalize a +label, strip off the opening and closing brackets, +perform the *Unicode case fold*, strip leading and trailing +spaces, tabs, and line endings, and collapse consecutive internal +spaces, tabs, and line endings to a single space. If there are multiple +matching reference link definitions, the one that comes first in the +document is used. (It is desirable in such cases to emit a warning.) + +The link's URI and title are provided by the matching [link +reference definition]. + +Here is a simple example: + +```````````````````````````````` example +[foo][bar] + +[bar]: /url "title" +. +<p><a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +The rules for the [link text] are the same as with +[inline links]. Thus: + +The link text may contain balanced brackets, but not unbalanced ones, +unless they are escaped: + +```````````````````````````````` example +[link [foo [bar]]][ref] + +[ref]: /uri +. +<p><a href="/uri">link [foo [bar]]</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[link \[bar][ref] + +[ref]: /uri +. +<p><a href="/uri">link [bar</a></p> +```````````````````````````````` + + +The link text may contain inline content: + +```````````````````````````````` example +[link *foo **bar** `#`*][ref] + +[ref]: /uri +. +<p><a href="/uri">link <em>foo <strong>bar</strong> <code>#</code></em></a></p> +```````````````````````````````` + + +```````````````````````````````` example +[![moon](moon.jpg)][ref] + +[ref]: /uri +. +<p><a href="/uri"><img src="moon.jpg" alt="moon" /></a></p> +```````````````````````````````` + + +However, links may not contain other links, at any level of nesting. + +```````````````````````````````` example +[foo [bar](/uri)][ref] + +[ref]: /uri +. +<p>[foo <a href="/uri">bar</a>]<a href="/uri">ref</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo *bar [baz][ref]*][ref] + +[ref]: /uri +. +<p>[foo <em>bar <a href="/uri">baz</a></em>]<a href="/uri">ref</a></p> +```````````````````````````````` + + +(In the examples above, we have two [shortcut reference links] +instead of one [full reference link].) + +The following cases illustrate the precedence of link text grouping over +emphasis grouping: + +```````````````````````````````` example +*[foo*][ref] + +[ref]: /uri +. +<p>*<a href="/uri">foo*</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo *bar][ref]* + +[ref]: /uri +. +<p><a href="/uri">foo *bar</a>*</p> +```````````````````````````````` + + +These cases illustrate the precedence of HTML tags, code spans, +and autolinks over link grouping: + +```````````````````````````````` example +[foo <bar attr="][ref]"> + +[ref]: /uri +. +<p>[foo <bar attr="][ref]"></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo`][ref]` + +[ref]: /uri +. +<p>[foo<code>][ref]</code></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo<https://example.com/?search=][ref]> + +[ref]: /uri +. +<p>[foo<a href="https://example.com/?search=%5D%5Bref%5D">https://example.com/?search=][ref]</a></p> +```````````````````````````````` + + +Matching is case-insensitive: + +```````````````````````````````` example +[foo][BaR] + +[bar]: /url "title" +. +<p><a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +Unicode case fold is used: + +```````````````````````````````` example +[ẞ] + +[SS]: /url +. +<p><a href="/url">ẞ</a></p> +```````````````````````````````` + + +Consecutive internal spaces, tabs, and line endings are treated as one space for +purposes of determining matching: + +```````````````````````````````` example +[Foo + bar]: /url + +[Baz][Foo bar] +. +<p><a href="/url">Baz</a></p> +```````````````````````````````` + + +No spaces, tabs, or line endings are allowed between the [link text] and the +[link label]: + +```````````````````````````````` example +[foo] [bar] + +[bar]: /url "title" +. +<p>[foo] <a href="/url" title="title">bar</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[foo] +[bar] + +[bar]: /url "title" +. +<p>[foo] +<a href="/url" title="title">bar</a></p> +```````````````````````````````` + + +This is a departure from John Gruber's original Markdown syntax +description, which explicitly allows whitespace between the link +text and the link label. It brings reference links in line with +[inline links], which (according to both original Markdown and +this spec) cannot have whitespace after the link text. More +importantly, it prevents inadvertent capture of consecutive +[shortcut reference links]. If whitespace is allowed between the +link text and the link label, then in the following we will have +a single reference link, not two shortcut reference links, as +intended: + +``` markdown +[foo] +[bar] + +[foo]: /url1 +[bar]: /url2 +``` + +(Note that [shortcut reference links] were introduced by Gruber +himself in a beta version of `Markdown.pl`, but never included +in the official syntax description. Without shortcut reference +links, it is harmless to allow space between the link text and +link label; but once shortcut references are introduced, it is +too dangerous to allow this, as it frequently leads to +unintended results.) + +When there are multiple matching [link reference definitions], +the first is used: + +```````````````````````````````` example +[foo]: /url1 + +[foo]: /url2 + +[bar][foo] +. +<p><a href="/url1">bar</a></p> +```````````````````````````````` + + +Note that matching is performed on normalized strings, not parsed +inline content. So the following does not match, even though the +labels define equivalent inline content: + +```````````````````````````````` example +[bar][foo\!] + +[foo!]: /url +. +<p>[bar][foo!]</p> +```````````````````````````````` + + +[Link labels] cannot contain brackets, unless they are +backslash-escaped: + +```````````````````````````````` example +[foo][ref[] + +[ref[]: /uri +. +<p>[foo][ref[]</p> +<p>[ref[]: /uri</p> +```````````````````````````````` + + +```````````````````````````````` example +[foo][ref[bar]] + +[ref[bar]]: /uri +. +<p>[foo][ref[bar]]</p> +<p>[ref[bar]]: /uri</p> +```````````````````````````````` + + +```````````````````````````````` example +[[[foo]]] + +[[[foo]]]: /url +. +<p>[[[foo]]]</p> +<p>[[[foo]]]: /url</p> +```````````````````````````````` + + +```````````````````````````````` example +[foo][ref\[] + +[ref\[]: /uri +. +<p><a href="/uri">foo</a></p> +```````````````````````````````` + + +Note that in this example `]` is not backslash-escaped: + +```````````````````````````````` example +[bar\\]: /uri + +[bar\\] +. +<p><a href="/uri">bar\</a></p> +```````````````````````````````` + + +A [link label] must contain at least one character that is not a space, tab, or +line ending: + +```````````````````````````````` example +[] + +[]: /uri +. +<p>[]</p> +<p>[]: /uri</p> +```````````````````````````````` + + +```````````````````````````````` example +[ + ] + +[ + ]: /uri +. +<p>[ +]</p> +<p>[ +]: /uri</p> +```````````````````````````````` + + +A [collapsed reference link](@) +consists of a [link label] that [matches] a +[link reference definition] elsewhere in the +document, followed by the string `[]`. +The contents of the link label are parsed as inlines, +which are used as the link's text. The link's URI and title are +provided by the matching reference link definition. Thus, +`[foo][]` is equivalent to `[foo][foo]`. + +```````````````````````````````` example +[foo][] + +[foo]: /url "title" +. +<p><a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[*foo* bar][] + +[*foo* bar]: /url "title" +. +<p><a href="/url" title="title"><em>foo</em> bar</a></p> +```````````````````````````````` + + +The link labels are case-insensitive: + +```````````````````````````````` example +[Foo][] + +[foo]: /url "title" +. +<p><a href="/url" title="title">Foo</a></p> +```````````````````````````````` + + + +As with full reference links, spaces, tabs, or line endings are not +allowed between the two sets of brackets: + +```````````````````````````````` example +[foo] +[] + +[foo]: /url "title" +. +<p><a href="/url" title="title">foo</a> +[]</p> +```````````````````````````````` + + +A [shortcut reference link](@) +consists of a [link label] that [matches] a +[link reference definition] elsewhere in the +document and is not followed by `[]` or a link label. +The contents of the link label are parsed as inlines, +which are used as the link's text. The link's URI and title +are provided by the matching link reference definition. +Thus, `[foo]` is equivalent to `[foo][]`. + +```````````````````````````````` example +[foo] + +[foo]: /url "title" +. +<p><a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[*foo* bar] + +[*foo* bar]: /url "title" +. +<p><a href="/url" title="title"><em>foo</em> bar</a></p> +```````````````````````````````` + + +```````````````````````````````` example +[[*foo* bar]] + +[*foo* bar]: /url "title" +. +<p>[<a href="/url" title="title"><em>foo</em> bar</a>]</p> +```````````````````````````````` + + +```````````````````````````````` example +[[bar [foo] + +[foo]: /url +. +<p>[[bar <a href="/url">foo</a></p> +```````````````````````````````` + + +The link labels are case-insensitive: + +```````````````````````````````` example +[Foo] + +[foo]: /url "title" +. +<p><a href="/url" title="title">Foo</a></p> +```````````````````````````````` + + +A space after the link text should be preserved: + +```````````````````````````````` example +[foo] bar + +[foo]: /url +. +<p><a href="/url">foo</a> bar</p> +```````````````````````````````` + + +If you just want bracketed text, you can backslash-escape the +opening bracket to avoid links: + +```````````````````````````````` example +\[foo] + +[foo]: /url "title" +. +<p>[foo]</p> +```````````````````````````````` + + +Note that this is a link, because a link label ends with the first +following closing bracket: + +```````````````````````````````` example +[foo*]: /url + +*[foo*] +. +<p>*<a href="/url">foo*</a></p> +```````````````````````````````` + + +Full and collapsed references take precedence over shortcut +references: + +```````````````````````````````` example +[foo][bar] + +[foo]: /url1 +[bar]: /url2 +. +<p><a href="/url2">foo</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo][] + +[foo]: /url1 +. +<p><a href="/url1">foo</a></p> +```````````````````````````````` + +Inline links also take precedence: + +```````````````````````````````` example +[foo]() + +[foo]: /url1 +. +<p><a href="">foo</a></p> +```````````````````````````````` + +```````````````````````````````` example +[foo](not a link) + +[foo]: /url1 +. +<p><a href="/url1">foo</a>(not a link)</p> +```````````````````````````````` + +In the following case `[bar][baz]` is parsed as a reference, +`[foo]` as normal text: + +```````````````````````````````` example +[foo][bar][baz] + +[baz]: /url +. +<p>[foo]<a href="/url">bar</a></p> +```````````````````````````````` + + +Here, though, `[foo][bar]` is parsed as a reference, since +`[bar]` is defined: + +```````````````````````````````` example +[foo][bar][baz] + +[baz]: /url1 +[bar]: /url2 +. +<p><a href="/url2">foo</a><a href="/url1">baz</a></p> +```````````````````````````````` + + +Here `[foo]` is not parsed as a shortcut reference, because it +is followed by a link label (even though `[bar]` is not defined): + +```````````````````````````````` example +[foo][bar][baz] + +[baz]: /url1 +[foo]: /url2 +. +<p>[foo]<a href="/url1">bar</a></p> +```````````````````````````````` + + + +## Images + +Syntax for images is like the syntax for links, with one +difference. Instead of [link text], we have an +[image description](@). The rules for this are the +same as for [link text], except that (a) an +image description starts with `![` rather than `[`, and +(b) an image description may contain links. +An image description has inline elements +as its contents. When an image is rendered to HTML, +this is standardly used as the image's `alt` attribute. + +```````````````````````````````` example +![foo](/url "title") +. +<p><img src="/url" alt="foo" title="title" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo *bar*] + +[foo *bar*]: train.jpg "train & tracks" +. +<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo ![bar](/url)](/url2) +. +<p><img src="/url2" alt="foo bar" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo [bar](/url)](/url2) +. +<p><img src="/url2" alt="foo bar" /></p> +```````````````````````````````` + + +Though this spec is concerned with parsing, not rendering, it is +recommended that in rendering to HTML, only the plain string content +of the [image description] be used. Note that in +the above example, the alt attribute's value is `foo bar`, not `foo +[bar](/url)` or `foo <a href="/url">bar</a>`. Only the plain string +content is rendered, without formatting. + +```````````````````````````````` example +![foo *bar*][] + +[foo *bar*]: train.jpg "train & tracks" +. +<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo *bar*][foobar] + +[FOOBAR]: train.jpg "train & tracks" +. +<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo](train.jpg) +. +<p><img src="train.jpg" alt="foo" /></p> +```````````````````````````````` + + +```````````````````````````````` example +My ![foo bar](/path/to/train.jpg "title" ) +. +<p>My <img src="/path/to/train.jpg" alt="foo bar" title="title" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo](<url>) +. +<p><img src="url" alt="foo" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![](/url) +. +<p><img src="/url" alt="" /></p> +```````````````````````````````` + + +Reference-style: + +```````````````````````````````` example +![foo][bar] + +[bar]: /url +. +<p><img src="/url" alt="foo" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![foo][bar] + +[BAR]: /url +. +<p><img src="/url" alt="foo" /></p> +```````````````````````````````` + + +Collapsed: + +```````````````````````````````` example +![foo][] + +[foo]: /url "title" +. +<p><img src="/url" alt="foo" title="title" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![*foo* bar][] + +[*foo* bar]: /url "title" +. +<p><img src="/url" alt="foo bar" title="title" /></p> +```````````````````````````````` + + +The labels are case-insensitive: + +```````````````````````````````` example +![Foo][] + +[foo]: /url "title" +. +<p><img src="/url" alt="Foo" title="title" /></p> +```````````````````````````````` + + +As with reference links, spaces, tabs, and line endings, are not allowed +between the two sets of brackets: + +```````````````````````````````` example +![foo] +[] + +[foo]: /url "title" +. +<p><img src="/url" alt="foo" title="title" /> +[]</p> +```````````````````````````````` + + +Shortcut: + +```````````````````````````````` example +![foo] + +[foo]: /url "title" +. +<p><img src="/url" alt="foo" title="title" /></p> +```````````````````````````````` + + +```````````````````````````````` example +![*foo* bar] + +[*foo* bar]: /url "title" +. +<p><img src="/url" alt="foo bar" title="title" /></p> +```````````````````````````````` + + +Note that link labels cannot contain unescaped brackets: + +```````````````````````````````` example +![[foo]] + +[[foo]]: /url "title" +. +<p>![[foo]]</p> +<p>[[foo]]: /url &quot;title&quot;</p> +```````````````````````````````` + + +The link labels are case-insensitive: + +```````````````````````````````` example +![Foo] + +[foo]: /url "title" +. +<p><img src="/url" alt="Foo" title="title" /></p> +```````````````````````````````` + + +If you just want a literal `!` followed by bracketed text, you can +backslash-escape the opening `[`: + +```````````````````````````````` example +!\[foo] + +[foo]: /url "title" +. +<p>![foo]</p> +```````````````````````````````` + + +If you want a link after a literal `!`, backslash-escape the +`!`: + +```````````````````````````````` example +\![foo] + +[foo]: /url "title" +. +<p>!<a href="/url" title="title">foo</a></p> +```````````````````````````````` + + +## Autolinks + +[Autolink](@)s are absolute URIs and email addresses inside +`<` and `>`. They are parsed as links, with the URL or email address +as the link label. + +A [URI autolink](@) consists of `<`, followed by an +[absolute URI] followed by `>`. It is parsed as +a link to the URI, with the URI as the link's label. + +An [absolute URI](@), +for these purposes, consists of a [scheme] followed by a colon (`:`) +followed by zero or more characters other than [ASCII control +characters][ASCII control character], [space], `<`, and `>`. +If the URI includes these characters, they must be percent-encoded +(e.g. `%20` for a space). + +For purposes of this spec, a [scheme](@) is any sequence +of 2--32 characters beginning with an ASCII letter and followed +by any combination of ASCII letters, digits, or the symbols plus +("+"), period ("."), or hyphen ("-"). + +Here are some valid autolinks: + +```````````````````````````````` example +<http://foo.bar.baz> +. +<p><a href="http://foo.bar.baz">http://foo.bar.baz</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<https://foo.bar.baz/test?q=hello&id=22&boolean> +. +<p><a href="https://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean">https://foo.bar.baz/test?q=hello&amp;id=22&amp;boolean</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<irc://foo.bar:2233/baz> +. +<p><a href="irc://foo.bar:2233/baz">irc://foo.bar:2233/baz</a></p> +```````````````````````````````` + + +Uppercase is also fine: + +```````````````````````````````` example +<MAILTO:FOO@BAR.BAZ> +. +<p><a href="MAILTO:FOO@BAR.BAZ">MAILTO:FOO@BAR.BAZ</a></p> +```````````````````````````````` + + +Note that many strings that count as [absolute URIs] for +purposes of this spec are not valid URIs, because their +schemes are not registered or because of other problems +with their syntax: + +```````````````````````````````` example +<a+b+c:d> +. +<p><a href="a+b+c:d">a+b+c:d</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<made-up-scheme://foo,bar> +. +<p><a href="made-up-scheme://foo,bar">made-up-scheme://foo,bar</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<https://../> +. +<p><a href="https://../">https://../</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<localhost:5001/foo> +. +<p><a href="localhost:5001/foo">localhost:5001/foo</a></p> +```````````````````````````````` + + +Spaces are not allowed in autolinks: + +```````````````````````````````` example +<https://foo.bar/baz bim> +. +<p>&lt;https://foo.bar/baz bim&gt;</p> +```````````````````````````````` + + +Backslash-escapes do not work inside autolinks: + +```````````````````````````````` example +<https://example.com/\[\> +. +<p><a href="https://example.com/%5C%5B%5C">https://example.com/\[\</a></p> +```````````````````````````````` + + +An [email autolink](@) +consists of `<`, followed by an [email address], +followed by `>`. The link's label is the email address, +and the URL is `mailto:` followed by the email address. + +An [email address](@), +for these purposes, is anything that matches +the [non-normative regex from the HTML5 +spec](https://html.spec.whatwg.org/multipage/forms.html#e-mail-state-(type=email)): + + /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])? + (?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ + +Examples of email autolinks: + +```````````````````````````````` example +<foo@bar.example.com> +. +<p><a href="mailto:foo@bar.example.com">foo@bar.example.com</a></p> +```````````````````````````````` + + +```````````````````````````````` example +<foo+special@Bar.baz-bar0.com> +. +<p><a href="mailto:foo+special@Bar.baz-bar0.com">foo+special@Bar.baz-bar0.com</a></p> +```````````````````````````````` + + +Backslash-escapes do not work inside email autolinks: + +```````````````````````````````` example +<foo\+@bar.example.com> +. +<p>&lt;foo+@bar.example.com&gt;</p> +```````````````````````````````` + + +These are not autolinks: + +```````````````````````````````` example +<> +. +<p>&lt;&gt;</p> +```````````````````````````````` + + +```````````````````````````````` example +< https://foo.bar > +. +<p>&lt; https://foo.bar &gt;</p> +```````````````````````````````` + + +```````````````````````````````` example +<m:abc> +. +<p>&lt;m:abc&gt;</p> +```````````````````````````````` + + +```````````````````````````````` example +<foo.bar.baz> +. +<p>&lt;foo.bar.baz&gt;</p> +```````````````````````````````` + + +```````````````````````````````` example +https://example.com +. +<p>https://example.com</p> +```````````````````````````````` + + +```````````````````````````````` example +foo@bar.example.com +. +<p>foo@bar.example.com</p> +```````````````````````````````` + + +## Raw HTML + +Text between `<` and `>` that looks like an HTML tag is parsed as a +raw HTML tag and will be rendered in HTML without escaping. +Tag and attribute names are not limited to current HTML tags, +so custom tags (and even, say, DocBook tags) may be used. + +Here is the grammar for tags: + +A [tag name](@) consists of an ASCII letter +followed by zero or more ASCII letters, digits, or +hyphens (`-`). + +An [attribute](@) consists of spaces, tabs, and up to one line ending, +an [attribute name], and an optional +[attribute value specification]. + +An [attribute name](@) +consists of an ASCII letter, `_`, or `:`, followed by zero or more ASCII +letters, digits, `_`, `.`, `:`, or `-`. (Note: This is the XML +specification restricted to ASCII. HTML5 is laxer.) + +An [attribute value specification](@) +consists of optional spaces, tabs, and up to one line ending, +a `=` character, optional spaces, tabs, and up to one line ending, +and an [attribute value]. + +An [attribute value](@) +consists of an [unquoted attribute value], +a [single-quoted attribute value], or a [double-quoted attribute value]. + +An [unquoted attribute value](@) +is a nonempty string of characters not +including spaces, tabs, line endings, `"`, `'`, `=`, `<`, `>`, or `` ` ``. + +A [single-quoted attribute value](@) +consists of `'`, zero or more +characters not including `'`, and a final `'`. + +A [double-quoted attribute value](@) +consists of `"`, zero or more +characters not including `"`, and a final `"`. + +An [open tag](@) consists of a `<` character, a [tag name], +zero or more [attributes], optional spaces, tabs, and up to one line ending, +an optional `/` character, and a `>` character. + +A [closing tag](@) consists of the string `</`, a +[tag name], optional spaces, tabs, and up to one line ending, and the character +`>`. + +An [HTML comment](@) consists of `<!-->`, `<!--->`, or `<!--`, a string of +characters not including the string `-->`, and `-->` (see the +[HTML spec](https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state)). + +A [processing instruction](@) +consists of the string `<?`, a string +of characters not including the string `?>`, and the string +`?>`. + +A [declaration](@) consists of the string `<!`, an ASCII letter, zero or more +characters not including the character `>`, and the character `>`. + +A [CDATA section](@) consists of +the string `<![CDATA[`, a string of characters not including the string +`]]>`, and the string `]]>`. + +An [HTML tag](@) consists of an [open tag], a [closing tag], +an [HTML comment], a [processing instruction], a [declaration], +or a [CDATA section]. + +Here are some simple open tags: + +```````````````````````````````` example +<a><bab><c2c> +. +<p><a><bab><c2c></p> +```````````````````````````````` + + +Empty elements: + +```````````````````````````````` example +<a/><b2/> +. +<p><a/><b2/></p> +```````````````````````````````` + + +Whitespace is allowed: + +```````````````````````````````` example +<a /><b2 +data="foo" > +. +<p><a /><b2 +data="foo" ></p> +```````````````````````````````` + + +With attributes: + +```````````````````````````````` example +<a foo="bar" bam = 'baz <em>"</em>' +_boolean zoop:33=zoop:33 /> +. +<p><a foo="bar" bam = 'baz <em>"</em>' +_boolean zoop:33=zoop:33 /></p> +```````````````````````````````` + + +Custom tag names can be used: + +```````````````````````````````` example +Foo <responsive-image src="foo.jpg" /> +. +<p>Foo <responsive-image src="foo.jpg" /></p> +```````````````````````````````` + + +Illegal tag names, not parsed as HTML: + +```````````````````````````````` example +<33> <__> +. +<p>&lt;33&gt; &lt;__&gt;</p> +```````````````````````````````` + + +Illegal attribute names: + +```````````````````````````````` example +<a h*#ref="hi"> +. +<p>&lt;a h*#ref=&quot;hi&quot;&gt;</p> +```````````````````````````````` + + +Illegal attribute values: + +```````````````````````````````` example +<a href="hi'> <a href=hi'> +. +<p>&lt;a href=&quot;hi'&gt; &lt;a href=hi'&gt;</p> +```````````````````````````````` + + +Illegal whitespace: + +```````````````````````````````` example +< a>< +foo><bar/ > +<foo bar=baz +bim!bop /> +. +<p>&lt; a&gt;&lt; +foo&gt;&lt;bar/ &gt; +&lt;foo bar=baz +bim!bop /&gt;</p> +```````````````````````````````` + + +Missing whitespace: + +```````````````````````````````` example +<a href='bar'title=title> +. +<p>&lt;a href='bar'title=title&gt;</p> +```````````````````````````````` + + +Closing tags: + +```````````````````````````````` example +</a></foo > +. +<p></a></foo ></p> +```````````````````````````````` + + +Illegal attributes in closing tag: + +```````````````````````````````` example +</a href="foo"> +. +<p>&lt;/a href=&quot;foo&quot;&gt;</p> +```````````````````````````````` + + +Comments: + +```````````````````````````````` example +foo <!-- this is a -- +comment - with hyphens --> +. +<p>foo <!-- this is a -- +comment - with hyphens --></p> +```````````````````````````````` + +```````````````````````````````` example +foo <!--> foo --> + +foo <!---> foo --> +. +<p>foo <!--> foo --&gt;</p> +<p>foo <!---> foo --&gt;</p> +```````````````````````````````` + + +Processing instructions: + +```````````````````````````````` example +foo <?php echo $a; ?> +. +<p>foo <?php echo $a; ?></p> +```````````````````````````````` + + +Declarations: + +```````````````````````````````` example +foo <!ELEMENT br EMPTY> +. +<p>foo <!ELEMENT br EMPTY></p> +```````````````````````````````` + + +CDATA sections: + +```````````````````````````````` example +foo <![CDATA[>&<]]> +. +<p>foo <![CDATA[>&<]]></p> +```````````````````````````````` + + +Entity and numeric character references are preserved in HTML +attributes: + +```````````````````````````````` example +foo <a href="&ouml;"> +. +<p>foo <a href="&ouml;"></p> +```````````````````````````````` + + +Backslash escapes do not work in HTML attributes: + +```````````````````````````````` example +foo <a href="\*"> +. +<p>foo <a href="\*"></p> +```````````````````````````````` + + +```````````````````````````````` example +<a href="\""> +. +<p>&lt;a href=&quot;&quot;&quot;&gt;</p> +```````````````````````````````` + + +## Hard line breaks + +A line ending (not in a code span or HTML tag) that is preceded +by two or more spaces and does not occur at the end of a block +is parsed as a [hard line break](@) (rendered +in HTML as a `<br />` tag): + +```````````````````````````````` example +foo +baz +. +<p>foo<br /> +baz</p> +```````````````````````````````` + + +For a more visible alternative, a backslash before the +[line ending] may be used instead of two or more spaces: + +```````````````````````````````` example +foo\ +baz +. +<p>foo<br /> +baz</p> +```````````````````````````````` + + +More than two spaces can be used: + +```````````````````````````````` example +foo +baz +. +<p>foo<br /> +baz</p> +```````````````````````````````` + + +Leading spaces at the beginning of the next line are ignored: + +```````````````````````````````` example +foo + bar +. +<p>foo<br /> +bar</p> +```````````````````````````````` + + +```````````````````````````````` example +foo\ + bar +. +<p>foo<br /> +bar</p> +```````````````````````````````` + + +Hard line breaks can occur inside emphasis, links, and other constructs +that allow inline content: + +```````````````````````````````` example +*foo +bar* +. +<p><em>foo<br /> +bar</em></p> +```````````````````````````````` + + +```````````````````````````````` example +*foo\ +bar* +. +<p><em>foo<br /> +bar</em></p> +```````````````````````````````` + + +Hard line breaks do not occur inside code spans + +```````````````````````````````` example +`code +span` +. +<p><code>code span</code></p> +```````````````````````````````` + + +```````````````````````````````` example +`code\ +span` +. +<p><code>code\ span</code></p> +```````````````````````````````` + + +or HTML tags: + +```````````````````````````````` example +<a href="foo +bar"> +. +<p><a href="foo +bar"></p> +```````````````````````````````` + + +```````````````````````````````` example +<a href="foo\ +bar"> +. +<p><a href="foo\ +bar"></p> +```````````````````````````````` + + +Hard line breaks are for separating inline content within a block. +Neither syntax for hard line breaks works at the end of a paragraph or +other block element: + +```````````````````````````````` example +foo\ +. +<p>foo\</p> +```````````````````````````````` + + +```````````````````````````````` example +foo +. +<p>foo</p> +```````````````````````````````` + + +```````````````````````````````` example +### foo\ +. +<h3>foo\</h3> +```````````````````````````````` + + +```````````````````````````````` example +### foo +. +<h3>foo</h3> +```````````````````````````````` + + +## Soft line breaks + +A regular line ending (not in a code span or HTML tag) that is not +preceded by two or more spaces or a backslash is parsed as a +[softbreak](@). (A soft line break may be rendered in HTML either as a +[line ending] or as a space. The result will be the same in +browsers. In the examples here, a [line ending] will be used.) + +```````````````````````````````` example +foo +baz +. +<p>foo +baz</p> +```````````````````````````````` + + +Spaces at the end of the line and beginning of the next line are +removed: + +```````````````````````````````` example +foo + baz +. +<p>foo +baz</p> +```````````````````````````````` + + +A conforming parser may render a soft line break in HTML either as a +line ending or as a space. + +A renderer may also provide an option to render soft line breaks +as hard line breaks. + +## Textual content + +Any characters not given an interpretation by the above rules will +be parsed as plain textual content. + +```````````````````````````````` example +hello $.;'there +. +<p>hello $.;'there</p> +```````````````````````````````` + + +```````````````````````````````` example +Foo χρῆν +. +<p>Foo χρῆν</p> +```````````````````````````````` + + +Internal spaces are preserved verbatim: + +```````````````````````````````` example +Multiple spaces +. +<p>Multiple spaces</p> +```````````````````````````````` + + +<!-- END TESTS --> + +# Appendix: A parsing strategy + +In this appendix we describe some features of the parsing strategy +used in the CommonMark reference implementations. + +## Overview + +Parsing has two phases: + +1. In the first phase, lines of input are consumed and the block +structure of the document---its division into paragraphs, block quotes, +list items, and so on---is constructed. Text is assigned to these +blocks but not parsed. Link reference definitions are parsed and a +map of links is constructed. + +2. In the second phase, the raw text contents of paragraphs and headings +are parsed into sequences of Markdown inline elements (strings, +code spans, links, emphasis, and so on), using the map of link +references constructed in phase 1. + +At each point in processing, the document is represented as a tree of +**blocks**. The root of the tree is a `document` block. The `document` +may have any number of other blocks as **children**. These children +may, in turn, have other blocks as children. The last child of a block +is normally considered **open**, meaning that subsequent lines of input +can alter its contents. (Blocks that are not open are **closed**.) +Here, for example, is a possible document tree, with the open blocks +marked by arrows: + +``` tree +-> document + -> block_quote + paragraph + "Lorem ipsum dolor\nsit amet." + -> list (type=bullet tight=true bullet_char=-) + list_item + paragraph + "Qui *quodsi iracundia*" + -> list_item + -> paragraph + "aliquando id" +``` + +## Phase 1: block structure + +Each line that is processed has an effect on this tree. The line is +analyzed and, depending on its contents, the document may be altered +in one or more of the following ways: + +1. One or more open blocks may be closed. +2. One or more new blocks may be created as children of the + last open block. +3. Text may be added to the last (deepest) open block remaining + on the tree. + +Once a line has been incorporated into the tree in this way, +it can be discarded, so input can be read in a stream. + +For each line, we follow this procedure: + +1. First we iterate through the open blocks, starting with the +root document, and descending through last children down to the last +open block. Each block imposes a condition that the line must satisfy +if the block is to remain open. For example, a block quote requires a +`>` character. A paragraph requires a non-blank line. +In this phase we may match all or just some of the open +blocks. But we cannot close unmatched blocks yet, because we may have a +[lazy continuation line]. + +2. Next, after consuming the continuation markers for existing +blocks, we look for new block starts (e.g. `>` for a block quote). +If we encounter a new block start, we close any blocks unmatched +in step 1 before creating the new block as a child of the last +matched container block. + +3. Finally, we look at the remainder of the line (after block +markers like `>`, list markers, and indentation have been consumed). +This is text that can be incorporated into the last open +block (a paragraph, code block, heading, or raw HTML). + +Setext headings are formed when we see a line of a paragraph +that is a [setext heading underline]. + +Reference link definitions are detected when a paragraph is closed; +the accumulated text lines are parsed to see if they begin with +one or more reference link definitions. Any remainder becomes a +normal paragraph. + +We can see how this works by considering how the tree above is +generated by four lines of Markdown: + +``` markdown +> Lorem ipsum dolor +sit amet. +> - Qui *quodsi iracundia* +> - aliquando id +``` + +At the outset, our document model is just + +``` tree +-> document +``` + +The first line of our text, + +``` markdown +> Lorem ipsum dolor +``` + +causes a `block_quote` block to be created as a child of our +open `document` block, and a `paragraph` block as a child of +the `block_quote`. Then the text is added to the last open +block, the `paragraph`: + +``` tree +-> document + -> block_quote + -> paragraph + "Lorem ipsum dolor" +``` + +The next line, + +``` markdown +sit amet. +``` + +is a "lazy continuation" of the open `paragraph`, so it gets added +to the paragraph's text: + +``` tree +-> document + -> block_quote + -> paragraph + "Lorem ipsum dolor\nsit amet." +``` + +The third line, + +``` markdown +> - Qui *quodsi iracundia* +``` + +causes the `paragraph` block to be closed, and a new `list` block +opened as a child of the `block_quote`. A `list_item` is also +added as a child of the `list`, and a `paragraph` as a child of +the `list_item`. The text is then added to the new `paragraph`: + +``` tree +-> document + -> block_quote + paragraph + "Lorem ipsum dolor\nsit amet." + -> list (type=bullet tight=true bullet_char=-) + -> list_item + -> paragraph + "Qui *quodsi iracundia*" +``` + +The fourth line, + +``` markdown +> - aliquando id +``` + +causes the `list_item` (and its child the `paragraph`) to be closed, +and a new `list_item` opened up as child of the `list`. A `paragraph` +is added as a child of the new `list_item`, to contain the text. +We thus obtain the final tree: + +``` tree +-> document + -> block_quote + paragraph + "Lorem ipsum dolor\nsit amet." + -> list (type=bullet tight=true bullet_char=-) + list_item + paragraph + "Qui *quodsi iracundia*" + -> list_item + -> paragraph + "aliquando id" +``` + +## Phase 2: inline structure + +Once all of the input has been parsed, all open blocks are closed. + +We then "walk the tree," visiting every node, and parse raw +string contents of paragraphs and headings as inlines. At this +point we have seen all the link reference definitions, so we can +resolve reference links as we go. + +``` tree +document + block_quote + paragraph + str "Lorem ipsum dolor" + softbreak + str "sit amet." + list (type=bullet tight=true bullet_char=-) + list_item + paragraph + str "Qui " + emph + str "quodsi iracundia" + list_item + paragraph + str "aliquando id" +``` + +Notice how the [line ending] in the first paragraph has +been parsed as a `softbreak`, and the asterisks in the first list item +have become an `emph`. + +### An algorithm for parsing nested emphasis and links + +By far the trickiest part of inline parsing is handling emphasis, +strong emphasis, links, and images. This is done using the following +algorithm. + +When we're parsing inlines and we hit either + +- a run of `*` or `_` characters, or +- a `[` or `![` + +we insert a text node with these symbols as its literal content, and we +add a pointer to this text node to the [delimiter stack](@). + +The [delimiter stack] is a doubly linked list. Each +element contains a pointer to a text node, plus information about + +- the type of delimiter (`[`, `![`, `*`, `_`) +- the number of delimiters, +- whether the delimiter is "active" (all are active to start), and +- whether the delimiter is a potential opener, a potential closer, + or both (which depends on what sort of characters precede + and follow the delimiters). + +When we hit a `]` character, we call the *look for link or image* +procedure (see below). + +When we hit the end of the input, we call the *process emphasis* +procedure (see below), with `stack_bottom` = NULL. + +#### *look for link or image* + +Starting at the top of the delimiter stack, we look backwards +through the stack for an opening `[` or `![` delimiter. + +- If we don't find one, we return a literal text node `]`. + +- If we do find one, but it's not *active*, we remove the inactive + delimiter from the stack, and return a literal text node `]`. + +- If we find one and it's active, then we parse ahead to see if + we have an inline link/image, reference link/image, collapsed reference + link/image, or shortcut reference link/image. + + + If we don't, then we remove the opening delimiter from the + delimiter stack and return a literal text node `]`. + + + If we do, then + + * We return a link or image node whose children are the inlines + after the text node pointed to by the opening delimiter. + + * We run *process emphasis* on these inlines, with the `[` opener + as `stack_bottom`. + + * We remove the opening delimiter. + + * If we have a link (and not an image), we also set all + `[` delimiters before the opening delimiter to *inactive*. (This + will prevent us from getting links within links.) + +#### *process emphasis* + +Parameter `stack_bottom` sets a lower bound to how far we +descend in the [delimiter stack]. If it is NULL, we can +go all the way to the bottom. Otherwise, we stop before +visiting `stack_bottom`. + +Let `current_position` point to the element on the [delimiter stack] +just above `stack_bottom` (or the first element if `stack_bottom` +is NULL). + +We keep track of the `openers_bottom` for each delimiter +type (`*`, `_`), indexed to the length of the closing delimiter run +(modulo 3) and to whether the closing delimiter can also be an +opener. Initialize this to `stack_bottom`. + +Then we repeat the following until we run out of potential +closers: + +- Move `current_position` forward in the delimiter stack (if needed) + until we find the first potential closer with delimiter `*` or `_`. + (This will be the potential closer closest + to the beginning of the input -- the first one in parse order.) + +- Now, look back in the stack (staying above `stack_bottom` and + the `openers_bottom` for this delimiter type) for the + first matching potential opener ("matching" means same delimiter). + +- If one is found: + + + Figure out whether we have emphasis or strong emphasis: + if both closer and opener spans have length >= 2, we have + strong, otherwise regular. + + + Insert an emph or strong emph node accordingly, after + the text node corresponding to the opener. + + + Remove any delimiters between the opener and closer from + the delimiter stack. + + + Remove 1 (for regular emph) or 2 (for strong emph) delimiters + from the opening and closing text nodes. If they become empty + as a result, remove them and remove the corresponding element + of the delimiter stack. If the closing node is removed, reset + `current_position` to the next element in the stack. + +- If none is found: + + + Set `openers_bottom` to the element before `current_position`. + (We know that there are no openers for this kind of closer up to and + including this point, so this puts a lower bound on future searches.) + + + If the closer at `current_position` is not a potential opener, + remove it from the delimiter stack (since we know it can't + be a closer either). + + + Advance `current_position` to the next element in the stack. + +After we're done, we remove all delimiters above `stack_bottom` from the +delimiter stack.