Latex
# Latex Module
Complete LaTeX parser for `blueprint.tex` and `paper.tex` files. Produces a typed AST that is consumed by both the HTML renderer (`Html/Render.lean`, `Latex/ToHtml.lean`) and the LaTeX-back-to-string converter (`ToLatex.lean`).
## Pipeline Position
```
.tex source --> Token.lean (lexer tokens) --> Lexer.lean (tokenization)
|
v
Parser.lean (AST construction)
|
+--------------+--------------+
| |
v v
ToHtml.lean ToLatex.lean
(prose HTML for chapters) (PDF generation with
node expansion)
```
## Files
### `Token.lean`
Defines the token types produced by the lexer.
**Types:**
- `Position`: Line, column, byte offset in source
- `Span`: Start and stop positions
- `TokenKind`: The token variants:
- `command name` -- LaTeX commands like `\chapter`, `\begin`
- `braceOpen` / `braceClose` -- `{` and `}`
- `bracketOpen` / `bracketClose` -- `[` and `]`
- `mathInline content` -- `$...$`
- `mathDisplay content` -- `$$...$$` or `\[...\]`
- `text content` -- Plain text
- `comment content` -- `%...`
- `newline` / `paragraph` -- Single newline vs blank line
- `whitespace content` -- Spaces and tabs
- `eof` -- End of input
- `Token`: A `TokenKind` with its `Span` and raw source text
### `Lexer.lean`
Tokenizes LaTeX source into a stream of `Token` values.
**Key design:**
- Uses byte-position tracking for correct UTF-8 handling
- `LexerState`: Holds source string, byte position, line/column counters
- `LexerM`: `StateM LexerState`
**Entry point:**
- `tokenize (source : String) : Array Token` -- Tokenizes a complete source string
**Notable behavior:**
- Math delimiters (`$`, `$$`, `\[`, `\]`) are consumed by the lexer, producing `mathInline` or `mathDisplay` tokens with the content between delimiters
- Backslash-escaped characters inside math are preserved
- Consecutive newlines are collapsed into a single `paragraph` token
- Comments (`%...`) consume everything to end of line
### `Ast.lean`
Defines the abstract syntax tree for parsed LaTeX documents.
**Inline types** (`Inline`):
- `text`, `emph`, `bold`, `code`, `math`, `ref`, `cite`, `href`, `lean`, `image`, `raw`, `space`, `seq`
**Block types** (`Block`):
- Document structure: `document`, `chapter`, `section`
- Theorem environments: `theorem` (with env name, title, metadata, statement), `proof`
- Content: `paragraph`, `displayMath`, `itemize`, `enumerate`, `figure`, `raw`, `comment`
- Blueprint-specific: `inputLeanModule`, `inputLeanNode`
- Paper-specific: `paperStatement`, `paperFull`, `paperProof`
**Supporting types:**
- `Preamble`: Document class, packages, title/author/date/abstract, raw preamble content. Has `authors` method that splits on `\and`.
- `TheoremMetadata`: Blueprint-enriched metadata -- `leanDecls`, `uses`, `label`, `signatureHtml`, `proofHtml`, `hoverData`, `above`, `below`, plus status flags (`leanOk`, `notReady`, `mathLibOk`)
- `Document`: Root block with optional source path
**Utility functions:**
- `Inline.toPlainText`: Strips formatting to get raw text
- `Block.getLabel`: Extracts label from any block type
- `isTheoremEnv`: Checks if an environment name is one of the recognized theorem-like environments (theorem, lemma, proposition, corollary, definition, example, remark, note, axiom, conjecture)
### `Parser.lean`
Recursive descent parser that builds `Block`/`Inline` AST from token arrays.
**State:** `ParserState` holds token array, current position, and error accumulator.
**Key functions:**
- `parseDocument`: Top-level entry point. Parses preamble, then `\begin{document}` body, including abstract extraction.
- `parsePreamble`: Extracts `\documentclass`, `\usepackage`, `\title`, `\author`, `\date`, and captures unknown commands as raw content.
- `parseBody`: Parses the document body, handling `\chapter`, `\section`, `\subsection`, `\begin{...}`, blueprint commands, and prose paragraphs.
- `parseSectionBody`: Like `parseBody` but breaks when encountering a same-or-higher-level heading.
- `parseEnv`: Dispatches environment parsing based on name. Theorem-like environments get `TheoremMetadata`; lists get item parsing; figures get caption extraction.
- `parseTheoremMeta`: Collects all blueprint metadata commands (`\label`, `\lean`, `\uses`, `\leanok`, `\leansignaturesourcehtml`, `\leanproofsourcehtml`, `\leanhoverdata`, `\leanabove`, `\leanbelow`, etc.) that precede the statement.
- `parseInlines`: Parses inline content until a stop condition is met.
**Infinite loop protection:** All catch-all cases include `advance` to consume at least one token, preventing stuck parsers on unrecognized input.
**Entry points:**
- `parse (source : String) : Document * Array String` -- Parse from string, returns document and errors
- `parseFile (path : System.FilePath) : IO (Document * Array String)` -- Parse from file
### `ToHtml.lean`
Converts parsed AST blocks to HTML strings for prose rendering in chapter pages.
**State:** `HtmlState` accumulates HTML output and tracks module/node references found during conversion.
**Key functions:**
- `blockToHtml`: Converts each `Block` variant to HTML. Theorem environments produce `div.theorem-env`, proofs produce `div.proof`, math is left for MathJax.
- `inlineToHtml`: Converts inline content. Math is wrapped in `\(...\)` / `\[...\]` for MathJax.
- `extractChapters`: Walks document blocks and extracts `ChapterExtract` records (number, title, label, body).
- `extractSections`: Extracts sections from a chapter body.
- `extractModuleRefs` / `extractNodeRefs`: Traverses the block tree to find `\inputleanmodule` and `\inputleannode` references.
- `extractLabeledTheorems`: Finds theorem environments with `\label{}` for detecting LaTeX-only stated nodes.
**Placeholder generation:** `\inputleanmodule`, `\inputleannode`, `\paperstatement`, `\paperfull`, and `\paperproof` commands produce placeholder `<div>` elements with `data-module` or `data-node` attributes.
**Result type:** `HtmlResult` contains the generated HTML string plus arrays of module and node references discovered during conversion.
### `ToLatex.lean`
Converts the parsed AST back to LaTeX strings, resolving node references against Dress artifacts. Used for PDF generation where `\inputleannode{label}` hooks are expanded to actual theorem content from the Lean formalization.
**State:** `LatexState` accumulates LaTeX output, tracks node references, and collects warnings.
**Configuration:** `LatexConfig` controls whether the source preamble is preserved in output.
## Integration with Adjacent Modules
- **Dress/Load.lean** decodes base64-encoded fields in `TheoremMetadata` (signatureHtml, proofHtml, hoverData, above, below) after parsing
- **Html/Render.lean** provides an alternative rendering path using Verso's `Html` type instead of string accumulation
- **Theme.lean** uses `extractChapters` and `extractSections` to build the chapter navigation structure
- **Paper.lean** uses the parser for `paper.tex` and processes paper-specific hooks (`\paperstatement`, `\paperfull`, `\paperproof`)