Latex

# Latex Module Complete LaTeX parser for `blueprint.tex` and `paper.tex` files. Produces a typed AST that is consumed by both the HTML renderer (`Html/Render.lean`, `Latex/ToHtml.lean`) and the LaTeX-back-to-string converter (`ToLatex.lean`). ## Pipeline Position ``` .tex source --> Token.lean (lexer tokens) --> Lexer.lean (tokenization) | v Parser.lean (AST construction) | +--------------+--------------+ | | v v ToHtml.lean ToLatex.lean (prose HTML for chapters) (PDF generation with node expansion) ``` ## Files ### `Token.lean` Defines the token types produced by the lexer. **Types:** - `Position`: Line, column, byte offset in source - `Span`: Start and stop positions - `TokenKind`: The token variants: - `command name` -- LaTeX commands like `\chapter`, `\begin` - `braceOpen` / `braceClose` -- `{` and `}` - `bracketOpen` / `bracketClose` -- `[` and `]` - `mathInline content` -- `$...$` - `mathDisplay content` -- `$$...$$` or `\[...\]` - `text content` -- Plain text - `comment content` -- `%...` - `newline` / `paragraph` -- Single newline vs blank line - `whitespace content` -- Spaces and tabs - `eof` -- End of input - `Token`: A `TokenKind` with its `Span` and raw source text ### `Lexer.lean` Tokenizes LaTeX source into a stream of `Token` values. **Key design:** - Uses byte-position tracking for correct UTF-8 handling - `LexerState`: Holds source string, byte position, line/column counters - `LexerM`: `StateM LexerState` **Entry point:** - `tokenize (source : String) : Array Token` -- Tokenizes a complete source string **Notable behavior:** - Math delimiters (`$`, `$$`, `\[`, `\]`) are consumed by the lexer, producing `mathInline` or `mathDisplay` tokens with the content between delimiters - Backslash-escaped characters inside math are preserved - Consecutive newlines are collapsed into a single `paragraph` token - Comments (`%...`) consume everything to end of line ### `Ast.lean` Defines the abstract syntax tree for parsed LaTeX documents. **Inline types** (`Inline`): - `text`, `emph`, `bold`, `code`, `math`, `ref`, `cite`, `href`, `lean`, `image`, `raw`, `space`, `seq` **Block types** (`Block`): - Document structure: `document`, `chapter`, `section` - Theorem environments: `theorem` (with env name, title, metadata, statement), `proof` - Content: `paragraph`, `displayMath`, `itemize`, `enumerate`, `figure`, `raw`, `comment` - Blueprint-specific: `inputLeanModule`, `inputLeanNode` - Paper-specific: `paperStatement`, `paperFull`, `paperProof` **Supporting types:** - `Preamble`: Document class, packages, title/author/date/abstract, raw preamble content. Has `authors` method that splits on `\and`. - `TheoremMetadata`: Blueprint-enriched metadata -- `leanDecls`, `uses`, `label`, `signatureHtml`, `proofHtml`, `hoverData`, `above`, `below`, plus status flags (`leanOk`, `notReady`, `mathLibOk`) - `Document`: Root block with optional source path **Utility functions:** - `Inline.toPlainText`: Strips formatting to get raw text - `Block.getLabel`: Extracts label from any block type - `isTheoremEnv`: Checks if an environment name is one of the recognized theorem-like environments (theorem, lemma, proposition, corollary, definition, example, remark, note, axiom, conjecture) ### `Parser.lean` Recursive descent parser that builds `Block`/`Inline` AST from token arrays. **State:** `ParserState` holds token array, current position, and error accumulator. **Key functions:** - `parseDocument`: Top-level entry point. Parses preamble, then `\begin{document}` body, including abstract extraction. - `parsePreamble`: Extracts `\documentclass`, `\usepackage`, `\title`, `\author`, `\date`, and captures unknown commands as raw content. - `parseBody`: Parses the document body, handling `\chapter`, `\section`, `\subsection`, `\begin{...}`, blueprint commands, and prose paragraphs. - `parseSectionBody`: Like `parseBody` but breaks when encountering a same-or-higher-level heading. - `parseEnv`: Dispatches environment parsing based on name. Theorem-like environments get `TheoremMetadata`; lists get item parsing; figures get caption extraction. - `parseTheoremMeta`: Collects all blueprint metadata commands (`\label`, `\lean`, `\uses`, `\leanok`, `\leansignaturesourcehtml`, `\leanproofsourcehtml`, `\leanhoverdata`, `\leanabove`, `\leanbelow`, etc.) that precede the statement. - `parseInlines`: Parses inline content until a stop condition is met. **Infinite loop protection:** All catch-all cases include `advance` to consume at least one token, preventing stuck parsers on unrecognized input. **Entry points:** - `parse (source : String) : Document * Array String` -- Parse from string, returns document and errors - `parseFile (path : System.FilePath) : IO (Document * Array String)` -- Parse from file ### `ToHtml.lean` Converts parsed AST blocks to HTML strings for prose rendering in chapter pages. **State:** `HtmlState` accumulates HTML output and tracks module/node references found during conversion. **Key functions:** - `blockToHtml`: Converts each `Block` variant to HTML. Theorem environments produce `div.theorem-env`, proofs produce `div.proof`, math is left for MathJax. - `inlineToHtml`: Converts inline content. Math is wrapped in `$...$` / `\[...\]` for MathJax. - `extractChapters`: Walks document blocks and extracts `ChapterExtract` records (number, title, label, body). - `extractSections`: Extracts sections from a chapter body. - `extractModuleRefs` / `extractNodeRefs`: Traverses the block tree to find `\inputleanmodule` and `\inputleannode` references. - `extractLabeledTheorems`: Finds theorem environments with `\label{}` for detecting LaTeX-only stated nodes. **Placeholder generation:** `\inputleanmodule`, `\inputleannode`, `\paperstatement`, `\paperfull`, and `\paperproof` commands produce placeholder `<div>` elements with `data-module` or `data-node` attributes. **Result type:** `HtmlResult` contains the generated HTML string plus arrays of module and node references discovered during conversion. ### `ToLatex.lean` Converts the parsed AST back to LaTeX strings, resolving node references against Dress artifacts. Used for PDF generation where `\inputleannode{label}` hooks are expanded to actual theorem content from the Lean formalization. **State:** `LatexState` accumulates LaTeX output, tracks node references, and collects warnings. **Configuration:** `LatexConfig` controls whether the source preamble is preserved in output. ## Integration with Adjacent Modules - **Dress/Load.lean** decodes base64-encoded fields in `TheoremMetadata` (signatureHtml, proofHtml, hoverData, above, below) after parsing - **Html/Render.lean** provides an alternative rendering path using Verso's `Html` type instead of string accumulation - **Theme.lean** uses `extractChapters` and `extractSections` to build the chapter navigation structure - **Paper.lean** uses the parser for `paper.tex` and processes paper-specific hooks (`\paperstatement`, `\paperfull`, `\paperproof`)