Guide: HTML Processing Pipeline

The core function of the Deepwiki MCP Server is to transform cluttered web pages into clean, structured Markdown. This is achieved through a multi-stage pipeline that processes the raw HTML fetched from the web.

The Pipeline Stages

1. Crawling

File: src/lib/httpCrawler.ts

The process begins with the crawl function, a breadth-first web crawler responsible for fetching the HTML content. Key features of this stage include:

Concurrency: Uses p-queue to manage a pool of concurrent fetch requests, maximizing speed while being respectful to the server.
Robots.txt: It first fetches and parses the robots.txt file from the target domain to ensure it doesn't access disallowed paths.
Depth Limiting: Adheres to the maxDepth parameter to prevent crawling an entire website.
Domain Lock: Only crawls URLs within the same hostname as the starting URL.
Retries: Implements an exponential backoff strategy to automatically retry failed network requests.

2. Sanitization

Files: src/converter/htmlToMarkdown.ts, src/lib/sanitizeSchema.ts

Once the raw HTML is fetched, it's passed to the htmlToMarkdown function. The first step within this function is sanitization, which aims to remove all non-essential content.

It uses the rehype-sanitize plugin with a custom schema defined in sanitizeSchema.ts.
This schema explicitly removes tags like <script>, <style>, <img>, <header>, <footer>, and <nav>.
It also strips potentially harmful attributes like onload and onclick from all remaining elements.

// src/lib/sanitizeSchema.ts
export const sanitizeSchema: SanitizeOptions = {
  ...defaultSchema,
  tagNames: (defaultSchema.tagNames ?? []).filter(
    t => !['img', 'script', 'style', 'header', 'footer', 'nav'].includes(t),
  ),
  // ...
};

3. Link Rewriting

File: src/lib/linkRewrite.ts

After the HTML is sanitized, all internal links (<a> tags with relative href attributes) must be adjusted to work correctly in the final Markdown output. The rehypeRewriteLinks plugin handles this based on the chosen mode:

aggregate mode: Internal links like <a href="/docs/setup"> are converted to anchor links pointing to a slugified version of the path: <a href="#docs/setup">.
pages mode: The same link is converted to a relative link to another Markdown file: <a href="docs/setup.md">.

This ensures that navigation remains functional within the context of the generated documentation.

4. Markdown Conversion

File: src/converter/htmlToMarkdown.ts

The final stage of the pipeline converts the processed HTML tree into a Markdown string.

It uses rehype-remark to bridge the gap from the HTML AST (HAST) to a Markdown AST (MDAST).
remark-gfm is added to ensure support for GitHub-Flavored Markdown features like tables and strikethrough.
Finally, remark-stringify serializes the Markdown AST into the final text output.

This robust pipeline ensures that the output is clean, safe, and formatted for optimal use by Large Language Models.