Guide: HTML Processing Pipeline

The core function of the Deepwiki MCP Server is to transform cluttered web pages into clean, structured Markdown. This is achieved through a multi-stage pipeline that processes the raw HTML fetched from the web.

The Pipeline Stages

1. Crawling

File: src/lib/httpCrawler.ts

The process begins with the crawl function, a breadth-first web crawler responsible for fetching the HTML content. Key features of this stage include:

  • Concurrency: Uses p-queue to manage a pool of concurrent fetch requests, maximizing speed while being respectful to the server.
  • Robots.txt: It first fetches and parses the robots.txt file from the target domain to ensure it doesn't access disallowed paths.
  • Depth Limiting: Adheres to the maxDepth parameter to prevent crawling an entire website.
  • Domain Lock: Only crawls URLs within the same hostname as the starting URL.
  • Retries: Implements an exponential backoff strategy to automatically retry failed network requests.

2. Sanitization

Files: src/converter/htmlToMarkdown.ts, src/lib/sanitizeSchema.ts

Once the raw HTML is fetched, it's passed to the htmlToMarkdown function. The first step within this function is sanitization, which aims to remove all non-essential content.

  • It uses the rehype-sanitize plugin with a custom schema defined in sanitizeSchema.ts.
  • This schema explicitly removes tags like <script>, <style>, <img>, <header>, <footer>, and <nav>.
  • It also strips potentially harmful attributes like onload and onclick from all remaining elements.
// src/lib/sanitizeSchema.ts
export const sanitizeSchema: SanitizeOptions = {
  ...defaultSchema,
  tagNames: (defaultSchema.tagNames ?? []).filter(
    t => !['img', 'script', 'style', 'header', 'footer', 'nav'].includes(t),
  ),
  // ...
};

File: src/lib/linkRewrite.ts

After the HTML is sanitized, all internal links (<a> tags with relative href attributes) must be adjusted to work correctly in the final Markdown output. The rehypeRewriteLinks plugin handles this based on the chosen mode:

  • aggregate mode: Internal links like <a href="/docs/setup"> are converted to anchor links pointing to a slugified version of the path: <a href="#docs/setup">.
  • pages mode: The same link is converted to a relative link to another Markdown file: <a href="docs/setup.md">.

This ensures that navigation remains functional within the context of the generated documentation.

4. Markdown Conversion

File: src/converter/htmlToMarkdown.ts

The final stage of the pipeline converts the processed HTML tree into a Markdown string.

  • It uses rehype-remark to bridge the gap from the HTML AST (HAST) to a Markdown AST (MDAST).
  • remark-gfm is added to ensure support for GitHub-Flavored Markdown features like tables and strikethrough.
  • Finally, remark-stringify serializes the Markdown AST into the final text output.

This robust pipeline ensures that the output is clean, safe, and formatted for optimal use by Large Language Models.