Guide: HTML Processing Pipeline
The core function of the Deepwiki MCP Server is to transform cluttered web pages into clean, structured Markdown. This is achieved through a multi-stage pipeline that processes the raw HTML fetched from the web.
The Pipeline Stages
1. Crawling
File: src/lib/httpCrawler.ts
The process begins with the crawl function, a breadth-first web crawler responsible for fetching the HTML content. Key features of this stage include:
- Concurrency: Uses
p-queueto manage a pool of concurrentfetchrequests, maximizing speed while being respectful to the server. - Robots.txt: It first fetches and parses the
robots.txtfile from the target domain to ensure it doesn't access disallowed paths. - Depth Limiting: Adheres to the
maxDepthparameter to prevent crawling an entire website. - Domain Lock: Only crawls URLs within the same hostname as the starting URL.
- Retries: Implements an exponential backoff strategy to automatically retry failed network requests.
2. Sanitization
Files: src/converter/htmlToMarkdown.ts, src/lib/sanitizeSchema.ts
Once the raw HTML is fetched, it's passed to the htmlToMarkdown function. The first step within this function is sanitization, which aims to remove all non-essential content.
- It uses the
rehype-sanitizeplugin with a custom schema defined insanitizeSchema.ts. - This schema explicitly removes tags like
<script>,<style>,<img>,<header>,<footer>, and<nav>. - It also strips potentially harmful attributes like
onloadandonclickfrom all remaining elements.
// src/lib/sanitizeSchema.ts
export const sanitizeSchema: SanitizeOptions = {
...defaultSchema,
tagNames: (defaultSchema.tagNames ?? []).filter(
t => !['img', 'script', 'style', 'header', 'footer', 'nav'].includes(t),
),
// ...
};
3. Link Rewriting
File: src/lib/linkRewrite.ts
After the HTML is sanitized, all internal links (<a> tags with relative href attributes) must be adjusted to work correctly in the final Markdown output. The rehypeRewriteLinks plugin handles this based on the chosen mode:
aggregatemode: Internal links like<a href="/docs/setup">are converted to anchor links pointing to a slugified version of the path:<a href="#docs/setup">.pagesmode: The same link is converted to a relative link to another Markdown file:<a href="docs/setup.md">.
This ensures that navigation remains functional within the context of the generated documentation.
4. Markdown Conversion
File: src/converter/htmlToMarkdown.ts
The final stage of the pipeline converts the processed HTML tree into a Markdown string.
- It uses
rehype-remarkto bridge the gap from the HTML AST (HAST) to a Markdown AST (MDAST). remark-gfmis added to ensure support for GitHub-Flavored Markdown features like tables and strikethrough.- Finally,
remark-stringifyserializes the Markdown AST into the final text output.
This robust pipeline ensures that the output is clean, safe, and formatted for optimal use by Large Language Models.