Guide: HTML Processing Pipeline
The core function of the Deepwiki MCP Server is to transform cluttered web pages into clean, structured Markdown. This is achieved through a multi-stage pipeline that processes the raw HTML fetched from the web.
The Pipeline Stages
1. Crawling
File: src/lib/httpCrawler.ts
The process begins with the crawl
function, a breadth-first web crawler responsible for fetching the HTML content. Key features of this stage include:
- Concurrency: Uses
p-queue
to manage a pool of concurrentfetch
requests, maximizing speed while being respectful to the server. - Robots.txt: It first fetches and parses the
robots.txt
file from the target domain to ensure it doesn't access disallowed paths. - Depth Limiting: Adheres to the
maxDepth
parameter to prevent crawling an entire website. - Domain Lock: Only crawls URLs within the same hostname as the starting URL.
- Retries: Implements an exponential backoff strategy to automatically retry failed network requests.
2. Sanitization
Files: src/converter/htmlToMarkdown.ts
, src/lib/sanitizeSchema.ts
Once the raw HTML is fetched, it's passed to the htmlToMarkdown
function. The first step within this function is sanitization, which aims to remove all non-essential content.
- It uses the
rehype-sanitize
plugin with a custom schema defined insanitizeSchema.ts
. - This schema explicitly removes tags like
<script>
,<style>
,<img>
,<header>
,<footer>
, and<nav>
. - It also strips potentially harmful attributes like
onload
andonclick
from all remaining elements.
// src/lib/sanitizeSchema.ts
export const sanitizeSchema: SanitizeOptions = {
...defaultSchema,
tagNames: (defaultSchema.tagNames ?? []).filter(
t => !['img', 'script', 'style', 'header', 'footer', 'nav'].includes(t),
),
// ...
};
3. Link Rewriting
File: src/lib/linkRewrite.ts
After the HTML is sanitized, all internal links (<a>
tags with relative href
attributes) must be adjusted to work correctly in the final Markdown output. The rehypeRewriteLinks
plugin handles this based on the chosen mode
:
aggregate
mode: Internal links like<a href="/docs/setup">
are converted to anchor links pointing to a slugified version of the path:<a href="#docs/setup">
.pages
mode: The same link is converted to a relative link to another Markdown file:<a href="docs/setup.md">
.
This ensures that navigation remains functional within the context of the generated documentation.
4. Markdown Conversion
File: src/converter/htmlToMarkdown.ts
The final stage of the pipeline converts the processed HTML tree into a Markdown string.
- It uses
rehype-remark
to bridge the gap from the HTML AST (HAST) to a Markdown AST (MDAST). remark-gfm
is added to ensure support for GitHub-Flavored Markdown features like tables and strikethrough.- Finally,
remark-stringify
serializes the Markdown AST into the final text output.
This robust pipeline ensures that the output is clean, safe, and formatted for optimal use by Large Language Models.