html2md-mcp

html2md-mcp

MCP server converting HTML pages to Markdown with browser support and significant size reduction.

1
Stars
0
Forks
1
Releases

Overview

HTML to Markdown MCP Server converts web pages into a concise Markdown representation suitable for AI context. It preserves essential content such as tables, images, and links while removing noise like scripts, styles, navigation, and headers/footers, achieving substantial size reductions (approximately 90-95%) compared to the original HTML. The server uses trafilatura for robust content extraction and BeautifulSoup4 for reliable HTML parsing, with an architecture designed for streaming processing to efficiently handle large pages. It offers configurable options to include or exclude images, tables, and links, and enforces content-size limits (1 MB to 50 MB) to prevent excessive downloads; caching is optional to speed repeated conversions. For JavaScript-heavy sites and authenticated pages, the browser mode leverages Playwright to render pages, execute JS, and interact with cookies via a user profile. It supports Chromium, Firefox, and WebKit, and provides configurable wait strategies for dynamic content. Deployment options include uv-based usage, pip install, and Docker images preloaded with Playwright. The server is intended to run as an MCP endpoint and can be integrated with Claude Desktop or other MCP clients for automated conversions.

Details

Owner
sunshad0w
Language
Python
License
MIT License
Updated
2025-12-07

Features

HTML to Markdown conversion

Converts HTML content fetched from URLs into clean Markdown.

Content preservation

Preserves essential content such as tables, images, and links.

Content trimming

Removes unnecessary elements (scripts, styles, navigation, footers, headers).

Size reduction

Achieves significant compression (~90-95%) while preserving content.

Configurable rendering options

Configurable options to include images, tables, and links.

Extraction stack

Built with trafilatura and BeautifulSoup4 for robust extraction.

Streaming processing

Stream processing for efficient handling of large pages.

Browser mode with Playwright

Browser mode enabling JavaScript rendering and authenticated access; supports Chromium, Firefox, WebKit; cookies via user profile and configurable wait strategies.

Audience

AI developersConvert web pages to compact Markdown to feed LLMs and AI agents with essential content.
Claude Desktop usersConfigure and run html2md MCP via Claude Desktop using Docker or uv, enabling fast URL-to-Markdown conversions.
Web developersLeverage Playwright browser mode to render JS-heavy sites and access authenticated content.
Data scientistsObtain compact Markdown of pages with preserved tables/images for data extraction and analysis.

Tags

HTMLMarkdownMCPBrowserPlaywrighttrafilaturaBeautifulSoup4JS-rendered contentAuthentication