WebScraping.AI

WebScraping.AI

Interact with for web data extraction and scraping

34
Stars
13
Forks
1
Releases

Overview

An MCP server implementation that integrates with WebScraping.AI to enable web data extraction capabilities for LLM-driven workflows. It provides tools for querying page content, extracting structured data, retrieving HTML with JavaScript rendering, and pulling plain text. It supports CSS selector extraction, multiple proxy types with country selection, and JavaScript rendering via headless Chrome/Chromium. The server manages concurrent requests with rate limiting, allows custom JavaScript execution on target pages, and supports device emulation (desktop, mobile, tablet). It includes account usage monitoring and an optional content sandboxing feature that wraps scraped content to prevent prompt injection and protect against untrusted content. Configuration is via environment variables (API key, concurrency limit, proxy type, timeouts, and sandboxing flag). It exposes a set of tools (Question, Fields, HTML, Text, Selected, Selected Multiple, Account) for LLMs and integrates with MCP-enabled LLM platforms. It includes error handling with retries and rate limit backoff, and is designed to work with systems like Cursor and Claude Desktop.

Details

Owner
webscraping-ai
Language
JavaScript
License
Updated
2025-12-07

Features

Question answering about web page content

Answer questions about a page's content using integrated web scraping and rendering capabilities.

Structured data extraction

Extracts structured data from web pages based on user instructions.

HTML content retrieval with JavaScript rendering

Retrieves full HTML with JavaScript executed to reflect dynamic content.

Plain text extraction

Extracts visible text content from web pages.

CSS selector-based content extraction

Targets specific content using CSS selectors for precise extraction.

Proxy support with country selection

Supports multiple proxy types (datacenter, residential) with country targeting.

JavaScript rendering via headless Chrome/Chromium

Renders on-page JavaScript to enable accurate data extraction.

Content sandboxing option

Optionally wraps scraped content in a security boundary to mitigate prompt injection.

Audience

DevelopersIntegrate WebScraping.AI tooling into MCP-enabled LLM workflows for web data extraction.
ML engineersBuild ML-enabled web scraping apps within MCP-enabled environments for scalable data collection.
Data scientistsPrototype and evaluate web data extraction pipelines using MCP tools with real-time results.

Tags

web scrapingMCP serverWebScraping.AIproxyheadless ChromeJavaScript renderingcontent sandboxingLLM integrationHTML extractionstructured data extraction