VisionAgent MCP

VisionAgent MCP

A simple MCP server that enables your LLM to better reason over images, video and documents.

20
Stars
8
Forks
0
Releases

Overview

VisionAgent MCP Server is a lightweight side-car MCP server that runs locally on STDIN/STDOUT, translating each tool call from an MCP-compatible client (Claude Desktop, Cursor, Cline, etc.) into authenticated HTTPS requests to Landing AI’s VisionAgent REST APIs. The response JSON, plus any images or masks, is streamed back to the model so that you can issue natural-language computer-vision and document-analysis commands from your editor without writing custom REST code or loading an extra SDK. v0.1 adds support for agentic-document-analysis (PDFs/images text extraction), text-to-object-detection (bounding boxes), text-to-instance-segmentation (pixel masks), activity-recognition (video actions with timestamps), and depth-pro (monocular depth maps). The server validates inputs with Zod schemas derived from the live OpenAPI, auto-generates tool definitions via generate-tools, reads file-based inputs, and forwards authenticated requests to VisionAgent. Outputs can be saved and visualized to OUTPUT_DIRECTORY when IMAGE_DISPLAY_ENABLED is true. It runs locally, with no telemetry, and is configurable via a minimal JSON config and sample MCP client entries.

Details

Owner
landing-ai
Language
TypeScript
License
Updated
2025-12-07

Features

MCP tool call translation

Translates MCP tool calls from clients into authenticated HTTPS requests to VisionAgent REST APIs.

Response streaming

Streams JSON results and any base64 media (images, masks) back to the MCP client.

OpenAPI-driven tool map

Fetches VisionAgent OpenAPI spec and auto-generates the MCP tool map with validation schemas.

Argument validation with Zod

Validates incoming tool arguments against Zod schemas derived from the live OpenAPI spec.

File-based arg handling

Reads and base64-encodes file-based arguments (e.g., imagePath, pdfPath) for upload.

Output visualization & storage

Optionally post-processes outputs (masks, boxes, depth maps) and saves to OUTPUT_DIRECTORY.

Local, private operation

Runs locally on STDIN/STDOUT with no telemetry; data is only sent to VisionAgent APIs.

Developer tooling & quick start

Includes generate-tools script and example MCP client configuration for easy setup.

Audience

MCP clientsDevelopers using Claude Desktop, Cursor, Cline, and other MCP clients to access vision/document-analysis tools via VisionAgent without custom REST code.
LLM engineersLLM teams integrating image/video/document reasoning capabilities into workflows through textual prompts.
ML/AI integratorsIntegrators building local toolchains that rely on VisionAgent endpoints for vision tasks.

Tags

MCPVisionAgentcomputer-visionimage-analysisdocument-analysisobject-detectionsegmentationdepth-estimationactivity-recognitionOpenAPIZodAxioslocal-serverstdin-stdoutprivacy