Patronus AI

Patronus AI

Test, evaluate, and optimize AI agents and RAG apps

13
Stars
4
Forks
0
Releases

Overview

An MCP server implementation for the Patronus SDK, providing a standardized interface for running powerful LLM system optimizations, evaluations, and experiments. The server enables initialization with an API key and project settings, and supports executing single evaluations with configurable evaluators, as well as batch evaluations across multiple evaluators. It also facilitates running experiments with datasets, leveraging remote evaluators, custom evaluators, and adapters to create flexible evaluation pipelines. API usage illustrations show initialize, evaluate, batch_evaluate, and run_experiment workflows, along with utilities to list evaluator information and create criteria. The design emphasizes modular evaluators, configurability, and interactive testing, enabling developers to test and compare model outputs against criteria, while providing structured results and metadata. This MCP server is geared toward accelerating AI agent testing, evaluation, and optimization workflows for LLM-driven applications, including retrieval-augmented generation (RAG) contexts. Developers can extend functionality by adding new features with request models and tool endpoints, and by writing tests to ensure reliability and reproducibility of evaluation experiments.

Details

Owner
patronus-ai
Language
Python
License
Apache License 2.0
Updated
2025-12-07

Features

Initialize Patronus with API key and project settings

Set up the Patronus MCP server by providing an API key and project configuration to enable subsequent evaluation and experiment workflows.

Run single evaluations with configurable evaluators

Execute individual evaluations using configurable evaluators (e.g., RemoteEvaluatorConfig) to assess a model output against a task.

Run batch evaluations with multiple evaluators

Perform batch evaluations across multiple evaluators to compare results on a single task and gather aggregated insights.

Run experiments with datasets

Run experiments with datasets, supporting asynchronous operations, custom evaluators, and adapters for flexible evaluation pipelines.

Audience

DevelopersIntegrate MCP server to run LLM evaluations and experiments programmatically.
ResearchersDesign and assess evaluators and criteria for model optimization experiments.
EngineersBuild automated evaluation pipelines and dashboards for iterating prompts.