Overview
Together AI provides an end-to-end AI Acceleration Cloud for training, fine-tuning, and deploying open-source and specialized generative models. The platform offers serverless inference APIs, token-based fine-tuning, managed GPU clusters (instant and reserved), and private AI factory solutions. Together emphasizes open-source research and no vendor lock-in, with an optimized inference stack (ATLAS + Together Inference Engine), a 200+ model library, and hardware options up to NVIDIA Blackwell/GB200. Pricing is modular and usage-based (per-token, per-megapixel, per-audio-minute, per-GPU-hour). The platform targets developers, researchers, and enterprises that need production-scale throughput, lower latency, and cost-optimized model deployment and training.
Key Features
High-performance Inference
Serverless APIs and a custom inference stack delivering faster inference and improved cost-efficiency.
Fine-tuning & Model Ownership
Token-based fine-tuning (LoRA & full) with deployment into dedicated endpoints; customers own resulting models.
Scalable GPU Cloud & Private Clusters
Instant self-service GPUs, reserved clusters, and custom AI Factory deployments scaling to thousands of GPUs.
Large Model Library & Open-source Focus
200+ curated open-source and specialized models (text, vision, audio, code) ready to deploy with examples.
Pretraining and Research Tooling
Support for pretraining with the Together Kernel Collection and research contributions (FlashAttention, Dragonfly, RedPajama).
Enterprise-grade Security & Support
Support tiers, SOC 2 Type 2 mentions, and enterprise SLAs for dedicated customers.



Who Can Use This Tool?
- Developers:Build and deploy models quickly using serverless APIs, model catalog, and SDKs.
- Researchers:Run experiments, pretrain/fine-tune models, and access specialized model architectures.
- Enterprises:Deploy dedicated hardware, private clusters, and enterprise-grade support for production workloads.
- Startups:Prototype with serverless inference and scale to reserved GPUs or custom clusters as needed.
Pricing Plans
State-of-the-art language and multimodal models billed by usage.
- ✓Price per 1M tokens
- ✓Batch API price
- ✓Access to language and multimodal models
Generate high-quality images billed per megapixel or step usage.
- ✓Price per MP
- ✓Images Per $1 (1MP)
- ✓Default steps included; extra cost if exceeded
Text-to-speech and speech processing billed by character volume.
- ✓Price per 1M Characters
- ✓Multiple speech models available (e.g., Cartesia Sonic-2)
Create high-quality videos billed per generated video output.
- ✓Price per video
- ✓Multiple video models and presets (720p, 1080p, audio options)
Speech-to-text and translation billed per audio minute.
- ✓Price per audio minute
- ✓Batch API pricing available
- ✓Models such as Whisper Large v3
Embeddings for semantic search and RAG billed by tokens.
- ✓Price per 1M tokens
- ✓Models: BGE-Base-EN v1.5, BGE-Large-EN v1.5, e5 variants
Improve search relevance billed per token volume processed.
- ✓Price per 1M tokens
- ✓Models: Mxbai Rerank Large V2, Salesforce Llama Rank V1
Safety and compliance filtering billed per token usage.
- ✓Price per 1M tokens
- ✓Models: VirtueGuard Text Lite, Llama Guard family
Deploy models on dedicated hardware with guaranteed performance.
- ✓Guaranteed performance (no sharing)
- ✓Support for custom models
- ✓Autoscaling & traffic spike handling
Fine-tuning billed by tokens processed in training and eval datasets.
- ✓Price based on sum of tokens (training + evaluation)
- ✓Minimum charges apply for certain models
- ✓LoRA fine-tuning supported
Customize VM sandboxes billed by hour and RAM usage.
- ✓Price per hour
- ✓Price per GiB RAM
- ✓Choice of Kubernetes or Slurm on Kubernetes
Secure execution of LLM-generated code billed per session.
- ✓Price per session
- ✓Session duration: 60 minutes
Ready-to-use self-service GPUs billed per GPU-hour.
- ✓Price per hour per GPU (usage-based)
- ✓1 week - 3 months reservation options
- ✓Free network ingress and egress
Dedicated H200 GPUs with expert support, starting price per hour.
- ✓Dedicated capacity with expert support
- ✓NVIDIA H200 141GB HBM3e
- ✓Starting at $2.09 per hour (usage billed hourly)
Dedicated H100 GPUs with expert support, starting price per hour.
- ✓Dedicated capacity with expert support
- ✓NVIDIA H100 (SXM) 80GB
- ✓Starting at $1.75 per hour (usage billed hourly)
Dedicated A100 GPUs with expert support, starting price per hour.
- ✓Dedicated capacity with expert support
- ✓NVIDIA A100 (SXM/PCIe) 40/80GB variants
- ✓Starting at $1.30 per hour (usage billed hourly)
Large-scale custom-built private GPU clusters; request a project plan.
- ✓1K 60; 10K 60; 100K+ NVIDIA GPUs scale
- ✓High-bandwidth parallel filesystem colocated with compute
- ✓Custom pricing via project request
Pros & Cons
✓ Pros
- ✓Production-grade, optimized inference stack with low latency and high throughput (ATLAS + Together Inference Engine).
- ✓Broad model library (200+ open-source and specialized models) and OpenAI-compatible APIs for easier migration.
- ✓Flexible, modular pricing across inference, fine-tuning, GPU cloud, and private clusters.
- ✓Strong open-source research contributions and enterprise-grade hardware options up to NVIDIA Blackwell.
✗ Cons
- ✗No obvious public free-trial; pricing is primarily usage-based which may be confusing for new users.
- ✗Enterprise hardware/pricing can require contacting sales or custom contracts for large deployments.
- ✗Fine-tuning minimum charges and LoRA limitations for some workflows may constrain small experiments.
Compare with Alternatives
| Feature | Together AI | Vertex AI | Run:ai (NVIDIA Run:ai) |
|---|---|---|---|
| Pricing | N/A | N/A | N/A |
| Rating | 8.4/10 | 8.8/10 | 8.4/10 |
| Inference Throughput | High throughput inference | High-scale cloud inference | GPU utilization optimized for higher throughput |
| Fine-tuning Control | Yes | Yes | Partial |
| GPU Orchestration | Yes | Partial | Yes |
| Model Ecosystem | Large open-source model library | Extensive model garden and integrations | Orchestration-focused with limited model library |
| Research Tooling | Yes | Partial | Partial |
| Deployment Flexibility | Cloud and private cluster deployments | Managed Google Cloud deployments | Kubernetes-native flexible on‑prem and cloud |
| Enterprise Governance | Yes | Yes | Yes |
| MLOps Observability | Partial | Yes | Yes |
Related Articles (19)
Baseten unveils an AI training platform aimed at taking on hyperscalers with streamlined workflows and reduced infrastructure overhead.
...
A practical, step-by-step guide to integrating OpenAI APIs with Jan for remote models, including setup, configuration, model selection, and troubleshooting.
A reflective dive into AI Mina’s imagined future driving, exploring companionship, memory-making, and the evolving role of cars in human life.
LiveAction 25.3 launches LiveAssist AI Copilot, Network Resource Monitoring, and Security Insights for proactive, AI-driven NetOps.
