Together AI Logo
BusinessFreemium

Together AI

A full-stack AI acceleration cloud for fast inference, fine-tuning, and scalable GPU training.
8.4
Rating
Freemium
Price
6
Key Features

Overview

Together AI provides an end-to-end AI Acceleration Cloud for training, fine-tuning, and deploying open-source and specialized generative models. The platform offers serverless inference APIs, token-based fine-tuning, managed GPU clusters (instant and reserved), and private AI factory solutions. Together emphasizes open-source research and no vendor lock-in, with an optimized inference stack (ATLAS + Together Inference Engine), a 200+ model library, and hardware options up to NVIDIA Blackwell/GB200. Pricing is modular and usage-based (per-token, per-megapixel, per-audio-minute, per-GPU-hour). The platform targets developers, researchers, and enterprises that need production-scale throughput, lower latency, and cost-optimized model deployment and training.

Details

Developer
together.ai
Launch Year
2022
Free Trial
No
Updated
2025-12-07

Features

High-performance Inference

Serverless APIs and a custom inference stack delivering faster inference and improved cost-efficiency.

Fine-tuning & Model Ownership

Token-based fine-tuning (LoRA & full) with deployment into dedicated endpoints; customers own resulting models.

Scalable GPU Cloud & Private Clusters

Instant self-service GPUs, reserved clusters, and custom AI Factory deployments scaling to thousands of GPUs.

Large Model Library & Open-source Focus

200+ curated open-source and specialized models (text, vision, audio, code) ready to deploy with examples.

Pretraining and Research Tooling

Support for pretraining with the Together Kernel Collection and research contributions (FlashAttention, Dragonfly, RedPajama).

Enterprise-grade Security & Support

Support tiers, SOC 2 Type 2 mentions, and enterprise SLAs for dedicated customers.

Screenshots

Together AI Screenshot
Together AI Screenshot
Together AI Screenshot

Pricing

Serverless Inference (API)
Free

State-of-the-art language and multimodal models billed by usage.

  • Price per 1M tokens
  • Batch API price
  • Access to language and multimodal models
Image Generation (Serverless)
Free

Generate high-quality images billed per megapixel or step usage.

  • Price per MP
  • Images Per $1 (1MP)
  • Default steps included; extra cost if exceeded
Speech Synthesis & Processing
Free

Text-to-speech and speech processing billed by character volume.

  • Price per 1M Characters
  • Multiple speech models available (e.g., Cartesia Sonic-2)
Video Generation (API)
Free

Create high-quality videos billed per generated video output.

  • Price per video
  • Multiple video models and presets (720p, 1080p, audio options)
Automatic Speech Recognition (ASR)
Free

Speech-to-text and translation billed per audio minute.

  • Price per audio minute
  • Batch API pricing available
  • Models such as Whisper Large v3
Vector Embeddings
Free

Embeddings for semantic search and RAG billed by tokens.

  • Price per 1M tokens
  • Models: BGE-Base-EN v1.5, BGE-Large-EN v1.5, e5 variants
Reranking Models
Free

Improve search relevance billed per token volume processed.

  • Price per 1M tokens
  • Models: Mxbai Rerank Large V2, Salesforce Llama Rank V1
Content Filtering & Classification
Free

Safety and compliance filtering billed per token usage.

  • Price per 1M tokens
  • Models: VirtueGuard Text Lite, Llama Guard family
Custom Hardware Deployment
Free

Deploy models on dedicated hardware with guaranteed performance.

  • Guaranteed performance (no sharing)
  • Support for custom models
  • Autoscaling & traffic spike handling
Fine-tuning (LoRA & Token-based)
Free

Fine-tuning billed by tokens processed in training and eval datasets.

  • Price based on sum of tokens (training + evaluation)
  • Minimum charges apply for certain models
  • LoRA fine-tuning supported
VM Sandboxes (Development Environments)
Free

Customize VM sandboxes billed by hour and RAM usage.

  • Price per hour
  • Price per GiB RAM
  • Choice of Kubernetes or Slurm on Kubernetes
Code Execution API (Sessions)
Free

Secure execution of LLM-generated code billed per session.

  • Price per session
  • Session duration: 60 minutes
GPU Cloud — Self-service GPUs (Instant/Reserved)
Free

Ready-to-use self-service GPUs billed per GPU-hour.

  • Price per hour per GPU (usage-based)
  • 1 week - 3 months reservation options
  • Free network ingress and egress
GPU Cloud — Dedicated Capacity (NVIDIA H200)
Free

Dedicated H200 GPUs with expert support, starting price per hour.

  • Dedicated capacity with expert support
  • NVIDIA H200 141GB HBM3e
  • Starting at $2.09 per hour (usage billed hourly)
GPU Cloud — Dedicated Capacity (NVIDIA H100)
Free

Dedicated H100 GPUs with expert support, starting price per hour.

  • Dedicated capacity with expert support
  • NVIDIA H100 (SXM) 80GB
  • Starting at $1.75 per hour (usage billed hourly)
GPU Cloud — Dedicated Capacity (NVIDIA A100)
Free

Dedicated A100 GPUs with expert support, starting price per hour.

  • Dedicated capacity with expert support
  • NVIDIA A100 (SXM/PCIe) 40/80GB variants
  • Starting at $1.30 per hour (usage billed hourly)
AI Factory / Private GPU Clusters (Custom)
Free

Large-scale custom-built private GPU clusters; request a project plan.

  • 1K 60; 10K 60; 100K+ NVIDIA GPUs scale
  • High-bandwidth parallel filesystem colocated with compute
  • Custom pricing via project request

Pros & Cons

Pros

  • Production-grade, optimized inference stack with low latency and high throughput (ATLAS + Together Inference Engine).
  • Broad model library (200+ open-source and specialized models) and OpenAI-compatible APIs for easier migration.
  • Flexible, modular pricing across inference, fine-tuning, GPU cloud, and private clusters.
  • Strong open-source research contributions and enterprise-grade hardware options up to NVIDIA Blackwell.

Cons

  • No obvious public free-trial; pricing is primarily usage-based which may be confusing for new users.
  • Enterprise hardware/pricing can require contacting sales or custom contracts for large deployments.
  • Fine-tuning minimum charges and LoRA limitations for some workflows may constrain small experiments.

Compare with Alternatives

FeatureTogether AIVertex AIRun:ai (NVIDIA Run:ai)
PricingN/AN/AN/A
Rating8.4/108.8/108.4/10
Inference ThroughputHigh throughput inferenceHigh-scale cloud inferenceGPU utilization optimized for higher throughput
Fine-tuning ControlYesYesPartial
GPU OrchestrationYesPartialYes
Model EcosystemLarge open-source model libraryExtensive model garden and integrationsOrchestration-focused with limited model library
Research ToolingYesPartialPartial
Deployment FlexibilityCloud and private cluster deploymentsManaged Google Cloud deploymentsKubernetes-native flexible on‑prem and cloud
Enterprise GovernanceYesYesYes
MLOps ObservabilityPartialYesYes

Audience

DevelopersBuild and deploy models quickly using serverless APIs, model catalog, and SDKs.
ResearchersRun experiments, pretrain/fine-tune models, and access specialized model architectures.
EnterprisesDeploy dedicated hardware, private clusters, and enterprise-grade support for production workloads.
StartupsPrototype with serverless inference and scale to reserved GPUs or custom clusters as needed.

Tags

aiinfrastructureinferencefine-tuninggpu-cloudopen-sourcemodelsapi