https://together.ai
Together AI – The AI Acceleration Cloud - Fast Inference, Fine-Tuning & Training
Run and fine-tune generative AI models with easy-to-use APIs and highly scalable infrastructure. Train & deploy models at scale on our AI Acceleration Cloud and scalable GPU clusters. Optimize performance and cost.
Together AI – The AI Acceleration Cloud - Fast Inference, Fine-Tuning & Training This website uses cookies to anonymously analyze website traffic using Google Analytics.AcceptDecline🦙 The Llama 4 herd is here! Now available on the Together API Products Together InferenceTogether Dedicated EndpointsTogether Fine-tuningTogether Custom ModelsTogether GPU ClustersModel LibraryFor Business EnterpriseCustomer storiesWhy open-sourceIndustries & use casesFor Developers DocsModel LibraryExample AppsCookbooksPlaygroundPricing OverviewInferenceDedicated EndpointsFine-TuningGPU ClustersResearchCompany BlogValuesCareersTeamDocsContactGet Started The AI Acceleration CloudTurbocharge model training and inference on NVIDIA GPUs.Build with open source and fine-tune your own AI.Start building nowContact sales200+ generative AI models Build with open-source and specialized multimodal models for chat, images, code, and more. Migrate from closed models with OpenAI-compatible APIs.All AllChatImageVisionAudioLanguageCodeEmbeddingsRerankThank you! Your submission has been received!Oops! Something went wrong while submitting the form.Try now together.aiChatLlama 4 MaverickSOTA 128-expert MoE powerhouse for multilingual image/text understanding, creative writing, and enterprise-scale applications.TRY
THIS MODEL NewChatLlama 4 ScoutSOTA 109B model with 17B active params & large context, excelling at multi-document analysis, codebase reasoning, and personalized tasks.TRY
THIS MODEL NewChatDeepSeek R1Open-source reasoning model rivaling OpenAI-o1, excelling in math, code, reasoning, and cost efficiency.TRY
THIS MODEL ChatDeepSeek R1 Distilled Llama 70B FreeFree endpoint to experiment the power of reasoning models. This distilled model beats GPT-4o on math & matches o1-mini on coding.TRY
THIS MODEL FreeChatGemma 3 27BLightweight model with vision-language input, multilingual support, visual reasoning, and top-tier performance per size.TRY
THIS MODEL NewChatLlama 3.3 70B Instruct Turbo FreeFree endpoint to try this 70B multilingual LLM optimized for dialogue, excelling in benchmarks and surpassing many chat models.TRY
THIS MODEL FreeVisionQwen2.5-VL 72B InstructVision-language model with advanced visual reasoning, video understanding, structured outputs, and agentic capabilities. TRY
THIS MODEL NewChatCogito V1 Preview Llama 70BBest-in-class open-source LLM trained with IDA for alignment, reasoning, and self-reflective, agentic applications.TRY
THIS MODEL NewImageFLUX.1 [schnell] FreeFree endpoint for the SOTA open-source image generation model by Black Forest Labs.TRY
THIS MODEL FreeVisionLlama 3.2 11B FreeFree endpoint to try Llama 3.2 11B.TRY
THIS MODEL FreeChatDeepSeek-V3-0324DeepSeek's latest open Mixture-of-Experts model challenging top AI models at much lower cost.TRY
THIS MODEL ChatQwen QwQ-32BQwen series reasoning model excelling in complex tasks, outperforming conventional instruction-tuned models on hard problems.TRY
THIS MODEL NewChatCogito V1 Preview Qwen 32BBest-in-class open-source LLM trained with IDA for alignment, reasoning, and self-reflective, agentic applications.TRY
THIS MODEL NewChatQwen2.5 72BPowerful decoder-only models available in 7B and 72B variants, developed by Alibaba Cloud's Qwen team for advanced language processing.TRY
THIS MODEL ChatCogito V1 Preview Qwen 14BBest-in-class open-source LLM trained with IDA for alignment, reasoning, and self-reflective, agentic applications.TRY
THIS MODEL NewAudioCartesia Sonic-2Low-latency, ultra-realistic voice model, served in partnership with Cartesia.TRY
THIS MODEL NewChatMistral Small 324B model rivaling GPT-4o mini, and larger models like Llama 3.3 70B. Ideal for chat use cases like customer support, translation and summarization.TRY
THIS MODEL ChatCogito V1 Preview Llama 8BBest-in-class open-source LLM trained with IDA for alignment, reasoning, and self-reflective, agentic applications.TRY
THIS MODEL NewChatLlama 3.1 Nemotron 70B InstructThis LLM is customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.TRY
THIS MODEL ChatCogito V1 Preview Llama 3BBest-in-class open-source LLM trained with IDA for alignment, reasoning, and self-reflective, agentic applications.TRY
THIS MODEL NewChatLlama 3.3 70BThe Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.TRY
THIS MODEL ImageFLUX1.1 [pro]Premium image generation model by Black Forest Labs.TRY
THIS MODEL ChatLlama 3.1 405BThe Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B and 405B sizes, that outperform many available open source and closed chat models on common industry benchmarks.TRY
THIS MODEL CodeQwen 2.5 Coder 32B InstructSOTA code LLM with advanced code generation, reasoning, fixing, and support for up to 128K tokens.TRY
THIS MODEL ChatGemma-2 Instruct (27B)Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.TRY
THIS MODEL ChatLlama 3.1 8BThe Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B and 405B sizes, that outperform many available open source and closed chat models on common industry benchmarks.TRY
THIS MODEL VisionQwen2-VL-72B-InstructA powerful OSS vision model by Alibaba that combines advanced vision capabilities with instruction-tuned language understanding for sophisticated visual reasoning tasks.TRY
THIS MODEL ChatDeepSeek R1 Distilled Llama 70BLlama 70B distilled with reasoning capabilities from Deepseek R1. Surpasses GPT-4o with 94.5% on MATH-500 & matches o1-mini on coding.TRY
THIS MODEL ChatDeepSeek R1 Distilled Qwen 14BQwen 14B distilled with reasoning capabilities from Deepseek R1. Outperforms GPT-4o in math & matches o1-mini on coding.TRY
THIS MODEL ChatDeepSeek R1 Distilled Qwen 1.5BSmall Qwen 1.5B distilled with reasoning capabilities from Deepseek R1. Beats GPT-4o on MATH-500 whilst being a fraction of the size.TRY
THIS MODEL ChatGemma 3 1BMost lightweight Gemma 3 model (1B) with 128K context, vision-language input, and multilingual support for on-device AI.TRY
THIS MODEL NewChatGemma 3 4BLightweight model with vision-language input, multilingual support, visual reasoning, and top-tier performance per size.TRY
THIS MODEL NewChatGemma 3 12BLightweight model with vision-language input, multilingual support, visual reasoning, and top-tier performance per size.TRY
THIS MODEL NewChatDBRX-InstructDBRX Instruct is a mixture-of-experts (MoE) large language model trained from scratch by Databricks. DBRX Instruct specializes in few-turn.TRY
THIS MODEL ImageFLUX.1 [dev]12 billion parameter rectified flow transformer capable of generating images from text descriptions.TRY
THIS MODEL ImageFLUX.1 [pro]First generation premium image generation model by Black Forest Labs.TRY
THIS MODEL ImageFLUX.1 Canny [dev]12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image.TRY
THIS MODEL ImageFLUX.1 Depth [dev]12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image.TRY
THIS MODEL ImageFLUX.1 Redux [dev]Adapter for FLUX.1 models enabling image variation, refining input images, and integrating into advanced restyling workflows.TRY
THIS MODEL ImageFLUX.1 [schnell]Fastest available endpoint for the SOTA open-source image generation model by Black Forest Labs.TRY
THIS MODEL ChatDeepseek-67BTrained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.TRY
THIS MODEL VisionLlama 3.2 90BThe Llama 3.2-Vision collection features multimodal LLMs (11B and 90B) optimized for visual recognition, image reasoning, captioning, and answering image-related questions.TRY
THIS MODEL VisionLlama 3.2 11BThe Llama 3.2-Vision collection features multimodal LLMs (11B and 90B) optimized for visual recognition, image reasoning, captioning, and answering image-related questions.TRY
THIS MODEL ChatLlama 3.2 3B Instruct TurboThe Llama 3.2-Vision collection features multimodal LLMs (11B and 90B) optimized for visual recognition, image reasoning, captioning, and answering image-related questions.TRY
THIS MODEL ChatGemma Instruct (2B)2B instruct Gemma model by Google: lightweight, open, text-to-text LLM for QA, summarization, reasoning, and resource-efficient deployment.TRY
THIS MODEL ChatGemma-2 Instruct (9B)Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.TRY
THIS MODEL ChatQwen2.5 7B Instruct TurboQwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters.TRY
THIS MODEL LanguageMistral7.3B parameter model that outperforms Llama 2 13B on all benchmarks, approaches CodeLlama 7B performance on code, Uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at smaller costTRY
THIS MODEL ImageStable Diffusion XL 1.0A text-to-image generative AI model that excels at creating 1024x1024 images.TRY
THIS MODEL ChatMistral InstructInstruct fine-tuned version of Mistral-7B-v0.1TRY
THIS MODEL ChatMixtral 8x7B Instruct v0.1The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.TRY
THIS MODEL LanguageMixtral 8x7B v0.1The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.TRY
THIS MODEL EmbeddingsBGE-Large-EN v1.5BAAI general embedding - large, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.TRY
THIS MODEL RerankSalesforce LlamaRankSalesforce Research's proprietary fine-tuned rerank model with 8K context, outperforming Cohere Rerank for superior document retrieval.TRY
THIS MODEL ChatMistral (7B) Instruct v0.2The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.TRY
THIS MODEL ChatMistral (7B) Instruct v0.3The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.TRY
THIS MODEL ChatUpstage SOLAR Instruct v1 (11B)Built on the Llama2 architecture, SOLAR-10.7B incorporates the innovative Upstage Depth Up-Scaling.TRY
THIS MODEL ChatWizardLM-2 (8x22B)WizardLM-2 8x22B is Wizard's most advanced model, demonstrates highly competitive performance compared to those leading proprietary works and consistently outperforms all the existing state-of-the-art opensource models.TRY
THIS MODEL ChatLlama 3.1 70BThe Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B and 405B sizes, that outperform many available open source and closed chat models on common industry benchmarks.TRY
THIS MODEL ChatMixtral-8x22B Instruct v0.1The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral-8x22B-v0.1.TRY
THIS MODEL ChatNous Hermes 2 - Mixtral 8x7B-DPONous Hermes 2 Mixtral 7bx8 DPO is the new flagship Nous Research model trained over the Mixtral 7bx8 MoE LLM. The model was trained on over 1,000,000 entries of primarily GPT-4 generated data, as well as other high quality data from open datasets across the AI landscape, achieving state of the art performance on a variety of tasks.TRY
THIS MODEL ChatQwen 2A transformer-based decoder-only language model pre-trained on a large amount of data. In comparison with the previously released QwenTRY
THIS MODEL ChatTyphoon 2 70B InstructInstruct Thai large language model with 70 billion parameters, based on Llama3.1-70B.TRY
THIS MODEL ChatTyphoon 2 8B InstructInstruct Thai large language model with 8 billion parameters based on Llama3.1-8B.TRY
THIS MODEL ChatTyphoon 1.5 8B InstructInstruct Thai large language model with 8 billion parameters based on Llama3-8B.TRY
THIS MODEL LanguageLlama Guard (7B)Llama Guard: LLM-based Input-Output Safeguard for Human-AI ConversationsTRY
THIS MODEL LanguageLlama Guard 2 8B8B Llama 3-based safeguard model for classifying LLM inputs and outputs, detecting unsafe content and policy violations.TRY
THIS MODEL LanguageLlama Guard 3 11B Vision TurboLlama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL LanguageLlama Guard 3 8B8B Llama 3.1 model fine-tuned for content safety, moderating prompts and responses in 8 languages with MLCommons alignment.TRY
THIS MODEL ChatMythoMax-L2MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a modelTRY
THIS MODEL EmbeddingsUAE-Large v1An universal English sentence embedding model by WhereIsAI. Its embedding dimension is 1024, it takes up to 512 context length.TRY
THIS MODEL ChatTyphoon 1.5X 70B-awqThai language 70B instruct model rivaling GPT-4-0612; optimized for RAG, constrained generation, and reasoning tasks.TRY
THIS MODEL EmbeddingsM2-BERT 80M 2K RetrievalAn 80M checkpoint of M2-BERT, pretrained with sequence length 2048, and it has been fine-tuned for long-context retrieval.TRY
THIS MODEL EmbeddingsM2-BERT 80M 8K RetrievalAn 80M checkpoint of M2-BERT, pretrained with sequence length 8192, and it has been fine-tuned for long-context retrieval.TRY
THIS MODEL EmbeddingsBGE-Base-EN v1.5BAAI general embedding - base, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.TRY
THIS MODEL ChatGryphe MythoMax L2 Lite (13B)MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model.TRY
THIS MODEL ChatLlama 3 70B Instruct LiteLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL LanguageLLaMA-2Language model trained on 2 trillion tokens with double the context length of Llama 1. Available in three sizes: 7B, 13B and 70B parametersTRY
THIS MODEL ChatLlama 3 70B Instruct ReferenceLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL ChatLlama 3 70B Instruct TurboLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL ChatLlama 3 8B Instruct LiteLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL ChatLlama 3 8B Instruct ReferenceLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL ChatLlama 3 8B Instruct TurboLlama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.TRY
THIS MODEL ChatLLaMA-2 Chat (13B)Llama 2-chat leverages publicly available instruction datasets and over 1 million human annotations. Available in three sizes: 7B, 13B and 70B parametersTRY
THIS MODEL ChatLLaMA-2 Chat (7B)Llama 2-chat leverages publicly available instruction datasets and over 1 million human annotations. Available in three sizes: 7B, 13B and 70B parameters.TRY
THIS MODEL ImageFLUX.1 Schnell [fixedres]FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.TRY
THIS MODEL EmbeddingsM2-BERT 80M 32K RetrievalAn 80M checkpoint of M2-BERT, pretrained with sequence length 32768, and it has been fine-tuned for long-context retrieval.TRY
THIS MODEL End-to-end platform for the full generative AI lifecycleLeverage pre-trained models, fine-tune them for your needs, or build custom models from scratch. Whatever your generative AI needs, Together AI offers a seamless continuum of AI compute solutions to support your entire journey.InferenceThe fastest way to launch AI models:✔ Serverless or dedicated endpoints✔ Deploy in enterprise VPC✔ SOC 2 and HIPAA compliantFine-TuningTailored customization for your tasks✔ Complete model ownership✔ Fully tune or adapt models✔ Easy-to-use APIsFull Fine-TuningLoRA Fine-TuningGPU ClustersFull control for massive AI workloads✔ Accelerate large model training✔ GB200, H200, and H100 GPUs✔ Pricing from $1.75 / hourRun models Train
ModelsSpeed, cost, and accuracy. Pick all three.SPEED RELATIVE TO VLLM4x FASTERLLAMA-3 8B AT FULL PRECISION400 TOKENS/SECCOST RELATIVE TO GPT-4o11x lower costWhy Together InferencePowered by the Together Inference Engine, combining research-driven innovation with deployment flexibility.accelerated by cutting edge researchTransformer-optimized kernels: our researchers' custom FP8 inference kernels, 75%+ faster than base PyTorchQuality-preserving quantization: accelerating inference while maintaining accuracy with advances such as QTIPSpeculative decoding: faster throughput, powered by novel algorithms and draft models trained on RedPajama datasetFlexibility to choose a model that fits your needsTurbo: Best performance without losing accuracy Reference: Full precision, available for 100% accuracy Lite: Optimized for fast performance at the lowest costAvailable via Dedicated instances and serverless APIDedicated instances: fast, consistent performance, without rate limits, on your own single-tenant NVIDIA GPUsServerless API: quickly switch from closed LLMs to models like Llama, using our OpenAI compatible APIsControl your IP. Own your AI.Fine-tune open-source models like Llama on your data and run them on Together Cloud or in a hyperscaler VPC. With no vendor lock-in, your AI remains fully under your control. together files upload acme_corp_customer_support.jsonl { "filename" : "acme_corp_customer_support.json", "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a", "object": "file" } together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a --model together compute/RedPajama-INCITE-7B-Chat together finetune create --training-file $FILE_ID --model $MODEL_NAME --wandb-api-key $WANDB_API_KEY --n-epochs 10 --n-checkpoints 5 --batch-size 8 --learning-rate 0.0003 { "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a", "model_output_name": "username/togethercomputer/llama-2-13b-chat", "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat", "Suffix": "Llama-2-13b 1", "model": "togethercomputer/llama-2-13b-chat", "n_epochs": 4, "batch_size": 128, "learning_rate": 1e-06, "checkpoint_steps": 2, "created_at": 1687982945, "updated_at": 1687982945, "status": "pending", "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f", "epochs_completed": 3, "events": [ { "object": "fine-tune-event", "created_at": 1687982945, "message": "Fine tune request created", "type": "JOB_PENDING", } ], "queue_depth": 0, "wandb_project_name": "Llama-2-13b Fine-tuned 1" } Start simpleBegin fine-tuning with a single commandgo deepControl hyperparameters like learning rate, batch size, and epochs to optimize model quality.Fine-tuning API Forge the AI frontier. Train on expert-built GPU clusters.Built by AI researchers for AI innovators, Together GPU Clusters are powered by NVIDIA GB200, H200, and H100 GPUs, along with the Together Kernel Collection — delivering up to 24% faster training operations. Top-Tier NVIDIA GPUsNVIDIA's latest GPUs, like GB200, H200, and H100,
for peak AI performance, supporting both training and inference. Accelerated Software StackThe Together Kernel Collection includes
custom CUDA kernels, reducing training times and costs with superior throughput. High-Speed InterconnectsInfiniBand and NVLink ensure fast
communication between GPUs,
eliminating bottlenecks and enabling
rapid processing of large datasets. Highly Scalable & ReliableDeploy 16 to 1000+ GPUs across global locations, with 99.9% uptime SLA. Expert AI Advisory ServicesTogether AI’s expert team offers
consulting for custom model development
and scalable training best practices. Robust Management ToolsSlurm and Kubernetes orchestrate
dynamic AI workloads, optimizing training
and inference seamlessly.Together GPU ClustersTraining-ready clusters – Blackwell and HopperReserve your cluster todayTHE AI
ACCELERATION
CLOUDBUILT ON LEADING AI RESEARCH.InnovationsOur research team is behind breakthrough AI models, datasets, and optimizations.See all researchCocktail SGDWith Cocktail SGD, we’ve addressed a key hindrance to training generative AI models in a distributed environment: networking overhead. Cocktail SGD is a set of optimizations that reduces network overhead by up to 117x.Read more FlashAttention-3FlashAttention-3 achieves up to 75% GPU utilization on H100s, making AI models up to 2x faster and enabling efficient processing of longer text inputs. It allows for faster training and inference of LLMs, supports lower precision operations for improved efficiency.Read more RedPajamaOur RedPajama project enables leading generative AI models to be available as fully open-source. The RedPajama models have been downloaded millions of times, and the RedPajama dataset has been used to create over 500 leading models. Read more Sub-quadratic model architecturesIn close collaboration with Hazy Research, we’re working on the next core architecture for generative AI models that will provide even faster performance with longer context. Our research published in this area includes Striped Hyena, Monarch Mixer, and FlashConv. Read more Customer StoriesSee how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.Pika creates the next gen text-to-video models on Together GPU Clusters Nexusflow uses Together GPU Clusters to build cybersecurity models Arcee builds domain adaptive language models with Together Custom Models Startbuildingyourshere →Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.Testing methodology and results published in this research paper.Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.Subscribe to newsletterThank you! Your submission has been received!Oops! Something went wrong while submitting the form.ProductsSolutionsResearchBlogAboutPricingContactStatus © 2025 San Francisco, CA 94114Privacy policyTerms of service
en
en
1770553561
https://together.ai
ތިޔަ ސައިޓް އެޑިޓް ކުރަންވީތަ؟
ތިބާ ކީއްކުރަނީ؟