https://together.xyz
Together AI | The AI Native Cloud
Build what's next on the AI Native Cloud. Full-stack AI platform for inference, fine-tuning, and GPU clusters — powered by cutting-edge research.
Together AI | The AI Native Cloud ⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →Introducing Together AI's new look →🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →InferenceServerless InferenceHigh-performance inference as APIsBatch InferenceInference for batch workloadsDedicated Model InferenceInference on custom hardwareDedicated Container InferenceInference for custom modelsMiniMax M2.5Nano Banana ProQwen3.5-397BGLM-5kimi k2.5gpt-oss-120BModel libraryExplore the top open-source modelsComputeAccelerated ComputeGPU ClustersReliable GPU clusters at scaleAI FactoryCustom infrastructure at frontier scaleDeveloper EnvironmentsSandboxBuild development environments for AIStorageManaged StorageStore model weights & data securelyGB300GB200B200H200H100Model ShapingFine-TuningShape models with your dataEvaluationsMeasure model qualityDeepSeek V3.1GLM 5 FP4Qwen3-VL 32Bgpt-oss-120bkimi k2.5Llama 4 MaverickModel libraryFine-tune top open-source modelsResearchResearchSystems research for production AIResearch blogAll our research publicationsFeatured publicationsFlashAttentionATLASKernel CollectionThunderKittensDSGymShow allDevelopersDocumentationTechnical docs for Together AIDemosOur open-source demo appsCookbooksPractical implementation guidesVoice AgentsBuild voice agents for productionModel LibraryPlaygroundTogether ChatWhich LLM to useCompanyResourcesCustomer storiesTestimonials from AI NativesStartup acceleratorBuild and scale your startupCustomer supportFind answers to your questionsBlogOur latest news & blog postsEventsExplore our events calendarCompanyAboutGet to know usCareersJoin our missionPricingServerless InferenceHigh-performance inference as APIsBatch InferenceInference for batch workloadsDedicated Model InferenceInference on custom hardwareDedicated Container InferenceInference for custom modelsMiniMax M2.5Nano Banana ProQwen3.5-397BGLM-5kimi k2.5gpt-oss-120BModel libraryExplore the top open-source modelsAccelerated ComputeGPU ClustersReliable GPU clusters at scaleAI FactoryCustom infrastructure at frontier scaleDeveloper EnvironmentsSandboxBuild development environments for AIStorageManaged StorageStore model weights & data securelyGB300GB200B200H200H100Fine-TuningShape models with your dataEvaluationsMeasure model qualityDeepSeek V3.1GLM 5 FP4Qwen3-VL 32Bgpt-oss-120bkimi k2.5Llama 4 MaverickModel libraryFine-tune top open-source modelsResearchSystems research for production AIResearch blogAll our research publicationsFeatured publicationsFlashAttentionATLASKernel CollectionThunderKittensDSGymShow allDocumentationTechnical docs for Together AIDemosOur open-source demo appsCookbooksPractical implementation guidesVoice AgentsBuild voice agents for productionModel LibraryPlaygroundTogether ChatWhich LLM to useResourcesCustomer storiesTestimonials from AI NativesStartup acceleratorBuild and scale your startupCustomer supportFind answers to your questionsBlogOur latest news & blog postsEventsExplore our events calendarCompanyAboutGet to know usCareersJoin our missionContact salesContact salesSign inBuild what's nexton the AI Native CloudFull-stack AI platform, powered by cutting-edge research.Start buildingContact SalesTrusted byThe Together AI PlatformAccelerate inference, model shaping and pre-training on a research-optimized platform.Faster inference2xpowered by cutting-edge research.Learn howLower cost60%with workload-specific optimization.Learn howFaster pre-training90%with Together Kernel Collection.Learn howFull-stack cloudPowering every step of the AI development journey
—from experimentation to massive scale.InferenceComputeModel shapingServerless InferenceThe fastest way to run open-source models on demand. Powered by cutting-edge inference research. No infrastructure to manage, no long-term commitments.Learn moreBatch InferenceCost-effectively process massive workloads asynchronously. Scale to 30 billion tokens per model with any serverless model or private deployment.Learn moreDedicated Model InferenceDeploy models on dedicated infrastructure. Purpose-built for teams who need speed, control, and the best economics in the market.Learn moreDedicated Container InferenceGPU infrastructure purpose-built for generative media workloads. Deploy video, audio, and image models with performance acceleration powered by Together Research.Learn moreAccelerated ComputeScale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.Learn moreSandboxUse fast, secure code sandboxes at scale to set up full-scale development environments for AI apps and agents.Learn moreManaged StorageHigh-performance managed storage for AI-native workloads. Object storage and parallel filesystems optimized for AI, with zero egress fees.Learn moreFine-TuningFine-tune open-source models for production workloads, using the latest research techniques. Improve accuracy, reduce hallucinations, and control behavior — without managing training infrastructure.Learn more Grounded in cutting-edge researchFoundational systems research for production AI.KernelsKey research and product announcements at the AI Native ConfRead MoreKernelsFlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware ScalingTed Zadouri (Princeton University, Together AI), Markus Hoehnerbach (Meta), Jay Shah (Colfax Research), Timmy Liu (NVIDIA), Vijay Thakkar (Meta, Georgia Tech), Tri Dao (Princeton University, Together AI)Read MoreInferenceCache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM servingJiejing Zhang, Yubo Wang, Yinghui Liu, Mourya Vangala Srinivasa, Chenxi Li, Jue Wang, Yineng Zhang, Shuaiwen Leon Song, Ce ZhangRead MoreAgentsCoderForge-Preview: SOTA open dataset for training efficient coding agentsBy Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyang Wu* *Project Core LeadsRead MoreAgentsHow speech models fail where it matters the most and what to do about itKaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James ZouRead MoreInferenceConsistency diffusion language models: Up to 14x faster inference without sacrificing qualityMinseo Kim, Chenfeng Xu, Coleman Richard Charles Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami | Seoul National University, University of California, Berkeley, Together AIRead MoreAgentsWhat do LLMs think when you don't tell them what to think about?Yongchan Kwon and James ZouRead MoreAgentsDSGym: A holistic framework for evaluating and training data science agentsFan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James ZouRead MoreKernelsResearch POV: Yes, AGI Can Happen – A Computational PerspectiveTogether AIRead MoreModel ShapingHow to run TorchForge reinforcement learning pipelines in the Together AI Native CloudTogether AI Training and Research, The PyTorch team at MetaRead MoreModel ShapingIntroducing AutoJudge: Streamlined inference acceleration via automated dataset curationRoman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max RyabininRead MoreAgentsLarge Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark StudyYongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James ZouRead MoreInferenceAdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning AcceleratorsJunxiong Wang, Shirley Wu, Zelei Shao, Vikranth Srivatsa, Jue Wang, Roy Yuan, Qingyang Wu, Alpay Ariyak, Rupert Wu, Wai Tong Chung, Chenfeng Xu, Yonatan Oren, Pragaash Ponnusamy, Yineng Zhang, Avner May, Leon Song, Tri Dao, Percy Liang, Ce Zhang, Ben AthiwaratkunRead MoreAgentsHow Together AI Uses AI Agents to Automate Complex Engineering Tasks: Lessons from Developing Efficient LLM Inference SystemsShang Zhu, Federico Bianchi, Wai Tong Chung, Zain Hasan, Rupert Wu, Ce Zhang, James Zou, Ben AthiwaratkunRead MoreAgentsBack to The Future: Evaluating AI Agents on Predicting Future EventsFederico Bianchi, Junlin Wang, Zain Hasan, Shang Zhu, Roy Yuan, Clémentine Fourrier, James ZouRead MoreInferenceDeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RLMichael Luo*, Naman Jain*, Jaskirat Singh*, Sijun Tan*, Ameen Patel*, Qingyang Wu*, Alpay Ariyak*, Colin Cai*, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, Ion StoicaRead MoreAgentsFrom Zero to One: Building An Autonomous and Open Data Scientist Agent from ScratchFederico Bianchi, Shang Zhu, Zain Hasan, Ben Athiwaratkun and James ZouRead MoreInferenceModel-Preserving Adaptive Rounding with YAQAAlbert Tseng, Zhaofeng Sun, and Chris De SaRead MoreAgentsMixture-of-Agents Alignment: Harnessing the Collective Intelligence of Open-Source LLMs to Improve Post-TrainingJunlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James ZouRead MoreInferenceBoosting DeepSeek-R1’s Speed with Customized Speculative DecodingWai Tong Chung, Dan Waters, Avner May, Ben AthiwaratkunRead MoreKernelsChipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse DeltasAustin Silveria, Soham Govande, Dan FuRead MoreFinetuningDirect Preference Optimization: A Technical Deep DiveIvan Provilkov, Zain Hasan, Max RyabininRead MoreFinetuningContinued Fine-tuning of LLMs: A Technical Deep DiveArtem Chumachenko, Zain Hasan, Max RyabininRead MoreAgentsOpen Deep ResearchTogether AIRead MoreInferenceDeepCoder: A Fully Open-Source 14B Coder at O3-mini LevelMichael Luo*, Sijun Tan*, Roy Huang*, Ameen Patel*, Alpay Ariyak*, Qingyang Wu*, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion StoicaRead MoreKernelsThunderKittens Now Optimized for NVIDIA Blackwell GPUsBenjamin Spector, Aaryan Singhal, Dan Fu, Chris RéRead MoreInferenceMinions: embracing small LMs, shifting compute on-device, and cutting cloud costs in the processAvanika Narayan*, Dan Biderman*, Sabri Eyuboglu*, Avner May, Scott Linderman, James Zou, Christopher RéRead MoreModel ShapingLong Context Fine-Tuning: A Technical Deep DiveGeorge Grigorev, Zain Hasan, Max RyabininRead MoreModel ShapingFine-Tuning LLMs for Multi-Turn Conversations: A Technical Deep DiveArtem Chumachenko, Zain Hasan, Max RyabininRead MoreInferenceEven Better, Even Faster Quantized LLMs with QTIPAlbert Tseng, Qingyao Sun, David Hou, Chris De SaRead MoreInferenceLinearizing LLMs with LoLCATsMichael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, Christopher RéRead MoreApplicationsMultimodal Document RAG with Llama 3.2 Vision and ColQwen2Zain HasanRead MoreInferenceThe Mamba in the Llama: Distilling and Accelerating Hybrid ModelsJunxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri DaoRead MoreInferenceSpeculative decoding for high-throughput long-context inferenceJian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Yunho Jin, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Beidi ChenRead MoreInferenceTEAL: Training-Free Activation Sparsity in Large Language ModelsJames Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben AthiwaratkunRead MoreKernelsFlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precisionJay Shah (Colfax Research), Ganesh Bikshandi (Colfax Research), Ying Zhang (Meta), Vijay Thakkar (NVIDIA), Pradeep Ramani (NVIDIA), Tri Dao (Princeton University, Together AI)Read MoreApplicationsBuilding a personalized code assistant with open-source LLMs using RAG Fine-tuningKezhen Chen, Linda He, Ben Athiwaratkun, Jue Wang, Maurice Weber, Heejin Jeong, Yonatan Oren, Michael PoliRead MoreInferenceSpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer DevicesRuslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max RyabininRead MoreAgentsTogether MoA — collective intelligence of open-source models pushing the frontier of LLM capabilitiesJunlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James ZouRead MoreInferenceDragonfly: A large vision-language model with multi-resolution zoomKezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James ZouRead MoreKernelsThunderKittens: A Simple Embedded DSL for AI kernelsBenjamin Spector, Aaryan Singhal, Simran Arora, Chris ReRead MoreInferenceFAQ: Building LLMs with RedPajama-v2, a 30 trillion token web datasetTogether AIRead MoreInferenceSequoia: Scalable, Robust, and Hardware-aware Speculative DecodingZhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi ChenRead MoreInferenceBASED: Simple linear attention language models balance the recall-throughput tradeoffSimran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher RéRead MoreInferenceEvo: Long-context modeling from molecular to genome scaleEric Nguyen, Michael Poli, Matthew Durrant, Patrick Hsu, Brian HieRead MoreInferenceBitDelta: Your Fine-Tune May Only Be Worth One BitJames Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle CaiRead MoreInferenceLong context retrieval models with Monarch MixerJon Saad-Falcon, Dan Fu, Simran AroraRead MoreInferenceMamba-3B-SlimPJ: State-space models rivaling the best Transformer architectureTri Dao, Albert GuRead MoreInferencePaving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond TransformersTogetherRead MoreKernelsFlashFFTConv: Efficient Convolutions for Long Sequences with Tensor CoresDan Fu, Hermann Kumbong, Eric Nguyen, Chris RéRead MoreInferenceRedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language modelsTogetherRead MoreKernelsFlash-Decoding for long-context inferenceTri Dao, Daniel Haziza, Francisco Massa, Grigory SizovRead MoreInferenceMedusa: Simple framework for accelerating LLM generation with multiple decoding headsTianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao (* Equal contribution)Read MoreFinetuningLlama-2-7B-32K-Instruct — and fine-tuning for Llama-2 models with Together APITogetherRead MoreInferenceFaster inference enables up to 5x price reduction on Together APITogetherRead MoreInferencePreparing for the era of 32K context: Early learnings and explorationsTogetherRead MoreKernelsMonarch Mixer: A new model architecture for increased efficiencyDan Fu, Simran Arora, Chris RéRead MoreFinetuningFine-tuning language models over slow networks using activation compression with guaranteesJue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce ZhangRead MoreFinetuningDecentralized training of foundation models in heterogeneous environmentsBinhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce ZhangRead MoreKernelsFlashAttention: Fast and memory-efficient exact attention with IO-AwarenessTri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher RéRead MoreFinetuningCocktailSGD: Fine-tuning foundation models over 500Mbps networksJue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce ZhangRead MoreInferenceFlexGen: High-throughput generative inference of large language models with a single GPUYing Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce ZhangRead MoreInferenceHyena Hierarchy: Towards larger convolutional language modelsMichael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher RéRead MoreKernelsFlashConv: Speeding up state space modelsDan Fu and Tri DaoRead MoreInferenceHungry Hungry Hippos: Towards language modeling with state space modelsDaniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher RéRead MoreFinetuningNeurIPS 2022: Overcoming communication bottlenecks for decentralized training (2/2)TogetherRead MoreFinetuningNeurIPS 2022: Overcoming communication bottlenecks for decentralized training (1/2)TogetherRead MoreInferenceHELM: benchmarking large language models on the Together Research ComputerTogetherRead Morerecognized byAI natives build on Together AISee how Together AI powers customers building the next generation of AI products. How Cursor partnered with Together AI to deliver real-time, low-latency inference at scaleInference, GPU clusters, RESEARCH • EnterpriseHow Decagon Engineered Sub-Second Voice AI with Together AIInference, GPU clusters, RESEARCH6xCost reduction per turn vs. gpt-5 mini11xFaster InferenceView All StoriesWhat’s new at Together AIAll blog postsResearchKey research and product announcements at the AI Native ConfAt AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI Native Cloud.ResearchFlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware ScalingAs GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.CompanyIntroducing Together AI’s new lookWe've refreshed our visual identity — designed with Pentagram to express how Together AI connects open-source innovation, systems research, and builders to unlock new possibilities.InferenceIntroducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI modelsTogether AI launches production-grade orchestration for custom AI models with 1.4x–2.6x faster inference.All blog postsStart building on Together AIFrom optimized training and model shaping to large-scale production inferenceGet Started nowProductsAccelerated ComputeServerless InferenceDedicated InferenceFine-TuningSandboxEvaluationsModelsSee all modelsDeepSeekMetaQwenGoogleOpenAIMistral AICustom modelsDevelopersResearchDocsPricingPricing overviewInferenceFine-TuningGPU ClustersResourcesBlogAbout usCareersCustomer StoriesSupportPrivacy PolicyTerms of service© 2026 Together AI. All Rights Reserved.
en
en
1773681892
https://together.xyz
Fa'atonu lau saite?
O au mea na e fai?