Tag: vLLM

2023-11-30
25,000 tokens of Llama2 70B per second on 8xH100 - A Brief Overview of Serving LLMs at ScaleLast week, we announced our [API](https://dashboard.decart.ai) that is currently the cheapest LLM inference API on the market by a margin of almost 2x - selling Llama2 70B at only $0.5 per million tokens. We provide simple-to-use interfaces that enable output streaming in python using libraries such as requests, openai and langchain with only a few lines of code.
2023-11-30
25,000 tokens of Llama2 70B per second on 8xH100 - A Brief Overview of Serving LLMs at ScaleLast week, we announced our [API](https://dashboard.decart.ai) that is currently the cheapest LLM inference API on the market by a margin of almost 2x - selling Llama2 70B at only $0.5 per million tokens. We provide simple-to-use interfaces that enable output streaming in python using libraries such as requests, openai and langchain with only a few lines of code.
2023-11-21
Decart & Cerebrium Commit to Empowering Next Million Users With LLM ApplicationsThe Decart and Cerebrium partnership allows you to process 1 million tokens of Llama 2 70B for just $0.50 – try out the API at https://dashboard.decart.ai! Decart built a proprietary LLM inference engine from scratch in C++ and NVIDIA CUDA to outperform other existing engines, and Cerebrium built a cutting-edge serverless compute platform. Key to this achievement was leveraging NVIDIA H100 Tensor Core GPUs and the CUTLASS library, as well as writing kernels custom to the H100 GPU to ensure low latency even at the unprecedented $0.50 price tag. The partnership was supported by the strong developer ecosystem that NVIDIA created, allowing the fusing of multiple solutions into NVIDIA accelerated computing applications. Proof point already delivered with CoreWeave GPU infrastructure.
2023-07-11
+34% throughput in vLLMvLLM has taken the AI world by storm a few weeks ago by showcasing a technique to speed up LLM serving at scale. We investigated whether we could further accelerate vLLM by profiling its performance with GPU counters. We ended up achieving a speed-up of 1.34x and found several interesting directions which are still open.