NVIDIA has unveiled TensorRT-LLM, a software library aimed at enhancing the performance of AI inference processing. TensorRT-LLM is an open-source library that runs on NVIDIA Tensor Core GPUs. It focuses on inference, which refines an AI's training and the system's ability to make predictions and connect concepts. The library facilitates the definition, optimization, and execution of Large Language Models (LLMs), which form the foundation of generative AI applications such as ChatGPT.
TensorRT-LLM enables faster inference on NVIDIA GPUs by leveraging various optimizations. It includes the TensorRT deep learning compiler, optimized kernels, pre- and post-processing capabilities, multi-GPU and multi-node communication, and an open-source Python application programming interface. A significant advantage of TensorRT-LLM is that it allows developers to work with it without requiring deep knowledge of C++ or NVIDIA CUDA.
With TensorRT-LLM, developers can build versions of popular LLMs such as Meta Llama 2, OpenAI GPT-2, GPT-3, Falcon, Mosaic MPT, and BLOOM, among others. By using TensorRT-LLM, LLMs designed for tasks like article summarization can see significant performance improvements on NVIDIA GPUs. In fact, the software brings an eight times improvement in comparison to previous-generation NVIDIA A100 chips without the LLM library. For instance, GPT-J 6B LLM inferencing achieved a four times performance boost with an H100 GPU alone, but with TensorRT-LLM, the improvement reached eight times.
TensorRT-LLM employs tensor parallelism, a technique that distributes different weight matrices across devices, allowing inference to be performed in parallel across multiple GPUs and servers simultaneously. Additionally, in-flight batching enhances efficiency by producing completed batches of generated text one at a time rather than all at once. These optimizations improve GPU usage and reduce the total cost of ownership associated with LLMs.
The early access to TensorRT-LLM is currently available to members of the NVIDIA Developer Program, with wider release expected in the coming weeks. The library offers an accessible solution for accelerating AI inferencing on NVIDIA hardware, enabling developers to build powerful generative AI models more efficiently.
– NVIDIA TensorRT-LLM Press Release