Theory
Introduction
TensorRT is an NVIDIA SDK designed to maximize the performance of deep learning inference on NVIDIA GPUs. It converts the model trained into a highly optimized runtime format.
Key Features
- Precision Calibration: FP32, FP16, and INT8 allow for a trade-off between accuracy and performance.
- Layer Fusion: Fuses multiple layers of the neural net into one unit, minimizing memory and computation overhead.
- Kernel Auto-Tuning: Selects the optimum kernel for inference by analyzing the model and repository of highly-tuned kernels. This minimizes inference latency and maximizes throughput and memory utilization.
Triron: Open-Source Inference Server
Triron is an open-source inference server developed by NVIDIA to deploy large AI models. It supports TensorFlow, PyTorch, ONNX, and TensorRT implementations.
Features
- Host works from multiple frameworks in the same container simultaneously.
- Dynamic Batching: Groups multiple inference requests to improve throughput.
- Deploy multiple model versions for A/B testing and rollback.
- Protocol support for HTTP/gRPC, enabling easy integration with many applications.
- Simplifies model management, improves scaling, and ensures optimal GPU resource utilization.
Optimization of Inference Performance
Precision Calibration
- FP16: Lowers memory consumption and enhances performance with minimal accuracy loss.
- INT8: Further reduces memory and computation needs, suitable for production environments with tighter latencies.
Batching Techniques
- Static Batching: Handles batches of fixed-size requests.
- Dynamic Batching: Collects and groups incoming requests to optimize GPU usage.
Performance Metrics
- Latency: Time spent on a single inference request.
- Throughput: Number of inference requests processed per second.