Theory

Introduction

TensorRT is an NVIDIA SDK designed to maximize the performance of deep learning inference on NVIDIA GPUs. It converts the model trained into a highly optimized runtime format.

Key Features

  • Precision Calibration: FP32, FP16, and INT8 allow for a trade-off between accuracy and performance.
  • Layer Fusion: Fuses multiple layers of the neural net into one unit, minimizing memory and computation overhead.
  • Kernel Auto-Tuning: Selects the optimum kernel for inference by analyzing the model and repository of highly-tuned kernels. This minimizes inference latency and maximizes throughput and memory utilization.

Triron: Open-Source Inference Server

Triron is an open-source inference server developed by NVIDIA to deploy large AI models. It supports TensorFlow, PyTorch, ONNX, and TensorRT implementations.

Features

  • Host works from multiple frameworks in the same container simultaneously.
  • Dynamic Batching: Groups multiple inference requests to improve throughput.
  • Deploy multiple model versions for A/B testing and rollback.
  • Protocol support for HTTP/gRPC, enabling easy integration with many applications.
  • Simplifies model management, improves scaling, and ensures optimal GPU resource utilization.

Optimization of Inference Performance

Precision Calibration

  • FP16: Lowers memory consumption and enhances performance with minimal accuracy loss.
  • INT8: Further reduces memory and computation needs, suitable for production environments with tighter latencies.

Batching Techniques

  • Static Batching: Handles batches of fixed-size requests.
  • Dynamic Batching: Collects and groups incoming requests to optimize GPU usage.

Performance Metrics

  • Latency: Time spent on a single inference request.
  • Throughput: Number of inference requests processed per second.