Procedure

  1. Environment Setup
    • CUDA and TensorRT SDK need to be installed on any computer that contains NVIDIA GPUs.
    • Configure Triton Inference Server to ensure compatibility with operations across the front-end deep-learning frameworks (e.g., TensorFlow, PyTorch, etc.).
  2. Model Transformation to TensorRT Format
    • Utilize pretrained models through either TensorFlow-TensorRT (TF-TRT) or ONNX-TensorRT conversion tools.
    • Specifications such as FP16 or INT8 might be applied in the process of conversion for more optimized performance.
  3. Model Deployment on Triton Inference Server
    • Set the model repository structure for Triton with suitable configuration files (config.pbtxt).
    • Load the optimized models in TensorRT into the Triton Inference Server.
    • Make sure that all models have working serving endpoints.
  4. Inference
    • Make inference requests (with a variety of batch sizes) using HTTP/gRPC.
    • Keep a check on latency, throughput, and GPU utilization performance metrics.
  5. Performance Analysis and Optimization
    • We're using internal Triton metrics as well as Prometheus and Grafana for monitoring.
    • Check for issues that slow things down when batching or making changes to improve how well something works.
    • The models should be reconverted with a precision level that would represent some compromise between speed and accuracy.