Procedure
- Environment Setup
- CUDA and TensorRT SDK need to be installed on any computer that contains NVIDIA GPUs.
- Configure Triton Inference Server to ensure compatibility with operations across the front-end deep-learning frameworks (e.g., TensorFlow, PyTorch, etc.).
- Model Transformation to TensorRT Format
- Utilize pretrained models through either TensorFlow-TensorRT (TF-TRT) or ONNX-TensorRT conversion tools.
- Specifications such as FP16 or INT8 might be applied in the process of conversion for more optimized performance.
- Model Deployment on Triton Inference Server
- Set the model repository structure for Triton with suitable configuration files (config.pbtxt).
- Load the optimized models in TensorRT into the Triton Inference Server.
- Make sure that all models have working serving endpoints.
- Inference
- Make inference requests (with a variety of batch sizes) using HTTP/gRPC.
- Keep a check on latency, throughput, and GPU utilization performance metrics.
- Performance Analysis and Optimization
- We're using internal Triton metrics as well as Prometheus and Grafana for monitoring.
- Check for issues that slow things down when batching or making changes to improve how well something works.
- The models should be reconverted with a precision level that would represent some compromise between speed and accuracy.