Objective
- 1. To understand TensorRT's capabilities in the optimization of deep-learning models for faster inference.
- 2. To learn about deployment of AI models via Triton Inference Server in CPU and GPU environments.
- 3. To learn about precision calibration to decrease inference time using, among others methods, FP16 and INT8.
- 4. To implement optimization techniques such as layer fusion and kernel auto-tuning.
- 5. To assess and analyze inference performance with respect to latency and throughput.