Objective

  1. 1. To understand TensorRT's capabilities in the optimization of deep-learning models for faster inference.
  2. 2. To learn about deployment of AI models via Triton Inference Server in CPU and GPU environments.
  3. 3. To learn about precision calibration to decrease inference time using, among others methods, FP16 and INT8.
  4. 4. To implement optimization techniques such as layer fusion and kernel auto-tuning.
  5. 5. To assess and analyze inference performance with respect to latency and throughput.