Procedure

  1. Set Up CUDA on Your System
    • Ensure you have a compatible NVIDIA GPU installed.
    • Download and install CUDA Toolkit from NVIDIA’s website.
    • Follow the installation prompts for your operating system.
    • Verify the installation by running the command:
      nvcc --version
      This should return the CUDA compiler version.
  2. Verify GPU Availability
    • Open a Command Prompt or Terminal and run the following:
      nvidia-smi
      This will display the GPU status, driver version, memory usage, and utilization.
  3. Create a CUDA Program to Test Performance
    • Write a simple CUDA program to perform matrix multiplication. Example:
    • #include <iostream>
      #include <cuda_runtime.h>
      
      __global__ void matMul(float* A, float* B, float* C, int N) {
          int row = threadIdx.y + blockIdx.y * blockDim.y;
          int col = threadIdx.x + blockIdx.x * blockDim.x;
      
          if (row < N && col < N) {
              float value = 0;
              for (int k = 0; k < N; ++k) {
                  value += A[row * N + k] * B[k * N + col];
              }
              C[row * N + col] = value;
          }
      }
      
      int main() {
          int N = 512;
          size_t size = N * N * sizeof(float);
          float *h_A, *h_B, *h_C;
          float *d_A, *d_B, *d_C;
      
          h_A = (float*)malloc(size);
          h_B = (float*)malloc(size);
          h_C = (float*)malloc(size);
      
          cudaMalloc(&d_A, size);
          cudaMalloc(&d_B, size);
          cudaMalloc(&d_C, size);
      
          for (int i = 0; i < N * N; ++i) {
              h_A[i] = 1.0f;
              h_B[i] = 1.0f;
          }
      
          cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
          cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
      
          dim3 threadsPerBlock(16, 16);
          dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
          matMul<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
          cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
      
          std::cout << "Matrix C: " << h_C[0] << std::endl;
      
          cudaFree(d_A);
          cudaFree(d_B);
          cudaFree(d_C);
          free(h_A);
          free(h_B);
          free(h_C);
      
          return 0;
      }
    • Compile the program:
      nvcc -o matmul matmul.cu
    • Run the program:
      ./matmul
  4. Benchmark GPGPU Performance
    • Use nvidia-smi or CUDA Profiler (nvprof) to monitor utilization during execution.
    • Test with different matrix sizes and compare speedup vs. CPU.
  5. Analyze Results
    • Compare execution time for CPU and GPU.
    • Measure memory usage, time, and efficiency using profiling tools.
    • Document findings, speedup, and bottlenecks.