Procedure

For Kubeflow
  1. Create a Kubernetes Cluster with GPU Support
    • Provision a cluster with GPU capability: Use a cloud provider (e.g., GKE, EKS, AKS) or on-premises Kubernetes cluster with GPU nodes.
    • On GKE, provide GPUs at node pool creation time (e.g., --accelerator type=nvidia-tesla-v100,count=1).
    • Install GPU drivers: Cloud-managed clusters (e.g., GKE) will have drivers automatically installed. For on-premises, install NVIDIA GPU drivers and the CUDA toolkit manually.
  2. Install Kubeflow
    • Follow the official Kubeflow installation guide. For instance:kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.6-branch/kfdef/kfctl_istio_dex.v1.6.0.yaml
    • Check installation: Access the dashboard at http://<cluster-ip>/ and check all pods are running with kubectl get pods -n kubeflow.
  3. Create a GPU-Enabled Notebook Server
    • In the Kubeflow UI:
      • Navigate to Notebooks → New Server.
      • Under Resource Configuration, ask for 1+ GPUs.
      • Choose a GPU-enabled image (e.g., tensorflow/tensorflow:2.12.0-gpu or pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime).
  4. Write ML Code
    • In the Jupyter notebook, check for GPU access:import tensorflow as tf
      print("GPUs Available:", tf.config.list_physical_devices('GPU'))
    • Utilize GPU-accelerated libraries such as CUDA or cuDNN in your code.
  5. Specify a Kubeflow Pipeline
    • Define GPU resources using the KFP SDK v2 (new syntax):from kfp import dsl
      @dsl.component(base_image='tensorflow/tensorflow:2.12.0-gpu', target_image='<your-registry>/train-pipeline:latest')
      def train_model():
        import tensorflow as tf
        # Training code here
      @dsl.pipeline
      def ml_pipeline():
        train_model().set_gpu_limit(1)
  6. Compile and Run the Pipeline
    • Compile the pipeline:from kfp import compiler
      compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml')
    • Upload pipeline.yaml through the Kubeflow UI and invoke a run.
    For SLURM
  1. Access the Cluster (SLURM)
    • In bash:ssh username@cluster.address
  2. Set up the Environment
    • Load modules (correct versions as desired):module purge
      module load python/3.10 cuda/11.8 cudnn/8.6
      python -m venv myenv && source myenv/bin/activate
      pip install tensorflow==2.12.0
  3. Write a SLURM Job Script (ml_job.sh)
    • #!/bin/bash
      #SBATCH --job-name=ml_gpu
      #SBATCH --nodes=1
      #SBATCH --gres=gpu:v100:1
      #SBATCH --time=02:00:00
      #SBATCH --output=%x_%j.out
      #SBATCH --error=%x_%j.err
      module load python/3.10 cuda/11.8
      source myenv/bin/activate
      python train.py
  4. Submit and Monitor the Job
    • Submit the job: sbatch ml_job.sh
    • Check status: squeue -u $USER
    • Stream logs: tail -f ml_gpu_<JOBID>.out