Procedure

For Kubeflow

Create a Kubernetes Cluster with GPU Support

Provision a cluster with GPU capability: Use a cloud provider (e.g., GKE, EKS, AKS) or on-premises Kubernetes cluster with GPU nodes.
On GKE, provide GPUs at node pool creation time (e.g., --accelerator type=nvidia-tesla-v100,count=1).
Install GPU drivers: Cloud-managed clusters (e.g., GKE) will have drivers automatically installed. For on-premises, install NVIDIA GPU drivers and the CUDA toolkit manually.

Install Kubeflow

Follow the official Kubeflow installation guide. For instance:kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.6-branch/kfdef/kfctl_istio_dex.v1.6.0.yaml
Check installation: Access the dashboard at http://<cluster-ip>/ and check all pods are running with kubectl get pods -n kubeflow.

Create a GPU-Enabled Notebook Server

In the Kubeflow UI:
- Navigate to Notebooks → New Server.
- Under Resource Configuration, ask for 1+ GPUs.
- Choose a GPU-enabled image (e.g., tensorflow/tensorflow:2.12.0-gpu or pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime).

Write ML Code

In the Jupyter notebook, check for GPU access:import tensorflow as tf print("GPUs Available:", tf.config.list_physical_devices('GPU'))
Utilize GPU-accelerated libraries such as CUDA or cuDNN in your code.

Specify a Kubeflow Pipeline

Define GPU resources using the KFP SDK v2 (new syntax):from kfp import dsl @dsl.component(base_image='tensorflow/tensorflow:2.12.0-gpu', target_image='<your-registry>/train-pipeline:latest') def train_model(): import tensorflow as tf # Training code here @dsl.pipeline def ml_pipeline(): train_model().set_gpu_limit(1)

Compile and Run the Pipeline

Compile the pipeline:from kfp import compiler compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml')
Upload pipeline.yaml through the Kubeflow UI and invoke a run.

For SLURM

Access the Cluster (SLURM)

In bash:ssh username@cluster.address

Set up the Environment

Load modules (correct versions as desired):module purge module load python/3.10 cuda/11.8 cudnn/8.6 python -m venv myenv && source myenv/bin/activate pip install tensorflow==2.12.0

Write a SLURM Job Script (ml_job.sh)

#!/bin/bash #SBATCH --job-name=ml_gpu #SBATCH --nodes=1 #SBATCH --gres=gpu:v100:1 #SBATCH --time=02:00:00 #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err module load python/3.10 cuda/11.8 source myenv/bin/activate python train.py

Submit and Monitor the Job

Submit the job: sbatch ml_job.sh
Check status: squeue -u $USER
Stream logs: tail -f ml_gpu_<JOBID>.out