Procedure
For Kubeflow
- Create a Kubernetes Cluster with GPU Support
- Provision a cluster with GPU capability: Use a cloud provider (e.g., GKE, EKS, AKS) or on-premises Kubernetes cluster with GPU nodes.
- On GKE, provide GPUs at node pool creation time (e.g., --accelerator type=nvidia-tesla-v100,count=1).
- Install GPU drivers: Cloud-managed clusters (e.g., GKE) will have drivers automatically installed. For on-premises, install NVIDIA GPU drivers and the CUDA toolkit manually.
- Install Kubeflow
- Follow the official Kubeflow installation guide. For instance:
kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.6-branch/kfdef/kfctl_istio_dex.v1.6.0.yaml - Check installation: Access the dashboard at
http://<cluster-ip>/and check all pods are running withkubectl get pods -n kubeflow. - Create a GPU-Enabled Notebook Server
- In the Kubeflow UI:
- Navigate to Notebooks → New Server.
- Under Resource Configuration, ask for 1+ GPUs.
- Choose a GPU-enabled image (e.g., tensorflow/tensorflow:2.12.0-gpu or pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime).
- Write ML Code
- In the Jupyter notebook, check for GPU access:
import tensorflow as tf
print("GPUs Available:", tf.config.list_physical_devices('GPU')) - Utilize GPU-accelerated libraries such as CUDA or cuDNN in your code.
- Specify a Kubeflow Pipeline
- Define GPU resources using the KFP SDK v2 (new syntax):
from kfp import dsl
@dsl.component(base_image='tensorflow/tensorflow:2.12.0-gpu', target_image='<your-registry>/train-pipeline:latest')
def train_model():
import tensorflow as tf
# Training code here
@dsl.pipeline
def ml_pipeline():
train_model().set_gpu_limit(1) - Compile and Run the Pipeline
- Compile the pipeline:
from kfp import compiler
compiler.Compiler().compile(ml_pipeline, 'pipeline.yaml') - Upload
pipeline.yamlthrough the Kubeflow UI and invoke a run.
- For SLURM
- Access the Cluster (SLURM)
- In bash:
ssh username@cluster.address - Set up the Environment
- Load modules (correct versions as desired):
module purge
module load python/3.10 cuda/11.8 cudnn/8.6
python -m venv myenv && source myenv/bin/activate
pip install tensorflow==2.12.0 - Write a SLURM Job Script (ml_job.sh)
#!/bin/bash
#SBATCH --job-name=ml_gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:v100:1
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module load python/3.10 cuda/11.8
source myenv/bin/activate
python train.py- Submit and Monitor the Job
- Submit the job:
sbatch ml_job.sh - Check status:
squeue -u $USER - Stream logs:
tail -f ml_gpu_<JOBID>.out