How to Deploy Kubeflow as an AWS SageMaker Alternative

Updated on 08 May, 2026
Deploy Kubeflow as a self-hosted AWS SageMaker alternative with pipelines, notebooks, training, and model serving.
How to Deploy Kubeflow as an AWS SageMaker Alternative header image

AWS SageMaker is a fully managed Machine Learning (ML) platform that provides notebooks, training jobs, pipelines, model serving, and hyperparameter tuning as hosted services within the AWS ecosystem. It abstracts infrastructure management but ties teams to AWS-specific APIs, regional data boundaries, and usage-based pricing that scales with compute time.

Kubeflow is an open-source alternative built on Kubernetes that replicates each SageMaker capability as a portable, self-hosted component, giving teams full control over infrastructure, data residency, and costs.

This article explains how to deploy Kubeflow on a Kubernetes cluster as a self-hosted replacement for AWS SageMaker. It covers installation via Kustomize manifests, notebook configuration, pipeline orchestration, distributed training with the Trainer v2 API, model serving with KServe, hyperparameter tuning with Katib, access control, object storage integration, and migration steps from SageMaker.

Understanding AWS SageMaker vs Kubeflow

AWS SageMaker and Kubeflow provide comparable ML platform capabilities, but they differ in deployment and operational models. SageMaker delivers these as fully managed services within the AWS ecosystem, while Kubeflow provides equivalent open-source components that run on any Kubernetes cluster. The following table maps each SageMaker feature to its Kubeflow counterpart.

AWS SageMaker Feature Kubeflow Equivalent Description
SageMaker Studio Kubeflow Notebooks Interactive development environments with JupyterLab, VS Code, and RStudio
SageMaker Training Jobs Kubeflow Trainer Distributed training for PyTorch, DeepSpeed, MLX, JAX, and XGBoost workloads
SageMaker Pipelines Kubeflow Pipelines (KFP) Directed Acyclic Graph (DAG) based ML workflow orchestration
SageMaker Model Registry Kubeflow Model Registry Versioned model artifact management with metadata tracking
SageMaker Endpoints KServe Serverless model serving with autoscaling and canary deployments
SageMaker Experiments Katib Automated hyperparameter tuning with multiple search algorithms

Self-hosting with Kubeflow eliminates per-minute compute charges, keeps all data within your own cluster, runs on any cloud provider or on-premises hardware, and allows complete customization of every component.

Prerequisites

Before you begin, you need to:

  • Have access to a multi-node Kubernetes cluster that runs Kubernetes 1.31 or later with at least 4 CPU cores and 16 GB of RAM per node (minimum 3 nodes recommended).
  • Install kubectl and configure it to connect to your cluster.
  • Install Kustomize version 5.4.3 or later.
  • Have a default StorageClass that is configured in your cluster for provisioning persistent volumes.

Install Kubeflow

Kubeflow uses Kustomize to deploy its components as Kubernetes resources. The official kubeflow/manifests repository contains all component manifests that are organized under common/ for shared infrastructure services such as Istio, cert-manager, and Dex, and under applications/ for Kubeflow-specific applications such as Pipelines, Notebooks, and KServe.

Deploy Kubeflow via Manifests

The following steps clone the Kubeflow manifests repository and deploy all components to the cluster.

  1. Verify the Kubernetes cluster connection.

    console
    $ kubectl cluster-info
    
  2. Check the Kubernetes server version.

    console
    $ kubectl version
    

    Verify that the Server Version field shows version 1.31 or later.

  3. Clone the official Kubeflow manifests repository.

    console
    $ git clone https://github.com/kubeflow/manifests.git
    
  4. Switch to the manifests directory.

    console
    $ cd manifests
    
  5. Check out the latest stable release tag.

    console
    $ git checkout 26.03
    
  6. Deploy all Kubeflow components. The command uses a bounded retry loop that attempts the installation up to 5 times, which accommodates the time that Kubernetes CRDs and webhooks need to register before dependent resources apply. The loop exits automatically after a successful apply or after reaching the retry limit.

    console
    $ for i in 1 2 3 4 5; do kustomize build example | kubectl apply --server-side --force-conflicts -f - && break || { echo "Attempt $i failed, retrying in 30s..."; sleep 30; }; done
    

    The first one or two attempts may output errors about CRDs or webhooks not being established. These errors are expected and resolve on subsequent attempts after the CRDs register. The loop exits automatically when the apply succeeds, which typically happens on the second or third attempt. The full installation takes approximately 10 to 15 minutes after the final successful apply for all pods to reach a Running state. The --server-side --force-conflicts flags are required because some Kubeflow CRDs exceed the annotation size limit that standard kubectl apply supports.

    Note
    The default installation uses the email user@example.com and password 12341234. Change these credentials before exposing Kubeflow to any network. See the Set Up Access Control section later in this article for instructions.

Verify Installation

After the deployment completes, verify that all Kubeflow components are running and the CRDs are registered.

  1. Check that all pods in the kubeflow namespace reach a Running state.

    console
    $ kubectl get pods -n kubeflow --field-selector=status.phase!=Succeeded
    

    Verify that all listed pods display a Running status with all containers ready. If any pods show CrashLoopBackOff or Pending, check their logs with kubectl logs -n kubeflow POD-NAME and verify that the cluster meets the minimum resource requirements.

  2. Check that the Istio ingress gateway service is running.

    console
    $ kubectl get svc istio-ingressgateway -n istio-system
    

    Verify that the service appears in the output.

  3. Check that Kubeflow and its component CRDs are registered.

    console
    $ kubectl get crd | grep -E "kubeflow|kserve|katib|istio|knative|trainer" | wc -l
    

    The output displays the count of registered CRDs across Kubeflow and its components. A count of 40 or more indicates a complete installation.

Create Default User Profile

Kubeflow uses profiles to provide namespace-level isolation for each user. The default installation does not automatically provision a user namespace, so you need to create one manually.

  1. Create a new file called user-profile.yaml.

    console
    $ nano user-profile.yaml
    
  2. Add the following configuration:

    yaml
    apiVersion: kubeflow.org/v1
    kind: Profile
    metadata:
      name: kubeflow-user-example-com
    spec:
      owner:
        kind: User
        name: user@example.com
    

    Save and close the file.

  3. Apply the profile manifest.

    console
    $ kubectl apply -f user-profile.yaml
    

    This command creates an isolated namespace called kubeflow-user-example-com with default Role-Based Access Control (RBAC) policies and a service account for the default user.

  4. Verify that the namespace exists.

    console
    $ kubectl get namespace kubeflow-user-example-com
    
  5. Verify that the default service account exists. The service account takes a few seconds to provision after the profile is created. Wait 10 seconds before running this command.

    console
    $ kubectl get serviceaccount default-editor -n kubeflow-user-example-com
    

Configure Storage

Kubeflow components such as Notebooks, Pipelines, and the Model Registry require persistent storage. The cluster needs a default StorageClass to dynamically provision Persistent Volume Claims (PVCs).

  1. Verify that a default StorageClass exists.

    console
    $ kubectl get storageclass
    

    The default StorageClass shows (default) next to its name. If no default exists, set one by annotating an existing StorageClass. Replace STORAGE-CLASS-NAME with the name of an existing StorageClass from the output above.

    console
    $ kubectl patch storageclass STORAGE-CLASS-NAME -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'
    

Configure Kubeflow Notebooks

Kubeflow Notebooks provides managed JupyterLab, VS Code, and RStudio environments that run as Kubernetes pods with direct access to cluster resources, GPUs, and persistent storage. This component replaces SageMaker Studio notebooks with a self-hosted alternative.

Access Dashboard

Set up port forwarding and log in to the Kubeflow Central Dashboard.

  1. Set up port forwarding to access the Kubeflow Central Dashboard.

    console
    $ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
    
  2. Open http://localhost:8080 in a web browser. The Kubeflow login screen appears. Click Sign in with Dex.

    Kubeflow Sign in with Dex

  3. Enter the default credentials on the Dex login form and click Login.

    • Email: user@example.com
    • Password: 12341234

    Dex Login Form

  4. The Kubeflow Central Dashboard loads with links to Notebooks, Pipelines, Katib Experiments, KServe Endpoints, and other components. Select kubeflow-user-example-com from the namespace dropdown at the top.

    Kubeflow Central Dashboard

Create Notebook Server

Launch a new notebook server from the Kubeflow dashboard.

  1. Navigate to Notebooks in the left sidebar and click New Notebook.

  2. Enter ml-workspace in the Name field.

  3. Select the notebook environment from the image cards. Choose JupyterLab for a general-purpose data science environment. Select VisualStudio Code for a code editor interface, or RStudio for R-based statistical computing. To use a specific image version, select Custom Notebook from the dropdown below the cards.

  4. Set Minimum CPU to 0.5 and Minimum Memory Gi to 1.

  5. Leave the Workspace Volume at the default 5Gi. This volume persists data across notebook restarts.

  6. Click Launch and wait for the notebook pod to reach a Running state. The status indicator turns green when the notebook is ready.

Connect to Notebook

Open the JupyterLab interface and verify that ML libraries are accessible.

  1. Click Connect next to the notebook server name. A new tab opens with the JupyterLab interface.

  2. Click Python 3 (ipykernel) under the Notebook section in the launcher to create a new notebook. Paste the following code into a cell and press Shift + Enter to run it.

    python
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    
    X = np.random.randn(1000, 10)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    print(f"Accuracy: {model.score(X_test, y_test):.4f}")
    

    The cell outputs an accuracy score such as Accuracy: 0.9750.

Set Up Kubeflow Pipelines

Kubeflow Pipelines (KFP) is an orchestration platform that enables building and deploying ML workflows as Directed Acyclic Graphs (DAGs). Each pipeline step runs in its own container, which enables reproducibility and version tracking. KFP replaces SageMaker Pipelines with an open-source alternative.

Deploy Sample Pipeline

Install the KFP SDK, define a three-step pipeline, compile it, and submit a run through the internal API. The following steps use both the notebook terminal and the cluster management terminal where specified.

  1. Inside JupyterLab, click File > New > Terminal to open a terminal tab.

  2. Create the pipeline definition file.

    console
    $ nano sample_pipeline.py
    
  3. Add the following configuration:

    python
    from kfp import dsl, compiler
    
    @dsl.component(base_image="python:3.11-slim")
    def preprocess() -> str:
        import json
        data = {"samples": 1000, "features": 10, "status": "preprocessed"}
        return json.dumps(data)
    
    @dsl.component(base_image="python:3.11-slim")
    def train(input_data: str) -> str:
        import json
        data = json.loads(input_data)
        result = {"model": "random_forest", "accuracy": 0.95, "input": data}
        return json.dumps(result)
    
    @dsl.component(base_image="python:3.11-slim")
    def evaluate(input_data: str):
        import json
        result = json.loads(input_data)
        print(f"Model: {result['model']}, Accuracy: {result['accuracy']}")
    
    @dsl.pipeline(name="sample-ml-pipeline")
    def ml_pipeline():
        preprocess_task = preprocess()
        train_task = train(input_data=preprocess_task.output)
        evaluate(input_data=train_task.output)
    
    compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")
    print("Pipeline compiled successfully")
    

    Save and close the file.

  4. Compile the pipeline to generate the YAML definition.

    console
    $ python3 sample_pipeline.py
    
  5. Switch to the cluster management terminal and create the authorization policy manifest. This policy allows the notebook namespace to call the Kubeflow Pipelines API through the Istio service mesh.

    console
    $ nano allow-pipeline-access.yaml
    
  6. Add the following configuration:

    yaml
    apiVersion: security.istio.io/v1beta1
    kind: AuthorizationPolicy
    metadata:
      name: allow-notebook-to-pipeline
      namespace: kubeflow
    spec:
      selector:
        matchLabels:
          app: ml-pipeline
      rules:
      - from:
        - source:
            namespaces: ["kubeflow-user-example-com"]
    

    Save and close the file.

  7. Apply the authorization policy.

    console
    $ kubectl apply -f allow-pipeline-access.yaml
    
  8. Switch to the notebook terminal and upload the compiled pipeline to Kubeflow Pipelines through the internal API.

    console
    $ curl -s -F "uploadfile=@pipeline.yaml" -H "kubeflow-userid: user@example.com" http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/pipelines/upload
    

    The command returns a JSON response that contains the pipeline_id. Note this value for the next step.

  9. Create an experiment to organize pipeline runs.

    console
    $ curl -s -X POST -H "Content-Type: application/json" -H "kubeflow-userid: user@example.com" http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/experiments -d '{"display_name":"default","namespace":"kubeflow-user-example-com"}'
    

    The command returns a JSON response that contains the experiment_id. Note this value for the next step.

  10. Start a pipeline run. Replace PIPELINE-ID and EXPERIMENT-ID with the values from the previous steps.

    console
    $ curl -s -X POST -H "Content-Type: application/json" -H "kubeflow-userid: user@example.com" http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/runs -d '{"display_name":"test-run","experiment_id":"EXPERIMENT-ID","pipeline_version_reference":{"pipeline_id":"PIPELINE-ID"},"runtime_config":{}}'
    

Monitor Execution

The pipeline run status is visible from the command line and from the Kubeflow dashboard. The dashboard provides a graph view that shows each step's completion state.

  1. Verify that the workflow completed.

    console
    $ kubectl get workflows -n kubeflow-user-example-com
    

    Verify that the STATUS column shows Succeeded.

  2. Navigate to Pipelines in the left sidebar of the Kubeflow dashboard, then click Experiments. Click default, then click the test-run entry. The graph view shows the preprocess, train, and evaluate steps each marked with a green checkmark when the run completes successfully.

    Pipeline Successful Run

Configure Distributed Training

Kubeflow Trainer v2 provides a unified TrainJob API for running distributed training jobs across frameworks including PyTorch, DeepSpeed, MLX, JAX, and XGBoost. The Trainer uses ClusterTrainingRuntime resources that define pre-configured runtime environments, which separates infrastructure configuration from training logic. Kubeflow Trainer replaces SageMaker Training Jobs with native Kubernetes-based distributed training.

Create Training Job

Create and deploy a distributed PyTorch training job that uses the torch-distributed runtime. Run the following commands from the cluster management terminal.

  1. Create the TrainJob manifest.

    console
    $ nano trainjob.yaml
    
  2. Add the following configuration:

    yaml
    apiVersion: trainer.kubeflow.org/v1alpha1
    kind: TrainJob
    metadata:
      name: pytorch-training
      namespace: kubeflow-user-example-com
    spec:
      runtimeRef:
        name: torch-distributed
      trainer:
        image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0
        numNodes: 2
        resourcesPerNode:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
    

    Save and close the file. The runtimeRef field references the torch-distributed ClusterTrainingRuntime, which configures the PyTorch distributed training environment. The numNodes field specifies the number of training nodes that Kubeflow provisions for the job.

  3. Apply the training manifest.

    console
    $ kubectl apply -f trainjob.yaml
    

Monitor Training

Check the training job status from the cluster management terminal.

  1. Verify the TrainJob status.

    console
    $ kubectl get trainjob -n kubeflow-user-example-com
    

    The STATE column shows Complete when training finishes successfully.

Deploy Model Serving with KServe

KServe provides a Kubernetes CRD called InferenceService for deploying, scaling, and managing ML model endpoints. It supports serverless inference with autoscaling from zero, canary rollouts, and multi-model serving across frameworks including TensorFlow, PyTorch, scikit-learn, XGBoost, and ONNX. KServe also supports deploying models directly from Hugging Face Hub using the hf:// URI schema and from the Kubeflow Model Registry using the model-registry:// protocol. KServe is included in the Kubeflow installation and replaces SageMaker Endpoints.

Create InferenceService

Deploy a pre-trained scikit-learn model and expose it as a serving endpoint. Run the following commands from the cluster management terminal.

  1. Create the model serving manifest.

    console
    $ nano sklearn-iris.yaml
    
  2. Add the following configuration:

    yaml
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sklearn-iris
      namespace: kubeflow-user-example-com
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 1Gi
    

    Save and close the file.

  3. Apply the InferenceService manifest.

    console
    $ kubectl apply -f sklearn-iris.yaml
    
  4. Wait for the InferenceService to become ready.

    console
    $ kubectl get inferenceservice sklearn-iris -n kubeflow-user-example-com -w
    

    The READY column changes to True when the model is loaded and serving. Press Ctrl + C to stop watching. Navigate to KServe Endpoints in the Kubeflow dashboard sidebar to view the deployed model.

Test Model Endpoint

Send a test inference request to verify that the model is serving predictions.

  1. Run the following command from the Kubeflow notebook terminal, which has direct access to the cluster-internal service endpoint.

    console
    $ curl -s --max-time 30 -H "Content-Type: application/json" http://sklearn-iris-predictor-00001-private.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-iris:predict -d '{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}'
    

    The response returns predicted class labels.

    {"predictions": [1, 1]}

Configure Hyperparameter Tuning

Katib is the Kubeflow component that provides automated hyperparameter tuning and neural architecture search. It supports multiple search algorithms including random search, grid search, Bayesian optimization, Tree-structured Parzen Estimator (TPE), and CMA Evolution Strategy. Katib replaces SageMaker Experiments and Automatic Model Tuning with a Kubernetes-native solution.

Create Tuning Experiment

Katib experiments are defined and submitted through the Katib Python SDK from a JupyterLab notebook cell. The SDK creates the experiment resource on the cluster and manages trial pod configuration and metrics collection automatically.

  1. Open a terminal in JupyterLab by clicking File > New > Terminal and install the Katib Python SDK.

    console
    $ pip install kubeflow-katib
    
  2. Close the terminal tab and create a new Python notebook by clicking File > New > Notebook, then selecting Python 3 (ipykernel) as the kernel. Run the following code in a cell to define an objective function and launch a tuning experiment.

    python
    import kubeflow.katib as katib
    
    def objective(parameters):
        import time
        time.sleep(5)
        result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
        print(f"result={result}")
    
    parameters = {
        "a": katib.search.int(min=10, max=20),
        "b": katib.search.double(min=0.1, max=0.2)
    }
    
    katib_client = katib.KatibClient(namespace="kubeflow-user-example-com")
    name = "tune-experiment"
    katib_client.tune(
        name=name,
        objective=objective,
        parameters=parameters,
        objective_metric_name="result",
        objective_type="maximize",
        algorithm_name="random",
        max_trial_count=4,
        parallel_trial_count=2,
        resources_per_trial={"cpu": "1", "memory": "1Gi"},
    )
    

    The tune() method creates a Katib experiment that runs 4 trials (2 in parallel) using random search. The cell output includes a Katib Experiment tune-experiment link here line. Click here to open the experiment directly in the Katib Experiments tab and monitor trial progress. The experiment status turns green when all trials complete.

  3. Retrieve the optimal hyperparameters by running the following code in the next cell.

    python
    katib_client.wait_for_experiment_condition(name=name)
    print(katib_client.get_optimal_hyperparameters(name))
    

    Katib Experiment

Set Up Access Control

Kubeflow uses Dex as its OpenID Connect (OIDC) identity provider and Istio for network-level authorization. Each user gets an isolated namespace, which is called a profile, with its own resources, secrets, and RBAC policies.

Configure Authentication

The Dex ConfigMap uses hashFromEnv: DEX_USER_PASSWORD to read the password hash from an environment variable rather than storing it directly in the ConfigMap. To change the default password, update the Secret that provides this environment variable to the Dex pod.

  1. Install the bcrypt Python package to generate a password hash.

    console
    $ pip install bcrypt
    
  2. Generate a bcrypt hash for the new password. Replace YOUR-SECURE-PASSWORD with the password you want to set.

    console
    $ python3 -c "import bcrypt; print(bcrypt.hashpw(b'YOUR-SECURE-PASSWORD', bcrypt.gensalt()).decode())"
    

    Copy the output hash for use in the next step.

  3. Update the dex-passwords Secret with the new hash. Replace GENERATED-BCRYPT-HASH with the hash output from the previous step.

    console
    $ kubectl create secret generic dex-passwords -n auth \
        --from-literal=DEX_USER_PASSWORD='GENERATED-BCRYPT-HASH' \
        --dry-run=client -o yaml | kubectl apply -f -
    
    Note
    The command outputs a warning about a missing annotation. This is expected because dex-passwords was created by Kubeflow without --save-config. Verify that the output ends with secret/dex-passwords configured.
  4. Restart the Dex deployment to apply the changes.

    console
    $ kubectl rollout restart deployment dex -n auth
    
  5. To add additional static users or configure external identity providers such as Lightweight Directory Access Protocol (LDAP), GitHub, or Google, add entries to the staticPasswords list or connector entries in the Dex ConfigMap. See the Dex documentation for details.

Implement RBAC

Each additional user needs a profile that follows the same manifest structure as the default user profile. Replace the metadata.name and owner.name fields with the new user's details, then apply the manifest with kubectl apply -f.

To give an existing user access to another user's namespace without creating a separate profile, navigate to the target namespace in the Kubeflow dashboard namespace dropdown. Click Manage Contributors in the left sidebar and enter the user's email address.

Integrate Object Storage

ML workflows generate large artifacts including trained models, pipeline outputs, datasets, and logs. Kubeflow uses SeaweedFS as its default S3-compatible object storage backend for artifact persistence. KServe also supports S3-compatible storage for loading model artifacts.

Configure Storage Backend

The default Kubeflow installation deploys SeaweedFS in the kubeflow namespace with pre-configured credentials. Verify that the storage deployment is running.

  1. Check the SeaweedFS pod status.

    console
    $ kubectl get pods -n kubeflow -l app=seaweedfs
    
  2. For production deployments, replace SeaweedFS with an external S3-compatible object storage service. Replace YOUR-ACCESS-KEY and YOUR-SECRET-KEY with the access key and secret key for your storage service, then update the mlpipeline-minio-artifact secret with the new credentials.

    console
    $ kubectl create secret generic mlpipeline-minio-artifact -n kubeflow --from-literal=accesskey=YOUR-ACCESS-KEY --from-literal=secretkey=YOUR-SECRET-KEY --dry-run=client -o yaml | kubectl apply -f -
    

Test Storage Integration

Configure KServe to access SeaweedFS for loading model artifacts stored in the cluster.

  1. Create a storage secret for KServe model storage.

    console
    $ nano s3-storage-secret.yaml
    
  2. Add the following configuration:

    yaml
    apiVersion: v1
    kind: Secret
    metadata:
      name: s3-storage-secret
      namespace: kubeflow-user-example-com
      annotations:
        serving.kserve.io/s3-endpoint: "seaweedfs.kubeflow:8333"
        serving.kserve.io/s3-usehttps: "0"
    type: Opaque
    stringData:
      AWS_ACCESS_KEY_ID: "admin"
      AWS_SECRET_ACCESS_KEY: "admin"
    

    Save and close the file.

  3. Apply the storage secret.

    console
    $ kubectl apply -f s3-storage-secret.yaml
    

Migration from AWS SageMaker

Migrating from SageMaker to Kubeflow involves exporting existing assets and mapping each SageMaker component to its Kubeflow equivalent. Notebooks transfer without format changes since both platforms use the standard Jupyter notebook format. Training scripts, pipelines, and experiments require rewriting to replace SageMaker SDK calls with Kubeflow and KFP SDK equivalents.

  1. Export Notebooks: Download SageMaker Studio notebooks as .ipynb files from the SageMaker console or by using the AWS CLI. Upload them directly to Kubeflow Notebook servers, since both platforms use standard Jupyter notebook format. Update any AWS SDK calls such as sagemaker.Session() or boto3 references that rely on SageMaker-specific APIs.

  2. Convert Training Scripts: SageMaker training scripts that use the SageMaker Python SDK estimator pattern (such as sagemaker.pytorch.PyTorch()) need conversion to standard framework training scripts. Replace SageMaker-specific environment variables like SM_MODEL_DIR and SM_CHANNEL_TRAINING with Kubernetes volume mount paths. Package the training code into container images and reference them in TrainJob manifests.

  3. Migrate Pipelines: SageMaker Pipelines that are defined using the SageMaker SDK need rewriting with the KFP SDK. Map each sagemaker.workflow.steps step to a KFP @dsl.component function. Replace sagemaker.processing.ProcessingStep with KFP container components, and replace sagemaker.workflow.pipeline.Pipeline with @dsl.pipeline decorated functions.

  4. Export Models: Download trained model artifacts from SageMaker S3 buckets using the AWS CLI. Upload them to your Kubeflow-connected object storage. Update the storageUri in KServe InferenceService manifests to point to the new storage location. KServe supports the same model formats that SageMaker uses (SavedModel, TorchScript, ONNX, scikit-learn pickle files) without conversion.

  5. Migrate Experiments: Export SageMaker experiment tracking data using the SageMaker SDK list_trials() and describe_trial_component() API calls. For hyperparameter tuning, recreate tuning jobs as Katib experiments with equivalent search spaces and objective metrics using the Katib Python SDK.

Conclusion

You have successfully deployed Kubeflow on a Kubernetes cluster as a self-hosted alternative to AWS SageMaker. The deployment includes interactive notebooks for model development, Kubeflow Pipelines for workflow orchestration, the Trainer v2 API for distributed PyTorch training, KServe for model serving with serverless inference, and Katib for automated hyperparameter tuning. For more information on GPU scheduling, multi-tenancy, and production hardening, visit the official Kubeflow documentation.

Comments