Why This Matters

Deploying inference workloads on Kubernetes-native infrastructure has traditionally required AI teams to navigate a maze of Helm charts, IAM role configurations, dependency management, and manual upgrades — often taking hours before a single model can serve predictions. The new Amazon SageMaker HyperPod Inference Operator, now available as a native EKS add-on, eliminates this complexity with one-click installation and managed upgrades.

Based on the official announcement from the AWS Architecture Blog, this tutorial walks you through the complete setup process, three installation methods, and a real deployment example.

Prerequisites

  • An existing Amazon SageMaker HyperPod cluster with EKS orchestration
  • IAM permissions for EKS cluster administration
  • kubectl configured for cluster access
  • (Optional) Terraform or AWS CLI if you prefer non-console workflows

AWS SageMaker HyperPod cluster dashboard showing Inference Operator installation status Coding Session Visual

Installation Methods

Method 1: One-Click via SageMaker Console (Recommended)

  1. Navigate to SageMaker Console → HyperPod Clusters → Cluster Management.
  2. Select your cluster and go to the Inference tab.
  3. Choose Quick Install (automated setup with defaults) or Custom Install (reuse existing IAM roles, S3 buckets, etc.).
  4. Click Install.
  5. Verify installation:
kubectl get pods -n hyperpod-inference-system
aws eks describe-addon --cluster-name YOUR-CLUSTER --addon-name amazon-sagemaker-hyperpod-inference --region us-west-2

Method 2: Using AWS CLI

For command-line enthusiasts, install directly via EKS APIs. All prerequisites (IAM roles, S3 bucket, VPC endpoints, dependency add-ons) must be created manually beforehand. See the installation guide for details.

aws eks create-addon \
  --cluster-name my-hyperpod-cluster \
  --addon-name amazon-sagemaker-hyperpod-inference \
  --addon-version v1.0.0-eksbuild.1 \
  --configuration-values '{
    "executionRoleArn": "arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodInference-inference-role",
    "tlsCertificateS3Bucket": "hyperpod-tls-certificate-bucket",
    "hyperpodClusterArn": "arn:aws:sagemaker:REGION:ACCOUNT-ID:cluster/CLUSTER-ID",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "arn:aws:iam::ACCOUNT-ID:role/alb-controller-role"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "arn:aws:iam::ACCOUNT-ID:role/keda-operator-role"
          }
        }
      }
    }
  }' \
  --region us-west-2

Method 3: Infrastructure as Code with Terraform

If you manage your infrastructure with Terraform, use the modules from the awesome-distributed-training GitHub repo. Set the variable create_hyperpod_inference_operator_module = true in your custom.tfvars:

kubernetes_version = "1.33"
eks_cluster_name = "tf-eks-cluster"
hyperpod_cluster_name = "tf-hp-cluster"
resource_name_prefix = "tf-eks-test"
aws_region = "us-east-1"
instance_groups = [
  {
    name = "accelerated-instance-group-1"
    instance_type = "ml.g5.8xlarge"
    instance_count = 2
    availability_zone_id = "use1-az2"
    ebs_volume_size_in_gb = 100
    threads_per_core = 1
    enable_stress_check = false
    enable_connectivity_check = false
    lifecycle_script = "on_create.sh"
  }
]
create_hyperpod_inference_operator_module = true

Run terraform apply and the add-on will be deployed automatically.

Kubernetes pods running inference workloads on HyperPod with GPU utilization metrics Algorithm Concept Visual

Deploying Your First Model

Once the add-on is installed, deploy a model using the JumpStartModel custom resource. Here's an example for DeepSeek-R1-Distill-Qwen-1.5B:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-test-endpoint
spec:
  model:
    modelId: "deepseek-llm-r1-distill-qwen-1-5b"
  sageMakerEndpoint:
    name: deepseek-test-endpoint
    server:
      instanceType: "ml.g5.8xlarge"

Apply it:

kubectl apply -f deepseek-model.yaml

Multi-Instance Type Deployment

For higher availability, specify a prioritized list of instance types. The scheduler automatically falls back to the next available type:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: lmcache-test-1
  namespace: default
spec:
  replicas: 13
  modelName: Llama-3.1-8B-Instruct
  instanceTypes: ["ml.p4d.24xlarge","ml.g5.24xlarge","ml.g5.8xlarge"]

Node Affinity for Fine-Grained Scheduling

Use Kubernetes native nodeAffinity to target specific instance types, exclude spot instances, or prefer certain availability zones:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: lmcache-test-1
  namespace: default
spec:
  replicas: 15
  modelName: Llama-3.1-8B-Instruct
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: node.kubernetes.io/instanceType
          operator: In
          values: ["ml.g5.4xlarge"]
  worker:
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: "6"
        memory: 30Gi
        nvidia.com/gpu: "1"

Developer using SageMaker console to deploy a Llama model with one-click inference operator addon Programming Illustration

Key Benefits & Limitations

What You Gain

  • Faster time to value: Deploy your first inference endpoint within minutes of cluster creation.
  • Reduced complexity: No more manual Helm charts, IAM role tweaks, or dependency wrangling.
  • Managed upgrades: One-click version updates with rollback capabilities.
  • Advanced features: Integrated tiered KV cache (up to 40% latency reduction for long-context), intelligent routing, and built-in observability via Amazon Managed Grafana.

Limitations & Considerations

  • Prerequisite overhead: If using CLI or Terraform, you must still create IAM roles, S3 buckets, and VPC endpoints manually.
  • Dependency conflicts: If you already run cert-manager, KEDA, or ALB controller on your cluster, you must toggle off the add-on's bundled versions to avoid collisions.
  • Migration complexity: Existing Helm-based deployments require running a migration script (provided by AWS) with rollback support.

Next Steps

  1. Clean up after testing: delete the add-on via SageMaker console or aws eks delete-addon.
  2. Explore advanced features: Enable managed tiered KV cache for long-context LLMs or configure intelligent routing strategies.
  3. Scale your workloads: Combine with distributed Python training on Ray clusters for end-to-end ML pipelines.

For a broader architectural perspective on team autonomy and seamless user experiences, check out the guide on building vertical microfrontends on Cloudflare.


Reference: AWS Architecture Blog – Unlock Efficient Model Deployment: Simplified Inference Operator Setup on Amazon SageMaker HyperPod

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.