Why This Matters
Deploying inference workloads on Kubernetes-native infrastructure has traditionally required AI teams to navigate a maze of Helm charts, IAM role configurations, dependency management, and manual upgrades — often taking hours before a single model can serve predictions. The new Amazon SageMaker HyperPod Inference Operator, now available as a native EKS add-on, eliminates this complexity with one-click installation and managed upgrades.
Based on the official announcement from the AWS Architecture Blog, this tutorial walks you through the complete setup process, three installation methods, and a real deployment example.
Prerequisites
- An existing Amazon SageMaker HyperPod cluster with EKS orchestration
- IAM permissions for EKS cluster administration
kubectlconfigured for cluster access- (Optional) Terraform or AWS CLI if you prefer non-console workflows

Installation Methods
Method 1: One-Click via SageMaker Console (Recommended)
- Navigate to SageMaker Console → HyperPod Clusters → Cluster Management.
- Select your cluster and go to the Inference tab.
- Choose Quick Install (automated setup with defaults) or Custom Install (reuse existing IAM roles, S3 buckets, etc.).
- Click Install.
- Verify installation:
kubectl get pods -n hyperpod-inference-system
aws eks describe-addon --cluster-name YOUR-CLUSTER --addon-name amazon-sagemaker-hyperpod-inference --region us-west-2
Method 2: Using AWS CLI
For command-line enthusiasts, install directly via EKS APIs. All prerequisites (IAM roles, S3 bucket, VPC endpoints, dependency add-ons) must be created manually beforehand. See the installation guide for details.
aws eks create-addon \
--cluster-name my-hyperpod-cluster \
--addon-name amazon-sagemaker-hyperpod-inference \
--addon-version v1.0.0-eksbuild.1 \
--configuration-values '{
"executionRoleArn": "arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodInference-inference-role",
"tlsCertificateS3Bucket": "hyperpod-tls-certificate-bucket",
"hyperpodClusterArn": "arn:aws:sagemaker:REGION:ACCOUNT-ID:cluster/CLUSTER-ID",
"alb": {
"serviceAccount": {
"create": true,
"roleArn": "arn:aws:iam::ACCOUNT-ID:role/alb-controller-role"
}
},
"keda": {
"auth": {
"aws": {
"irsa": {
"roleArn": "arn:aws:iam::ACCOUNT-ID:role/keda-operator-role"
}
}
}
}
}' \
--region us-west-2
Method 3: Infrastructure as Code with Terraform
If you manage your infrastructure with Terraform, use the modules from the awesome-distributed-training GitHub repo. Set the variable create_hyperpod_inference_operator_module = true in your custom.tfvars:
kubernetes_version = "1.33"
eks_cluster_name = "tf-eks-cluster"
hyperpod_cluster_name = "tf-hp-cluster"
resource_name_prefix = "tf-eks-test"
aws_region = "us-east-1"
instance_groups = [
{
name = "accelerated-instance-group-1"
instance_type = "ml.g5.8xlarge"
instance_count = 2
availability_zone_id = "use1-az2"
ebs_volume_size_in_gb = 100
threads_per_core = 1
enable_stress_check = false
enable_connectivity_check = false
lifecycle_script = "on_create.sh"
}
]
create_hyperpod_inference_operator_module = true
Run terraform apply and the add-on will be deployed automatically.

Deploying Your First Model
Once the add-on is installed, deploy a model using the JumpStartModel custom resource. Here's an example for DeepSeek-R1-Distill-Qwen-1.5B:
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
name: deepseek-test-endpoint
spec:
model:
modelId: "deepseek-llm-r1-distill-qwen-1-5b"
sageMakerEndpoint:
name: deepseek-test-endpoint
server:
instanceType: "ml.g5.8xlarge"
Apply it:
kubectl apply -f deepseek-model.yaml
Multi-Instance Type Deployment
For higher availability, specify a prioritized list of instance types. The scheduler automatically falls back to the next available type:
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
name: lmcache-test-1
namespace: default
spec:
replicas: 13
modelName: Llama-3.1-8B-Instruct
instanceTypes: ["ml.p4d.24xlarge","ml.g5.24xlarge","ml.g5.8xlarge"]
Node Affinity for Fine-Grained Scheduling
Use Kubernetes native nodeAffinity to target specific instance types, exclude spot instances, or prefer certain availability zones:
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
name: lmcache-test-1
namespace: default
spec:
replicas: 15
modelName: Llama-3.1-8B-Instruct
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node.kubernetes.io/instanceType
operator: In
values: ["ml.g5.4xlarge"]
worker:
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 30Gi
nvidia.com/gpu: "1"

Key Benefits & Limitations
What You Gain
- Faster time to value: Deploy your first inference endpoint within minutes of cluster creation.
- Reduced complexity: No more manual Helm charts, IAM role tweaks, or dependency wrangling.
- Managed upgrades: One-click version updates with rollback capabilities.
- Advanced features: Integrated tiered KV cache (up to 40% latency reduction for long-context), intelligent routing, and built-in observability via Amazon Managed Grafana.
Limitations & Considerations
- Prerequisite overhead: If using CLI or Terraform, you must still create IAM roles, S3 buckets, and VPC endpoints manually.
- Dependency conflicts: If you already run cert-manager, KEDA, or ALB controller on your cluster, you must toggle off the add-on's bundled versions to avoid collisions.
- Migration complexity: Existing Helm-based deployments require running a migration script (provided by AWS) with rollback support.
Next Steps
- Clean up after testing: delete the add-on via SageMaker console or
aws eks delete-addon. - Explore advanced features: Enable managed tiered KV cache for long-context LLMs or configure intelligent routing strategies.
- Scale your workloads: Combine with distributed Python training on Ray clusters for end-to-end ML pipelines.
For a broader architectural perspective on team autonomy and seamless user experiences, check out the guide on building vertical microfrontends on Cloudflare.