AWS Cluster Create and Manage
Before you begin, you'll need a Kubernetes cluster. Once it's created, you can reuse that cluster as many times as needed. Furthermore, if you're sharing a cluster that's already been created by another user, you can skip this section and go to Sharing your K8s Cluster.
SigOpt currently only supports launching a Kubernetes (K8s) cluster on AWS. If you have a Kubernetes cluster you'd like to use for your orchestration, please see the Bring your own cluster page for instructions on how to do so.
AWS clusters created by SigOpt will have autoscaling enabled. In order for autoscaling to work properly, SigOpt needs to provision at least 1 system node to run at all times. This will allow your cluster to scale down your expensive compute nodes when they aren't being used.
If your local development environment is not already configured to use AWS, the easiest way to get started is to configure your AWS Access Key and AWS Secret Key via the aws command line interface:
$ aws configure
SigOpt requires that AWS accounts creating clusters have access to the following services:
- AutoScaling full access
- CloudFormation full access
- IAM full access
- EC2 full access
- ECR full access
- EKS full access
- SSM full access
- S3 full access
If you are an account admin, you may already have the correct permissions. Otherwise, for your convenience we have created a JSON policy document that you can use to create an IAM Policy for yourself or other users:
Next, you can use the
awscli(installed with SigOpt) to create this policy for your team:
aws iam create-policy \
--policy-name sigopt-orchestrate-full-access \
SigOpt clusters with GPU machines use an AWS-managed EKS-optimized AMI with GPU support. To use this AMI, AWS requires that you accept an end user license agreement (EULA). This can be done here by subscribing to the AMI.
The cluster configuration file is commonly referred to as
cluster.yml, but you can name yours anything you like. The file is used when we create a SigOpt cluster, with
sigopt cluster create -f cluster.yml. You can update your cluster configuration file after the cluster has been created to change the number of nodes in your cluster or change instance types. These changes can be applied by running
sigopt cluster update -f cluster.yml. Some updates might not be supported, for example introducing GPU nodes to your cluster in some regions. If the update is not supported then you will need to destroy the cluster and create it again.
The available fields are:
You must provide at least one of either
You must provide a name for your cluster. You will share this with anyone else who wants to connect to your cluster.
Override environment-provided values for
The version of Kubernetes to use for your cluster. Currently supports Kubernetes 1.16, 1.17, 1.18, and 1.19. Defaults to the latest stable version supported by SigOpt, which is currently 1.18.
Currently, AWS is our only supported provider for creating clusters. You can, however, use a custom provider to connect to your own Kubernetes cluster with the
System nodes are required to run the autoscaler. You can specify the number and type of system nodes with
The example YAML file below defines a CPU cluster named
# AWS is currently our only supported provider for cluster create
# You can connect to custom clusters via `sigopt connect`
# We have provided a name that is short and descriptive
# Your cluster config can have CPU nodes, GPU nodes, or both.
# The configuration of your nodes is defined in the sections below.
# (Optional) Define CPU compute here
# AWS instance type
# # (Optional) Define GPU compute here
# # AWS GPU-enabled instance type
# # This can be any p* instance type
# instance_type: p2.xlarge
# max_nodes: 2
# min_nodes: 0
It is tempting to choose an instance type that exactly matches the needs of a single training run. Because SigOpt is focused on experimentation, you will likely be running more than one training run at a time, maybe even hundreds at a time. For this reason it is a good idea to make your cluster as efficient as possible.
Each node will reserve some amount of resources for the system and for Kubernetes system pods. Because of this, your runs will not be able to use 100% of the resources on each node. If you chose larger instances for your cluster then your training runs will be able to use closer to 100% of the resources on each node.
Another reason to choose larger instance types is to support varying workloads. If you choose to switch to working on a different project, invite another user to share your cluster or even just change the resources used by each training run, then you will benefit by choosing instance types that have more resources.
To create the cluster on AWS, run:
$ sigopt cluster create -f cluster.yml
Cluster creation can take between 15-30 mins. If you notice an error, please try re-running the same command. SigOpt will reuse the same EKS cluster so the second run will be much faster.
Test that your cluster was created correctly:
$ sigopt cluster test
SigOpt will respond with:
Successfully connected to kubernetes cluster: tiny-cluster
Destroying your cluster can take between 15-30 mins. To destroy your cluster please run:
$ sigopt cluster destroy
You can grant other users permission to run on your Kubernetes cluster by modifying the relevant IAM Role.
SigOpt creates a role for every cluster, named
<cluster-name>-k8s-access-role, which we call the cluster access role. SigOpt uses the cluster access role under the hood to access your cluster. To allow a second user to access the cluster, modify the cluster access role's trust relationship to give another user access. See instructions for Modifying a Role on AWS for how to change the trust relationship.
Below is an example trust relationship from a newly created cluster:
Below is an example trust relationship from a cluster which two people can access:
After the IAM Role has been modified, new users can run:
$ sigopt cluster connect --cluster-name <cluster-name> --provider aws
Now, the second user should be able to run commands on the cluster. Try running something simple, such as:
$ sigopt cluster test