AWS Cluster Create and Manage
Before you begin, you'll need a Kubernetes cluster. Once it's created, you can reuse that cluster as many times as needed. Furthermore, if you're sharing a cluster that's already been created by another user, you can skip this section and go to Sharing your K8s Cluster.
SigOpt currently only supports launching a Kubernetes (K8s) cluster on AWS. If you have a Kubernetes cluster you'd like to use for your orchestration, please see the Bring your own cluster page for instructions on how to do so.
AWS clusters created by SigOpt will have autoscaling enabled. In order for autoscaling to work properly, SigOpt needs to provision at least 1 system node to run at all times. This will allow your cluster to scale down your expensive compute nodes when they aren't being used.
AWS Configuration
If your local development environment is not already configured to use AWS, the easiest way to get started is to configure your AWS Access Key and AWS Secret Key via the aws command line interface:
See the AWS docs for more about configuring your AWS credentials.
Enable Full Access
SigOpt requires that AWS accounts creating clusters have access to the following services:
AutoScaling full access
CloudFormation full access
IAM full access
EC2 full access
ECR full access
EKS full access
SSM full access
S3 full access
If you are an account admin, you may already have the correct permissions. Otherwise, for your convenience we have created a JSON policy document that you can use to create an IAM Policy for yourself or other users:
Next, you can use the awscli
(installed with SigOpt) to create this policy for your team:
SigOpt clusters with GPU machines use an AWS-managed EKS-optimized AMI with GPU support. To use this AMI, AWS requires that you accept an end user license agreement (EULA). This can be done here by subscribing to the AMI.
Cluster Configuration File
The cluster configuration file is commonly referred to as cluster.yml
, but you can name yours anything you like. The file is used when we create a SigOpt cluster, with sigopt cluster create -f cluster.yml
. You can update your cluster configuration file after the cluster has been created to change the number of nodes in your cluster or change instance types. These changes can be applied by running sigopt cluster update -f cluster.yml
. Some updates might not be supported, for example introducing GPU nodes to your cluster in some regions. If the update is not supported then you will need to destroy the cluster and create it again.
The available fields are:
cpu
, or gpu
Yes
You must provide at least one of either cpu
or gpu
. Define the CPU compute that your cluster will need in terms of: instance_type
, max_nodes
, and min_nodes
. It is recommended that you set min_nodes
to 0 so the autoscaler can remove all of your expensive compute nodes when they aren't in use. It's ok if max_nodes
and min_nodes
are the same value, as long as max_nodes
is not 0.
cluster_name
Yes
You must provide a name for your cluster. You will share this with anyone else who wants to connect to your cluster.
aws
No
Override environment-provided values for aws_access_key_id
or aws_secret_access_key
.
kubernetes_version
No
The version of Kubernetes to use for your cluster. Currently supports Kubernetes 1.16, 1.17, 1.18, and 1.19. Defaults to the latest stable version supported by SigOpt, which is currently 1.18.
provider
No
Currently, AWS is our only supported provider for creating clusters. You can, however, use a custom provider to connect to your own Kubernetes cluster with the sigopt cluster connect
. See page on Bringing your own K8s cluster.
system
No
System nodes are required to run the autoscaler. You can specify the number and type of system nodes with min_nodes
, max_nodes
and instance_type
. The value of min_nodes
must be at least 1 so that you have at least 1 system node. The defaults for system
are:
min_nodes
:1
max_nodes
:2
instance_type
:"t3.large"
Example
The example YAML file below defines a CPU cluster named tiny-cluster
with two t2.small
AWS instances.
Notes on Choosing Instance Types
It is tempting to choose an instance type that exactly matches the needs of a single training run. Because SigOpt is focused on experimentation, you will likely be running more than one training run at a time, maybe even hundreds at a time. For this reason it is a good idea to make your cluster as efficient as possible.
Each node will reserve some amount of resources for the system and for Kubernetes system pods. Because of this, your runs will not be able to use 100% of the resources on each node. If you chose larger instances for your cluster then your training runs will be able to use closer to 100% of the resources on each node.
Another reason to choose larger instance types is to support varying workloads. If you choose to switch to working on a different project, invite another user to share your cluster or even just change the resources used by each training run, then you will benefit by choosing instance types that have more resources.
Create Cluster
To create the cluster on AWS, run:
Cluster creation can take between 15-30 mins. If you notice an error, please try re-running the same command. SigOpt will reuse the same EKS cluster so the second run will be much faster.
Check Cluster Status
Test that your cluster was created correctly:
SigOpt will respond with:
Destroy your Cluster
Destroying your cluster can take between 15-30 mins. To destroy your cluster please run:
Share your Kubernetes cluster
You can grant other users permission to run on your Kubernetes cluster by modifying the relevant IAM Role.
SigOpt creates a role for every cluster, named <cluster-name>-k8s-access-role
, which we call the cluster access role. SigOpt uses the cluster access role under the hood to access your cluster. To allow a second user to access the cluster, modify the cluster access role's trust relationship to give another user access. See instructions for Modifying a Role on AWS for how to change the trust relationship.
Below is an example trust relationship from a newly created cluster:
Below is an example trust relationship from a cluster which two people can access:
After the IAM Role has been modified, new users can run:
Now, the second user should be able to run commands on the cluster. Try running something simple, such as:
Last updated