AWS Cluster Create and Manage

Before you begin, you'll need a Kubernetes cluster. Once it's created, you can reuse that cluster as many times as needed. Furthermore, if you're sharing a cluster that's already been created by another user, you can skip this section and go to Sharing your K8s Cluster.

SigOpt currently only supports launching a Kubernetes (K8s) cluster on AWS. If you have a Kubernetes cluster you'd like to use for your orchestration, please see the Bring your own cluster page for instructions on how to do so.

AWS clusters created by SigOpt will have autoscaling enabled. In order for autoscaling to work properly, SigOpt needs to provision at least 1 system node to run at all times. This will allow your cluster to scale down your expensive compute nodes when they aren't being used.

AWS Configuration

If your local development environment is not already configured to use AWS, the easiest way to get started is to configure your AWS Access Key and AWS Secret Key via the aws command line interface:

$ aws configure

See the AWS docs for more about configuring your AWS credentials.

Enable Full Access

SigOpt requires that AWS accounts creating clusters have access to the following services:

  • AutoScaling full access

  • CloudFormation full access

  • IAM full access

  • EC2 full access

  • ECR full access

  • EKS full access

  • SSM full access

  • S3 full access

If you are an account admin, you may already have the correct permissions. Otherwise, for your convenience we have created a JSON policy document that you can use to create an IAM Policy for yourself or other users:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"VisualEditor0",
         "Effect":"Allow",
         "Action":[
            "iam:*",
            "ecr:*",
            "ec2:*",
            "cloudformation:*",
            "autoscaling:*",
            "eks:*",
            "ssm:*",
            "s3:*"
         ],
         "Resource":"*"
      }
   ]
}

Next, you can use the awscli (installed with SigOpt) to create this policy for your team:

aws iam create-policy \
  --policy-name sigopt-orchestrate-full-access \
  --policy-document https://raw.githubusercontent.com/sigopt/sigopt-examples/master/orchestrate/aws_iam/orchestrate_full_access_policy_document.json

SigOpt clusters with GPU machines use an AWS-managed EKS-optimized AMI with GPU support. To use this AMI, AWS requires that you accept an end user license agreement (EULA). This can be done here by subscribing to the AMI.

Cluster Configuration File

The cluster configuration file is commonly referred to as cluster.yml, but you can name yours anything you like. The file is used when we create a SigOpt cluster, with sigopt cluster create -f cluster.yml. You can update your cluster configuration file after the cluster has been created to change the number of nodes in your cluster or change instance types. These changes can be applied by running sigopt cluster update -f cluster.yml. Some updates might not be supported, for example introducing GPU nodes to your cluster in some regions. If the update is not supported then you will need to destroy the cluster and create it again.

The available fields are:

FieldRequired?Description

cpu, or gpu

Yes

You must provide at least one of either cpu or gpu. Define the CPU compute that your cluster will need in terms of: instance_type, max_nodes, and min_nodes. It is recommended that you set min_nodes to 0 so the autoscaler can remove all of your expensive compute nodes when they aren't in use. It's ok if max_nodes and min_nodes are the same value, as long as max_nodes is not 0.

cluster_name

Yes

You must provide a name for your cluster. You will share this with anyone else who wants to connect to your cluster.

aws

No

Override environment-provided values for aws_access_key_id or aws_secret_access_key.

kubernetes_version

No

The version of Kubernetes to use for your cluster. Currently supports Kubernetes 1.16, 1.17, 1.18, and 1.19. Defaults to the latest stable version supported by SigOpt, which is currently 1.18.

provider

No

Currently, AWS is our only supported provider for creating clusters. You can, however, use a custom provider to connect to your own Kubernetes cluster with the sigopt cluster connect. See page on Bringing your own K8s cluster.

system

No

System nodes are required to run the autoscaler. You can specify the number and type of system nodes with min_nodes, max_nodes and instance_type. The value of min_nodes must be at least 1 so that you have at least 1 system node. The defaults for system are:

  • min_nodes: 1

  • max_nodes: 2

  • instance_type: "t3.large"

Example

The example YAML file below defines a CPU cluster named tiny-cluster with two t2.small AWS instances.

# cluster.yml

# AWS is currently our only supported provider for cluster create
# You can connect to custom clusters via `sigopt connect`
provider: aws

# We have provided a name that is short and descriptive
cluster_name: tiny-cluster

# Your cluster config can have CPU nodes, GPU nodes, or both.
# The configuration of your nodes is defined in the sections below.

# (Optional) Define CPU compute here
cpu:
  # AWS instance type
  instance_type: t2.small
  max_nodes: 2
  min_nodes: 0

# # (Optional) Define GPU compute here
# gpu:
#   # AWS GPU-enabled instance type
#   # This can be any p* instance type
#   instance_type: p2.xlarge
#   max_nodes: 2
#   min_nodes: 0

kubernetes_version: '1.20'

Notes on Choosing Instance Types

It is tempting to choose an instance type that exactly matches the needs of a single training run. Because SigOpt is focused on experimentation, you will likely be running more than one training run at a time, maybe even hundreds at a time. For this reason it is a good idea to make your cluster as efficient as possible.

Each node will reserve some amount of resources for the system and for Kubernetes system pods. Because of this, your runs will not be able to use 100% of the resources on each node. If you chose larger instances for your cluster then your training runs will be able to use closer to 100% of the resources on each node.

Another reason to choose larger instance types is to support varying workloads. If you choose to switch to working on a different project, invite another user to share your cluster or even just change the resources used by each training run, then you will benefit by choosing instance types that have more resources.

Create Cluster

To create the cluster on AWS, run:

$ sigopt cluster create -f cluster.yml

Cluster creation can take between 15-30 mins. If you notice an error, please try re-running the same command. SigOpt will reuse the same EKS cluster so the second run will be much faster.

Check Cluster Status

Test that your cluster was created correctly:

$ sigopt cluster test

SigOpt will respond with:

Successfully connected to kubernetes cluster: tiny-cluster

Destroy your Cluster

Destroying your cluster can take between 15-30 mins. To destroy your cluster please run:

$ sigopt cluster destroy

Share your Kubernetes cluster

You can grant other users permission to run on your Kubernetes cluster by modifying the relevant IAM Role.

SigOpt creates a role for every cluster, named <cluster-name>-k8s-access-role, which we call the cluster access role. SigOpt uses the cluster access role under the hood to access your cluster. To allow a second user to access the cluster, modify the cluster access role's trust relationship to give another user access. See instructions for Modifying a Role on AWS for how to change the trust relationship.

Below is an example trust relationship from a newly created cluster:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789:user/alexandra"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Below is an example trust relationship from a cluster which two people can access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::123456789:user/alexandra",
          "arn:aws:iam::123456789:user/ben"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

After the IAM Role has been modified, new users can run:

$ sigopt cluster connect --cluster-name <cluster-name> --provider aws

Now, the second user should be able to run commands on the cluster. Try running something simple, such as:

$ sigopt cluster test

Last updated