Orchestrate an HPO Experiment
In this part of the docs, we will walk through how to execute an HPO experiment on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.

Set Up

Then, test whether or not you are connected to a cluster with SigOpt by running:
1
$ sigopt cluster test
Copied!
SigOpt will output:
1
Successfully connected to kubernetes cluster: tiny-cluster
Copied!
If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:
1
$ sigopt cluster install-plugins
Copied!
SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir), and then change directories (cd) into that directory:
1
$ mkdir example && cd example
Copied!
Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file
1
$ sigopt init
Copied!
Next, you will create some files and put them in this example directory.

Dockerfile: Define your model environment

For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on Dockerfiles. Please copy and paste the following snippet into the autogenerated file named Dockerfile.
1
FROM python:3.9
2
3
RUN pip install --no-cache-dir sigopt
4
5
RUN pip install --no-cache-dir scipy==1.7.1
6
RUN pip install --no-cache-dir scikit-learn==0.24.2
7
RUN pip install --no-cache-dir numpy==1.21.2
8
9
COPY . /sigopt
10
WORKDIR /sigopt
Copied!

Define a Model

This code defines a simple SGDClassifier model that measures accuracy classifying labels for the Iris flower dataset. Copy and paste the snippet below to a file titled model.py. Note the snippet below uses SigOpt’s Runs to track model attributes.
1
# model.py
2
# SGDClassifier example
3
4
# You'll use the SigOpt Training Runs API to communicate with SigOpt
5
# while your model is running on the cluster.
6
import sigopt
7
8
9
# These packages will need to be installed in order to run your model.
10
# To do this, define a requirements.txt file, and provide instructions
11
from sklearn import datasets
12
from sklearn.linear_model import SGDClassifier
13
from sklearn.model_selection import cross_val_score
14
import numpy
15
16
# https://en.wikipedia.org/wiki/Iris_flower_data_set
17
def load_data():
18
iris = datasets.load_iris()
19
return (iris.data, iris.target)
20
21
22
def evaluate_model(X, y):
23
sigopt.params.setdefaults(
24
loss="log",
25
penalty="elasticnet",
26
log_alpha=-4,
27
l1_ratio=0.15,
28
max_iter=1000,
29
tol=0.001,
30
)
31
classifier = SGDClassifier(
32
loss=sigopt.params.loss,
33
penalty=sigopt.params.penalty,
34
alpha=10 ** sigopt.params.log_alpha,
35
l1_ratio=sigopt.params.l1_ratio,
36
max_iter=sigopt.params.max_iter,
37
tol=sigopt.params.tol,
38
)
39
cv_accuracies = cross_val_score(classifier, X, y, cv=5)
40
return (numpy.mean(cv_accuracies), numpy.std(cv_accuracies))
41
42
43
# Each execution of model.py should represent one evaluation of your model.
44
# When this file is run, it loads data, evaluates the model using assignments
45
if __name__ == "__main__":
46
(X, y) = load_data()
47
(mean, std) = evaluate_model(X=X, y=y)
48
print("Accuracy: {} +/- {}".format(mean, std))
49
sigopt.log_metric("accuracy", mean, std)
Copied!

Notes on implementing your model

When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model. This varies by implementation, but some common libraries are listed below:
Numpy
Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:
1
ENV MKL_NUM_THREADS=N
2
ENV NUMEXPR_NUM_THREADS=N
3
ENV OMP_NUM_THREADS=N
Copied!
Tensorflow/Keras
Can be configured in the Tensorflow module, see: https://www.tensorflow.org/api_docs/python/tf/config/threading
PyTorch
Can be configured in the PyTorch module, see: https://pytorch.org/docs/stable/generated/torch.set_num_threads.html

Create an orchestration configuration file

Here's a sample SigOpt configuration file that specifies an HPO experiment for the model.py specified above on one CPU.
Please copy and paste the following to a file named run.yml.
1
# Choose a descriptive name for your model
2
name: SGD Classifier
3
4
# Here, we run the model
5
run: python model.py
6
resources:
7
requests:
8
cpu: 0.5
9
memory: 512Mi
10
limits:
11
cpu: 2
12
memory: 512Mi
13
14
# We don't need any GPUs for this example, so we'll leave this commented out
15
# gpus: 1
16
17
# SigOpt creates a container for your model. Since we're using an AWS
18
# cluster, it's easy to securely store the model in the Amazon Elastic Container Registry.
19
# Choose a descriptive and unique name for each new experiment configuration file.
20
image: sgd-classifier
Copied!
Please copy and paste the following to a file named experiment.yml.
1
# experiment.yml
2
name: SGD Classifier HPO
3
4
metrics:
5
- name: accuracy
6
parameters:
7
- name: l1_ratio
8
type: double
9
bounds:
10
min: 0
11
max: 1.0
12
- name: log_alpha
13
type: double
14
bounds:
15
min: -5
16
max: 2
17
18
parallel_bandwidth: 2
19
budget: 60
Copied!

Execute

So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. SigOpt can now execute an HPO experiment on your cluster.
1
$ sigopt cluster optimize -r run.yml -e experiment.yml
Copied!

Monitor progress through CLI

You can monitor the status of SigOpt Experiments from the command line using the run name or the Experiment ID.
1
$ sigopt cluster status experiment/99999
Copied!
1
experiment/999999:
2
Experiment Name: hoco
3
5.0 / 64.0 Observation budget
4
5 Observation(s) failed
5
Run Name Pod phase Status Link
6
run-25ne1woa Succeeded failed https://app.sigopt.com/run/49950
7
run-2lkc1ppa Succeeded failed https://app.sigopt.com/run/49975
8
run-zggujklx Succeeded failed https://app.sigopt.com/run/49980
9
run-zhc9c5q0 Succeeded failed https://app.sigopt.com/run/49967
10
run-zydkuibj Succeeded failed https://app.sigopt.com/run/49966
11
Follow logs: sigopt cluster kubectl logs -ltype=run,experiment=374876 --max-log-requests=4 -f
12
View more at: https://app.sigopt.com/experiment/999999
Copied!
The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.

Monitor progress in the web app

You can monitor experiment progress on https://app.sigopt.com/experiment/[id].
The History tab, https://app.sigopt.com/experiment/[id]/history, shows a complete table of training runs created in the experiment. The State column displays the current state of each training run.

Stop

You can stop your HPO Experiment at any point while it's running. This command stops and deletes an HPO Experiment on the cluster. All in-progress Training Runs will be terminated.
1
$ sigopt cluster stop <experiment-id>
Copied!