Orchestrate an AI Experiment
In this part of the docs, we will walk through how to execute an AI experiment on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.
Set Up
If you haven't connected to a cluster yet, you can launch a cluster on AWS, connect to an existing Kubernetes cluster, or connect to an existing, shared K8s cluster.
Then, test whether or not you are connected to a cluster with SigOpt by running:
SigOpt will output:
If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:
SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir
), and then change directories (cd
) into that directory:
Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file
Next, you will create some files and put them in this example directory.
Dockerfile: Define your model environment
For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on Dockerfiles. Please copy and paste the following snippet into the autogenerated file named Dockerfile
.
Define a Model
This code defines a simple SGDClassifier model that measures accuracy classifying labels for the Iris flower dataset. Copy and paste the snippet below to a file titled model.py
. Note the snippet below uses SigOpt’s Runs to track model attributes.
Notes on implementing your model
When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model
. This varies by implementation, but some common libraries are listed below:
Numpy
Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:
Tensorflow/Keras
Can be configured in the Tensorflow module, see: https://www.tensorflow.org/api_docs/python/tf/config/threading
PyTorch
Can be configured in the PyTorch module, see: https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
Create an orchestration configuration file
Here's a sample SigOpt configuration file that specifies an AI Experiment for the model.py
specified above on one CPU.
Please copy and paste the following to a file named run.yml
.
Please copy and paste the following to a file named experiment.yml
.
Execute
So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. SigOpt can now execute an AI Experiment on your cluster.
Monitor progress through CLI
You can monitor the status of SigOpt AI Experiments from the command line using the run name or the Experiment ID.
The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.
Monitor progress in the web app
You can monitor experiment progress on https://app.sigopt.com/experiment/[id].
The History tab, https://app.sigopt.com/experiment/[id]/history, shows a complete table of training runs created in the experiment. The State column displays the current state of each training run.
Stop
You can stop your AI Experiment at any point while it's running. This command stops and deletes an AI Experiment on the cluster. All in-progress Training Runs will be terminated.
Last updated