Orchestrate an AI Experiment
Last updated
Last updated
In this part of the docs, we will walk through how to execute an AI experiment on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.
If you haven't connected to a cluster yet, you can , , or
Then, test whether or not you are connected to a cluster with SigOpt by running:
SigOpt will output:
If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:
SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir
), and then change directories (cd
) into that directory:
Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file
Next, you will create some files and put them in this example directory.
For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on . Please copy and paste the following snippet into the autogenerated file named Dockerfile
.
When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model
. This varies by implementation, but some common libraries are listed below:
Numpy
Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:
Tensorflow/Keras
PyTorch
Here's a sample SigOpt configuration file that specifies an AI Experiment for the model.py
specified above on one CPU.
Please copy and paste the following to a file named run.yml
.
Please copy and paste the following to a file named experiment.yml
.
So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. SigOpt can now execute an AI Experiment on your cluster.
You can monitor the status of SigOpt AI Experiments from the command line using the run name or the Experiment ID.
The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.
You can monitor experiment progress on https://app.sigopt.com/experiment/[id].
The History tab, https://app.sigopt.com/experiment/[id]/history, shows a complete table of training runs created in the experiment. The State column displays the current state of each training run.
You can stop your AI Experiment at any point while it's running. This command stops and deletes an AI Experiment on the cluster. All in-progress Training Runs will be terminated.
This code defines a simple model that measures accuracy classifying labels for the Copy and paste the snippet below to a file titled model.py
. Note the snippet below uses SigOpt’s Runs to track model attributes.
Can be configured in the Tensorflow module, see:
Can be configured in the PyTorch module, see: