Orchestrate a Tracked Training Run
In this part of the docs, we will walk through how to execute a training job on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.
Set Up
If you haven't connected to a cluster yet, you can launch a cluster on AWS, connect to an existing Kubernetes cluster, or connect to an existing, shared K8s cluster.
Then, test whether or not you are connected to a cluster with SigOpt by running:
SigOpt will output:
If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:
SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir
), and then change directories (cd
) into that directory:
Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file
Next, you will create some files and put them in this example directory.
Dockerfile: Define your model environment
For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on Dockerfiles. Please copy and paste the following snippet into the autogenerated file named Dockerfile
.
Define a Model
This code defines a simple SGDClassifier model that measures accuracy classifying labels for the Iris flower dataset. Copy and paste the snippet below to a file titled model.py
. Note the snippet below uses SigOpt’s Runs to track model attributes.
Notes on implementing your model
When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model
. This varies by implementation, but some common libraries are listed below:
Numpy
Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:
Tensorflow/Keras
Can be configured in the Tensorflow module, see: https://www.tensorflow.org/api_docs/python/tf/config/threading
PyTorch
Can be configured in the PyTorch module, see: https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
Create an orchestration configuration
Here's a sample SigOpt configuration file that specifies a training run for the model.py
specified above on one CPU.
Please copy and paste the following to a file named run.yml
.
Execute
So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. Now is a good time to test that you can create your run and verify that your model code works in the cluster.
Once you are confident that your runs will finish you can kick one off in the background and continue your experimentation.
Note that we can also directly run the python script we execute in the run section of run.yml
.
Monitor
You can monitor the status of SigOpt Runs from the command line using the run name or the Run ID.
The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.
You can see all of the activity on your cluster with the following command:
Monitor progress in the web app
You can monitor training run progress on https://app.sigopt.com/run/[id].
At the top of the page under the training run name, you’ll find the status of the run. Once the run is completed, the Performance and Metric sections will fill in.
Stop
You can stop an in progress run and mark it as failed on SigOpt website by archiving it.
Last updated