Orchestrate a Tracked Training Run
Last updated
Last updated
In this part of the docs, we will walk through how to execute a training job on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.
If you haven't connected to a cluster yet, you can , , or
Then, test whether or not you are connected to a cluster with SigOpt by running:
SigOpt will output:
If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:
SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir
), and then change directories (cd
) into that directory:
Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file
Next, you will create some files and put them in this example directory.
For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on . Please copy and paste the following snippet into the autogenerated file named Dockerfile
.
When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model
. This varies by implementation, but some common libraries are listed below:
Numpy
Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:
Tensorflow/Keras
PyTorch
Here's a sample SigOpt configuration file that specifies a training run for the model.py
specified above on one CPU.
Please copy and paste the following to a file named run.yml
.
So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. Now is a good time to test that you can create your run and verify that your model code works in the cluster.
Once you are confident that your runs will finish you can kick one off in the background and continue your experimentation.
Note that we can also directly run the python script we execute in the run section of run.yml
.
You can monitor the status of SigOpt Runs from the command line using the run name or the Run ID.
The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.
You can see all of the activity on your cluster with the following command:
You can monitor training run progress on https://app.sigopt.com/run/[id].
At the top of the page under the training run name, you’ll find the status of the run. Once the run is completed, the Performance and Metric sections will fill in.
You can stop an in progress run and mark it as failed on SigOpt website by archiving it.
This code defines a simple model that measures accuracy classifying labels for the Copy and paste the snippet below to a file titled model.py
. Note the snippet below uses SigOpt’s Runs to track model attributes.
Can be configured in the Tensorflow module, see:
Can be configured in the PyTorch module, see: