LogoLogo
sigopt.comLog In / Sign Up
  • Welcome to SigOpt!
    • SigOpt API Modules
    • Main Concepts
      • Define and Set Up Parameter Space
      • Define and Set Up Metric Space
      • Alternative Experiment Types
  • Advanced Experimentation
    • Multimetric Optimization
    • Metric Thresholds
    • Metric Constraints
    • Metric Failure
    • All Constraints Experiment
    • Multisolution
    • Parallelism
    • Parameter Constraints
    • Prior Beliefs
    • Multitask Experiments
  • Core Module API References
    • Installation and Setup
    • Quick Start
    • API Topics
      • API Tokens and Authentication
      • API Errors
      • Manage Open Suggestions
      • Metadata
      • Pagination
    • API Endpoints
      • Client Detail
      • Experiment Best Assignments
      • Experiment Create
      • Experiment Delete
      • Experiment Detail
      • Experiment Metric Importances
      • Experiment Stopping Criteria
      • Experiment Update
      • Observation Create
      • Observation Batch Create
      • Observation Delete
      • Observation Detail
      • Observation List Delete
      • Observation List
      • Observation Update
      • Queued Suggestion Create
      • Queued Suggestion Delete
      • Queued Suggestion Detail
      • Queued Suggestion List
      • Suggestion Create
      • Suggestion Delete
      • Suggestion Detail
      • Suggestion List Delete
      • Suggestion List
      • Experiment Token Create
      • Organization Detail
      • Suggestion Update
    • API Objects
      • Assignments Object
      • Best Assignments Object
      • Bounds Object
      • Categorical Value Object
      • Client Object
      • Conditional Object
      • Conditions Object
      • Constraint Term Object
      • Experiment Object
      • Metadata Object
      • Metric Object
      • Metric Evaluation Object
      • Metric Importances Object
      • Observation Object
      • Organization Object
      • Pagination Object
      • Parameter Object
      • Parameter Constraint Object
      • Plan Object
      • Plan Period Object
      • Plan Rules Object
      • Prior Object
      • Progress Object
      • Queued Suggestion Object
      • Stopping Criteria Object
      • Suggestion Object
      • Token Object
  • AI MODULE API REFERENCES
    • Installation and Setup
    • Quick Start Tutorials
      • Run Tutorial
      • Project Tutorial
      • AI Experiment and Optimization Tutorial
    • Tracking Your Training Runs
      • Set Up for Example Code
      • Record SigOpt Runs in Jupyter
      • Record SigOpt Runs with Python IDE and SigOpt CLI
      • Record SigOpt Runs with Python IDE
      • View and Analyze the Recorded SigOpt Run
      • Enable Optimization for your SigOpt Runs
    • AI Experiments
      • AI Experiment Set Up
    • Bring Your Own Optimizer
    • XGBoost Integration
      • Installation
      • Tracking XGBoost Training
      • Tuning XGBoost Models
    • HyperOpt
      • Installation
      • User Case
      • API
    • SigOpt Orchestrate
      • Install SigOpt Orchestrate
      • Orchestrate a Tracked Training Run
      • Orchestrate an AI Experiment
      • AWS Cluster Create and Manage
      • SigOpt: Bring Your Own Cluster
      • Dockerfile: Define Your Environment
      • Debugging
      • CLI Reference
    • API Reference
      • Manually create a SigOpt Run
      • Tracking a Run
      • AI Experiment Client Calls
      • Project Client Calls
      • Objects
        • Training Run Object
        • AI Experiment Object
        • Parameter Object
        • Metric Object
      • SigOpt CLI Commands
      • Orchestrate CLI Commands
  • Support
    • Support
    • FAQ
    • Best Practices
      • Setting an experiment budget
      • Reproducibility in SigOpt
      • Uploading data as a Training Run artifact
      • Navigating multiple metrics
Powered by GitBook
On this page
  • Set Up
  • Dockerfile: Define your model environment
  • Define a Model
  • Notes on implementing your model
  • Create an orchestration configuration
  • Execute
  • Monitor
  • Monitor progress in the web app
  • Stop
  1. AI MODULE API REFERENCES
  2. SigOpt Orchestrate

Orchestrate a Tracked Training Run

PreviousInstall SigOpt OrchestrateNextOrchestrate an AI Experiment

Last updated 2 years ago

In this part of the docs, we will walk through how to execute a training job on a Kubernetes cluster using SigOpt. SigOpt should now be connected to a Kubernetes cluster of your choice.

Set Up

If you haven't connected to a cluster yet, you can , , or

Then, test whether or not you are connected to a cluster with SigOpt by running:

$ sigopt cluster test

SigOpt will output:

Successfully connected to kubernetes cluster: tiny-cluster

If you're using a custom Kubernetes cluster, you will need to install plugins to get the controller image working:

$ sigopt cluster install-plugins

SigOpt works when all of the files for your model are located in the same folder. So, please create an example directory (mkdir), and then change directories (cd) into that directory:

$ mkdir example && cd example

Then auto-generate templates for a Dockerfile and an SigOpt Configuration YAML file

$ sigopt init

Next, you will create some files and put them in this example directory.

Dockerfile: Define your model environment

For the tutorial, we'll be using a very simple Dockerfile. For instructions on how to specify more requirements see our guide on . Please copy and paste the following snippet into the autogenerated file named Dockerfile.

FROM python:3.9

RUN pip install --no-cache-dir sigopt

RUN pip install --no-cache-dir scipy==1.7.1
RUN pip install --no-cache-dir scikit-learn==0.24.2
RUN pip install --no-cache-dir numpy==1.21.2

COPY . /sigopt
WORKDIR /sigopt

Define a Model

# model.py
# SGDClassifier example

# You'll use the SigOpt Training Runs API to communicate with SigOpt
# while your model is running on the cluster.
import sigopt


# These packages will need to be installed in order to run your model.
# To do this, define a requirements.txt file, and provide instructions
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
import numpy

# https://en.wikipedia.org/wiki/Iris_flower_data_set
def load_data():
  iris = datasets.load_iris()
  return (iris.data, iris.target)


def evaluate_model(X, y):
  sigopt.params.setdefaults(
    loss="log",
    penalty="elasticnet",
    log_alpha=-4,
    l1_ratio=0.15,
    max_iter=1000,
    tol=0.001,
  )
  classifier = SGDClassifier(
    loss=sigopt.params.loss,
    penalty=sigopt.params.penalty,
    alpha=10 ** sigopt.params.log_alpha,
    l1_ratio=sigopt.params.l1_ratio,
    max_iter=sigopt.params.max_iter,
    tol=sigopt.params.tol,
  )
  cv_accuracies = cross_val_score(classifier, X, y, cv=5)
  return (numpy.mean(cv_accuracies), numpy.std(cv_accuracies))


# Each execution of model.py should represent one evaluation of your model.
# When this file is run, it loads data, evaluates the model using assignments
if __name__ == "__main__":
  (X, y) = load_data()
  (mean, std) = evaluate_model(X=X, y=y)
  print("Accuracy: {} +/- {}".format(mean, std))
  sigopt.log_metric("accuracy", mean, std)

Notes on implementing your model

When your model runs on a node in the cluster it can use all of the CPUs on that node with multithreading. This is good for performance if your model is the only process running on the node, but in many cases it will need to share those CPUs with other processes (ex. other model runs). For this reason it is a good idea to limit the number of threads that your model library can create in conjunction with the amount of cpu specified in your resources_per_model. This varies by implementation, but some common libraries are listed below:

Numpy

Threads spawned by Numpy can be configured with environment variables, which can be set in your Dockerfile:

ENV MKL_NUM_THREADS=N
ENV NUMEXPR_NUM_THREADS=N
ENV OMP_NUM_THREADS=N

Tensorflow/Keras

PyTorch

Create an orchestration configuration

Here's a sample SigOpt configuration file that specifies a training run for the model.py specified above on one CPU.

Please copy and paste the following to a file named run.yml.

# Choose a descriptive name for your model
name: SGD Classifier

# Here, we run the model
run: python model.py
resources:
  requests:
    cpu: 0.5
    memory: 512Mi
  limits:
    cpu: 2
    memory: 512Mi

# We don't need any GPUs for this example, so we'll leave this commented out
# gpus: 1

# SigOpt creates a container for your model. Since we're using an AWS
# cluster, it's easy to securely store the model in the Amazon Elastic Container Registry.
# Choose a descriptive and unique name for each new experiment configuration file.
image: sgd-classifier

Execute

So far, SigOpt is connected to your cluster, the Dockerfile defines your model requirements, and you've updated the SigOpt configuration file. Now is a good time to test that you can create your run and verify that your model code works in the cluster.

$ sigopt cluster test-run -r run.yml

Once you are confident that your runs will finish you can kick one off in the background and continue your experimentation.

$ sigopt cluster run -r run.yml

Note that we can also directly run the python script we execute in the run section of run.yml.

$ sigopt cluster run python model.py

Monitor

You can monitor the status of SigOpt Runs from the command line using the run name or the Run ID.

$ sigopt cluster status run/99999
run/99999:
  Run Name: run-jwc5fyyr
  Link: https://app.sigopt.com/run/99999
  State: failed
  Experiment link: https://app.sigopt.com/experiment/111111
  Suggestion id: 42613040
  Observation id: 28531050
  Pod phase: Deleted
  Node name: ip-111-11-11-111.us-west-2.compute.internal
  bFollow logs: sigopt cluster kubectl logs pod/run-jwc5fyyr -f

The status will include a command that you can run in your terminal to follow the logs as they are generated by your code.

You can see all of the activity on your cluster with the following command:

$ sigopt cluster status
You are currently connected to the cluster: test-cluster
Experiments: 1 total
    Experiment 374876: 45 runs
        Succeeded: 41 runs
        Pending: 3 runs
            run-868o4gou	Pending
            run-shb0yvxd	Pending
            run-t8co05nt	Pending
    Running: 1 runs
            run-dba7gdlc	Running
Nodes: 1 total
    ip-111-11-11-111.us-west-2.compute.internal:
        cpu:
            Allocatable: 1.93 CPU
            Requests: 250.00 mCPU, 12.95 %
            Limits: 250.00 mCPU, 12.95 %
        memory:
            Allocatable: 7.44 GB
            Requests: 1.07 GB, 14.44 %
            Limits: 1.07 GB, 14.44 %
$ sigopt cluster status

Monitor progress in the web app

You can monitor training run progress on https://app.sigopt.com/run/[id].

At the top of the page under the training run name, you’ll find the status of the run. Once the run is completed, the Performance and Metric sections will fill in.

Stop

You can stop an in progress run and mark it as failed on SigOpt website by archiving it.

sigopt cluster stop run/<run-id>

This code defines a simple model that measures accuracy classifying labels for the Copy and paste the snippet below to a file titled model.py. Note the snippet below uses SigOpt’s Runs to track model attributes.

Can be configured in the Tensorflow module, see:

Can be configured in the PyTorch module, see:

SGDClassifier
Iris flower dataset.
https://www.tensorflow.org/api_docs/python/tf/config/threading
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
launch a cluster on AWS
connect to an existing Kubernetes cluster
Dockerfiles
connect to an existing, shared K8s cluster.