Distributed Workloads

Built-in support

Graphsignal has a built-in support for distributed training and inference, e.g. multi-node and multi-gpu training. If workloads involve multiple workers, the dashboards seamlessly aggregate, structure and visualize data from all workers.

Single-node, single-process, multi-GPU workloads

This setup is supported out-of-the box without any additional configuration.

Single-node, multi-process, multi-GPU workloads

In the case of multi-process setup, the support is framework specific. If environment variables from rank zero worker are passed to other workers, the profiler will automatically group workers into a run. Otherwise, run_id argument of graphsignal.configure method or GRAPHSIGNAL_RUN_ID environment variable should be used to pass the same run ID to all workers. The ID should be unique for each run.

Depending on the framework and, more specifically, the way it allows to pass/share arguments to all workers, the run ID can be generated before every run and provided to all worker scripts.

Here is an example of defining GRAPHSIGNAL_RUN_ID environment variable for a script that launches multiple processes on a single host, for example using torch.distributed.launch. Every time this command is run, a new ID will be generated.

GRAPHSIGNAL_RUN_ID=$(cat /proc/sys/kernel/random/uuid) ./my_distributed_train.sh

If the launcher script is written in python, this can also be done with the help of graphsignal package:

import graphsignal
os.environ['GRAPHSIGNAL_RUN_ID'] = graphsignal.generate_uuid()

Check framework-specific integration documentation for recommended setup.

Multi-node, multi-process, multi-GPU workloads

Multi-node setup is similar to single-node, multi-process setup, although a different mechanism of passing run ID may be necessary. See framework-specific integration documentation for recommended setup.