Distributed Inference Monitoring
Integration
Graphsignal has a built-in support for distributed inference, e.g. multi-node and multi-GPU inference. When runs involve multiple workers, the dashboards seamlessly aggregate, structure and visualize data from all workers.
The ranks of workers are automatically recorded for some frameworks. In other cases, to identify each worker, you can provide tags
to configure
method. Tags can also be used to identify and compare runs and jobs.
graphsignal.configure(
api_key='my-api-key', deployment='my-model-prod', tags=dict(rank=0))
Just like api_key
and deployment
, tags
can also be provided via an environment variable.
Both online model serving and offline jobs are supported. Simply wrap the inference code with start_trace
context or use @trace_function
decorator.
for sample in dataset:
with graphsignal.start_trace(endpoint='predict'):
# inference code
Examples
The DeepSpeed GPT Neo example illustrates a distributed inference use case.