Distributed Inference Monitoring

Integration

Graphsignal has a built-in support for distributed inference, e.g. multi-node and multi-GPU inference. When runs involve multiple workers, the dashboards seamlessly aggregate, structure and visualize data from all workers.

The ranks of workers are automatically recorded for some frameworks. In other cases, to identify each worker, you can provide tags to configure method. Tags can also be used to identify and compare runs and jobs.

graphsignal.configure(
    api_key='my-api-key', deployment='my-model-prod', tags=dict(rank=0))

Just like api_key and deployment, tags can also be provided via an environment variable.

Both online model serving and offline jobs are supported. Simply wrap the inference code with start_trace context or use @trace_function decorator.

for sample in dataset:
    with graphsignal.start_trace(endpoint='predict'):
        # inference code

Examples

The DeepSpeed GPT Neo example illustrates a distributed inference use case.