TensorRT over SSH - Embedl Hub

This guide walks you through compiling, profiling, and invoking a TensorRT model on your own NVIDIA hardware over SSH using NVIDIA’s trtexec tool.

You will learn how to:

Set up trtexec on the target device
Connect to the device over SSH
Compile an ONNX model to a TensorRT engine
Profile the compiled engine
Invoke the engine with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Locating trtexec on the target device

The trtexec provider requires NVIDIA’s trtexec tool, which is included with TensorRT. You can find it on your device by running:

ssh user@host find / -name trtexec -type f 2>/dev/null

Common paths include:

/usr/src/tensorrt/bin/trtexec

/opt/tensorrt/bin/trtexec

If trtexec is not on the device’s $PATH, you will need to provide the full path when connecting to the device (see Connecting to your device below).

Creating a project

from embedl_hub.core import HubContext

from embedl_hub.core import LocalPath

ctx = HubContext(

    project_name="TensorRT SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

The HubContext is your entry point. It manages the project, artifact directory, devices, and tracking. We’ll register a device in the next section.

The artifact_base_dir is where compiled models, profiling results, and other outputs are stored on disk. If omitted, HubContext creates a temporary directory when used as a context manager (with ctx:), and cleans it up automatically when the context exits. This is convenient for scripts where you only need the in-memory results and don’t need to persist artifacts to disk.

For alternative ways to configure project context, see the configuration guide.

Connecting to your device

Next, configure a connection to your target device over SSH.

from embedl_hub.core import HubContext

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core import LocalPath

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

ctx = HubContext(

    project_name="TensorRT SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

If trtexec is not on the device’s $PATH, pass the full path via TrtexecConfig:

from embedl_hub.core.device import TrtexecConfig

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

    provider_config=TrtexecConfig(

        trtexec_path="/usr/src/tensorrt/bin/trtexec",

),

The name parameter is a label you choose for this device; you reference it by that label when creating components later (e.g. device="jetson").

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch

from torchvision.models import mobilenet_v2

model = mobilenet_v2(weights="IMAGENET1K_V2")

example_input = torch.rand(1, 3, 224, 224)

torch.onnx.export(

    model,

    example_input,

    "mobilenet_v2.onnx",

    input_names=["input"],

    output_names=["output"],

    opset_version=18,

    external_data=False,

    dynamo=False,

Compiling a model

Compile the ONNX model to a TensorRT engine on the target device. The model is transferred over SSH, compiled using trtexec, and the engine file is fetched back:

from embedl_hub.core import HubContext

from embedl_hub.core import LocalPath

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.compile import TensorRTCompiler

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

    # provider_config=trtexec_config,  # if trtexec is not on PATH

ctx = HubContext(

    project_name="TensorRT SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = TensorRTCompiler(device="jetson")

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    print(compiled.path.file_path)

TensorRT optimizes the model as part of compilation, applying FP16 precision by default to reduce memory usage and inference latency on the target GPU. For further gains, INT8 quantization is supported when a calibration cache is provided.

Providing a calibration cache for INT8

To enable INT8 quantization, you need a TensorRT calibration cache file (.cache) containing pre-computed per-tensor dynamic ranges. This file must be generated externally using the TensorRT Python API (e.g. trt.IInt8EntropyCalibrator2) from your calibration dataset before calling the compiler.

compiler = TensorRTCompiler(

    device="jetson",

    calib_path=LocalPath("path/to/calibration.cache"),

The calib_path parameter accepts a local path to the .cache file. It is automatically uploaded to the target device and passed to trtexec with the --calib flag. You also need to enable INT8 mode via trtexec_cli_args:

from embedl_hub.core.device import TrtexecConfig

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

    provider_config=TrtexecConfig(

        trtexec_cli_args=("--int8",),

),

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Specifying a TensorRT version

If the target device has multiple TensorRT versions installed, you can specify which one to use:

compiler = TensorRTCompiler(

    device="jetson",

    tensorrt_version="10.0",

Profiling a model

Profile the compiled engine on the target device:

from embedl_hub.core import HubContext

from embedl_hub.core import LocalPath

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.compile import TensorRTCompiler

from embedl_hub.core.profile import TensorRTProfiler

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

    # provider_config=trtexec_config,  # if trtexec is not on PATH

ctx = HubContext(

    project_name="TensorRT SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = TensorRTCompiler(device="jetson")

profiler = TensorRTProfiler(device="jetson")

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    result = profiler.run(ctx, compiled)

    print("Latency:", result.latency.value)

    print("FPS:", result.fps.value)

Your runs are automatically synced to your project on hub.embedl.com.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

Can we optimize the slowest layer?
Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled engine with real input data to get inference outputs:

import numpy as np

from embedl_hub.core import HubContext

from embedl_hub.core import LocalPath

from embedl_hub.core.device import DeviceManager

from embedl_hub.core.device import SSHConfig

from embedl_hub.core.compile import TensorRTCompiler

from embedl_hub.core.invoke import TensorRTInvoker

device = DeviceManager.get_tensorrt_device(

    SSHConfig(host="192.168.1.10", username="nvidia"),

    name="jetson",

    # provider_config=trtexec_config,  # if trtexec is not on PATH

ctx = HubContext(

    project_name="TensorRT SSH",

    artifact_base_dir=LocalPath("my-artifacts"),

    devices=[device],

compiler = TensorRTCompiler(device="jetson")

invoker = TensorRTInvoker(device="jetson")

input_data = dict(input=np.random.rand(1, 3, 224, 224).astype(np.float32))

with ctx:

    compiled = compiler.run(ctx, LocalPath("mobilenet_v2.onnx"))

    invocation = invoker.run(ctx, compiled, input_data)

    print(invocation.output)

The input_data dictionary maps input tensor names to NumPy arrays.

Next steps

Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
See the providers guide for the full reference of supported provider and toolchain combinations.