Skip to main content
Documentation

TensorRT over SSH

Compile, profile, and invoke TensorRT models on NVIDIA hardware.

This guide walks you through compiling, profiling, and invoking a TensorRT model on your own NVIDIA hardware over SSH using NVIDIA’s trtexec tool.

TensorRT compilation is more heavyweight than ONNX Runtime, but still practical for iterative development on NVIDIA GPUs. Building a MobileNetV2 engine takes around 70 seconds, while a ResNet-50 completes in about 40 seconds. Profiling is fast at around 15 seconds per model. The total turnaround stays under two minutes for most models — compared to around 10 minutes for the same workflow on a cloud provider. On the other hand, cloud providers let you test on a wide variety of edge devices without managing any hardware.

You will learn how to:

  • Set up trtexec on the target device
  • Connect to the device over SSH
  • Compile an ONNX model to a TensorRT engine
  • Profile the compiled engine
  • Invoke the engine with real input data

Prerequisites

Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.

Locating trtexec on the target device

The trtexec provider requires NVIDIA’s trtexec tool, which is included with TensorRT. You can find it on your device by running:

ssh user@host find / -name trtexec -type f 2>/dev/null

Common paths include:

/usr/src/tensorrt/bin/trtexec
/opt/tensorrt/bin/trtexec

If trtexec is not on the device’s $PATH, you will need to provide the full path when connecting to the device (see Connecting to your device below).

Creating a project

embedl-hub init \
    --project "TensorRT SSH" \
    --artifact-dir ~/my-artifacts

This sets the default project and artifact directory for subsequent commands. The artifact directory is where compiled models, profiling results, and other outputs are stored on disk. Later commands — such as profiling a model from a previous compile step — look here for previously produced artifacts. If omitted, a platform-specific default location is used.

You can view your current settings at any time:

embedl-hub show

Connecting to your device

Next, configure a connection to your target device over SSH.

In the CLI, device connection details are passed directly to each command:

embedl-hub compile tensorrt trtexec \
    --host 192.168.1.10 \
    --user nvidia \
    --exec-path /usr/src/tensorrt/bin/trtexec \
    ...

If trtexec is on the device’s $PATH, you can omit the --exec-path flag.

Preparing a model

The compile step expects an ONNX file. You can save your existing PyTorch model in ONNX format using torch.onnx.export:

import torch
from torchvision.models import mobilenet_v2
model = mobilenet_v2(weights="IMAGENET1K_V2")
example_input = torch.rand(1, 3, 224, 224)
torch.onnx.export(
    model,
    example_input,
    "mobilenet_v2.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=18,
    external_data=False,
    dynamo=False,
)

Compiling a model

Compile the ONNX model to a TensorRT engine on the target device. The model is transferred over SSH, compiled using trtexec, and the engine file is fetched back:

embedl-hub compile tensorrt trtexec \
    --model /path/to/mobilenet_v2.onnx \
    --host 192.168.1.10 \
    --user nvidia

TensorRT optimizes the model as part of compilation, applying FP16 precision by default to reduce memory usage and inference latency on the target GPU. For further gains, INT8 quantization is supported when a calibration cache is provided.

Providing a calibration cache for INT8

To enable INT8 quantization, you need a TensorRT calibration cache file (.cache) containing pre-computed per-tensor dynamic ranges. This file must be generated externally using the TensorRT Python API (e.g. trt.IInt8EntropyCalibrator2) from your calibration dataset before calling the compiler.

The CLI does not expose a dedicated calibration flag, but you can forward the --calib and --int8 arguments directly to trtexec:

embedl-hub compile tensorrt trtexec \
    --model /path/to/mobilenet_v2.onnx \
    --host 192.168.1.10 \
    --user nvidia \
    --cli-args --int8 \
    --cli-args --calib=/path/on/device/calibration.cache

Note that the calibration file must already be present on the remote device when using --cli-args.

Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).

Specifying a TensorRT version

If the target device has multiple TensorRT versions installed, you can specify which one to use:

The CLI auto-detects the TensorRT version from the remote device. To override it, use the Python API instead.

Profiling a model

Profile the compiled engine on the target device:

embedl-hub profile tensorrt trtexec \
    --from-run latest \
    --host 192.168.1.10 \
    --user nvidia

Use embedl-hub log to view your runs.

Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:

  • Can we optimize the slowest layer?
  • Why aren’t certain layers running on the expected compute unit?

Invoking a model

Invoke the compiled engine with real input data to get inference outputs:

embedl-hub invoke tensorrt trtexec \
    --from-run latest \
    --host 192.168.1.10 \
    --user nvidia \
    --input /path/to/input.npz

The --input flag accepts a .npz file — a NumPy archive where each key is an input tensor name and each value is the corresponding array.

Next steps