TensorRT over SSH
Compile, profile, and invoke TensorRT models on NVIDIA hardware.
This guide walks you through compiling, profiling, and invoking a TensorRT
model on your own NVIDIA hardware over SSH using NVIDIA’s trtexec tool.
TensorRT compilation is more heavyweight than ONNX Runtime, but still practical for iterative development on NVIDIA GPUs. Building a MobileNetV2 engine takes around 70 seconds, while a ResNet-50 completes in about 40 seconds. Profiling is fast at around 15 seconds per model. The total turnaround stays under two minutes for most models — compared to around 10 minutes for the same workflow on a cloud provider. On the other hand, cloud providers let you test on a wide variety of edge devices without managing any hardware.
You will learn how to:
- Set up
trtexecon the target device - Connect to the device over SSH
- Compile an ONNX model to a TensorRT engine
- Profile the compiled engine
- Invoke the engine with real input data
Prerequisites
Make sure you have completed the setup guide and the your hardware prerequisites, including passwordless SSH access to the target device.
Locating trtexec on the target device
The trtexec provider requires NVIDIA’s trtexec tool, which is included
with TensorRT. You can find it on your device by running:
ssh user@host find / -name trtexec -type f 2>/dev/nullCommon paths include:
/usr/src/tensorrt/bin/trtexec/opt/tensorrt/bin/trtexecIf trtexec is not on the device’s $PATH, you will need to provide
the full path when connecting to the device (see Connecting to your device below).
Creating a project
embedl-hub init \
--project "TensorRT SSH" \
--artifact-dir ~/my-artifactsThis sets the default project and artifact directory for subsequent commands. The artifact directory is where compiled models, profiling results, and other outputs are stored on disk. Later commands — such as profiling a model from a previous compile step — look here for previously produced artifacts. If omitted, a platform-specific default location is used.
You can view your current settings at any time:
embedl-hub showConnecting to your device
Next, configure a connection to your target device over SSH.
In the CLI, device connection details are passed directly to each command:
embedl-hub compile tensorrt trtexec \
--host 192.168.1.10 \
--user nvidia \
--exec-path /usr/src/tensorrt/bin/trtexec \
...If trtexec is on the device’s $PATH, you can omit the --exec-path flag.
Preparing a model
The compile step expects an ONNX file. You can save
your existing PyTorch model in ONNX format using torch.onnx.export:
import torchfrom torchvision.models import mobilenet_v2model = mobilenet_v2(weights="IMAGENET1K_V2")example_input = torch.rand(1, 3, 224, 224)torch.onnx.export( model, example_input, "mobilenet_v2.onnx", input_names=["input"], output_names=["output"], opset_version=18, external_data=False, dynamo=False,)Compiling a model
Compile the ONNX model to a TensorRT engine on the target device. The model
is transferred over SSH, compiled using trtexec, and the engine file is
fetched back:
embedl-hub compile tensorrt trtexec \
--model /path/to/mobilenet_v2.onnx \
--host 192.168.1.10 \
--user nvidiaTensorRT optimizes the model as part of compilation, applying FP16 precision by default to reduce memory usage and inference latency on the target GPU. For further gains, INT8 quantization is supported when a calibration cache is provided.
Providing a calibration cache for INT8
To enable INT8 quantization, you need a TensorRT calibration cache file
(.cache) containing pre-computed per-tensor dynamic ranges. This file
must be generated externally using the TensorRT Python API (e.g. trt.IInt8EntropyCalibrator2) from your calibration dataset before
calling the compiler.
The CLI does not expose a dedicated calibration flag, but you can
forward the --calib and --int8 arguments directly to trtexec:
embedl-hub compile tensorrt trtexec \
--model /path/to/mobilenet_v2.onnx \
--host 192.168.1.10 \
--user nvidia \
--cli-args --int8 \
--cli-args --calib=/path/on/device/calibration.cacheNote that the calibration file must already be present on the remote
device when using --cli-args.
Note: Some models have operations that are notoriously difficult to quantize, which can lead to a large drop in accuracy. One example is the softmax function in the attention layers of large language models (LLMs).
Specifying a TensorRT version
If the target device has multiple TensorRT versions installed, you can specify which one to use:
The CLI auto-detects the TensorRT version from the remote device. To override it, use the Python API instead.
Profiling a model
Profile the compiled engine on the target device:
embedl-hub profile tensorrt trtexec \
--from-run latest \
--host 192.168.1.10 \
--user nvidiaUse embedl-hub log to view your runs.
Profiling gives you the model’s latency on the target hardware, which layers are slowest, the number of layers executed on each compute unit type, and more. You can use this information to iterate on the model’s design and answer questions like:
- Can we optimize the slowest layer?
- Why aren’t certain layers running on the expected compute unit?
Invoking a model
Invoke the compiled engine with real input data to get inference outputs:
embedl-hub invoke tensorrt trtexec \
--from-run latest \
--host 192.168.1.10 \
--user nvidia \
--input /path/to/input.npzThe --input flag accepts a .npz file — a NumPy archive where each key
is an input tensor name and each value is the corresponding array.
Next steps
- Learn how to view, name, and tag your runs, and how to interpret profiling results in the exploring results guide.
- See the providers guide for the full reference of supported provider and toolchain combinations.