Local Streaming Latency

SB07 · June 16, 2026, 5:31pm

Hi,

I’m working with two ZED X One cameras on a ZED Box (Orin NX, SDK 5.3, JetPack 6). I’m setting up live streaming and testing locally over loopback first: one process opens both cameras and enables streaming (one stream per camera on separate ports), and a second process receives both streams and processes the frames.

I’m measuring ~150ms end-to-end latency (capture timestamp to frame available on the receiver), and I broke it down by pipeline stage to find where the time goes:

Grab only (capture → grab, on the box): ~23ms @ SVGA/60fps, ~39ms @ 30fps
Grab + stream + receive (capture → received on the other process): ~110ms @ 60fps, ~160ms @ 30fps
Full pipeline incl. frame matching: ~125ms @ 60fps, ~167ms @ 30fps

So the streaming step itself (encode + transport + decode) adds ~85-120ms, which is the dominant cost. Grabbing and matching are minor.

My streaming params: H264, bitrate 4000, gop_size 0, chunk_size 1200.

A few questions:

Since both encode (2 streams) and decode (2 streams) are running on the same Jetson over loopback, could the shared NVENC/NVDEC be inflating this? I saw NVENC/NVDEC utilization around 50-70% in tegrastats (not saturated), with the engines bursting/idling rather than maxed.
Are there StreamingParameters settings you’d recommend tuning to reduce the encode/decode latency specifically?
Would you expect the latency to drop meaningfully when I move decode to a separate machine over real ethernet?

Myzhar · June 17, 2026, 1:05pm

Hi @SB07
Welcome to the StereoLabs community.

Thanks for the detailed breakdown; the stage-by-stage measurements make this much easier to reason about.

A few points on what you are seeing:

1. Shared NVENC/NVDEC on the same Jetson

Yes, running 2 encoders and 2 decoders on the same Orin NX over loopback is very likely the main contributor. The engines bursting/idling at 50-70% rather than saturating is typical of latency-bound (not throughput-bound) load: the issue is serialization and scheduling between the four sessions sharing the same hardware engines, plus the queuing introduced by the encoder, not raw engine capacity. The decode side competing with encode on the same box adds round-trip queuing on top of that.

2. StreamingParameters tuning

A few things worth trying:

gop_size: with 0 the SDK uses its default GOP. For low latency, set an explicit short GOP (or all-intra style behavior) so the decoder does not wait on reference frames; this reduces decode-side buffering.
adaptative_bitrate: keep it disabled for fixed local conditions, since the rate controller can add latency while it converges.
bitrate: 4000 kbps at SVGA is comfortable; raising it modestly can reduce per-frame encode complexity stalls, though this is secondary here.
chunk_size 1200: this is tuned for typical MTU and is fine for loopback; no change needed.
On the receiver, make sure you are calling grab() in a tight loop so received frames are not waiting in the input buffer, and consider reducing any internal queue depth on your processing side.

You can find the full parameter reference here: StreamingParameters Struct Reference | C++ API Reference | Stereolabs

3. Moving decode to a separate machine

I would expect a meaningful improvement, yes. Offloading the 2 decode sessions to a separate host frees the Orin NX engines to handle encode only, removing the encode/decode contention that is dominating your numbers. Real Ethernet adds some transport latency, but on a local gigabit link that is typically far smaller than the contention you are currently paying. This is also the intended deployment model for the streaming feature, so it is well worth testing.

As a quick sanity check on the engine-contention theory, you could try streaming a single camera (1 encode + 1 decode) on the box and compare the streaming-stage latency; if it drops sharply, that confirms the shared-engine bottleneck.

Let me know how the separate-host test goes, and feel free to share updated numbers.

SB07 · June 17, 2026, 1:54pm

Thank you so much for your response! The single camera grabbing + streaming + reading is 160 ms. Making that process two cameras did not really make a difference to the latency and only bumped it up to ~164 ms. However I did test just encoding, so just starting the stream and grabbing images on the input side, and it resulted in almost no additional latency, so I will try moving the receiving end over to another device and see if it helps.

Myzhar · June 17, 2026, 3:06pm

Hi @SB07
Thank you for your feedback. What you say makes sense

SB07 · June 17, 2026, 3:59pm

Also, since the capture timestamp is stamped on the sender and I’d be comparing it against the receiver’s wall clock, the two machines’ clocks being out of sync would show up directly in the latency number. For measuring end-to-end latency across two hosts, do you rely on NTP/PTP to sync the clocks, or is there a recommended way to measure this within the SDK that avoids the cross-clock comparison? I just want to make sure I’m measuring real latency and not clock skew.

Myzhar · June 18, 2026, 3:27pm

Yes, NTP/PTP is required in this case:

Please refer to this section of the documentation:

SB07 · June 23, 2026, 11:25pm

Hi, thank you for your response and I will try NTP/PTP.

The stream settings I am using are SVGA at 30 FPS. Before I test, I was just wondering if there is an expected latency, measured from image capture to being received on the other device, over ethernet for capturing in parallel (multi-threads) with 2 cameras on device 1, and sending their frames downstream in parallel? Is there also an expected latency for capturing in parallel (multi-threads) with 4 cameras on device 1, and sending their frames downstream in parallel?

Myzhar · June 24, 2026, 3:11pm

Hi @SB07,

Before I test, I was just wondering if there is an expected latency, measured from image capture to being received on the other device, over ethernet for capturing in parallel (multi-threads) with 2 cameras on device 1, and sending their frames downstream in parallel? Is there also an expected latency for capturing in parallel (multi-threads) with 4 cameras on device 1, and sending their frames downstream in parallel?

There isn’t a single guaranteed number I can quote, because end-to-end streaming latency depends on several factors that are specific to your deployment: encoder/decoder load, network medium and switch behavior, the chosen StreamingParameters, and the CPU/GPU headroom left on the Orin NX while it runs your processing pipeline in parallel. So please treat the values below as realistic expectations, not contractual figures.

Why the loopback numbers are not representative

A couple of important points before estimating the Ethernet case:

First, your current measurement mixes two unrelated things. The ~150-167ms you see includes the clock offset between the two processes if their timestamps are not referenced to the same synchronized clock. This is exactly why NTP/PTP matters here: until both ends share a common time base, “capture timestamp on device 1” and “received timestamp on the receiver” are not directly comparable, and part of what you are reading as latency can be clock skew. After you align the clocks (and ideally use the new monotonic TIMESTAMP_CLOCK option in SDK 5.3), the measured end-to-end figure should both drop and become stable.

Second, running both encode (2 streams) and decode (2 streams) on the same Jetson over loopback is the worst case for NVENC/NVDEC contention. Even if tegrastats shows the engines at 50-70% rather than saturated, they are time-shared and bursty, so queuing delay between the encode and decode stages inflates the figure. This is expected on loopback.

What to expect over real Ethernet

Moving the decoder to a separate machine should help meaningfully, for two reasons: the receiver’s NVDEC no longer competes with the box’s NVENC, and the box frees the GPU cycles it was spending on decode. With SVGA at 30 FPS, H264, and a synchronized clock, an end-to-end capture-to-received latency in the low tens of milliseconds per stream is a reasonable target on a clean GigE link. The dominant remaining contributors then become the sensor grab time you already measured (~23-39ms is itself a large share of your budget) and the encoder GOP/queue behavior.

For the 2-camera case, the two encoders on the box run comfortably within Orin NX NVENC capacity at SVGA/30, so the per-stream latency should be close to the single-stream case.

For the 4-camera case, you are doubling the concurrent encode load on a single Orin NX NVENC engine while also competing for ISP, memory bandwidth, and CPU threads. Expect higher and more variable per-stream latency, and validate it under your real processing load rather than in isolation. If you need 4x ZED X One streaming with tight latency, the higher-tier Jetson modules (Orin NX 16GB at MAXN SUPER, or AGX Orin) give the encoder and memory subsystem more headroom.

Settings worth tuning (relevant to your post #1 question 2)

gop_size: you have it at 0. Keeping I-frames frequent (small GOP) reduces decode dependency latency at the cost of bitrate; this is usually the right tradeoff for low-latency local streaming.
bitrate: 4000 kbps at SVGA is on the lower side; a moderate increase can reduce encoder buffering pressure without stressing a GigE link.
chunk_size: smaller chunks reduce packetization latency over loopback but matter less on real ethernet; leave at the default unless you see fragmentation.

Reference docs:

Local streaming: Local Video Streaming | StereoLabs
Multi-camera and PTP setup: Setting Up Multiple 3D Cameras | StereoLabs

My honest recommendation is to run your test exactly as planned (clocks synchronized, decode on the remote machine), measure the per-stream figure for the 2-camera case first, then add the 3rd and 4th streams and watch how NVENC headroom and the variance evolve. That will give you the numbers that actually apply to your system far more reliably than any generic estimate.

SB07 · June 29, 2026, 3:28pm

hey myzhar thank you so much for your help, after moving to a second device and ensuring PTP sync, the latency fell into a range of 80ms to 130 ms. Does it also make sense that running at SVGA 60 fps should have a lower latency than SVGA 30 FPS?

Myzhar · June 30, 2026, 12:21pm

Yes, it does. Here you can find the reason explained:

It’s part of the ROS 2 documentation, but the concepts are always valid.