I’m working with two ZED X One cameras on a ZED Box (Orin NX, SDK 5.3, JetPack 6). I’m setting up live streaming and testing locally over loopback first: one process opens both cameras and enables streaming (one stream per camera on separate ports), and a second process receives both streams and processes the frames.
I’m measuring ~150ms end-to-end latency (capture timestamp to frame available on the receiver), and I broke it down by pipeline stage to find where the time goes:
Grab only (capture → grab, on the box): ~23ms @ SVGA/60fps, ~39ms @ 30fps
Grab + stream + receive (capture → received on the other process): ~110ms @ 60fps, ~160ms @ 30fps
So the streaming step itself (encode + transport + decode) adds ~85-120ms, which is the dominant cost. Grabbing and matching are minor.
My streaming params: H264, bitrate 4000, gop_size 0, chunk_size 1200.
A few questions:
Since both encode (2 streams) and decode (2 streams) are running on the same Jetson over loopback, could the shared NVENC/NVDEC be inflating this? I saw NVENC/NVDEC utilization around 50-70% in tegrastats (not saturated), with the engines bursting/idling rather than maxed.
Are there StreamingParameters settings you’d recommend tuning to reduce the encode/decode latency specifically?
Would you expect the latency to drop meaningfully when I move decode to a separate machine over real ethernet?
Thanks for the detailed breakdown; the stage-by-stage measurements make this much easier to reason about.
A few points on what you are seeing:
1. Shared NVENC/NVDEC on the same Jetson
Yes, running 2 encoders and 2 decoders on the same Orin NX over loopback is very likely the main contributor. The engines bursting/idling at 50-70% rather than saturating is typical of latency-bound (not throughput-bound) load: the issue is serialization and scheduling between the four sessions sharing the same hardware engines, plus the queuing introduced by the encoder, not raw engine capacity. The decode side competing with encode on the same box adds round-trip queuing on top of that.
2. StreamingParameters tuning
A few things worth trying:
gop_size: with 0 the SDK uses its default GOP. For low latency, set an explicit short GOP (or all-intra style behavior) so the decoder does not wait on reference frames; this reduces decode-side buffering.
adaptative_bitrate: keep it disabled for fixed local conditions, since the rate controller can add latency while it converges.
bitrate: 4000 kbps at SVGA is comfortable; raising it modestly can reduce per-frame encode complexity stalls, though this is secondary here.
chunk_size 1200: this is tuned for typical MTU and is fine for loopback; no change needed.
On the receiver, make sure you are calling grab() in a tight loop so received frames are not waiting in the input buffer, and consider reducing any internal queue depth on your processing side.
I would expect a meaningful improvement, yes. Offloading the 2 decode sessions to a separate host frees the Orin NX engines to handle encode only, removing the encode/decode contention that is dominating your numbers. Real Ethernet adds some transport latency, but on a local gigabit link that is typically far smaller than the contention you are currently paying. This is also the intended deployment model for the streaming feature, so it is well worth testing.
As a quick sanity check on the engine-contention theory, you could try streaming a single camera (1 encode + 1 decode) on the box and compare the streaming-stage latency; if it drops sharply, that confirms the shared-engine bottleneck.
Let me know how the separate-host test goes, and feel free to share updated numbers.
Thank you so much for your response! The single camera grabbing + streaming + reading is 160 ms. Making that process two cameras did not really make a difference to the latency and only bumped it up to ~164 ms. However I did test just encoding, so just starting the stream and grabbing images on the input side, and it resulted in almost no additional latency, so I will try moving the receiving end over to another device and see if it helps.
Also, since the capture timestamp is stamped on the sender and I’d be comparing it against the receiver’s wall clock, the two machines’ clocks being out of sync would show up directly in the latency number. For measuring end-to-end latency across two hosts, do you rely on NTP/PTP to sync the clocks, or is there a recommended way to measure this within the SDK that avoids the cross-clock comparison? I just want to make sure I’m measuring real latency and not clock skew.