Hi @withcargocam,
Thanks for the detailed follow-up and for posting your Cyclone DDS configuration; that helps a lot.
The fact that the stalls are synchronized across all topics from both cameras, that no subscriber receives anything during the window, and that NVargus / zed_x_daemon / the node logs all stay clean makes me move away from my initial “generic DDS saturation” guess. That pattern points to something blocking inside the single container, either the socket buffers not being what you think they are, or a reliable writer back-pressuring your shared executor, rather than the wire itself dropping samples. A few things to check, in order:
1. Confirm the socket buffers are actually granted, not just requested
Your config asks for SocketReceiveBufferSize min="64MB", but Cyclone can only obtain that if the kernel allows it. net.core.rmem_max / net.core.wmem_max are not namespaced in Docker; they come from the host kernel. This is very likely the same setting that “wasn’t permanent” in your earlier test. If the host sysctl reverts on reboot, Cyclone silently clamps the socket buffer down to the default (~208 KB), and your NEURAL_LIGHT depth + dual-camera traffic backs up periodically. That would explain “less frequent but still there.”
Run inside the container:
sysctl net.core.rmem_max net.core.wmem_max
If you don’t see the 2 GiB value from our docs, the request is being clamped. The fix has to be applied on the host (the /etc/sysctl.d/60-zed-buffers.conf approach in the docs), not inside the container.
To see what Cyclone actually obtained at startup, temporarily add tracing to your XML:
<Tracing>
<Verbosity>config</Verbosity>
<OutputFile>stderr</OutputFile>
</Tracing>
Cyclone will log the real socket buffer size it got — this confirms or kills the buffer hypothesis immediately.
2. Your WhcHigh is likely throttling the reliable writers
WhcHigh at 500kB is quite tight for four large image streams plus depth all flowing through one process. When a reliable writer hits the high-water mark, it blocks the calling thread waiting for ACKs, and in a single composable container with one executor, a blocked publish can stall the callbacks serving the other camera as well. That is a strong candidate for the “everything freezes at once” behavior.
Two options:
- Raise
WhcHigh substantially (try 16MB).
- Better for your case: since you’ve bound everything to
lo and disabled multicast, this is effectively a single-host deployment. I’d strongly recommend enabling Iceoryx shared-memory transport for the large image/point-cloud topics. It bypasses the loopback socket path entirely and is the cleanest fix for intra-host transport of heavy messages.
3. Isolate transport vs. executor with a quick A/B test
Because all nodes are loaded into a single component_container_isolated, one blocking publish or callback stalls everything. To find out whether the cause is transport-side or executor-side, split the two cameras into two separate containers and run again. If the synchronized stall disappears, the problem is executor blocking, not the wire — in which case a multithreaded executor or permanent container separation is the answer.
4. Rule out power/thermal throttling
NEURAL_LIGHT + dual capture with jetson_clocks on the stock 90W supply is worth double-checking; throttling won’t necessarily show up in the ZED/Argus logs. Could you capture tegrastats across a stall window and share the throttling columns? Also confirm whether the AGX Orin is still on the 90W PSU when both cameras run at full clocks.
My bet is on (1) + (2) together, but the tracing block and the single-vs-dual-container test will tell us definitively. Let me know what you find.