ZED ROS2 Wrapper frames stall randomly during prolonged operation

Setup

  • Hardware: NVIDIA AGX Orin Developer Kit, ZEDLink Duo capture card
  • Cameras:
    • 1× ZED X stereo camera
    • 1× virtual stereo pair from 2× ZED X One GS cameras
  • Software: JetPack 6.2.2 (L4T 36.5.0, Kernel 5.15), stereolabs-zedlink-duo 1.4.2-LI-MAX96712-L4T36.5.0, latest ZED ROS 2 Wrapper; SDK version 5.2.3 (latest supported by ros2 wrapper)
  • Depth mode: NEURAL_LIGHT
  • Launch: both cameras launched together via zed_multi_camera.launch.py, all nodes loaded as composable nodes into a single component_container_isolated running inside a Docker container
  • Power: MAXN + jetson_clocks, stock 19V/4.74A (90W) power supply

Issue

When running the cameras alone (no other workload, no downstream consumers beyond standard ZED wrapper publishing) for prolonged periods, we see intermittent frame publishing stalls of up to ~10 seconds in the worst case. Stalls between ~0.5-1s appear quite frequently. There are no errors in the SDK, Argus, or zed_x_daemon logs during these stalls. The pipeline resumes on its own afterward. tegrastats shows nominal GPU utilization during the stall windows.

Questions

  1. Is this combination — one ZED X stereo + one virtual stereo from a ZED X One GS pair, all in a single composable container via zed_multi_camera.launch.py — a validated configuration?
  2. Are there known SDK-level mechanisms (recovery paths, internal rate limiting, synchronization waits) that could silently pause grab() for 10+ seconds without surfacing in logs?
  3. Is there anything I could disable in my launch files to mitigate this problem?

Hi @withcargocam
This seems to be a typical saturation of the DDS middlewaare communication.

Have you applied the configurations described in this section of the documentation?

I recommend you also read this part of the documentation to improve the performance of your setup:

I had applied the DDS configuration but seems like my system-wide configuration changes had not been made permanent. It appears the issue is fixed.

1 Like

@Myzhar Unfortunately, after a few more days of testing it seems like this issue persists. The proposed changes make this issue less frequent but it is still prevalent. We have experimented with a few different DDS configurations but it does not seem to fully resolve the problem. Here is the current Cyclone DDS configuration we are using:

<?xml version='1.0' encoding='us-ascii'?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain id="any">
        <General>
            <Interfaces>
                <NetworkInterface autodetermine="false" name="lo" />
            </Interfaces>
            <AllowMulticast>false</AllowMulticast>
            <EnableMulticastLoopback>false</EnableMulticastLoopback>
            <MaxMessageSize>65500B</MaxMessageSize>
        </General>
        <Discovery>
            <ParticipantIndex>auto</ParticipantIndex>
            <MaxAutoParticipantIndex>150</MaxAutoParticipantIndex>
            <Peers>
                <Peer Address="127.0.0.1" />
            </Peers>
        </Discovery>
        <Internal>
            <!-- Improves the performance of topic subscribers receiving large messages. -->
            <SocketReceiveBufferSize min="64MB" />
            <SocketSendBufferSize min="64MB" />
            <MaxQueuedRexmitBytes>64MB</MaxQueuedRexmitBytes>
            <HeartbeatInterval min="20ms" max="100ms">20ms</HeartbeatInterval>
            <Watermarks>
                <WhcHigh>500kB</WhcHigh>
            </Watermarks>
        </Internal>
    </Domain>
</CycloneDDS>

Any other suggestions that we should try? Seems like when these stalls happen no subscribers receive a frame. NVargus and ZED daemon service logs look clean and so do the zed nodelet logs.

Hi @withcargocam,

Thanks for the detailed follow-up and for posting your Cyclone DDS configuration; that helps a lot.

The fact that the stalls are synchronized across all topics from both cameras, that no subscriber receives anything during the window, and that NVargus / zed_x_daemon / the node logs all stay clean makes me move away from my initial “generic DDS saturation” guess. That pattern points to something blocking inside the single container, either the socket buffers not being what you think they are, or a reliable writer back-pressuring your shared executor, rather than the wire itself dropping samples. A few things to check, in order:

1. Confirm the socket buffers are actually granted, not just requested

Your config asks for SocketReceiveBufferSize min="64MB", but Cyclone can only obtain that if the kernel allows it. net.core.rmem_max / net.core.wmem_max are not namespaced in Docker; they come from the host kernel. This is very likely the same setting that “wasn’t permanent” in your earlier test. If the host sysctl reverts on reboot, Cyclone silently clamps the socket buffer down to the default (~208 KB), and your NEURAL_LIGHT depth + dual-camera traffic backs up periodically. That would explain “less frequent but still there.”

Run inside the container:

sysctl net.core.rmem_max net.core.wmem_max

If you don’t see the 2 GiB value from our docs, the request is being clamped. The fix has to be applied on the host (the /etc/sysctl.d/60-zed-buffers.conf approach in the docs), not inside the container.

To see what Cyclone actually obtained at startup, temporarily add tracing to your XML:

<Tracing>
    <Verbosity>config</Verbosity>
    <OutputFile>stderr</OutputFile>
</Tracing>

Cyclone will log the real socket buffer size it got — this confirms or kills the buffer hypothesis immediately.

2. Your WhcHigh is likely throttling the reliable writers

WhcHigh at 500kB is quite tight for four large image streams plus depth all flowing through one process. When a reliable writer hits the high-water mark, it blocks the calling thread waiting for ACKs, and in a single composable container with one executor, a blocked publish can stall the callbacks serving the other camera as well. That is a strong candidate for the “everything freezes at once” behavior.

Two options:

  • Raise WhcHigh substantially (try 16MB).
  • Better for your case: since you’ve bound everything to lo and disabled multicast, this is effectively a single-host deployment. I’d strongly recommend enabling Iceoryx shared-memory transport for the large image/point-cloud topics. It bypasses the loopback socket path entirely and is the cleanest fix for intra-host transport of heavy messages.

3. Isolate transport vs. executor with a quick A/B test

Because all nodes are loaded into a single component_container_isolated, one blocking publish or callback stalls everything. To find out whether the cause is transport-side or executor-side, split the two cameras into two separate containers and run again. If the synchronized stall disappears, the problem is executor blocking, not the wire — in which case a multithreaded executor or permanent container separation is the answer.

4. Rule out power/thermal throttling

NEURAL_LIGHT + dual capture with jetson_clocks on the stock 90W supply is worth double-checking; throttling won’t necessarily show up in the ZED/Argus logs. Could you capture tegrastats across a stall window and share the throttling columns? Also confirm whether the AGX Orin is still on the 90W PSU when both cameras run at full clocks.

My bet is on (1) + (2) together, but the tracing block and the single-vs-dual-container test will tell us definitively. Let me know what you find.

@Myzhar Thanks so much for your responses — they were incredibly useful and really helped me narrow this down. Following up with the resolution in case it helps anyone else.

The suggestions from the team (the network/DDS-side tuning) reduced the frequency but didn’t fully resolve it — and that’s because the real root cause turned out to be completely unrelated to the ZED stack or DDS. It was NVMe interrupt handling.

Setup

Jetson AGX Orin, JetPack / L4T R36.5 (upgraded from 35.4)
ZED stereo cameras over GMSL2
Entire stack deployed via Docker; because of image size, all Docker data is mounted to an NVMe SSD
Symptom: sporadic multi-second frame stalls, intermittent. Worked fine on 35.4, started after the 36.5 upgrade.
The tell-tale sign

The stalls lined up with this in dmesg:
nvme nvme0: I/O <tag> QID <n> timeout, completion polled

This comes straight from the Linux NVMe PCI driver, and it’s a well-documented behavior with the kernel shipped in 36.5 (it didn’t happen for us on 35.4). It means the drive actually completed the I/O fine, but the host never received (or missed) the MSI-X interrupt for it. The block layer only notices at the command timeout (~30s default), polls the completion queue, finds the result sitting there, and unblocks. Anything waiting on that I/O — for us, all the container I/O sitting on the NVMe — stalls until then. So it’s a missed-interrupt problem, not a failing drive.

The fix

Add these two parameters to the active boot entry’s APPEND line in /boot/extlinux/extlinux.conf:

pcie_aspm=off nvme_core.default_ps_max_latency_us=0
pcie_aspm=off disables PCIe Active State Power Management; nvme_core.default_ps_max_latency_us=0 disables the drive's APST low-power states. 

Aggressive ASPM/APST is a known trigger for these missed-interrupt stalls under bursty I/O and after idle wake-ups.

Two gotchas that cost me time:

Edit the APPEND line of the label your DEFAULT directive points to — for me that was a custom label, not primary. Editing the wrong label does nothing.
APPEND must stay a single line, no breaks, or it won’t boot.
Verify after reboot

cat /proc/cmdline                                              # should show the new flags
cat /sys/module/nvme_core/parameters/default_ps_max_latency_us # should print 0

Then run the normal workload and watch dmesg -w — the completion polled lines stopped for us, and the frame stalls went away.

Link to related issue: Nvme timeout happened after upgrading from Jetpack 5.1 to Jetpack 6 - Jetson AGX Orin - NVIDIA Developer Forums

I doubt many people have the same setup I do, but posting in case anyone faces the same issue as I did.

Hi @withcargocam
Thanks for the detailed solution to your problem.
It will be helpful to the StereoLabs community.