Recommended ZED setup for indoor multi-person body tracking, gestures, hand pointing, and multi-room (latency <100 ms)

Hi Stereolabs community,

We are looking for advice on which Stereolabs camera(s) and development setup to use for a set of indoor depth-based prototypes we want to build and benchmark.

Use cases (indoor)

  1. Single-person full-body tracking
    Goal: robust skeleton/body tracking for one person in a defined interaction zone, stable over time and tolerant to typical guest behavior (turning, partial occlusions).

  2. Multiple simultaneous full-body trackings
    Goal: track multiple people at once, maintain identities, handle occlusions and people crossing, and keep tracking stable in a public-facing environment.

  3. Gesture control / recognition
    Goal: detect and classify a limited set of gestures (interactive triggers), low false positives, consistent behavior across different body types and lighting.

  4. Wrist / finger tracking and pointing direction
    Goal: track wrist/hand position plus pointing direction, with useful precision at roughly 10 cm scale for “point at object/UI” interactions.

  5. Multi-room tracking (multi-camera, multi-space)
    Goal: start with 2 cameras across two spaces, with a seamless tracking experience across zones. We need practical guidance on calibration, synchronization, and coordinate alignment between rooms.

Targets / constraints

  • Environment: indoor, varying room sizes. We want to explore practical limits (small to large spaces, single to multi-person).

  • End-to-end latency target: <100 ms.

  • Compute: we can process on a PC (possibly preferred). If a Jetson-based setup is recommended for latency/robustness, we are open to it.

What we are asking for

A) Recommended hardware configuration

  • Which camera model(s) fit these use cases best (especially multi-person + hand/pointing + multi-room)?

  • Any required accessories (mounts, sync hardware if applicable, cables, recommended cable lengths/limits)?

  • If multi-camera: best-practice setup and constraints we should know up front.

B) Recommended compute platform

  • PC specs (CPU/GPU suggestions, USB/PCIe requirements) for <100 ms end-to-end

  • When is a Jetson setup actually the better choice here?

C) Key limitations / gotchas

  • Expected limitations in multi-person identity persistence, occlusion handling, and hand/pointing precision

  • Any known pitfalls in lighting, reflective surfaces, crowded scenes, or calibration drift

  • Best practices for stable calibration/alignment in a real venue (multi-room)

Thanks in advance for any recommendations, and pointers to relevant docs/examples/threads are also welcome.

Hi @MartijnDekker
Welcome to the Stereolabs community.

All the cameras that we provide allow you to perform the tasks that you described.

While the ZED SDK provides “Single-person full-body tracking” and “Multiple simultaneous full-body tracking,” you must use external libraries or your own solutions to perform “Gesture control/recognition”, “Wrist/finger tracking and pointing direction”, and “Multi-room tracking (multi-camera, multi-space)”.

You can work with PCs or Jetson devices. A Jetson device is required if you select a ZED camera of the ZED X series.

I recommend you consult our detailed Online Documentation where you can find details to answer most of your questions.

In case you need additional information, do not hesitate to ask for it.

Hi Martijn,
Thank you for reaching out to us.

Walter

Walter Lucetti
Senior Computer Engineer
SDK / Robotics / HW
Stereolabs Support