Hi Stereolabs community,
We are looking for advice on which Stereolabs camera(s) and development setup to use for a set of indoor depth-based prototypes we want to build and benchmark.
Use cases (indoor)
-
Single-person full-body tracking
Goal: robust skeleton/body tracking for one person in a defined interaction zone, stable over time and tolerant to typical guest behavior (turning, partial occlusions). -
Multiple simultaneous full-body trackings
Goal: track multiple people at once, maintain identities, handle occlusions and people crossing, and keep tracking stable in a public-facing environment. -
Gesture control / recognition
Goal: detect and classify a limited set of gestures (interactive triggers), low false positives, consistent behavior across different body types and lighting. -
Wrist / finger tracking and pointing direction
Goal: track wrist/hand position plus pointing direction, with useful precision at roughly 10 cm scale for “point at object/UI” interactions. -
Multi-room tracking (multi-camera, multi-space)
Goal: start with 2 cameras across two spaces, with a seamless tracking experience across zones. We need practical guidance on calibration, synchronization, and coordinate alignment between rooms.
Targets / constraints
-
Environment: indoor, varying room sizes. We want to explore practical limits (small to large spaces, single to multi-person).
-
End-to-end latency target: <100 ms.
-
Compute: we can process on a PC (possibly preferred). If a Jetson-based setup is recommended for latency/robustness, we are open to it.
What we are asking for
A) Recommended hardware configuration
-
Which camera model(s) fit these use cases best (especially multi-person + hand/pointing + multi-room)?
-
Any required accessories (mounts, sync hardware if applicable, cables, recommended cable lengths/limits)?
-
If multi-camera: best-practice setup and constraints we should know up front.
B) Recommended compute platform
-
PC specs (CPU/GPU suggestions, USB/PCIe requirements) for <100 ms end-to-end
-
When is a Jetson setup actually the better choice here?
C) Key limitations / gotchas
-
Expected limitations in multi-person identity persistence, occlusion handling, and hand/pointing precision
-
Any known pitfalls in lighting, reflective surfaces, crowded scenes, or calibration drift
-
Best practices for stable calibration/alignment in a real venue (multi-room)
Thanks in advance for any recommendations, and pointers to relevant docs/examples/threads are also welcome.