Creating 3D Bounding Boxes

Hi, I succesfully intalled ZED Sdk, ROS and YOLO . I can visulize and object detection in Rviz, it creates 3D bounding boxes around people. As you already know YOLO has already trained 83 objects and we can crete 2D bounding boxes around that objects.
My question is, when you trained your data for object detection for person and car. Did you first create 2D bounding boxes around the object then make it 3D with adding depth information or did you trained with directly 3D point cloud for object detection. The reason why i need is i want to create 3D bounding boxes around the object which i detected with YOLO.

The objects are detected in 2D ( therefore the first output is the 2D bounding box).

To have a 3D bounding box, you will need to extract the depth map associated to the 2D image, then convert the 2D points into 3D points.
A simple way is to take the point cloud, that convert [i,j] in pixels to [x,y,z] in world.

To have something more stable, you can use a median filter around the [i,j] pixel so that you can handle NaN value of the point cloud, then apply a temporal filter to the 3D bounding box positions.


Thanks @obraun-sl,

I little cofuse about how did you get final 3D box from the translated 3D points. Are there any documentation or any example that I can follow up.


There is no ready-to-use sample that convert 2D to 3D boxes but you can take a look at the repository here :

You can use this function here :

Instead of using the center of the bbox 2D (bounds[0] /bounds[1]), use the 4 points of the 2D bbox and convert to 3D using the same code. it will return a X,Y,Z for each 4 points.

To create the 4 remaining points, you need to generate them using an arbitrary rule that might depend on the object class:
A simple way would be to say that the 4 remaining points are 1m away in the Z axis from the existing 4 3D points.
It could be 1m away on the Z axis or 1m away on the camera -> object axis. This 1 meter value will fit for people but you might change for other objects.

1 Like

@obraun-sl thank you. I am checking that out.

Hi @MBaranPeker , I am working on similar project to generate 3d bounding box based on 2d detector and depth map, do you have any idea for implementing.

1 Like

@obraun-sl is there a sample code to understand how I can use these 3D points to calculate the velocity of the detected object.

Hello I am working on similar project. I am using mask rcnn to find the bounding box for objects using 2d color images. Then using the depth map to create 3d bounding boxes. Has any one solved the problem?

1 Like

Hi @sim and @Nilesh-Hampiholi, I am also trying to do the same thing. I am using MaskRCNN benchmark to get the 2D bounding boxes and then use the depth image to get the 3D bounding box. Have you solved it? If you have, can you please give me some hint? Thanks!

Hi @bibekyess I found few solutions for monocular images. They use the approximate size of the cars and the fact that the 3d box fits inside the 2d box. But I was not able to find a proper solution for custom objects.

@bibekyess @Nilesh-Hampiholi You can do this by reprojecting the image points to 3D using opencv (look at stereoRectify and reprojectImageTo3D) The general pipeline would be:

  • Detect Objects in 2D and get Stereo Depth
  • Project all image points to a 3D point cloud (downsample this)
  • Convert detected object center points to 3D and cluster in 3D space
  • Get associated clusters and their 3D bounding boxes

Then you can convert 3D bounding box points back to 2D via Projection Matrix and draw them on the image. open3d helps once you get the point cloud. I feel like there is a better way to do this, but this should at least provide a basic solution.

I have tried the same pipline as yours. Using the instance segmentation to get the instance area and then combine the depth map to compute the 3D bounding box. But the real-time performance is a problem. Until now I still haven’t any better approach to solve that problem.