Creating 3D Bounding Boxes

MBaranPeker · January 7, 2021, 5:30am

Hi, I succesfully intalled ZED Sdk, ROS and YOLO . I can visulize and object detection in Rviz, it creates 3D bounding boxes around people. As you already know YOLO has already trained 83 objects and we can crete 2D bounding boxes around that objects.
My question is, when you trained your data for object detection for person and car. Did you first create 2D bounding boxes around the object then make it 3D with adding depth information or did you trained with directly 3D point cloud for object detection. The reason why i need is i want to create 3D bounding boxes around the object which i detected with YOLO.
Thanks

obraun-sl · January 7, 2021, 7:49am

Hi,
The objects are detected in 2D ( therefore the first output is the 2D bounding box).

To have a 3D bounding box, you will need to extract the depth map associated to the 2D image, then convert the 2D points into 3D points.
A simple way is to take the point cloud, that convert [i,j] in pixels to [x,y,z] in world.

To have something more stable, you can use a median filter around the [i,j] pixel so that you can handle NaN value of the point cloud, then apply a temporal filter to the 3D bounding box positions.

Best,
OB

MBaranPeker · January 8, 2021, 12:41am

Thanks @obraun-sl,

I little cofuse about how did you get final 3D box from the translated 3D points. Are there any documentation or any example that I can follow up.

Thanks

obraun-sl · January 8, 2021, 7:28am

There is no ready-to-use sample that convert 2D to 3D boxes but you can take a look at the repository here :

You can use this function here :

github.com

stereolabs/zed-yolo/blob/master/zed_python_sample/darknet_zed.py#L264


    res = sorted(res, key=lambda x: -x[1])
    free_detections(dets, num)
    return res
netMain = None
metaMain = None
altNames = None
def get_object_depth(depth, bounds):
    '''
    Calculates the median x, y, z position of top slice(area_div) of point cloud
    in camera frame.
    Arguments:
        depth: Point cloud data of whole frame.
        bounds: Bounding box for object in pixels.
            bounds[0]: x-center
            bounds[1]: y-center
            bounds[2]: width of bounding box.
            bounds[3]: height of bounding box.

Instead of using the center of the bbox 2D (bounds[0] /bounds[1]), use the 4 points of the 2D bbox and convert to 3D using the same code. it will return a X,Y,Z for each 4 points.

To create the 4 remaining points, you need to generate them using an arbitrary rule that might depend on the object class:
A simple way would be to say that the 4 remaining points are 1m away in the Z axis from the existing 4 3D points.
It could be 1m away on the Z axis or 1m away on the camera -> object axis. This 1 meter value will fit for people but you might change for other objects.

MBaranPeker · January 8, 2021, 7:37am

@obraun-sl thank you. I am checking that out.

sim · May 8, 2022, 6:33pm

Hi @MBaranPeker , I am working on similar project to generate 3d bounding box based on 2d detector and depth map, do you have any idea for implementing.
Thanks

MadhuriPatil1694 · May 9, 2022, 2:59pm

@obraun-sl is there a sample code to understand how I can use these 3D points to calculate the velocity of the detected object.

Nilesh-Hampiholi · June 17, 2022, 12:54pm

Hello I am working on similar project. I am using mask rcnn to find the bounding box for objects using 2d color images. Then using the depth map to create 3d bounding boxes. Has any one solved the problem?

bibekyess · August 26, 2022, 11:10am

Hi @sim and @Nilesh-Hampiholi, I am also trying to do the same thing. I am using MaskRCNN benchmark to get the 2D bounding boxes and then use the depth image to get the 3D bounding box. Have you solved it? If you have, can you please give me some hint? Thanks!

Nilesh-Hampiholi · August 29, 2022, 1:08am

Hi @bibekyess I found few solutions for monocular images. They use the approximate size of the cars and the fact that the 3d box fits inside the 2d box. But I was not able to find a proper solution for custom objects.

blueeagle100 · September 5, 2022, 9:36pm

@bibekyess @Nilesh-Hampiholi You can do this by reprojecting the image points to 3D using opencv (look at stereoRectify and reprojectImageTo3D) The general pipeline would be:

Detect Objects in 2D and get Stereo Depth
Project all image points to a 3D point cloud (downsample this)
Convert detected object center points to 3D and cluster in 3D space
Get associated clusters and their 3D bounding boxes

Then you can convert 3D bounding box points back to 2D via Projection Matrix and draw them on the image. open3d helps once you get the point cloud. I feel like there is a better way to do this, but this should at least provide a basic solution.

sim · September 6, 2022, 8:05am

I have tried the same pipline as yours. Using the instance segmentation to get the instance area and then combine the depth map to compute the 3D bounding box. But the real-time performance is a problem. Until now I still haven’t any better approach to solve that problem.