I just reviewed the course “Perception in 5 Days.” I am now wondering how I could obtain the 3D coordinates of an object (let’s say a person) in my 3D environment to potentially navigate (using nav2) toward that coordinate. I could even add these coordinates to a map with markers.
I see that we are extracting “persons” and adjusting the position of the bot based on raw image input. However, I don’t see an easy way in the course to deduce the 3D spatial coordinates. I suspect we would need to use depth images with a stereo camera?
Also, is this process called 3D pose estimation, or is there another term for it? Perhaps you have a course specifically about this?
Hi @rouiller.romain13 , a way you could detect an object is detecting it through the image and then using the depth information from a depth camera (or PointCloud2 publisher) to get the depth information where the detected object is. You could cluster the points pertaining to the obstacle and then publish a TF transform from the robot to the object:
Use Depth Images: Convert the pixel coordinates from a detection (e.g., a person) into 3D space by combining the depth information at that pixel.
TF Transform: Convert those 3D coordinates from the camera’s reference frame into the robot’s base frame.
Publish Markers: Add markers to your map by publishing the 3D coordinates using visualization tools in ROS, like RViz2, to see the person’s position.
I don’t think it’s called 3D pose estimation, that refers to the estimation of joint positions of an articulated object, like a person’s skeleton. 3D object detection is more fitting.
Hello, thanks for answering @roalgoal , I’m unfortunately not that clear with such process at the moment. Especially 1 & 2. May be there is an existing repo or some libraries implementing just that?
Well, since 1 and 2 are the main part and 3 is for visualization, I’d recommend first taking our ROS2 basics course and then our TF course to start getting a clearer picture of the process you would need to do.
I don’t know if there is an existing repo implementing this, maybe darknet_ros can help you?
You need to have the computer vision (check out our OpenCV course) knowledge to be able to detect an object from an image. Then, since you need to use a depth camera to get spatial info on that object, you would match the image pixels to the corresponding depth values of that detection.