Issues when testing on a real robot

General issue: rosject part 2 finding wall part works in simulation, but not on a real robot and it is becoming hard to troubleshoot with inconsistent connection to the real robot.

My intended logic:
make /scan subscriber callback calculate minimum distance and its index
server callback runs a while loop that goes trough a series of states
FACE_FORWARD - spins until front laser index matches the min index
DISNTACE_ALIGN - moves forward or backwards until min distance is around 0.3m
FACE_RIGHT same as face_forward, but target is the right laser index
send the response and end the callback.

What is actually happening on real robot:
Robot still spins after switching to distance_align mode. After 1-2 minutes, it stops spinning and drives forward/backward without stopping. ros2 topic echo /cmd_vel is showing that correct command being sent, but robot is not reflecting that (rostopic echo /cmd_vel is also the same). Robot seems to be stuck, meaning when in this state, stopping the program, with ctrl+c and trying to restart the program or try control it with a joystick does not do anything - robot is still performing previous action, like spinning or driving forwards. After again few minutes robot can be seen “back to normal” where it is responding to joystick and the program can be run. If robot managed to get through all the stages and complete the service, wall following behavior is working fine.

I understand that troubleshooting why something does not work in reality is the fun part of robotics, but the experience of remote real robot is starting to confuse me.
My experience with connection to real robots:
While doing the course, i tried testing my project on a real robot after each part:
The first part - wall following was working fine, although there were no joysticks.
Second part finding wall was giving me issues, but i noticed i had not adjusted for different laser scan sensors. So i moved on to finish the third part and finished to course.
I booked again a (third time) and then I start to have mentioned problems above.
Right now i am stuck on figuring out why my second part - find wall is not working on the real robot, and just booking and trying to troubleshoot.
What is confusing that real robot is not consistent:
now two sessions i could not make the robot functional. Both times i saw robot being a little bit on top of those sign base platform, and on one case i saw the wheels actually spinning, the second case i am not sure.
in one session my programs were just giving me error, where nothing was changed between the previous session.
in another session my service would just not get seen, meaning the main wall_follower node would just print ‘service not available, waiting again…’. This problem is the only one that happened to get fixed by restarting/reconnecting connection to the real robot.

After encountering an issue i tried:
check if joystick is responsive (wait for around 30s if maybe the camera feed is lagging)
to disconnect/reconnect inside rosject
restart the whole rosject
rerun ros1 bridge, use the commands provided in Jupiter notebook (since i saw in forums that now the bridge is ran automatically)
i check if /cmd_vel messages are constantly being published (as i understand in the real robot /cmd_vel message if not permanent order as in simulation)
i check ros2 topic echo hz /scan

Conclusion
It is starting to be very difficult to troubleshoot and i start to questions if the problem is my code or the connection to the real robot.
I just want to finish this course, what kind of completion of the real robot project is needed to start making arrangement to present?

Here is my find_wall.py server code for context:

#!/usr/bin/env python3
from wall_follower_interfaces.srv import FindWall
from geometry_msgs.msg import Twist
import rclpy
from rclpy.node import Node
from rclpy.executors import MultiThreadedExecutor
from rclpy.callback_groups import ReentrantCallbackGroup
from sensor_msgs.msg import LaserScan
from rclpy.qos import ReliabilityPolicy, QoSProfile
import time


class Service(Node):

    def __init__(self):
        super().__init__('find_wall_service')
        self.group = ReentrantCallbackGroup()
        self.srv = self.create_service(FindWall, 'find_wall', self.service_callback, callback_group=self.group)
        self.publisher_ = self.create_publisher(Twist, 'cmd_vel', 10)
        self.subscriber = self.create_subscription(
        LaserScan, 
        '/scan', 
        self.laser_callback, 
        QoSProfile(depth=10, reliability=ReliabilityPolicy.RELIABLE),
        callback_group=self.group)

        self.smallestDist = 0
        self.smallestDistIndex = 0
        self.alignment_complete = False
        self.service_response = None
        self.laser_source = "None"
        self.is_first_call = True
    
    def laser_callback(self,msg):
        self.laser_all = msg.ranges

        if self.laser_source == "None":
            length_of_ranges = len(msg.ranges)
            if length_of_ranges <500:
                self.laser_source = "Sim"
                self.get_logger().info('Using SIMULATION Laser values')
            else:
                self.laser_source = "Real"
                self.get_logger().info('Using REAL Laser values')
                #self.timer = self.create_timer(1.0, self.show_laser_range, callback_group=self.group)

        
        #valid_ranges = [d for d in msg.ranges if not math.isnan(d) and not math.isinf(d)]
        #valid_ranges = [d if not math.isnan(d) and not math.isinf(d) else 999.0 for d in msg.ranges]

        #if msg.ranges:
        self.smallestDist = min(msg.ranges)
        self.smallestDistIndex = msg.ranges.index(self.smallestDist)
        #else:
        # Handle the case where all values are NaN or Inf
         #   self.smallestDist = None
          #  self.smallestDistIndex = None

    def show_laser_range(self):
        #self.get_logger().info('Laser 0 {0} '.format(self.laser_all[0]))
        #self.get_logger().info('Laser 180 {0} '.format(self.laser_all[180]))
        #self.get_logger().info('Laser 360 {0} '.format(self.laser_all[360]))
        #self.get_logger().info('Laser 540 {0} '.format(self.laser_all[540]))
        #self.get_logger().info('Laser 719 {0} '.format(self.laser_all[719]))
        self.get_logger().info('MODE: %s Smallest Disntance: %.2f at index: %d' % (self.mode,self.smallestDist, self.smallestDistIndex))

    def service_callback(self, request, response):
        self.get_logger().info('Service callback started')
        if self.is_first_call:
            self.is_first_call = False
            self.get_logger().info('Fake workaround for cancelling the first callback')
            response.wallfound = False
            return response
        
        self.alignment_complete = False
        self.mode = "FACE_FORWARD"
        self.get_logger().info('MODE: %s'%(self.mode))

        msg = Twist()
        driveSpeed = 0.1
        targetDistance = 0.3
        targetDistanceAllowedError = 0.02
        if self.laser_source == "Sim":
            turnSpeed = 0.1
            laser_forward_index = range(355,363)#359
            laser_right_index = range(266,274)#270
            laser_backward_index = 180 #range(176,184)
            laser_left_index = 90
        elif self.laser_source == "Real":
            turnSpeed = -0.1
            laser_forward_index = range(355,363)
            laser_right_index = range(176,184)
            laser_backward_index = 360#719
            laser_left_index = 0#540
        #self.get_logger().info('Just before while loop')
        while not self.alignment_complete:
            #self.get_logger().info('Inside while')
            #time.sleep(0.1)
            if self.mode=="FACE_FORWARD":
                if self.smallestDistIndex not in laser_forward_index:
                    if self.smallestDistIndex >=laser_backward_index:
                        msg.angular.z = -turnSpeed
                        #self.get_logger().info('Turning right')
                    else:
                        msg.angular.z = turnSpeed
                        #self.get_logger().info('Turning left')
                else: 
                    msg.angular.z = 0.0
                    self.get_logger().info('Stopping turning, Changing to Distance align')
                    self.mode = "DISNTACE_ALIGN"
                    self.get_logger().info('MODE: %s'%(self.mode))
                    
            elif self.mode=="DISNTACE_ALIGN":
                if  abs(self.smallestDist - targetDistance)>targetDistanceAllowedError:
                    if self.smallestDist>targetDistance:
                        msg.linear.x = driveSpeed
                        #self.get_logger().info('Moving closer')
                    else:
                        msg.linear.x = -driveSpeed
                        #self.get_logger().info('Moving further')
                else:
                    msg.linear.x = 0.0
                    self.get_logger().info('Stopping dinstance align, Changing to rotate right align')
                    self.mode="FACE_RIGHT"
                    self.get_logger().info('MODE: %s'%(self.mode))
                    

            elif self.mode=="FACE_RIGHT":
                if self.smallestDistIndex not in laser_right_index:
                    if self.laser_source == "Sim":
                        msg.angular.z = turnSpeed
                    else:
                        msg.angular.z = -turnSpeed
                    #self.get_logger().info('Turning left')
                    #if self.smallestDistIndex < laser_right_index and self.smallestDistIndex > laser_left_index :
                    if self.smallestDistIndex < min(laser_right_index) and self.smallestDistIndex > laser_left_index :
                        if self.laser_source == "Sim":
                            msg.angular.z = -turnSpeed
                        else:
                            msg.angular.z = turnSpeed
                        #self.get_logger().info('Turning right')
                        
                else: 
                    msg.angular.z = 0.0
                    self.get_logger().info('Aligned, stopping')
                    self.mode = "ALIGNED"
                    self.get_logger().info('MODE: %s'%(self.mode))
            elif self.mode=="ALIGNED":
                self.alignment_complete = True
                self.get_logger().info('Aligned_complete set to True')
                #self.timer.cancel()
            else:
                    self.mode=="ERROR"
                    self.get_logger().info('Eorr inside findWall main logic')
                    #self.timer.cancel()


            
            self.publisher_.publish(msg)
            #self.get_logger().info('MODE: %s Smallest Disntance %.2f at index: %d' % (self.mode,self.smallestDist, self.smallestDistIndex))
    
            #self.get_logger().info('Inside while after callback')
        #self.get_logger().info('just outside while')
        response.wallfound = True
        self.get_logger().info('Servers is Sending response')
        return response
        
        

def main(args=None):
    rclpy.init(args=args)
    try:
        moving_service = Service()
        executor = MultiThreadedExecutor(num_threads=3)
        executor.add_node(moving_service)
        try:
            executor.spin()
        finally:
            executor.shutdown()
            moving_service.destroy_node()
    finally:
        rclpy.shutdown()


if __name__ == '__main__':
    main()

In order to present and get your certificate, you need to have the robot do the same thing as the real robot. You can check out the dozens of students that have done this:

https://www.youtube.com/@becomingarosdeveloper6394/streams

  • Yes, there will be changes from the simulation to the real robot, like the lidar orientation you mentioned. That is intentional. If you write robust code, that would not have to matter much.

Now commenting on your experience. I am sorry you feel confused. Here are some pointers to understand the system better:

  • There needs to be an understanding whether the issue is with the data that comes from the robot, the data that is sent to the robot, or the programs you are running in your rosject.
  • You mention that the first part was working fine. That means that the other two parts will work fine. The information/commands to/from the robot are the same for all parts. That leads me to believe the issue is with the programs you are writing. Usually the frequency at which you spin the nodes makes a difference, and you have to make sure you are not using single data points that will always work in simulation but not in the real robot (i.e. looking at one lidar range point vs an array)
  • The joystick is not an indication of the robot working or not working. It is a tool we developed to help you move the robot easier (you can also move the robot through the terminal using teleop_twist_keyboard). If it’s not there, we are probably working in improving it, as well as other systems, like the ros1_bridge. The user used to have to run the bridge for ROS 2 every time, but now we have upgraded it so it automatically comes in ROS 2, so no bridge is needed.
  • What do you mean by “my programs were just giving me error”? Can you share a terminal output of those errors?
  • Your service not being seen is not an issue with the real robot, since it is contained in your rosject. Make sure you can see the server actually running that service.
  • Again you don’t have to run the ros1_bridge anymore. The fact that you are trying to rerun it makes it possible that there are errors in that side that will affect the whole system. You can get a new, updated rosject that removes the bridge instructions if you re-fork the project rosject, as it is a recent change.

And finally, I have trouble understanding what your issue is. Could you try and summarize it in a couple of bullet points?

Thank you for your response.

I will try to be as clear as my knowledge allows it. After this post I had a session with a real robot, and my new experience illustrate the problem more clearly.
The same program can work and not work on the robot.

  • Since this post I have tried once more on a real robot and my wall_following behavior was working flawlessly. I checked multiple times.
  • This is the same unchanged code that was not working in the multiple previous sessions.
  • I ran only the server and send calls from terminal, so I then checked the whole combined program, and to my surprise it also worked without any technical issues. It found the wall, it started following the wall and recording odometry.
  • I have not even had the chance to test odometry part but it was working fine, the only minor issue was distance to starting point never reached 0.05 meters, so I decided to adjust my laser index inside the wall_following program (right side index was still 270, so i changed to 200), I thought I will make this small adjustment and I am done.
  • I make the adjustment I build, source, launch and then we are back to the “laggy” state I am trying to describe.
  • I think ok, i will try the isolated find_wall service that was not changed in any way, and was confirmed working flawlessly, multiple times and now it is not working, it is in this “laggy” state.
  • The “laggy” state in the context of my program means: when at the start of the program robot is receiving Twist messages with angular.z = 0.1 linear.x = 0.0 velocity and is physically spinning then if it starts receiving angular.z = 0.0 linear.x = 0.1 it is still physically spinning for undetermined amount of time. The undetermined amount of time can vary from tens of seconds to minutes or even longer, usually my patience runs out and i stop the program after 5 minutes of spinning, while the whole program has already finished.

I can see you have summarized my problem perfectly:

With the experience I have described I am confused what I am doing wrong. How can my program work when it was not working in other session? Can i break something by building not correctly? What am I not understanding, can you give me a hint?
This is the steps I took as detailed as i can remember:

  1. I launch rosject and enable real robot connection button.
  2. After video feed has loaded I used joystick to move the robot a little bit to confirm it is responsive and video feed is not drastically lagging behind the joystick commands. All is well it is responsive.
  3. I launch my find_wall server, meaning I open terminal and this only command: ros2 launch wall_follower start_wall_finder.launch.py
  4. I open a new terminal and send service call with this command: ros2 service call /find_wall wall_follower_interfaces/srv/Findwall “{}”
  5. The robot is working to my surprise
  6. I launch another terminal and type: rviz2
  7. i configure to see tf data, and can see how smoothly the wheels are turning correctly.
  8. i then again call the server, the program and robot are working as excpected.
  9. i the stop the server with ctrl+c and type: ros2 launch wall_follower main.launch.py
  10. i see how my whole program is working as expected, the only part is robot is a little too close to the wall since the laser index for right is still 270 and i see that distance to origin is not reaching minimal allowed 0.05m so i stop the robot.
  11. i run again the main program, it works the same, finds the wall, follows reaches the origin point with about 0.06 precision.
  12. i open vs code alternative and edit my this only number inside the wall_follower program
  13. i go to terminal type: cd ros2_ws, then colcon build --pacakges-select wall_follower and then source intsall/setup.bash
  14. then i launch the main program ros2 launch wall_follower main.launch.py and its “laggy”
  15. i cant believe it, i repeat the steps: 2,3,4 still not working
  16. i try restarting the connection, the browser and the rosject, still the same behavior.

I understand and I was not using bridge command at this new test.

This is highly likely due to me using bridge commands since and i believe it is not as relevant to the main issue.

Can you elaborate or point me to the resources regarding the spinning node frequency? regarding once point and array, i understand and i have implemented as such in the service.

Thank you for your time and patience.

I have once more tested some and manages to again see my whole project working fine multiple times. I can’t say for sure what all the issues were, but I want to try to present and have send an email for the request.
My hypothesis for the solution is real robot being “choked” with a ton of /cmd_vel messages. Meaning the program would send so much messages that by the time program logic have changed for example from spinning movement to forwards, the actual robot is still processing spinning commands, hence the “laggy” behavior. I added a delay of 0.1 seconds when publishing twist messages and that seemed to make it function normally. I have no idea if that is how things works here, but that’s the best I could think of with my limited knowledge.

Since I am not sure if my solution will work in the future sessions I made a video proof just in case. Either way I hope everything works out when the time will come.