Robert Zhu

Hand2Rob

2026-01-19T00:00:00+00:00

Overview

Hand2Rob teaches a Franka robot to grasp delicate objects by watching human hand demonstrations captured with a stereo camera pair. MediaPipe and CoTracker extract hand and object keypoints that get triangulated into 3D trajectories the robot can follow directly. The catch is that spatial imitation alone isn’t enough for fragile things, so I integrated a ResKin tactile sensor into custom gripper fingertips to give the robot a sense of force.

This project builds on Point Policy and Feel The Force by Siddhant Haldar and Lerrel Pinto. Big thanks to them for the original work. I adapted and extended both systems with my own changes to get them running on the Franka Panda, including modifications to the trajectory execution, force control integration, data collection, and the custom gripper hardware.

Data Collection and Annotation

I collect demonstrations by having a human perform the task in front of two calibrated cameras. MediaPipe tracks the hand in real time, extracting semantic keypoints on the fingers and palm that serve as the basis for the robot's trajectory. I also annotate object keypoints using CoTracker, giving the model a consistent spatial representation of both the hand and the target object across frames. A ResKin tactile sensor mounted on my thumb records contact forces during grasping trials, providing the force labels used for training.

Translating Points to Franka End Points

The core challenge is mapping a human hand demonstration onto a robot gripper that has far fewer degrees of freedom. The tracked 2D keypoints are triangulated across both camera views to recover 3D positions, and a transformation aligns the human hand's keypoint cloud to the robot's end-effector frame. This produces a full 6-DoF pose trajectory and a gripper open/close signal derived from the distance between the thumb and index finger. The result is a set of robot-executable actions that reproduce the intent of the original human demonstration.

Robot Execution

Without force feedback

With ResKin force control

Without force feedback, the robot has no way to modulate its grip strength, it simply closes the gripper until the binary close command is fully executed. For fragile objects like eggs, this can lead to it breaking. Here the robot successfully reaches and grasps the egg using the learned trajectory, but the uncontrolled gripper force crushes it. This failure motivates the need for closed-loop force control during the grasp phase.

With the ResKin sensor in the loop, the robot can feel how much force it is applying and stop closing once a target threshold is reached. The force controller reads the live tactile signal and adjusts the gripper incrementally, holding the egg securely without exceeding the pressure that would crack it. This demonstrates that combining learned spatial policies with real-time tactile feedback enables safe manipulation of objects that would otherwise be damaged.

Above are two evaluation runs showing the full pipeline end-to-end, the robot approaches the egg, descends to the grasp position, and closes with force-controlled grip. Each camera view is shown separately, with the live force reading overlaid in blue. The consistency across runs demonstrates that the learned policy generalizes reliably from the human demonstrations.

System Architecture

Figure 1. Overview of the Hand2Rob pipeline. Stereo camera footage and MediaPipe hand tracking are processed through CoTracker and stereo triangulation to build the dataset. A training policy learns both the end-effector trajectory and force grasping, which are deployed on the Franka robot with ResKin tactile feedback for manipulating fragile objects.

CAD Models

Gripper without sensor

Gripper with ResKin tactile sensor

Special thanks to Miguel Pegues for helping me design custom gripper fingertips to house the ResKin magnetometer-based tactile sensor. The left model shows the standard Franka gripper fingers, while the right model integrates a recessed pocket that secures the ResKin sensing pad flush against the contact surface.

Vision-Guided Pen Grasping

2025-12-18T00:00:00+00:00

Overview

This project implements a vision-guided grasping system for a robotic arm that autonomously detects and grasps a pen using an RGB-D camera. The objective was to build a complete perception-to-action pipeline that converts raw camera data into executable robot motion, enabling the robot to locate and grasp a pen without manual alignment or human intervention.

The task was intentionally constrained to a known object type and workspace, allowing the system to emphasize robustness, accuracy, and correct geometric reasoning rather than general-purpose object recognition.

Perception and Localization

Perception was implemented using classical computer vision techniques. Since the target object was a purple pen, the RGB image was converted to the HSV color space and color thresholding was applied to segment purple regions from the background. Depth data from the RealSense camera was used to remove background pixels beyond a fixed range, improving robustness under clutter and lighting variation.

Contours were extracted from the segmented mask, and the most relevant contour was selected based on geometric properties. From this contour, the pen’s image-space centroid and orientation were estimated. The centroid pixel was aligned with the depth image, and the corresponding depth value was used to deproject the pixel into a 3D point in the camera coordinate frame using the camera intrinsics. To reduce sensor noise, multiple measurements were collected over a short time window and averaged.

Coordinate Transformation and Grasp Execution

The 3D pen position estimated in the camera frame was transformed into the robot base frame using a precomputed camera-to-robot extrinsic calibration. This calibration was represented as a rigid-body transform consisting of a rotation matrix (R) and translation vector (t), applied directly as:

P_robot = R · P_camera + t

A small tool offset was then added to account for the physical geometry of the gripper.

Once the target position was expressed in the robot frame, the PincherX 100 arm was controlled using direct API commands. The robot moved to a hover pose above the pen, descended to the grasp location, closed the gripper, lifted the object to verify a successful grasp, and returned to a safe pose. This demonstrated a complete vision-driven manipulation pipeline operating under real sensor noise and hardware constraints.

AutonomousPick-and-Place

2025-12-18T00:00:00+00:00

Overview

OmniPlace is an autonomous pick and place system built on ROS 2 for a Franka Panda with an Intel RealSense camera. The robot scans a tabletop, detects both objects (squares, rectangles, cylinders) and targets (their flat cross sections), matches each object to the correct target, and executes pick and place until no targets remain.

System diagrams

Hardware setup: objects, camera, and robot arm

Software pipeline from sensing to planning and execution

Perception pipeline (YOLO-OBB and detection stabilization)

Perception is handled by a YOLO-based detector that outputs oriented bounding boxes, providing both object position and in-plane rotation directly from the image. Rather than relying on single-frame detections, the system performs a structured scan of the workspace and aggregates detections across multiple frames.

Detections are associated over time by comparing center location, shape class, bounding box dimensions, and orientation. Candidates that do not appear consistently across the scan window are discarded as noise. This temporal filtering produces a stable set of object and target poses that can be safely used for motion planning.

To train the YOLO model, a Python-based synthetic data generation pipeline was developed. The script programmatically generated large datasets of labeled images containing geometric objects with randomized pose, scale, orientation, lighting, and background variation. This approach made it possible to rapidly scale the training dataset without manual labeling and ensured strong coverage of object orientations required for reliable OBB prediction.

Objects and targets are distinguished using a height-based heuristic. Targets are flat cross-sections placed on the table, while objects have nonzero height. This separation allows the system to independently build object and target sets and perform shape-based matching between them.

Motion Planning and Task Execution

Motion planning is handled through a custom Python interface built on top of MoveIt 2, designed to simplify interaction with the robot while still exposing fine control when needed. Instead of calling MoveIt APIs directly throughout the codebase, the system is structured around a small set of modular wrappers that separate state queries, planning logic, and environment management.

The motion planning interface provides utilities for querying the robot’s current state, generating collision-aware trajectories, and executing both Cartesian and joint-space motions. A dedicated planning scene manager dynamically adds and removes collision objects corresponding to detected items and targets, ensuring that planned motions remain valid as the workspace changes. All motion commands are funneled through a single high-level interface that exposes actions such as moving to poses, executing grasps, and controlling the gripper, keeping task logic clean and readable.

Task execution is coordinated by a central control script that drives the full pick-and-place pipeline. When triggered, the robot first moves to a known home configuration and performs camera calibration using an ArUco marker to establish consistent transforms between the camera, marker, and robot base. The perception pipeline then scans the workspace, producing a stable set of object and target poses that are visualized in RViz and added to the planning scene.

Objects are matched to targets based on shape and size, and the robot executes pick-and-place operations one pair at a time. After each attempt, the system re-scans the workspace rather than assuming a static scene. This design allows the robot to recover from failed grasps, handle objects being moved during execution, and remain robust to detection ordering changes. The task continues until no valid targets remain, at which point the robot safely returns to its home position.

Jack-in-the-Box Physics Simulation

2025-12-18T00:00:00+00:00

Project Summary

This project simulates a jack-in-the-box mechanism to study the interaction between rigid-body motion and internal impacts. The system’s motion is computed from analytical dynamics, with collisions between the jack and the box resolved using impulse-based methods.

The result is a compact simulation that highlights the role of coordinate frames, equations of motion, and contact dynamics in mechanical behavior.

Frames, Dynamics, and Impacts

The system is represented using multiple coordinate frames. A fixed world frame defines the inertial reference. A box frame moves and rotates with the enclosure. A jack frame rotates inside the box and defines the positions of the four tip masses. Points are mapped between frames using rigid-body transformations of the form

p_world = R · p_body + t

Expressing the jack in the box frame simplifies contact detection, since the box walls are fixed in that frame.

The motion of the box and jack is derived using the Euler–Lagrange formulation,

d/dt(∂L/∂q̇) − ∂L/∂q = Q

which governs smooth translational and rotational motion between collisions. These equations are numerically integrated forward in time.

When a jack tip contacts a wall, the collision is treated as an instantaneous event. Instead of integrating through contact, velocities are updated using an impulse-based formulation. The post-impact generalized velocities satisfy

∂L/∂q̇⁺ = ∂L/∂q̇⁻ + Jᵀλ

where (J) is the contact Jacobian and (λ) is the impulse magnitude. The impulse is solved by enforcing the contact constraint together with an energy consistency condition, resulting in realistic momentum transfer between the internal jack and the box. These repeated impacts produce rotation, bouncing, and jitter of the enclosure.

EKF SLAM on Turtlebot3

2025-01-18T00:00:00+00:00

Demo

Red = ground truth · Blue = odometry · Green = SLAM estimate and landmark map

Overview

This project builds feature-based EKF SLAM on a TurtleBot3 from scratch. Five ROS 2 packages make up the system:

turtlelib — Pure C++ library for SE(2) transforms, differential drive kinematics, and the EKF. No ROS dependency; fully unit-tested with Catch2.
nusim — 2D physics simulator with Gaussian wheel noise, wheel slip, lidar ray-casting, and collision detection.
nuturtle_description — Multi-colour TurtleBot3 URDF with a parametric diff_params.yaml for wheel radius, track width, and collision geometry.
nuturtle_control — Turtle interface node, odometry node, and a circle-driving node.
nuslam — Landmark detection pipeline and EKF SLAM node.

Algorithms

Odometry

The odometry node integrates wheel encoder deltas using the constant-curvature arc model from the DiffDrive class. It publishes nav_msgs/Odometry and broadcasts the odom → base_footprint TF, with an initial_pose service to reset the origin.

Physical TurtleBot3 driving in a circle

Odometry estimate visualized in RViz

Landmark Detection

Landmarks are cylinders detected from 2D lidar scans in three stages:

Clustering — consecutive scan points within a distance threshold form a cluster; clusters with fewer than 4 points are discarded.
Circle fitting — the Hyper algebraic fit finds the centre and radius of each cluster. Clusters with implausible radii are rejected.
Arc classification — uses the inscribed angle theorem: for a true circular arc, the angle ∠P₁PP₂ is nearly constant across all points P. Clusters whose mean falls between 90°–135° with standard deviation below 0.15 rad are classified as circles.

EKF SLAM

The EKF maintains a joint state vector [θ, x, y, mx₁, my₁, …, mxN, myN] and a single covariance matrix over robot pose and all landmark positions simultaneously.

Prediction — wheel encoder deltas propagate through the nonlinear motion model; the Jacobian is computed analytically.
Correction — each detected landmark is matched to a known map entry by world-frame Euclidean distance, then used to correct the full state via a range-bearing Jacobian.
Initialisation — on first sighting, a landmark is placed using the inverse measurement model and the EKF correction is skipped for that frame, avoiding a degenerate update with uninformative covariance.

TF Tree

Click to view full size