An RGB-D Dataset with Cluttered Box Scenes Containing Household Objects

Tonci Novkovic, Fadri Furrer, Marko Panjek, Margarita Grinvald, Roland Siegwart, and Juan Nieto

Autonomous Systems Lab, ETH Zurich, Switzerland

CLUBS is an RGB-D dataset that can be used for segmentation, classification and detetion of household objects in realistic warehouse box scenarios. The dataset contains the object scenes, the reconstructed models, as well as box scenes that contain mutliple objects packed in different configurations. Additionally, the raw scaning data, 2D object bounding boxes and pixel-wise labels in RGB images, 3D bounding boxes, and calibration data are also available.
The dataset contains 85 object scenes and 25 box scenes. Different box scenes have diffrent configurations of objects in them and vary in clutter level. More details about the specific box scenes can be found by clicking on the scene gif image in the Box scenes section below.

The data was collected using a robotic arm (UR10) with one RGB (Chameleon3) and three RGB-D (PrimeSense Carmine 1.09, Intel Realsense D415, Intel Realsense D435) cameras. All the cameras were calibrated for intrisic and extrinsic parameters. More details about the calibration and notation used can be found below.

Object scenes
The dataset contains 85 different objects. The path for scanning a single object scene contains 19 different poses at three different height levels. It covers the whole object from every side, except the bottom. Every object was, therefore, rotated and scanned a second time to also cover the bottom face.

The objects within the dataset are separated into 5 main categories by the type of the product, 3 categories based on shape, and 2 categories based on rigidity:
More details about object scenes and their individual download links can be found here:

Box scenes
The robot path for box scenes is shorter and contains 9 poses. These poses were chosen such that the inside of a box of size 58x38x34 cm is covered from all sides including a top-view. The dataset contains 25 different box scenes, where 5 of them have 40 objects and rest 30 objects. Overall object distribution within the boxes is shown below:

To get more details about a specific box click on its image below:

All the raw calibration data and results are available for download below. For the distortion, radial-tangential model was used, represented as:

$$[r_1, r_2, t_1, t_2, r_3]$$
All the frames and transformations between the cameras and the robot are displayed in the image below:
For the RealSense cameras, two different calibration files are provided, one containing depth intrinsic parameters read out from the device itself for the lower resolution depth *_device_depth.yaml, and the other containing higher-resolution depth intrinsic parameters obtained in the calibration step *_stereo_depth.yaml. If regular depth from the raw data is used, the former one should be used. In case the depth is generated from stereo using our Python script the latter calibration file should be used. The calibration folder is provided as:
  • - calibration
    • - primesense.yaml
    • - realsense_d415_device_depth.yaml
    • - realsense_d415_stereo_depth.yaml
    • - realsense_d435_device_depth.yaml
    • - realsense_d435_stereo_depth.yaml
    • - chameleon3.yaml
    • - calibration_raw_data
MATLAB script for obtaining the calibration parameters is available on the clubs_dataset_tools github page.

Notation and data structure
Camera poses are represented by a translation vector and a Hamiltonian unit quaternion:

$$[stamp, t_x, t_y, t_z, q_x, q_y, q_z, q_w]$$
Scenes contain either 9 or 19 poses depending on the type of scene being scanned (object or box scene). Object scenes are labeled as: label_side_name, where label is a 3 digit number uniquely defining an object, side is either 0 or 1, since all objects are flipped in order to get full coverage, and the name describes the object. Box scenes have the following convention: box_label_iteration, where the label represents a box with a defined subset of N contained objects, and the iteration relates to the current number of objects in the box (N - iteration). In the first iteration all the objects are in the box, the box is scanned, one object is taken out and then the next iteration starts. This is repeated until there are no objects left in the box.
Each sensor folder contains the available raw data for that sensor and an additional folder with pixel-wise labeled images and .json files which include object 2D bounding boxes and polygon vertices used for generating the label image. Objects present in the scene, together with the 3D bounding box size and pose, are stored in the scene_objects.csv file:
$$[object\_id, x, y, z, qx, qy, qz, qw, size_x, size_y, size_z]$$
where [x, y, z] vector represents center coordinates, quaternion [qx, qy, qz, qw] represents orientation, and [sizex, sizey, sizez] represents the size of the 3D bounding box.
All the poses are also stored using the quaternion and translation format as described above. The robot's end-effector poses are stored in W_H_poses.csv, and for each sensor, the poses of the RGB camera and the IR cameras are stored in W_RGB_poses.csv, W_IR1_poses.csv and W_IR2_poses.csv respectively. The folder structure is as follows:
  • - scene
    • - scene_objects.csv
    • - W_H_poses.csv
    • - sensor
      • - W_RGB_poses.csv
      • - W_IR1_poses.csv
      • - W_IR2_poses.csv
      • - depth_images
        • - stamp_depth.png
      • - ir1_images
        • - stamp_ir1.png
      • - ir2_images
        • - stamp_ir2.png
      • - rgb_images
        • - stamp_rgb.png
      • - labels
        • - rgb_images
          • - stamp_rgb_label.json
          • - stamp_rgb_label.png
      • - scene_sensor_pointlcoud.ply
For each sensor in one scene, we provide the reconstructed point cloud scene_sensor_pointcloud.ply obtained by integrating the point clouds with their corresponding camera poses into a TSDF volume. The pixel-wise labels and 2D bounding boxes are stored in json files timestamp_rgb_label.json as follows:
        [[[x00, y00], ..., [x0L, y0L]],
         [[xN0, yN0], ..., [xNL, yNL]]],
        [[bx0, by0, w0, h0],
         [bxN, byN, wN, hN]],
where poly is a list of lists of image coordinates that define the surrounding polygon, bbox is a list of lists that contain 2D bounding box coordinates of the top left corner, width and height, and labels is a list of label names corresponding to the names of objects present in the scene. These json files are used to generate the label images timestamp_rgb_label.png. All the pixels in these images have values from 0 to 41, where 0 corresponds to the background, and rest of the values correspond to the labels in the json file, i.e. value 1 means it is the first label in the json file, label 2 is the second, etc.
By using the provided Python scripts for generating depth images from stereo, computing point clouds from RGB-D images, and registering depth images to the RGB image of the corresponding sensor, additional folders for each sensor are created. Namely, these are stereo_depth_images, point_clouds, and registered_depth_images.

If you are using this dataset in your research, please cite the following publication:

    author      = {Novkovic, Tonci and Furrer, Fadri and Panjek, Marko and Grinvald, Margarita and Siegwart, Roland and Nieto, Juan},
    journal     = {The International Journal of Robotics Research (IJRR)},
    title       = {CLUBS: An RGB-D dataset with cluttered box scenes containing household objects},
    year        = {2019},
    pages       = {1538-1548},
    volume      = {38},
    number      = {14},
    doi         = {10.1177/0278364919875221}

Individual object scenes can be downloaded from the list in the Object scenes section. Furthermore, individual box scenes can be downloaded by clicking on the scene image in the Box scenes section. Finally, the whole dataset can be downloaded using the following links:

This dataset comes with a set of tools which are avilable on our repo:

These tools include:

  • Download script for different parts of the dataset
  • Script for computing depth images from an IR stereo pair
  • Script for generating point clouds
  • Script for registering depth images to corresponding RGB images
  • Script for displaying label images in color
  • Camera calibration script