In recent years, the computer vision society has made significant progress in multi-object tracking (MOT) and video object segmentation (VOS) respectively. Further progress can be achieved by effectively combining the following tasks together–detection, segmentation and tracking. In this work, we propose a multi-stage framework called “Lidar and monocular Image Fusion based multiobject Tracking and Segmentation (LIFTS)” for multiobject tracking and segmentation (MOTS). In the first stage, we use a 3D Part-Aware and Aggregation Network detector on the point cloud data to get 3D object locations. Then a graph-based 3D TrackletNet Tracker (3D TNT), which takes both CNN appearance features and object spatial information of detections, is applied to robustly associate objects along time. The second stage involves a Cascade Mask R-CNN based network with PointRend head for obtaining instance segmentation results from monocular images. Its input pre-computed region proposals are generated from projecting 3D detections in the first stage onto a 2D image plane. Moreover, two post-processing techniques are further applied in the last stage:(1) generated mask results are refined by an optical-flow guided instance segmentation network;(2) object re-identification (ReID) is applied to recover ID switches caused by long-term occlusion; Overall, our proposed framework is evaluated on BMTT Challenge 2020 Track2: KITTI-MOTS dataset and achieves a 79.6 sMOTSA for Car and 64.9 for Pedestrian, with the 2nd place ranking in the competition.