text-shape data, the substantial semantic gap between these two modalities, and the
structural complexity of 3D shapes. This paper presents a new framework called Image as
Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the
two modalities and to eliminate the need for paired text-shape data. Our key contribution is a
two-stage feature-space-alignment approach that maps CLIP features to shapes by …