Due to the costly collection of expert demonstrations for robots, robot imitation learning suffers from the demonstration-insufficiency problem. A promising solution to this problem is self-supervised learning that leverages pretext tasks to extract general and high-level features from a relatively small amount of data. Since imitation learning tasks are typically composed of primitives (e.g., primary skills, such as grasping and reaching), learning representations of these primitives is crucial. However, existing methods have a weak ability to represent primitive, leading to unsatisfactory generalizability to learning scenarios with few data. To address this problem, we propose a novel primitive-contrastive network (PCN) and pretext task that optimizes the distances between pseudo-primitive distributions as a learning objective. Experimental results show that the proposed PCN can learn a more discriminative embedding space of primitives than existing self-supervised learning methods. Four representative robot manipulation experiments are conducted to demonstrate the superior data efficiency of the proposed method.