Speech balloons and thought bubbles are among the most recognizable visual signs of the visual language used in comics. These enclosed graphic containers provide a way in which text and image can interface with each other. However, their stereotypical meanings as representing speech or thought betray much deeper semantic richness. This paper uses these graphic signs as a platform for examining the multimodal interfaces between text and image, and details four types of interfaces that characterize the connections between modalities: Inherent, Emergent, Adjoined, and Independent relationships. Each interface facilitates different levels of multimodal integration, tempered by principles of Gestalt grouping and underlying semantic features. This process allows the possibility of creating singular cohesive units of text and image that is on par with other multimodal interfaces, such as between speech and gesture.