Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so on. Recently, social media users are increasingly using additional images and videos to express their opinions and share their experiences. Sentiment analysis of such large-scale textual and visual content can help better extract user sentiments toward events or topics. Motivated by the needs to leverage large-scale social multimedia content for sentiment analysis, we utilize both the state-of-the-art visual and textual sentiment analysis techniques for joint visual-textual sentiment analysis. We first fine-tune a convolutional neural network (CNN) for image sentiment analysis and train a paragraph vector model for textual sentiment analysis. We have conducted extensive experiments on both machine weakly labeled and manually labeled image tweets. The results show that joint visual-textual features can achieve the state-of-the-art performance than textual and visual sentiment analysis algorithms alone.