Robust visual tracking is a challenging problem due to pose variance, occlusion and cluttered backgrounds. No single feature can be robust to all possible scenarios in a video sequence. However, exploiting multiple features has demonstrated its effectiveness in overcoming challenging situations in visual tracking. We propose a new framework for multi-modal fusion at both the feature level and decision level by training a reconstructive and discriminative dictionary and classifier for each modality simultaneously with the additional constraint of label consistency across different modalities. In addition, a joint decision measure is designed based on both reconstruction and classification error to adaptively adjust the weights of different features such that unreliable features can be removed from tracking. The proposed tracking scheme is referred to as the label-consistent and fusion-based joint sparse coding (LC-FJSC). Extensive experiments on publicly available videos demonstrate that LC-FJSC outperforms state-of-the-art trackers.