Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal artifacts. However, deepfake videos in real-world scenarios often consist of a combination of audio and visual. In this paper, we propose an Audio-Visual Joint Learning for Detecting Deepfake (AVoiD-DF), which exploits audio-visual inconsistency for multi-modal forgery detection. Specifically, AVoiD-DF begins by embedding temporal-spatial information in Temporal-Spatial Encoder. A Multi-Modal Joint-Decoder is then designed to fuse multi-modal features and jointly learn inherent relationships. Afterward, a Cross-Modal Classifier is devised to detect manipulation with inter-modal and intra-modal disharmony. Since existing datasets for deepfake detection mainly focus on one modality and only cover a few forgery methods, we build a novel benchmark DefakeAVMiT for multi-modal deepfake detection. DefakeAVMiT contains sufficient visuals with corresponding audios, where any one of the modalities may be maliciously modified by multiple deepfake methods. The experimental results on DefakeAVMiT, FakeAVCeleb, and DFDC demonstrate that the AVoiD-DF outperforms many state-of-the-arts in deepfake detection. Our proposed method also yields superior generalization on various forgery techniques.