Violence detection in videos is very promising in practical applications due to the emergence of massive videos in recent years. Most previous works define violence detection as a simple video classification task and use the single modality of small-scale datasets, e.g., visual signal. However, such solutions are undersupplied. To mitigate this problem, we study weakly supervised violence detection on the large-scale audio-visual violence data, and first introduce two complementary tasks, i.e., coarse-grained violent frame detection and fine-grained violent event detection, to advance the simple violence video classification to frame-level violent event localization, which aims to accurately locate the violent events on untrimmed videos. We then propose a novel network that takes as input audio-visual data and contains three parallel branches to capture different relationships among video snippets and further integrate features, where similarity branch and proximity branch capture long-range dependencies using similarity prior and proximity prior, respectively, and score branch dynamically captures the closeness of predicted score. In both coarse-grained and fine-grained tasks, our approach outperforms other state-of-the-art approaches on two public datasets. Moreover, experiment results also show the positive effect of audio-visual input and relationship modeling.