videos are classified into one of five real-life crowded scenes:'Riot','Noise-Street','Firework-
Event','Music-Event', and 'Sport-Atmosphere'. To this end, we firstly collect an audio-visual
dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a
wide range of deep learning classification models are proposed to train either audio or
visual input data independently. Finally, results obtained from high-performance models are …