Recent deep learning techniques have achieved satisfactory results for various image-related problems. However, many research questions remain open in tasks involving video sequences. Several applications demand the understanding of complex events in videos, such as traffic monitoring, person re-identification, security and surveillance. In this work, we address the problem of human action recognition in videos through a multi-stream network that incorporates both spatial and temporal information. The main contribution of our work is a stream based on a new variant of the visual rhythm, called Learnable Visual Rhythm (LVR). We employ a deep network to extract features from the video frames in order to generate the rhythm. The features are collected at multiple depths of the network to enable the analysis of different abstraction levels. This strategy significantly outperforms the handcrafted version on the UCF101 and HMDB51 datasets. Experiments conducted on these datasets show that our final multi-stream network achieved competitive results compared to state-of-the-art approaches.