Quality and safety are critical elements in the performance of surgeries. Therefore, surgical trainees need to obtain the required degrees of expertise before operating on patients. Conventionally, a trainee’s performance is evaluated by qualitative methods that are time-consuming and prone to bias. Using autonomous and quantitative surgical skill assessment improves the consistency, repeatability, and reliability of the evaluation. To this end, this paper proposes a video-based deep learning framework for surgical skill assessment. By incorporating prior knowledge on surgeon’s activity in the system design, we decompose the complex task of spatio-temporal representation learning from video recordings into two independent, relatively-simple learning processes, which greatly reduces the model size. We evaluate the proposed framework using the publicly available JIGSAWS robotic surgery dataset and demonstrate its capability to learn the underlying features of surgical maneuvers and the dynamic interplay between sequences of actions effectively. The skill level classification accuracy of 97.27% on the public dataset demonstrates the superiority of the proposed model over prior video-based skill assessment methods. The code of this paper will be available on Github at link: ${\color{blue}{\text{sourceCode}}}$.