Speech impairments are one of the earliest manifestations in patients with Parkinson’s disease. Particularly, articulation deficits related to the capability of the speaker to start/stop the vibration of the vocal folds have been observed in the patients. Those difficulties can be assessed by modeling the transitions between voiced and unvoiced segments from speech. A robust strategy to model the articulatory deficits related to the starting or stopping vibration of the vocal folds is proposed in this study. The transitions between voiced and unvoiced segments are modeled by a convolutional neural network that extracts suitable information from two time–frequency representations: the short time Fourier transform and the continuous wavelet transform. The proposed approach improves the results previously reported in the literature. Accuracies of up to 89% are obtained for the classification of Parkinson’s patients vs. healthy speakers. This study is a step towards the robust modeling of the speech impairments in patients with neuro–degenerative disorders.