Speech emotion recognition (SER) is a challenging task since the expression of emotions is distinct in different languages. This paper proposes a robust SER approach that focuses on improving the robust performance of SER for low resource languages such as Sindhi. To the best of our knowledge, this is the first SER work on the Sindhi language that utilizes data augmentation (DA) and deep learning techniques. The proposed method first uses the optimally modified log-spectral amplitude estimator (OMLSA), to suppress the noise in speech data. Secondly, to deal with the imbalance or limited dataset of low resource languages, DA technique based on a combination of different prosodic features (i.e., time-stretching, pitch, and white noise) is proposed. Then, based on the proposed 1-dimensional Convolution Neural Network (1DCNN) model the SER is achieved. We contribute further by introducing our novel Sindhi speech emotion dataset (NSSED) consisting of as many as 1231 audio files categorized into four emotions (i.e., happy, sad, angry, and neutral). To demonstrate the superior performance and cross-lingual adaptability of the proposed method, it is compared with two other methods, i.e., the support vector machine (SVM) and the long short-term memory (LSTM) for both NSSED and Urdu language datasets. Experimental results demonstrate that our proposed method achieves up to 91% and 88% accuracy for NSSED and Urdu datasets respectively which is an increase of approx. 22% over the baseline model.