In the recent past, handling the curse of dimensionality observed in acoustic features of the speech signal in machine learning-based emotion detection has been considered a crucial objective. The contemporary emotion prediction methods are experiencing false alarming due to the high dimensionality of the features used in training phase of the machine learning models. The majority of the contemporary models have endeavored to handle the curse of high dimensionality of the training corpus. However, the contemporary models are focusing more on using fusion of multiple classifiers, which is barely improvising the decision accuracy, if the volume of the training corpus is high. The contribution of this manuscript endeavored to portray a novel ensemble model that using fusion of diversity measures to suggest the optimal features. Moreover, the proposed method attempts to reduce the impact of the high dimensionality in feature values by using a novel clustering process. The experimental study signifies the proposed method performance in term of emotion prediction from speech signals and compared to contemporary models of emotion detection using machine learning. The fourfold cross-validation of standard data corpus has used in performance analysis.