Conventional models for emotion recognition from speech signal are trained in supervised fashion using speech utterances with emotion labels. In this study we hypothesize that speech signal depends on multiple latent variables including the emotional state, age, gender, and speech content. We propose an Adversarial Autoencoder (AAE) to perform variational inference over the latent variables and reconstruct the input feature representations. Reconstruction of feature representations is used as an auxiliary task to aid the primary emotion recognition task. Experiments on the IEMOCAP dataset demonstrate that the auxiliary learning tasks improve emotion classification accuracy compared to a baseline supervised classifier. Further, we demonstrate that the proposed learning approach can be used for the end-to-end speech emotion recognition, as its applicable for models that operate on frame-level inputs.