Explanation supervision is a technique in which the model is guided by human-generated explanations during training. This technique aims to improve both the interpretability and predictability of the model by incorporating human understanding into the training process. Since explanation supervision requires a large scale of training data, the data augmentation technique is necessary to be applied to increase the size and diversity of the original dataset. However, data augmentation on sophisticated data like medical images is particularly challenging due to the following: 1) scarcity of data in training the learning-based data augmenter, 2) difficulty in generating realistic and sophisticated images, and 3) difficulty in ensuring the augmented data indeed boosts the performance of explanation-guided learning. To solve these challenges, we propose an Explanation Iterative Supervision via Saliency-guided Data Augmentation (ESSA) framework for conducting explanation supervision and adversarial-trained image data augmentation via a synergized iterative loop that handles the translation from annotation to sophisticated images and the generation of synthetic image-annotation pairs with an alternating training strategy. Extensive experiments on two datasets from the medical imaging domain demonstrate the effectiveness of our proposed framework in improving both the predictability and explainability of the model.