completed, the dataset will contain over one and a half million captions describing over
330,000 images. For the training and validation images, five independent human generated
captions will be provided. To ensure consistency in evaluation of automatic caption
generation algorithms, an evaluation server is used. The evaluation server receives
candidate captions and scores them using several popular metrics, including BLEU …