contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014)
and represents a wider variety of both images and image caption styles. We achieve this by
extracting and filtering image caption annotations from billions of webpages. We also
present quantitative evaluations of a number of image captioning models and show that a
model architecture based on Inception-ResNetv2 (Szegedy et al., 2016) for image-feature …