X Hu, Z Gan, J Wang, Z Yang, Z Liu, Y Lu… - arXiv preprint arXiv …, 2021 - arxiv.org
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …