Noise-aware learning from web-crawled image-text data for image captioning

W Kang, J Mun, S Lee, B Roh - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
W Kang, J Mun, S Lee, B Roh
Proceedings of the IEEE/CVF International Conference on …, 2023openaccess.thecvf.com
Image captioning is one of the straightforward tasks that can take advantage of large-scale
web-crawled data which provides rich knowledge about the visual world for a captioning
model. However, since web-crawled data contains image-text pairs that are aligned at
different levels, the inherent noises (eg, misaligned pairs) make it difficult to learn a precise
captioning model. While the filtering strategy can effectively remove noisy data, it leads to a
decrease in learnable knowledge and sometimes brings about a new problem of data …
Abstract
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (eg, misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a Noise-aware Captioning (NoC) framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal during training. The alignment-level-conditioned training allows the model to generate high-quality captions by simply setting the control signal to the desired alignment level at inference time. An in-depth analysis shows the effectiveness of our framework in handling noise. With two tasks of zero-shot captioning and text-to-image retrieval using generated captions (ie, self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. The code is available at https://github. com/kakaobrain/noc.
openaccess.thecvf.com
以上显示的是最相近的搜索结果。 查看全部搜索结果

Google学术搜索按钮

example.edu/paper.pdf
搜索
获取 PDF 文件
引用
References