Vision-language models like CLIP are widely used for multimodal retrieval tasks. However, they can learn historical biases from their training data, resulting in the perpetuation of …
Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone …