Baidu offers ERNIE-VIL 2.0, a multi-view contrastive learning framework that aims to gain a more robust cross-modal representation by simultaneously establishing intramodal and cross-modal correlations between distinct views

Baidu offers ERNIE-VIL 2.0, a multi-view contrastive learning framework that aims to gain a more robust cross-modal representation by simultaneously establishing intramodal and cross-modal correlations between distinct views

Vision-language pre-training (VLP) models have made significant progress on several cross-modal tasks, such as visual question answering (VQA) and cross-modal retrieval, over the past two years. The majority of previous efforts based on intermodal transformer encoders focus on creating several proxy pre-training tasks (e.g., masked language modeling (MLM) and masked region modeling (MRM)) to learn …

Baidu offers ERNIE-VIL 2.0, a multi-view contrastive learning framework that aims to gain a more robust cross-modal representation by simultaneously establishing intramodal and cross-modal correlations between distinct views Read More »