Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Qwen Blog ·

Qwen releases Chinese CLIP, an open-source CLIP variant optimized for Chinese language vision-language tasks including cross-modal retrieval.

Categories: Model Releases

Excerpt

CLIP1 is a phenomenal playmaker in vision and multimodal representation learning. It plays not only as a foundation model but also a bridge between vision and language. It has triggered a series of research in different fields, especially text-to-image generation. However, we find that there is a necessity for a language-specific CLIP for applications, especially cross-modal retrieval, and there is no opensourced Chinese CLIP with good performance. We therefore launched this project to promote the Chinese multimodal representation learning.