This project focuses on improving the training of CLIP (Contrastive Language-Image Pretraining) models by optimizing the global contrastive loss for bimodal contrastive self-supervised learning. Self-supervised learning (SSL) has gained prominence for its ability to generalize across downstream tasks in areas such as natural language processing and computer vision. Among SSL frameworks, Contrastive Learning (CL) has proven effective by maximizing the similarity between positive pairs and minimizing it between negative pairs. While CLIP has demonstrated success in aligning image and text representations, challenges persist, such as slow convergence in large-scale bimodal datasets. Participants are tasked with accelerating global contrastive loss optimization and enhancing model performance on provided benchmarks. Trainig of CLIP models used a 100k subset of the Conceptual Captions 3M dataset for training and validate the models on MSCOCO and ImageNet datasets. They must evaluate model performance based on retrieval accuracy and zero-shot classification metrics. Models are restricted to using ResNet-50 and DistilBERT as encoders, with fixed hyperparameters, and must compare at least two optimizers and three loss functions. Deliverables include model code, trained models, and a detailed report covering experimental results, all adhering to specified guidelines. Evaluation criteria include experimental breadth, report quality, presentation, and innovative ideas.
Built with