This project features the implementation of a custom CLIP (Contrastive Language-Image Pretraining) model, leveraging a Vision Transformer (ViT) backbone and trained on the COCO dataset. Inspired by the original CLIP paper and Stable Diffusion's architecture, we developed a custom training pipeline that optimizes performance despite hardware limitations. Key Highlights: - Final loss: 0.72 after 85 epochs with a batch size of 1024. - Resource Efficiency: Trained on an NVIDIA Tesla P100 GPU (16 GiB), with 150 GPU hours of training time. - Custom training pipeline with efficient data loading and evaluation. - Zero-shot capabilities for tasks like image retrieval and classification. This project highlights expertise in transformer-based architectures, contrastive learning, and bridging vision-language modalities. Scripts for evaluating pre-trained models on custom inputs are available.