This project aims to check the robustness of the CLIPcap Image Captioning Model using an adversarial attack from an Encoder-Decoder architecture-based model on the MSCOCO 2014 dataset. The Encoder model is based on ResNet-101, and the Decoder model uses LSTM along with an attention mechanism and Beam Search.