Developed a prompt testing framework for LLM models to check prompt quality by adding multiple system prompts (using chain-of-thought, few-shot etc.), and then compare each system prompt based on the model-generated answer quality. Quality of answers can be measured using NLP metrics such as ROUGE, BLEU, or BERTScore and Responsible AI metrics such as Faithfulness, Answer Relevancy Score, Harmfulness etc.