We are living in a very exciting in terms of the emergence of contrastitive language image pre-training and multi-modal extensions that are beginning to promise true zero-shot capability.
CLIP, SigLIP and multi-stage pipelines such as Grounding DINO offer us some really exciting prospects, and we're already seeing CLIP's vision decoder married with cutting edge LLM's like Llava and InternLM-XComposer.
The project was borne out of two notions - excitement for the emergence of these foundation models and the fact the I find myself continually working with single board computers and SBCs in order to keep up with optimisation platforms like ONNX and TensorRT - platforms that will help us realise the world of true embodied AI we're all pushing for. So why not create and optimised vision foundation model swiss-army knife!
Project overview:
Full onxruntime support for CLIP and SigLIP with Grounding DINO around the corner.
Automatic model preprocessing and postprocessing switching.
Stripped of reliance on the Hugginface transformers, and tokenizers libraries in order to provide a lightweight set of dependencies. A bare metal implementation remains as a Huggingface lite nod if you will.
Concerted, pre-packaged inference session classes, with an incoming pip package.
Built with