The advancements in large language models have significantly accelerated the development of natural language processing, or NLP. The introduction of the transformer framework proved to be a milestone, facilitating the development of a new wave of language models, including OPT and BERT, which exhibit profound linguistic understanding. Furthermore, the inception of GPT, or Generative Pre-trained Transformer models, introduced a new paradigm with autoregressive modeling and established a robust method for language prediction and generation. The advent of language models like GPT-4, ChatGPT, Mixtral, LLaMA, and others has further fueled rapid evolution, with each model demonstrating enhanced performance in tasks involving complex language processing. Among existing methods, instruction tuning has emerged as a key technique for refining the output of pre-trained large language models, and the integration of these models with specific tools for visual tasks has highlighted their adaptability and opened doors for future applications. These extend far beyond the traditional text-based processing of LLMs to include multimodal interactions.
Furthermore, the convergence of natural language processing and computer vision models has given rise to VLMs, or Vision Language Models, which combine linguistic and vision models to achieve cross-modal comprehension and reasoning capabilities. The integration and advent of visual and linguistic models have played a crucial role in advancing tasks that require both language processing and visual understanding. The emergence of revolutionary models like CLIP has further bridged the gap between vision tasks and language models, demonstrating the feasibility and practicality of cross-modal applications. More recent frameworks like LLaMA and BLIP leverage tailored instruction data to devise efficient strategies that demonstrate the potent capabilities of the model. Additionally, combining large language models with image outputs is the focus of recent multimodal research, with recent methods being able to bypass direct generation by utilizing the image retrieval approach to produce image outputs and interleaved texts.
With that being said, and despite the rapid advancements in vision language models facilitating basic reasoning and visual dialogue, there still exists a significant performance gap between advanced models like GPT-4, and vision language models. Mini-Gemini is an attempt to narrow the gap that exists between vision language models and more advanced models by mining the potential of VLMs for better performance from three aspects: VLM-guided generation, high-quality data, and high-resolution visual tokens. To enhance visual tokens, the Mini-Gemini framework proposes to utilize an additional visual encoder for high-resolution refinement without increasing the count of visual tokens. The Mini-Gemini framework further constructs a high-quality dataset in an attempt to promote precise comprehension of images and reasoning-based generation. Overall, the Mini-Gemini framework attempts to mine the potential of vision language models, and aims to empower existing frameworks with image reasoning, understanding, and generative capabilities simultaneously. This article aims to cover the Mini-Gemini framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started.
Over the years, large language models have evolved, and they now boast of remarkable multi-modal capabilities, and are becoming an essential part of current vision language models. However, there exists a gap between the multi-modal performance of large language models and vision language models with recent research looking for ways to combine vision with large language models using images and videos. For vision tasks itself, image resolution is a crucial element to explicitly despite the surrounding environment with minimal visual hallucinations. To bridge the gap, researchers are developing models to improve the visual understanding in current vision language models, and two of the most common approaches are: increasing the resolution, and increasing the number of visual tokens. Although increasing the number of visual tokens with higher resolution images does enhance the visual understanding, the boost is often accompanied with increased computational requirements and associated costs especially when processing multiple images. Furthermore, the capabilities of existing models, quality of existing data, and applicability remains inadequate for an accelerated development process, leaving researchers with the question, “how to accelerate the development of vision language models with acceptable costs”?
The Mini-Gemini framework is an attempt to answer the question as it attempts to explore the potential of vision language models from three aspects: VLM-guided generation or expanded applications, high-quality data, and high-resolution visual tokens. First, the Mini-Gemini framework implements a ConvNet architecture to generate higher-resolution candidates efficiently, enhancing visual details while maintaining the visual token counts for the large language model. The Mini-Gemini framework amalgamates publicly available high-quality datasets in an attempt to enhance the quality of the data, and integrates these enhancements with state of the art generative and large language models with an attempt to enhance the performance of the VLMs, and improve the user experience. The multifaceted strategy implemented by the Mini-Gemini framework enables it to explore hidden capabilities of vision language models, and achieves significant advancements with evident resource constraints.
In general, the Mini-Gemini framework employs an any to any paradigm since it is capable of handling both text and images as input and output. In particular, the Mini-Gemini framework introduces an efficient pipeline for enhancing visual tokens for input images, and features a dual-encoder system comprising of twin encoders: the first encoder is for high-resolution images, while the second encoder is for low-quality visual embedding. During inference, the encoders work in an attention mechanism, where the low-resolution encoder generates visual queries, while the high-resolution encoder provides key and values for reference. To augment the data quality, the Mini-Gemini framework collects and produces more data based on public resources, including task-oriented instructions, generation-related data, and high-resolution responses, with the increased amount and enhanced quality improving the overall performance and capabilities of the model. Furthermore, the Mini-Gemini framework supports concurrent text and image generation as a result of the integration of the vision language model with advanced generative models.
Mini-Gemini : Methodology and Architecture
At its core, the Mini-Gemini framework is conceptually simple, and comprises three components.
- The framework employs dual vision encoders to provide low-resolution visual embeddings and high resolution candidates.
- The framework proposes to implement patch info mining to conduct mining at patch level between low-resolution visual queries, and high-resolution regions.
- The Mini-Gemini framework utilizes a large language model to marry text with images for both generation and comprehension simultaneously.
Dual-Vision Encoders
The Mini-Gemini framework can process both text and image inputs, with the option to handle them either individually or in a combination. As demonstrated in the following image, the Mini-Gemini framework starts the process by employing bilinear interpolation to generate a low-resolution image from its corresponding high-resolution image.
The framework then processes these images and encodes them into a multi-grid visual embedding in two parallel image flows. More specifically, the Mini-Gemini framework maintains the traditional pipeline for low-resolution flows and employs a CLIP-pretrained Visual Transformer to encode the visual embeddings, facilitating the model to preserve the long-range relation between visual patches for subsequent interactions in large language models. For the high-resolution flows, the Mini-Gemini framework adopts the CNN or Convolution Neural Networks based encoder for adaptive and efficient high resolution image processing.
Patch Info Mining
With the dual vision encoders generating the LR embeddings and HR features, the Mini-Gemini framework proposes to implement patch info mining with the aim of extending the potential of vision language models with enhanced visual tokens. In order to maintain the number of visual tokens for efficiency in large language models, the Mini-Gemini framework takes the low-resolution visual embeddings as the query, and aims to retrieve relevant visual cues from the HR feature candidates, with the framework taking the HR feature map as the key and value.
As demonstrated in the above image, the formula encapsulates the process of refining and synthesizing visual cues, which leads to the generation of advanced visual tokens for the subsequent large language model processing. The process ensures that the framework is able to confine the mining for each query to its corresponding sub region in the HR feature map with the pixel-wise feature count, resulting in enhanced efficiency. Owing to this design, the Mini-Gemini framework is able to extract the HR feature details without enhancing the count of visual tokens, and maintains a balance between computational feasibility and richness of detail.
Text and Image Generation
The Mini-Gemini framework concatenates the visual tokens and input text tokens as the input to the large language models for auto-regressive generation. Unlike traditional vision language models, the Mini-Gemini framework supports text-only as well as text-image generation as input and output, i.e. any to any inference, and it is the result of this outstanding image-text understanding and reasoning capabilities, the Mini-Gemini is able to generate high quality images. Unlike recent works that focus on the domain gap between text embeddings of the generation models and large language models, the Mini-Gemini framework attempts to optimize the gap in the domain of language prompts by translating user instructions into high quality prompts that produce context relevant images in latent diffusion models. Furthermore, for a better understanding of instruction finetuning, and cross modality alignment, the Mini-Gemini framework collects samples from publicly available high quality datasets, and uses the GPT-4 turbo framework to further construct a 13K instruction following dataset to support image generation.
Mini-Gemini : Experiments and Results
To evaluate its performance, the Mini-Gemini framework is instantiated with the pre-trained ConvNext-L framework for the HR vision encoder, and with a CLIP-pre-trained Vision Transformer for the LR vision encoder. To ensure training efficiency, the Mini-Gemini framework keeps the two vision encoders fixed, and optimizes the projectors of patch info mining in all stages, and optimizes the large language model during the instruction tuning stage itself.
The following table compares the performance of the Mini-Gemini framework against state of the art models across different settings, and also takes in consideration private models. As it can be observed, the Mini-Gemini outperforms existing frameworks across a wide range of LLMs consistently at normal resolution, and demonstrates superior performance when configured with the Gemma-2B in the category of efficient models. Furthermore, when larger large language models are employed, the scalability of the Mini-Gemini framework is evident.
To evaluate its performance on high resolution and extended visual tokens, the experiments are performed with an input size of 672 for the LR vision encoder, and 1536 for the visual encoder. As mentioned earlier, the main purpose of the HR visual encoder is to offer high-resolution candidate information. As it can be observed, the Mini-Gemini framework delivers superior performance when compared against state of the art frameworks.
Furthermore, to assess the visual comprehension prowess of the Mini-Gemini framework in real-world settings, developers apply the model to a variety of reasoning and understanding tasks as demonstrated in the following image. As it can be observed, the Mini-Gemini framework is able to solve a wide array of complex tasks thanks to the implementation of patch info mining, and high-quality data. But what’s more impressive is the fact that the Mini-Gemini framework demonstrates a keen addition to detail that extends beyond mere recognition prowess, and describes intricate elements intricately.
The following figure provides a comprehensive evaluation of the generative abilities of the Mini-Gemini framework.
When compared against recent models like ChatIllusion and AnyGPT, the Mini-Gemini framework demonstrates stronger multi-modal understanding abilities, allowing it to generate text to image captions that align with the input instructions better, and results in image to text answers with stronger conceptual similarity. What’s more impressive is the fact that the Mini-Gemini framework demonstrates remarkable proficiency in generating high-quality content using multi-model human instructions only with text training data, a capability that illustrates Mini-Gemini’s robust semantic interpretation and image-text alignment skills.
Final Thoughts
In this article we have talked about Mini-Gemini, a potent and streamlined framework for multi-modality vision language models. The primary aim of the Mini-Gemini framework is to harness the latent capabilities of vision language models using high quality data, strategic design of the framework, and an expanded functional scope. Mini-Gemini is an attempt to narrow the gap that exists between vision language models and more advanced models by mining the potential of VLMs for better performance from three aspects: VLM-guided generation, high-quality data, and high-resolution visual tokens. To enhance visual tokens, the Mini-Gemini framework proposes to utilize an additional visual encoder for high-resolution refinement without increasing the count of visual tokens. The Mini-Gemini framework further constructs a high-quality dataset in an attempt to promote precise comprehension of images and reasoning-based generation. Overall, the Mini-Gemini framework attempts to mine the potential of vision language models, and aims to empower existing frameworks with image reasoning, understanding, and generative capabilities simultaneously.