Meta has unveiled an impressive generative artificial intelligence (AI) model called "CM3leon" (pronounced like "chameleon") that possesses the ability to perform text-to-image and image-to-text generation. It is said that the new AI model is the better alternative of OpenAi's DALL-E 2 Which is also a text to image generation AI model by Meta's competitor in AI domain.
In a blog post published on Friday, Meta explained that CM3leon stands out as the first multimodal model trained using a recipe adapted from text-only language models. This recipe encompasses two crucial stages: a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage. These stages enable CM3leon to achieve remarkable results in generating coherent imagery that aligns closely with the provided input prompts.
One notable advantage of CM3leon, as highlighted by Meta, is its efficiency in terms of computing power and training dataset requirements. Compared to previous transformer-based methods, CM3leon only necessitates five times the computing power and a smaller training dataset.
When evaluated against the widely adopted zero-shot MS-COCO image generation benchmark, CM3leon achieved an outstanding FID (Frechet Inception Distance) score of 4.88. This groundbreaking achievement establishes CM3leon as the new state-of-the-art model for text-to-image generation, surpassing Google's text-to-image model known as Parti.
Furthermore, Meta emphasized that CM3leon excels in a wide range of vision-language tasks, including visual question answering and long-form captioning. Despite being trained on a dataset of only three billion text tokens, CM3leon's zero-shot performance compares favorably to larger models trained on larger datasets.
Meta expressed their belief that CM3leon's exceptional performance across various tasks is a significant step towards achieving higher-fidelity image generation and understanding. They stated, "With the goal of creating high-quality generative models, we believe CM3leon's strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding."
In addition, Meta expressed enthusiasm about the potential of models like CM3leon to enhance creativity and improve applications in the metaverse. They look forward to exploring the boundaries of multimodal language models and releasing more models in the future, anticipating the positive impact these advancements will have.