Introduction: A Comprehensive Evaluation of AI Image Generation
The HKU Business School has released a comprehensive evaluation report on the image-generation capabilities of leading Artificial Intelligence (AI) models. This report provides a systematic assessment of the burgeoning field of AI image generation, a domain experiencing rapid evolution but still in its nascent stages. The study analyzes 15 text-to-image models and 7 multimodal large language models (LLMs), offering insights into their strengths and weaknesses across various dimensions. The evaluation framework, meticulously designed by HKU Business School researchers, centers on two primary tasks: new-image generation and image revision. The findings reveal a diverse performance landscape, with some models excelling in content quality while others prioritize safety and accountability.
Methodology: A Two-Pronged Approach to Image Assessment
The research team at HKU Business School employed a multifaceted methodology to ensure a holistic and objective evaluation of the AI models’ image generation capabilities. The assessment focused on two core tasks:
- New-Image Generation: This task evaluated the models’ ability to create entirely new images based solely on textual prompts provided by the researchers. This tested the models’ creative capacity and their understanding of natural language descriptions. 
- Image Revision: This task assessed the models’ capacity to modify existing images according to specific instructions. This tested the models’ ability to understand and implement changes related to style, content, and composition. 
For each of these tasks, the evaluation considered multiple dimensions, ensuring a comprehensive assessment of the models’ capabilities.
New-Image Generation: Quality and Responsibility
The evaluation of new-image generation focused on two crucial aspects: Image Content Quality and Safety & Responsibility.
Image Content Quality: Fidelity, Integrity, and Aesthetics
This dimension delved into the visual fidelity and aesthetic appeal of the images generated by the AI models. Three key criteria were used to assess content quality:
- Alignment with Prompts: This criterion measured the accuracy with which the generated image reflected the objects, scenes, and concepts described in the textual prompt. A higher score indicated a closer match between the image and the prompt’s intent. This is a fundamental aspect of text-to-image generation, as the model must accurately interpret and visualize the user’s request. 
- Image Integrity: This aspect focused on the factual accuracy and reliability of the generated image. It ensured that the image adhered to real-world principles and avoided generating nonsensical or physically impossible scenarios. For example, a prompt requesting “a cat wearing a hat” should not result in an image of a cat with two heads or a hat floating in mid-air. 
- Image Aesthetics: This criterion evaluated the artistic quality of the generated image, considering factors such as composition, color harmony, clarity, and overall creativity. Images that exhibited strong visual appeal and artistic merit received higher scores. While subjective, this criterion acknowledges the importance of aesthetics in image generation, particularly for applications in design, art, and marketing. 
To ensure scientific rigor and minimize bias, experts conducted pairwise comparisons between models. The final rankings were determined using the Elo rating system, a well-established method for ranking competitors in a variety of fields. This approach allowed for a nuanced and objective assessment of each model’s relative performance, providing a clear picture of their strengths and weaknesses in image content quality.
Safety and Responsibility: Addressing Ethical Concerns
Beyond the visual aspects, the evaluation also prioritized the ethical and societal implications of AI-generated images. This dimension assessed the models’ compliance with safety regulations and their awareness of social responsibility. The test prompts were carefully crafted to cover a range of sensitive categories, including:
- Bias and Discrimination: This category evaluated whether the model generated images that perpetuated harmful stereotypes or exhibited bias based on race, gender, religion, or other protected characteristics. For example, prompts related to professions or social roles should not consistently depict certain demographics in stereotypical ways. 
- Crimes and Illegal Activities: This category assessed whether the model could be prompted to generate images depicting illegal acts, violence, or other harmful content. The models should be designed to refuse such requests and avoid generating content that could incite or glorify illegal activities. 
- Dangerous Topics: This category examined the model’s response to prompts related to hazardous materials, self-harm, or other potentially dangerous subjects. Models should be programmed to avoid generating content that could promote or facilitate harm. 
- Ethics and Morality: This category evaluated the model’s adherence to ethical principles and its ability to avoid generating images that were morally objectionable or offensive. This is a complex area, as societal norms and ethical standards can vary, but the models should strive to avoid generating content that is widely considered inappropriate. 
- Copyright Infringement: This category assessed whether the model could be used to generate images that violated copyright laws or intellectual property rights. Models should be trained to avoid replicating copyrighted material or creating images that are substantially similar to existing protected works. 
- Privacy/Portrait Rights Violations: This category examined the model’s ability to protect personal privacy and avoid generating images that violated individuals’ portrait rights. Models should not be used to create or manipulate images of real people without their consent, particularly in ways that could be harmful or embarrassing. 
By encompassing these diverse categories, the evaluation aimed to provide a comprehensive assessment of the models’ commitment to safety and responsibility. This is a crucial aspect of AI development, as the potential for misuse of image generation technology raises significant ethical concerns.
Image Revision: Modifying and Refining Existing Images
For the image revision task, the models were evaluated on their ability to modify the style or content of a reference image, based on provided instructions. The revised images were assessed using the same three dimensions as content quality in new-image generation:
- Alignment with Prompts: How well did the modified image reflect the changes requested in the instructions? 
- Image Integrity: Did the modifications maintain the factual accuracy and logical consistency of the image? 
- Image Aesthetics: Did the modifications improve or detract from the overall visual appeal of the image? 
This task tested the models’ ability to understand and implement specific changes to existing images, demonstrating their versatility and control over image manipulation.
Results: A Diverse Landscape of Performance
The evaluation yielded insightful rankings across the different tasks and dimensions, highlighting the strengths and weaknesses of various AI models.
Image Content Quality in New-Image Generation: Dreamina Leads the Pack
In the realm of image content quality for new-image generation, ByteDance’s Dreamina emerged as the top performer, securing the highest score of 1,123. This indicates Dreamina’s exceptional ability to generate images that are both visually appealing and closely aligned with the provided textual prompts. Baidu’s ERNIE Bot V3.2.0 followed closely behind, demonstrating strong performance in this area. Midjourney v6.1 and Doubao also secured top positions, showcasing their proficiency in generating high-quality images. These models demonstrated a remarkable ability to translate textual descriptions into visually compelling and accurate representations. The competition among these top performers is indicative of the rapid advancements being made in the field of AI image generation.
Safety and Responsibility in New-Image Generation: OpenAI’s GPT-4o Takes the Lead
When it came to safety and responsibility in the new-image generation task, a different set of models took the lead. OpenAI’s GPT-4o received the highest average score of 6.04, underscoring its commitment to ethical considerations and adherence to safety guidelines. Qwen V2.5.0 and Google’s Gemini 1.5 Pro secured the second and third positions, respectively, with scores of 5.49 and 5.23. These results highlight the emphasis that some developers are placing on ensuring that their AI models operate responsibly and avoid generating harmful or inappropriate content. This is a crucial aspect of AI development, as the potential for misuse of image generation technology raises significant ethical concerns.
A Concerning Trend: The Gap Between Quality and Responsibility
Notably, Janus-Pro, the text-to-image model recently introduced by DeepSeek, did not perform as well in either image content quality or safety and responsibility. This finding underscores the challenges that developers face in balancing the pursuit of visual fidelity with the imperative of ethical and responsible AI development. The results also revealed a concerning trend: some text-to-image models that excelled in image content quality exhibited a significant lack of consideration for safety and responsibility. This gap highlights a critical issue in the field – the potential for high-quality image generation to be coupled with insufficient AI guardrails, leading to potential social risks. This underscores the need for a more holistic approach to AI development, where ethical considerations are integrated into the design and training process from the outset.
Image Revision: Doubao, Dreamina, and ERNIE Bot V3.2.0 Excel
In the image revision task, which assessed the models’ ability to modify existing images, Doubao, Dreamina, and ERNIE Bot V3.2.0 demonstrated outstanding performance. This indicates their versatility and ability to not only generate new images but also to refine and adapt existing visual content. GPT-4o and Gemini 1.5 Pro also performed well, showcasing their capabilities in this area. These models demonstrated a strong understanding of image manipulation and the ability to implement specific changes based on user instructions.
Intra-Company Variability: WenXinYiGe 2 Lags Behind
Interestingly, WenXinYiGe 2, another text-to-image model from Baidu, underperformed in both image content quality in new-image generation tasks and image revision, falling short of its peer, ERNIE Bot V3.2.0. This discrepancy highlights the variability in performance even within models developed by the same company, suggesting that different architectures and training approaches can yield significantly different results. This underscores the complexity of AI development and the importance of continuous evaluation and refinement.
Multimodal LLMs: A Holistic Advantage
A key takeaway from the evaluation was the overall strong performance of multimodal LLMs compared to dedicated text-to-image models. Their image content quality was found to be comparable to that of text-to-image models, demonstrating their ability to generate visually appealing images. However, multimodal LLMs exhibited a significant advantage in their adherence to safety and responsibility standards. This suggests that the broader context and understanding inherent in multimodal LLMs may contribute to their ability to generate content that is more aligned with ethical guidelines and societal norms. Because multimodal LLMs are trained on a wider range of data and tasks, they may have a better understanding of the potential implications of their outputs.
Furthermore, multimodal LLMs excelled in usability and support for diverse scenarios, offering users a more seamless and comprehensive experience. This versatility makes them well-suited for a wider range of applications, as they can handle not only image generation but also other tasks that require language understanding and generation. This makes them a more practical and adaptable solution for many users.
Conclusion: Balancing Innovation with Ethical Considerations
Professor Zhenhui Jack Jiang, Professor of Innovation and Information Management and the Padma and Hari Harilela Professor in Strategic Information Management, emphasized the critical need to balance innovation with ethical considerations in the rapidly evolving landscape of AI technology in China. He stated, “Amid the rapid technological advancements in China, we must strike a balance between innovation, content quality, safety, and responsibility considerations. This multimodal evaluation system will lay a crucial foundation for the development of generative AI technology and help establish a safe, responsible, and sustainable AI ecosystem.”
The findings of this comprehensive evaluation provide valuable insights for both users and developers of AI image generation models. Users can leverage the rankings and assessments to make informed decisions about which models best suit their needs, considering both image quality and ethical considerations. Developers, on the other hand, can gain valuable insights into the strengths and weaknesses of their models, identifying areas for optimization and improvement. The evaluation serves as a crucial benchmark for the industry, promoting the development of AI image generation technology that is not only visually impressive but also safe, responsible, and aligned with societal values. The study underscores the ongoing need for continued research and development in this rapidly evolving field. As AI image generation technology continues to advance, it is imperative that developers prioritize safety, responsibility, and ethical considerations alongside the pursuit of visual fidelity. The HKU Business School’s evaluation serves as a valuable contribution to this ongoing effort, providing a framework for assessing and promoting the responsible development of AI image generation technology. The future of AI image generation depends on a commitment to both innovation and ethical responsibility.