Arm and Alibaba Boost Edge Multimodal AI | en

Arm and Alibaba Collaboration: Enhanced Multimodal AI at the Edge

The rapid evolution of AI is ushering in a new era of multimodal models. These sophisticated systems possess the ability to process and interpret information from a variety of sources, including text, images, audio, video, and even sensor data. However, deploying these powerful models on edge devices presents significant hurdles. The inherent limitations in power and memory capacity of edge hardware, combined with the intricate task of simultaneously processing diverse data types, create a complex challenge.

Arm Kleidi: Optimizing AI Inference on Arm CPUs

Arm Kleidi is specifically designed to address this challenge, providing seamless performance optimization for all AI inference workloads that run on Arm CPUs. At the heart of Kleidi is KleidiAI, a streamlined suite of highly efficient, open-source Arm routines built to accelerate AI.

KleidiAI is already integrated into the latest versions of widely used AI frameworks for edge devices. These include ExecuTorch, Llama.cpp, LiteRT via XNNPACK, and MediaPipe. This widespread integration offers a significant advantage to millions of developers, who can now automatically benefit from AI performance optimizations without any extra effort.

Partnership with Alibaba: Qwen2-VL-2B-Instruct Model

A new milestone in the advancement of multimodal AI on edge devices has been achieved through a close collaboration with MNN. MNN is a lightweight, open-source deep learning framework developed and maintained by Alibaba. This partnership has resulted in the successful integration of KleidiAI, enabling multimodal AI workloads to run efficiently on mobile devices using Arm CPUs. The key to this achievement is Alibaba’s instruction-tuned 2B parameter Qwen2-VL-2B-Instruct model. This model is specifically designed for image understanding, text-to-image reasoning, and multimodal generation across multiple languages, all tailored for the constraints of edge devices.

Measurable Performance Gains

The integration of KleidiAI with MNN has yielded significant, measurable performance improvements for the Qwen2-VL-2B-Instruct model. Faster response times have been observed across crucial AI multimodal use cases at the edge. These improvements unlock enhanced user experiences in a variety of customer-focused Alibaba applications. Examples include:

Chatbots for customer service: Providing quicker and more efficient responses to customer inquiries.
E-shopping applications: Enabling photo-to-goods searching, allowing customers to quickly find the items they are looking for by simply uploading an image.

The enhanced speed in these applications is a direct result of substantial performance gains:

Pre-fill Improvement: A remarkable 57 percent performance improvement has been achieved in pre-fill. This refers to the crucial stage where AI models handle multi-source prompt inputs before generating a response.
Decode Enhancement: A significant 28 percent performance improvement has been observed in decode. This is the process where the AI model generates text after processing a prompt.

Beyond speed, the KleidiAI integration also contributes to more efficient processing of AI workloads at the edge. This is achieved by lowering the overall computational cost associated with multimodal workloads. These performance and efficiency gains are readily accessible to millions of developers. Any developer running applications and workloads on the MNN framework, as well as other popular AI frameworks for edge devices where KleidiAI is integrated, can immediately benefit.

Real-World Demonstration: MWC Showcase

The practical capabilities of the Qwen2-VL-2B-Instruct model, powered by the new KleidiAI integration with MNN, were showcased at the Mobile World Congress (MWC). A demonstration at the Arm booth highlighted the model’s ability to understand diverse combinations of visual and text inputs. The model then responded with a concise summary of the image content. This entire process was executed on the Arm CPU of smartphones, showcasing the power and efficiency of the solution. These smartphones were built on MediaTek’s Arm-powered Dimensity 9400 mobile system-on-chip (SoC), including the vivo X200 Series.

A Significant Step Forward in User Experience

The integration of Arm’s KleidiAI with the MNN framework for Alibaba’s Qwen2-VL-2B-Instruct model represents a substantial leap forward in the user experience for multimodal AI workloads. This advancement delivers these enhanced experiences directly at the edge, all powered by the Arm CPU. These capabilities are readily available on mobile devices, with leading customer-facing applications already leveraging the benefits of KleidiAI.

The Future of Multimodal AI on Edge Devices

Looking ahead, KleidiAI’s seamless optimizations for AI workloads will continue to empower millions of developers. They will be able to create increasingly sophisticated multimodal experiences on edge devices. This continuous innovation will pave the way for the next wave of intelligent computing, marking a significant step forward in the ongoing evolution of AI.

Quotes from Alibaba Leadership

‘We are pleased to see the collaboration between Alibaba Cloud’s large language model Qwen, Arm KleidiAI, and MNN. Integrating MNN’s on-device inference framework with Arm KleidiAI has significantly improved the latency and energy efficiency of Qwen. This partnership validates the potential of LLMs on mobile devices and enhances the AI user experience. We look forward to continued efforts in advancing on-device AI computing.’ - Dong Xu, GM of Tongyi Large Model Business, Alibaba Cloud.

‘The technical integration between the MNN inference framework and Arm KleidiAI marks a major breakthrough in on-device acceleration. With joint optimization of the architecture, we have greatly improved the Tongyi LLM’s on-device inference efficiency, bridging the gap between limited mobile computing power and advanced AI capabilities. This achievement highlights our technical expertise and cross-industry collaboration. We look forward to continuing this partnership to enhance the on-device computing ecosystem, delivering smoother and more efficient AI experiences on mobile.’ - Xiaotang Jiang, Head of MNN, Taobao and Tmall Group, Alibaba.

Delving Deeper into the Technical Aspects

To fully appreciate the significance of this collaboration, it’s helpful to examine some of the underlying technical details.

The Role of MNN

MNN’s design philosophy centers around efficiency and portability. It achieves this through several key features:

Lightweight Architecture: MNN is designed to have a small footprint, minimizing the storage and memory requirements on edge devices. This is crucial for mobile phones and other devices with limited resources. The reduced size also contributes to faster loading times and lower power consumption.
Optimized Operations: The framework incorporates highly optimized mathematical operations specifically tailored for Arm CPUs, maximizing performance. These optimizations are often at the assembly level, taking advantage of specific CPU instructions and features to achieve maximum speed. This includes optimized implementations of matrix multiplications, convolutions, and other core operations used in deep learning models.
Cross-Platform Compatibility: MNN supports a wide range of operating systems and hardware platforms, making it a versatile choice for developers. This includes support for Android, iOS, Linux, and other embedded operating systems. This broad compatibility allows developers to deploy their models on a variety of devices without significant code modifications.
Graph Optimization: MNN employs various graph optimization techniques to further enhance performance. This includes operator fusion (combining multiple operations into a single one), constant folding (pre-computing constant values), and memory allocation optimization.
Quantization Support: MNN supports model quantization, which reduces the precision of model weights and activations (e.g., from 32-bit floating-point to 8-bit integer). This significantly reduces model size and computational cost, making it possible to run larger models on resource-constrained devices.

KleidiAI’s Contribution

KleidiAI complements MNN’s strengths by providing a set of specialized routines that further accelerate AI inference. These routines leverage Arm’s extensive experience in CPU architecture to unlock performance gains that would be difficult to achieve otherwise. Key aspects of KleidiAI’s contribution include:

Highly Optimized Kernels: KleidiAI provides highly optimized kernels for common AI operations, such as matrix multiplication and convolution. These kernels are meticulously tuned to take advantage of the specific features of Arm CPUs, including NEON and SVE (Scalable Vector Extension) instructions. These low-level optimizations are crucial for achieving maximum performance on Arm-based devices.
Automatic Integration: The seamless integration of KleidiAI into popular AI frameworks (like MNN, ExecuTorch, Llama.cpp, LiteRT, and MediaPipe) means that developers don’t need to manually incorporate these optimizations. The performance benefits are automatically applied, simplifying the development process and reducing the barrier to entry for developers who may not have expertise in low-level optimization.
Continuous Improvement: Arm is committed to continuously updating and improving KleidiAI, ensuring that it remains at the forefront of AI acceleration technology. This includes adding support for new AI models and operations, as well as further optimizing existing kernels for the latest Arm CPU architectures. This ongoing development ensures that developers will continue to benefit from performance improvements over time.
Targeted Optimizations: KleidiAI provides optimizations that are specifically tailored to different types of AI workloads, such as computer vision, natural language processing, and speech recognition. This allows for more fine-grained control over performance and efficiency.
Open Source: KleidiAI is open-source, fostering collaboration and transparency within the AI community. This allows developers to contribute to the project and adapt it to their specific needs.

Qwen2-VL-2B-Instruct: A Powerful Multimodal Model

The Qwen2-VL-2B-Instruct model is a testament to Alibaba’s expertise in large language models and multimodal AI. Its key features include:

Instruction Tuning: The model is specifically tuned to follow instructions, making it highly adaptable to a wide range of tasks. This means that the model can be given a specific instruction, such as “Summarize this image” or “Translate this text into French,” and it will generate a response that directly addresses the instruction. This makes the model more versatile and easier to use than models that are not instruction-tuned.
Multimodal Capabilities: It excels at understanding and processing both visual and textual information, enabling applications like image captioning and visual question answering. The model can take both an image and a text prompt as input and generate a text response that is relevant to both. This opens up a wide range of possibilities for applications that combine visual and textual information.
Multilingual Support: The model is designed to work with multiple languages, broadening its applicability across different regions and user bases. This is crucial for creating AI applications that can be used by a global audience.
Optimized for Edge Devices: Despite its powerful capabilities, the model is carefully designed to operate within the resource constraints of edge devices. This is achieved through techniques such as model compression, quantization, and knowledge distillation. The 2B parameter size is a key factor in making it suitable for edge deployment, striking a balance between performance and resource requirements.
Vision-Language Understanding: The model demonstrates strong capabilities in understanding the relationship between images and text. This allows it to perform tasks such as identifying objects in an image, describing the scene, and answering questions about the image content.
Text-to-Image Reasoning: The model can reason about the relationship between text and images, allowing it to perform tasks such as generating captions that accurately describe the image content or answering questions that require understanding both the image and the text.

Expanding the Scope of Multimodal AI

The advancements discussed here are not limited to smartphones. The same principles and technologies can be applied to a wide range of edge devices, including:

Smart Home Devices: Enabling voice assistants, image recognition for security cameras, and other intelligent features. This could include smart speakers that can understand and respond to voice commands, security cameras that can identify people and objects, and smart appliances that can be controlled remotely.
Wearable Devices: Powering health monitoring, fitness tracking, and augmented reality applications. This could include smartwatches that can track heart rate and activity levels, fitness trackers that can provide personalized workout recommendations, and augmented reality glasses that can overlay digital information onto the real world.
Industrial IoT: Facilitating predictive maintenance, quality control, and automation in manufacturing settings. This could include sensors that can monitor the condition of machinery and predict when maintenance is needed, cameras that can inspect products for defects, and robots that can automate repetitive tasks.
Automotive: Enhancing driver assistance systems, in-cabin entertainment, and autonomous driving capabilities. This could include systems that can help drivers stay in their lane, park their car, and avoid collisions, as well as in-car entertainment systems that can provide personalized recommendations and autonomous driving systems that can take over control of the vehicle in certain situations.
Retail: Enhancing customer experiences through personalized recommendations, virtual try-on, and automated checkout.
Healthcare: Assisting with diagnosis, treatment planning, and remote patient monitoring.

The potential applications of multimodal AI at the edge are vast and continue to expand. As models become more sophisticated and hardware becomes more powerful, we can expect to see even more innovative and impactful use cases emerge. This collaboration between Arm and Alibaba is a significant step in that direction, bringing the power of multimodal AI to a wider audience and enabling a new generation of intelligent devices. The focus on efficiency, performance, and developer accessibility ensures that these advancements will have a broad and lasting impact on the future of technology. The combination of MNN’s lightweight and versatile framework, KleidiAI’s highly optimized kernels, and Qwen2-VL-2B-Instruct’s powerful multimodal capabilities creates a synergistic effect, pushing the boundaries of what’s possible with AI on edge devices.

updated at 2025-03-05

# AIGC # Qwen # Alibaba