Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
AI and ML blog Paddle-Lite improves performance of mobile devices with Arm technology
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • optimization
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • Arm Compute Library (ACL)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Paddle-Lite improves performance of mobile devices with Arm technology

Liliya Wu
Liliya Wu
December 3, 2021
3 minute read time.
This blog is written by Yabin Zheng, Liliya Wu and Mary Bennion.

As the speedy growth of AI technology has been applied to mobile and edge devices, Arm has been playing a key role in the AI domain with significant technology influence. Breakthroughs in mobile processors’ performance have been continuously made by Arm, enabling more deep-learning algorithms to be deployed on mobile devices by engineers.

Baidu is a leading AI company with strong Internet foundation in China. Its open-source platform PaddlePaddle integrates multi-level components to create an efficient, flexible, and scalable deep learning platform. Among its rich products, Paddle Lite is the industry leading high-performance inference engine for endpoints. Paddle-Lite has been continuously developed to improve support capabilities of Arm-based platform.

Paddle and Arm have the shared vision of a mobile hardware ecosystem, and therefore, the two parties have been in a long-term collaboration. In the past few months, Arm Compute Library(ACL) team has deeply engaged with Paddle’s core R&D team to facilitate the improvement of overall performance on Arm Cortex-A CPU and Mali GPU on mobile and edge devices. The collaboration aims to enable better user experience when they use Arm-based hardware as back-end inference engine. Based on the instruction set characteristics of different Arm architectures, the technical exchanges and collaborations cover multiple scenarios of computing and memory access optimization. Combined with the analysis of some key operators in Paddle-Lite and Arm ACL team's experience, Paddle’s RD team has made the optimization for operator implementation based on multiple dimensions. The optimization includes but not limited to the following aspects:

Uses cases for Arm Cortex-A CPU:

  • Optimizing the instruction re-arrangement for the assembly implementation by taking the mul instruction characteristics of Cortex-A53/A35 into consideration.
  • Implementing various adaptive block strategies in specific calculations based on the difference in the number of registers in different processors
  • Combined with the characteristics of the data to optimize the logic and adjust the calculation strategy for reducing the number of redundant calculations

Uses cases for Arm Mali GPU:

  • Based on the data access characteristics of Mali GPU architecture to realizing the high-efficient access to operators with Buffer-Object
  • Specializing the realization of 1x1Conv and optimizing the multi-threaded calculation logic

Through the previous optimization methods and some general and non-general optimization methods, the Paddle Lite model running on Cortex-A CPU and Mail GPU obtains a very considerable performance improvement. Meanwhile, the accuracy of some models has also been increased. We have measured the performance comparison data of the operator and the model of Paddle Lite before and after optimization in various data dimensions.

With Cortex CPU:

  • Significant performance improvement of operators’ running efficiency.

 Operators' performance improvement on Armv8

Figure 1: Operators' performance improvement on Armv8

 Operators' performance improvement on Armv7

Figure 2: Operators' performance improvement on Armv7 

  • Performance improvement of the typical model.

 Performance of typical model on Armv8

Figure 3: Performance of typical model on Armv8

Performance of typical model on Armv7

Figure 4: Performance of typical model on Armv7

In regards with Mali GPU-based devices, we also initiated the similar testing with the following results.

  • The time consumed for calculation by operators is significantly reduced under different scales.
  • The overall performance of the models across different devices is improved.

  Model performance improvement of Mali-G76(OpenCL) in mate30(990)

Figure 5: Model performance improvement of Mali-G76(OpenCL) in mate30(990)

Model performance improvement of Mali-T860 (OpenCL) in rk3399

Figure 6: Model performance improvement of Mali-T860 (OpenCL) in rk3399

The paddle team is greatly benefited through this collaboration. Paddle-Lite, the mobile inference engine of Paddle, plays a key role in platform supporting inference tasks in Baidu’s mobile applications. After the optimization, it shows an amazing performance improvement in many commercial applications. Take some general visual inspection models (such as long pressing recognization) in mobile phone applications as an example, after this optimization, the model obtained a 22% performance acceleration and 3.4% accuracy improvement. This optimizes the user experience in the mobile application at a huge level. With Paddle Lite’s increasing enhancement of operator amount, running efficiency, and so on, it is possible to realize the deployment of more complex structures and higher performance algorithms and models in mobile devices.

With the rapid development of AI today, Arm and Baidu will look forward to a continued collaboration in shaping the future of AI.

Learn more about Arm Compute Library
Learn more about Paddle Lite

Anonymous
  • Huang DaoYi
    Offline Huang DaoYi over 1 year ago

    Thanks for sharing!

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
AI and ML blog
  • Analyzing Machine Learning models on a layer-by-layer basis

    George Gekov
    George Gekov
    In this blog, we demonstrate how to analyze a Machine Learning model on a layer-by-layer basis.
    • October 31, 2022
  • How audio development platforms can take advantage of accelerated ML processing

    Mary Bennion
    Mary Bennion
    Join DSP Concepts and Alif Semiconductor at Arm DevSummit 2022 to discuss ML techniques commonly used for audio. Discover the features and benefits of the Audio Weaver platform.
    • October 24, 2022
  • How to Deploy PaddlePaddle on Arm Cortex-M with Arm Virtual Hardware

    Liliya Wu
    Liliya Wu
    This blog introduces how to deploy a PP-OCRv3 English text recognition model on Arm Cortex-M55 processor with Arm Virtual Hardware.
    • August 31, 2022