I will train a tensorflow or caffe CNN model with Nvidia cuda GPU, and would like to deploy it to an embedded system with arm mali-g71 or g72 GPU to run inference, is this possible without major code modification? Seems like mali GPU supports only openCL ? any solutions? Thanks!
Well I had a look on the web and I can't see anything about what I said below, seems my memory failed me. So I believe you'd need an Nvidia Tegra K1 or better if you really want to run on an ARM and use CUDA.
CUDA can be used with GPUs other than nVidia's via openCL. Not done anything like that myself but it is worth doing a bit of Googling on using some other GPU than nVidia's with Tensorflow - it probably isn't too bad.
Of course, you can run CNN on Mali!
Caffe has a stable OpenCL branch to which we have recently contributed support for Android. You can see some public benchmarking results on the ARM Mali-T860 GPU here: https://github.com/dividiti/ck-caffe-firefly-rk3399. This is enabled by our CK-Caffe framework: https://github.com/dividiti/ck-caffe.
OpenCL/SYCL support for TensorFlow is tracked here: https://github.com/tensorflow/tensorflow/issues/22 but we haven't been able to test it.
The ARM Compute Library should also become useful at some point: github.com/.../computelibrary
CNN on Mali is a joke (my experience with Mali T628)
1) there's a tensorflow (v0.11, old) branch that uses coriander to translate CUDA-OpenCL. Works for some stuff, but waay slower than CPU tensorflow (upstream) compiled with some neon compiler flags
2) i tried theano with GPU array backend (open-cl) ... WAY (>100 times) slower than with CPU
3) there's a CaffeOnACL (ARM Compute Library) branch, which supposedly uses NEON, GPU, etc. done by ARM. Another sad joke. Same example (classifying an image from a pre-trained model) was 2 times slower with CaffeOnACL than Caffe mainline branch using CPU
4) Caffe supports OpenCL. Tried that too, caffe detects the GPU and all, but when trying to run something with GPU enabled:
F0923 22:04:38.238814 10416 syncedmem.cpp:256] Check failed: mapped_ptr == cpu_ptr_ (0 vs. 0x7d433000) Device claims it support zero copy but failed to create correct user ptr buffer
So yeah... don't bother...
Hi Marianmi,
Thanks for sharing that info. Do you know the approx dimensions of the cnn model you were using?
I just checked the Mali specs. and there is really minimal power. It may have felt overwhelmed.
I realise this question was asked some time ago and you may have moved on, but have you had a look at these blogs from 3 months ago?
https://community.arm.com/processors/b/blog/posts/running-alexnet-on-raspberry-pi-with-compute-library
and
https://community.arm.com/tools/b/blog/posts/profiling-alexnet-on-raspberry-pi-and-hikey-960-with-the-compute-library
I followed the first one to install and run the Alexnet CNN on the Odroid XU4 with Arm Compute Library and the second one helped me to get StreamLine community edition working from a remote PC so that I could monitor the GPU activity. May be worth a look!