This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OpenCL Mali seg faults on unmap operations

Alexey over 8 years ago

Hi,

I'm working an an Android image processing app which uses OpenCL on Mali GPU.

I have a problem where I get a random seg faults on clEnqueueUnmapMemObject or a following clReleaseMemObject

The program just seg faults and I can't catch any of the OpenCL error codes for these operations anyway.

This bug happens randomly but I noticed it has a preference to happen on a certain image size.

The app works fine with sizes larger than and smaller than that size.

I think it has something to do with how it is distributed in the memory like certain buffer sizes might be close to a memory page edge or something.

And then it's is badly handled there.

Or maybe there are memory drifts and my mapped pointers become invalid after some time or as a result of a memory page change.

I'm not really sure about any of this I'm just throwing out some ideas which I had.

I always map my buffers before processing on the CPU and unmap them before processing on the GPU.

Any insight on my problem will be appreciated.

Parents

0 Alexey over 8 years ago in reply to Anthony Barbier

I can't really share my code here due to company policy but I got some additional info on the problem.
There seems to a be a serious problem with Android + OpenCL.
I got a lot of seg faults on unmap operations with a message of mali renderer mem purge.
It's like the GPU is releasing memory in an unauthorized region after using unmap.
But other than that sometimes there are other messages and sometimes the seg faults occur during processing itself.
One time I noticed a seg fault happening constantly with violation on the same address every time.
Due to the random nature of these faults a device reset simply removed this crash from this address and everything worked fine afterwards.
I guess it's because something changed in the memory after the reset, but this crash can randomly happen at another time in another address.
I need to mention that I got a service which holds all the OpenCL environment running in a process of it's own and I send processing requests to that process.
This way I don't need to initialize OpenCL environment for every processing job so the service and the process run binded to the lifetime of the main Android activity.
I noticed that I can't really monitor the memory that the GPU use in Android Studio I only see Java or native C allocations but I can't see the massive memory buffers in which OpenCL holds my raw images.
Perhaps there is some conflict between the heap android allocates for my Android application and the memory which OpenCL uses which is global.
Maybe something goes wrong when holding a persistent OpenCL environment maybe something needs to be refreshed in response to some changes.
I still can't figure out this problem.
Any insight will be appreciated.
Cancel
Up 0 Down

Cancel

Reply

0 Alexey over 8 years ago in reply to Anthony Barbier

I can't really share my code here due to company policy but I got some additional info on the problem.
There seems to a be a serious problem with Android + OpenCL.
I got a lot of seg faults on unmap operations with a message of mali renderer mem purge.
It's like the GPU is releasing memory in an unauthorized region after using unmap.
But other than that sometimes there are other messages and sometimes the seg faults occur during processing itself.
One time I noticed a seg fault happening constantly with violation on the same address every time.
Due to the random nature of these faults a device reset simply removed this crash from this address and everything worked fine afterwards.
I guess it's because something changed in the memory after the reset, but this crash can randomly happen at another time in another address.
I need to mention that I got a service which holds all the OpenCL environment running in a process of it's own and I send processing requests to that process.
This way I don't need to initialize OpenCL environment for every processing job so the service and the process run binded to the lifetime of the main Android activity.
I noticed that I can't really monitor the memory that the GPU use in Android Studio I only see Java or native C allocations but I can't see the massive memory buffers in which OpenCL holds my raw images.
Perhaps there is some conflict between the heap android allocates for my Android application and the memory which OpenCL uses which is global.
Maybe something goes wrong when holding a persistent OpenCL environment maybe something needs to be refreshed in response to some changes.
I still can't figure out this problem.
Any insight will be appreciated.
Cancel
Up 0 Down

Cancel

Children

0 Anthony Barbier over 8 years ago in reply to Alexey

Hi yakovenko,
Unfortunately without more detailed steps to reproduce the issue our driver team will not be able to investigate.
We need at least:
- A driver revision + platform
- A sequence of OpenCL calls that will trigger the error ( e.g create a buffer of size X, map size Y, unmap, enqueue kernel, repeat ~ Z times and it should fail)
- A rough estimate of how often this triggers the bug: every time, 1/10, 1/100, etc...
Cancel
Up 0 Down

Cancel
0 Alexey over 8 years ago in reply to Anthony Barbier

Hi
I tried the app on 2 different devices the Galaxy Note 5 and the Galaxy S7 Edge and the random crashes occur on both of them.
Here are the OpenCL queries from the devices:
// Galaxy S7 edge
CL_PLATFORM_NAME: ARM Platform
CL_PLATFORM_VERSION: OpenCL 1.2 v1.r9p0-12dev0.c37094c9ad948aa7a7b056e38dcda762
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_DEVICE_NAME: Mali-T880
CL_DEVICE_MAX_COMPUTE_UNITS: 12 Cores
CL_DEVICE_MAX_CLOCK_FREQUENCY: 650 MHz
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_COMPILER_AVAILABLE: 1
CL_DEVICE_GLOBAL_MEM_SIZE: 4089249032 Bytes
CL_DEVICE_LOCAL_MEM_SIZE: 4089249068 Bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4089249104 Bytes
CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024 Bits
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE: 128 Bytes
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 16
CL_DEVICE_MAX_WORK_ITEM_SIZES: {256, 256 ,256}
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
// Galaxy Note 5
CL_PLATFORM_NAME: ARM Platform
CL_PLATFORM_VERSION: OpenCL 1.1 v1.r7p0-03rel0.b596bd02e7d0169c10574b57180c8b57
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_DEVICE_NAME: Mali-T760
CL_DEVICE_MAX_COMPUTE_UNITS: 8 Cores
CL_DEVICE_MAX_CLOCK_FREQUENCY: 772 MHz
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_COMPILER_AVAILABLE: 1
CL_DEVICE_GLOBAL_MEM_SIZE: 4082507076 Bytes
CL_DEVICE_LOCAL_MEM_SIZE: 4082507112 Bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4082507148 Bytes
CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024 Bits
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE: 128 Bytes
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 16
CL_DEVICE_MAX_WORK_ITEM_SIZES: {256, 256 ,256}
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
Here is an example of logcat crashes on unmap using10MB buffer:
The mapped pointer is located inside a struct with the mem object So I'm certain I am unmapping the correct pointer.
The buffer is mapped in order to load an image from a file to it and after it is unmapped.
Since the logcat reached UNMAPPING then the loading of the image succeeded without errors and it was just before calling clUnmapMemObject function
E/SR_DEBUG_NATIVE: Pixel buffer size: 10245120 bytes
E/SR_DEBUG_NATIVE: MAPPING
E/SR_DEBUG_NATIVE: UNMAPPING
A/libc: Fatal signal 7 (SIGBUS), code 1, fault addr 0x7e87738f in tid 3116 (Thread-871)
E/SR_DEBUG_NATIVE: Pixel buffer size: 10245120 bytes
E/SR_DEBUG_NATIVE: MAPPING
E/SR_DEBUG_NATIVE: UNMAPPING
A/libc: Fatal signal 11 (SIGSEGV), code 1, fault addr 0x99857a98 in tid 5850 (mali-renderer)
Notice the different seg fault codes here...
I can't say that the crash is related to a number of executions but rather if it hits the fault address or not. This can change on device restart or using a different buffer size.
In every app session it seems that it can hit a fault address on a different buffer size so a specific buffer size does not play a role here either.
I had crashes on buffer larger than 10MB and smaller than 10MB buffers.
I mean that I can run the app fine most of the time and sometimes it will crash 100% always on the same buffer size in the same fault address in a certain app session.
I am certain that it isn't out of bound memory access because I solved that bug already and its effects were crashing kernels but not the whole process.
It's like if I accessed a forbidden address on the GPU during processing then the kernel would just be terminated but the process is still alive and there is no seg fault.
Also this seg fault crash occurs on unmap which is part of the API.
I'm pretty sure that it is related to the contradiction of limited Android memory for every app and global memory access from the GPU.
The buffers I allocate for the GPU are not visible in Android Studio memory monitor as if they are not part of my app.
But my app has a limited heap which is allocated by android.
Maybe the crash occurs when my app attempts to use memory that is related to GPU global memory which is outside of my app heap and then a seg fault occurs,
Cancel
Up 0 Down

Cancel
+1 Anthony Barbier over 8 years ago in reply to Alexey

You can check dmesg to see if mali_kbase reports any illegal memory accesses.
For us to be able to reproduce the issue we need a sequence of commands to run on the device which will trigger the crash, something like:
while(1)
{
buf = clCreateBuffer()
mapping = clMapBuffer()
clUnmapBuffer(buf, mapping)
etc.
}
We already have some unit tests which cover all basic mapping operations and they pass on these devices, so if the problem comes from the driver there must be a special sequence of commands you do which triggers the issue that we don't cover in our tests.
Thanks,
Anthony
Cancel
Up 0 Down

Cancel