Hi,
I'm working an an Android image processing app which uses OpenCL on Mali GPU.
I have a problem where I get a random seg faults on clEnqueueUnmapMemObject or a following clReleaseMemObject
The program just seg faults and I can't catch any of the OpenCL error codes for these operations anyway.
This bug happens randomly but I noticed it has a preference to happen on a certain image size.
The app works fine with sizes larger than and smaller than that size.
I think it has something to do with how it is distributed in the memory like certain buffer sizes might be close to a memory page edge or something.
And then it's is badly handled there.
Or maybe there are memory drifts and my mapped pointers become invalid after some time or as a result of a memory page change.
I'm not really sure about any of this I'm just throwing out some ideas which I had.
I always map my buffers before processing on the CPU and unmap them before processing on the GPU.
Any insight on my problem will be appreciated.
Hi yakovenko,
Unfortunately without more detailed steps to reproduce the issue our driver team will not be able to investigate.
We need at least:
- A driver revision + platform
- A sequence of OpenCL calls that will trigger the error ( e.g create a buffer of size X, map size Y, unmap, enqueue kernel, repeat ~ Z times and it should fail)
- A rough estimate of how often this triggers the bug: every time, 1/10, 1/100, etc...
Hi
I tried the app on 2 different devices the Galaxy Note 5 and the Galaxy S7 Edge and the random crashes occur on both of them.
Here are the OpenCL queries from the devices:
// Galaxy S7 edge
CL_PLATFORM_NAME: ARM Platform
CL_PLATFORM_VERSION: OpenCL 1.2 v1.r9p0-12dev0.c37094c9ad948aa7a7b056e38dcda762
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_DEVICE_NAME: Mali-T880
CL_DEVICE_MAX_COMPUTE_UNITS: 12 Cores
CL_DEVICE_MAX_CLOCK_FREQUENCY: 650 MHz
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_COMPILER_AVAILABLE: 1
CL_DEVICE_GLOBAL_MEM_SIZE: 4089249032 Bytes
CL_DEVICE_LOCAL_MEM_SIZE: 4089249068 Bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4089249104 Bytes
CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024 Bits
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE: 128 Bytes
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 16
CL_DEVICE_MAX_WORK_ITEM_SIZES: {256, 256 ,256}
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
// Galaxy Note 5
CL_PLATFORM_VERSION: OpenCL 1.1 v1.r7p0-03rel0.b596bd02e7d0169c10574b57180c8b57
CL_DEVICE_NAME: Mali-T760
CL_DEVICE_MAX_COMPUTE_UNITS: 8 Cores
CL_DEVICE_MAX_CLOCK_FREQUENCY: 772 MHz
CL_DEVICE_GLOBAL_MEM_SIZE: 4082507076 Bytes
CL_DEVICE_LOCAL_MEM_SIZE: 4082507112 Bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4082507148 Bytes
Here is an example of logcat crashes on unmap using10MB buffer:
The mapped pointer is located inside a struct with the mem object So I'm certain I am unmapping the correct pointer.
The buffer is mapped in order to load an image from a file to it and after it is unmapped.
Since the logcat reached UNMAPPING then the loading of the image succeeded without errors and it was just before calling clUnmapMemObject function
E/SR_DEBUG_NATIVE: Pixel buffer size: 10245120 bytes
E/SR_DEBUG_NATIVE: MAPPING
E/SR_DEBUG_NATIVE: UNMAPPING
A/libc: Fatal signal 7 (SIGBUS), code 1, fault addr 0x7e87738f in tid 3116 (Thread-871)
A/libc: Fatal signal 11 (SIGSEGV), code 1, fault addr 0x99857a98 in tid 5850 (mali-renderer)
Notice the different seg fault codes here...
I can't say that the crash is related to a number of executions but rather if it hits the fault address or not. This can change on device restart or using a different buffer size.
In every app session it seems that it can hit a fault address on a different buffer size so a specific buffer size does not play a role here either.
I had crashes on buffer larger than 10MB and smaller than 10MB buffers.
I mean that I can run the app fine most of the time and sometimes it will crash 100% always on the same buffer size in the same fault address in a certain app session.
I am certain that it isn't out of bound memory access because I solved that bug already and its effects were crashing kernels but not the whole process.
It's like if I accessed a forbidden address on the GPU during processing then the kernel would just be terminated but the process is still alive and there is no seg fault.
Also this seg fault crash occurs on unmap which is part of the API.
I'm pretty sure that it is related to the contradiction of limited Android memory for every app and global memory access from the GPU.
The buffers I allocate for the GPU are not visible in Android Studio memory monitor as if they are not part of my app.
But my app has a limited heap which is allocated by android.
Maybe the crash occurs when my app attempts to use memory that is related to GPU global memory which is outside of my app heap and then a seg fault occurs,
You can check dmesg to see if mali_kbase reports any illegal memory accesses.
For us to be able to reproduce the issue we need a sequence of commands to run on the device which will trigger the crash, something like:
while(1)
{ buf = clCreateBuffer()
mapping = clMapBuffer()
clUnmapBuffer(buf, mapping)
etc.
}
We already have some unit tests which cover all basic mapping operations and they pass on these devices, so if the problem comes from the driver there must be a special sequence of commands you do which triggers the issue that we don't cover in our tests.
Thanks,
Anthony