I am playing with Mali T624 and OpenCL. By playing the kernel-space midgard driver, I am now able to access some I/O memory in the OpenCL kernel. However, the I/O memory I am accessing is volatile. For example, assume we have a kernel function with a 0x1000 size input buffer. The data at offset 0x10 of the buffer would change after each read (each read returns a different value), and I attempt to read 10 times from the offset (would get 10 different values in my expectation). The problem is that the GPU caches the data at the offset, and each read to the offset always returns the same value. So, is there any way to manual flush the GPU cache in the OpenCL kernel code?
Thanks for any help and discussion!
No, it's not possible.
As per my other replies, this is not supported use case.
Kind regards, Pete
Thanks for explaining! I understand that this is not a normal use case. Actually, I am not supposed to do this from user-level. However, since there are not much manual or document of the GPU and kernel-level driver, I choose to do it with an user-level application and a modified kernel driver. So, basically, I assume that I own at least the kernel privilege. In this case, should flushing the GPU cache be a valid operation? How can I invalidate the GPU cache with the kernel privilege?
Thanks for your help!
> In this case, should flushing the GPU cache be a valid operation?
No, it's not possible to clean or clean+invalidate the GPU caches from a shader during execution.
Manual cache maintenance for the entire cache is triggered at job chain boundaries (e.g. when the GPU starts a task assigned by the driver, or when the GPU finishes a task assigned by the driver). The memory model for GPU workloads is not generally designed to cope with volatile I/O resources - it's a GPU not a CPU ...
Recent Mali GPUs can optionally support full coherency, which use system level hardware coherency protocols, if the silicon chip implements them. However full coherency requires both sides of the link to implement it and I doubt simple IO peripherals will implement coherency protocols even if the CPU and GPU do. Mali-T620 doesn't implement full coherency in either case, so this won't be able to help in this specific case.
You might be able to set up the GPU memory mapping as uncached in the GPU page table, which bypasses the cache completely, but again this isn't something which a GPU is really designed to do so it might have weird side-effects on e.g. performance.
Thanks again for your explanation! That helps a lot!
Moreover, if you can kindly provide some hints on how to make the page table in the GPU as uncached, that would be great! Am I supposed to modify the GPU MMU configuration directly or there are already some implementations in the GPU kernel driver?
Thanks for your help again!
By further introspection, I found that there are 5 MemAttrs in the GPU MMU configuration (0x48, 0x4f, 0x4d, 0x88, 0x8d), and none of them matches the expected non-cacheable attributes. Thus, I added the 6th MemAttrs 0x04 to the GPU MMU configuration and use the index 5 while setting the page table entry. With this configuration, I expected that the GPU won't cache the data. However, it turns out that multiple reads to the same offset still return the same value. Do you happen to know the reason for the incorrect result?
For more information, I used the CL_MEM_ALLOC_HOST_PTR to create the buffer, so there should not be any buffer copy. Also, the read/write to the I/O device works correctly. For example, if I write something to some control register in the GPU, the corresponding status register would show that the write operation has been successfully performed. To avoid the influence of the compiler optimization, I also added the "-cl-opt-disable" option while building the kernel.
Thanks in advance for your help!!
Zhenyu Ning said:Do you happen to know the reason for the incorrect result?
Not tested, but try 0x4C.
Note that the GPU isn't designed to talk to IO peripherals so all memory accesses are treated as "normal" memory in the Arm architecture point of view, where as you really want device or strongly ordered memory when talking to a peripheral.
There is a high probability that this won't work, even with uncached memory, as the hardware can legally drop writes which are overwritten by later writes, change access size, etc. These are all valid optimizations for "normal" memory, but invalid when talking to a physical peripheral.
To make sure the compiler won't get in the way, you should use a volatile pointer.
Thanks for your suggestion. I used the "volatile" modifier in the kernel parameters, but the pointer does not really work like a volatile one.
Thanks for your suggestion and explanation. I tried both 0x4C and 0x44, but no luck. As you mentioned the hardware optimization, I didn't some other experiment to identify the real reason for this problem.
First, in an OpenCL kernel, I read the target I/O memory, and then add about 1000 dummy read to other I/O memory. Finally, read the target memory again. I guess that the hardware optimization is based on the instruction pipelines, and the 1000 dummy read may fill up the pipelines so that the GPU may not be aware of the final read while doing the optimization. However, it turns out the first read and final read returns the same value.
Then, to verify whether the cache is still the problem, I did a similar experiment. First, read the target memory, then read to 0x10000 different location to fill up the 256K GPU L2 cache. Finally, read the target memory again. Now the first read and the final read gives me different results.
So, basically, I guess the cache is still the main cause for the problem. Do you happen to know any other way to disable the page cache?
Thanks in advance!
Not that I'm aware of, sorry. This is really something the GPU isn't designed to be able to do, so I'm not sure it's even possible.
Got it. Thanks anyway for your kind reply all the time!