Hello, I'm trying to optimize some OpenCL Code, what we're queueing is 2 Write-Only Buffers Mapped to the Host and 2 Read-Only (Mapped aswell on the host)
The Proposed Simplified Workflow is the following:
- Compile Programs (P1, P2)
- Allocate buffers (W1, W2, R1, R2), initialization, (...)
- Map Buffers to Host
- Fill W1
- Unmap W1, R1
------loop-------
- enqueue P1, with W1 and R1 as Arg into OpenCL Device CmdQueue
- Schedule() <P2 End>
- Map W2, R2
- Fill W2
- Read R2
- Unmap W2, R2
- enqueue P2, with W2 and R2 as Arg into OpenCL Device CmdQueue
- Schedule() <P1 End>
- Map W1, R1
- Read W1
----end loop ---
Is it necessary to unmap buffers from host (and then remap) before a kernel to start using them?. As I can remember from the spec it says that is implementation-defined, I already asked to the board manufacturers (ODROID Forum • View topic - OpenCL Mapped Buffer Map (Unmap) Implementation Behavior) But they told me to ask here.
What would it be the time gained from avoiding enqueueing map and unmap commands for each of it? The point is that kernels run very fast, so those calls get queued petty often.
-- Platform Data --
Board: ODROID-XU3
Processor: Samsung Exynos5422 ARM® Cortex™-A15 Quad 2.0GHz/Cortex™-A7 Quad 1.4GHz
GPU: Mali™-T628 MP6 OpenGL ES 3.0 / 2.0 / 1.1 and OpenCL 1.1 Full profile
(Details: ODROID-XU3)
Best Regards!