This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory Optimization on Mali GPU

Hi everyone,

Recently I have been working on a GPU application. My application will run on Arndale board and will use Mali GPU. To make program execution faster I wanted to do memory optimization. Based on the OpenCL guide, using CL_MEM_ALLOC_HOST_PTR should be used to improve performance. Using of CL_MEM_USE_HOST_PTR is discouraged.

But from my experiment, I found that using of CL_MEM_USE_HOST_PTR actually reduce data transfer time. but increase kernel execution overhead. From my experiement, I found that data copy is inevitable in both cases (CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR).

Can anyone confirm? Is it possible at all to have a zero copy?

It has been said in the mali OpenCL guide that using CL_MEM_ALLOC_HOST_PTR requires no copy. But there is a copy. Let’s say I have a pointer A. I created a buffer using CL_MEM_ALLOC_HOST_PTR. To have the data of A available to the GPU, I have to do a memcpy to transfer data from A to the allocated space I get using CL_MEM_ALLOC_HOST_PTR.

So, data copy is needed. Is there a way to access the data directly from GPU without any copying?

PS: I have attached my code for your feedback.

UPDATE:: I have uploaded a version with HOST_ALLOC_PTR for your review.

This is the code snippet:

#ifdef mem_alloc_host

	start = getTime();
	a_st=getTime();
	bufferA = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
	cl_float* src_a=(cl_float)clEnqueueMapBuffer(commandQueue, bufferA,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) ELE_NUM, 0, NULL, NULL, &err);

	bufferB = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);
	cl_float* src_b=(cl_float)clEnqueueMapBuffer(commandQueue, bufferB,CL_TRUE,CL_MAP_WRITE, 0, sizeof(cl_float) ELE_NUM, 0, NULL, NULL, &err);
	clFinish(commandQueue);
	a_en=getTime();
	a_time=a_time+(a_en-a_st);

	pfill_s=getTime();
	for (int i = 0; i < ELE_NUM; i++){
		src_a[i] = 100.0;
		src_b[i] = 11.1;

	}
	pfill_e=getTime();
	pfill_time=pfill_time+(pfill_e-pfill_s);

	b_st=getTime();
	clEnqueueUnmapMemObject(commandQueue, bufferB, src_b, 0, NULL, NULL);
	clEnqueueUnmapMemObject(commandQueue, bufferA, src_a, 0, NULL, NULL);
	clFinish(commandQueue);
	b_en=getTime();
	b_time=b_time+(b_en-b_st);

	end = getTime();
	creat_buffer += (end-start);
	bufferC = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(cl_float) * ELE_NUM, NULL, &err);

#endif

6176.zip

Top replies

Michael McGeagh over 11 years ago in reply to Veeranna +1 verified

Hi veerannah, It is worth remembering that the Arndale is a development board and is designed to be pushed to its limits. What you do on this hardware, may not work well on production devices due to different...