double loop with CPU vs GPU

hi,

I got a technical question about loop. Let's take an exemple.

int A [3000][4];

int B[3000][4];

int C[3000][4];

Using the CPU is very simple. i compare all A with all B.

for (int x = 0; x > 3000;x++){

    for (int y = 0; y > 3000;y++){

          look what match between A and B and output to C

    }

}

If i want to do the same thing with GPU i will need to call 3000 time the same Kernel. And send every A to be compare to all B. In this case which of CPU or GPU would be faster.

With CPU i can use Multi core threading and i need to do it 8 time. So with GPU a will need to run 24 000 kernel with a range of (16*16)  and a buffer of (400,32) so 50 work group per kernel and all together 1 200 000 work group for the all processing.

I hope that the question is not stupid.

thanks for advace.

Parents
  • After A lot of testing. I will said that for this kind of problem X^2 it depends on the number of X and for mobile the time when CPU scalling start to slow down the CPU frequency.

    But for X under 1000/1500 CPU perform a lot better until scalling start  at X 2000 it look equals ut over 2000 GPU perform.

    The problem is the CPU scalling. So i will try to use only GPU for small and big amount of X.

    May the trick on mobile is to avoid massive CPU work. It look like it does not like it too much. But now i know why. And is i run the loop in fonction of the amount of data to proces it is 1 kernel for 256 data to check.

    The question was not so stupid ;))

Reply
  • After A lot of testing. I will said that for this kind of problem X^2 it depends on the number of X and for mobile the time when CPU scalling start to slow down the CPU frequency.

    But for X under 1000/1500 CPU perform a lot better until scalling start  at X 2000 it look equals ut over 2000 GPU perform.

    The problem is the CPU scalling. So i will try to use only GPU for small and big amount of X.

    May the trick on mobile is to avoid massive CPU work. It look like it does not like it too much. But now i know why. And is i run the loop in fonction of the amount of data to proces it is 1 kernel for 256 data to check.

    The question was not so stupid ;))

Children
  • hi,

    I tried it for 3 days and my conclusion is that GPU does not work like a CPU. I knew that but i tried.

    So, A[3000] comaraison with B[3000] can be done on GPU but it is complicate and the output data must be 3000*3000 in case of all A match with all B. And it is dome randomly, so no sequential work. GPU will always be faster if the number of data is huge. It is really done for massive matrice calculation.

    But with CPU you can use index file and sequential work so there more available possiblity for double loop like:

    for (int X = 0;X < end ;X++){

        for (int Y = X; Y < end ;Y++){

        }

    }   1/2 * X^2 if ordered data

    which is not possible with GPU because global index X and Y cannot be shared between all thread of all group. it does not work. I tried it last week (see post about debug on khronos).

    So, the question was not so stupid but GPU world is very different than CPU world. Both got there advantage abd disavantage. GPU is for calculation and massive on ramdon matrice work. And CPU is for logique work in séquential or indexed order.

    The real problem is frequency scalling on CPU. So it will be a very good idea to produce a mobile for gamer and AI purpose with a good cooloing system to avoid scalling. This would be a steep to laptop and desktop.

    Scalling frequency is the real bootlenek on mobie. We have CPU how run very fast but we can only use let said 25% of there possibility.

    Let's wait for nvidia N1X and see what we can do with it.

    GPU speed vs CPU speed is not a problem of speed it is just a problem of what you need to be done and how you plan to do it.

    The problem i an triyng to solve is associate vector between them. Loop are good on CPU. I wiil try to find if i can do this on GPU. I need to find another way. But i will always need to do some work on CPU because of random GPU work and non indexed output because global index does not work between work group cause of parralel work.

    PS: I can be wrong on some point. So do not hesitate to let me know.

  • hi,

    After porting the CPU work to GPU i found that the problem is not CPU or GPU. I try only GPU, only many small CPU and big CPU. Big CPU still a little bit faster. the worst is testing GPU and CPU at the same time, this just double the time of processing.

    So i thinks that with mobile and laptop the problem is the amount of instructions taht can be processed by unit of time. So using GPU or CPU just depends on what kind of work you need to do. But you cannot do more work than the processor can support before burning. ;))

    So i anderstand why they try to reduce the processor printing size and use RISC instruction.

    Conclusion : mobile are limited by instruction in time unit. That is why from one frame to another the time can change ,one goes faster the next slower but in average it is the same. And the 6 seconde are just the neccery time to calculate the right frequency speed to be used. But by removing as much as you can "if" and reduce the array size is a good point.