This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cycles calculation in beagle board

Note: This was originally posted on 22nd November 2010 at http://forums.arm.com

Hello All,

I am trying to use the cycle counter registers of cortexA8 for calculating cycles.
Following is the code i am using.

int main()
{
     int i;
     int a,b,c,n;
     printf("Enter a: ");
     scanf("%d",&a);
     printf("Enter b: ");
     scanf("%d",&B);
     printf("Enter c: ");
     scanf("%d",&c);
     printf("No of times to run: ");
     scanf("%d",&n);

     ccnt_init();
     ccnt_start();
     cycles=ccnt_read();
     {
             c=a+b;
     }
     cycles=ccnt_read()-cycles;
     ccnt_stop();
     printf("Sum : %d\n",c);
     printf("Cycles : %d\n",cycles);
}

The above simple integer addition takes 5 cycles since it includes load operations.
If all the above variables are made double it takes 3800 cycles if I enable neon and 1900cycles if i disable neon.
Kindly explain how i am getting these values.
I am using beagle board XM-A3 to run this code and i am using codesourcery 2010q1 toolchain to compile the code.
I am wondering whether it is due to the interrupts. If so how to disable the interrupts.

Thanks in advance..

Parents

Peter Harris over 12 years ago

Note: This was originally posted on 25th November 2010 at http://forums.arm.com

You may also want to look at how your floating point is being provided. In many cases for Linux it defaults to a floating-point library (which may be hardware or software) which lis loaded at run-time. A lot of Linux distros default to this shared object implementation, which adds veneer call overheads for every FPU operation. Even with "hard float" you may still find you link a library rather than emitting float instructions directly into the binary - so you may want to dump the image using objdump to make sure you are emitting hard-float directly.

Secondly - the linker resolution of symbols in a shared object is commonly "lazy" for Linux - that is they are resolved the first time they are hit and found to be missing. You may well find you are spending time in your 1900 cycles resolving a link into the shared object via the dynamic linker. The idea of ttfn to do one operation outside of the timing loop before timing should solve this one.

Finally - don't use doubles if you can use floats. For an embedded platform doubles are horrendously expensive - and for most use cases float is fine.

Iso
Cancel
Vote up 0 Vote down

Cancel

Reply

Peter Harris over 12 years ago

Note: This was originally posted on 25th November 2010 at http://forums.arm.com

You may also want to look at how your floating point is being provided. In many cases for Linux it defaults to a floating-point library (which may be hardware or software) which lis loaded at run-time. A lot of Linux distros default to this shared object implementation, which adds veneer call overheads for every FPU operation. Even with "hard float" you may still find you link a library rather than emitting float instructions directly into the binary - so you may want to dump the image using objdump to make sure you are emitting hard-float directly.

Secondly - the linker resolution of symbols in a shared object is commonly "lazy" for Linux - that is they are resolved the first time they are hit and found to be missing. You may well find you are spending time in your 1900 cycles resolving a link into the shared object via the dynamic linker. The idea of ttfn to do one operation outside of the timing loop before timing should solve this one.

Finally - don't use doubles if you can use floats. For an embedded platform doubles are horrendously expensive - and for most use cases float is fine.

Iso
Cancel
Vote up 0 Vote down

Cancel

Children

No data