This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Using Streamline Instruction Executed counter to measure MIPS

Hello,

We are adding some extra sound effects to Android's mediaserver. We're using DS-5 's Streamline to measure the performance of an active thread that implements this sound effect, on a Nexus 5 phone. This phone's cpu has four cores, which are correctly detected by Streamline. We build the entire Android AOSP platform for Android 5/6, using the prebuilt toolchain supplied by Android. The code that is used to build the shared library corresponding to the thread being measured, was compiled using gcc.

I use DS-5/Streamline by playing a media file for one minute and simultaneously using Streamline to capture the cpu activity. I've done the following

- compiled all code that implements the thread using the flags -g -fno-inline -fno-omit-frame-pointer, as described in

   Streamline User Guide | Recommended compiler options | ARM DS-5 Development Studio

- pushed the compiled shared library (with symbols) to the phone

- in Streamline's Capture & Analysis Options, selected "High Resolution Timeline", and added the location of the shared library with symbols to "Program Images"

After the test, I expand the Cross Section Marker to cover a time period of one minute. The Instructions Executed counter displays the total MIPS for this elapsed period of time.

I filter all counters for the process I want to measure, and divide the filtered Instruction Executed count by 60 to get the MIPS figure, averaged for the four cpu cores.

My questions are:

1. Is this the best way to measure the MIPS, using DS-5?

2. Using these compile-time options seem counter-intuitive when taking profiling measurements. For example, the whole point of using inlining is to speed up the performance. Do these flags apply to profiling measurements?

3. When I don't use the compile time flags -fno-inline -fno-omit-frame-pointer listed above, I get a total Instruction Executed count figure about 35% less (26 Ginstruction vs 40 Ginstruction). However, the indicated CPU activity averaged for the four cores for the same thread is about 15% less (11.3% vs 13.4%). Using or omitting the -g flags makes no difference, which also seems counter-intuitive.

Many thanks,

Paul

Parents
  • So I think the question is, why is Instruction Executed 35% less but CPU activity is 15% less, shouldn't they be the same?

    One other possibility to consider. Assuming you removed non-memory accessing instructions (e.g. branch veneers for the non-inlined functions which become inlined, etc) then this seems "normal" for a high frequency applications core.


    One load instruction which touches memory and misses in the L1 cache will be a lot slower than one arithmetic instruction (> 20 cycles to get to L2, >100 cycles to get to DDR, rather than 1 cycle for an arithmetic operation (or even less than one in the case of multi-issue cores)). If you significantly reduce the number of arithmetic instructions then relative number of memory accesses per instruction goes up, so and the average achieved instructions-per-clock will also drop.

    TLDR: Memory accesses are expensive

    Cheers,
    Pete

Reply
  • So I think the question is, why is Instruction Executed 35% less but CPU activity is 15% less, shouldn't they be the same?

    One other possibility to consider. Assuming you removed non-memory accessing instructions (e.g. branch veneers for the non-inlined functions which become inlined, etc) then this seems "normal" for a high frequency applications core.


    One load instruction which touches memory and misses in the L1 cache will be a lot slower than one arithmetic instruction (> 20 cycles to get to L2, >100 cycles to get to DDR, rather than 1 cycle for an arithmetic operation (or even less than one in the case of multi-issue cores)). If you significantly reduce the number of arithmetic instructions then relative number of memory accesses per instruction goes up, so and the average achieved instructions-per-clock will also drop.

    TLDR: Memory accesses are expensive

    Cheers,
    Pete

Children
No data