This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Using Streamline Instruction Executed counter to measure MIPS

Hello,

We are adding some extra sound effects to Android's mediaserver. We're using DS-5 's Streamline to measure the performance of an active thread that implements this sound effect, on a Nexus 5 phone. This phone's cpu has four cores, which are correctly detected by Streamline. We build the entire Android AOSP platform for Android 5/6, using the prebuilt toolchain supplied by Android. The code that is used to build the shared library corresponding to the thread being measured, was compiled using gcc.

I use DS-5/Streamline by playing a media file for one minute and simultaneously using Streamline to capture the cpu activity. I've done the following

- compiled all code that implements the thread using the flags -g -fno-inline -fno-omit-frame-pointer, as described in

   Streamline User Guide | Recommended compiler options | ARM DS-5 Development Studio

- pushed the compiled shared library (with symbols) to the phone

- in Streamline's Capture & Analysis Options, selected "High Resolution Timeline", and added the location of the shared library with symbols to "Program Images"

After the test, I expand the Cross Section Marker to cover a time period of one minute. The Instructions Executed counter displays the total MIPS for this elapsed period of time.

I filter all counters for the process I want to measure, and divide the filtered Instruction Executed count by 60 to get the MIPS figure, averaged for the four cpu cores.

My questions are:

1. Is this the best way to measure the MIPS, using DS-5?

2. Using these compile-time options seem counter-intuitive when taking profiling measurements. For example, the whole point of using inlining is to speed up the performance. Do these flags apply to profiling measurements?

3. When I don't use the compile time flags -fno-inline -fno-omit-frame-pointer listed above, I get a total Instruction Executed count figure about 35% less (26 Ginstruction vs 40 Ginstruction). However, the indicated CPU activity averaged for the four cores for the same thread is about 15% less (11.3% vs 13.4%). Using or omitting the -g flags makes no difference, which also seems counter-intuitive.

Many thanks,

Paul

Parents
  • Paul,

    Regarding question 3, I believe that -g only adds extra sections to the ELF image which are not loaded at runtime, so it does not affect performance. But as Wade states, for what you're doing, you don't need to provide -g or -fno-inline -fno-omit-frame-pointer anyway.

    If I understand correctly, when you use -fno-inline -fno-omit-frame-pointer, you get a Instruction Executed count of 40 Ginstruction and an average CPU activity of 13.4%. But when you don't use -fno-inline -fno-omit-frame-pointer, you get a Instruction Executed count of 26 Ginstruction and an average CPU activity of 11.3%. So I think the question is, why is Instruction Executed 35% less but CPU activity is 15% less, shouldn't they be the same? There are a few reasons this could be happening, the most likely is that the clock frequency is lower when -fno-inline -fno-omit-frame-pointer is omitted. You should be able to check this in Streamline. Other possibilities are that instructions per cycle are also lower or if you're device is a big.LITTLE that you're running on a different core type.

    Drew

Reply
  • Paul,

    Regarding question 3, I believe that -g only adds extra sections to the ELF image which are not loaded at runtime, so it does not affect performance. But as Wade states, for what you're doing, you don't need to provide -g or -fno-inline -fno-omit-frame-pointer anyway.

    If I understand correctly, when you use -fno-inline -fno-omit-frame-pointer, you get a Instruction Executed count of 40 Ginstruction and an average CPU activity of 13.4%. But when you don't use -fno-inline -fno-omit-frame-pointer, you get a Instruction Executed count of 26 Ginstruction and an average CPU activity of 11.3%. So I think the question is, why is Instruction Executed 35% less but CPU activity is 15% less, shouldn't they be the same? There are a few reasons this could be happening, the most likely is that the clock frequency is lower when -fno-inline -fno-omit-frame-pointer is omitted. You should be able to check this in Streamline. Other possibilities are that instructions per cycle are also lower or if you're device is a big.LITTLE that you're running on a different core type.

    Drew

Children