What is considered "good" and "average" pipeline utilization?

We are using an ARM Cortex-M4 in our application. Recently I was dealing with optimized critical DSP code. The code is written in C and compiled for the target using ARM Compiler 6 (armclang). When testing, I get a cycle count with is considerably more than expected. Peeping into the disassembly, it looks like the generated code is pretty good, which makes me assume the difference comes from pipeline stalls.

Assuming a small function is preformed in an interrupt-free environment, what would be considered a good utilization of the pipeline, that I can expect from compiled code?

As a baseline - in our case, the C code is translated to ~3200 (LDRSH, SMLABB) instructions, while the execution time is ~5000 cycles.

While there - Using the ARM Developer Studio environment and a DSTREAM debugger, is it possible to observe individual pipeline stages, and see where bubbles are formed?

0