This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Trying to find basic performance measurements of ARM cores

Howdy, I was trying to find some basic performance benchmarks for a couple of different ARM cores: The Arm 926EJ-S, Cortex A9, and the Cortex M7.

I am looking for primarily DMIPS (per MHz or a form that requires me scaling to my specific chip is fine), MFLOPS (where applicable, I don't think the 926EJ-S has a floating point coprocessor), CPI (since it varies depending on the instructions, some standardized average is fine), and if possible some things like branch prediction failure penalty and cache miss penalty.

If it makes a difference, the actual chips we would be/are purchasing are respectively: NXP i.MX27, NXP i,MX6, and the NXP i.MX RT 1050.

I have spent hours looking through the ARM information center. I found some of the data on the ARM 926EJ-S in there (average CPI of 1.5, DMIPs of 1.1 per MHz), but the the newer parts didn't seem to have that.

As a side note, I don't have hardware in hand for all of them, but is all of this information in something like the ARMulator which I have seen references to? I think asking about that would belong in a different forum category though (which I can do if it isn't publicly available information).

Thanks

Top replies

vstehle over 6 years ago +2 verified

Dear eskimoalva , A quick search for "DMIPS" on NXP site yields: i.MX27 has an ARM926EJ-S core, which NXP rates at "1.1 DMIPS/MHz" @400 MHz i.MX6 has a Cortex-A9 core, which NXP rates at "3000...

Parents

0 Sean Dunlevy over 6 years ago

It can be quite complex as some cores have TCM (tightly coupled memory) which is just another name for a scratchpad (like the one in the Playstation 1). If your code and data are in TCM, other elements of the chip can be doing other stuff in parallel. TCM doesn't collide with FPC bus, for example. The beauty of ARM is that compilers do a very good job so extracting 85-90% of the possible performance isn't tricky but it also has an artful side by which I mean you can dynamically change the cache characteristics (for example), organize preloads (read a byte from almost certainly uncached memory but don't use the result. Take advantage of speculative execution. I admit I've not tried out the really high-performance cores but reading from uncached RAM and then performing an immediate load or logical XOR to the destination register of the read. will render the results of the fetch (which has already begun) invalid so it doesn't stall the pipeline, that can in effect make ALL data and most code appear to come from zero wait-state RAM. If DMA or such is stealing cycles off the bus, Thumb may be of benefit.

If you have a more specific need, don't be frightened to delve into assembly language. Little things like aligning code & data on cache-line boundaries. 50% of the time that will save a line. If DMA or similar is sharing the same bus as the CPU, Thumb may be of benefit. After all, the machine can read 2 instructions at once. All of the old tricks work fine. From unrolling loops to rewriting the felide constructors. I seem to recall that there are some tricks for converting IEEE 754-1985 standard floats (within certain ranges) to integer without using the FPC. Everything Knuth did, ARM does well at it.
Cancel
Up 0 Down

Cancel

Reply

0 Sean Dunlevy over 6 years ago

It can be quite complex as some cores have TCM (tightly coupled memory) which is just another name for a scratchpad (like the one in the Playstation 1). If your code and data are in TCM, other elements of the chip can be doing other stuff in parallel. TCM doesn't collide with FPC bus, for example. The beauty of ARM is that compilers do a very good job so extracting 85-90% of the possible performance isn't tricky but it also has an artful side by which I mean you can dynamically change the cache characteristics (for example), organize preloads (read a byte from almost certainly uncached memory but don't use the result. Take advantage of speculative execution. I admit I've not tried out the really high-performance cores but reading from uncached RAM and then performing an immediate load or logical XOR to the destination register of the read. will render the results of the fetch (which has already begun) invalid so it doesn't stall the pipeline, that can in effect make ALL data and most code appear to come from zero wait-state RAM. If DMA or such is stealing cycles off the bus, Thumb may be of benefit.

If you have a more specific need, don't be frightened to delve into assembly language. Little things like aligning code & data on cache-line boundaries. 50% of the time that will save a line. If DMA or similar is sharing the same bus as the CPU, Thumb may be of benefit. After all, the machine can read 2 instructions at once. All of the old tricks work fine. From unrolling loops to rewriting the felide constructors. I seem to recall that there are some tricks for converting IEEE 754-1985 standard floats (within certain ranges) to integer without using the FPC. Everything Knuth did, ARM does well at it.
Cancel
Up 0 Down

Cancel

Children

No data