Howdy, I was trying to find some basic performance benchmarks for a couple of different ARM cores: The Arm 926EJ-S, Cortex A9, and the Cortex M7.
I am looking for primarily DMIPS (per MHz or a form that requires me scaling to my specific chip is fine), MFLOPS (where applicable, I don't think the 926EJ-S has a floating point coprocessor), CPI (since it varies depending on the instructions, some standardized average is fine), and if possible some things like branch prediction failure penalty and cache miss penalty.
If it makes a difference, the actual chips we would be/are purchasing are respectively: NXP i.MX27, NXP i,MX6, and the NXP i.MX RT 1050.
I have spent hours looking through the ARM information center. I found some of the data on the ARM 926EJ-S in there (average CPI of 1.5, DMIPs of 1.1 per MHz), but the the newer parts didn't seem to have that.
As a side note, I don't have hardware in hand for all of them, but is all of this information in something like the ARMulator which I have seen references to? I think asking about that would belong in a different forum category though (which I can do if it isn't publicly available information).
Thanks
Dear eskimoalva,
A quick search for "DMIPS" on NXP site yields:
For more details, Instruction Cycle Times are available on ARM site:
Best regards,
Vincent.
Excellent, thanks. I looked through the NXP data sheets, but didn't search their site. I was surprised that I couldn't find the information (though admittedly it was a lot more important before the days of application level processors) but it looks like my searching skills have failed me.
Thanks,
No problem. I pride myself on my assembly language programming as I must. I am unable to deal with the abstaction of C, let alone C++ and Java. Jazelle is worth a mention. Officially, the CPU uses a BXJ to enter Java bytecode but officially, the way back out is shrouded in secrecy. In fact, it's just virtual_register_0 = virtual_register_1 = 0xFFFF followed by bytecode $FF. There is a design fault in Jazelle that seems to be in every core with it. Bits 0-7 of the Status Register are supposed to be privileged but a BX to thumb isn't a privileged instruction and J is bit 24 (so unprivileged) thus the instruction format 11 (reserved) ban be entered. That is just one of the few examples of addenda that hasn't been listed (AFAIK). I point it out because the SecureCore line of chips are amenable to this attack and I note the chip you are considering has the -S option. On the pure performance front, one 'old school' tricks is to use the I-cache to store data. If you place a branch ALWAYS condition (deprecated on newer cores) at the very end of a tag-line, the data in the next cache line will be read into the I-CACHE (speculative read). The L2 & L3 are mixed data/instruction so it will be sped up to a certain degree... but since the CPU uses literal pools, it accepts that although it is technically data, it can be read into I-Cache. If you have a small piece of code that performs some operation that thrashes the cache, this is a way to get more D-cache (dynamic cache changes). If you want to know the most powerful ARM optimization that is available to use in ALL cores, keep in mind that their are 16 'general purpose' registers. R13 is the stack-pointer and R14 is the link-register. If your subroutine doesn't use the stack frame then you push R14 onto the stack and store the stack-pointer into RAM. Now you are free to use 15 of the 16, not the 13 that compilers seem to use. Want more? Well, a core-dependent trick is the realization that the IP isn't used in certain cache-designs. If you are coding for a core it works on, as long as you ensure that the code is always in L1 cache, you can store R15 into memory and voila, you get to use all 16 registers. This is in the realm of madness but if you know the core your code is running on, you would be amazed at just how far you can push things. I am presuming that the Cortex M0 & MBed will become the baseline for IoT so optimizing audio, video and communications for this core is the best way to set some kind of standard. I mean, if you can remember when the US 'bubblegum' cards in the form of Star Wars cards became prevalent. Now consider such a thing but each card also has say a 15-20 minute story-line spoken in character by the actors and with 'blueprint' images and so forth. Just a way to use the new flexible plastic cores PragmatIC are working on. Sometimes we specifically do not want all data to be accessed from a central computer. Sometimes discrete storage will be of value. I suppose top chefs who have branded food in shops can have audiovisual recipes. It's really got huge possibilities. What attracted me was that the Pragmatic system can currently print 10 layers of metal-in-plastic so mask-ROM will not have the huge initial outlays. I remember writing cartridge games for consoles and the cost was something like $13 per unit so get the order wrong and your game could sell 150,000 units but you make a loss because of the 50,000 you didn't sell.