This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: Cortex A7 is 4 times slower than Cortex A8 ?

I'm seeing Cortex-A7 cycle-timing table here :

http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/


For example, 

VADD.F32 Dd, Dn, Dm takes 2 cycles

VADD.F32 Qd, Qn, Qm takes 4 cycles

same goes for VMUL..

Is this really the case ? I think to remember that both take 1 cycle on Cortex-A8 ?

I'm wondering what's the benefit to use NEON in this case, except for compatibility reason maybe, where some parts could run 4 times faster on some other NEON implementations ?

Parents
  • > except for compatibility reason maybe,


    We have a winner. App developers hate recompiling apps for 10 different variants of an architecture, so compatibility and ensuring that all apps run is a really important design objective.


    > where some parts could run 4 times faster on some other NEON implementations ?


    Cortex-A7 is much smaller and lower power than Cortex-A8 - if you want pure clock-for-clock performance we have plenty of cores which are faster than Cortex-A8, so it really depends what tradeoffs your design is trying to make in terms of silicon area, power, and performance.


    HTH,

    Pete


Reply
  • > except for compatibility reason maybe,


    We have a winner. App developers hate recompiling apps for 10 different variants of an architecture, so compatibility and ensuring that all apps run is a really important design objective.


    > where some parts could run 4 times faster on some other NEON implementations ?


    Cortex-A7 is much smaller and lower power than Cortex-A8 - if you want pure clock-for-clock performance we have plenty of cores which are faster than Cortex-A8, so it really depends what tradeoffs your design is trying to make in terms of silicon area, power, and performance.


    HTH,

    Pete


Children
  • Thanks for your answer Pete, I had naively thought NEON implementation was the same between A7 and A8 and that performances would be close.

    Now I realize although NEON instructions are the same, its implementation in the silicon must be quite different (stripped down?), for example there must be a single floating-point multiplier unit in A7 whereas A8 probably has 4.

    After all it makes sense, otherwise how would the A7 be less expensive and consume less power ?


    Of course I understand the motivation of having the same NEON instructions on the A cores, but it's important for me to realize this huge performance difference because I have a bunch of proprietary C DSP algorithms (floating-point) I want to use on an A7 only.

    I was already thinking to take the time to convert them for NEON, which would take a while, especially those with with tests/branches that require more care.

    But it looks like I will not gain much on using NEON on the A7 so I better use my C code almost "as is", making sure VFP instructions are correctly issued by the compiler.

  • > Now I realize although NEON instructions are the same, its implementation in the silicon must be quite different

    Yep, exactly this.

    > But it looks like I will not gain much on using NEON on the A7


    If you know you are going to use only Cortex-A7 then it is unlikely you will gain too much, but NEON has some nice instructions which are not always available as scalar integer or VFP equivalents, so it does tend to be a little faster over a whole algorithm (just not the same multiplier you would get on a bigger core). What you would buy yourself is a little future portability - you could just run the same app on a different platform with a wider NEON implementation and it would automatically go faster without any extra work.


    HTH,

    Pete

  • If you're using your board / device as 'bare metal', writing purely in assembly language (thus knowing exactly which registers are used for what), then you could keep things in registers and thus gain a little extra speed by not reloading all the time.

    That means: If you're using an operating system - such as Linux - then this isn't really possible.

    peterharris - I've understood that basically the higher number that a core has, the faster it is.

    I've seen that this seems to be true for the Cortex-A5, Cortex-A7 and Cortex-A8, but some people claim that parts of the Cortex-A8 is faster than the Cortex-A9 ?

    When moving up to Cortex-A12, Cortex-A15 and Cortex-A17, I'm starting to lack knowledge (except that Cortex-A17 uses less power than Cortex-A15).

    Is there a good overview of the 'core speeds' and the 'NEON speeds' for the Cortex-A architectures ?

  • I've understood that basically the higher number that a core has, the faster it is.

    Not sure how well it holds up now that we have Cortex-A50 and A70 series, but as a very rough rule of thumb it's probably not too far off "on average". As always there will be bits which perform slightly worse and bits which perform slightly better than that.

    but some people claim that parts of the Cortex-A8 is faster than the Cortex-A9 ?

    Cortex-A8 NEON is dual issue for some pairings of instruction, whereas Cortex-A9 NEON is single issue, so Cortex-A9 can go slower. What it does do a lot better than Cortex-A8 is interoperate ARM and NEON code, so the cost of moving from ARM to NEON/VFP is much lower. This is very important for normal "float" programming in C, as soft-float ABI was so common back when these cores were released.

    Is there a good overview of the 'core speeds' and the 'NEON speeds' for the Cortex-A architectures ?

    Nothing specific I am aware of for NEON.

    HTH,

    Pete

  • Thank you Peter this is definitely clearing up a lot of things.