• Problem in understanding behaviour of GCC compiler (aarch64-none-elf-gcc) on Neon intrinsics for ARM cortex a53

    Hi,

    I am using IDE Xilinx SDK 2019.1 for my application and running it on ARM cortex a53  processor with Neon and floating point engine support available. I am working on a bare metal application.

    The problem I am facing is that, I am unable to understand…

  • compiler optimization options for ARMv8 GCC compiler on ARM cortex a53 (bare metal application)

    I am using ARMv8 GCC compiler(aarch64-none-elf-gcc) for my bare metal application on ARM cortex a53. I am using neon intrinsics with plain C in my code so I would like to ensure to use all optimization option available for this compiler.

    I tried -mfpu…

  • Optimization of Neon Intrinsics on ARM cortexa53

    I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?

    Here is the code:

    static inline…

  • Building Ne10 Library With ArmCompiler 5 on ARM Cortex A9

    Hello everyone,

    I am having difficulties on compiling Ne10 library with ArmCompiler 5. As I understand, Ne10 library requries a GNU compiler, or ArmCompiler 6 which is more GNU like, however currently we are using ARM Compiler 5 in our project.

    Is there…

  • Partial register dependency neon

    I'm having trouble finding any informations on partial neon register dependencies.

    Take for example the following code:

    ld2 {v0.16b, v1.16b}[0], [x0]
    ld2 {v0.16b, v1.16b}[1], [x1]
    ld2 {v0.16b, v1.16b}[2], [x2]
    ...

    Does the second load have to wait…

  • What is arrangement specifier(.16b,.8b) in ARM assembly language instructions?

    I want to what exactly is arrangement specifier in arm assembly instructions.

    I have gone through ARM TRMs and i think if it is size of Neon register that will be used for computation

    for e.g. TBL Vd.Ta, {Vn.16B,Vn+1.16B }, Vm.Ta

    they mentioned Ta to…

  • I'm not seeing any flush-to-zero (FTZ) effects with NEON intrinsics on an ARM A9, any advice?

    Hi everyone,

    As the title states - I've had issues reproducing flush-to-zero (FTZ) using the NEON intrinsics provided in the 'arm_neon.h' header. For test purposes I'm using an iPhone 6 with an ARMv8-A dual-core ('Twister') CPU.…

  • Embedded assembly function problem

    Hello all,

    I wrote end embedded assembly function for an ARM Cortex A9 (the specific device is Zynq, from Xilinx) as follow

    float my_fun(float x)

    {

                    asm volatile ("vdup.f32 d0, r0                     \n\t");…

  • float behaivior on AARCH64

    Hello,

    forgive me if my question is a litte bit weak in content and linguistic. I'm only a Hobbyist and english is not my nativ.

    I'm trying to compile an App from Einstein@Home for AARCH64 using GCC. Einstein@Home is a DC-Projekt using Boinc. The App…

  • cortex-A15 instruction set and optimization ways on this platform?

    Dear,

    I am an greenhand developer on cortex-a15.

    now I need some specification as follows:

    where I can get the instruction set of cortex-A15?

    are there some documents about optimization technology on cortex-A15(image processing optimization)

    Thanks a lot.

  • HI,why the VFP vector mode can not be used in cortex-a series processors?

    HI,why the VFP vector mode can not be used in cortex-a series processors?

  • ARM_V8 instruction Cycles timings

    Hi, can anyone suggest me how to know the instructions cycle timing of the arm_v8 instructions.does it take more cycles to transmit from neon to basic arm instructions in arm_v8.

    please suggest me how to calculate instruction cycles in arm_v8

  • In NEON, have the three instructions( VCLS, VCLZ, VCNT), are they all count sign bit?

    In NEON spec:

    VCLS (Vector Count Leading Sign bits) counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit, in each element in a vector, and places the results in a second vector.

    VCLZ (Vector Count Leading…

  • The cortex-A7's pipeline support dual-issue, so I want to ask what's the dual-issue mean?

    The cortex-A7's pipeline support dual-issue, so I want to ask what's the dual-issue mean?

    I find some answers say that dual-issue means that the cortex-A7 can issue two instructions per clock.

    But in the cortex-A7's pipeline diagraph, it has integer…

  • Question about accumulator word length in A8 core

    Hi,

    I have used some 32-bit microprocessor cores (non-ARM), which has a long word-length accumulator for some DSP operations, to avoid over-flow etc. After I check A8 core document, it is a surprise that I do not see any about this specification. It looks…

  • Why in A64 the coprocessor is removed?

    For the view of architecture, why the coprocessor is removed for A64 instruction set?

  • Trap control and instruction enable/disable in ARMv8

    Hi Experts,

    What is the trap control feature and its typical use case of the same ?

    How instruction enable/disable feature in ARMv8 is useful  ?

    Regards,

    Techguyz

  • Explain 8 stage pipeline of ARM Cortex a7?

    Brief explanation of each stage of ARM pipe-lining.  

    How many Neon pipeline stages are their?

    What is dual issue in ARM pipe-lining?

  • How does the ARM CA53 4 core join NEON on only 2 cores?

    Our project only wants 2 cores to support NEON for cost reasons. How can I do this?

    1. Can a single cluster be done?


    2. Cut into 2 clusters, each with 2 cores. What is the difference between the performance of ARM HMP scheduling 4 cores and the performance…

  • Arm Neon not vectorising nested loop

    Hi,

    I am using A9 Processor on Zynq Board running a test project with neon and simd options enabled . In my code i have nested loops which is not vectorised and below is the build log 

     not vectorized: multiple nested loops. 

    Can anyone help me on thi…

  • No segmentation fault when expected with aligned load and store

    Hi all,

    It is a well known fact that performing an aligned vector load with an unaligned memory address should lead to segmentation fault.

    However, when I do try to run code segment below using the same, i do not see any segmentation fault.

    ---------…

  • NE10-Library -> FIR-Filter cycle counts: C-version faster than NEON-version?

    Hi,

    i'm currently trying to measure cycle counts for FIR-filtering with the NE10 library. I'm using a Raspberry Pi 2 with ARM Cortex-A7 running on Raspbian as a target.

    I activated the Cortex-A7 performance counter register to read out the cycles…

  • Questions regarding NEON

    Hi,

    for a project regarding Digital Signal Processing on ARM SoCs i'm currently gathering some information about the ARM NEON engine and would need some clarification if my assumptions are correct.

    I found an instruction timing table in the "Cortex…

  • NEON SIMD Dn Register and Parallel Operations

    Thank you for your reply. A few more questions:

    Is Dn a 128-bit wide register? Is Dd also a 128-bit wide register? (Referring to the diagram in the original question)

    Also, the diagram shows 4 parallel operations. Is this the actual number of parallel operations…

  • NEON SIMD Register Diagram

    Hello,

    I’m new to ARM architecture and was looking to get a better understanding of how it works. Most notably, the Cortex-A series and its DSP functionality.

    When looking through the NEON SIMD page on ARM's webpage (NEON - ARM), it mentions that…