Chinese Version 中文版:NEON编码 - 第4部分: 左右移位
This article introduces the shifting operations provided by Neon, and shows how they can be used to convert image data between commonly used color depths.
Previous articles in this series:
Chinese Version 中文版:NEON编码 - 第4部分: 左右移位
This article introduces the shifting operations provided by Neon, and shows how they can be used to convert image data between commonly used color depths.
Previous articles in this series:
eXecute-Only-Memory (XOM) is a firmware protection technique to help prevent 3rd parties from stealing or reverse engineering firmware, and at the same time allowing 3rd parties to add additional software to the…
Typically when teaching a class about embedded C programming, one of the early questions we ask is "Where does the memory come from for function arguments?"
Take, for example, the following simple C function:
void test_function…
Recently I spoke about a LZ4 decompression routine I converted from 6502 code into a Arm Cortex-M0 code.
For some reason, I could not find my decompression routine, so I decided to convert it again. The result is below; the routine is now…
The Statistical Profiling Extension is an optional feature in ARMv8.2. This article will provide an overview of the Extension, describe how it works, and the advantages it provides over other profiling mechanisms.
Recently, Will Deacon posted a request…
Some of us need to find out how many leading zero-bits there are in a 32-bit word. Such a feature is useful on many occasions, especially when writing a fast divide subroutine.
The Cortex-M3 and later have a CLZ instruction which can count…
This is the third in a series of blogs that gives a technical introduction to the ARM CoreSight Debug and Trace technology and architecture. You can check out my previous blogs How to debug: CoreSight basics (Part 1) and How to debug: CoreSight basics…
I discussed in a previous blog post that it is possible to set some condition flags based on the result of an arithmetic operation. Consider the following code:
adds r0, r0, r1 bvs <some_address>
Chinese Version中文版:扩展系统一致性 - 第 2 部分 - 实施、big.LITTLE、GPU 计算和企业级应用
This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency. This part talks about the implementation of hardware…
Let's be honest, debug can be a bit of a pain. At the best of times it's a nuisance and in the worst case scenario a complex web of wires that need to be configured properly in order to diagnose and solve your SoC design problems.
A study conducted…
Page colouring is a technique for allocating pages for an MMU such that the pages exist in the cache in a particular order. The technique is sometimes used as an optimization (and is not specific to ARM), but as a result of the cache architecture some…
“At the end of the day, we must go forward with hope and not backward by fear and division.” – Jesse Jackson.
It often surprises me how many people believe that “ARM doesn’t do division” or “ARM cores don’t have…
With ARM entering the server space, a key technology in play in this segment is Virtualization. Virtualization is not a tool solely for servers and the data center, it is also used in the embedded space in segments like automotive and it is also starting…
| This post is part of a series: |
Every practical…
My previous post provided an introduction to the concept of memory access ordering. It did not however provide any solution to the problem, or necessarily specify where such ordering can be significant.
Now, not all software developers need to be deeply…
As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.
The alignment checks can be very…
Note: Armv8 deprecates…
Hello and I welcome you to my Arm programming tutorial series. I would like to give a big thank you to Abhishek Agrawal, a Final Year Undergraduate Student at IIT Kharagpur for his help to complete this blog.
Let’s start with basics. RISC machines have…
A branch, quite simply, is a break in the sequential flow of instructions that the processor is executing. Some other architectures call them jumps, but they're essentially the same thing. The following is a trivial, and hopefully…
In part 1 of this series we dealt with how to load and store data with NEON, and part 2 involved how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing - multiplying matrices.
In this…
This article describes the instructions provided by Neon for rearranging data within vectors. Previous articles in this series:
Once you move beyond short sequences of optimised Arm assembler, the next likely step will be to managing more complex, optimised routines using macros and functions. Macros are good for short repeated sequences, but often quickly increase the size of…
Arm's Neon technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image…
In part 1 of this series on Neon about loads and stores we looked at transferring data between the Neon processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors…
If you’d like to develop your Convolutional Neural Networks using just the Compute Library and a Raspberry Pi, this step-by-step guide will show you how… and it comes complete with all the tools you’ll need to get up and running.
If…