This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

On the Cortex M0, does the multiply set the flags. That way, if one is writing a 32x32=64 bit multiply, a fast exit can be added.

On the Cortex M0, does the multiply set the flags. That way, if one is writing a 32x32=64 bit multiply, a fast exit can be added. I'm just looking at this code I found and wondered if it were worth inserting.

+ /* Slow version for both THUMB and older ARMs lacking umull. */

+ mul xxh, yyl /* xxh := AH*BL */

+ push {r4, r5, r6, r7}

+ mul yyh, xxl /* yyh := AL*BH */

+ ldr r4, .L_mask

+ lsr r5, xxl, #16 /* r5 := (AL>>16) */

+ lsr r6, yyl, #16 /* r6 := (BL>>16) */

+ lsr r7, xxl, #16 /* r7 := (AL>>16) */

+ mul r5, r6 /* r5 = (AL>>16) * (BL>>16) */

+ and xxl, r4 /* xxl = AL & 0xffff */

+ and yyl, r4 /* yyl = BL & 0xffff */

+ add xxh, yyh /* xxh = AH*BL+AL*BH */

+ mul r6, xxl /* r6 = (AL&0xffff) * (BL>>16) */

+ mul r7, yyl /* r7 = (AL>>16) * (BL&0xffff) */

+ add xxh, r5

+ mul xxl, yyl /* xxl = (AL&0xffff) * (BL&0xffff) */

+ mov r4, #0

+ adds r6, r7 /* partial sum to result[47:16]. */

+ adc r4, r4 /* carry to result[48]. */

+ lsr yyh, r6, #16

+ lsl r4, r4, #16

+ lsl yyl, r6, #16

+ add xxh, r4

+ adds xxl, yyl

+ adc xxh, yyh

+ pop {r4, r5, r6, r7}

+ RET

+ .align 2

+.L_mask:

+ .word 65535

Top replies

Sean Dunlevy over 9 years ago in reply to daith +1 verified

Well, what I am starting to do, even without a BBC Microbit Devkit, is to write a fixed-point MP3 player. I'm going to write in 100% assembly language but MiniMP3 (C) has the whole decoder in a single...

Parents

0 Jens Bauer over 9 years ago

According to the Cortex-M0 Generic User Guide, muls sets the N and Z flags.
That means the result of the operation can be tested for zero, nonzero, negative or positive, but not greater than, less than, greater or equal, less or equal.
Eg. It will be valid to use BMI, BPL, BNE and BEQ, but it would be invalid to depend on BGE, BLT, BGT, BLE, BCC, BCS, BHI, BLO, BVS or BVC.
Instruction details:
Cortex-M0 Generic User Guide - Instruction set summary
Instruction timing:
Cortex-M0 Technical Reference Manual Instruction set summary
Note: The Cortex-M0 instructions will behave the same on all other Cortex implementations, except for the number of clock cycles spent. In other words: They're binary compatible.
Cancel
Up 0 Down

Cancel

Reply

0 Jens Bauer over 9 years ago

According to the Cortex-M0 Generic User Guide, muls sets the N and Z flags.
That means the result of the operation can be tested for zero, nonzero, negative or positive, but not greater than, less than, greater or equal, less or equal.
Eg. It will be valid to use BMI, BPL, BNE and BEQ, but it would be invalid to depend on BGE, BLT, BGT, BLE, BCC, BCS, BHI, BLO, BVS or BVC.
Instruction details:
Cortex-M0 Generic User Guide - Instruction set summary
Instruction timing:
Cortex-M0 Technical Reference Manual Instruction set summary
Note: The Cortex-M0 instructions will behave the same on all other Cortex implementations, except for the number of clock cycles spent. In other words: They're binary compatible.
Cancel
Up 0 Down

Cancel

Children

0 daith over 9 years ago in reply to Jens Bauer

Multiplying 0x10000 by 0x10000 gives zero. What one would really like is to have the overflow flag set depending on whether there was an overflow - but it isn't.
If one had the CLZ instruction that would be another way of checking - but unfortunately the Cortex-M0 doesn't have that.
If the operands are very often both less than 0x10000 you could insert a test into the multiply routine above to check for that and exit early if so. This can be done before the push so there could be an appreciable saving.
Cancel
Up 0 Down

Cancel
+1 Sean Dunlevy over 9 years ago in reply to daith
Well, what I am starting to do, even without a BBC Microbit Devkit, is to write a fixed-point MP3 player. I'm going to write in 100% assembly language but MiniMP3 (C) has the whole decoder in a single file BUT uses floating point. Madplay offers a fixed-point solution but is in many files and rather rambling. This is all without knowing if the Microbit has the 1-cycle multiply for the FFT calculations. The Bluetooth chip (Nordic Semiconductors nRF51822 also has an M0 core with 128K program flash-memory (read only) & 16K of RAM to go with the 128K of program flash-memory & 54K RAM of the CPU. You can imagine how one could add the FFT code into the Bluetooth RAM and set it going. It could be as simple as the left channel FFT being calculated using the CPU & the right channel FFT being done by the by the nRF51822.
Looking at MiniMP3,player, it uses these maths-coprocesor functions:
POW
COS
SQRT
and divides all over the place. The COS I will swap to a table, There are a number of routes to calculating square root BUT, lacking a divide, there are a few different ways to do that and I will likely have a standard one & case-specific if the range is small. The code also has an instruction to find the reciprocal i.e. 1.0 / 1.0 + CX * CX. IF there is the 1-cycle multiply, I'm wondering if anyone has code to calculate reciprocals (a special case of a divide).
So, as you can see, I'm having to consider ALL of the CPU power I can access. One minor point I have spotted is that the M0 ALWAYS does a 32-bit instruction read. If you are going to perform a B instruction and the instruction before is a 16-bit atomic operation, it means that there isn't a wasted read.
The machine has no external memory chips so I can only presume that the core is a custom M0.so, unless it's too late, if I can present a player that manages low bit-rates due to the slow multiply, it then gives me a strong reason to suggest adding the 1-cycle multiply and, ideally, ask for a UMULL command. with the result appearing in the 2 registers that held the values to be multiplied.
So, POW is the only one I haven't found a suitable piece of code that gives a good approximation in fixed-point. A Log & antilog table could be used to find an approximate and then home in on the correct answer... or I'm guessing that's a reasonable route. I'm sure that there is code in the math include file but in-line is optimal and if I know the range.
I know I'm touching on a lot of subjects here, but I'm a volunteer, I'm not being paid. I just see the Microbit being much more than something for simple experiments. I've been looking at G728 - short-delay CELP. The 2 reasons are it's RAM footprint and it's low complexity - 30MIPS & 2K of RAM. If it's used for audiobooks, the encoding could be altered so the size of the fixed codebook could be much bigger (so, OK, it's going to use 16.2Kb/s). It would be possible to find the optimal codebook for each speech file (Audible does this). I also wonder about using parts of 2 adjacent vector-tables e.g. vector_number 200, index 4 so you end up using 12 vectors from vector_number 200 & 4 vectors from vector_number 201. Audible use ACELP but using an exhaustive search would be a mammouth task but if an audiobook that is coursework can replace physical books, it's a 1-time cost per book. An exhaustive search would use a LOT of MIPs but services like Amazon EC2 AWS | Amazon Elastic Compute Cloud (EC2) - Scalable Cloud Hosting would allows 10s, 100s or 1000s of instances so that the task becomes possible. If a book costs £4 (if bought in bulk) then as long as the book encoding costs less than £4 million, it's still cheaper. MP3 is,a good way to interest the pupils; a headline feature. That means that they will look after their Microbit and I foresee manufacturers offering cases for the machine. A customizable front end using the accelerometers in 3 axes would minimize button uses. A walkie-talkie (ADPCM is built in to bluetooth chip) would be fun for playtime. Then there are crossover fun/educational possibilities. There are MIDI->(micro)USB converters so a Roland TB 303 emulator would be possible:
Ram chips : NEC PD-444C CMOS RAM, 1024 x 4 Bit Static
CPU type : NEC PD-650C-133, 4-bit microcomputer
Filter: 24db filter
Waveforms: sawtooth and square

In fact, it would be a minor addition to support sine & triangle wave. All that is needed is a 256-PCM table for each waveform. Adding a white noise could be useful. The noise, with a specific scaling could be added to the waveform. I reckon 5-channel polyphonic could also be added. 5 so that each column of the display shows a channel & would allow editing. A bit limited, but possible to make a freestanding MIDI rack. Looking at all of the guitar (or bass) pedals out there, I'm sure someone would make up pedals based on the Microbit. This is where the fast multiply is a must. As the 303+, it's needed for the filter and for pedals, all kinds of effects can be produced through digital filtering. You could set high-pass, low-pass & notch filters.
I coded Tomb Raider on the Color Gameboy and we managed to get all of Lara's moves in. you can click or hold each one so you can get:
click A
hold A & click B
click B
hold B & click A
hold A & B
So,you can see, with 3 axes of movement gives 6 buttons. e.g. volume could be push forward and twist left or right, pull back to set. Button A & left/right = next/last track. Button B & left/right.
Presuming the teacher has a Microbit, lower quality (but still clear and with no glaring glitches) would mean that the whole lesson could be recorded real time & broadcast to the whole class, another improvement.
For pure education, the Microbit means that ANY classroom is a language-lab. The teacher broadcasts the speech to the whole class and can hear the speech from individual pupils. THAT is a big saving and gives so much more flexibility. In this case, presuming that the audio is broadcast using ADPCM (which is VERY LOW complexity), the pupil could record their lesson.
Basically, if they finally give me a manual for the system (and I hope it's in beta so I can ask for a few small (in terms of gates) extras (like the 32x32->64 multiply & accumulate (for the FFT matrix maths), the nRF51822 memory-mapped and usable when not engaged in dealing with Bluetooth. A figure of 0.9 DMIPS/MHz for the Cortex M0 giving a total of >86 DMIPS. Compare that with the DSPs running fixed-point MP3 and you can see that it fits.
Gosh, I'm sorry I've rambled on for so long. I'm just trying to get experts in the field to see just how potentially powerful the Microbit could be for UK education. Books that can be listened to anywhere and the possibility of simple testing in which the pupil says yes or no can be utilized. No need for complex speech->text style decoding. YES has an unvoiced section while NO does not. It would be nice to have real speech recognition but I've not looked into that. I envisage questions asking if the answer is right or wrong. If 100 questions are recorded, a test could choose 25. If the pupil failed to get a given %, they could retake the test using 25 of the 75 unused questions.
I do apologize for this long response but as I'm doing this for free and I spent 20 years as a game-engine coder so I look very hard at what is possible but HARD - this puts you ahead of the competition. I'm still looking to see if code can run from RAM allowing for self-modifying code. Things like huffman trees can be hand-coded so that they are as efficient as possible.
I thank you for plowing through this missive. I guess I'm just excited to have the chance of putting something back in to education. I worked until an explosion left me physically disabled and with PTSD (flashbacks and so on). My only relief of this is to end up in a flow state (also known as debug mode.or hyperfocus). You know the feeling, you loose track of time, your totally concentrating on every way to make the code smaller & faster (for FFT, each row of the matrix is unrolled for example).
When I get the MP3 player working, of course it's free to use, as is any code for the Microbit I develop. Please, if you have any ideas, I would be so grateful.
Thank you all,
Sean
Cancel
Up +1 Down

Cancel
0 Jens Bauer over 9 years ago in reply to Sean Dunlevy

Hi Sean.

muffin wrote:

I'm still looking to see if code can run from RAM allowing for self-modifying code. Things like huffman trees can be hand-coded so that they are as efficient as possible.

There are a lot of questions I cannot answer, but I can definitely confirm that you can run code from the internal SRAM.
Also, I've once converted some decompression routine from 6502 to Cortex-M0 - I forgot where I put it, though - but it wasn't hard to do the conversion and it was fairly small (smaller than the original 6502 routine, but I never tested the routine).
bzip2 compresses very well and decompresses fast, but I believe the size of the decompression routine might be too large for most Cortex-M0.
I believe you will probably also want to know that it's fairly easy to interface with a SD/MMC card using the SPI protocol.
Almost all Cortex-M microcontrollers have built-in SPI interface. The 3 basic interfaces are: SPI, UART/USART and I2C; most Cortex-M microcontrollers implement all three.
In case you run out of GPIO pins, you can add I/O-expanders; this can be done via I2C or SPI. I/O-expanders typically give you 8 or 16 extra (slow) GPIO pins - these can be useful for reading buttons and controlling LEDs, but normally it shouldn't be necessary, as you can easily connect 2 LEDs or 2 switches to a single GPIO pin (if you need details, let me know).
Something else you will probably want to play with, is that you can extend the Flash memory by adding an external SPI-based Flash memory, I tend to recommend Spansion's 8Mbit (1MB) S25FL208K0RMFI041, which cost less than $1! -The only extra component required is a 100nF capacitor at the VCC pin (from VCC to GND).
For memory card support (eg. SD/MMC), you need slightly more components, including resistors (mainly 33K) and a couple of capacitors (again a 100nF capacitor and - say - a 4.7uF capacitor).
Cancel
Up 0 Down

Cancel