This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

The BBC Micro:bit - 2 totally different computers... how convenient

I have been asked to write an audio driver for the BBC Micro:bit. The problem is that the Mark 1 & Mark 2 versions of this computer are utterly different. To access the full power of these computers, I intend to develop the cost in 100% assembly language. Luckily I have quite a lot of experience in programming the Cortex M0/M0+ thanks to the kind help of Jens Bauer who is a truly great guy and who always produced fragments of truly amazing code e.g. a 32-bit x 32-bit ---> 64-bit routine that takes only 17 instructions and so only 17 cycles. He has provided many other tricks although it has to be admitted that of the 30+ instruction-sets I have programmed in anger (commercial games including audio drivers, area fills & loops (it only knocks off 1 cycle per loop BUT if the loop goes down from 6 to 5 cycles, it's BIG speed up. Anyway, here are the specifications of the two different versions of the Micro::bit

v1:
Nordic nRF51822 (contains 16MHz M0)
16 MHz ARM Cortex-M0 core
256 KB Flash
16 KB RAM


v2:
Nordic nRF52833 (contains 16MHz M0)
64 MHz ARM Cortex-M4 core
512 KB Flash
128 KB RAM

Now, the Cortex M4 supports the Thumb-2 instruction-set which means that it more or less has all of the original 32-bit ARM instruction-set. ARM states that the M0 achieves around 0.9 MIPS/MHz so 14.4 MHz whereas ARM states that the M4 achieves around 1.25 MIPS/MHz so 80 MIPS.so the v2 processes around around 5½ times faster. Those figures don't even deal with the fact that the Thumb-2 instructions reducing the number of instructions used. Both processors have a 3-stage pipeline although it appears that the M4 is able to read 2 x 16-bit instructions in one cycle but the M4 is able to fetch instructions early so that instructions that access memory do not slow down the pipeline.

I have asked the people behind the Micro:bit if they intend to replace all of the v1 machines with the v2 but they have yet to reply.

Now, I for one was impressed by the use of PWM to produce a 4 channel tracker that had audio quality that was similar to 8-bit samples. It would be nice to do something using the audio. I've already reversed engineered the C64 version of SAM (the speech utility provided with the Micro:bit). The quality is pretty poor because SAM was originally developed for the Apple ][ which only had a 1-bit beeper. I don't know exactly how the hardware works but I suspect that when the bit is set to 1, a positive DC voltage is sent to the speaker as long as the bit is 1. When the bit is set to 0, a negative DC voltage is sent through the speaker. The difference is that the Micro:bit is sufficiently powerful to use PWM to increased the perceived bit-depth. I would like to apply PWM to SAM.

I have already identified the table of phonemes stored by SAM and wondered if it would be possible to replace the samples with 2-bit ADPCM format. I found this routine on Github which supports 2-bit ADPCM. Of course, the only problem with this routine is that it converts 2-->16 bit whereas I think 2-->8 is more appropriate ALTHOUGH I am quite willing to be guided by someone more expert than myself (that's most of you). Of course, IF I write the audio player with the v2 hardware in mind, I may well be able to mix more than 4 channels in which case having 16-bit values to combine will be more accurate.

Long ago I found a drum machine for the Commodore 64. The way it worked was quite simple. As people might know, the C64 does not have a specific DAC convertor but writing values between 00 & 0F acts as a DAC but the clever trick here was that the drum machine had 3 tables. One was used when just 1 channel was in use, a second when 2 channels were in use and a third when all 3 channels were in use. I'm sure you can imagine it's contents.

Ideally I would love to support 4 drum channels & 4 'real' sample channels i.e. they support frequency & amplitude with vibrato, glissando, trill and all of those other tricks to allow a limited number of channels to act like a LOT of channels. Of course, this would require the 4 harmonic channels to be decompressed from 2-->16 bits, mixed and then the drum channels added.

I do realise that this is quite a complex task but over the 14 years I was a professional computer games programmer, I got to write an awful lot of music/SFX drivers:

-Commodore C4
- Apple ][
-ZX spectrum (including a single channel of sample sound)
-Sega Master System
-Sega Megadrive (including a single channel of sample sound)
-Sega X32
-Sega Saturn
-Nintendo Entertainment System (1 channel of Δ samples)
-Super Nintendo Entertainment System
-Nintendo Virtual Boy (waveform was 32 6-bit values so rewrites allowed 5 channels of samples. Channel 6 was noise)
-Nintendo 64
-Nintendo Gameboy Color (including a single channel of sample sound)
-Neo Geo Pocket (color)
-Nintendo Gameboy Advance
-PSX (Playstation)
-Nintendo DS (16 sample channels)

I'm not showing off - I learnt the hard way that a different technique is needed for each and every platform. Since these drivers had to use a minimal amount of bus bandwidth so that they didn't slow the games down, many potential techniques were not possible. If I had all of the NES CPU time, I could have mixed and written 7 bit values directly to the DAC. A 2MHz 6502 (well, an A203 which is a 6502 with all of the BCD removed).

In this case I'm intending to write a stand-alone synthesizer. I have experience writing a drum machine because all of the samples are played at a fixed rate so it's merely a case of reading the sample data for each of the 3 channels, mixing them using simple tables and writing to the DAC. The issue is the variable frequency of the harmonic channels (4 I hope) that first need decompressing from 2-bit ADPCM to 16-bit, mixing them, mixing them with the drum tracks and outputting the whole lot to the DAC but in this case, the quality relies on PWM - a methodology that I am not familiar with.

The Micro:bit only has very simple input & output so I am wondering if I will need to add a touch-screen LED screen or an LED screen and a selection of other input methods. A number of vendors are offering  Waveshare and a number of other vendors are offering 1.8" FTF screens (but I cannot find >1.8") and none of them are touch screens. A few simply offer 256 x 16 i.e. two rows of 32 characters. On reflection, I think the TFT screens look the better option.

But that leaves me with the problem of input. I have found just 1 solution:

https://www.youtube.com/watch?v=6EP4AaF8HHE

I don't know if anyone has got one of those 1.8" TFT screens and/or a PS2 keyboard adaptor. If so, I really would appreciate your input.

It would be lovely if the whole setup would work on the Mark 1 hardware but 4 drum channels & 4 harmonic channels is going to use a fair amount of processing power. I do not think 14.4 MIPS of THUMB will be reliably sufficient. It would be dreadful to discover that the whole think works UNLESS the user plays a high frequency on all for of the melody channels! I've run into such bugs - the ones you only find on the last day and for which you have no answers. I think the 80 MIPS within a Thumb 2 instruction-set offers the programmer the chance to optimise and to optimise and to optimise.

I've not coded in Thumb 2 yet but I am keen to explore the cache because it's a mixture of 16 & 32 bit instructions. I PRESUME that this means that if the PC is on a 32-bit boundary & the next 2 instructions are 16-bit, it reads 2 instructions as the same time. Does this then mean that the bus goes unused for a cycle. In short, can DMA be set to a lower priority than the CPU so that it uses up those unused instruction fetches? I have read that some M4 processors come equipped with a 4K mixed cache (16 byte lines) i.e. 256 cache lines. --With that in mind, I will design the mixer to keep the code within the cache. Since the drum patterns are played at a fixed rate, they might benefit from being cached i.e. 16 samples can be loaded at once BUT the melodic channels that may have to skip samples (after 2->16 bit decompression.

Depending on the behavior of the cache, the lookup tables for decodeing the ADPCM only take StepSizeTable (89 x 2 = 178) + StepSizeTable2 (11 x 2 = 22) + IndexTable (16 x 1 = 16) i.e. a total of  14 of the 25666 cache lines. If I place the ADPCM decoder & the drum mixer into the cache, it SHOULD just about fit. I am wondering if I should place some of the resulting 1-bit beeper values in the cache although it it's outside the cache, I can simply DMA the 1-bit beeper value to the appropriate hardware register.

On that front, I have looked at many documents but nowhere does it mention where that 1-bit beeper setting IS. Can someone enlighten me as to it's address.

Now, I have covered a lot of ground. There are a lot of ideas that may/may not work and I'm certainly going to need to listen to you experts. I would just like to make it clear that this is a not-for-profit project and anyone who helps out even to the smallest amount WILL receive an equal credit. I have always believed that the product is the important thing. I am MORE than happy to take advice and however you help, you will not find yourself forgotten.

So, many thanks for reading this, many thanks for your time and effort.

  • Hi there, I have moved your query to the Architectures and Processors forum. Many thanks.

  • Have you looked at the PJRC "Teensy Audio Library"?  AFAIK, it provides a relatively well-regarded audio library for some of the moderate-to-high functioning NXP ARM cpus, and I'm tempted to say that a library that is more or less compatible with that (for the Nordic chips) would be more useful than a higher-performance brand new library.  For that matter, "more portable" code is probably more useful than a highly-tuned chip-specific code.

    OTOH, it's not clear whether you're writing a library, a driver, or a synthesizer application.

    I suspect that what you're aiming at is nearly impossible on a 16MHz CM0, and relatively trivial on a 64MHz M4F.  Demonstrating why most people are less likely to write ASM code for modern CPUs.

    Meh.  Just my random opinions.  Good luck!

    > can DMA be set to a lower priority than the CPU so that it uses up those unused instruction fetches?

    The way I read the datasheet, the cache is for flash only, and DMA is for RAM only.  I'm not sure how that interacts with the "bus matrix", contention-wise.