Can the Cortex-M3 stream videos?

I'm (trying to) build a device that can receive and play/stream videos wirelessly to an embedded system. I wanted to know what would be the best course to take? If the ARM Cortex-M3 is a viable option for this, do I need a separate video controller with it? or can I use an alternative solution in the ARM family. (Also I understand that I am targeting a wide range of chips by simply saying Cortex-M3, I would like to know (if anyone else knows) whether I could through such a wide net or does it have to certain models)

Thanks!

  • Cortex-M3 comes in many colours and shapes.

    Said in a different way: The maximum speed of a Cortex-M3 can be 16 MHz, or it can be 180 MHz, depending on which model you choose.

    If you're to try receiving video and output it to some kind of display, you should know that there are differences between display types as well.

    It also depends on how many pixels you need to output, as well as the colour-depth.

    And finally, it depends on the video format you're receiving.

    It would be a fairly easy task to output a live picture on a 320x240 display in 16-bit colour. However, if you want to handle high resolutions, such as 1024x768@60 Hz, and do a H.264 decoding while you're also receiving data wirelessly, I am very sure that you'll see a lot of chopping.

    Since you're building the device, which is going to send the stream to the microcontroller, you're in charge of the compression format. You should know that the Cortex-M4 has a better chance of decoding video, as it has built-in DSP instructions, and you also get a CMSIS DSP library for free, which you can start using right after unboxing the Cortex-M4.

    In addition, NXP makes a LPC43xx, which is a hybrid Cortex-M4/Cortex-M0, running at 204MHz. The LPC4357 has a built-in LCD-controller, and the large versions (BGA256) can also be interfaced with SDRAM and SRAM at the same time as it interfaces with the LCD display. It supports 100Mbit Ethernet as well, so it might be a good device to try things out on. The LPC43xx also have a very strong feature called State Configurable Timer. It's not just another timer, it's a feature, which enables this device to handle data quicker than usually.

    If crazy solutions is one of your strong points, you might be able to combine the SCT and the DMA, in order to make those two units do their part of decompressing the data while receiving them (Note: This is challenging and requires some skill! - The states of the SCT can be used for data decompression, plus the SCT can control two DMAs at the same time. The DMAs can be set up in a crazy way, so that they can modify eachother's variables, including source and destination addresses, number of bytes to transfer and transfer modes. The DMA can also modify the timer registers, thus they will be able to modify the SCT. Make sure you're not sleepy while making this kind of code, though.)

    You should know that the code that outputs video should most likely be run from the internal SRAM instead of the flash-memory, because it runs much faster from SRAM.

    Two other strong devices are from STMicroelectronics; First the STM32F439. This one runs at up to 180 MHz; it does not have the SCT, but it has another interesting feature called Chrom-ART. This is a 2D graphics accelerator, which can be used for a different kind of decompression.

    Basically it's a feature that copies an image portion of any size from one location to another location; while it's able to also crop the image if necessary.

    In addition to just copying, it can actually blend two images into a third. As it supports indexed colours, this feature too, can also be used in the decompression.

    The STM32F439 has a flash-accelerator called ART-accelerator (Advanced Real-Time accelerator). This makes your code run just as fast from Flash-memory, as it would from SRAM, so you will not have to copy your code to SRAM, before executing it in this case. The STM32F439 is also a Cortex-M4 based device, so it includes the DSP instructions.

    Finally, there's a Cortex-M7, which has not really been released yet (I believe it'll start becoming available at Farnell, RS, DigiKey, Mouser, Avnet and the like in the beginning of next year; perhaps already in January (it's my personal impression; I do not know if this is actually the case).

    The Cortex-M7 is by my estimation approximately 1.65 times faster than the Cortex-M4. STM's Cortex-M7 will run at 200MHz, and decoding plus displaying received video streams should not be a real problem for this microcontroller.

    Now, let's get back to the original question. Will it be possible to use a Cortex-M3 for video streaming ?

    -Yes. But it does depend on many things as mentioned earlier.

    But you can get small 320x240 displays, which are connected to your microcontroller via SPI, for instance. One of NXP's smaller devices would be able to transmit data via an SPI interface (actually the specific implementation is called SSP); the LPC1751 can clock the SSP at 50 MHz, which means you get 5MB per second. This means you will get 5000000 bytes per second. If you are using 16-bit pixels, it will mean you get 2500000 pixels per second.

    Since you have 320x240, your maximum frame-rate of uncompressed video would be 2500000 / 320 / 240 = 32.5 FPS.

    You will of course also have to receive the video stream and decompress it. The microcontroller is running at 100MHz.

    Depending on how complex your video compression format is, you will maybe not be able to get the full 32 FPS, and it might be a bit tricky to make a suitable compression format, because you don't have the luxury of the DSP-instructions from the Cortex-M4.

    Still the LPC1751 has a DMA, and you can do tricks with this as well. (These tricks are not documented anywhere; I speak from personal experience).

    The LPC175x does not have any support for external RAM, so you'd have to make do with maximum 32KB local SRAM. That's not much.

    A better choice would be the LPC1769, it also has support for Ethernet and this device runs at 120MHz. The LPC1788 is also a Cortex-M3 that has a LCD controller, plus this device runs at 120MHz and has a Ethernet support as well. The LPC4088 is more or less identical to the LPC1788, except from that it is a Cortex-M4, so it has the DSP instruction set.

    I am no expert on other brands, perhaps someone else can tell you about features in Texas Instruments, Silicon Labs, Microchip, Freescale, Infineon Technologies, Analog Devices Inc, Nordic Semiconductor, Spansion, Fujitsu Semiconductor, Cypress and all those I've forgotten to mention (if I did forget any, it was not intended).

    But my recommendations are:

    Go for highest possible speeds; 72MHz is probably too low. I suggest 160MHz or more.

    Choose a device with Ethernet

    Consider choosing a device, which has a built-in LCD-controller.

    Look at as many datasheets as you can, and find the benefits of each device, consider whether or not they can be used for video acceleration in one way or the other (decompression, especially).

    If the device you've chosen, also supports I2S, you might want to use this to interface with a low-cost, fairly good quality audio DAC (WM8523 for instance).

    An alternative to using I2S and a DAC, is to use the PWM timers to generate PWM audio (if you're using the LPC series, you should consider using the Motor Control PWM for this feature).

    If you can disclose further details on your requirements, I'll be happy to narrow down your search and give more hints.

    Note: The LCD-controllers allow you to use for instance external SDRAM for video frame buffers. This makes it easier to decompress video, but not necessarily faster. It all depends on the kind of video stream we're dealing with.

  • paulstoffregen - Great to hear about other Cortex-M options! -Thank you for chipping in (eh, that was lame, wasn't it?)

    Yes, I made a lot of assumptions in my answer above; that's true. I've personally used the 50MHz clock with my SPI display, which is a low-cost display module from China (you probably know already). -I must say that your library is definitely impressive. Note: I just cloned the sources and viewed them lightly. I think you may be able to improve on the character-output by drawing the characters vertically instead of horizontally - I did this in my own code.

    The reason I think this is the case, is that there are much more adjacent pixels vertically than horizontally (well in my font anyway). Saving are most of the time when you move the window without resizing it.

    x-627 - I think I forgot to mention that there are 8-bit parallel and 16-bit parallel displays available as well. If pushing things to the limit, I believe it would be possible to get a data rate of 8 bits per clock cycle on a 16-bit parallel display or 4 bits per clock cycle on an 8-bit parallel display, however, the display would most likely not be able to handle a clock rate that high. I'm not currently able to test this theory, but this particular setup will most likely require the SCT found in the LPC43xx or LPC541xx.

  • Wow, that's an impressive analysis, at least of the abstract capability of a wide range of Cortex-M3 chips.

    Please allow me to offer a couple specific examples.  These are based on Cortex-M4 (Freescale Kinestis K20) running at 96 MHz.  Minimal use of the DSP extensions was made, so you could probably expect similar performance from Cortex-M3 at about 100 MHz, assuming similar buses and DMA capability.

    First, here's a LED panel project I made several months ago, displaying 30 Hz video at low resolution (90x48 pixels) and streaming 44.1 kHz (mono) audio.

    LED Video Panel at Maker Faire 2014: Concept and Development

    The video and audio data leverage the Kinetis eDMA engine.  The uncompressed data is read from a SD card, in SPI mode, using polling for the SPI peripheral.  Testing showed approx 50% CPU usage, mostly for the SPI data transfer.

    A 320x240 SPI-interface display was mentioned.  This is another place I've done significant optimization work.  Here's a blog article, with a sample video:

    Display & SPI Optimization | DorkbotPDX

    As you can see in that article, simplistic software design for these displays results in slow performance, even with a fast CPU.

    Above, a theoretical 32.5 Hz refresh rate was mentioned, based on the assumption of 50 MHz SPI clock.  Since publishing that article, I've talked with others attempting similar optimization.  Testing has shown many of those displays do not work reliably with 42 MHz clock speed.  Reliable specs are hard to find, but some datasheets spec a maximum SPI clock of only 10 MHz.  I have personally done a LOT of testing with 24 MHz SPI clock, but always while running the display at 3.3V (well above its minimum 2.2 or 2.4V), with good results.

    If your video is compressed with any DCT-based algorithm, odds are slim a Cortex-M3 will be capable of decoding the data in real time at any significant resolution.  Even just moving the bytes from a SPI port to RAM can take a lot of CPU time if DMA is not leveraged efficiently.

    Cortex-M7, when chips appear in the 350 to 400 MHz range, might open up more possibilities.  Maybe?  As you can see, I actually do quite a bit of work on optimizing open source middleware.  If anyone from ARM actually reads this message, please let me know when an updated v7m architecture reference manual is published?

  • Yesterday, Hackaday was highlighting a project from Karl Hunt on outputting 800x600 VGA from an STMicroelectronics STM32 F4 series.

    The project is available but that only takes care of displaying the information, but I don't believe it does any video decoding you might wish to do.