I've been working on an edge computing project using Arm Cortex-A based gateways, currently managing a modest fleet but planning to scale to hundreds of devices across multiple sites.
OTA updates have been straightforward at small scale, but I'm starting to think harder about failure scenarios: partial updates, rollback strategies, staged rollouts, and how to handle devices with poor connectivity mid-update.
A few things I'm genuinely uncertain about:
Curious what approaches others have landed on, especially at production scale. Are there Arm-specific tools or PSA-aligned patterns that make this more robust?