So I have a question regarding the lossless compression that you have mentioned in the key features and benefits, I did not quite understand that. Are you implying when you quantize the model it reduces to a size which is less than 75% of its original size. Or am i missing something?
Hi, I presume you are referring to the statement on the Ethos-U landing page in developer.arm.com - >advanced, lossless model compression reduces model size up to 75%,“ The first part is lossless -
Ethos-U supports Int8/uint8/ input values and hence models need to be quantized before passing to vela. From the F32 to Int8 quantization there is some accuracy loss. It will obviously depend on if you are doing post-training quantization or training the model itself with quantized weights. What Ethos lossless refer to the point that there is no accuracy loss after this stage. Vela will consume the model with many internal stages and generate its command stream for Ethos but it will not be dropping any accuracy. So, after the quantization accuracy drop, Ethos is bit accurate and there is no accuracy loss.
Weight Compression: The Ethos-U55 hardware processes the model weights via a hardware weight decoder. Vela arranges the model weights into blocks of weights and the blocks are then fed into the hardware weight decoder of the NPU. To arrange the weights into blocks, a header is applied to the weights of the input model(encoding). The weights may get compressed as part of the encoding, but it is not always true. Note that the encoding happens for every weight tensor of the model. This is the weight compression and can reduce the size significantly.
Hope it clears both of your points.