This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How can I do read vela compiler report ?

I have some problems about his report when using ethous-U-vela.

The vela version I use is 3.0.0 .

This is an introduction to Vela's options:
review.mlplatform.org/.../OPTIONS.md

There is an example of Configuration File in the above website

I would like to ask

1.
There are burst length and latency.

Does the latency in the example refer to memoey's CAS?

Then burst length refers to reading the address once, which can read several strokes continuously.

2.
The three (Sram Only, Shared Sram, Shared Sram) modes that I want to ask the most are whether to adjust the four settings (const_mem_area, arena, cache_mem_area, arena_cache_size)
) Instead of directly setting the model?


3.
The total DRAM MB/batch in the figure below has a value.
But there is no value in the last DRAM access cycle.
What could be the problem?


4.
Finally, I would like to ask about the value of those cycles. (Picture above)
Where can I know the cycle value used by an OP?
I think these values ​​should be calculated after understanding the process of op operation.

Thank you for your answers in advance

  • Thanks for posting on Arm Technical Forum. Please find the response inline.

    1.

    Read AXI latency is defined as the number of cycles between the accepted read address and the arrival of the corresponding read data.

    Write AXI latency is defined as the number of cycles between the accepted write address and the arrival of the corresponding write response.

    Similarly, DMA (between Ethos_U55 and Cortex_M*) also handles latency hiding and out-of-order burst handling, and for this purpose it will require an internal burst buffer. This handles a run-time configurable number of outstanding bursts as well as configurable maximum burst length.T

    Coming back to setting this up in Vela, this is because these params are used by Vela's performance estimation. Performance estimation is used to determine the optimisations performed

    1. Yes, if you are selecting the memory mode and giving that option that is already defined in the config file as  Vela CLI as 

    --memory-mode My_Mem_Mode

    then you need not to select the rest of the settings explicitly. Please note "My_Mem_Mode" must be provided on your configuration where you are setting AX0/AX1 for const_mem_area, arena, cache_mem_area, arena_cache_size.

    for e.g

    https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-vela/+/refs/heads/master/vela.ini

    1. This needs further investigation. which memory mode you selected ? I need to check your configuration (.ini file) to investigate. might be a bug. 
    2. No, we can’t get the per Op NPU cycles. All the ops falling on Ethos_U55 are packed in a custom op to run on NPU. Active NPU cycle is based on PMU counters.

    Hope this helps. If you are a licence customer of Ethos_U55, please send your issue (3) and others to support-ml support-ml@arm.com.

     

    Thanks,

    Sandeep Singh

  • Hi

    The Vela compiler

    This tool is used to compile a TFLμ model into an optimized version that can run on the Ethos -U NPU.

    The optimized model contains TensorFlow Lite custom operators (supported operators) for those parts of the model that can be accelerated by the Ethos-U NPU. Parts of the model that cannot be accelerated are left unchanged and will instead run on the Cortex -M series CPU using an appropriate kernel.

    Vela trials a number of different compilation strategies and applies a cost function to each one. It then chooses the optimal execution schedule for each supported operator or group of operators.

  • First of all thank you for your answer

    1.First of all, about the first question, I probably understand the definition of latency.

    It’s just that the basis of burst lenght is still not well understood. According to my understanding, is the definition of this value more inclined to consider the definition on AXI instead of the definition of memory?

    I don't understand how should I set this value?

    2.Thanks for your secnod answers
    But if I execute sram only mode, the following sentence will appear:


    Info: Changing cost_mem_area from Sram to OnChipFlash.

    The following are my settings:

    I'm not sure if it can't execute sram only mode.

    3.I give my settings

    system is the same as above.
    Only adjust the memory mode(shared sram).

    I later tried inceptionV4(TensorFlow Hub (tfhub.dev)) and it got the value under the same setting.

    inception V4:

    MNIST:

    But it is still not clear why there is no cycle when the model is too small.

    Thanks again for your answer

  • @Danter

    1. Efficient burst length - This is a parameter to make Vela estimates memory performance depending on burst length.
    Say when set to 64B, for instance, any burst shorter than 64B will be considered as taking the same time as a 64B burst.
    This can be used to model a memory where short bursts are inefficient. So, 
    this is depends on the interconnect, memory of the HW and you should set burst length to the hardware burst length.

     

    2. "Info: Changing cost_mem_area from Sram to OnChipFlash.
    When you see changing const_mem_area from Sram to OnChipFlash. This will use the same characteristics as Sram.

    In the code, this happens:

                    if self.const_mem_area == MemPort.Axi0:
                        self.const_mem_area = MemPort.Axi1
                        self.axi1_port = MemArea.OnChipFlash

    OnChipFlash is  internal to Vela and as shown in code, it’s the name used to specify the const_mem_area as being in Read-Only SRAM. in the vela.ini config file, user can specify either OnChipFlash or Sram. If Sram is specified then Vela will automatically change this OnChipFlash and print a message indicating this and will be 2 area of sram created. one called 'Sram' and other one called 'OnChipFlash' - 'Sram'​ area is r/w and 'OnChipFlash' is read only.

    3. Need to investigate this further. I will get back to you. 

  • for point 3: I checked the behaviour and it's align with the Vela code. Vela does DMA weight stored in Flash/DRAM (permanent storage) to SRAM, so that they are saved from reading more than once. That's why for the smaller you are getting 0 off chip flash cycles as weights are cached in SRAM. Hope this clears all your questions.