I was wondering if there are any documentations on how to analyze the ARM pipeline. I have access to thunderx2 nodes, and i'd like to make bottleneck analysis like can be done on intel chips. i can get the formulas to get compute the different metrics for a skylake here https://github.com/andikleen/pmu-tools/blob/master/skl_client_ratios.py. i checked the regular arm docs, ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile, and Programmer’s Guide for ARMv8-A and i did not find any information.
Thanks for your query. I would be curious to learn a bit more about your application to make sure I point you in the right direction (for instance the programming language you use, the type of application and paradigms you rely on).
If you are working with C/C++, Fortran or Python codes, Arm Forge (and in particular Arm MAP) may be suitable for you. We did a webinar recently about this very topic. I encourage you to have a look here: https://www.brighttalk.com/webcast/17792/384060/top-down-performance-analysis
If you would like to give it a go, you can download a trial version of Forge on this page: https://developer.arm.com/tools-and-software/server-and-hpc/trials
I hope this helps. Let me know if I can be of further assistance.
thanks for the links. i'll definitely take a look. i have a number of openmp benchmarks and i'm trying to understand bottlenecks on the thunderx2. i know my way around perf, so i am interested in trying to figure out which perf events to focus on. i'll take a look at the trial version
saw the video. thanks. i was hoping to find a resource like this,
here you can get a breakdown and information on how the top-down is computed. i like to understand what is going on. when i check the aarch64, it is empty
is there a white paper, document or something url where i could get more info on the thunderx2 top-down approach? i'm trying to learn more about the pipeline etc
the application is a set of benchmarks, SPEC OMP2012. the goal is to understand the underlying architecture, bottlenecks, etc when using different computational kernel
Thanks for clarifying. This is an interesting query and I do not believe such a document exists today. I have forwarded your request to my colleague Florent who created the webinar and is one of our experts on the subject, we will do our best to assist you.
if you share the formulas in this forum, or on some arm document, people can write the url in the reference portion of the white paper or other publications. that way people can share knowledge and make it easier to profile on aarch64. that is a win for everyone and ARM too. formulas for the main categories and subcategories. people want to see why their code is core bound, or memory bound. whether it is l1bound, l2 bound, external memory bound, etc.
View all questions in HPC forum