OpenFOAM, developed by ESI-OpenCFD is one of the most popular tools for developing CFD (Computational Fluid Dynamics) applications, along with ANSYS Fluent or CD-Adapco Star-CCM+.
Most modules of OpenFOAM are heavily optimized and offer little room for improvement at the code level – but surprisingly there are still many rewards that can be had by making sure that OpenFOAM makes the best of your system.
Arm Performance Reports looks inside applications and diagnoses how well they are performing and where issues might be.
In this article, we focus on OpenFOAM’s interFOAM solver on a small scale, on one server.
We show how Arm Performance Reports help to increase the efficiency of our usage and reach the highest level of performance that our machine can offer.
We'll assume you have OpenFOAM already up and running – and will take an example from OpenFOAM’s tutorial: damBreak.
This example solves a problem of a dam break in two dimensions using interFOAM solver for 2 incompressible, isothermal and immiscible fluids.
Starting with Arm Performance Reports is very easy. Just add "perf-report" in front of the mpirun command - and you are good to go.
$ perf-report mpirun –n 8 \ /home/allinea/OpenFOAM/OpenFOAM-2.3.0/applications/linuxGccx86_64/interFOAM –parallel
NB If you are trying this for yourself but see an error – you may need to follow these short steps on how to compile OpenFOAM for profiling.
This command runs the application, generates the scientific results you would normally expect, and Arm Performance Reports creates a html and a raw text file containing profiling information about this run.
Here is the information displayed by Arm Performance Reports - on 8 processes from that example.
At a glimpse, we have an overview of the application behaviour with communications, computing and disk access – and some more specific profiling information and hints to help us understand what could be improved.
Although the application is CPU-bound, most of the CPU time is spent doing memory access. The report also indicates that the code is poorly vectorized. The time spent in MPI communications does not seem very efficient either.
There may be room for improvements here, and Arm Performance Reports provides us with several hints :
With communication at over 14% of run time for an application running on a single server, that sounds high. Perhaps we should explore the workload distribution - the domain decomposition. Can we get some hints as to how the mesh is split across the processes?
A good proxy for data distribution is the quantity of memory usage. Arm Performance Reports suggests a reasonable balance has been achieved:
So, there's very little we can do there this time. Let's try another optimization.
At almost 85% of the time, processor usage is up high where we want it to be, but is it good usage?
We can see from the CPU section of the report that a lot of CPU time - 59% - is spent in memory accesses. This is very high - it's a sign that we don't have a great memory access pattern - we'd rather be spending time in floating point operations. We're suffering from poor cache usage.
We probably cannot change the vectorization achieved (that usually requires source code or compiler magic). However, we may be able to improve cache usage through improving the spatial and temporal locality. Let's do that by increasing the number of MPI processes to 12.
Let’s have a look at the memory access again by profiling the application with Arm Performance Reports.
The memory accesses have decreased down to 38%. This is still high but it brings noticeable increase in performance. The execution time has been reduced: approximately 40 seconds instead of 45. Communication time now represents 28% of the application and the overall MPI communications are worse - less bandwidth and more synchronization. That is to say, some of the time saved by increasing the cache usage is lost because of longer MPI call durations and poorer communication!
Even though we were working on one node and on a small scale, we already have a good understanding of OpenFOAM. For this data set, the limiting factors for OpenFOAM are the memory bandwidth and the communication.
With only two runs of OpenFOAM through Arm Performance Reports, we have been able to understand this key behavior.
In a future article, we will validate those findings on a multi-node environment. And as we will try and scale up, new questions will be triggered.
Exploring bottlenecks and finding improvements without touching the source code is really easy with Arm Performance Reports. With this tool, you can answer:
The Arm report also forms a reference you can rely on. Hardware faults, software upgrades issues - all those can impact the profile and the efficiency of your applications. With Arm reports, you can track those problems down and get the best from your cluster in production. Why not take of a trial of Arm Performance Reports on your CFD simulations today.