Boosting OpenFOAM behavior with Arm Performance Reports

November 3, 2014

4 minute read time.

OpenFOAM, developed by ESI-OpenCFD is one of the most popular tools for developing CFD (Computational Fluid Dynamics) applications, along with ANSYS Fluent or CD-Adapco Star-CCM+.

Most modules of OpenFOAM are heavily optimized and offer little room for improvement at the code level – but surprisingly there are still many rewards that can be had by making sure that OpenFOAM makes the best of your system.

Arm Performance Reports looks inside applications and diagnoses how well they are performing and where issues might be.

In this article, we focus on OpenFOAM’s interFOAM solver on a small scale, on one server.

We show how Arm Performance Reports help to increase the efficiency of our usage and reach the highest level of performance that our machine can offer.

We'll assume you have OpenFOAM already up and running – and will take an example from OpenFOAM’s tutorial: damBreak.

Analyzing our performance

This example solves a problem of a dam break in two dimensions using interFOAM solver for 2 incompressible, isothermal and immiscible fluids.

Dam break in two dimensions using interFOAM

Starting with Arm Performance Reports is very easy. Just add "perf-report" in front of the mpirun command - and you are good to go.

$ perf-report mpirun –n 8 \

       /home/allinea/OpenFOAM/OpenFOAM-2.3.0/applications/linuxGccx86_64/interFOAM –parallel

NB If you are trying this for yourself but see an error – you may need to follow these short steps on how to compile OpenFOAM for profiling.

This command runs the application, generates the scientific results you would normally expect, and Arm Performance Reports creates a html and a raw text file containing profiling information about this run.

Here is the information displayed by Arm Performance Reports - on 8 processes from that example.

Openfoam profiling performance reports

At a glimpse, we have an overview of the application behaviour with communications, computing and disk access – and some more specific profiling information and hints to help us understand what could be improved.

Although the application is CPU-bound, most of the CPU time is spent doing memory access. The report also indicates that the code is poorly vectorized. The time spent in MPI communications does not seem very efficient either.

There may be room for improvements here, and Arm Performance Reports provides us with several hints :

Do we have a workload imbalance and did OpenFOAM not split the workload correctly?
Has the current build been compiled with the appropriate optimization options?
Could OpenFOAM solvers and loops be better optimized?
Is there a better way to start OpenFOAM?

Is my workload unbalanced?

With communication at over 14% of run time for an application running on a single server, that sounds high. Perhaps we should explore the workload distribution - the domain decomposition. Can we get some hints as to how the mesh is split across the processes?

A good proxy for data distribution is the quantity of memory usage. Arm Performance Reports suggests a reasonable balance has been achieved:

Openfoam profiling performance reports 2 So, there's very little we can do there this time. Let's try another optimization.

Can I improve the processor usage?

At almost 85% of the time, processor usage is up high where we want it to be, but is it good usage?

We can see from the CPU section of the report that a lot of CPU time - 59% - is spent in memory accesses. This is very high - it's a sign that we don't have a great memory access pattern - we'd rather be spending time in floating point operations. We're suffering from poor cache usage.

We probably cannot change the vectorization achieved (that usually requires source code or compiler magic). However, we may be able to improve cache usage through improving the spatial and temporal locality. Let's do that by increasing the number of MPI processes to 12.

Let’s have a look at the memory access again by profiling the application with Arm Performance Reports.

Openfoam profiling performance reports 3

The memory accesses have decreased down to 38%. This is still high but it brings noticeable increase in performance. The execution time has been reduced: approximately 40 seconds instead of 45. Communication time now represents 28% of the application and the overall MPI communications are worse - less bandwidth and more synchronization. That is to say, some of the time saved by increasing the cache usage is lost because of longer MPI call durations and poorer communication!

Even though we were working on one node and on a small scale, we already have a good understanding of OpenFOAM. For this data set, the limiting factors for OpenFOAM are the memory bandwidth and the communication.

What's next for my simulation?

With only two runs of OpenFOAM through Arm Performance Reports, we have been able to understand this key behavior.

In a future article, we will validate those findings on a multi-node environment. And as we will try and scale up, new questions will be triggered.

Exploring bottlenecks and finding improvements without touching the source code is really easy with Arm Performance Reports. With this tool, you can answer:

what scale is best for a given domain size and mesh resolution?
how should the meshes be dimensioned and spread across processes?
what can get the best efficiency and increase the system productivity without touching the code or the hardware?

The Arm report also forms a reference you can rely on. Hardware faults, software upgrades issues - all those can impact the profile and the efficiency of your applications. With Arm reports, you can track those problems down and get the best from your cluster in production. Why not take of a trial of Arm Performance Reports on your CFD simulations today.

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Boosting OpenFOAM behavior with Arm Performance Reports

Analyzing our performance

Is my workload unbalanced?

Can I improve the processor usage?

What's next for my simulation?

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL