How to debug and profile those mixed Python and Fortran codes

February 11, 2015

4 minute read time.

Python is pretty commonplace in scientific computing these days. It is easy to code and powerful - but numerical computation is not a strength that Python has. Its interpreter simply can’t apply the advanced optimizations to your loops and floating point operations that a C++ or Fortran compiler can. However, using native code such as code written in C, C++, F90 is easy from within Python scripts – and that’s why Python and C++ or Fortran are a popular combination.

One of the most popular methods for Fortran developers is to use the f2py utility – part of the NumPy package. With f2py, any Fortran code can be used within a Python script. Indeed, NumPy itself brings in routines from the BLAS libraries – providing hyper-efficient numerical matrix routines into the world of Python.

We’re regularly asked how to debug or profile codes invoked or steered by Python – so today we’re going to show you how it’s done. You can still use your Allinea tools debug your Fortran – even if it’s been invoked from Python – and then profile the whole application so that we can improve its performance.

It's worth noting too that the techiques here apply equally easily for the mpi4py package for almost all MPI implementations.

Debugging Fortran and F90 in a mixed Python Code

So, you think you have a bug in some Fortran/F90 called from inside Python. Let's see how we'll tackle that (C++ users can use exactly the same method).

Arm DDT is a native code debugger – which means it can’t debug the python script – but it can debug any native code that is used by a process. For Python scripts that call F90 or C++ directly, say, that code is executed within the same Unix process as the Python itself – so we start the debugger by having it debug that process. As soon as your native code executes, the debugger will be debugging it.

Let's take inspiration from this Laplace example from the SciPy Performance Python pages - which computes several iterations of Laplace transform on a matrix, using various implementations including native code.

    % python setup.py build_ext --inplace

and now we execute inside DDT. Let's make it breakpoint at the function timestep - to stop the process as soon as it reaches the code we are interested in .

    % ddt --start --break-at timestep /usr/bin/python laplace.py

When DDT starts you will see a warning message that DDT cannot find the timestep function - that's okay, it's not loaded yet - click "ok" to dismiss it. That part of your program is not in the Python process memory yet - it doesn't know what the Python script will load. You might instead have preferred to set a breakpoint in one of your Fortran files - this can be achieved by using the Project Files window in the DDT GUI after your process loads and right clicking to "Add file" to the project. Load the source file and then right click on a source line to set a breakpoint.

Let's press the green "Play" button - and the code runs for a few seconds before reaching your Fortran.

That's it - you're now debugging your native code! You can see those Fortran variables and stack traces (including the stack frames of the Python process) - and explore to find that bug!

Profiling the performance of Fortran, F90 or C++ in a mixed Python code

Profiling is even easier to accomplish. Let's switch from debugging to profiling - select "Session/End Session" from Allinea DDT and then pick Arm MAP on the Forge GUI, choose "Profile a program" and click "Run".

% map /usr/bin/python laplace.py

That's it.

You'll now see the code run:

- and on completion MAP will present the performance profile. There's no need to tell MAP about your source files - but it won't have source file for the top stacks - initially as these are actualy the Python stacks that call your Fortran.

In this first view you can see memory usage creeping up through in stages - and how floating point instructions are intensive for only some of the time.

I'm going to select my favourite interesting metrics (right click in the time line for this) - memory usage, floating point vector and floating point instructions. - and select the area where the floating points are happening.

That's interesting - a lot of floating point instructions but very few vectorized ones! Let's look at the code - by going to the Main Thread Stacks.

Did you also notice the flat-lining of the vectorization? Yes, the original Fortran code has none - although the later F90 and F95 versions to the right of the bar in the timeline have lots (and hence run much faster!).

Why not try using your usual optimization techniques for fun with the example code - you now have a profiler that can help!

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

How to debug and profile those mixed Python and Fortran codes

Debugging Fortran and F90 in a mixed Python Code

Profiling the performance of Fortran, F90 or C++ in a mixed Python code

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL