Core counts continue to increase for High-Performance Computing (HPC) systems, but multiple factors may prevent current software from fully utilizing the increased available thread count. Inter-thread communication and serialized execution may hamper the efficiency of increased thread counts. In recent work, published at the International Workshop on OpenMP (IWOMP) 2019, we introduce our methodology for analyzing this behavior and present an inter-cache communication characterization of multi-threaded HPC codes. Our characterization quantifies data sharing between core caches as well as inter-cache communication. We also expose behaviors such as code serialization and false sharing, which may reduce parallel efficiency of these multi-threaded applications.
OpenMP is a programming interface which manages multiple execution threads within an application process. OpenMP leverages shared memory between threads, which often makes communication between threads implicit. While multiple methodologies exist for analyzing communication between threads, these techniques typically ignore whether this inter-thread communication results in cache-to-cache communication. We created a methodology for viewing inter-thread communication from a cache coherence perspective, showing the frequency and patterns of communication events between hardware cache when running multi-threaded codes. In our work, we analyze rates and patterns in cache line sharing, producer-consumer cache interactions, and write invalidation events between caches. We characterize High Performance Computing (HPC) proxy applications to reveal insights into the execution behaviors of these multi-threaded codes.
Cache line sharing may occur when multiple threads access the same or adjacent data within a short period of time. We examine patterns in cache line sharing to observe how threads actively share data for these proxy applications. For some of these codes, cache lines were primarily shared between neighboring cores. CoMD is an example of this sharing pattern, shown in the figure below.
Each box in the figure corresponds to the frequency of a core on the X axis sharing a cache line with a core on the Y axis. This data assumes ordered static mapping of threads to cores, and the darker boxes of the heat map indicate higher frequency of cache line sharing between that pair of cores. The diagonal pattern in this figure shows that numerically local threads share data more frequently than threads that are numerically distant. In CoMD, work is statically distributed among threads such that neighboring threads process neighboring data. Each thread’s calculations require it to access neighboring data, which causes cache line sharing among neighboring threads.
Producer-consumer interactions are observed coherently when one cache requests a line that is in a dirty state in another private cache. Our coherence analysis of communication includes the effects of false sharing, and disregards communication that does not result in cache-to-cache transactions. When the production of data occurs long before the usage of that cache line by another core, the corresponding dirty cache line may be evicted from the producer’s private cache, and the consuming access may hit in the last level cache. The following figure shows a producer-consumer communication pattern for AMG.
The boxes of communication we observe in the figure are caused by false sharing of cache lines. This occurs because each thread accumulates values into adjacent locations, and many of these locations fit within a cache line. This false communication may cause a performance degradation if these cache lines are unnecessarily ping-ponging between caches.
Our producer-consumer analysis also exposed asymmetric communication as well as serialized execution characteristics in some of the proxy apps. Asymmetric communication typically occurred due to the order computation within threads at each timestep. Serialized sections of code may limit scalability of OpenMP threads, and our analysis exposed serialization in multiple proxy apps, including LULESH, CoMD, SWFFT, and miniVite. Serial sections of code reduce efficiency of multi-threaded software because other threads must wait for these sections to complete. This communication during a serial phase of execution will further increase the serial execution time by increasing latency of the serial thread’s memory accesses.
Write invalidations occur when a core writes to a cache line that is held by another core. This communication may be unnecessary if the invalidating cache writes to data in the cache line that is unused by the invalidated cache. The following figure shows the rates of write invalidations observed in the proxy apps we studied.
We measure communication events per thousand instructions to get an idea of the rate of occurrence of this type of communication. When weak scaling codes, we increased the problem size proportionally as thread counts increased, while strong scaled codes used the same problem size for all thread counts. This data shows that write invalidation events typically scaled up dramatically with the strongly scaled proxy applications, while these events did not necessarily increase with the codes that we weak scaled. It makes sense that the strongly scaled proxy applications would increase in write invalidations, because higher thread counts necessitate less data per thread, causing a higher rate of cache-to-cache interaction.
We also analyze patterns of write invalidations to gain insights into the proxy apps. Similar to the producer-consumer pattern analysis, these patterns highlighted serialization in the code, which may harm the efficiency of multi-threaded software.
We developed a tool that allows a coherent analysis of multi-threaded software. This analysis yields insights into how data is shared between cores and how this data moves between cores. We found that communication between core caches may increase as thread counts increase, especially when a strong scaling strategy is employed. Our tool also exposed serial sections of code by showing asymmetric communication to or from a single thread. Eliminating this serialization may increase scalability of multi-threaded software. Our tool is also able to uncover cases of false sharing, which may result in false communication and hindering performance. We also observed that some communication events between threads may be predictable. This suggests that data movement optimizations may be able to eagerly move data and potentially reduce latency of future memory accesses. Further insights and analysis derive from our methodology and can be found in our paper, “Cache Line Sharing and Communication in ECP Proxy Applications”, which appears in the International Workshop on OpenMP (IWOMP) 2019.
Read the Paper
We hope to make our code available in the near future for others to perform this analysis with their software.
This work was part of a team effort with Alejandro Rico and Jose Joao, in collaboration with Cray and funded in part by the DOE ECP PathForward program. Part of the methodology for this work has been contributed towards the DynamoRIO open source project.