What do you get when you combine 30 teams, 53 mentors, 5 days, 12 time zones, almost 30,000 AWS Graviton2 vCPUs and a bunch of pizza? You get the first ever AWS, Arm, and Arm HPC Users Group (AHUG) cloud hackathon.With a global pandemic forcing events to go virtual, we wanted to come up with something big to drive participation and engagement. What we came up with was the AHUG Hackathon: Cloud Hackathon for Arm-based HPC. Teams of up to four would be randomly assigned an HPC application and asked to make it run well on Arm. We made sure Spack build recipes were available for each application. And teams had access to Arm Forge (debugger, profiler), Arm performance libraries, compilers from Arm, NVIDIA and GCC, and a team of HPC experts from Arm, AWS and NVIDIA.
After five days one team – DodgeCoinToTheMoon – walked away winning the first prize of M1 MacBooks for their efforts in getting Tinker to run and scale well on the Arm architecture. But we were so impressed with the efforts of teams Wolfpack and Iman that we ended up awarding them with iPads as well.
In total, 31 HPC codes were ported and built for Arm, and often compiled using multiple methods. The results showed that with a bit of tuning and detective work, many HPC codes could be ported to Arm with performance meeting, and sometimes exceeding, that of x86.
We considered this event a huge success. And plan to run it again in the future, perhaps with a vertical application focus (ex, oil & gas) and with less of a global footprint (lesson learned, supporting teams separated over 17 hours is hard). Our next hackathon is planned for Supercomputing 2021 so please save the date (Nov 11th) if you would like to participate.
We will also be featuring more content from AWS on running HPC in the cloud on AWS Graviton2 at Arm's upcoming DevSummit 2021. Do not miss 'Disrupting the HPC balance (again) - Stories from the Cloud'.
And if that summary caught your interest and you have a few minutes to spare, please read on for all of the details of this first of its kind event.
One of the biggest challenges for many during the pandemic has been missing in-person events. As a small community distributed across the globe, the HPC community is especially sensitive to this. The challenge of outreach has been very prominent for us these past 18 months.
Pre-pandemic, Arm’s HPC Field Applications Team would be hosting or participating in over 30 events a year – including BoFs, workshops, tutorials, guest lectures and hackathons. Events which centered on the adoption of Arm technology for HPC and scientific computing, with specialisms in ‘upcoming’ technology features (such as SVE).
This has helped Arm build a strong standing within the community and an engaged user base. However, as with everyone, else the pandemic has forced these in-person events to become virtual. The good news is the show goes on as planned; the bad news is virtual events tend to lack engagement.
Not wanting to miss engagement for 2 years, we decided to come up with something big to draw in the Arm HPC community. This blog is an in-depth review of the planning, execution, and results of Arm’s biggest HPC Hackathon to date. One we hosted in the cloud with the help of Amazon and the Arm HPC Users Group (A-HUG).
Early in 2021 plans started forming to host a virtual Arm HPC hackathon, unlike anything we have ever attempted before. Normally our events range from 1-3 days, usually at a university or a conference center – specifically targeted to those at the University or conference. However, we started planning an open-invite event that was to be a week long.
While a daunting undertaking, we were certainly not alone in this endeavor. We wanted to host this event in collaboration with AWS and showcase the AWS Graviton2 processor on HPC workloads (an active topic of research for our team last year [1] [2] [3] [4]). We were also supported by the Arm HPC User Group (A-HUG) (comprised of a board of industry and academic professionals who support the furthering of Arm in HPC). And another unexpected but greatly appreciated supporter also came forward – NVIDIA.
The sketch of the event looked something like this: A competitive hackathon, aimed at students but open to all, teams of 4 porting and optimizing on AWS virtual clusters over a whole week, to win M1 Macbooks. With a key goal of making an impact within the whole HPC community – lifting as many HPC codes to work on Arm as possible – and fixing all the bugs along the way.
With that, simple yet ambitious, plan we set about filling in the blanks. How many teams would want to participate? Which codes can they work on? How do we score it? How do we handle the logistics – and most importantly (as we would find out) how to manage the time zones?
With so much potential, there were many directions we could take this event – however a week is not long enough to learn all of the HPC and Cloud technologies available. So, we decided to hide as much of that from the participants as possible – and let them focus on applications and science.
What we really wanted as an outcome from the event was a set of HPC applications that we can demonstrate work on Arm, with some understanding of their performance, and potential comparisons (between compilers and between architectures).
From the organization perspective, we also wanted to ensure what we were creating was a framework for future events, not just a one off. Where possible all components were documented and scripted to be reusable for future events.
As organizers we curated a list of about 50 mini-apps and 50 full applications and assigned them to teams at random. Then, in accordance with a ‘play-book’, teams were rewarded for different porting, validation, profiling, and optimization tasks.
The team that amassed the most points by the end of the week was rewarded with an M1 based MacBook for each team member.
Validation was placed at the heart of this event, with a stipulation that proof of a working test case is essential for all activities.
To simplify the event, we fixed on two key pieces of software which would become central to the students activities, and the event as a whole. Spack [5] and ReFrame [6].
Spack is a package manager for installing HPC focused applications (with over 5000 applications recipes available in its repository) – with a focus on software dependency tracking. This allowed students to have consistent software builds with different compilers on both X86 and Arm. We ensured that all our 100 applications had existing Spack packages, but the status of each was unknown on Arm.
ReFrame was utilized to provide a testing framework for the applications, for student defined scripts, and to validate the build and document performance.
Teams could then pick their strategy for competing – deep dive into one or two applications, or light touch across lots of applications. From our perspective, we did not mind – this event was all about ‘moving the needle’ and that can be achieved by both strategies.
To aid teams a GitHub repo [7] was created, with templates for each application, allowing them to quickly add the relevant data as they fulfilled their tasks. Teams then submitted pull requests to this repo, to upload their data and supporting collateral. Application submissions could then be marked on pull request.
When initially scoping this event we toyed with the idea of staying small – a few select teams closely monitored and supported, to ensure that the event format and infrastructure works. However, in the spirit of openness we decided to go big. There were no restrictions on entering, but asked teams to provide an academic on industrial reference to allow us to check their credentials.
For this, we were met with overwhelming interest from the community. 47 team applied and 30 were selected, representing just over 100 participants. And what drew them in was interesting. The chance to win a new M1 Macbook was a big draw, yes. But we saw a large number of ‘social’ entrants, just looking for hands-on experience and to learn more about Arm in HPC.
Figure 1: Entry form motivation question
To support the teams we also put together a host of domain experts, technology experts and mentors from across Arm, AWS, A-HUG, and NVIDIA. With the addition of the development teams from Spack and ReFrame, we were well equipped to support the teams. In total, we had 53 supporting mentors.
We also utilized this expanse of industry knowledge to host a lecture series during the week, educating teams on topics relevant to their work. These lectures can all be found on the Arm/AWS HPC Cloud Hackathon YouTube playlist.
As previously mentioned, we wanted to make the access as simple as possible and allow for rapid onboarding, without having to learn different cloud or HPC technologies. Additionally, we wanted users to have a positive experience with both Arm and the cloud, so deliberately overprovisioned resources.
Each team was provided with two dedicated virtual clusters hosted on AWS – provisioned using AWS’s PCluster.
We built the Arm cluster around the AWS Graviton2 based C6gn.16xlarge (64 core) instances, and the X86 cluster around the Intel Skylake based C5n.18xlarge (36 core, 72 threads). Each cluster had one login node and 8 compute node (dynamically provisioned), with both clusters supporting AWS’s EFA 100Gbps networking.
The clusters were controlled through the Slurm resource manager and job scheduler, thus each team had access to 512 cores of Arm for parallel jobs, connected with a high-speed interconnect.
Clusters also had a number of parallel file systems – to present a shared filesystem across the login and compute nodes, including a 1.2TB AWS FSx Lustre mount for high-performance file I/O.
Figure 2: HPC Cloud Hackathon cluster infrastructure
To hasten adoption, we had pre-installed the three compilers of choice. These were Arm’s ACfL 21.1, GCC 10.3 and NVIDIA’s NVHPC 21.2 on the Arm clusters and GCC 10.3 and NVHPC 21.2 on the X86 clusters. Spack and ReFrame were also pre-configured to work with this environment. In all other regards the X86 and Arm clusters were configured to mimic each other to provide as comparable an environment as possible (Amazon Linux2 as the OS – then all packaging provided through Spack).
The clusters were also configured with Arm Allinea Studio profilers and debugger (Forge), to facilitate the porting and profiling of the codes.
We then provisioned these clusters for the teams, with the goal of one dedicated cluster per team. In total 63 clusters were provisions (30 Arm and 33 X86) – With a combined total of 29,172 cores available (though most only spin up when needed).
These clusters were geo-located with the teams. This was to minimize access latency and to reduce the load on any specific AWS AZ (5 in US-West-2, 30 in US-East-1, 16 in EU-West-1 and 12 in AP-Northeast-1).
Figure 3: Distribution of HPC Hackathon clusters
To support management and oversight we also deployed a GrayLog server – to collect all of the run logs from ReFrame. This provided us with a centralized repository of all job executions throughout the week. Through mandatory metadata tagging, we were able to analyze which applications were being worked on, and the progress being made.
One key consideration for us was to provide the students with a comprehensive HPC development environment, and a key component of this was making the Arm tools available.
We wanted to make sure they had access to common open-source compilers, such as GCC, but also a license to use the Arm compiler - in addition to the Arm Performance Libraries. We also wanted to make the Arm Forge suite available, so that students had access to our world class debugger and profiler.
The additional benefit of making our tools available to students is the feedback we can gain from their use.
During the event, we saw lots of instances of the profiler being used to identify performance issues, compiler performance comparisons and compilation bugs. Due to the use of dedicated Slack channels all of this discussion could happen directly with the Arm product teams.
Without a doubt, the biggest challenge of the event was managing the time zones. Our participants spanned 12 different time zone, with a 17H spread. A bias in the organizing committee towards Europe and US meant that we were very short staffed for our teams Asia and Australia. However, where possible we tried to accommodate everyone – at the expense of sleep for the organizers.
Three kickoff events were hosts on the Monday: 2am BST for Asia / Australia, 9 AM BST for Europe, and 5 PM BST for the Americas.
Then each day two sync up meetings were hosted (9 AM BST and 5 PM BST) as a catchup, ask questions sessions. We also co-hosted the talks with these sessions. For those unable to attend all of the talks were recorded and made available to participants.
Communication took place with a dedicated Slack workspace – with dedicated help channels for each technology and private channels for each team. Analyzing the Slack data showed we had between 120 and 150 daily active users, with over 8k message sent over the week.
Daily summaries (mainly curated from the GrayLog server) were posted everyday on Twitter, for those ‘following along’. These received good feedback and engagement, with high impressions (~4k / tweet).
Additionally, to build some community around the event, we hosted a pizza day on the Wednesday of the event. Participants (and mentors) were encouraged to order or make pizza and tweet about it and the event – and in return for an Amazon voucher.
Figure 4: Pizza night
Graylog became a critical component for us to monitor and verify the work of the students, we have also been able to data-mine this after the event to look for trends.
The ReFrame integration with Greylog allowed us to automate the logging of ever job the students ran. Additionally we injected a number of routines into their ReFrame scripts to collect some additional meta-data. This data would allow us to better analyze the data - such a metadata from Spack about the build properties of an application.
During the week, we were able to keep a close eye on the Graylog server to see what the students were up to.
This provided us with a continuous source of data analysis such as breakdowns per compiler - either total runs, or by test case (shown in the following diagram).
Figure 5: Live usage reports from Graylog
Figure 6: Test cases per application on Day 3
Greylog was also a good mechanism for us to do some on the fly data plotting, to validate the behavior the teams were seeing. Such as this simple on-node scaling graph for CoMD, comparing different compilers on the C6g and C5 instances.
Throughout the week, we were continually impressed by the quality of the work coming out of the teams, in their ability to work with new codes and find and fix bugs as they went.
The net results were a wealth of application build and performance data, allowing us to see what works on Arm and what the comparative performance is between architectures and compilers.
Figure 7: List of ported applications with compilation status per compiler
However, behind each of these applications is multiple test cases (to prove their validity). We observed over 200 test cases – most of which have been uploaded to the event GitHub page, for use later.
In addition, team submitted a total of 115 hot-spot profiles (45 serial, 70 parallel), 127 scaling studies, 12 examples of compiler flag tuning, and 5 maths library analysis (BLAS/LAPACK/FFT).
To ensure the impact of their work, teams were also encouraged to document and upstream their application changes. Mainly this was to Spack, but in some cases also directly interacting with the application owners and communities. This means the benefit of the event can be utilized by any future users, something we were so happy to see. These changes often addressed little paper cuts or added explicit support for the Arm and NVIDIA compilers on Arm hardware.
It is worth remembering that applications do not exist in isolation. The complexity of package management is significant for HPC, which is what motivated the choice of Spack to manage the installs.
Figure 8: Spack dependency graph for FleCSPH
To demonstrate this, we can evaluate the dependency tree of the ‘Flecsph’ package (shown in Figure X+1). For this one application a student team needed to port, there were 34 dependency packages (largely built by Spack) that must work to enable Flecsph to build and work. Considering the majority of this dependency tree must be rebuilt for each compiler, really highlights the achievement of the students during the week.
In total, to provide the dependencies for all of the 31 applications, required over 200 dependency packages to be built (often with all 3 compilers). The utter mess of this dependency graph highlights the value Spack plays for managing HPC software dependencies, and the achievements of the students.
Figure 9: Spack dependency graph for the entire event
One of the best outcomes from the event is all of the pull requests. By fixing software in the open, the next users will have a much easier time. This is how we grow the Arm community and the end-user experience.
Figure 10: A selection of pull requests to Spack to fix packages for Arm
To highlight a few of the key achievements, we will now focus on the two highest scoring applications.
A molecular dynamics package with shared memory parallelism through OpenMP and threading enabled FFTW implementation.
The team identified that the Spack build recipe was not enabling the OpenMP support correctly, so users were being left with an essentially serial build. After fixing this, we are now able to scale test cases across the node.
With some extra enhancements to the Spack package, we are now able to swap in different FFTW implementations, such as the Arm Performance Library.
Now we are able to perform scaling studies, comparing the three different compilers on Arm, and two different FFT implementations (shown in figure 5).
Figure 11: Tinker performance comparison on C6g for GCC, ACfl, and NVHPC using FFTW and ArmPL.
From this scaling study we can see that Tinker now works well on Arm, and is showing good scaling behavior up to about 32 cores on all three compilers. While GCC outperforms on a single core, we quickly see both ACfL and NVHPC shine at larger core counts - most likely attributed to improved OpenMP efficiency. We also note there is very little performance difference between FFTW3 and ArmPL.
miniGMG is a geometric multi-grid application with hybrid (MPI + OpenMP) parallelism. And while technically superseded it is still a relevant piece of software.
By profiling the code with the Arm MAP tool, the team saw that the default OpenMP configuration was highly inefficient (shown in figure 6). By configuring the build flags of the code, they were able to exploit application features to vastly improve this performance.
Figure 12: MAP profile of miniGMG showing OpenMP overhead
Additionally, the code was able to leverage Intel specific intrinsics for enhanced vectorization, and while the compiler was able to vectorize some of the code itself, there was still room for improvement. So the team set about using SIMD Everywhere [10] to replace the X86 headers. They were able to control this through the use of a ‘variant’ in Spack. So those building on X86 still maintain the original headers, while those on Arm have the option for native auto-vectorization or SIMDe optimizations.
The resulting performance gain was significant (shown in figure 13), and vastly improves the user experience with the code.
Figure 13: Improved performance on miniGMG on C6gn
1st: DogeCoinToTheMoon
Team of four MSc (CS / Robotics) students at Edinburgh University.
2nd: Wolfpack
Team of three students and one professor from North Carolina State University
3rd: Iman
A single person team, Iman Hosseini, a PhD student from New York
Our first event joint Arm-AWS-AHUG cloud hackathon was a smashing success. The event was very positively received by all parties. And while we had expected interest to wean during the week, due to the competitive nature of some teams we actually saw increased interested as the week progressed.
We saw broad adoption of all compiler and architecture choices, with some positive performance data coming from all, showing the value of a diverse toolchain environment. And we observed a number of cases where AWS Graviton2 outperforms the X86 instances.
Speaking to participants we feel like we provided a positive user experience on Arm and the cloud, for those inexperienced in them, and have received a number of requests for continued access to resources and AWS funding for their projects.
In conclusion, I feel like we can confidently say we have achieved our primary objective – to ‘move the needle’ and enhance the user experience of HPC on Arm. Given the success of the event we plan to host more in the future – though with a more focused application domain (e.g. Genomics for HPC).
AWS have also taken the structure of the event and made a DIY tutorial for those wishing to recreate the infrastructure.
We would like to thank our co-hosts AWS and A-HUG, and sponsor NVIDIA, for being gracious with their time and expertise. Hopefully in the not-too-distant future we can all convene under one roof to port more HPC codes to the Arm architecture.
[1]
N. Ashton and O. Perks, “OpenFOAM on Amazon EC2 C6g Arm-based Graviton2 Instances – up to 37% better price/performance,” 11 06 2020. [Online]. Available: https://aws.amazon.com/blogs/compute/c6g-openfoam-better-price-performance/.
[2]
S. Vadlamani, “Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances,” Arm HPC Blogs, 25 06 2020. [Online]. Available: https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/low-mach-number-cfd-modeling-with-nalu-on-graviton2-aws-m6g.
[3]
F. Dupros, “Seismic Modeling with Arm Neoverse N1 and AWS Graviton2,” Arm HPC Blogs, 20 06 2020. [Online]. Available: https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/seismic-modeling-with-arm-neoverse-n1-and-aws-graviton2.
[4]
C. Hillairet, “Assessing Seismic Wave Modelling on AWS Graviton2 with SW4Lite,” Arm HPC Blogs, 09 09 2020. [Online]. Available: https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/assessing-seismic-wave-modelling-on-aws-graviton-2-with-sw4lite.
[5]
Spack, [Online]. Available: https://spack.readthedocs.io/en/latest/.
[6]
ReFrame, [Online]. Available: https://reframe-hpc.readthedocs.io/en/stable/index.html.
[7]
A-HUG, “A-HUG Cloud HPC Hackathon,” [Online]. Available: https://github.com/arm-hpc-user-group/Cloud-HPC-Hackathon-2021.
[8]
Tinker, “Tinker Molecular Modeling,” [Online]. Available: https://dasher.wustl.edu/tinker/.
[9]
Lawrence Berkeley National Laboratory, “miniGMG,” [Online]. Available: https://crd.lbl.gov/departments/computer-science/PAR/research/previous-projects/miniGMG/.
[10]
E. Nemerson, “SIMD Everywhere,” [Online]. Available: https://github.com/simd-everywhere/simde.
[11]
AWS, “A-HUG/SPACK/REFRAME CLOUD HACKATHON 2021,” [Online]. Available: https://cloud-hpc-hackathon.workshop.aws/.