Cosmic particles like gamma rays can have a dramatic impact on basic compute, affecting everything from election votes to airplane safety. A small team at the University of Rochester, New York State, has been working with Arm technology to assess the resilience of Neural Processing Units. These underpin much of today’s AI, and the team are working to understand how we can make them safer. Abhishek Tyagi, a PhD student at the University of Rochester, walks us through it.
In 2003, there was an election in Belgium using electronic voting machines. One of the candidates spontaneously received 4,096 votes that could not be accounted for. Observers noticed something special about the number 4,096. It has a clear relationship to computer science. It’s 2 to the power of 12.
What had happened? In the number representation, they were using to store the votes, there is what is known as a bit flip. A particular position in the memory, which had a value of zero, flipped to one instead. This change manifested as 4,096 extra votes [1].
Bits are the smallest unit of data that a computer can process and store. They are always either ‘on’ or ‘off’. These bit flips can be caused by environmental factors, such as particle strikes from cosmic rays (including gamma rays) as they filter through the atmosphere.
“Investigators discovered a host of corrupted data in the system – and concluded the plane’s nosedives were probably due to soft errors caused by cosmic particles.”
We call these unintentional bits flips ‘soft errors.’ That’s a form of silent data corruption, where the fault is not detected anywhere in the system, but users can identify an anomalous behavior.
Another famous example occurred in 2008. A Qantas plane flying over Western Australia suddenly went into two unexplained nosedives, seriously injuring a crew member and some of the passengers. Investigators later discovered a host of corrupted data in the system that led to the plane’s nosedives, and which may have been caused by soft errors from cosmic particle strikes [2].
Soft errors may occur for only one cycle or two cycles, which could last just nanoseconds these days. And the likelihood of them happening is very small. The basic storage components in the hardware (which are known as flip-flops (FFs) have a failure rate which is defined using a unit called Failure In Time (FIT) rate. The FIT rate tells us the number of failures per billion hours of operation for a device.
For flip-flops, the most common unit used is FIT/MB which is the FIT rate per Mega Byte of data and that value is < 50 FIT or MB for modern FFs. Most of these errors will end up being masked before they can affect execution. In memory devices, errors that are not masked can be handled by methods such as Error Correction Codes (ECC). But with the risks attached being so high, the detection of soft errors is an important field to investigate.
In 2021, a small team at University of Rochester began analyzing an industrial-scale Neural Processing Unit (NPU) to characterize it for for soft errors. Would the NPU meet our reliability requirements?
The question of reliability has been an area of study as long as computer chips have been around. Applying those considerations to NPUs is increasingly important, as NPUs run artificial intelligence (AI) workloads, and AI is now so prevalent in everyday life.
NPUs are made up of various blocks, each responsible for a certain task. Together, these blocks combine to perform one large action. We wanted to figure out which blocks are more susceptible to soft errors, and why. Armed with that information, you can then design the IP so those blocks do not impact the whole in an erroneous manner, making systems more resilient and safer.
There are two approaches that researchers tend to follow in this area. One is academic, the other industrial. And there’s a lot of disparity between them in terms of results.
“Arm’s products are very impactful for this kind of study, because they’re used in almost everything, from phones to refrigerators. We worked on a specific Arm machine learning NPU called Ethos-U55.”
Our work was a true collaboration between both worlds. I had done an internship at Arm, where I first heard about the Academic Access program that gives universities free-of-charge access to Arm’s IP. When I started my PhD at Rochester, we applied and gained institutional access to the Arm Academic Access IP, joining over 115 other global academic universities and institutes.
Arm’s products are very impactful for this kind of study, because they are used in almost everything, from phones to refrigerators. We worked on a specific Arm machine learning Neural Processing Unit called Ethos-U55.
Our collaborators at Arm gave an important industrial perspective to the work. Academic research is supposed to look 10 to 15 years down the line, so you can end up making a lot of assumptions in order to move the work forward. There is no obligation to share academic paper drafts as part of Arm Academic Access, but when we did, we got some extremely useful feedback. Arm showed us how accurate we were in our assumptions and any limitations we had. And they helped us to understand where this work might be most useful, both now and in the future.
Such insight is always fascinating, and it’s very motivating to a PhD student to know how the work may be used in a product someday.
I have always been interested in this area of NPUs and AI. I come from a computer architecture background, working at Samsung prior to my PhD. But when I joined the PhD program, the reliability aspect was what really caught my attention. It was fascinating to see that very little work was being done to understand the reliability of these chips, even though they are used almost everywhere.
One of the biggest challenges we faced was the enormity of the study. We had to do these bit flips – from one to zero or zero to one – six to seven billion times. That is the largest scale I’m aware of anyone attempting. It took us almost a month just to start the experiment.
We needed an immense amount of computing resources to do this. When I was doing my internship, we were lucky to have such resources available. A tool provided by another collaborator ran the bit flips in the hardware. We had to make sure we configured that tool in the best possible manner to get what we wanted, and even then, we were pushing the limits of its capability.
One of our key takeaways was that not every block in the Neural Processing Unit is susceptible to soft errors to the same degree. There are a few you can leave as is. Registry units, for example, will work fine in 99.99% of scenarios and don’t need any extra attention.
But there are certain blocks that you should really be focusing on because they could produce errors 30-40% of the time.
For example, one of the main tasks that takes place in machine learning is the multiplication of two numbers. Machine Learning (ML) models, when executed on the Arm IP, do many such multiplication operations. One of those numbers comes from a block called the ‘weight dispatcher’ in Ethos-U55 and is used quite a few times in these operations. So, if that value is wrong, whatever other values are computed using it could be wrong too.
Our work showed that, when the next set of NPUs are being designed, that’s the block that will need the greatest protection. This was fascinating information.
The work on NPU reliability is now complete and has been accepted at the ISPASS 2024 Conference.
And our direction has shifted a little since. We are still interested in reliability but are trying to explore similar questions in optical computing.
The data requirements of AI and machine learning are extremely large, and data centers are consuming megawatts of energy. So, people are exploring alternative ways of doing computation that don’t require as much power. Optical computing is one enticing option. Our work there is in its very early stages, and we don't yet know where it's going to go, but it may prove incredibly important to understand how reliability works there too.
[1] Wikipedia: ‘Electronic voting in Belgium.’ (retrieved 25th January 2024).[2] Medium (2019) ‘Ghosts in the code: the near crash of Qantas flight 72.’ (retrieved 25th January 2024).
Arm has an internship program that provides opportunities to work on real world projects in software and hardware engineering, research, data and business roles.
Explore Internships Explore Early Careers