The Heterogeneous System Architecture 1.0 specifications have now been released.
Jem Davies has talked about earlier releases of the HSA specification which gave an overview of the Programmer's model, and it's great to see the System Architecture, Programmer's Reference Manual and Runtime specifications finalised and available for download. The HSA System Architecture specifies hardware which fits into modern SoCs and a clean Programming Model for software and compiler design which fits well into modern operating systems. It's well suited to CPUs, GPUs, DSPs and any other devices which support offloading of computation tasks, programmable or otherwise.
For developers, HSA provides a simple and consistent interface to the hardware and then gets out of the way. Moreover, HSA exposes hardware in a standardised way that has support from a large number of companies in both the Desktop and Mobile space. One of the key things this allows for is opening up acceleration API design to more developers; HSA is meant to be built on.
You build APIs on top of it.
There's a lot of active debate on what the right API for accelerating general purpose programs in heterogenous systems is, it's still a young area, as evidenced by a large number of APIs available today. We're still trying to find the right API (or APIs) to make it easier to speed up programs and make them more power efficient. HSA makes it easier for any person or company to prototype or develop better tools and APIs for heterogeneous systems.
For domain specific issues it also means that the API design on top of HSA can be made to suit the problem at hand, and those who best understand their problem can design the solution.
HSA is a low level interface with a focus on direct hardware support, so it's not typically something you would program to directly if you're writing applications. It's also got a relatively high bar to entry, so if writing your own compiler front-end isn't something you had in mind, you should probably look at APIs built on top of HSA.
The good thing is, there are already API implementations written on top of HSA to get you started. Take a look at the open source OpenCL C compiler and C++ AMP implementations that are already available.
HSA simplifies the programming model by providing mechanisms consistent with those seen for CPU development.
The hope is that HSA can become the basis for innovation for heterogeneous compute APIs. As implementations continue to advance and overheads to use of accelerators continue to reduce, it will become easier to offload small pieces of work to accelerators.
HSA allows for communicating agents at the hardware level, without CPU involvement. This means fewer cycles where your CPU needs to be active and the CPU isn't unnecessarily used as a communication path between two independent units. It also means (with the help of support libraries and OS software) that common formats can be used on multiple devices in the system and no copies are requred for reformatting data shared between devices. With the flat address space of the HSAIL model, programmers can target normal CPU algorithms more easily to other devices and not have to specialise algorithms as heavily for features like local scratch memory or segmented addressing.
Mobile SoCs are also more constrained by the power and thermal budget available in the small form factors of phones and tablets. As a result, mobile development is more sensitive to overheads from copying or driver validation; HSA avoids this by adding hardware support for coherency that can be easily extended to the non-CPU devices in the sysem
Full coherency and a shared address space mean algorithms can be working on different areas of an input buffer without having to carefully partition work statically to cache line boundaries. This allows programs to run across more devices in the system, often at a lower frequency and voltage, resulting in lower total energy for equivalent computations. This can be used to either speed up the computation, or to conserve battery.
Having HSA as a low level hardware interface means it is naturally applicable to more problems and opens up the development of higher level APIs which can now accelerate general purpose compute on GPUs, DSPs and other devices. The task based queue interface implemented in user process mapped memory allows for low overhead task dispatch and minimal CPU involvement.
HSA hardware complements existing standards such as OpenCL, SYCL, C++ AMP and OpenMP 4, with features useful for many of these APIs. HSA isn't a revolution but an evolution and standardisation of key hardware features that open up development to a much larger community.