The modern web is built primarily from three technologies: HTML, CSS and JavaScript. It is JavaScript that drives the interactive web; slow JavaScript means slow web pages. So today, a huge amount of effort is being put into improving the performance of JavaScript, giving us access to powerful web applications, with features from your desktop, but available wherever you are.
Web applications like Gmail, Google Maps and Google Docs use JavaScript extensively, and the user experience is greatly improved on systems with fast, efficient JavaScript engines. In 2008, this motivated Google to create the V8 JavaScript engine project.
V8 is now, on modern benchmarks, the fastest JavaScript engine available. Rather than interpreting JavaScript as the old engines used to do, V8 uses a Just-In-Time compiler to produce and execute native instructions tailored to the processor on which it is running. The generated instructions are cached, avoiding the overhead of repeated code generation, and deleted when no longer needed.
V8 is now the core technology used in a number of important applications. It is the JavaScript engine used in Google's super-fast browser Chrome, and mobile OS Android. It is used in HP's mobile OS, webOS. And it is at the heart of cool new server applications based on the node.js framework.
The web is increasingly mobile. With iPhones, Android phones, tablets and other devices, we can leave our desktops behind. Powerful web applications running on fast JavaScript engines provide the way to cut the ties with your desk, and work or play on the move. It is therefore essential that JavaScript is quick on mobile devices. Which means quick on ARM.
Google's V8 engine is an open source project, driven by the contributions of hundreds of coders. Its development is rapid, with features added and performance improved every day. Over the last year, ARM has been contributing to this effort, helping to make V8 on ARM super fast.
ARM has pushed many large and small patches to the V8 project. Here are some of the more interesting changes.
The return stack has been a part of ARM processors since the ARM 11. It is a small stack of addresses and ARM/Thumb state information used to accelerate returning from function calls. It works by pushing addresses on to the stack when a function call is recognised, and popping them off again on return from the function. It saves valuable cycles when code calls lots of functions.
However, only certain instructions activate the return stack's push and pop behaviour, and these are listed in the processor's Technical Reference Manual. For example, on Cortex-A9, the following are recognised as calls and returns in ARM and Thumb state:
BL
BLX
BX lr
MOV pc, lr
LDM sp, {... pc}
LDR pc, [sp]
Only these instructions cause the return stack to be used. ARM's first patch committed to V8 made the call and return instructions consistent, giving a big performance boost.
Modern ARM cores provide hardware support for floating point operations in two ways.
JavaScript's native numeric type is double precision floating point. So, where V8 can not optimize operations to use integers, the natural choice is to use VFP to speed up calculations. But to do this efficiently, V8 has to support VFP code generation directly, rather than suffer the costly overhead of repeated calls into library code.
ARM has provided a number of patches to broaden the use of VFP in V8, such as adding some of the new features found in VFPv3, and adding support for these new features in V8's built-in ARM simulator.
ARM architecture version 7 introduced new instructions to manipulate bitfields, useful when operating on space-efficient packed data structures.
UBFX
SBFX
BFI
BFC
These operations would previously have been implemented using masking (BIC) and bitwise-or (ORR), so one bitfield instruction can often replace two or three traditional instructions. As less code is required to achieve the same effect, the processor's instruction cache is used more efficiently.
BIC
ORR
There is a further benefit. Developing a JIT requires balancing the amount and quality of code generated against the time taken to generate that code. Users experience this time as an annoying latency — the delay between loading a web page, and being able to use it. Adding bitfields makes a small contribution towards this, allowing a JIT to generate less code to complete the same operation.
At the end of 2010, Google introduced a new technology to V8, called Crankshaft.
It consists of a fast and simple compiler, combined with a slower, profile-guided optimizing compiler. We have contributed a number of patches that helped to complete support for Crankshaft on ARM, and in March 2011, Crankshaft became the default code generator in V8. It gives a huge performance boost on many benchmarks.
Crankshaft needs a modern processor, which for ARM means architecture version 7, with VFP support; an ARM Cortex-A class processor is required.
The many contributions of the V8 coders, including the patches provided by ARM, have resulted in huge performance gains. We have benchmarked the latest development version of the V8 engine on an ARM Cortex-A9 system, and compared the results to those produced by the V8 engine from a year ago. The results are striking.
The V8 benchmark suite (version 6) contains seven benchmarks that are used to tune the V8 engine. These include ray tracing, regular expression, cryptography and OS simulationtests. On the same hardware, performance has increased by up to 500%.
Other benchmarks tell a similar story. Sunspider,a suite containing a set of very simple operations, is over 50% faster than a year ago.
Sunspider was designed before the creation of modern, high-performance JavaScript engines, and it is often difficult to make performance gains here that are relevant to today's JavaScript-heavy web applications.
Kraken is a recent benchmark from Mozilla that focuses on the more iterative tasks that you would encounter in real web applications, using workloads much larger than those present in Sunspider; in terms of execution time, Kraken is approximately 20 times larger than Sunspider.
V8 on ARM has also seen an impressive performance gain here. The benchmark is over four times faster on today's engine, compared to that from a year ago. Crankshaft is particularly important in delivering this result, as it is most suited to the tight, iterative loops seen in the Kraken suite.
It takes a few months of work for Google to integrate and test the latest V8 engine with new devices, so you will not be able to see these performance improvements appearing in products until the second half of 2011. But the V8 developers continue to increase the speed of the engine, so you can expect even higher performance in 2012.
Further out, you will see the introduction of devices based on the latest ARM core, Cortex-A15, with advanced features that will push the speed of JavaScript to new heights.
ARM will continue to contribute to the V8 project, with both optimizations and support for new processors. However, as V8 is an open source project, good patches are welcome from any interested ARM coders. So, if you want to be part of the evolution of the web in mobile devices, check out the code from the Google repository and start hacking!
I find a piece of signal processing floating point code compiled for VFPv3-D16 and ran in Cortex-A9 (no Neon) is 4 times faster than it compiled for NEON and ran on Cortex-A8 (with Neon). Why is the difference so large?
Great news and thanks for information. Is the VFPV3 mentioned here, means Neon is used for floating point operations? If neon is not enabled how much % V8 score will be impacted?
Nice article.It would be interesting to now the percentage increase due to bitfield assembly operations. Can you share the stats ? Is there a library with C API set for gcc toolchain which I can use in driver code?