In a previous blog, I wrote about the reproducibility crisis in research, how it is impacting the quality of research and public trust in it, and how the community of knowledge can address this crisis. In this blog, I will focus on measures of performance and how confusion on what really matters, and agency problems can lead to considerable inefficiencies and distortions. At the heart of many of these problems is conflation of outputs, outcomes and impacts. The value pipeline figure below aims to clarify this.
Figure 1. Value Pipeline
Naturally, any project requires a set of "inputs" or resources necessary for the work needed. This can include funding, raw materials, but also human resources and other forms of capital. The cost of input acquisition, at the desired quality, is an important economic factor.
Inputs are subsequently converted through activity into "outputs" which are the direct measurable deliverables of the project. In the context of research projects, for example, these could be publications or number of trained researchers produced. The rate of converting inputs into outputs dictates the efficiency of the project.
From outputs then come "outcomes" which depict the short- and medium-term value obtained from the project. In the context of research projects, this could be a successful knowledge and technology transfer. Longer term value is referred to as "impact", which in the context of research projects, could be the wider socio-economic value e.g. the creation of a self-sustained ecosystem of talent or the wide deployment of a life enhancing technology. The rate of converting outputs into outcomes and impacts dictates the efficacy of the project.
In the remainder of this blog, I will attempt to demonstrate how conflating the above pipeline stages can lead to confusion, inefficiencies and distortions. As in my previous blog, I will do this through examples from three different fields: engineering, medical sciences, and social sciences.
Benchmarking is routinely used to compare the relative performance of different processors. CoreMark [1] is a widely used benchmark to measure the performance of processors in embedded systems. SPEC CPU benchmarks [2] such as SPECint and SPECfp are used to measure the performance of CPUs used in higher performance systems e.g. servers. However, nothing prevents anyone from using and publishing the results of any benchmark on any system. So, it is possible to run the CoreMark benchmark on a high-performance multi-core processor and show a very high score compared to lower performance embedded systems. That could then be used to market that processor to non-discerning users. However, the CoreMark benchmark is just an output, the value is derived from outcomes and impacts, which look at customer or user value within an application area e.g. low-power embedded systems. The value would typically derive from a Performance Power Area (PPA) or Performance Power Cost (PPC) sweet spot in an application area, but also from ease of use, programmability, maintainability and security. Seen through this lens, a high CoreMark score on its own and without context is meaningless.
It is also worth noting that some of the confusion emanates from what really constitutes an embedded system in the first place. Most people would say: an embedded system is a computer system dedicated to a particular function. That is generally the case for highly constrained environments where there is a need for a computer to perform specific tasks under constraints e.g. power, area, time, cost. By this definition, mobile phones could qualify as embedded systems - they certainly did in their early days when mobile phones were used mostly for making calls or sending text messages. However, as mobile phones evolved to integrate more computing power, memory and connectivity, they have become the main computer system used by billions of people around the world, not just to connect with others (using audio, text, video), but also to access various services, and develop applications for personal and enterprise use. Seen from this angle, mobile phones are general purpose computers, not embedded systems. This argument could go on forever - essentially it's about the irreducible complexity and instability of human language itself. What is less complex to grasp, however, is the concept of user value as it can be clearly determined through objective events: purchase and usage.
In medical sciences, running large population testing is often used to understand the prevalence of diseases, their causes, ways to prevent their spread and speed up their suppression. However, the number of tests, as important as that is, is merely an output of the testing intervention. The outcome is to understand better the underlying disease in view of suppressing it in the most efficient and efficacious way possible e.g. through track and trace, targeted and timely treatment. The long term impact is to reduce the likelihood of reoccurrence and diminish the possible negative consequences of the suppression intervention itself. Indeed, certain treatments or non-treatment can lead to long-term health consequences that are worse than the original disease in some cases. Understanding these risks upfront, through a systematic analysis of the above value pipeline, can reduce the actual risk greatly.
Thus, treating the number of medical tests conducted as "be-all and end-all" is counterproductive as it takes away attention from what really matters (outcomes and impacts) and can well lead to destructive behaviors destined to meet targets at the expense of the desired outcome e.g. wasting testing capacity. Moreover, done properly, lower testing numbers with superior sampling quality, for instance, can lead to better outcomes and impacts at lower costs.
The third and final example is in the field of economics, namely the Gross Domestic Product (or GDP). The latter measures the market value of all final goods and services produced within a specific period of time e.g. a year or a quarter. At a national level, adjusted for the population size and the cost of living, GDP per capita at purchasing power parity (PPP) is routinely used to compare the standard of living between countries. However, this is also an output measure which reflects economic activity and not outcomes and impacts. Activity can be good or bad but outcomes and impacts have to be linked to people's welfare and well-being, the ultimate aim of economic activity.
GDP per capita at PPP is an average and does not account for wealth distribution across the population. It also does not account for the environmental impact of economic activity, the safety and security of citizens, their access to education and health services or their political freedoms. For these, there are several other measures e.g. the Gini coefficient for measuring wealth distribution or the HDI index, which is a composite of life expectancy, education, and per capita income. These measures may or may not reflect accurately the underlying facet of economic activity, but what is for sure is that one single measure cannot fully capture the well-being of people. Furthermore, one must link any measure of output to the underlying outcomes, impacts or value sought.
There has been growing recognition of the above issues in recent times. The Gross National Wellness (GNW) index for instance is an attempt at capturing the well-being aim of socio-econopmic development by capturing economic, environmental, physical, mental, work, social, and political facets. Nonetheless, even if this index (or similar) were to be adopted widely, it is crucial not to fall into the same trap of concentrating on output measures rather than outcomes and impacts. These cannot be captured solely through metrics no matter how complex and sophisticated they are.
Quantitative performance measures are valuable tools to assess the economy, efficiency and efficacy of various human endeavors. However, these tools often become dangerous when not subjected to the controls of the underlying value sought. Indeed, these measures can often be abused to portray a simple and definite conclusion e.g. that this electronic chip is far superior to alternatives based on a CoreMark figure. They can also give the illusion of success e.g. based on GDP figures or number of medical tests performed as explained above. Unfortunately, this is often amplified in today's noisy social media environment where arguments are often won based on oversimplified messaging.
The scientific community should resist this oversimplification and bring the debate back to the fundamentals i.e. outcomes, impacts and value sought. The additional difficulties related to the assessment of outcomes, impacts and value in general should not be allowed to be used as an excuse to prevent deeper discussions, why should we expect to resolve a difficult question by answering a different and simpler one? To the contrary, these difficulties should spur all of us to scratch below the surface of headline-grabbing oversimplified messages and metrics.
[1] https://www.eembc.org/coremark/
[2] https://www.spec.org/benchmarks.html