Triangles Per Second: Performance Metric or Chocolate Teapot?

For the Second blog on this subject see Triangles Per Second 2: A Chocolate Teapot of a Graphics Benchmark.

The practice of characterizing GPU performance in terms of triangle rate (triangles per second) has largely died out in the world of desktop graphics. Unfortunately, it continues to linger in the mobile graphics community. It is time to put a stop to it. There are three things wrong with it: first, it is ill-defined, making it impossible to compare numbers reported by different GPU vendors; second, it forces otherwise honest GPU vendors (not to mention unscrupulous ones) to mislead their customers, by measuring triangle rate under wildly unrealistic conditions; and third, if you do manage to address those problems and measure a well-defined, "honest" triangle rate, it is useless for any conceivable purpose.  In this two-part blog, I'll explain each of these points in detail, with digressions on how GPUs work, engineering trash-talk, and the relevance of chocolate teapots.

Measuring Graphics Performance
We've been thinking a lot lately about graphics performance and how to measure it. Our latest graphics core, the recently announced ARM Mali -T604, raises the bar enormously on graphics performance for mobile devices; but how much does it raise it? The answer, of course, is that it depends how you measure it. Sadly, a lot of the metrics in common use are dangerously misleading. This has provoked us to write a series of blogs (more are coming) about what various metrics mean and how to interpret them.

My buddy Ed Plowman kicked the series off with an essay on fill rate. Ed's blog got deeply into existentialist philosophy; his point was that in order to use fill rate as a GPU performance metric, you must define it correctly and measure it honestly. (In particular, when you are counting pixels per second, you shouldn't count pixels that don't actually exist.) Here, I'll be taking a swing at triangles per second, aka triangle rate. Unlike fill rate, triangle rate is a topic we can discuss without philosophy, subtlety, or even politeness; instead, we'll be making liberal use of the good old clue-by-four.  This is because, where fill rate is a performance metric worth saving, triangle rate most definitely is not. It is deeply and fundamentally flawed, and cannot be fixed; it should never be used, and its name should never be spoken.  I hate the thought of spending two whole blogs on it, but it is time to drive a stake through the heart of this vampire once and for all.

Did I mention that I feel strongly about this? (And you should have seen this blog before the censors got hold of it. Catch up with me at GDC and I'll tell you how I really feel.) But first, a digression:

A Digression: Engineering Vocabulary
For a Texas-Yankee-Californian like me, one of the minor pleasures of coming to work for ARM has been the opportunity to learn UK engineering slang.  Usually, it's a matter of learning new words for familiar concepts: where a US engineer in a hurry will hack something up, her UK counterpart will bodge it together; where the hacked-up code may be evil or flaky, the result of the bodge-fest may be dodgy; and so on. Those sorts of expressions, I can generally figure out from context.  But the really fun ones are the terms that have no US counterpart, or at least not one I know. Recently, when my boss referred to something as a chocolate teapot, I had to ask what he meant.

A chocolate teapot, it turns out, is the ultimate in uselessness; a profoundly flawed design, whose defining characteristics make it unfit for its intended use.  The metaphor makes sense; it's obvious that if you try to make tea in a chocolate teapot, it will melt, and (boiling water being what it is), you'll have a mess (and maybe a lawsuit) on your hands.

(Actually, there is controversy over whether a teapot made of chocolate is, in fact, a chocolate teapot.  Experiments by Bradshaw et al seem to support the popular consensus, but other results suggest that the problems can be solved through better engineering.  More research is clearly needed. But I digress.)

The apotheosis of uselessness: a chocolate teapot
Photo via Echostains

The World's Most Useless Graphics Benchmark
As Ed explained in his blog, the problem with fill rate as a benchmark is that you have to define it consistently; if different vendors mean different things by it, you can't compare their results. However, once you do define it properly, you have a number which is somewhat useful.

Like fill rate, triangle rate suffers from the definition problem – in fact it's a lot harder to define (read: easier to confuse people with) than fill rate.  But unlike fill rate, triangle rate is utterly and completely useless; even if you define it consistently and measure it precisely, it doesn't tell you anything you might conceivably want to know. It is a terrible, horrible, no good, very bad way to measure graphics performance. It is a chocolate teapot.

Why Defining Triangle Rate is Hard
The definition problem boils down to this: some triangles are easier to draw than others. In order to compare triangle rates reported by two different GPU vendors, you need be sure they are drawing similar things. Unfortunately, not only is triangle rate tricky to define, it is also trickier to define than it looks.  Over the years, GPU vendors have found not only the obvious loopholes in a naive definition of the metric, but also many non-obvious ones. This makes it extremely difficult to tell whether rates from two different vendors are in fact comparable, and it means that the vendor quoting the highest triangle rate doesn't always have the fastest GPU.

To get a sense of the possibilities, we need to review briefly how GPUs work. (People who know this stuff can skip ahead.)

Another Digression: How a GPU Draws a Triangle
To draw a triangle in, say, OpenGL ES, an application hands the GPU an array of vertices, a vertex shader, a fragment shader, and a bunch of state. The state is just control information and global data. The vertices are what you think they are: the coordinates of the points defining the corners of the triangle, in some coordinate system. The vertex shader is a tiny program that the GPU executes for each vertex; its job is to read in the coordinates of that vertex and some of the state, and figure out where the vertex will appear on the screen.  The GPU then groups sets of three vertices (now in screen coordinates) into triangles, and decides whether each resulting triangle is visible.  (The triangle may be non-visible for any of several reasons; it may be too small to cover any pixels, or it may be off-screen, or it may be back-facing – meaning that the triangle is part of a surface that is facing away from the viewer, and therefore can't be seen.) If a triangle isn't visible, it is culled, i.e. discarded. If the triangle is visible, the GPU performs triangle setup to compute a convenient mathematical description of the triangle. It uses that description to figure out which screen pixels the triangle overlaps, and generates a fragment (a chunk of data that describes the triangle's impact on the pixel) for each one. The fragment shader is a tiny program that the GPU executes once per fragment; it reads the fragment data and some more of the state, and figures out what color the fragment should have.  The GPU combines each fragment color with the color of the corresponding screen pixel according to rules defined by the state, and the triangle is done.  Whew!

Factors Affecting Triangle Rate
Given how complicated drawing a triangle turns out to be, it's pretty obvious that there are a lot of things that might affect how fast a GPU can draw them:

How complicated is the vertex shader? Normally, the vertex shader program converts triangle coordinates into screen coordinates using a bunch of floating point math. It also (normally) computes information about object appearance and lighting and attaches it to the vertices. But it doesn't have to do these things; the application could be submitting the vertices already transformed, with the appearance information already attached.

How many vertices must be shaded per triangle? You might think that for every triangle drawn, the GPU has to shade three vertices. Sometimes it does, but if vertices are shared between multiple triangles, you can usually avoid shading the same vertex twice, using a GPU feature called indexed drawing. In the simplest version, the application supplies a list of vertices and then describes triangles in terms of the positions of their vertices in the list. In the example shown below, to draw a square, the application provides vertices [A,B,C,D] and then asks to draw the two triangles defined by the index list [0,1,2,1,3,2]. The first three indices (0,1,2) select vertices [A,B,C], drawing triangle ABC. The second three indices draw triangle BDC. Presto, we've drawn two triangles but only shaded four vertices. In the limit, a clever application can draw N triangles while only shading about N/2 vertices. 

How much does triangle setup cost? On some architectures, the cost of triangle setup depends on how much appearance and lighting information is attached to the vertices.

How many triangles are culled? Triangles that are culled don't have to be set up, and don't generate fragments, so they are cheap to "draw".  Of course, being invisible, they aren't very interesting to look at. It may be worth drawing a certain number of them, though, as I'll explain below.

How many fragments are generated? A single triangle can generate anywhere from zero (if no fragments are visible) to millions of fragments (if the triangle covers the whole screen). The more fragments are generated, the more work the GPU has to do.

How complex is the fragment shader? Fragment shader programs used in modern applications typically do quite a lot of work. A basic shader might figure out the approximate surface orientation by interpolating values stored at the vertices, perturb that by a "bump map" texture that captures surface relief, and then compute a color by evaluating a complex lighting equation. But it doesn't have to.

Unless we know the answers to these questions (and a whole lot more), we don't know what the TPS rate means, and we can't compare it to the TPS rate quoted by another vendor.

Is it fair to count culled triangles? Or, when is a triangle not a triangle?
Earlier I said that we wouldn't need philosophy for this discussion. I lied; we have to consider questions of existentialism and ethics. Specifically, we have to consider whether, for purposes of measuring triangle rate, it is cheating to count triangles that aren't actually visible. It turns out that it is reasonable to assume that some percentage of triangles are back-facing or off-screen.  That is because 3D applications typically do draw a fair number of non-visible triangles. A solid cube has six faces, for example, but at any one time, no matter where you stand, you can never see more than three of them. The others are back-facing and are going to be culled.  You can generalize that idea to spheres, teapots, or flesh-eating zombies – no matter where you stand, you are (typically) only going to be able to see about half of their surfaces. You want to include some culled triangles in your triangle rate measurement, because efficiency of culling has a real effect on GPU performance.

"But wait," you say – "Your buddy Ed just got finished telling us that it is cheating to count invisible pixels in your fill rate.  Why are the rules different for triangles?"  There's a subtle but important point here: Applications that draw enough triangles to strain a GPU's triangle-handling capability are typically drawing complex 3D scenes, so they are going to draw a lot of triangles that will end up being culled. If they aren't doing that, they probably aren't drawing enough triangles to worry about.  So, if you care about triangle rate at all, you care about the rate observed when you are drawing a mix of visible and culled triangles.  On the other hand, applications that are fill rate limited are typically doing really simple things, like copying a video frame to the screen, or compositing the desktop in a fancy UI.  In those sorts of applications, every pixel the API draws is visible on the screen, so it doesn't make sense to apply an "overdraw discount" that is never observed in that type of application.

Time for a tea break?
We've seen that the rate at which your GPU can draw triangles depends in a very complicated way on what triangles you ask it to draw. This makes defining an "honest" triangle rate (i.e. one which can be compared across different GPUs) extremely difficult.  In the second half of this blog, coming soon, we'll see how you can use this difficulty to confuse and mislead - and in particular, how you can claim a staggeringly high triangle rate even if your GPU has trouble rendering Angry Birds. We'll also see why, even if you were able to measure an honest triangle rate, you wouldn't want to.

In the mean time, got questions? Want to correct my gross over-simplifications? Let me know...

  • Excellent Article Tom...!!! Thank you..
  • I see a similarity here with CPU vendors quoting MIPS. They are plain bogus (or bogoMIPS) unless you also fix the software program calculating it.Once you fix the software then it starts to make some sense and vendor comparison.Going further the nature of software starts to make the difference.Rather than giving a single number output, if it stresses each type of loding factor and quotes them separately then it makes real sense to get these numbers. Apple-2-apple comparisons.
  •    You're right, a triangle metric have to deal with triangle culling, and that's probably the only information it could measure.
    A relevant triangle rate metric could be the triangle rate achieved for a Z only rendering pass, which is probably the only real situation when the raw triangle rate of the GPU could be the bottleneck. Z only rendering is widely used in current generation game engine, and is also probably the condition used for  most of the geometries. If we consider a typical engine using a Z prepass to setup the Z-Stencil buffer/HiZ buffer and sending occlusion queries, and using cascaded shadow mapping with 3 or 4 viewports, probably around 75% or more of draw calls for the full frame are done during Z only rendering. For Z only rending, the fragment shader does absolutely nothing since it does not output any data, GPU usually have optimisations that double the fill rate for z-only rendering. The vertex shader is very simple for most of the meshes (4 instructions for transform + 1 for unpack scale), and the post-transform cache hit is very high (up to 85% is achievable) since only the position attribute of the vertex is relevant. Also, the position data in the vertex stream could be packed to use as few as 6 bytes per vertex, pre-transform cache hit is very high too so reading vertex from memory is not  a bottleneck. In these conditions, back triangles and zero area triangles culling, but also the way the GPU retire vertex data from the post transform cache could be a bottleneck.
    And if your GPU is too slow for Z only rendering, you can remove all the triangles that do not contribute to the final image at runtime using the CPU, and that's the beginning of a nightmare...
  • Its mentioned in the article that a clever application can draw N triangles while only shading about N/2 vertices. In the example mentioned the vertices shaded are 4 for drawing 2 triangles.  I believe it should be N+2, considering a triangle strips or Fans. Please explain otherwise.
    <br><br>Bujji00,<br><br>   If you just send in the vertex coordinates organized into triangle strips or fans, you indeed only get N triangles out of N+2 vertices, just as you say. But you can do better with indexing.  When you are using indexing, it actually doesn't matter (much) whether you use strips and fans, or just draw individual triangles.  When you are using indexing, a typical GPU will store transformed vertices in a <i>post-transform cache</i>, using the index as the cache tag.  When the index list contains, say, index <i>k</i>, the GPU will first look in the cache and see if the <i>k</i>th vertex is already there. If it is, it doesn't have to run the vertex shader - it can just use the result from the cache.  If the kth vertex isn't found, then it will read in the vertex coordinates, run the vertex shader, and cache the output. That's why, in my example, we sent in six indices ([0,1,2,1,3,2]), drawing two triangles, but expect to shade only four vertices.<br><br>   Because of these caches, what matters is the total number of vertices in the vertex list compared to the total number of triangles specified by the index list.  In my example, we had four vertices for two triangles, but that was just a toy example. Take a look at the illustration at [url=""]http://en.wikipedia....i/Triangle_mesh[/url], which is much more like something you'd see in a real application.  If you look closely, you'll see that most vertices are shared between six neighboring triangles - this is typical for real 3D content.  If three vertices participate in each triangle, but each vertex participates in six triangles, it must be the case that the number of vertices is half the number of triangles: N/2 vertices for N triangles, as I said.<br><br>   Another way to think about it is this: Suppose I've got the world's simplest mesh, with just one triangle (and three vertices). Add a new vertex in the center, and connect it to each of the original vertices, splitting the original triangle into three triangles. I've now got four vertices and three triangles - I've added one vertex, but increased the triangle count by two.  Now pick one of the three triangles at random, and repeat the process to get five vertices and five triangles.  Do it again and again, as many times as you like; every time you do it, you add one vertex and two triangles, so if you do it a few hundred times, you'll end up with about twice as many triangles as you have vertices. (You can prove it mathematically using limits, but I can't figure out how to write equations in this silly web interface...)<br><br>   Of course, in real life, the GPU post-transform cache will have finite size, so you will occasionally have to run the vertex shader again on a vertex you've already transformed, and you probably won't quite get two two triangles per shaded vertex. You can improve your chances of success by sorting the index list so that occurrences of a given index tend to happen near each other in the list - that will make the cache work better and save the GPU work.  There are lots of algorithms and software tools for this out there - if you search for "mesh optimization" on the web, you'll find lots of good stuff.  <br><br>regards,<br><br>--Tom<br>
Graphics & Multimedia blog