In the first post in a three-part series, Greg Yeric, Arm Research Fellow, shares his thoughts on three-dimensional integrated circuits and the potential impact of this technology across a variety of areas.
You may have become numb to the constant mentions of 3D integrated circuits these days, but there is exciting potential for our industry here. In this three-part blog, I will focus on three distinct 'dimensions' of this potential, which roughly correspond to: today, the near future, and farther out into the future.
We’ll need a bit of level-setting, with a common nomenclature. For that, I’m going to follow the terms published by Eric Beyne of IMEC, in Figure 1 below. As you move from left to right in this figure, “what is stacked” (or “partitioned”) becomes finer, with assumed advances in 3DIC interconnect density. All technologies in production today fall into the left-most 3D-SIC “Stacked IC” category, which I'll cover in this part. In parts II and III, we'll move along toward the right to talk about the potential for the future with 3D-SOC and 3D-IC.
Figure 1: Overview and taxonomy of evolution from 3D-SIC to 3D-IC. From: Eric Bayne, “The 3-D Interconnect Technology Landscape”, IEEE Design and Test, May/June 2016, pp. 8-20 (reproduced with permission)
3D Stacked IC includes a variety of technologies, but they all utilize semiconductor technology in some way that improves die-to-die interconnect beyond what conventional packaging can provide. This definition of 3D-SIC helps us differentiate older Multi-Chip Module (MCM) technologies, which have been with us for decades, from modern 3D-SIC technologies.
These 'modern MCM' embodiments achieve much higher interconnect densities with the use of a defining technology, the Through Silicon Via (TSV). While the concept of TSV has been around since before Moore’s Law, the actual term wasn’t coined until 2000 - about the time that semiconductor applications such as DRAM and CMOS Image Sensors (CIS) were driving key technologies which were enabling for practical TSV, including the etching and filling of high aspect ratio holes as well as wafer thinning and precision die-to-wafer or wafer-to-wafer alignment. Then, around 2011, 3D-SIC started to really take off.
Figure 2: Virtex-7 2000T with 3D packaging
That year, Xilinx brought 3D packaging into its Virtex products. The packaging technology they used was not truly 3-dimensional in that die were not stacked on top of one another, but a technology where 28nm die were placed side by side on an “interposer” built from a 65nm wafer with some “large by wafer but small by packaging standards” RDL wiring layers (Figure 2). An interposer utilizes the RDL along with TSV to transition from a high-density chip-level interconnect to the package level interconnect density. In this case, the interposer provided Xilinx 100x the die-to-die bandwidth per watt and one-fifth the latency as compared to connecting 4 discrete die together in conventional packaging and associated board-level interconnect. This so-called “2.5D” interposer method provided a significant benefit while avoiding the much more complicated design and manufacturing issues that would have been involved with driving TSVs through tiers of chips in “real” 3D-SIC.
Fully 3D-SIC was pioneered a few years later in a controlled design environment: stacked DRAM. Existing memory interfaces were becoming hamstrung by traditional packaging interconnect density, eventually pushing to the limits of packaged pin count and frequency, with some DDR5 proposals running 512-channels at 1.7GHz, which resulted in an untenable power/performance roadmap. By stacking DRAM slices in true 3D form, and utilizing TSV technology to enable a much “wider” I/O, Micron (Hybrid Memory Cube, 2013) and Hynix (High-Bandwidth Memory, 2015) began shipping products that could vastly out-perform the conventional packaging’s access to DRAM (see comparison in Figure 3).
Figure 3: Comparison of conventional and 3D-SIC style DRAM products (Based on: http://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf)
These 3D-SIC DRAM products were designed with interposer-based access in mind, so you’d need to commit to also using the “2.5D” technology in order to realize the full benefit. In the last couple of years, we have seen impressive interposer-based “GPU + 3D-SIC DRAM” systems from Nvidia and AMD, both demonstrating a 3-4X improvement in bandwidth per watt compared to a conventional solution (Figure 4). Not only are there more interconnect, but the distance to the processor is dramatically reduced, as illustrated below by AMD in Figure 5. AMD calculate that accessing 2GB of memory can realize a 94% reduction in board area going from 2D to "2.5D" HBM.
Figure 4: GPU + HBM drawing (Based on http://www.amd.com/en-gb/innovations/software-technologies/hbm)
Figure 5: AMD GPU with HBM via “2.5D” packaging From: IEDM 2017 paper 1.1 “Multi-Chip Technologies to Unleash Computing Performance Gains over the Next Decade” (Reproduced with permission)
Note the reference for Figure 5 above. The International Electron Devices Meeting (IEDM) has been in existence since the late 1950s, and is the pre-eminent conference for disclosure of the transistor (and other electron device) advancements that have helped fuel progress for the entirety of the Moore’s Law era. It is for that reason that this section is particularly significant. This year, in one of the invited keynote talks, in discussing “Computing Performance Gains over the Next Decade”, AMD did not talk the next greatest FinFET or nanowire transistors. Instead, they discussed how they utilized 3D-SIC to get the power efficiency gains in the GPU above. Then, they discussed how they split up their 32-core EPYC server-class chip to four 8-core “chiplets” (Figure 6). Even with a 10% overhead to add I/O to the 4 chiplets, AMD was able to reduce their overall cost, owing to basic die yield—chip yield goes down very quickly with chip area (somewhere between a square law and an exponential), so if you can test for known good chiplets, you can come out way ahead in simple cost for chips this size. In AMD’s case below, they achieved a 41% cost savings figure.
Figure 6: AMD GPU with HBM via “2.5D” packaging From: EDM 2017 paper 1.1 “Multi-Chip Technologies to Unleash Computing Performance Gains over the Next Decade” (Reproduced with permission)
This cost claim by AMD lines up pretty well with simple yield modeling, as shown in Figure 7 below. The D=0.22/cm2 (the value commonly used in the ITRS yield calculations) does come in at around the benefit claimed by AMD. The yield benefit of breaking their gigantic 777mm2 chip into 4 chiplets is shown in the top green line—one large chip would yield 26% whereas 4 “chiplets” would yield 59% (a 2.26x increase, if you could perfectly sort for known good die in the chiplets). The additional D=1.0/cm2 curve probably falls in line better with an FPGA market that accesses the very early silicon, and you can see the cost benefit that an early adopter of a new process technology might obtain in very early silicon could be even larger than the number claimed by AMD.
Figure 7: Die yield as a function of chip area and wafer defectivity D Yield model: Bose-Einstein, defect clustering parameter=2, systematic yield limit 90%
Oh, and that 10% chip area overhead from Figure 6? Almost all of that disappears, because you can fit 9% more of the smaller die onto a wafer as compared to the larger die - more fit around the edges.
Figure 8: Die per wafer for one large die vs. 4 smaller die that add 10% area overhead (Calculation: 300mm wafer, 0.06mm spacing, 3mm edge)
So, yes, it appears that yet another nugget in Gordon Moore’s landmark vision from 1965 is becoming true:
“It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected.” - Gordon Moore, 1965, in his original Moore’s Law paper.
At least, it’s being shown in the market for the largest die possible. The AMD and Xilinx examples both push toward the limit of around 8 cm2 (the maximum field size of today’s steppers is around 26mm x 33mm). If you zoom that Figure 6 plot in to smaller die sizes, the yield-based cost benefit is much less… partitioning a sub-1 cm2 chip might yield a 20% cost benefit (in line with a much more detailed look at the problem given by U.C. Santa Barbara and AMD here).
Figure 9: The yield curves of Figure 7, zoomed in to smaller die sizes
In these early adoption examples from the likes of FPGAs, GPUs, and server CPUs, we’ve been able to ignore the actual cost of the 3D-SIC packaging overhead, but for more mainstream products, that overhead will be significant compared to the savings. In the examples above, the interposer is more or less an additional wafer cost (albeit with only a few layers of RDL processing), which could be too steep a price, considering the large interposer areas.
For more cost-sensitive products that don’t need the absolute highest performance, starting with Apple’s iPhone 7, “wafer level fan out” technologies have started to offer a lower cost option to true 3D-stacked chips. These fan-out technologies compress mould compound around the back side of an arrangement of chips, which creates a temporary carrier wafer called “Fan-out wafer-level packaging”, FOWLP. The carrier is then flipped and RDL-type metal is deposited directly across the face of the chips in their carrier wafer—no interposer cost required (see figure 4 and 6 in this nice article by semiengineering). This method can currently produce 5um Line/Space chip-to-chip interconnect - not as dense as the silicon interposer technology, but good enough for a lot of applications. And there’s plenty of research ongoing to push toward 1um Line/space and beyond. Cost-wise, even more affordable substrates could substitute for a wafer, such as glass or organic materials (a nice summary presentation by Nvidia can be found here).
Beyond wafer-level fan-out, it turns out that there are some nice, depreciated, glass-substrate facilities that could prove cheaper than wafer-level fanout: previous generations of flat panel factories. This Fan-Out Panel-Level Packaging (FOPLP) has the potential cost advantage over WLP by sheer size, shown in the figure below comparing wafer size to panel size.
Figure 10: Comparison of number of die on a 300mm wafer, vs. various panel sizes
Various industry-academia consortiums are working to address the many panel substrate challenges, including warpage, RDL processing, chip placement accuracy, and yield. These include consortia based around the Fraunhofer Institute (300mm x 600mm) and the A*STAR/IME consortia (550mm x 650mm). Production of panel-level packaging currently being announced, including NEPES and PowerTech (PTI), and Samsung (510mm x 500mm). Interconnect pitches are expected to be initially in the 5um/5um to 2um/2um Line/Space range.
The summary for 3D-SIC is that in a short amount of time, we have gone from lab demonstrators to real products showing significant advantages in cost and performance. While these advantages are most easily leveraged by gigantic, expensive chips today, with so many fabs and packaging houses pushing development of 3D-SIC, I expect solutions to filter down in to more consumer-level products at a relatively fast pace. These 3D-SIC solutions will improve power and performance over what basic transistor scaling can provide us, and can ultimately lower costs as well.
I also expect this momentum to carry the 3D interconnect density to levels where we can consider partitioning not just at the IC level but at the block level. That falls in the range of “3D-SOC” in Figure 1-- that's where an IP provider like Arm could participate fully in 3D, and that’s what I’ll cover in the second part of this series.