In the second of a three-part series exploring three-dimensional integrated circuits, Arm Research Fellow Greg Yeric moves onto what the future could hold for this technology.
In the first part of this series, I used Figure 1 below to partition the 3D space (no pun intended) into three parts, and discussed the past and present '3D-SIC' (stacked IC) work represented on the left side. In this second part, I will move on into the future, with my assumption that 3D interconnect density will continue to advance to densities which allow 3D-SoC depicted in the middle.
Figure 1: Overview and taxonomy of evolution from 3D-SIC to 3D-IC. (Source: Eric Bayne, “The 3-D Interconnect Technology Landscape”, IEEE Design and Test, May/June 2016, pp. 8-20 (reproduced with permission))
As discussed in part one, 'Stacked IC' technologies are beginning to provide significant advancements in power and performance for end systems. But, for the most part, the TSV pitches in the few 10s of microns limit partitioning at the 'chiplet' level, meaning that we need to limit our 3D architectures to things that can be partitioned using somewhat conventional I/O protocols. But, now that TSV-based technology is delivering real results to real systems, we can expect the technology to drive toward increasingly finer pitches, and if/when they get to around 1-2um in pitch, they could support partitioning at levels inside the SoC, realizing 3D “System on Chip” (3D-SoC) partitioning.
In case you are having doubts, one specific part of the semiconductor ecosystem is already pushing into 3D-SoC interconnect densities: CMOS Image Sensors (CIS). With a > $10b market pushed by mobile phones, CIS adopted 'backside illumination', where they employed advanced wafer thinning to achieve form factor reduction as well as improved imaging. In the early 2010s, they pushed further into 3D-SoC, marrying the image sensor with an image signal processor (ISP) to further reduce the form factor and improve the CIS characteristics. Last year Sony announced its 3-layer stacked IMX400 image sensor chip, adding a DRAM memory buffer layer that allows their Xperia XZ phones to capture slow motion at 960 frames per second.
Figure 2: Sony IMX400 Image sensor comprised of 3 layers (Source: Techinsights)
A key enabling technology for this product is wafer-level bonding, as opposed to the chip-level stacking that we’ve discussed so far. Stacking wafers can be ultimately cheaper than stacking chips, but what wafer-level bonding brings also is the leveraging of wafer alignment technology that far surpasses the accuracy discrete chip placement. Equipment today, such as from EVG, can achieve 200nm alignment accuracy across the full wafer, enabling TSV pitches well into the single digit micron range, as seen in Figure 2 above. Moreover, as the pixel size roadmap goes to around 1um, and the end goal for image sensors is a per-pixel interconnect, we can see a sufficient amount of momentum that would allow us to plan for 1um pitch TSV interconnect using wafer-to-wafer bonding.
At that level of interconnect density, we can partition the components of an SoC across different 3D layers, even down into the IP block level. One key benefit would derive from a particular advantage displayed by Sony in Figure 2: the three different functions of that CIS chip are made from three different process technologies. This Heterogeneous Integration offers us a new and powerful knob for system optimization. An obvious heterogeneous integration benefit for conventional SoC’s would be to break the I/O from the logic in a microprocessor. As wafer cost is well understood to be a limiting factor to scaling below 28nm, and many analog/mixed-signal circuits do not scale well into advanced nodes, there is no reason to scale the I/O to 7nm and beyond if it were possible to process at an older/cheaper process node and then cost-effectively marry to a smaller advanced logic chip. At the recent International Electron Devices Meeting (IEDM), IMEC presented a paper showing that a 28nm I/O (30% of die area) integrated with 7nm logic would be 33% cheaper than a full 7nm implementation. Ultimately the benefit would stretch beyond basic cost, because 3D-SoC-enabled heterogeneous integration would afford us the ability to separately optimize the I/O transistors and the logic transistors (and then extensions to various memory or even heterogeneous logic technologies). Historically the goal has been “get as good an I/O device as you can get without disturbing the logic transistors”, because the most cost effective path was to co-integrate onto the same wafer. The extra degree of freedom would allow a solution with better logic and better I/O devices than co-integration would allow. Of course, a key issue still comes down to cost of monolithic vs. packaging-enabled integration, but I do believe that we are seeing enough critical mass (see part I) that we can expect cost to drive down to levels that allow 3DIC use in a broad swath of products...not just high end processors.
And we are already seeing examples of cost savings through heterogeneous partitioning. One specific way to reduce cost in an SoC was discussed by Luke England of GLOBALFOUNDRIES in a paper at the most recent International Electron Devices Meeting (IEDM) titled “Advanced Packaging Saves the Day! - How TSV Technology Will Enable Continued Scaling”. In this paper, they partitioned the cache memory and logic into two separate layers. Firstly, you can get away with significantly fewer interconnect layers for an SRAM die, so in effect half of your chip has a reduced interconnect cost (which is a big driver of wafer cost), but you also recoup cost with the native reparability of SRAM unencumbered by defects in a logic portion of a die-- the dedicated SRAM layer can be very high yield, and the smaller logic-only die can benefit from the yield improvement we discussed in part 1 of the blog. For the large 625 mm2 die considered, cost savings for this approach were as high as 63%.
An excellent demonstration of the potential power of heterogenous 3D-SoC integration was shown by Stanford in a Nature publication this year. Figure 3 below shows three connected layers that are quite similar to the Sony image sensor I showed in Figure 2: sensor layer, memory buffer layer, compute layer. However, in this case the large number of ambient sensors are married to specially-designed sensor classification circuits designed to more directly accomplish pattern recognition of the sensor layer. The goal here was not apples-to-apples 2D to 3D improvement of some percentage, but an apples-to-oranges improvement that fully takes advantage of 3D-SoC connectivity in a “data rich” environment. For this particular type of application in a standard 2D CMOS chip, over 85% of the clock cycles are spent in memory access, and this dedicated 3D-SoC sensor/memory/logic cell approach achieves > 100x improvement in the energy-delay product. Note that additional orange-to-apple improvement comes through the use of a novel nonvolatile Resistive “RRAM” memory and also carbon nanotubes (CNT) for the logic underneath. A similar goal was recently advanced by DARPA’s Electronics Research Initiative. In particular, goals were set to utilize 9 million interconnects per mm2 to connect an NVM layer to a dedicated machine learning logic layer in order to achieve a 50x improvement in performance at power and cost equivalent to 2D CMOS.
Figure 3: 3D-SoC sensor/machine learning chip from Stanford (Source: Nature, Vol 547, 6 July 2017, p. 74. Image courtesy of Stanford)
The Stanford chip shown in Figure 3 actually sits on a fourth layer, a standard CMOS chip that provides the normal interface(s) to the outside world, and therefore the Stanford chip takes specific advantage of the heterogeneous integration afforded by 3D-SoC, because the carbon nanotube (CNT) FETs are best created at very high temperatures that are not compatible with standard CMOS. Attempting to co-integrate the CNTs with CMOS would necessarily require compromises in performance, and the 3D-SoC approach removes this constraint. This heterogeneous integration degree of flexibility could easily enable the adoption of devices that would normally face a steep uphill battle if they need to fit within the limits of standard CMOS wafer processing. Examples you might expect to see in the future range from non-silicon-based high mobility MOSFETs to novel photonic devices.
There is another advantage inherent in this approach that will become more critical as we progress into deep nanometer process nodes. Much of the additional wafer cost in the newer nodes comes directly from additional process steps. Multiple patterning is one example, but every technology node seems to need additional steps that result in the wafers requiring more time to process. With stacked heterogeneous integration we also can have the ability to create these more complex but more capable systems in less time, because various layers can be manufactured in parallel and then bonded together.
The 3D-SoC examples you’ve seen above are all fairly specific use cases. There is a reason I’m not showing you more general SoC examples: Design complexity. To partition a generic 2D SoC into 3D would require most aspects of the design flow to comprehend 3D constructs, and that is no easy task. It’s not just the added thermal and noise complexities: schematic capture, parasitic extraction, IR and EM checking, and static timing analysis tools would all have to be 3D-aware. And, additionally, functional verification, design-for-test, assembly/yield and even supply chain management would have to be addressed in order to fully push 3D-SoC design into the more general SoC market. Embedded into these challenges is a two-step challenge in System-Technology Co-Optimization: system-level changes to take advantage of reduced block-to-block latency or increased bandwidth need information abstracted from the physical design level, which needs its own set of 3D enablement in order to accomplish this.
3D interconnect pitches that drive below 10um are enabling us to partition within the IP blocks, but we should anticipate that it won't be too much longer before we’ll be able to utilize 3D at the standard cell level. I’m going to drop the 'TSV' nomenclature here because at this level we really are talking about things that look like the conventional vias you get in the wiring of a 2D chip. As you’d imagine, this level of 3D connectivity would be particularly interesting to those of us who care about life inside the IP blocks. At Arm Research, we began looking at this kind of 3D-IC scenario a few years ago. Of course, as I just mentioned above, design tools don’t exist yet for 3D-IC, so we had to address that chicken and egg problem first. Professor Sung Kyu Lim at Georgia Tech had already pioneered work in this area , and we were able to partner with him to look into this from an Arm perspective. We combined his work to 'trick' 2D logic synthesis into creating 2-tier 3D-IC results with our 'Predictive Technology' 7nm standard cell libraries in order to assess what potential benefit 3D-IC could provide to Arm cores in (what was at the time) a future technology node. (note: we maintain internal predictive standard cells for work like this, but we have also partnered with Professor Larry Clark at Arizona State University to publish an academic 7nm predictive library 'ASAP7, described here).
While intuitively we’d expect savings at advanced nodes where wire RC is extremely painful, we found benefits across the range from 45nm to 7nm, which we discussed at the 2015 ISLPED and 2016 ISQED conferences. At the 7nm node, we do see significant advantage both from base wire length reduction and also from buffer count reduction, but at the same time we do see the growing pain of the intra-cell 'middle-of-line' interconnect which manifests as power internal to the cells, which cell-level 3D-IC can’t help. We described this more at the 2016 DAC conference.
With these initial studies showing potential benefit in logic path delay due to wire reduction, we formalized our work to create 2-layer 3D-IC designs using conventional 2D tools, which we call “Cascade2D” and described at the 2016 ICCAD conference. With attention to key blocks within the Arm core used in this study, and creating a 3D partitioning flow that considered both 2D tiers simultaneously, we demonstrated 25% higher performance at iso-power. (see also more a more detailed description of 7nm 3D-IC in our 2017 IEEE Trans. VLSI paper.)
Figure 4: 3D-IC method and results. (a) “Cut and slide” method to create 3D layout from 2D tools. (b) illustration of intelligent mapping of IC blocks to optimize 3D benefits. (c) demonstration of 2D to 3D frequency uplift. (Source: K. Chang, S. Sinha, B. Cline, G. Yeric, and S. K. Lim, "Cascade2D: A Design-Aware Partitioning Approach to Monolithic 3D IC with 2D Commercial Tools," in International Conference on Computer Aided Design (ICCAD), 2016)
This result shows significant potential for 3D-IC at the standard cell level, opening up an additional path for performance scaling in addition to conventional process node size reduction. As an added benefit, we also found that for the higher end frequency targets, we were able to save 10% or more in die area, due mainly to buffer savings. So, if the combined costs of 3D-IC can be kept in this range, we could realize the performance boost of 3D-IC 'for free'.
The work above shows the potential advantage to logic circuits using 3D-IC, and given the slowing of performance scaling in advanced process nodes, justifies continued work in readying VLSI design for 3D-IC. While these are all paper studies, the availability of appropriate process technology may be closer than you might think. Limiting to 2 layers, we can utilize wafer-to-wafer bonding technology without the need for TSV, using so-called 'hybrid' wafer bonding (referring to the fact that wafers are bonded together with both oxide and metal areas exposed). This is used in the bottom two tiers of Figure 2 but not readily apparent. Here’s a better picture from the Samsung S7 phone’s image sensor, where the hybrid bonding is seen in the 4 light-shaded shapes-- these are copper pad to copper pad bonding between the two wafers.
Figure 5: Wafer-to-wafer hybrid bonding in the Samsung S7 image sensor (Source: TechInsights)
It’s quite encouraging to see this technology demonstrate the cost and reliability required for consumer electronics. And while the pitch shown above is a far cry from what we need, lab demonstrations are already showing results in the range we’d need to realize cell-level 3D-IC:
Figure 6: advanced wafer-to-wafer hybrid bonding demonstration (Source: CEA-Leti: "Leti Demonstrated World's First 300-mm Wafer-to-Wafer Direct Hybrid Bonding with 1-micron Pitch on EV Group System")
Granted, chips using this technology would be a very specific subset, because you would want to partition a die into two equally-sized pieces, and you would have to be willing to pay for twice the interconnect (both tiers need interconnect, which is partially redundant to a 2D-only solution), both in terms of cost and wire RC. You’d also probably still need TSVs, in order to get power to the face-to-face pair of chip layers, but these TSVs could be much fewer in number and we could probably tolerate standard TSV processes available today. Ultimately, we might prefer to process an additional inter-tier via, and you could have a more conventional Power Delivery Network (PDN) setup in a back-to-face arrangement like this.
Figure 7: Back to-face 2-tier arrangement facilitating a more conventional PDN arrangement (Source: K. Chang, S. Das, S. Sinha, B. Cline, G. Yeric and S. K. Lim, "Frequency and time domain analysis of power delivery network for monolithic 3D ICs", Low Power Electronics and Design (ISLPED), 2017 IEEE/ACM International Symposium on, Taipei, 2017)
The 3D-IC research that we’ve published to date has been limited to logic path folding, but there is potentially even greater benefit in 3D partitioning within memories, and/or enabling fine-grained memory-over-logic. As the manufacturing capabilities arrive to support cell-level 3D-IC, we expect gains in addition to the 3D-SIC and 3D-SoC technologies that are already being demonstrated.
It makes sense that those technologies are being exploited first, as 3D-IC design has a lot of work ahead of it in order to come to fruition. But they need not be mutually exclusive: it’s quite likely that we can create advanced systems with a combination of chip-level stacking and block/cell level stacking.
There is one final 'dimension' to the 3D story, described at the right of Figure 1: transistor-level stacking. That can also be 'folded in' to the overall story. Transistor-level 3D is more directly an attack at continuing Moore’s Law scaling, and we’ll be discussing that in the final part of this blog series.
Arm Research publications referenced in this blog: 1. K. Chang, S. Das, S. Sinha, B. Cline, G. Yeric and S. K. Lim, "Frequency and time domain analysis of power delivery network for monolithic 3D ICs," 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Taipei, 2017, pp. 1-6. 2. K. Chang, K. Acharya, S. Sinha, B. Cline, G. Yeric and S. K. Lim, "Impact and Design Guideline of Monolithic 3-D IC at the 7-nm Technology Node," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 7, pp. 2118-2129, July 2017. 3. K. Chang et al., "Cascade2D: A design-aware partitioning approach to monolithic 3D IC with 2D commercial tools," 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, 2016, pp. 1-8. 4. K. Chang, S. Sinha, B. Cline, G. Yeric and S. K. Lim, "Match-making for Monolithic 3D IC: Finding the right technology node," 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 2016, pp. 1-6. 5. K. Acharya, K. Chang, B. W. Ku, S. Panth, S. Sinha, B. Cline, G. Yeric, and S. K. Lim, "Monolithic 3D IC design: Power, performance, and area impact at 7nm," 2016 17th International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, 2016, pp. 41-48. 6. K. Chang, K. Acharya, S. Sinha, B. Cline, G. Yeric and S. K. Lim, "Power benefit study of monolithic 3D IC at the 7nm technology node," Low Power Electronics and Design (ISLPED), 2015 IEEE/ACM International Symposium on, Rome, 2015, pp. 201-206. 7. L. T. Clark, L. Shifren, V. Vashishthaa, A. Gujja, S. Sinha, B. Cline, C. Ramamurthya, and G. Yeric, “ASAP7: A 7-nm FinFET Predictive Process Design Kit,” Microelectronics Journal, vol. 53, pp. 105-115, July 2016