|
Primeur Magazine: Applying processors from mobile devices to supercomputers sounds like an original approach to solve the power efficiency problem. Where did the idea come from?
John Shalf:The idea is not entirely original since it was also central to the design of IBM's BlueGene and the SiCortex machines, but our target is devices containing hundreds of cores. The unique nature of our approach is determining new ways to organize these cores so that they can be effective and easy to program for the more extreme many-core chips we will need for future energy-efficient supercomputing designs. Already power is the key limiting factor for supercomputing. At Berkeley, we investigated numerous possible architecture options during the course of our discussions for the "View from Berkeley" (http://view.eecs.berkeley.edu). Of all the approaches, the most practical option turned out to be large arrays of simpler cores rather than continuing with modest-size multicore processors containing dozens of complex cores. The embedded market has a longer history of expertise with power efficiency that uses these very simple core designs. Not only the processor itself but also the design techniques developed by the embedded technology vendors are playing a crucial role in this matter. The hardware and software co-design methodology that is commonly used to develop energy-efficient mobile devices has to be adapted to the design of supercomputers. This is the problem the researchers are trying to solve over the course of the next decade.
The vendors for handheld devices need energy efficiency for a long battery life, yet still have to get the performance - given power is the leading design constraint for HPC, our needs are aligned with that of the embedded/handheld computing market. The delivered application performance per energy for these designs is far greater than what can be achieved using conventional designs. Embedded processors are much more energy-efficient than the traditional processors because of their simplicity, but also because they are tailored for the application. The approach reduces waste by removing any features that are unnecessary for the targeted applications. This application-targeted approach, combined with the co-design process, can achieve a hundreds of times more energy-efficient solution than using conventional desktop components. As a result, there is a lot more energy to do useful work. Embedded processors are also inexpensive. They are used to keeping prices under control, because of the competition in this mass market.
In the high-end of desktop computing it takes 4 years to design a new chip whereas in the embedded market a design firm such as Tensilica may produce upwards of 200 new designs per year. They have developed sophisticated tools to accelerate the turnaround of such tailored processor designs. The software, namely the debuggers, are also playing their part. Everything is customized. The processor and the software are matched to each other. Berkeley Lab is collaborating with Tensilica to explore the use of this company's processor cores as the building blocks in supercomputing design.
Primeur Magazine: So the Tensilica processors are providing more power efficiency but what about the power sufficiency?
John Shalf:Currently, the chip (e.g. an AMD Opteron chip or an Intel Nehalem) is considered the commodity in current supercomputing. They are the building blocks for our current supercomputer designs. In the embedded market, the circuits on the chip are commodity. The Intellectual Property (IP) blocks are stitched together to create ASIC designs from pre-designed and pre-verified components, but can be configured in novel ways to target the application requirements. So 100 cores are put on a chip together with large numbers of memory controllers in order to provide lots of memory bandwidth, and the communication between those cores can be organized to target high-level programming languages such as Unified Parallel C (UPC). The chip is as powerful as a graphics processing unit (GPU) but it works at a fraction of the watts. If you can add some features you can make it easier to program for science. A lot is being gained from what we throw away. Intel for instance has 500 instructions per chip but we need only 80. So we take only what we need from the embedded processor market. Seymour Cray said: "Only put into a supercomputer what is absolutely necessary." COTS technology has limited our ability to adhere to his advice, but the embedded tools give us much more flexibility to return to that kind of design philosophy.
Primeur Magazine: Are your processors already used in supercomputers right now?
John Shalf:Our design is still experimental. We have all the logic to build the chip but we do cycle-accurate simulation of the design using an FPGA (field-programmable gate array) simulator called RAMP (Research Accelerator for MultiProcessors). The simulation runs 10 times slower than the real chip would, but it accurately predicts the real hardware performance because it is in fact the real circuit design. Intel's Larabee is based on some of the same design principles of using large arrays of simpler cores. It answers how many-core processing could be made easier to program. But the Berkeley solution uses far less power than Intel's Larabee because they are much simpler and more closely tailored to the target scientific applications. Each one of the Tensilica processors we use in our design occupies only 2 square millimeters on a chip with full IEEE double precision floating point and consumes only 130 milliwatts.
Intel and Microsoft have co-funded the ParaLab at Berkeley campus to study many-core processors and software for handheld devices. We work closely with the campus researchers, but our target is more on how to scale up this approach to target large-scale supercomputing systems.
Primeur Magazine: Can the solution be commercialized?
John Shalf: Our goal is to influence the HPC market to adopt a radically different design methodology. As such, we have a high burden of proof to demonstrate the value of our approach. So we want to build prototype systems to demonstrate the effectiveness of our approach to industry. It can be commercialized but not by Berkeley as such. So Berkeley seeks a partner to commercialize it. The idea has to become commodity, so it is important to us to make all of the results of this research public and broadly accessible.
Primeur Magazine: What are your future plans and work in this area?
John Shalf:We need to demonstrate a scaled up design within the next five years. We are already three years working on it and now there is a concrete design in simulation. This is used for research programming models and hardware support for many-core. However, demonstrating that our energy models and design processes can hit the target power and performance will require a full-scale prototype design to validate the models. LBL is talking with different partners but it is a question of funding and that takes time.
Primeur Magazine: How do you see the future of supercomputing in general?
John Shalf:Power consumption is the largest impediment to future performance improvements in supercomputing systems of all scales - not just exascale. Therefore it is essential for energy-efficient systems that are easy to program to pursue the new technologies and chip organizations. We are on a critical path to exascale computing but we still need three or four miracles to reach it. The embedded processor approach is one issue to solve. The second one is energy-efficient memory technology. However, there are very few memory vendors left. Actually, there is only Micron which has the intention to address this problem. The third one is more energy-efficient interconnect technology, which may well involve scaling down photonics to work at chip scale. Indeed, we had a talk today from Luca Carloni who is studying "silicon photonics", which are tiny optical switches and wave guides that can be integrated directly onto a CMOS chip design. The last one is storage technology. We know that disk technology is not going to scale at the rate we need to meet storage performance requirements at exascale, but the heir apparent to existing mechanical technology is not clear. Nonvolatile solid-state memory technologies, such as FLASH, phase-change memories, and other NVRAM technologies are making great strides, but it is not yet apparent which approach will win in the marketplace in the long run.
Primeur Magazine: Thank you for your time and best of success with the energy-efficient processor miracle!
More information about the embedded processor research in the Green Flash project at Berkeley is available at http://www.lbl.gov/cs/html/greenflash.html
|