<ors:list-of-processors  xmlns:ors="http://www.hoise.com/ors/0.1" id="procesors" ><!-- ======================================================================== --><ors:processor id="ev67">   <ors:name>Compaq Alpha EV7</ors:name>   <ors:block-diagram href="ev7-1.jpg">Block diagram showing the functional units in an Alpha EV7 processor</ors:block-diagram>   <ors:chip-layout href="ev7-2.jpg">Chip layout for the Alpha EV7 processor</ors:chip-layout>   <ors:description>      <p>The present CPU that is employed in Compaq machines like theAlhpaServerSC and the Wildfire and in various cluster systems isthe Alpha EV68 processor. Shortly, (second half 2002) EV7processors will become available. Because of the EV7 structure themacro-architecture of these systems may also significantly change(see below). The core of the EV7 processor is almost identical tothat of the EV68 architecture and is depicted in Figure <a href="#ev7-1fig">8a</a>      </p>      <p>A notable fact is that there are <em>two</em> duplicate integerregister files both with 80 entries, that each service a set ofinteger functional units called cluster 0 and cluster 1,respectively, by Compaq. The four integer Add/Logical units canexchange values in one cycle if required. Although this is notshown in the diagram, the integer multiply is fully pipelined. Thetwo integer clusters and the two floating-point units enable theissueing of up to 6 instructions simultaneously. The two load/storeunits draw on a 64 KB instruction and a 64 KB data cache that areboth 2-way set-associative. Four instructions can be accepted for(speculative) processing. Of the 80 integer and 72 floating-pointregisters 41 in both register files can hold speculative results.The out-of-order issueing of instructions is supported via aninteger queue of length 20 and a floating-point queue with 15entries. However, as the integer processing clusters do not containthe same functional units, the issueing of integer instructionscannot all be scheduled dynamically. Those instructions that needto execute in a particular unit (e.g., an integer multiply that isonly available in cluster 0) are scheduled statically. As soon asan instruction is issued or is terminated due to mis-speculation itis removed from the queue and can be replaced by anotherinstruction. Instruction fetching is governed by the branchpredictor. This hardware contains global and local predictiontables and Branch History Tables (BHTs) to train the predictor inorder to obtain an optimal instruction fetch to the instructioncache and registers.</p>      <p>The feature density used is 0.18 µm instead of 0.25µm which enables the location of a 1.5 MB secondary cache and2 memory controllers on chip. The largest difference will be thatthere will be 4 dual channels (North, East, South, West) from thechip to interconnect it with neighouring chips at a bandwidth of1.6 GB/s per single channel for what Compaq calls "seamless SMPprocessing" and is, as the name suggests, well-suited to build SMPnodes with low memory latency. The layout of the complete chip isshown in Figure <a href="#ev7-2fig">8b</a>      </p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor id="itanium">   <ors:name>Intel Itanium 2</ors:name>   <ors:block-diagram href="itan2.jpg">Block diagram of the Intel Itanium 2</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>The Itanium 2 is a representative of Intel's IA-64 64-bitprocessor family and as such the second generation. Itspredecessor, the Itanium, has been out for almost a year, but hasnot spread widely, primarily because the Itanium 2 would followquickly with projected performance levels up to twice that of thefirst Itanium. The Itanium 2 will become available in 1--2 month atthe time of writing and would improve on some aspects of the firstgeneration, in particular integer processing and cache/memorybandwidth.</p>      <p>The Itanium family of processors has characteristics that aredifferent from the RISC chips presented elsewhere in this section.A block diagram of the Itanium 2 is shown in <a href="#itanfig">11</a>.</p>      <p>The clock frequency for the Itanium 2 in the products to beshipped will be around 1 GHz. Figure <a href="#itanfig">11</a>shows a large amount of functional units that must be kept busy.This is done by large instruction words of 128 bits that contain 341-bit instructions and a 5-bit template that aids in steering anddecoding the instructions. This is an idea that is inherited fromthe Very Large Instruction Word (VLIW) machines that have been onthe market for some time about ten years ago. The two load/storeunits fetch two instruction words per cycle so six instructions percycle are dispatched. The Itanium has also in common with thesesystems that the scheduling of instructions, unlike in RISCprocessors, is not done dynamically at run time but rather by thecompiler. The VLIW-like operation is enhanced with predicatedexecution which makes it possible to execute instructions inparallel that normally would have to wait for the result of abranch test. Intel calls this refreshed VLIW mode of operationEPIC, Explicit Parallel Instruction Computing. Furthermore, loadinstructions can be moved and the loaded variable used before abranch or a store by replacing this piece of code by a test on theplace is originally came from to see whether the operations havebeen valid. To keep track of the advanced loads an Advanced LoadAddress Table records them. When a check is made about thevalidness of an operation depending on the advanced load, the ALATis searched and when no entry is present the operation chainleading to the check is invalidated and the appropriate fix-up codeis executed. Note that this is code that is generated at compiletime so no control speculation hardware is needed for this kind ofspeculative execution. This would become exceedingly complex forthe many functional units that may be simultaneously in operationat any time.</p>      <p>As can be seen from Figure <a href="#itanfig">11</a> there are fourfloating-point units capable of performing Fused MultiplyAccumulate (FMAC) operations. However, two of these work at thefull 82-bit precision which is the internal standard on Itaniumprocessors, while the other two can only be used for 32-bitprecision operations. When working in the customary 64-bitprecision the Itanium has a theoretical peak performance of 4Gflop/s at a clock frequency of 1 GHz. Using 32-bit floatingarithmetic, the peak is doubled. In the first generation Itaniumthere were 4 integer units for integer arithmetic and other integeror character manipulations. Because the integer performance of thisprocessor was modest, 2 integer units have been added to improvethis. In addition four MMX units to accommodate instructions formulti-media operations, an inheritance from the Intel Pentiumprocessor family. For compatibility with this Pentium family aspecial IA-32 decode and control unit is present.</p>      <p>The register files for integers and floating-point numbers islarge: 128 each. However, only the first 32 entries of theseregisters are fixed while entries 33--128 are implemented as aregister stack. The primary data and instruction caches are 4-wayset associative and rather small: 16 KB each. This is the same asin the former Itanium processor. However, speed of the L1 cache isnow doubled to full speed: data and instructions can now bedelivered every clock cycle to the registers. Further more thesecondary cache has been enlarged from 96 KB to 256 KB and it is8-way set-associative. Moreover, the L3 cache is moved onto thechip and is no less than 3 MB. This cache structure greatlyimproves the bandwidth to the processor core, on average by afactor of 3. This does more for the performance improvement thanthe relatively modest increase in clock speed from 800 MHz to 1GHz. Also the bandwidth from/to memory has increased by more than afactor of 3. The bus is now 128 bits wide and operates at a clockfrequency of 400 MHz, totaling to 6.4 GB/s in comparison to 2.1GB/s for its predecessor.</p>      <p>The introduction of the first Itanium has been deferred time andagain which quenched the interest for use in high-performancesystems. With the availability of the Itanium 2 in the second halfof 2002 it is expected that the adoption will speed up. Apart fromHP/Compaq also SGI, NEC and Fujitsu will include these processorsin their systems in the not too distant future while phasing outthe Alpha, PA-RISC, MIPS and SUN processors.</p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor id="opteron">   <ors:name>AMD Opteron</ors:name>   <ors:block-diagram href="opteron.jpg">Block diagram of Opteron processor</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>The Opteron (long known by its code name Hammer) is the newestprocessor from AMD and the successor of the Athlon processor. Thefirst versions are expected to become available by the end of 2002.As it is, like the Athlon, a clone with respect to Intel's x86Instruction Set Architecture, it will undoubtly frequently be usedused in clusters. Therefore we discuss this processor here althoughit is not used presently in integrated parallel systems.</p>      <p>The Opteron processor has many features that are also present inmodern RISC processors: it supports out-of-order execution, hasmultiple floating-point units, and can issue up to 9 instructionssimultaneously. In fact, the processor core in very similar to thatof the Athlon processor. A block diagram of the processor is shownin Figure <a href="#optfig">7</a>      </p>      <p>It shows that the processor has three pairs of Integer ExecutionUnits and Address Generation Units that via an 24-entry IntegerScheduler takes care of the integer computations and addresscalculations. Both the Integer Scheduler and the Floating-PointScheduler are fed by the 96-entry Instruction Control Unit thatreceives the decoded instructions from the instruction decoders. Aninteresting feature of the Opteron is the pre-decoding of x86instructions in fixed-length macro-operations, called RISCOperations (ROPs), that can be stored in a Pre-decode Cache. Thisenables a faster and more constant instruction flow to theinstruction decoders. Like in RISC processors, there is a BranchPrediction Table assisting in branch prediction.</p>      <p>The floating-point units allow out-of-order execution ofinstructions via the FPU Stack Map &amp; Rename unit. It receivesthe floating-point instructions from the Instruction Control Unitand reorders them if necessary before handing them over to the FPUScheduler. The Floating-Point Register File is 88 elements deepwhich approaches the number of registers as is available on RISCprocessors. (For the x86 instructions 16 registers in a flatregister file are present instead of the register stack that isusual for Intel architectures.)</p>      <p>The floating-point part of the processor contains three units: aFloating Store unit that stores results to the Load/Store QueueUnit and Floating Add and Multiply units that can work insuperscalar mode, resulting in two floating-point results per clockcycle. Because of the compatibility with Intel's Pentium IIIprocessors, the floating-point units also are able to execute IntelMMX instructions and AMD's own 3DNow! instructions. However, thereis the general problem that such instructions are not accessiblefrom higher level languages, like Fortran 90 or C(++). Bothinstruction sets are meant for massive processing of visualisationdata and only allow for 32-bit precision to be used.</p>      <p>Due to the shrinkage of components the chip now can harbour thesecondary cache of 256 KB and the memory controller. This, togetherwith a significantly enhanced memory bus can deliver up to 5.3 GB/sof bandwidth, an enormous improvement over the former memorysystem. This memory bus, called HyperTransport by AMD, is derivedfrom licensed Compaq technology and similar to that employed inCompaq's EV7 processors (see the <a href="ev7.html">Compaq AlphaEV7</a>). It allows for "glueless" connection of several processorsto form multi-processor systems with very low memory latencies.</p>      <p>The clock frequency will be in the order of 2 GHz of the currentprocessors the Opteron is an interesting alternative for many ofthe RISC processors that are available at this moment. Especiallythe HyperTransport interconnection possibilities could be highlyinteresting for building SMP-type clusters. </p>      <p>       </p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor xmlns:ors="http://www.hoise.com/ors/0.1" xmlns:top500="http://www.hoise.com/vmp/top500/1.0" xmlns:vmp="http://www.hoise.com/vmp/manual/0.1" id="pa-risc">   <ors:name>Hewlett-Packard PA-RISC8700</ors:name>   <ors:block-diagram href="pa8700.jpg">Block diagram of a HP PA-RISC 8700 processor</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>The computational power for the Hewlett Packard systems, likethe SuperDome, the V-class, and N-class servers is delivered by thePA-8600 and PA-8700 chips. The processor cores of these chips areessentially the same. However, the PA-8700 is made in 0.18 µmlogic which made it possible to fit a very large 0.75 MBinstruction and a 1.5 MB data cache on the chip and to raise theclock frequency to 750 MHz. A block diagram of the PA-8700 chip isshown in <a href="#pa8700fig">9</a>.</p>      <p>A peculiarity of the PA-8x00 chips is the abcense of a secondarycache. Instead, a very large primary cache is implemented: 0.75 MBinstruction cache and 1.5 MB data cache. From the PA8600 on theshrinkingof the logic has allowed to put these caches on-chip. Thelatency of the caches is two cycles. To ensure data to be shippedto the registers every cycle, the load/store units work"out-of-phase". So, one unit loads from one half of the data cachewhile the other loads from the other half. The Address ReorderBuffer sets the priority for the loads and tries to load from thealternate halfs every cycle.</p>      <p>Like all advanced RISC processors the PA-8700 has out-of-orderexecution, the sequence of instructions being determined by theinstruction reorder buffer (IRB) which contains an ALU buffer thatdrives the computational functional units and a memory buffer thatcontrols the load/store units. When speculative branches have beenmis-predicted the dependent instructions are retired from the IRBand new candidate instructions replaced them. Branch prediction iscontrolled through the branch history table (BHT) but, in additionto this dynamic branch prediction, a static branch prediction canbe performed at the compiler level or by execution traces of formerexecutions of a program. The BHT was rather small in thepredecessors of the PA-8600 and has been enlarged significantly toget better prediction results. Also the Translation LookasideBuffer (a component of the load/store units not shown in Figure <a href="#pa8700fig">9</a>) has been enlarged for a more effectiveaddress translation. Also there is a pre-fetch capability in thenew PA-8700 from the data cache.</p>      <p>As can be seen in Figure <a href="#pa8700fig">9</a>, there are 2floating-point units which each can deliver 2 flops per cycle butonly when the operation is in the <tt>axpy</tt> form <i>x = x +a...y</i>. This is called a Floating Multiply Accumulateinstruction (FMAC) by HP. At a clock frequency of 550 MHz thisleads to a theoretical peak performance of 3 Gflop/s. However, whenthe operations occur in another order or with another composition,1 flop per cycle per floating-point unit can be executed with acorrespondingly lower flop rate.</p>      <p>According to HP's roadmap at least another two generations ofthe PA-8<i>x</i>00 are projected: PA-8800 and PA-8900 that will beon the market concurrently with the IA-64 Itanium 2 (McKinley) andItanium 3 (Deerfield), respectively. After that the PA-RISC familywill be withdrawn to give way to the IA-64 architecture.</p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor xmlns:ors="http://www.hoise.com/ors/0.1" xmlns:top500="http://www.hoise.com/vmp/top500/1.0" xmlns:vmp="http://www.hoise.com/vmp/manual/0.1" id="pentium4">   <ors:name>Intel Pentium 4</ors:name>   <ors:block-diagram href="p4.jpg">Block diagram of the Intel Pentium 4</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>Although Pentium processors are not applied in integratedparallel systems these days, they play a major role in the clustercommunity as most compute nodes in Beowulf clusters are of thistype. Therefore we briefly discuss also this type ofprocessor.</p>      <p>Intel only provides scant information on its processor. Therefore,a rough block diagram of the P4 processor can only be synthesizedfrom various sources. It is shown in Figure <a href="#p4fig">12</a>.</p>      <p>There is a number of distinctive features with respect to theearlier Pentium generations. There are two main ways to increasethe performance of a processor: by raising the clock frequency andby increasing the number of instructions per cycle (IPC). These twoapproaches are generally in conflict: when one wants to increasethe IPC the chip will become more complicated. This will have anegative impact on the clock frequency because more work has to bedone and organised within the same clock cycle. Very seldomly chipdesigners succeed in raising both clock frequency and IPCsimultaneously. Also in the Pentium 4 this could not be done. Intelhas chosen for a high clock speed (initially about 40% more thanthat of the Pentium III with the same fabrication technology) whilethe IPC decreased by 10--20%. This still gives a net performancegain even if other changes would have been made to the processor.To sustain the very high clock rate that the present processorshave, currently &gt; 2 GHz, a very deep instruction pipeline isrequired. The instruction pipeline has no less than 20 stages,double the number of stages in that of the Pentium III. Althoughthis favours a high clock rate, the penalty for a pipeline miss(e.g., a branch mis-predict) is much heavier and therefore Intelhas improved the branch prediction by a increasing the size of theBranch Target Buffer from 0.5 to 4 KB. In addition, the Pentium 4has an execution trace cache which holds partly decodedinstructions of former execution traces that can be drawn upon,thus foregoing the instruction decode phase that might produceholes in the instruction pipeline. The allocator dispatches thedecoded instructions, "micro operations", to the appropriateµop queue, one for memory operations, another for integer andfloating-point operations.</p>      <p>Two integer Arithmetic/Logical Units are kept simple in order to beable to run them at twice the clock speed. In addition there is anALU for complex integer operations that cannot be executed withinone cycle. There is only one Floating-point functional unit thatdelivers one result per cycle. However, besides the normalFloating-point Unit, there also are additional units that executethe Streaming SIMD Extensions 2 (SSE2) repertoire of instructions,a 144-member instruction set, that is especially meant formultimedia, and 3-D visualisation applications. The length of theoperands for these units is 128 bits. The Intel compilers have theability to address the SSE2 units. This makes it in principlepossible to achieve a two times higher floating-pointperformance.</p><p>The primary cache is quite small by today's standards: 8 KB. Thisis again to accommodate the high clock speed. With this size ofcache it is possible to have a latency of two cycles for the cache,where it was 3 cycles in the Pentium III. The secondary cache has asize of 256 KB and has a wide 256-bit bus, which amounts to abandwidth of 54.4 Gb/s. Also the memory bandwidth has improvedsignificantly over that of the Pentium III: although the bus cyclefrequency is 133 MHz, four transactions per cycle can be done,making it effectively a 533 MHz bus. This should give quite animprovement for codes that cannot be kept in cache.<br />It will depend heavily on the availability of compilers that areable to take advantage of all the facilities present in the P4processor. But if they can, the processor could form a good basisfor any HPC platform.</p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor xmlns:ors="http://www.hoise.com/ors/0.1" xmlns:top500="http://www.hoise.com/vmp/top500/1.0" xmlns:vmp="http://www.hoise.com/vmp/manual/0.1" id="power4">   <ors:name>IBM POWER4</ors:name>   <ors:block-diagram href="pwr4-2.jpg">Block diagram of the POWER4 processor core</ors:block-diagram>   <ors:chip-layout href="pwr4-1.jpg">Diagram of the IBM POWER4 chip layout</ors:chip-layout>   <ors:description>      <p>In the newest IBM SP systems the nodes contain the POWER4 chip,the latest variant of the RS/6000 family of processors. At the timeof writing, the clock frequency of the POWER4 is 1.3 GHz. The chipsize has become so large (or rather the feature size has become sosmall) that IBM places now two processor cores on one chip as shownin Figure <a href="#pwr4-1fig">10a</a>. The chip also harbours 1.5MB of secondary cache divided over three modules of 0.5 MBeach.</p>      <p>The L2 cache module are connected to the processors by the CoreInterface Unit (CIU) switch, a 2 x 3 crossbar with a bandwidthof 40 B/cycle per port. This enables to ship 32 B to either the L1instruction cache or the data cache of each of the processors andto store 8 B values at the same time. Also, for each processorthere is a Non-cacheable Unit that interfaces with the FabricController and that takes care of non-cacheable operations. TheFabric Controller is responsible for the communication with threeother chips that are embedded in the same Multi Chip Module (MCM),to L3 cache, and to other MCMs. The bandwidths at 1.3 GHz are 10.4,6.9, and 5.2 GB/s, respectively. The chip further still contains avariety of devices: the L3 cache directory and the L3 and MemoryController that should bring down the off-chip latencyconsiderably, the GX Controller that responsible for the traffic onthe GX bus. This bus transports data to/from the system and inpractice is used for I/O. Some of the integrated devices, like thePerformance Monitor, and logic for error detection and logging arenot shown in Figure <a href="#pwr4-1fig">10a</a>.</p>      <p>A block diagram of the processor core is shown in Figure <a href="#pwr4-2fig">10b</a>.</p>      <p>In many ways the POWER4 processor core is similar to the formerPOWER3 processor: there are 2 integer functional units instead of 3(called Fixed Point Units by IBM) and instead of a fusedBranch/Dispatch Unit, the POWER4 core has a separate Branch andConditional Register Unit, 8 execution units in all. Oddly, theinstruction cache is two times larger than the data cache (64 KBdirect-mapped vs. 32 KB two-way set associative, respectively) andall execution units have instruction queues associated with themthat enables the out-of-order processing of up to 200 instructionsin various stages. Having so may instructions simultaneously inflight calls for very sophisticated branch prediction facilities.Instructions are fetched from the Instruction Cache under controlof the Instruction Fetch Address Register which in turn isinfluenced by the branch predict logic. This consists of a localand a global Branch History Table (BHT), each with 16 K entries anda so-called selector table which keeps track of which of the BHTshas functioned best in a particular case in order to select theprediction priority of the BHTs for similar cases coming up.</p>      <p>Unlike in the POWER3, the fixed point units performs integerarithmetic operations that can complete in one cycle as well asmulti-cycle operations like integer multiply and divide. There areno separate floating-point units for operations that require manycycles like divisions and square roots. All floating-pointoperations are taken care of in the FP units and, like in the HPPA-8700, there is an instruction to accommodate the <tt>axpy</tt>operation, called Fused Multiply Add (FMA) at IBM's which coulddeliver 2 floating-point results every cycle. This brings thetheoretical peak performance at 1.3 Gflop/s at the current clockfrequency. Like in the HP processor, the composition of thefloating-point operations should be such that the units have indeedenough FMAs to perform otherwise the performance drops by a factorof 2.</p>      <p>Although here the dual core version of the chip is describedthat is positioned for general processing, also a single coreversion is marketed that is recommended for HPC use. The reason isthat in this case the bandwidth from the L2 cache does not have tobe shared between the CPUs and a contention-free transfer of up to83.2 GB/s can be achieved while in the dual core version a peakbandwidth of 124.8 GB/s is to be shared between both CPUs.</p>      <p>It is interesting to see that presently three vendors (AMD,Compaq, and IBM) have facilities that enable glueless coupling ofprocessors although the packaging and implementation is somewhatdifferent. All implementations allow for low-latency SMP nodes witha considerable number of processors stimulating the trend to buildparallel systems based on SMP nodes.</p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor id="r14000">   <ors:name>MIPS R14000A</ors:name>   <ors:block-diagram href="r14k.jpg">Block diagram of the MIPS R14000 processor</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>The essentials of the MIPS R1<i>x</i>000 series of processorshave not changed since the introduction of the first in thisfamily, the R10000. The current processor that is at the heart ofthe SGI Origin3000 series is the R14000A. The R14000A is similar tothe preceding R14000 except for the clock cycle: this is presently600 MHz and as such the lowest of all RISC processors employed inHigh Performance systems. A block diagram of this processor isgiven in Figure <a href="#r14000fig">13</a>.</p>      <p>The R14000 is a typical representative of the modern RISCprocessors that are capable of out-of-order and speculativeinstruction execution. Like in the Compaq Alpha processor there aretwo independent floating-point units for addition andmultiplication and, additionally, two units that perform floatingdivision and square root operations (not shown in Figure <a href="#r14000fig">13</a>). The latter, however, are not pipelinedand with latencies of about 20--30 cycles are relatively slow. Inall there are 5 pipelined functional units to be fed: an addresscalculation unit which is responsible for address calculations andloading/storing of data and instructions, two ALU units for generalinteger computation and the floating-point add and multiply pipesalready mentioned.</p>      <p>The level 1 instruction and data caches have a moderate size of 32KB and are 2-way set-associative. In contrast, the secondary cachecan be very large: up to 16 MB. Both the integer and thefloating-point registers have a physical size of 64 entries,however, 32 of them are accessible by software while the other halfis under direct CPU control for register re-mapping.</p>      <p>The clock frequency of the MIPS R1x000 processors have always beenon the low side. The first R10000 appeared at a frequency of the180 MHz while in the new R14000A the clock cycle is 600 MHz andwill slightly rise during its lifetime. With the initial 600 MHzfrequency the theoretical peak performance is 1.2 Gflop/s. Becauseof the independent floating-point units without fused multiply-addcapabilities often a fair fraction of that speed can be realised.There also have been made some improvements with respect to theearlier chips: the bus speed has been doubled from 100 MB/s to 200MB/s and the L1 cache that ran at a 2/3 speed in the predecessorR12000 has been sped up to full speed in the R14000A.</p>      <p>The R14000A is built in advanced 0.13 µm technology and ithas at the present 600 MHz clock frequency an extremely low powerconsumption: only 17 Watt, several factors lower than that of theother processors discussed here. SGI keeps the clock frequencyintentionally as low as possible to enable to build "dense" systemsthat can accommodate a large amount of processors in a smallvolume.</p>      <p>A R16000 successor is planned for next year that will be ashrunken version of the R14000 made in 0.11 µm technology andwith a clock frequency of 700 MHz. In the current plans it seemsthat SGI will stay with the MIPS processors (along with systemswith Itanium processors like most vendors). A R18000 will in allprobability become available in 2004 both as dual and single corechips while even a R20000 is envisioned around 2005 that woulddouble the amount of floating-point units to four per processorcore.</p>   </ors:description></ors:processor><!-- ======================================================================== --><ors:processor id="sparcIII">   <ors:name>Sun UltraSPARC III</ors:name>   <ors:block-diagram href="sparc3.jpg">Block diagram of the UltraSPARC III processor</ors:block-diagram>   <ors:chip-layout href=""/>   <ors:description>      <p>The UltraSPARC-III is the third generation from the UltraSPARCfamily and, as one of the last RISC processor families, with full64-bit precision and addressing range. It is built in 0.18 µmCMOS technology at a clock frequency that is currently 900 MHz. Itis a complete revamp of earlier UltraSPARC designs but backwardcompatible with these older processors. UltraSPARCs are used in allSUN products from workstations to the heavy E10000 servers and alsoin Fujitsu products like the AP-3000. We show a block diagram ofthe UltraSPARC-III in Figure <a href="#sparcIIIfig">14</a>.</p>      <p>The chip is characterised by large large amount of caches ofvarious sorts as can be seen in the figure. The Data Cache Unit(DCU) contains apart from a 4-way set associative cache of 64 KBalso a write and a pre-fetch cache, both of 2 KB. The pre-fetchcache is independent from the data cache and can load data whenthis is deemed appropriate. The write cache defers writes to the L2cache and so may evade unnecessary writes of individual bytes untilentire cache lines have to be updated. The Instruction Issue Unit(IIU) contains the 32 KB 4-way set associative instruction cachetogether with the instruction TLB which is called Instructiontranslation buffer in SUN's terminology. The IIU also contains aso-called miss queue that holds instructions that are immediatelyavailable for the execute units when a branch has beenmis-predicted. Branch prediction is fully static in theUltraSPARC-III. It is implemented as a 16 KB table in the IIU thatis pipelined because of its size.</p>      <p>The Integer Execute Unit (IEU) has two Add/Logical Units and abranch unit. Integer adds and multiplies are pipelined but thedivide operation is not. It is performed by an Arithmetic SpecialUnit (not shown in the figure) that does not burden the pipelinesfor the ALUs. The integer register file is effectively divided intwo and is called the Working and Architectural Register File bySUN. Operands are accessed and results stored in the workingregisters. When an exception occurs, the results to be undone inthe working registers are overwritten by those from thearchitectural file.</p>      <p>The floating-point unit (FPU) has two independent pipelined unitsfor addition and multiplication and a non-pipelined unit forfloating division and square-root computation that require in theorder of 20--25 cycles. The FPU also contains graphics hardware(not shown in Figure <a href="#sparcIIIfig">14</a>) that shares thepipelined adder and multiplier with general 64-bit calculations.For the chips delivered at 900 MHz, the theoretical peakperformance is 1.8 Gflop/s. It is expected that the UltraSPARC-IIItechnology can be shrunk to reach a clock frequency to 1 GHz by theend of its life cycle.</p>      <p>The memory controller and the L2 cache controller together with theL2 cache tags are all housed on the chip in the External MemoryUnit. This shortens the latency of accesses from both memorylevels. In addition, both controllers communicate with the SystemInterface Unit (SIU), also on-chip to keep in touch with the snooppipe controller in the SIU. The processor has been built withmulti-processing in mind and the snoop controller keeps track ofdata requests in the whole system to ensure coherency of the cacheswhen required.</p>      <p>As the UltraSPARC-III is around for about a year at the time ofwriting and the clock frequency has gone up in that period from 750to 900 MHz. The next generation will take some time (about a year)to appear and, after the radical redesign in the presentgeneration, will have most of the same characteristics as thecurrent one.</p>   </ors:description></ors:processor><!-- ======================================================================== --></ors:list-of-processors>
