<ors:list-of-systems  xmlns:ors="http://www.hoise.com/ors/0.1" id="existing"  ><!-- ======================================================================== --><ors:system id="ap3000">   <ors:name>The Fujitsu AP3000</ors:name>   <ors:machine-type>RISC-based distributed-memory multi-processor</ors:machine-type>   <weg>AP3000</weg>   <ors:operating-system>Cell OS (transparent to the user) and Solaris (Sun's Unixvariant)</ors:operating-system>   <ors:connection-structure>2-D torus</ors:connection-structure>   <ors:compilers>Parallel Fortran/AP, Fortran 90, HPF, C, C++.</ors:compilers>   <ors:vendor-website/>   <ors:year-of-introduction>1996</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>AP3000</ors:name>         <ors:clock-cycle unit="MHz">300</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s">600</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">614</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">2</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>1024</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The AP3000 is the sucessor of the earlier AP1000 system.Although the name could suggest otherwise, few characteristics ofthe AP1000 have been retained except that Sun Sparc processors areused in the nodes. No front-end processor is required anymore as inthe former system.</p>      <p>Also the communication network has been simplified considerablywith respect to that in the earlier model: where three differentnetworks were present in the AP1000 (see <a href="references.html#Hori91">[15]</a>), in the AP3000 the nodesare connected in a 2-D torus structure with a bi-directionalbandwidth of 200 MB/s. The maximum amount of memory is large: afull 1024 node system can accomodate 2 TB.</p>      <p>Another difference with the AP1000 system is that the fastestnodes (the (U300 nodes described here) can have either 1 or 2 CPUsas opposed to only one CPU in the AP1000. The two CPUs share theon-board memory.</p>      <p>The available software for the AP3000 is extensive: ParallelFortran/AP is a Fortran 77 with extensions that offers a sharedmemory-like programming model for the system. In addition, HPF isavailable and the machine can also be used with a message passingmodel as customised MPI/AP and PVM/AP are offered. As sequentiallanguages to be used with the message passing libraries Fortran 90,C and C++ are available.</p>      <p>The current motto on Fujitsu's English home page reads "Thepossibilities are infinite". This certainly is true for looking forrelevant information on this system: when following the links forthese machines one ends up on unreadable Japanese pages from whichit is difficult to find your way back.</p>       </ors:remarks>   <ors:measured-performance>      <p>: The system has been announced inMarch 1996 and installations have been done in Japan, theUniversity of Singapore and at the Australian National Universitybut as yet no performance figures are published. Although thetheoretical bandwidth is 200 MB/s, the best measured bandwidth withMPI as given by Fujitsu is 88 MB/s with a latency of 12µs.</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="cdac">   <ors:name>The C-DAC Param 10000 OpenFrame</ors:name>   <ors:machine-type>RISC-based distributed memory multi-processor.</ors:machine-type>   <weg>C-DAC Param 10000 OpenFrame.</weg>   <ors:operating-system>SunOS, Sun's Unix flavour</ors:operating-system>   <ors:connection-structure>Variable (see remarks)</ors:connection-structure>   <ors:compilers>Fortran 77/90, C, C++</ors:compilers>   <ors:vendor-website>http://www.cdacindia.com/html/openframe.htm</ors:vendor-website>   <ors:year-of-introduction>2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>C-DAC Param 10000 OpenFrame</ors:name>         <ors:clock-cycle unit="MHz">400</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s">800</ors:processor-performance>         <ors:peak-performance unit="Mflop/s">-</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">1</ors:memory-maximal>         <ors:number-of-processors>            <ors:min/>            <ors:max/>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The PARAM systems are highly variable machines to such an extentthat they almost can be regarded as clusters. However, CDAC hasdeveloped its own communication network and optimised MPI whichgives it the flavour of an integrated parallel machine. The maximumnumber of processors is unclear although the information on thevendor's web page suggests that systems with a peak performance ofup to a Tflop/s could be delivered. As the basic processorpresently is the UltraSPARC II processor with a clock frequency of400 MHz, this would amount to systems with more than 1200processors. Such systems could communicate through CDAC's ownPARAMNet at a peak bandwidth of 50 MB/s bi-directional, Myrinet at160 MB/s, ATM at 155/622 Mb/s, or Fast Ethernet. Unfortunately,there is no firm information about the structure of PARAMNet. Thesedifferent possibilities stress the cluster character of the PARAMsystems or, as CDAC expresses it, the OpenFrame policy for itssystems.<br/>CDAC is not new in the High Performance Computing business whichshows in the software that is available for the PARAM machines.Apart from CDAC's MPI, KSHIPRA a lightweight, low latencycommunication layer based on Berkeley's Active Messages-II,performance profilers and a parallel debugger are offered alongwith a complete compiler set with Fortran 77/90, C, and C++.</p>      <p>The PARAM machines have been mostly sold on the internal Indianmarket where more than 20 systems have been installed, mostly with8 processors. However, since July 2000 a system is placed at theRussian National Academy of Sciences in Moscow in a collaborationproject between India and Russia to develop parallel applicationsin the area of structural analysis and Computational FluidDynamics.</p>       </ors:remarks>   <ors:measured-performance>      <p>No performance measurements ofPARAM 10000 systems are available at all, although one wouldpresume that they will not be very different from other UltraSPARCII-based systems using MPI for parallelisation.</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="cenju-4">   <ors:name>The NEC Cenju-4</ors:name>   <ors:machine-type>RISC-based distributed-memory multi-processor.</ors:machine-type>   <weg>Cenju-4.</weg>   <ors:operating-system>Cenjuiox (Mach micro-kernel based Unix flavour).</ors:operating-system>   <ors:connection-structure>Multi-stage crossbar.</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, HPF (subset), ANSI C.</ors:compilers>   <ors:vendor-website>http://kiefer.gmd.de:8002/popcorn/services/Overview.html</ors:vendor-website>   <ors:year-of-introduction>1998</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Cenju-4</ors:name>         <ors:clock-cycle unit="ns">5</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s">400</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">410</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">512</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>1024</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The name Cenju-4 suggests that there have been predecessors,Cenju-1, Cenju-2, and Cenju-3. This is indeed the case but thefirst two systems have only been used internally by NEC forresearch purposes and were never officially marketed. The Cenju-3was also placed externally but, again, mostly for evaluationpurposes. The same is the case for the present Cenju-4: it is notactively marketed, although NEC will have no objections to sellingit. Officially, the Cenju-series is regarded by NEC as systems togain experience in massively parallel computing and to develop theproper tools for it.</p>      <p>The Cenju-4 is based on the MIPS R10000 RISC processor. Allprocessors have, apart from their on-chip 32 KB primary data andinstruction cache, a secondary cache of 1 MB to mitigate theproblems that arise in the high data usage of the CPU.</p>      <p>The interconnection type used in the Cenju is a multistagecrossbar build from 4×4 modules that are pipelined. So, in afull configuration the maximal number of levels in the crossbar tobe traversed is six. The peak transfer rate of the crossbar isquoted as 200 MB/s irrespective of the data placement. Preliminarymeasurements of the author of this report show that the practicaltransfer rate for point-to-point communication is at least 175 MB/swith MPI; a quite high efficiency.</p>      <p>The system needs a front-end processor of the EWS4800 type(functionally equivalent to Silicon Graphics workstations) of SUN.The I/O requirements have to be fulfilled by the front-end systemas the Cenju does not have local (distributed) I/Ocapabilities.</p>      <p>There is some software support that should make the programmer'slife somewhat easier. The library PARALIB/CJ contains proprietaryfunctions for forking processes, barrier synchronisation, remoteprocedure calls, and block transfer of data. Like on the <a href="t3e.html#t3e">Cray T3E</a>, the Hitachi <a href="sr8000.html#sr8000">SR8000</a>, and on the former <a href="gone.html#cs-2">Meiko CS-2</a> the programmer has thepossibility to write/read directly to/from non-local memories whichavoids much message passing overhead.</p>     </ors:remarks>   <ors:measured-performance>      <p>: No systematic performancemeasurement have been done yet on the Cenju-4. However, fromcomparative studies it seems that the speed on some applications ispresently about 2/3 of an equivalent SGI R10000 node due to adifferent compiler technology (<a href="references.html#CaSt98">[3]</a>). Nagel reports a speed of90-100 Mflop/s for in-cache matrix-matrix multiplication in Fortran90 per node (<a href="references.html#Nagel98">[21]</a>).</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="compaqsc">   <ors:name>The Compaq AlphaServer SC</ors:name>   <ors:machine-type>RISC-based SMP-clustered DM-MIMD system.</ors:machine-type>   <weg>AlphaServer SC</weg>   <ors:operating-system>Tru64 Unix (Compaq's flavour of Unix)</ors:operating-system>   <ors:connection-structure>Fat Tree</ors:connection-structure>   <ors:compilers>Fortran 77, HPF, C, C++</ors:compilers>   <ors:vendor-website>http://www.compaq.com/hpc/systems/sys_sc.html</ors:vendor-website>   <ors:year-of-introduction>1999</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>AlphaServer SC</ors:name>         <ors:clock-cycle unit="GHz">1</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">2</ors:processor-performance>         <ors:peak-performance unit="Tflop/s">8</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">8</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>1</ors:min>            <ors:max>4096</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The AlphaServer SC is the very high end of Compaq's AlphaServerline (SC stands for SuperComputer). The system is typical for thepresent development of SMP-based clustered systems. In the SCsystem the basic SMP node is the Compaq ES45, a 4-CPU SMP systemwith the Alpha 21264a (EV68) as its processor. The clock rate is 1ns. The SMP node has a crossbar as its internal network with anaggregate bandwidth of 5.2 GB/s (1.33 GB/s/processor). This issufficient to deliver 1.33 byte/clock cycle to each processor inthe node simultaneously.</p>      <p>Within a node the system is a shared memory machine that allowsfor shared-memory parallel processing, for instance by using<tt>OpenMP</tt>. When more than four processors are required, onehas to use a message passing programming model like MPI, PVM, orHPF (Compaq is one of the few companies that still provides its ownHPF compiler).</p>      <p>For communication between the SMP nodes the SC uses QsNet, anetwork manufactured by QSW Limited. In fact QsNet is the follow-onof the network employed in the former Meiko CS-2 systems (seesection <a href="gone.html#cs-2">4</a>). The network has thestructure of a fat tree, is based on PCI technology, and has apoint-to-point bandwidth of 210 MB/s. Because of its fat treestructure the bandwidth in the upper level of the network is 340MB/s sustained. QSW claims a very low latency of 5 µs for MPImessages.</p>       </ors:remarks>   <ors:measured-performance>      <p>In <a href="references.html#Dong02">[6]</a> a performance of 4463 Gflop/sprocessors was reported solving a full linear system of order280,000 on a configuration with 4463 processors, an efficiency of73.8%. For a small system of order 1000 an efficiency of about 50%was measured in using the EuroBen Benchmark (see <a href="references.html#EurB99">[7]</a>).</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="compaqgs">   <ors:name>The Compaq GS series</ors:name>   <ors:machine-type>RISC-based SMP system.</ors:machine-type>   <weg>AlphaServer GS80, GS160, GS320.</weg>   <ors:operating-system>Tru64 Unix (Compaq's flavour of Unix).</ors:operating-system>   <ors:connection-structure>Variable (see remarks)</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, HPF, C, C++.</ors:compilers>   <ors:vendor-website>http://www.digital.com/products/quickspecs/10643_na/10643_na.html</ors:vendor-website>   <ors:year-of-introduction>1999</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>GS80</ors:name>         <ors:clock-cycle unit="GHz">1</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">2</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">16</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >64</ors:memory-maximal>         <ors:number-of-processors>            <ors:min></ors:min>            <ors:max>8</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>GS160</ors:name>         <ors:clock-cycle unit="GHz">1</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">2</ors:processor-performance>        <ors:peak-performance unit="Gflop/s">32</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">128</ors:memory-maximal>         <ors:number-of-processors>           <ors:min></ors:min>            <ors:max>16</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>GS320</ors:name>         <ors:clock-cycle unit="GHz">1</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">2</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">64</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">156</ors:memory-maximal>         <ors:number-of-processors>            <ors:min></ors:min>            <ors:max>32</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The GS series is a family of SMP servers with currently thefastest Alpha 21264 processor available at 1 GHz. The systems arebuild from ``Quad Building Blocks'' (QBBs), blocks of 4 processors.The GS80 can house 2 of these blocks, while the largestconfiguration, the GS320 has up to 32 processsors in 8 QBBs. Theprocessors in a QBB have access to the memory via a crossbar withan aggregate bandwidth of 7.0 GB/s. This means that for eachindividual processor the bandwidth is 1.75 GB/s or slightly morethan a quarter of an 8-byte operand per cycle. The QBBs are againconnected by a crossbar with the same bandwidth which amounts to anaggregate bandwidth of 57 GB/s for the largest GSconfiguration.<br/>Because of their SMP character, users can employ OpenMP forshared-memory parallelisation on the GS systems to up to 32processors in the GS320. Of course also MPI can be used along withthe full range of Compaq compilers.</p>       </ors:remarks>   <ors:measured-performance>      <p>: In <a href="references.html#Dong02">[6]</a> a performance of 47.1 Gflop/sis given for a 32-processor GS320 system in solving linear systemof order 40,000. An efficiency of 73.5%. Moreover ES40-basedGS320's at a clock frequency of 731 MHZ have been 2-way and 4-wayclustered which yielded speeds of 63.8 and 87.5 Gflop/s,respectively. As the internode bandwidth of the clusters markedlyless, the efficiencies dropped accordingly to 68.2 and 46.7%respectively.</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="cray-sx6">   <ors:name>The Cray SX-6</ors:name>   <ors:machine-type/>   <weg/>   <ors:operating-system/>   <ors:connection-structure/>   <ors:compilers/>   <ors:vendor-website/>   <ors:year-of-introduction/>   <ors:remarks>   <p>The Cray SX-6 is in fact the NEC SX-6 as marketed by Cray in theUSA. See the section on the <a href="sx-6.html">NEC SX-6</a> forthe description.</p></ors:remarks></ors:system><!-- ======================================================================== --><ors:system id="gamma-II">   <ors:name>The Cambridge Parallel Processing Gamma II Plus</ors:name>   <ors:machine-type>Processor array</ors:machine-type>   <weg>Gamma II Plus 1000, Gamma II Plus 4000</weg>   <ors:operating-system>DEC, HP, or Sun workstation, stand-alone for dedicatedapplications</ors:operating-system>   <ors:connection-structure>Internal OS transparent to the user, Unix on front-end</ors:connection-structure>   <ors:compilers>2-D mesh, row- and column datapaths (see remarks)</ors:compilers>   <ors:vendor-website/>   <ors:year-of-introduction>http://www</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Gamma II Plus 1000</ors:name>         <ors:clock-cycle unit="MHz">30</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s" >0.6</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >0.6</ors:peak-performance>         <ors:memory-maximal unit="MB">128</ors:memory-maximal >         <ors:number-of-processors>            <ors:min/>            <ors:max>1024</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>Gamma II Plus 4000</ors:name>         <ors:clock-cycle unit="MHz">30</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s" >0.6</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >2.4</ors:peak-performance>         <ors:memory-maximal unit="MB">512</ors:memory-maximal >         <ors:number-of-processors>            <ors:min/>            <ors:max>4096</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>In November 1995 the new Gamma II Plus models have beenannounced by CPP. In essence there is not much difference with itspredecessor the DAP Gamma. However, the clock cycle has tripled to33 ns with an equivalent rise in the peak performance of thesystems.</p>      <p>The Gamma II is presented as the fourth generation of this typeof machine. Indeed, the macro architecture of the systems hashardly changed since the first ICL DAP (the first generation ofthis system) was conceived. As in the ICL DAP in the Gamma 1000models the 1024 processors are ordered in a 32 x 32 array, whilethe Gamma 4000 has 4096 processors arranged in a 64 x 64square.</p>      <p>The systems are able to operate byte parallel on appropriateoperands to speed up floating-point operations, however, forlogical operations bit-wise operations are possible, which makesthe machines quite fast in this respect. As the byte parallel codeconsists of separate sequences of microcode instructions, the bitprocessor plane and the byte processor plane are in factindependent and can work in parallel. This is also the case for I/Ooperations. Also character-handling can be done very efficiently.This is the reason that Gamma systems are often used for full textsearches.</p>      <p>As in all processor-array machines, the control processor(called the Master Control Unit (MCU) in the Gamma II) has aseparate memory to hold program instructions while the data areheld in the data memory associated with each Processing Element(PE) in the processor array. So, for a Gamma 1000 with 128 MB ofdata memory each PE has 128 KB of data memory directly associatedto it. To access data in other PE's memories these must be broughtup to the data routing plane and shifted to the appropriateprocessor.</p>      <p>As already mentioned under the heading of the connectionstructure, there are two ways of connecting the PEs. One is the 2-Dmesh that connects each element to its North-, East-, West-, andSouth neighbour. In addition there are row- and column data pathsthat enable the fast broadcast of a row or column to an entirematrix by replication. Conversely, they can be used for row orcolumn-wise reduction of matrix objects into a column or row-vectorof results from, e.g., a summing or maximum operation.</p>      <p>Separate I/O processors and disk systems can be attached to theGamma directly thus not burdening the front-end machine (and theconnection between front-end and Gamma-II) with I/O operations andunnecessary data transport. One of these I/O devices is the GIOCthat can transport data to the data memory at a sustained rate of80 MB/s transposing the data to the vertical storage mode of thedata memory on the fly. Also, a direct video interface is availableto operate a frame buffer.</p>      <p>A nice (non-standard) feature of the FORTRAN-PLUS compiler isthe possibility to use logical matrices as indexing objects forcomputational matrix objects. This enables a very compact notationfor conditional execution on the processor array. In addition,since 1997 C++ is available.</p>       </ors:remarks>   <ors:measured-performance>      <p>: In <a href="references.html#Flan91">[8]</a> the speed of matrixmultiplication on various Gamma-II models (precursors of the Gammasystems) is analyzed. The documentation states 32-bitfloating-point add speed of 1.68 Gflop/s on 4096 PEs, while a32-bit 1,024 complex FFT would attain 2.49 Gflop/s. No independentperformance figures for the Gamma II systems are available.</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="mta-2">   <ors:name>The Cray Inc</ors:name>   <ors:machine-type>Distributed-memory multi-processor</ors:machine-type>   <weg>MTA-2x, x = 16, 32,...,256</weg>   <ors:operating-system>Unix BSD4.4 + proprietary micro kernel</ors:operating-system>   <ors:connection-structure>Fortran 77/90, ANSI C, C++</ors:connection-structure>   <ors:compilers>http://www.cray.com/products/systems/craymta/</ors:compilers>   <ors:vendor-website/>   <ors:year-of-introduction/>   <ors:models>      <ors:model>         <ors:name>MTA-2x</ors:name>         <ors:clock-cycle unit=""/>         <ors:processor-performance unit="Mflop/s">750</ors:processor-performance>         <ors:peak-performance unit="Gflops">192</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">1</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>16</ors:min>            <ors:max>256</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The exact peak speed of the MTA-2 systems cannot be given asthis new CMOS version is yet to be delivered (see below forperformances of the first GaAs-based MTA-1 machine). The datasheets on the Cray MTA-2 are not overly informative in this respectbut a lower bound of the peak performance is quoted.</p>      <p>Although the memory in the MTA is physically distributed, thesystem is emphatically presented as a shared memory machine (withnon-uniform access time). The latency incurred in memory referencesis hidden by <i>multi-threading</i>, i.e., usually many concurrentprogram threads (instruction streams) may be active at any time.Therefore, when for instance a load instruction cannot be satisfiedbecause of memory latency the thread requesting this operation isstalled and another thread of which an operation can be done isswitched into execution. This switching between program threadsonly takes 1 cycle. As there may be up to 128 instruction streamsand 8 memory references can be issued without waiting for precedingones, a latency of 1024 cycles can be tolerated. References thatare stalled are retried from a retry pool. A construction thatworks out similarly is to be found in the Stern Computing Systems<a href="gone.html#stern">SSP</a> machines.</p>      <p>The connection network connects a 3-D cube of <i>p</i>processors with sides of <i>p</i>         <sup>1/3</sup> of whichalternately the <i>x</i>- or <i>y</i> axes are connected.Therefore, all nodes connect to four out of six neighbours. In a<i>p</i> processor system the worst case latency is4.5<i>p</i>         <sup>1/3</sup> cycles; the average latency is2.25<i>p</i>         <sup>1/3</sup> cycles. Furthermore, there is an I/Oport at every node. Each network port is capable of sending andreceiving a 64-bit word per cycle which amounts to a bandwidth of5.33 GB/s per port. In case of detected failures, ports in thenetwork can be bypassed without interrupting operations of thesystem.</p>      <p>Although the MTA should be able to run "dusty-deck" Fortranprograms because parallelism is automatically exploited as soon asan opportunity is detected for multi-threading, it may be (andoften is) worthwhile to explicitly control the parallelism in theprogram and to take advantage of known data locality occurrences.MTA provides handles for this in the form of library routines,including synchronisation, barrier, and reduction operations ondefined groups of threads. Controlled and uncontrolled parallelismapproaches may be freely mixed. Furthermore, each variable has afull/empty bit associated with it which can be used to controlparallelism and synchronisation with almost zero overhead.<br/>A first MTA-2 system with 28 processors (instead of the normal 32)will be installed at the Naval Research Lab, USA, in 2002.</p>       </ors:remarks>   <ors:measured-performance>      <p>: The company has presentlydelivered a 16-processor system to the San Diego SupercomputingCenter. This system runs at a clock cycle of 4.4 ns instead of theplanned 3 ns. Consequently, the peak performance of a processor is450 Mflop/s. Using the <a href="http://www.euroben.nl">EuroBenBenchmark</a> a performance of 388 Mflop/s out of 450 Mflop/s wasfound for an order 800 matrix-vector multiplication, an efficiencyof 86%. For 1-D FFTs up to 1 million elements a speed of 106Mflop/s was found on 1 processor and the about the same speed on 4processors due to an insufficient availability of parallelthreads.</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="p690">   <ors:name/>   <ors:machine-type>RISC-based distributed-memory multi-processor cluster</ors:machine-type>   <weg>IBM eServer p690.</weg>   <ors:operating-system>AIX (IBMs Unix variant)</ors:operating-system>   <ors:connection-structure>-switch</ors:connection-structure>   <ors:compilers>XL Fortran (Fortran 90), HPF, XL C, C++</ors:compilers>   <ors:vendor-website>http://www-1.ibm.com/servers/eserver/pseries/hardware/datactr/p690_desc.html</ors:vendor-website>   <ors:year-of-introduction>2001 (16/32-CPU POWER4 SMP)</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>eServer p690</ors:name>         <ors:clock-cycle unit="GHz">1.3</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">5.2</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">166.4</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">128</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>16384</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The eServer p690 is the successor of the RS/6000 SP. It retainsmuch of the macro structure of this system: multi-CPU nodes areconnected within a frame either by a dedicated switch or by othermeans, like switched Ethernet. The structure of the nodes, however,has changed considerably, see \ref{s:pwr4}. Up to four MultichipModules (MCMs) are housed in a node totaling 16 or 32 CPUs in anode depending on whether the dual or single core version of thechip is used. For High Performance Computing IBM recommends toemploy the 16 CPU, single core, nodes because a higher effectivebandwidth from the L2 cache can be expected in this case. For lessdata intensive work that primarily uses the L1 cache the differencewould be small while there is a large cost advantage using the32-CPU so-called Turbo nodes.</p>      <p>The p690 is accessed through a front-end control workstationthat also monitors system failures. Failing nodes can be taken offline and exchanged without interrupting service.</p>      <p>The so-called high-performance switch, the SP Switch2, is an Omega-switch as described in the section on <a href="sm-mimd.html">SM-MIMD systems</a> and, although we mentionedonly the highest speed option for the communication, thehigh-performance switch, there is a wide range of other optionsthat could be chosen instead: Ethernet, Token Ring, FDDI, etc., areall possible. The high performance switch is the third generationof this interconnect. The single-direction bandwidth is quoted as500 MB/s and tests with MPI-based point-to-point communication fromthe EuroBen Distributed memory benchmark have shown that one cancome very close to this limit.</p>      <p>Applications can be run using PVM or MPI. Also High PerformanceFortran is supported, both a proprietary version and a compilerfrom the Portland Group. IBM uses its own PVM version from whichthe data format converter XDR has been stripped. This results in alower overhead at the cost of generality. Also the MPIimplementation, MPI-F, is optimised for the eServer p690 systems.As the nodes are in effect shared-memory SMP systems, within thenodes OpenMP can be employed for shared-memory parallelism and itcan be freely mixed with MPI if needed.</p>      <p>The standard commercial models that are marketed contain up to128 nodes. However, on special request systems with up to 512 nodescan be built. This largest configuration is used in the table above(although never a system of a size exceeding 128 nodes has beensold yet).</p>     </ors:remarks>   <ors:measured-performance>      <p>In <a href="references.html#Dong02">[6]</a> a performance of 2310Gflop/s for an 864 processor (54 HPC-node) system is reported forsolving a 275,000-order dense linear system yielding an efficiencyof 51%. A system with 8 Turbo nodes was reported to obtain a speedof 737 Gflop/s out of 1331 Gflop/s on a linear system of size285,000, an efficiency of 55%. As this type of applicationprimarily operates from the L1 cache, the more or less similarefficiencies are as expected.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="quadrics">   <ors:name>The Quadrics Apemille</ors:name>   <ors:machine-type>Processor array</ors:machine-type>   <weg>Quadrics Apemille.</weg>   <ors:operating-system>Almost any Unix workstation</ors:operating-system>   <ors:connection-structure>Internal OS transparent to the user, Unix on front-end</ors:connection-structure>   <ors:compilers>3-D mesh, (see remarks)</ors:compilers>   <ors:vendor-website/>   <ors:year-of-introduction>http://www</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Apemille</ors:name>         <ors:clock-cycle unit="MHz">267</ors:clock-cycle>         <ors:processor-performance unit="Mflop/s">533</ors:processor-performance>         <ors:peak-performance unit="Tflop/s">1</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">64</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>2048</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The Apemille is a commercial spin-off of the APE-1000 project ofthe Italian National Institute for Nuclear Physics and a successorto the APE-100 systems. The systems are available in multiples of 8processor nodes where up to 16 boards can be fitted into one crateor in multiples of 128 nodes by adding up to 15 crates to theminimal 1-crate system. The interconnection topology of theQuadrics is a 3-D grid with interconnections to the opposite sides(so, in effect a 3-D torus). The 8-node floating-point boards(FPBs) are plugged into the crate backplane which providespoint-to-point communication and global control distribution. TheFPBs are configured a 2³ cubes that are connected to the otherboards appropriately to arrive at the 3-D grid structure.</p>      <p>The basic floating-point processor, the so-called MAD chip,contains a register file of 128 registers. Of these registers thefirst two hold permanently the values 0 and 1 to be able to expressany addition or multiplication as a ``normal operation'', i.e., acombined multiply-add operation, where an addition is of the form,<i>a</i>×<i>b</i>+0 and a multiplication is<i>a</i>×1+<i>b</i>. In favourable circumstances the processorcan therefore deliver two floating-point operations per cycle.Instructions are centrally issued by the controller at a rate ofone instruction every two clock cycles.</p>      <p>Communication is controlled by the Memory Controller and theCommunication Controller which are both housed on the backplane ofa crate. When the Memory Controller generates an address it isdecoded by the Communication Controller. In case non-local accessis desired, the Communication Controller will provide the necessarydata transmission. The memory bandwidth per processor is notdisclosed in the documentation, nor the bandwidth for non-localcommunication. Regrettably, Quadrics provides no details on localor global communication speeds whatsoever.</p>      <p>The Apemille communicates with the front-end system via a PCIadapter card and should therefore have a bandwidth of about 100MB/s. The actual speed is not specified, however. The interface canwrite and read the memories of the nodes and the Controller. I/Oand should have a bandwidth up to 8.5 GB/s according to thedocumentation.</p>      <p>The TAO language has several extensions to employ the SIMDfeatures of the Quadrics. Firstly, floating-point variables areassumed to be local to the processor that owns them, while integervariables are assumed to be global. Local variables can be promotedto global variables. Other extensions are the <tt>ANY</tt>,<tt>ALL</tt>, and <tt>WHERE</tt>/<tt>END WHERE</tt> keywords thatcan be used for global testing and control. Processors that notmeet a global condition effectively skip the operation(s) that areassociated with it. For easy referencing nearest-neighbourlocations special constants <tt>LEFT</tt>, <tt>RIGHT</tt>,<tt>UP</tt>, <tt>DOWN</tt>, <tt>FRONT</tt>, and <tt>BACK</tt> areprovided. In addition, new data types and operators on these datatypes are supported together with overloading of operators. Thisenables very concise code for certain types of calculations.</p>       </ors:remarks>    <ors:measured-performance>    <p>No measured performances have beenreported for this machine.</p>   </ors:measured-performance> </ors:system><!-- ======================================================================== --><ors:system id="sr8000">   <ors:name>The Hitachi SR8000</ors:name>   <ors:machine-type>RISC-based distributed memory multi-processor</ors:machine-type>   <weg>SR8000, SR8000 E1, SR8000 F1, SR8000 G1.</weg>   <ors:operating-system>HI-UX/MPP (Micro kernel Mach 3.0)</ors:operating-system>   <ors:connection-structure>Mult-dimensional crossbar (see remarks)</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++</ors:compilers>   <ors:vendor-website>http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr81e.html</ors:vendor-website>   <ors:year-of-introduction>1998, E1 and F1: 1999, G1: 2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>SR8000</ors:name>         <ors:clock-cycle unit="MHz">250</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Tflop/s">1</ors:peak-performance>         <ors:memory-maximal unit="Tbyte" >1</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>128</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>SR8000 E1</ors:name>         <ors:clock-cycle unit="MHz">300</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">9.6</ors:processor-performance>         <ors:peak-performance unit="Tflop/s">4.9</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">8</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>512</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>SR8000 F1</ors:name>         <ors:clock-cycle unit="MHz">375</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">12</ors:processor-performance>          <ors:peak-performance unit="Tflop/s">6.1</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">8</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>512</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>SR8000 G1</ors:name>         <ors:clock-cycle unit="MHz">450</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">14.4</ors:processor-performance>         <ors:peak-performance unit="Tflop/s">7.3</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">8</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>512</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The SR8000 is the third generation of distributed-memoryparallel systems of Hitachi. It is to replace both its directpredecessor, the SR2201 and the late top-vectorprocessor, theS-3800 (see <a href="gone.html">Systems Disappeared from theList</a>).</p>      <p>The basic node processor is a 2.22--4 ns clock PowerPC node withmajor enhancements made by Hitachi. E.g., a hardware barriersynchronisation is added and the additions required for "PseudoVector Processing" (PVP). The latter means that for operations onlong vectors one does not incur the detrimental effects of cachemisses that often ruin the performance of RISC processors unlesscode is carefully blocked and unrolled. This facility was alreadyavailable on the SR2201 and experiments have shown that this ideaseems to work well (see <a href="references.html#Hisr2201">[13]</a>).</p>      <p>The peak performance per basic processor, or IP, can be attainedwith 2 simultaneous multiply/add instructions resulting in a speedof 1 Gflop/s on the SR8000. However, eight basic processors arecoupled to form one processing node all addressing a common part ofthe memory. For the user this node is the basic computing entitywith a peak speed of 8 Gflop/s. Hitachi refers to this nodeconfiguration as COMPAS, <b>C</b>o-<b>o</b>perative<b>M</b>icro-<b>P</b>rocessors in single <b>A</b>ddress<b>S</b>pace. In fact this is a kind of SMP clustering as discussedin the sections on <a href="architecture.html">the mainarchitectural classes</a> and <a href="ccNUMA.html">ccNUMAmachines</a>. A difference with most of these systems is that forthe user the individual processors in a cluster node are notaccessible. Every node also contains an SP, a system processor thatperforms system tasks, manages communication with other nodes and arange of I/O devices.</p>      <p>The SR8000 has a multi-dimensional crossbar with abi-directional link speed of 1 GB/s. From 4--8 nodes thecross-section of the network is 1 hop. For configurations 16--64 itis 2 hops and from 128-node systems on it is 3 hops.</p>      <p>The E1 and F1 models are in almost every respect equal to thebasic SR8000 model, however, the clock cycles for these models are3.3 and 2.66 ns, respectively. Furthermore, the E1, F1, and G1models can house twice the amount of memory per node and themaximum configurations can be extended to 512 processors makingthem at the time of writing this report the most powerfulcommercially available systems --- at least in theory. The Hitachidocumentation quotes a bandwidth of 1.2 GB/s for the network in theE1 model while it is 1 GB/s for the basic SR8000 and the F1. Bycontrast, the G1 model has a bandwidth of 1.6 GB/s.</p>      <p>Like in some other systems as the <a href="t3e.html#t3e">CrayT3E</a>, and the <a href="compaqsc.html#compaqsc">AlphaServerSC</a>, and the late NEC Cenju-4, one is able to directly accessthe memories of remote processors. Together with the very fasthardware-based barrier synchronisation this should allow forwriting distributed programs with very low parallelisationoverhead.</p>      <p>The following software products will be supported in addition tothose already mentioned above: PVM, MPI, PARMACS, Linda, andFORGE90. In addition a numerical libraries like NAG and IMSL areoffered.</p>     </ors:remarks>   <ors:measured-performance>      <p>Results for the all of the SR8000 types are available from <a href="references.html#Dong02">[6]</a>, of which we quote the mostsignificant ones. On a 144-node G1 (450 MHz) configuration a speedof 1709 Gflop/s out of 2074 was observed, an efficiency of 63% forthe solution of a 141,000 full linear system. On a 112-node 375 MHzF1 model 1035 out of 1344 Gflop/s could be achieved, an efficiencyof 77%. On a single node of this processor speeds of over 6.2 and4.1 Gflop/s were measured in solving a full linear system and afull symmetric eigenvalue problem of order 5000, respectively (see<a href="references.html#EurB99">[7]</a> for the last two results).Furthermore 2 SR8000 G1 frames have been coupled and a speed of1709 Gflop/s out of 2074 has been attained on 1152 processors forsolving a 141,000-order linear system. The efficiency in this caseis 82%, quite high for externally coupled systems.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="sun">   <ors:name>The Sun Fire 3800-15K</ors:name>   <ors:machine-type>RISC-based distributed-memory multi-processor</ors:machine-type>   <weg>Fire 3800-15K</weg>   <ors:operating-system>Solaris (Sun's Unix flavour)</ors:operating-system>   <ors:connection-structure>Crossbar (see remarks)</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, HPF, C, C++</ors:compilers>   <ors:vendor-website>http://www.sun.com/servers/highend/sunfire15k/details.html</ors:vendor-website>   <ors:year-of-introduction>2001</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Fire 3800-15K</ors:name>         <ors:clock-cycle unit="MHz">900</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >190.8</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >576</ors:memory-maximal>         <ors:number-of-processors>            <ors:min/>            <ors:max>106</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>In the Fire 15K the processor/memory boards are plugged into abackplane that is an 18×18 flat crossbar. Each board containsfour 900 MHz UltraSPARC III processors and a maximum of 32 GB ofmemory. So, normally the maximum number of processors would 72.However, the 15K in fact contains <i>three</i> of these 18×18crossbars, for data, addresses, and signals. It is possible tosacrifice I/O capacity and use 17 of the 18 slots of the secondcrossbar to put in 2-CPU boards without local memory, addinganother 34 processors to obtain the maximum of 106. Obviously, sucha system is less balanced and such a configuration will normallyonly be chosen for very specific compute-intensive tasks with smallI/O requirements. Because of the flat crossbar memory access isuniform and the aggregate bandwidth of the crossbar is 172.8 GB/s.This is equivalent to 2.4 GB/s/processor or 2.66B/cycle. So, an8-byte operand needs 3 cycles to be shipped to the processor. Ofcourse, for processors in excess to 72 that are not on the databackplane the situation is more complicated and it is hard toestimate what the effective bandwidth would be.</p>      <p>The Fire 15K is a typical SMP machine with provisions forshared-memory parallelism in the Fortran and C(++) compilers bydirectives in the source code. Sun has joined the OpenMP consortiumfor standardising the shared-memory programming model.</p>     </ors:remarks>   <ors:measured-performance>      <p>In <a href="references.html#Top500">[32]</a> a speed of 357 Gflop/sis reported for a 4-way cluster of 72 processor machines in solvinga dense linear system of unspecified size. The efficiency for thisproblem is 69%.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="superdome">   <ors:name>The HP 9000 SuperDome</ors:name>   <ors:machine-type>RISC-based ccNUMA system.</ors:machine-type>   <weg>HP 9000 SuperDome.</weg>   <ors:operating-system>HP-UX (HP's usual Unix flavour)</ors:operating-system>   <ors:connection-structure>Crossbar</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++</ors:compilers>   <ors:vendor-website>http://www.hp.com/products1/servers/scalableservers/index.html</ors:vendor-website>   <ors:year-of-introduction>2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>HP 9000 SuperDome</ors:name>         <ors:clock-cycle unit="MHz">750</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">3</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >192</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >128</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>16</ors:min>            <ors:max>64</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The Superdome is to replace the Exemplar V2600 system which isalso still marketed by HP but not as a multi-node system anymore(see section <a href="gone.html#gone">systems disappeared from thelist</a>). The aggregate peak speed of the Superdome is in fact 2times lower than that of the 4-node V2600 because the same CPUs areused, but the maximal configuration only can harbour 64 processorsagainst 128 in the 4-node V2600.</p>      <p>The connection structure of the Superdome has significantlyimproved over that of the V2600: where the latter had a crossbarwithin its 32-way SMP nodes with an aggregate bandwidth of 15.4GB/s and a 3.8 GB/s aggregate bandwidth between the SMP nodes, inthe Superdome the aggregate bandwidth in 64 GB/s in a 2-levelcrossbar. This greatly improves the communication within thesystem. The PA-RISC 8600 CPUs run at a clock frequency of 750 MHz.As a CPU contains 2 floating-point units that are able to execute acombined floating multiply-add instruction, in favourablecircumstances four flops/cycle can be achieved and a TheoreticalPeak Performance of 3 Gflop/s per CPU can be attained. This amountsto a peak speed of 192 Gflop/s for a full configuration.</p>      <p>As in the former systems a shared memory parallel model issupported. HP is a partner in the OpenMP organisation and willtherefore provide this style of shared-memory parallel programmingin addition to (and later on instead of) its proprietary parallelmodel.</p>     </ors:remarks>   <ors:measured-performance>      <p>In <a href="references.html#Dong02">[6]</a> a speed of 86.45Gflop/s is reported for solving a full linear system of size41,000. This amounts to an efficiency of 61%. Also results for a4-way coupled system with a total of 256 processors are reported:solving a full linear system of order 340,092 showed a speed of 471Gflop/s, 61% of the 768 Gflop/s peak. For coupling a Hyperfabricnetwork was used showing no degradation with respect to theinternal network.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="sv1">   <ors:name>The Cray Inc</ors:name>   <ors:machine-type>Shared-memory multi-vector processor.</ors:machine-type>   <weg>SV1ex-1A, SV1ex-1, SV1ex-4 (cluster).</weg>   <ors:operating-system>UNICOS (Cray Unix variant).</ors:operating-system>   <ors:connection-structure>Crossbar.</ors:connection-structure>   <ors:compilers>Fortran 90, C, C++, Pascal, ADA.</ors:compilers>   <ors:vendor-website>http://www.cray.com/products/systems/craysv1/</ors:vendor-website>   <ors:year-of-introduction>2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Cray SV1ex-1A</ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">32</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >32</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>16</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>Cray SV1ex-1</ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">64</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >96</ors:memory-maximal>         <ors:number-of-processors>           <ors:min>8</ors:min>            <ors:max>32</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>Cray SV1ex-4</ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">256</ors:peak-performance>         <ors:memory-maximal unit="Gbyte" >384</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>32</ors:min>            <ors:max>128</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The Cray SV1ex series is a "midlife kicker" that bridges the gapbetween the Cray SV1 that appeared in 1998 and the SV2 which isexpected to appear in 2002. Essentially the SV1ex machines areidentical to the SV1s, however, the clock frequency has been raisedby 50%. This speeds up the single-processor peak performance from1.2 to 1.8 Gflop/s. Furthermore, the speed of memory has increasedby a factor of two which respect to the SV1.</p>      <p>The Cray SV1(ex) is the successor both to the CMOS-based CrayJ90 and the Cray T90 which was based on ECL technology. The SV1exsystems are CMOS-based and therefore much cheaper to manufacturethan the ECL-based systems. In this respect it has followed thetrend set in by Fujitsu and NEC a few years ago with their vectorsystems (see the <a href="vpp5000.html">Fujitsu VPP5000</a> and the<a href="sx-6.html">NEC SX-6</a>). The Cray vector processortradition has also been followed in that the SV1ex series uses itsown Cray-specific floating-point format instead of the IEEE 754standard.</p>      <p>The single-cabinet configurations come in two sizes, theSV1ex-1A and the SV1ex-1 that can house 4 and 8 processor boards,respectively. Each processor board contains 4 CPUs that can delivera peak rate of 4 floating-point operations per cycle, amounting toa theoretical peak performance of 2 Gflop/s per CPU. However, 4CPUs can be coupled <i>across</i> CPU boards in a configuration toform a so-called Multi Streaming Processor (MSP) resulting in aprocessing unit that has effectively a Theoretical Peak Performanceof 8 Gflop/s. The reconfiguration into MSPs and/or single CPUcombinations can be done dynamically as the workload dictates. Thevector start-up time for the single CPUs is smaller than for MSPs,so for small vectors single CPUs might be preferable while forprograms containing long vectors the MSPs should be of advantage.The number of combinations that can be made is large but at least 8CPUs must be configured as single 2 Gflop/s CPUs. So a full SV1ex-1cabinet may be configured as 32 single 2 Gflop/s CPUs or as 1--6MSPs with the remaining processors as single CPUs.</p>      <p>Another feature in the SV1ex is a combined scalar and vectorcache of 256 KB per CPU. This cache is important because thebandwidth of 6.4 GB/s per CPU board amounts to only 1.5 eight-byteoperands per cycle. The cache can ship 4 operands per cycle to aCPU. This relative bandwidth is much smaller than what was offeredin the former Cray systems which makes the cache all the moreimportant. As the available bandwidth from a memory interface isdivided over the 4 processors on a board on an as-needed basis andit is assumed that not all processors require the maximum amount ofdata all the time the average data requirement of the processorboards is hoped to be met.</p>      <p>Like in the NEC SX-6 single cabinets can be combined to form acluster (Supercluster in Cray's terminology) by a so-calledGigaRing. The GigaRing, which is also used to couple I/Osub-systems, is comprised of two counter-rotating rings with abandwidth of 1 GB/s each. Where the systems in a cabinet areSM-MIMD systems, a multi-cabinet Supercluster is an DM-MIMD systemand can be operated in parallel only by some parallel programmingmodel like MPI or HPF. The SV1ex-4 is a standard configuration thatis offered by Cray Inc. but larger clusters with up to 32 SV1ex-1nodes are also possible.</p>     </ors:remarks>   <ors:measured-performance>      <p> In <a href="references.html#Dong02">[6]</a> a performance of 48.17Gflop/s is reported for solving a dense linear system of size40,320 on a 32-processor machine. This amounts to an efficiency of75.3%.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="sx-6">   <ors:name>The NEC SX-6</ors:name>   <ors:machine-type>Distributed-memory multi-vector processor</ors:machine-type>   <weg>SX-6i, SX-6A, SX-6xMy</weg>   <ors:operating-system>Super-UX (Unix variant based on BSD V.4.3 Unix).</ors:operating-system>   <ors:connection-structure>Multi-stage crossbar (see Remarks)</ors:connection-structure>   <ors:compilers>Fortran 90, HPF, ANSI C, C++</ors:compilers>   <ors:vendor-website>http://www.sw.nec.co.jp/hpc/sx-e/sx6/index.htm</ors:vendor-website>   <ors:year-of-introduction>2002</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>SX-6i</ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">8</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">8</ors:memory-maximal>         <ors:number-of-processors>            <ors:min/>            <ors:max/>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>SX-6A</ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">64</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">64</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>8</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>SX-6xMy </ors:name>         <ors:clock-cycle unit="MHz">500</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">8</ors:processor-performance>         <ors:peak-performance unit=""/>         <ors:memory-maximal unit="8 TB"/>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>1024</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The SX-6 series is offered in numerous models but most of theseare just smaller frames that house a smaller amount of the sameprocessors. We only discuss the essentially different models here.All models are based on the same processor, an 8-way replicatedvector processor where each set of vector pipes contains a logical,mask, add/shift, multiply, and division pipe (see section <a href="sm-simd.html">SM-SIMD systems</a> for an explanation of thesecomponents). As multiplication and addition can be chained (but notdivision) the peak performance of a pipe set at 500 MHz is 1Gflop/s. Because of the 8-way replication a single CPU can delivera peak performance of 8 Gflop/s. The vector units are complementedby a scalar processor that is 4-way super scalar and at 500 MHz hasa theoretical peak of 1 Gflop/s. The peak bandwidth per CPU is 32GB/s or 64 B/cycle. This is sufficient to ship 8 8-byte operandsback or forth and just enough to feed one operand to each of thereplicated pipe sets.</p>      <p>The SX-6i is the single CPU system that because of the singlechip implementation is offered as a desk side model. Also a rackmodel is available that enables housing two systems in a rack butthere is no connection between the systems.</p>      <p>In a single frame of the SX-6A models fit up to 8 CPUs at thesame clock frequency as the SX-6i. Internally the CPUs in the frameare connected by a 1-stage crossbar with the same bandwidth as thatof a single CPU system: 32 GB/s/port. The fully configurated framecan therefore attain a peak speed of 64 Gflop/s.</p>      <p>In addition, there are multi-frame models(SX-6<i>x</i>M<i>y</i>) where <i>x = 8,...,1024</i> is the totalnumber of CPUs and <i>y = 2,...,128</i> is the number of framescoupling the single-frame systems into a larger system. There aretwo ways to couple the SX-6 frames in a multi-frame configuration:NEC provides a full crossbar, the so-called IXS crossbar to connectthe various frames together at a speed of 8 GB/s for point-to-pointunidirectional out-of-frame communication (1024 GB/s bi-sectionalbandwidth for a maximum configuration). Also a HiPPI interface isavailable for inter-frame communication at lower cost and speed.When choosing for the IXS crossbar solution, the total multi-framesystem is globally addressable, turning the system into a NUMAsystem. However, for performance reasons it is advised to use thesystem in distributed memory mode with MPI.</p>      <p>The technology used is CMOS. This lowers the fabrication costsand the power consumption appreciably (the same approach is used inthe <a href="vpp5000.html#vpp5000">Fujitsu VPP5000</a> and the <a href="sv1.html#sv1">Cray SV1ex</a>) and all models are aircooled.</p>      <p>For distributed computing there is an HPF compiler and formessage passing an optimised MPI (MPI/SX) is available. In additionfor shared memory parallelism, OpenMP is available.</p>     </ors:remarks>   <ors:measured-performance>      <p>Results for a 8-frame SX-6/128M16 processors are available from <a href="references.html#Dong02">[6]</a>. The system attained 982Gflop/s, an efficiency of 96%. The size of the linear system thisresult was 204,800.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="t3e">   <ors:name>The Cray Inc</ors:name>   <ors:machine-type>RISC-based distributed-memory multi-processor</ors:machine-type>   <weg>T3E-1200E, T3E-1350</weg>   <ors:operating-system>UNICOS/mk (micro kernel-based Unix)</ors:operating-system>   <ors:connection-structure>3-D Torus</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, HPF, ANSI C, C++.</ors:compilers>   <ors:vendor-website>http://www.cray.com/products/systems/crayt3e/</ors:vendor-website>   <ors:year-of-introduction>T3E-1200E: 1998 T3E-1350: 2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>T3E-1200E</ors:name>         <ors:clock-cycle unit="MHz">600</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.2</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >2458</ors:peak-performance>         <ors:memory-maximal unit="Tbyte" >4</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>6</ors:min>            <ors:max>2048</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>T3E-1350</ors:name>         <ors:clock-cycle unit="MHz">675</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.35</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >2938</ors:peak-performance>        <ors:memory-maximal unit="Tbyte" >1</ors:memory-maximal>          <ors:number-of-processors>           <ors:min>40</ors:min>            <ors:max>2176</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The T3E is the second generation of DM-MIMD systems from Cray.Lexically, it follows in name after its predecessor T3D which namereferred to its connection structure: a 3-D torus. In this respectit has still the same interconnection structure as the T3D. In manyother respects, however, there are quite some differences. A firstand important difference is that no front-end system is requiredanymore (although it is still possible to connect to a Cray vectorsystems). The systems up to 128 processors are air-cooled. Thelarger ones, from 256-2176 processors, are liquid cooled.</p>      <p>The T3E uses the DEC Alpha 21164 for its computational tasks. In2000, a T3E-1350 was introduced that uses the latest 21164Aprocessors at a clock rate of only 675 MHz but that is identical inalmost all other aspects to the T3E-1200E. Cray stresses, that theprocessors are encapsulated in such a way that they can beexchanged easily for any other (faster) processor as soon as thiswould be available without affecting the macro-architecture of thesystem. However, in practice this is not likely to happen.</p>      <p>Each node in the system contains one processing element (PE)which in turn contains a CPU, memory, and a communication enginethat takes care of communication between PEs. The bandwidth betweennodes is quite high: 300 MB/s. Like the T3D, its predecessor, theT3E has hardware support for fast synchronisation. E.g., barriersynchronisation takes only one cycle per check.</p>      <p>Each node in the system contains one processing element (PE)which in turn contains a CPU, memory, and a communication enginethat takes care of communication between PEs. The bandwidth betweennodes is quite high: 325 MB/s, bi-directional. The T3E has hardwaresupport for fast synchronisation. E.g., barrier synchronisationtakes only one cycle per check. The node also contains a set ofE-registers and streaming registers that allows for aggressiveprefetching to ameliorate the restrictions of the processor/memorybottleneck. An interesting additional feature is the availabilityof 32 contexts per processor which opens the door formultiprocessing.</p>      <p>In the T3E distributed I/O is present. For every 8 PEs an I/Ochannel can be configured in the air-cooled systems and 1 I/Ochannel per 16 nodes in the liquid-cooled systems. The maximumbandwidth for a channel is about 1 GB/s, the actual speed will bein the order of 500 MB/s.<br/>      </p>      <p>The T3E supports various programming models. Apart from PVM andMPI for message passing and HPF for data distribution, a Crayproprietary one-sided communication library, the so-called<tt>shmem</tt> library can be employed for message passing. Inaddition, the BSP library (see <a href="references.html#Hill97">[12]</a>), also a one-sided messagepassing library is available. The <tt>shmem</tt> library isimplemented close to the hardware and shows very low latency ofonly 1.6 µs.</p>      <p>There are some differences in the available configurationsbetween the T3E-1200 and the T3E-1350: In the T3E-1200 the amountof memory per node ranges from 64 MB to 2 GB while in the 1350model there is only a choice between 256 and 512 MB per node.Furthermore, there is an air-cooled model (up to 128 PEs) of theT3E-1200 while the larger configurations are liquid-cooled. TheT3E-1350 knows only liquid-cooled configurations that can beincremented from 40 processors on with modules of 8 processors. The1200 systems start at 6 processors and modules of 4 or 8 processorscan be added.</p>     </ors:remarks>   <ors:measured-performance>      <p>In <a href="references.html#Dong02">[6]</a> a speed of 1.127Tflop/s is reported for the solution of a dense linear system oforder 148800 on a T3E-1200 with 1488 processors. The efficiency forsuch an exercise is 63%. The same source quotes a speed of 113.9out of 172.8 Gflop/s on a 128-processor T3E-1350, giving anefficiency of 66% for solving a size 89,088 linear system.</p>     </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="vpp5000">   <ors:name>The Fujitsu VPP5000 series</ors:name>   <ors:machine-type>Distributed-memory vector multi-processor</ors:machine-type>   <weg>VPP5000U, VPP5000</weg>   <ors:operating-system>UXP/V (a V5.4 based variant of Unix)</ors:operating-system>   <ors:connection-structure>Full distributed crossbar</ors:connection-structure>   <ors:compilers>Fortran 90/VP (Fortran 90 Vector compiler), Fortran 90/VPP(Fortran 90 Vector Parallel compiler), C/VP (C Vector compiler),HPF, C, C++</ors:compilers>   <ors:vendor-website/>   <ors:year-of-introduction/>   <ors:models>      <ors:model>         <ors:name>VPP5000U</ors:name>         <ors:clock-cycle unit="MHz">300</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">9.6</ors:processor-performance>         <ors:peak-performance unit="Gflop/s" >9.6</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">16</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>1</ors:min>            <ors:max>1</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>VPP5000</ors:name>         <ors:clock-cycle unit="MHz">300</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">9.6</ors:processor-performance>       <ors:peak-performance unit="Tflop/s" >1.22</ors:peak-performance>           <ors:memory-maximal unit="Tbyte">2</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>128</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>The VPP5000 is the sucessor of the former VPP700/VPP700E systems(with E for extended, i.e., the clock cycle 6.6 instead of 7 ns).The overall architectural changes with respect to the VPP700 seriesare slight. The clock cycle has been halved and the floating-pointvectorpipes are able to deliver floating multiply-add results. Witha replication factor of 16 for these vectorpipes, 32 floating-pointresults per clock cycle can be generated, at least in theory. Inthis way a four-fold increase in speed per processor can beattained with respect to the VPP700E.</p>      <p>The architecture of the VPP5000 nodes is almost identical tothat of the VPP700: Each node, called a Processing Element (PE) inthe system is a powerful (9.6 Gflop/s peak speed with a 3.3 nsclock) vector processor in its own right. The vector processor iscomplemented by a RISC scalar processor with a peak speed of 1.2Gflop/s. The scalar instruction format is 64 bits wide and maycause the execution of up to 4 operations in parallel. Each PE hasa memory of up to 16 GB while a PE communicates with its fellow PEsat a point-to-point speed of 1.6 GB/s. This communication is takencare of by separate Data Transfer Units (DTUs). To enhance thecommunication efficiency, the DTU has various transfer modes likecontiguous, stride, sub array, and indirect access. Alsotranslation of logical to physical PE-ids and from Logical in-PEaddress to real address are handled by the DTUs. Whensynchronisation is required each PE can set its corresponding bitin the Synchronisation Register (SR). The value of the SR isbroadcast to all PEs and synchronisation has occurred if the SR hasall its bits set for the relevant PEs. This method is comparable tothe use of synchronisation registers in shared-memory vectorprocessors and much faster than synchronising via memory. Thenetwork is a direct crossbar which should lead to an excellentthroughput of the network. This is in contrast to the VPP700 wherea level-2 crossbar was employed for configurations larger than 16processors. On special order 512 PE systems can be built byFujitsu, quadrupling the maximum amount of memory and thetheoretical peak performance.</p>      <p>The VPP5000U is one of the few single-processor vectorprocessors that is offered. It is simply a single-processor versionof the VPP5000, of course without the network and data transferextentions that are required in the VPP5000.</p>      <p>The Fortran compiler that comes with the VPP5000 has extensionsthat enable data decomposition by compiler directives. This evadesin many cases restructuring of the code. The directives aredifferent from those as defined in the High Performance FortranProposal but it should be easy to adapt them. Furthermore, it ispossible do define parallel regions, barriers, etc., viadirectives, while there are several intrinsic functions to enquireabout the number of processors and to execute <tt>POST/WAIT</tt>commands. Furthermore, also a message passing programming style ispossible by using the PVM or MPI communication libraries that areavailable.</p>      <p>Just like for the Fujitsu AP3000, no information via a web pageis available anymore (unless perhaps in Japanese) since therestructuring of the Fujitsu web site.</p>       </ors:remarks>   <ors:measured-performance>      <p>The system was announced in November 1999 and some results areavailable by now. In <a href="references.html#Dong02">[6]</a> for a100 processor system a speed of 886 Gflop/s was measured solving anorder 195,600 full linear system which amounts to an efficiency of90%. On a single processor a speed of 6.04 Gflop/s was measured insolving a system of order 2000. In evaluating a 10-th orderpolynomial a speed of 8.68 Gflop/s was observed, also an efficiencyof over 90% (see <a href="references.html#EurB99">[7]</a> for bothlast results).</p>       </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="primepwr">   <ors:name>The Fujitsu/Siemens PRIMEPOWER</ors:name>   <ors:machine-type>RISC-based shared-memory multi-processor.</ors:machine-type>   <weg>PRIMEPOWER 800, 1000, 2000.</weg>   <ors:operating-system>Solaris (Sun's Unix variant).</ors:operating-system>   <ors:connection-structure>Crossbar.</ors:connection-structure>   <ors:compilers>Parallel Fortran 90, C, C++.</ors:compilers>   <ors:vendor-website>http://primepower.fujitsu.com/en/index.html</ors:vendor-website>   <ors:year-of-introduction>2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>PRIMEPOWER 2000</ors:name>         <ors:clock-cycle unit="MHz">675</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.35</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">173</ors:peak-performance>         <ors:memory-maximal unit="Tbyte">4</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>8</ors:min>            <ors:max>128</ors:max>         </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>We only discuss here the PRIMEPOWER 2000 as the smaller modelshave the same structure but less processors (maximally 16 in the800 model and 32 in the 1000 model). In many respects this machineis akin to the <a href="sun.html">SUN Fire 3800-15K</a>. Theprocessors are 64-bit Fujitsu implementations of SUN's SPARCprocessors, called SPARC 64 GP processors and they are completelycompatible with the SUN products. The processors are available in a563 MHz and a 675 MHz variant. Also the interconnection of theprocessors in the PRIMEPOWER systems is like the one in the Fire3800-15K: a crossbar that connects all processors at the samefooting, i.e., it is <em>not</em> a NUMA machine.</p>      <p>Unfortunately, there is no sound technical information availablebeyond the data sheets that are provided via Fujitsu's web site.These data sheets omit any information about the bandwidth of theinterconnect be it point-to-point, bi-sectional, or aggregate.Judging from the available information the system is morepositioned as a communication server than as a high performancecomputer while the structure is well suited for this kind oftasks.</p>     </ors:remarks>   <ors:measured-performance>      <p>Dongarra reports in <a href="references.html#Dong02">[6]</a> aperformance of 118 Gflop/s out of a maximum of 172.8 Gflop/s forsolving a system of order 116,480. This amounts to an efficiency of68.3%.</p>        </ors:measured-performance></ors:system><!-- ======================================================================== --><ors:system id="origin">   <ors:name>The SGI Origin 3000 series</ors:name>   <ors:machine-type>RISC-based distributed-memory multi-processor</ors:machine-type>   <weg>Origin 3200, Origin 3400, Origin 3800</weg>   <ors:operating-system>IRIX (SGI's Unix variant)</ors:operating-system>   <ors:connection-structure>Crossbar, hypercube (see remarks)</ors:connection-structure>   <ors:compilers>Fortran 77, Fortran 90, C, C++ , ADA, Pascal</ors:compilers>   <ors:vendor-website>http://www.sgi.com/servers/</ors:vendor-website>   <ors:year-of-introduction>2000</ors:year-of-introduction>   <ors:models>      <ors:model>         <ors:name>Origin 3400</ors:name>         <ors:clock-cycle unit="MHz">600</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.2</ors:processor-performance>         <ors:peak-performance unit="Gflop/s">38.3</ors:peak-performance>         <ors:memory-maximal unit="Gbyte">64</ors:memory-maximal>         <ors:number-of-processors>            <ors:min>4</ors:min>            <ors:max>32</ors:max>         </ors:number-of-processors>      </ors:model>      <ors:model>         <ors:name>Origin 3800</ors:name>         <ors:clock-cycle unit="MHz">600</ors:clock-cycle>         <ors:processor-performance unit="Gflop/s">1.2</ors:processor-performance>        <ors:peak-performance unit="Gflop/s">614</ors:peak-performance>          <ors:memory-maximal unit="Tbyte">1</ors:memory-maximal>         <ors:number-of-processors>              <ors:min>6</ors:min>            <ors:max>512</ors:max>        </ors:number-of-processors>      </ors:model>   </ors:models>   <ors:remarks>      <p>By July 2000 has passed from its Origin2000 series to its newOrigin3000 series comprised of the Origin3200, Origin3400, andOrigin3800 models. In the system parameter list above we onlyincluded the 3400 and 3800 models because of their peakperformance. Many of the characteristics of the Origin2000 havebeen retained of which the most important is its ccNUMA character.The processor used is presently the MIPS R14000, a direct successorof the R12000s in the Origin2000 systems. The R14000 is verysimilar to the R12000 processor, be it that the primary cache is atfull speed where the R12000 operated at 2/3 speed. In the newestsystems the 600 MHz R14000A is offered, although also R14000 500MHz-based systems are still available.</p>      <p>SGI has further modularised the Origin3000 in comparison withits predecessor. A system contains so-called C-bricks, CPU boardswith 2-4 processors and a router chip connecting the on-boardmemory with the processors, to router boards called R-bricks forcommunication with the rest of the system, and to I-bricks thatcontain disks, PCI expansion slots, etc. and that together make upthe I/O sub-system of the machine. The basic hardware bandwidthwithin a C-brick is 1.6 GB/s from the router chip to one pair ofCPUs, 3.2 GB/s from memory to the router chip (2 x 1.6 GB/sfull duplex). The same bandwidth is available for inter-nodecommunication. The off-board I/O bandwidth is 2.4 GB/s (2 x 1.2GB/s full duplex). The R-brick can be connected to 16 C-bricks andit has 8 ports to connect it to other R-bricks. So, 128 C-bricks or512 processors can maximally be interconnected in this way.</p>      <p>The machine is a typical representative of the ccNUMA class ofsystems. The memory is physically distributed over the node boardsbut there is one system image. Because of the structure of thesystem, the bi-sectional bandwidth of the system remains constantfrom 8 processors on: 210 GB/s. This is a large improvement overthe earlier Origin2000 systems where the bi-sectional bandwidth was82 GB/s.</p>      <p>Parallelisation is done either automatically by the (Fortran orC) compiler or explicitly by the user, mainly through the use ofdirectives. All synchronisation, etc., has to be done via memory.This may cause potentially a fairly large parallelisation overhead.Also a message passing model is allowed on the Origin using theoptimised SGI versions of PVM and MPI, and the SGI/Cray-specific<tt>shmem</tt> library. Programs implemented in this way willpossibly run very efficiently on the system.</p>      <p>A nice feature of the Origins is that it may migrate processesto nodes that should satisfy the data requests of these processes.So, the overhead involved in transferring data across the machineare minimised in this way. The technique is reminiscent of the lateKendall Square Systems although in these systems the data weremoved to the active process. SGI claims that the time for non-localmemory references is on average about 2 times longer than for localmemory references, an improvement of 50% over the Origin2000series.</p>   </ors:remarks>   <ors:measured-performance>      <p>As yet no performance figures for the 600 MHz-based systems areavailable but in <a href="references.html#Dong02">[6]</a> theperformance of the solution of a linear system of order 230,000 isquoted for a 512 processor system with 500 MHz processors. In thiscase a speed of 405.6 Gflop/s was found, an efficiency of 79%.</p>   </ors:measured-performance></ors:system><!-- ======================================================================== --></ors:list-of-systems>
