When Are MFLOPS Really MFLOPS?
An emerging class of computers--minisupercomputers--claims to provide the compute horsepower of a supercomputer, such as the Cray XMP, but with a price tag closer to a minicomputer, such as a VAX 8600. Although these minisupercomputers tout performance benchmarks of 2 to 12 MFLOPS (Figure 1), these measurements of performance do not accurately indicate how a minisuper will perform in the real world of circuit simulations, compilations, and disk accesses. In other words, claims of peak performance often differ from achieved performance.
Consequently, prospective users who are considering buying a minisuper--and making an investment that could range from $200,000 to $1 million--must look beyond benchmarks. Users must first consider what type of computing they want to perform on a minisupercomputer (e.g., largely vector, mix of scalar and vector, or largely scalar). Once they understand the type of compute power required for an application, users can find the machine best suited for the job.
To judge the computer requirements of a particular application, users must carefully consider the type of computations that occur within a given program or routine. Examining the inner program loops will determine what operations are actually occurring. The structure of inner loops is important because it often determines the degree to which an optimizing compiler can vectorize the problem, or execute on parallel processors. The larger the fraction of inner loops that can take advantage of vector (or parallel) execution, the closer a minisuper will operate at peak performance--that theoretical level where all computer elements perform at their maximum.
A minisupercomputer that relies on vector notation for its speed achieves peak performance when an application has large data sets that can be processed independently of previous iterations. Vector notation allows the programmer to access individual elements via indexing and looping structures. Parallel computations can be performed on each array subset as the indexes are incremented by either row or column.
With as much as a 3:1 speedup over sequential execution (if 75% of the code is vectorized), dedicated hardware vector processors are often implemented to execute the operations in pipelined fashion for maximum effect. These vector processors, such as the SCS-40 from Scientific Computer Systems (San Diego, CA), have a single vector instruction that initiates several computations in parallel, with register files that can store as many as 512- x 64-bit …