IBM Blue Gene/Q chip and system
August 8, 2011, Hot Chips 23, Stanford, CA—Ruud Haring from IBM described the latest Blue Gene massively parallel supercomputer. He noted that one of the system design objectives included capabilities to move supercomputing from the scientific community to commercial applications.
Design objectives included reducing the total cost of operations and achieve greater power efficiency. In addition, the new machine had to be more reliable than previous generations. The resulting chip is optimized for flops per watt and reliability is enhanced through ECC, BER sentry circuits, and partial redundancy. The total chip including CPU, networking, and cache uses 1.47 B transistors in a 45 nm process.
The chip is organized as 16 user cores, 1 service and policy engine, and 1 spare. Each core is 4-way multi-threaded and includes a 1.6 GHz FPU. Total chip power is less than 55 W in normal operation. All of the cores are interconnected through a crossbar switch which required manual layout and runs at half-clock or 800MHz. the cores can access the outside world through 22 master ports which include 2 PCIe, parallel bi-directional ports for chip-to-chip communications, and Ethernet or Infiniband.
The core of the chip is a set of 64-bit Power ISA units using in-order dispatch, execution, and completion. The L1 cache has 16k of I and 16 K of D storage and the pre-fetcher uses an adaptive stream pre-fetch and 4 list-based pre-fetches. Dual on-chip memory controllers can address 16 GB of DDR3 each providing a 2 X 16 byte-wide interface + ECC, providing 2–bit correction and 3-bit detection. The L2 cache is 32 MB 16-way set associative and multi-version. Data are tagged through a scoreboarding transaction memory and the speculative execution unit.
To assemble a full supercomputer, a chip is packaged and put on a daughter card with 16 GB DDR3 memory soldered on to the board to eliminate socket issues. 32 compute cards are put on a main board with optical modules and other interconnect. These cards are placed into a drawer with 8 I/O cards and 8 PCIe slots or 16 cards are stacked in a midplane module. A rack holds 2 midplanes and 1-4 I/O drawers. These racks are then assembled into the final system of 96 racks to provide 20 PFlops/s performance.
The design challenges included total area, power and power management, and soft error detection and correction. The methodology was complex, a mix of ASIC and custom flows for different sections. Verification and test pattern generation as a result were also difficult since some functions were not easily scanned and sections like the crossbar were to complex. The mixed chip, technology, and flows all contributed to the difficulties.