The Oracle Sparc T5 processor more than doubles the throughput of the Sparc T4 processor, while increasing per-thread performance, scalability, power efficiency, and I/O bandwidth. The authors detail ...the improvements and new features leading to this latest Oracle Sparc processor.
This paper describes the micro-architecture of a Coherency Hub (CoHub) ASIC for a 4-socket highly-threaded multiprocessor using Sun's UltraSPARC ¯ T2 Plus processor. UltraSPARC T2 Plus is an 8-core ...CMT processor in the Sun Servers with CoolThreadsTM Technology family. CoHub enables cost-effective scaling to 4 nodes with a total thread count of 256 and near-linear performance scaling on transaction processing workloads. Extending a 2-node "glueless" system to a 4-node system without processor changes was a key requirement. CoHub broadcasts snoop requests, serializes requests to the same address, and consolidates snoop responses. It communicates with nodes via serial links, using a proprietary link layer implemented over FBDIMM. We present the coherency scheme, ASIC design, transaction flows, and engineering challenges created by 800 MHz operation and 6-stage pipeline budget. We report performance scalability results measured on commercial server benchmarks.
Deflection routing resolves output port contention in packet switched multiprocessor interconnection networks by granting the preferred port to the highest priority packet and directing contending ...packets out other ports. When combined with optical links and switches, deflection routing yields simple bufferless nodes, high bit rates, scalable throughput, and low latency. We discuss the problem of packet synchronization in synchronous optical deflection networks with nodes distributed across boards, racks, and cabinets. Synchronous operation is feasible due to very predictable optical propagation delays. A routing control processor at each node examines arriving packets and assigns them to output ports. Packets arriving on different input ports must be bit wise aligned; there are no elastic buffers to correct for mismatched arrivals. "Time of flight" packet synchronization is done by balancing link delays during network design. Using a directed graph network model, we formulate a constrained minimization problem for minimizing link delays subject to synchronization and packaging constraints. We demonstrate our method on a ShuffleNet graph, and show modifications to handle multiple packet sizes and latency critical paths.
To bring the benefits of CMT to larger workloads, these systems had to scale beyond a single socket. Because CMT requires massive memory bandwidth to achieve adequate throughput performance, the ...challenge was to develop a coherency link and fabric that would allow performance to scale along with thread count in a multinode (that is, multisocket) system. In this article CoHub's coherency scheme, ASIC design, and transtransaction flows, and discussion of the engineering challenges created by 800-MHz operation and a six-stage pipeline budget is presented. The basic principles embodied in the multinode coherency protocol and CoHub design will be important building blocks for future multinode CMT systems with higher node counts.
CoHub, a coherency hub ASIC, provides a cost-effective way to extend a glueless two-node chip-multithreading system to a four-node system without changes to the processor. The four-node, 256-thread ...system achieves near-linear scaling of performance with thread count on transaction-processing workloads. Time-to-market pressure, 800-MHz operation, and a six-stage pipeline were among the constraints that shaped CoHub's design. PUBLICATION ABSTRACT
Time-of-flight synchronization is a new digital design methodology for optoelectronics that eliminates latches, allowing higher clock rates than alternative timing schemes. Synchronization is ...accomplished by precisely balancing connection delays. Circuits use pulse-mode signaling and clock gates to restore pulse timing. Many effective pipeline stages are created within combinational logic without extra hardware bounding the stages. Time-of-flight design principles are applicable to packet routing and sorting processors for optical interconnection networks. Circuits are unique because the clock rate is limited primarily by imprecision in propagation delay rather than absolute delay, as in circuits with latches. We develop a general model of delay uncertainty and focus on the effect that static and dynamic uncertainty accumulated over circuit paths has on the minimum feasible clock period. We present a method for traversing the circuit graph representation of a time-of-flight circuit to compute arrival time uncertainty at each pulse interaction point. Arrival time uncertainties give rise to pulse width and overlap constraints. From these constraints we formulate a constrained minimization to find the minimum clock period. We demonstrate our method on circuits implemented with 2/spl times/2 electro-optic switches and optical waveguides and find the electronic component of path uncertainty frequently limits speed.
The I/O Hub (IOH) for SPARC M7 processor-based servers is an ASIC providing high performance, flexible, and virtualized access to multiple Gen3 PCIe devices. The IOH's top-level interconnect, ...connecting multiple PCIe Root Complexes to a set of SPARC M7 protocol interface units, proved challenging for design, verification, performance modeling, and physical implementation. Close collaboration between these disciplines was required to deliver a production quality ASIC. We report on some specific technical issues we encountered and how the interconnect design strategy shifted through the course of development as we made optimizations based on performance modeling feedback and physical design constraints.
This dissertation presents a study of how propagation delay uncertainty affects the performance of time-of-flight synchronized digital circuits. Time-of-flight synchronization is a new timing method ...suitable for technologies such as optoelectronics having highly controllable propagation delay. No bistable memory elements are required, and synchronization is accomplished by precise adjustments of interconnect lengths. Delay is distributed over connections so that, nominally, pulses arrive at a common destination simultaneously. Clock gating and pulse stretching are used to restore timing of pulses. Time multiplexing is used to increase computational throughput, whereby a major cycle is divided into a number of minor cycles, each representing an independent virtual machine. What limits the amount of multiplexing that is feasible is the controllability of delay. The principle focus of this research is methods for computing the minimum feasible minor cycle and the amount of stretch needed to prevent synchronization errors. Due to the unique circuit features, timing analysis differs significantly from analysis of conventional digital circuits. Models of delay uncertainty accounting for static and dynamic effects are discussed for discrete and integrated implementations. Methods for placing a minimal set of clock gates necessary for a functional circuit are presented. The minimum feasible major cycle is computed using nominal delays. A method for computing the arrival time and pulse width uncertainty at each node in the circuit is presented. The circuit graph is traversed and device uncertainty functions operating on worst-case input pulse parameters are applied at vertices. Using pulse timing parameters obtained from the traversal, timing constraints are generated. A constrained minimization problem to find the minimum feasible minor cycle is then presented and solved. Two variations on this problem are presented. Circuit structural issues that affect the accuracy of the results are also discussed. The timing analysis algorithms are implemented in a CAD tool called XHatch. Results of XHatch experiments showing the effect of delay uncertainty on the minimum feasible minor cycle for discrete and integrated implementations of circuits are presented. The last chapter presents a statistical model of delay uncertainty and a method for estimating the probability of a synchronization error.
This paper describes the design of a coherency hub ASIC for a 4-socket highly-threaded multiprocessor using Sun's Victoria Falls processor. Victoria Falls is an 8-core CMT processor in the Niagara ...family, with 8 threads per core and a shared L2 cache. The coherency hub, named Zambezi, enables cost-effective scaling to 4 sockets with a total thread count of 256 and near-linear performance scaling on transaction processing workloads. Extending a 2-socket "glueless" system to a 4-socket system with no change to the processor was a key requirement. Zambezi broadcasts snoop requests to all nodes (i.e. sockets), serializes requests to the same address, and consolidates snoop responses. The hub communicates with each node via point-to-point serial links, using a proprietary data link layer implemented over an FBDIMM PHY. In this paper, we summarize the ASIC micro-architecture and coherency scheme, highlight how we addressed the engineering challenges we faced, and report performance scalability results we achieved on key commercial server benchmarks. Conflicting constraints (800 MHz operation and a 6-stage pipeline budget) presented the primary challenge to architecture, design and layout.