This paper reports the first experimental demonstration of silicon photonic (SiPh) Flex-LIONS, a bandwidth-reconfigurable SiPh switching fabric based on wavelength routing in arrayed waveguide ...grating routers (AWGRs) and space switching. Compared with the state-of-the-art bandwidth-reconfigurable switching fabrics, Flex-LIONS architecture exhibits 21× less number of switching elements and 2.9× lower on-chip loss for 64 ports, which indicates significant improvements in scalability and energy efficiency. System experimental results carried out with an 8-port SiPh Flex-LIONS prototype demonstrate error-free one-to-eight multicast interconnection at 25 Gb/s and bandwidth reconfiguration from 25 Gb/s to 100 Gb/s between selected input and output ports. Besides, benchmarking simulation results show that Flex-LIONS can provide a 1.33× reduction in packet latency and >1.5× improvements in energy efficiency when replacing the core layer switches of Fat-Tree topologies with Flex-LIONS. Finally, we discuss the possibility of scaling Flex-LIONS up to N = 1024 ports (N = M × W) by arranging M 2 W-port Flex-LIONS in a Thin-CLOS architecture using W wavelengths.
The rapid increases in data-intensive applications demand for more powerful parallel computing systems capable of parallel processing a large amount of data more efficiently and effectively. While ...GPU-based systems are commonly used in such parallel processing, the exponentially rising data volume can easily saturate the capacity of the largest possible GPU processor. One possible solution is to exploit multi-GPU systems. In a multi-GPU system, the main bottleneck is the interconnect, which is currently based on PCIe or NVLink technologies. In this study, we propose to optically interconnect multiple GPUs using Flex-LIONS, an optical all-to-all reconfigurable interconnect. By exploiting the multiple free spectral ranges (FSRs) of Flex-LIONS, it is possible to adapt (or steer) the inter-GPU connectivity to the traffic demands by reconfiguring the optical connectivity of one FSR while maintaining fixed all-to-all connectivity of another FSR. Simulation results show the benefits of the proposed reconfigurable bandwidth-steering interconnect solution under various traffic patterns of different applications. Execution time reductions by up to 5× have been demonstrated in this study including two applications of convolution and maxpooling .
Data movement has become a limiting factor in terms of performance, power consumption, and scalability of high-performance compute nodes with increasing numbers of processor and memory systems. ...Optical interconnects enabled by Silicon Photonics could not only overcome this limitation but also change the way we think about system architectures and memory hierarchies. This dissertation aims to introduce and evaluate scalable high performance computing architectures based on optical interconnects. This dissertation presents the motivation and background, architecture design, and evaluation results for the following case studies:Investigating the design challenges in large-scale many-core processors, the impact of interconnection fabric on the overall system performance and power consumption, and how Silicon Photonics can alleviate system constraints.Studying on-chip memory networks capable of providing HPC compute nodes with terabytes of memory capacity by interconnecting several 3D stacked DRAM modules through a packet-switched network interface. Replacing legacy interconnects with sophisticated optical networks could significantly reduce memory access time and energy - a largely unexplored research area.Addressing the scaling limitations in chiplet-based systems, in particular, large inter-chiplet non-uniform latencies, distance-related energy overheads, and limited Input-Output (IO) bandwidth, and exploiting the properties of optical interconnects to propose a scalable uniform memory architecture.Rethinking the architecture of state-of-the-art high-throughput accelerators, the impact of memory access latency variations on the overall performance and system design, and the key challenges in scaling memory and compute capacity in these systems. A new architecture is proposed to reduce the contention within the memory system with the help of a partitioned memory controller and an all-to-all passive optical interconnect that is amenable for a 2.5D based implementation using off-the-shelf memory modules.
While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly ...challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, 3\times lower power consumption, 10 \times fewer cable connections, and 4 \times reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to 4 \times improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.
2.5D integrated systems exploiting electronic interposers to tightly integrate multiple processor dies into the same package suffer from significant performance degradation caused by the large ...latency overheads of their die-to-die multihop electrical interconnection networks. Silicon-photonic interposers with wavelength-routed interconnects can overcome this issue by enabling directly connected, scalable topologies while exhibiting low-energy optical communication even at large distances. This paper studies the use of an arrayed waveguide grating router (AWGR) as a scalable, low-latency silicon-photonic interconnection fabric for computing systems with up to 256 cores. Our results indicate that AWGRs could be a key enabler for large-scale interposer systems, offering an average performance speed-up of at least 1.25× with 1.32× lower power for 256 cores compared to state-of-the-art electrical networks, while offering a more compact solution compared to alternative photonic interconnects.
New high-performance processors tend to shift from multi to many cores. More- over, shared memory has turned to dominant paradigm for mainstream multicore pro- cessors. As memory wall issue loomed ...over architecture design, most modern computer systems have several layers in their memory hierarchy. Among many, caches has be- come everlasting components of memory hierarchies as they signicantly reduce access time by taking the advantage of locality. Major processor vendors usually rely on cache coherence, and implement a vari- ant of MESI, e.g., MOESI for AMD, to help reduce inter-chip trac on the fast in- terconnection network. Supposedly, maintaining coherence should help with keeping parallel and concurrent programmers happy, all the while providing them with a well- known cache behavior for shared memory. This thesis challenge the assumption that Coherence is well-suited for large-scale many core processors. Seeking an alternative for coherence, LC cache protocol is extensively investigated. LC-Cache is a cache protocol weaker than Coherence, but which preserves causality. It relies on the Location Consistency (LC) model. The basic philosophy behind LC is to maintain a unique view of memory only if there is a reason to. Other ordinary memory accesses may be observed in any order by the other processors of the computer system. The motivation to stand against cache coherence, relies on underestimated lim- itations implied on system design by coherence. Observations presented in this thesis, demonstrates that coherence eliminates the possibility of having a directory based pro- tocol in practice since size of such directory grows linearly with number of cores. In addition, coherence adds implicit latency in many cases to the protocol. This thesis presents LCCSim, a simulation framework to compare cache proto- col based on location consistency against cache coherence protocols. A comparative analysis between the MESI and MOESI coherence protocols is provided, and pit them against LC-Cache. Both MESI and MOESI consistently generate more on-chip trac compared to LC cache since transitions in LC cache are done locally. However, LC cache degrades total latency of accesses as it does not take the advantage of cache to cache forwarding. Additionally, LC cache cannot be considered a true implementation based on LC since it does not behave according to the memory model. The following summarizes the contributions of this thesis: 1. Detailed specication of LC cache protocol, covering the missing aspects in the original paper. 2. A simulation framework to compare cache protocols based on LC against cache coherence protocols. 3. Extensive analysis of LC cache protocol, leading to discovery of several weak- nesses. 4. Demonstrating features for an ecient cache protocol, truly based on location consistency.
We propose Flex-LIONS, a silicon photonic switch architecture with 21 × less switching elements and 2.9× lower on-chip loss when compared with other reconfigurable switching fabrics at a scale of 64 ...ports. Benchmarking simulations indicate 1.33× reduction in packet latency and > 1.5× improvement in energy efficiency when compared with Fat Tree topologies.
Domain Wall Memory (DWM) with ultra-high density and comparable read/write latency to DRAM is an attractive replacement for CMOS-based devices. Unlike DRAM, DWM has non-uniform data access latency ...that is proportional to the number of shift operations. While previous works have demonstrated the feasibility of using DWM as main memory and have proposed different ways to alleviate the impact of shift operations, none of them have addressed the performance-critical metadata accesses, in particular page table accesses. To bridge this gap, this paper aims at accelerating page table walk in DWM-based main memory from two innovative aspects. First of all, we propose a new page table layout and leverage the positions of access ports in DWM to differentiate the state of page table entries. In addition, we propose a technique to pre-align the access ports to the positions to be accessed in the near future, thus hiding shift latency to the maximum extent. Since both address translation and context switching are affected by page table access latency, the proposed technique can effectively improve system performance and user experience.
Smaller feature size, lower supply voltage, and faster clock rates have made modern computer systems more susceptible to faults. Although previous fault tolerance techniques usually target a ...relatively low fault rate and consider error recovery less critical, with the advent of higher fault rates, recovery overhead is no longer negligible. In this paper, we propose a scheme that leverages and revises a set of compiler optimizations to design, for each application hotspot, a smart recovery plan that identifies the minimal set of instructions to be re-executed in different fault scenarios. Such fault scenario and recovery plan information is efficiently delivered to the processor for runtime fault recovery. The proposed optimizations are implemented in LLVM and GEM5. The results show that the proposed scheme can significantly reduce runtime recovery overhead by 72%.