The quantum supremacy experiment, such as Google Sycamore F. Arute et al., Nature (London) 574, 505 (2019).NATUAS0028-083610.1038/s41586-019-1666-5, poses a great challenge for classical verification ...due to the exponentially increasing compute cost. Using a new-generation Sunway supercomputer within 8.5 d, we provide a direct verification by computing 3×10^{6} exact amplitudes for the experimentally generated bitstrings, obtaining a cross-entropy benchmarking fidelity of 0.191% (the estimated value is 0.224%). The leap of simulation capability is built on a multiple-amplitude tensor network contraction algorithm which systematically exploits the "classical advantage" (the inherent "store-and-compute" operation mode of von Neumann machines) of current supercomputers, and a fused tensor network contraction algorithm which drastically increases the compute efficiency on heterogeneous architectures. Our method has a far-reaching impact in solving quantum many-body problems, statistical problems, as well as combinatorial optimization problems.
Boson sampling is expected to be an important milestone that will demonstrate quantum computational advantage (or quantum supremacy). This work establishes the benchmarking of Gaussian boson sampling ...(GBS) with threshold detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel framework with almost perfect load balance and an instruction-level optimizing scheme based on a shortest-path-based instruction scheduling. In addition, data precision is carefully processed by an integer-instruction-based and multiple-precision fixed-point implementation, including 128- and 256-bit precison mode, which can be appropriately selected based on an adaptive precision optimizing scheme. Based on these methods, a highly efficient parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a <inline-formula><tex-math notation="LaTeX">100\times 100</tex-math> <mml:math><mml:mrow><mml:mn>100</mml:mn><mml:mo>×</mml:mo><mml:mn>100</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="li-ieq1-3111185.gif"/> </inline-formula> submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision. To our knowledge, this was the largest quantum computing simulation based on Boson Sampling by using modern supercomputers.
We report new Gaussian boson sampling experiments with pseudo-photon-number-resolving detection, which register up to 255 photon-click events. We consider partial photon distinguishability and ...develop a more complete model for the characterization of the noisy Gaussian boson sampling. In the quantum computational advantage regime, we use Bayesian tests and correlation function analysis to validate the samples against all current classical spoofing mockups. Estimating with the best classical algorithms to date, generating a single ideal sample from the same distribution on the supercomputer Frontier would take ∼600 yr using exact methods, whereas our quantum computer, Jiǔzhāng 3.0, takes only 1.27 μs to produce a sample. Generating the hardest sample from the experiment using an exact algorithm would take Frontier∼3.1×10^{10} yr.
In this paper, we introduce a privacy-preserving stable diffusion framework leveraging homomorphic encryption, called HE-Diffusion, which primarily focuses on protecting the denoising phase of the ...diffusion process. HE-Diffusion is a tailored encryption framework specifically designed to align with the unique architecture of stable diffusion, ensuring both privacy and functionality. To address the inherent computational challenges, we propose a novel min-distortion method that enables efficient partial image encryption, significantly reducing the overhead without compromising the model's output quality. Furthermore, we adopt a sparse tensor representation to expedite computational operations, enhancing the overall efficiency of the privacy-preserving diffusion process. We successfully implement HE-based privacy-preserving stable diffusion inference. The experimental results show that HE-Diffusion achieves 500 times speedup compared with the baseline method, and reduces time cost of the homomorphically encrypted inference to the minute level. Both the performance and accuracy of the HE-Diffusion are on par with the plaintext counterpart. Our approach marks a significant step towards integrating advanced cryptographic techniques with state-of-the-art generative models, paving the way for privacy-preserving and efficient image generation in critical applications.
Accelerating cryo-EM Reconstruction of RELION on the New Sunway Supercomputer Xu, Jingle; Fu, Jiayu; Gan, Lin ...
2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom),
2022-Dec.
Conference Proceeding
Over the long journey of understanding the micro-world, cryo- EM has become an effective technique for biomolec-ular structure determinations. However, due to the complex algorithmic features and ...large amounts of computing data, so-phisticated HPC solutions are in urgent demand. In this paper, we present our efforts of porting RELION to the new generation of Sunway supercomputer. Optimizations that fit well with the new hardware architecture have been proposed, including a multi-level parallel scheme that smartly maps and scales RELION onto the novel Sun way architecture, optimizations that address memory bottlenecks and improve memory efficiency, and a pipeline approach that obtains excellent computation and communication overlapping. Combining all proposed optimizations, the calculation time for one iteration is greatly reduced from 7,577 seconds to 2,017 seconds, with a speedup of 3.757x. The overall design is scaled to over 131,072 cores with a parallel efficiency of 95 %.
High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order ...to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. This paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including, an interpretation method to deal with slicing overhead, an inplace slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our inplace slicing strategy would be less than the Cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work.
High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order ...to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. This paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including an interpretation method to deal with slicing overhead, an in-place slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our in-place slicing strategy would be less than the cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work.
A high-scalable and fully optimized earthquake model is presented based on the latest Sunway supercomputer. Contributions include: 1) the curvilinear grid finite-difference method (CGFDM) and ...flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions; 2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and 3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140×. Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores). Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
The quantum circuits that declare quantum supremacy, such as Google Sycamore Nature \textbf{574}, 505 (2019), raises a paradox in building reliable result references. While simulation on traditional ...computers seems the sole way to provide reliable verification, the required run time is doomed with an exponentially-increasing compute complexity. To find a way to validate current ``quantum-supremacy" circuits with more than \(50\) qubits, we propose a simulation method that exploits the ``classical advantage" (the inherent ``store-and-compute" operation mode of von Neumann machines) of current supercomputers, and computes uncorrelated amplitudes of a random quantum circuit with an optimal reuse of the intermediate results and a minimal memory overhead throughout the process. Such a reuse strategy reduces the original linear scaling of the total compute cost against the number of amplitudes to a sublinear pattern, with greater reduction for more amplitudes. Based on a well-optimized implementation of this method on a new-generation Sunway supercomputer, we directly verify Sycamore by computing three million exact amplitudes for the experimentally generated bitstrings, obtaining an XEB fidelity of \(0.191\%\) which closely matches the estimated value of \(0.224\%\). Our computation scales up to \(41,932,800\) cores with a sustained single-precision performance of \(84.8\) Pflops, which is accomplished within \(8.5\) days. Our method has a far-reaching impact in solving quantum many-body problems, statistical problems as well as combinatorial optimization problems where one often needs to contract many tensor networks which share a significant portion of tensors in common.
Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold ...detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a 100 x 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.