To ensure security, image encryption algorithms generally include two stages: permutation and diffusion. The traditional image permutation algorithms include the sort-based permutation algorithm, ...Arnold-based permutation algorithm, Baker-based permutation algorithm and the cyclic shift permutation algorithm, etc. However, these algorithms have the disadvantages of either high time complexity or poor permutation performance. Therefore, in combination with cyclic shift and sorting, this paper proposes a permutation algorithm that can not only guarantees good permutation performance but also guarantee low time and space complexity. Most importantly, this paper proposes a parallel diffusion method. This method ensures the parallelism of diffusion to the utmost extent and achieves a qualitative improvement in efficiency over traditional streaming diffusion methods. Finally, combined with the proposed permutation and diffusion, the paper proposes a computational model for parallel image encryption algorithms.
Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate ...combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split , Broadcast , and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.
Solid state drives (SSDs) achieve a significantly better performance than hard disks by internally implementing channel level, chip level, and die level parallelism. However, the plane level ...parallelism has not been sufficiently exploited, because of the constraint that the target pages of the planes must be in the same location. To overcome this constraint, a policy that enforces the multi-plane operation by matching the position of the target pages while wasting clean pages is proposed. However, this policy excessively increases the number of block erasures, which leads to reducing the stability and lifetime of the SSD. To solve this problem, this Letter proposes a policy that determines whether to perform the multi-plane operation considering the number of wasted clean pages. The performance evaluation using representative server workloads shows that the proposed policy improves an average performance by up to 28.82% over the policy that does not perform the multi-plane operation, without significantly increasing the number of block erasures.
With the increasing volumes of data samples and deep neural network (DNN) models, efficiently scaling the training of DNN models has become a significant challenge for server clusters with AI ...accelerators in terms of memory and computing efficiency. Existing parallelism schemes can be broadly classified into three categories: data parallelism (splitting data samples), model parallelism (splitting model parameters), and pipeline model parallelism (splitting model layers). Hybrid approaches split data and models, offering a comprehensive solution for parallel training. However, these methods encounter limitations in efficiently scaling larger models across more computing nodes, as they incur substantial memory constraints that affect training efficiency and overall throughput. In this paper, we propose HIPPIE , a hybrid parallel training framework designed to enhance memory efficiency and scalability of large DNN training. First, to evaluate the optimization effect more reasonably, we propose an index of Memory Efficiency (ME) to quantify the tradeoff between throughput and memory overhead. Second, driven by the informed ME optimization objective, we automatically partition the pipeline to balance the throughput and memory. Third, we optimize the model training process via a novel hybrid parallel scheduler that improves the throughput and scalability by informed pipeline scheduling and communication scheduling with gradient-hidden optimization. Experiments on various models show that HIPPIE achieves above 90% scaling efficiency on a 16-GPU platform. Moreover, HIPPIE increases throughput by up to 80%, while saving 57% of memory overhead and achieving 4.18× memory-efficiency improvement.
The article presents a comparative analysis of the effectiveness of using parallelism of varying granularity degrees in modern multicore computer systems using the most popular programming languages ...and libraries (such as C#, Java, C++, and OpenMP). Based on the performed comparison, the possibilities of increasing the efficiency of computations in multicore computer systems by using combinations of medium- and fine-grained parallelism were also investigated. The results demonstrate the high potential efficiency of fine-grained parallelism when organizing intensive parallel computations. Based on these results, it can be argued that, in comparison with more traditional parallelization methods that use medium-grain parallelism, the use of separately fine-grained parallelism can reduce the computation time of a large mathematical problem by an average of 4%. The use of combined parallelism can reduce the computation time of such a problem to 5,5%. This reduction in execution time can be significant when performing very large computations.
Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel ...stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results.
yambo is an open source project aimed at studying excited state properties of condensed matter systems from first principles using many-body methods. As input, yambo requires ground state electronic ...structure data as computed by density functional theory codes such as Quantum ESPRESSO and Abinit. yambo's capabilities include the calculation of linear response quantities (both independent-particle and including electron-hole interactions), quasi-particle corrections based on the GW formalism, optical absorption, and other spectroscopic quantities. Here we describe recent developments ranging from the inclusion of important but oft-neglected physical effects such as electron-phonon interactions to the implementation of a real-time propagation scheme for simulating linear and non-linear optical properties. Improvements to numerical algorithms and the user interface are outlined. Particular emphasis is given to the new and efficient parallel structure that makes it possible to exploit modern high performance computing architectures. Finally, we demonstrate the possibility to automate workflows by interfacing with the yambopy and AiiDA software tools.
To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network ...(DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods.
•We proposed an improved optimization strategy for parallel task scheduling of pipeline model, established a multi-iteration asynchronous parallel task management mechanism suitable for large-scale model computing tasks, designed the total execution framework of computing resources and model tasks to solve the problem of model partitioning and equipment allocation, and solve the problem of gradient delay updating during asynchronous iteration by controlling the micro-batch unit scheduling process.•We proposed a model segmentation method based on augmented antichain. Through the computational task transformation of large-scale DNN model, we constructed the antichain Directed Acyclic Graph (DAG) state sequence that conforms to the computational iteration specification. On this basis, combined with the characteristics of hardware computing resources, tasks are segmented through dynamic programming to achieve a reasonable match between computing tasks and computing resources.•We designed a runtime scheduling strategy for computing resources and tasks. By optimizing the default stream of devices, the dependency between computing nodes and communication is eliminated to the maximum extent, the overlap between computing and communication is maximized, the utilization rate of computing resources is improved, and the training speed of large-scale deep neural network model is increased while the accuracy of model training is guaranteed.
Dorreh Nadereh is a historical-literary text which is difficult to communicate with due to the stilted industrialization in language and the distance from the natural state. However, due to the ...alienation and the use of formative techniques, it takes on a special literature. Mirza Mehdi Khan Astarabadi in Dorreh narration has given a literary aspect to his expression so that linguistic prominence can be seen in it. Since the language in the text of the Dorreh is affected by defamiliarization, it is possible to retrieve and explain the alienation aspects of the discourse in it. The present study is the examination of prominent highlighting in the Dorreh Nadereh with a formalistic perspective (addition rule) and analyzes the arrangements that have realized the literature of the text by creating balance and acting on the outside of the language. Including these linguistic techniques, we can mention phonetic (Alliteration, magic of approximation, semantic sound), lexical (incomplete and complete similarity) and syntactics (Accompanying and substituting elements which have same rule) balance. The authors of this research have tried in a descriptive-analytical research, based on library sources to examine the aspects of balance in the text of the Dorreh Nadereh by mentioning and analyzing examples.Keywords: "Dorreh Nadereh"."Foregrounding"."Parallelism"."Phono-semantic"