The B-tree and its variants are an indispensable tool for database systems and applications. Hence the efficiency of the B-tree is one of the few critical factors that determine the performance of a ...database system. In main-memory database systems, the computational overhead intrinsic in the B-tree algorithms for branching becomes the dominant factor in performance. In this paper, we propose yet another but disruptive variant of the B+-tree called the DB+-tree that redesigns the node structure for faster branching operations. The novel branching algorithm of the DB+-tree can be implemented in an O(1) number of SIMD and other sequential instructions, which supports fast branching, and this leads to efficient point search, range search, and update operations.
•DB+ tree redesigns the node structure of B+ tree for faster branching operation•Our branching algorithm can be implemented in an O(1) number of instructions•DB+ tree performs point search 170% faster than pkB-tree.•DB+ tree performs range search 130% faster than pkB-tree.•DB+ tree performs insertions and deletions 140% faster than pkB-tree.
Designing Arithmetic Logic Unit (ALU) is a combinational logic problem. As ALU has a regular pattern, it can be broken into identical stages connected into cascade through carry chain. We have ...designed one stage of ALU and then duplicated it depending upon the size required. The design has been tested for 4, 8, 16, 32 and 64- bit width. The idea is resource sharing and functionality sharing technique to design an ALU that leads to a significant saving of resources. Different functionality has been obtained by using a single resource (parallel adder) with different inputs at different times through control circuit. The design through this approach leads to a significant reduction in hardware requirement. The design is implemented in 3s700anfgg484-4 FPGA. Significant reduction in hardware has been achieved. The hardware used has been compared with normal function by function design. Resources saving of 66% have been observed for 4-bit wide ALU implementation on FPGA. For 8 and 16-bit implementation the saving obtained is 65%. A hardware saving of 60% has been obtained for 32 and 64-bit implementation.
In this paper, we propose a new lightweight block cipher named RECTANGLE. The main idea of the design of RECTANGLE is to allow lightweight and fast implementations using bit-slice techniques. ...RECTANGLE uses an SP-network. The substitution layer consists of 16 4×4 S-boxes in parallel. The permutation layer is composed of 3 rotations. As shown in this paper, RECTANGLE offers great performance in both hardware and software environment, which provides enough flexibility for different application scenario. The following are 3 main advantages of RECTANGLE. First, RECTANGLE is extremely hardware-friendly. For the 80-bit key version, a one-cycle-per-round parallel implementation only needs 1600 gates for a throughput of 246 Kbits/s at 100 kHz clock and an energy efficiency of 3.0 pJ/bit. Second, RECTANGLE achieves a very competitive software speed among the existing lightweight block ciphers due to its bit-slice style. Using 128-bit SSE instructions, a bit-slice implementation of RECTANGLE reaches an average encryption speed of about 3.9 cycles/byte for messages around 3000 bytes. Last but not least, we propose new design criteria for the RECTANGLE S-box. Due to our careful selection of the S-box and the asymmetric design of the permutation layer, RECTANGLE achieves a very good security-performance tradeoff. Our extensive and deep security analysis shows that the highest number of rounds that we can attack, is 18 (out of 25).
An optimized implementation of S-boxes has a significant impact on the performance of cryptographic primitives. SAT-based methods can find optimal implementations for moderately sized S-boxes but ...their efficiency decreases when handling complex S-boxes. To improve the efficiency of the implementations, we propose two different methods, namely OR-encoding and IF-encoding, to encode the implementations of S-boxes. Furthermore, we also simplify the encoding of the outputs of logic gates and introduce new SAT-based search methods to optimize the implementations of S-boxes. Finally, to get a better trade-off between the search results (optimized implementations of S-boxes) and the search efficiency (in terms of time complexity), an encoding scheme using local solutions is proposed. Compared to the previous methods, our algorithms are relatively simple and more efficient. For instance, when a serial software implementation is considered, then the S-boxes of Sycon, ASCON, and the <inline-formula> <tex-math notation="LaTeX">\chi</tex-math> </inline-formula> function in Xoodyak, require 6, 1, and 2 fewer programming instructions, respectively, than the best known methods. Similar improvements are obtained for hardware implementations of S-boxes in some cryptographic primitives (e.g. LBlock, RECTANGLE, PRESENT/PHOTON-Beetle, TWINE, and ASCON), with the saving of gate equivalent (GE) that range from 1.67GE to 5.34GE compared to the current best implementations. Furthermore, our model can be applied to 6-bit, 7-bit, and 8-bit S-boxes, when the considered S-boxes are of low complexity.
Abstract
Masking is a well used and widely deployed countermeasure against side channel attacks, both in software and hardware. With masking comes at a great cost, search has focused on how to lower ...a performance penalty or find efficient masking implementation. In particular, our contribution is 2-fold: for software masking, we first find bitsliced implementations of Sbox with Multiplicative Complexity 4 and Multiplicative Depth 2, then adapt the common shares approach introduced by Coron et al. at CHES 2016 to make many cross-products $a_{i}\cdot b_{j}$ can be reuse for parallel ISW-based 32-bit nonlinear operations. Therefore, we improve the efficiency of 2$\times b/4/32$ parallel high-order masking of ISW scheme for RECTANGLE, TANGRAM and KNOT on 32-bit ARM embedded microprocessor, with roughly a 13%-34% speed-up, at cost of $(1+d) \times 32$-bit randomness. For hardware masking, 4 bit cubic Sboxes with quadratic decomposition length 2, including RECTANGLE, TANGRAM, KNOT and LWC third-round candidates, can be implemented with a 3-share and 4-share threshold implementation (TI) by decomposing cubic permutations $S$ as a composition of sub-permutations having lower algebraic degrees. We use two decomposition form: one composition of two quadratic permutations $G$ and $F$, $S = F\circ G$, is for efficiency; the other composition of some linear permutations $A_i$ and one quadratic permutation $G$, $S=A_3 \circ G \circ A_2 \circ G \circ A_1 $, is for reducing the area requirements. For $S = F\circ G$, we introduce a new approach of searching through all possible quadratic permutations $G$ with 2$^{25.71}$, which is effcient than 2$^{26.23}$ in Poschmann et al. at J. Cryptol 2011. For $S=A_3 \circ G \circ A_2 \circ G \circ A_1 $, our approach of finding $A_i$ with complexity 2$^{27.71} $, which is effcient than the method introduced by Moradi et al. at ASIACRYPT 2016. In addition, we proposes a new decomposition that $S=G \circ A_2 \circ G \circ A_1 $. We can find the fastest and the smallest hard-ware decomposition implementation of 4-bit permutations for TI with 3 and 4 shares.
IoT devices have been widely used with the advent of 5G. These devices contain a large amount of private data during transmission. It is primely important for ensuring their security. Therefore, we ...proposed a lightweight block cipher based on dynamic S-box named DBST. It is introduced for devices with limited hardware resources and high throughput requirements. DBST is a 128-bit block cipher supporting 64-bit key, which is based on a new generalized Feistel variant structure. It retains the consistency and significantly boosts the diffusion of the traditional Feistel structure. The SubColumns of round function is implemented by combining bit-slice technology with subkeys. The S-box is dynamically associated with the key. It has been demonstrated that DBST has a good avalanche effect, low hardware area, and high throughput. Our S-box has been proven to have fewer differential features than RECTANGLE S-box. The security analysis of DBST reveals that it can against impossible differential attack, differential attack, linear attack, and other types of attacks.
A highly efficient approximate addition plays a vital role in arithmetic operations. A special addition mechanism employed in the proposed work consists of both exact and approximate modes of ...operation suitable for both error tolerant and exact applications. The proposed adder is more area and power efficient compared with other conventional approximate carry look-ahead adders. It is constructed by splitting the input into two parts namely, approximate part and augmenting part. If both produce carry it will be exact output, while the carry produced by approximate part will be imprecise. Based on this mechanism an efficient reconfigurable carry look-ahead adder is designed and applied to a bit-slice processor. Bit slicing is technique to construct a processor by using n-bit CPU. The error rate is also minimized in the proposed technique by 10% compared to existing designs. Compared with conventional approximate adder, the proposed reconfigurable approximate adder produce better results in terms of area and power
Deep neural networks (DNNs) have achieved high performance in many AI fields such as 1-D language, 2-D image, and 3-D point cloud processing applications. Since recent DNN tasks require dense matrix ...operations with various bit-precision and non-ReLU activation functions, mobile neural processing units (NPUs) suffer from the acceleration of diverse DNN tasks within their limited hardware resources and power budget. Although bit-slice architectures benefit from slice-level computation and slice-level sparsity exploitation, the conventional bit-slice representation is inefficient in bit-slice architectures resulting in poor dense DNN execution. This paper proposes the efficient signed bit-slice architecture, Sibia, with the signed bit-slice representation (SBR) for efficient dense DNN acceleration. The SBR adds a sign bit to each bit-slice and changes signed 1111 2 bit-slice to 0000 2 by borrowing a value of 1 from its lower order of the bit-slice. This scheme generates large numbers of zero bit-slices in dense DNNs even not relying on accuracy-sensitive pruning methods or retraining processes. Moreover, the SBR balances positive and negative values of 2's complement data, allowing accurate bit-slice-based output speculation that pre-computes high orders of bit-slices. Sibia integrates the signed multiplier-and-accumulate (MAC) units for efficient signed bit-slice computations, and the flexible zero skipping processing element (PE) supports the zero input bit-slice skipping and output skipping for high throughput and energy-efficiency. Additionally, the dynamic sparsity monitoring unit monitors sparsity ratio between input and weight data and determines the more sparse one for zero bit-slice skipping. The heterogeneous network-on-chip (NoC) benefits from data reusability during bit-slice computation, reducing transmission bandwidth. Finally, Sibia outperforms the previous bit-slice architecture, Bit-fusion, over 3.65× higher area-efficiency, 3.88× higher energy-efficiency, and 5.35× higher throughput.
Regular structures, like datapath, are important components of integrated circuits. Datapath logic is usually placed with high regularity and compactness for higher performance by using manual ...placement. The authors propose effective datapath regularity extraction and placement (DREP) techniques which simultaneously place datapath logic and random logic. This method detects datapath logic and effectively formats regular datapath structures while optimizing the order of functional stages and placement of datapath blocks. Moreover, the datapath structures are further optimized by using bit-slice order adjustment and partitioning techniques during global placement. Partitioning of a big datapath macro greatly increases the placement flexibility, since partitioned sub-blocks of the datapath macro can be optimally placed with other blocks. A new effective method is also suggested to decide the block to be partitioned and the granularity of partitioning. Similar to the manual placement results, the datapath logic is regularly placed and the datapath cells are aligned well, vertically or horizontally by the DREP techniques. When compared with the state-of-the-art works, the experimental results show that the new techniques produce significantly better results than other methods in terms of half perimeter wire length, Steiner wire length, and routability, measured from the detail routing results.
Data path circuits are regular and best placed in bit-sliced pattern for improved Quality of Results such as timing, power, congestion and area. The cells in a column of bit slice structure are ...normally aligned on control pins or clock pins for straight routes, reducing power. The traditional way of placing data path circuits, using separate data path placer and then bringing them as macro in main design has its significant disadvantages. It is important for modern day placement tool to place random logic and data path circuits concurrently respecting the regularity that a data path circuit has by placing them in bit-sliced manner. It is not only important to place the data path elements in bit-sliced pattern but also that structure has to be maintained throughout the flow. The different set of optimization tricks can be applied to different bits of data path which can destroy the identical footprints of cells in column and that brings challenge of maintaining pin alignment. In addition to that, in lower nanometer nodes, fixed physical only cells pre-placed throughout the core area pose challenge of keeping bit-sliced structure intact. A flow for handling data path circuits in a place and route tool along with an algorithm for bit slice tiling is being proposed in this paper which addresses the challenges mentioned above.