Large-scale (or massive) multiple-input multiple-out put (MIMO) is expected to be one of the key technologies in next-generation multi-user cellular systems based on the upcoming 3GPP LTE Release 12 ...standard, for example. In this work, we propose-to the best of our knowledge-the first VLSI design enabling high-throughput data detection in single-carrier frequency-division multiple access (SC-FDMA)-based large-scale MIMO systems. We propose a new approximate matrix inversion algorithm relying on a Neumann series expansion, which substantially reduces the complexity of linear data detection. We analyze the associated error, and we compare its performance and complexity to those of an exact linear detector. We present corresponding VLSI architectures, which perform exact and approximate soft-output detection for large-scale MIMO systems with various antenna/user configurations. Reference implementation results for a Xilinx Virtex-7 XC7VX980T FPGA show that our designs are able to achieve more than 600 Mb/s for a 128 antenna, 8 user 3GPP LTE-based large-scale MIMO system. We finally provide a performance/complexity trade-off comparison using the presented FPGA designs, which reveals that the detector circuit of choice is determined by the ratio between BS antennas and users, as well as the desired error-rate performance.
Data detection in massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems is among the most critical tasks due to the excessively high implementation complexity. In this paper, ...we propose a novel, equalization-based soft-output data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that use orthogonal frequency-division multiplexing (OFDM). Our data-detection algorithm performs approximate minimum mean-square error (MMSE) or box-constrained equalization using coordinate descent. We deploy a variety of algorithm-level optimizations that enable near-optimal error-rate performance at low implementation complexity, even for systems with hundreds of base-station (BS) antennas and thousands of subcarriers. We design a parallel VLSI architecture that uses pipeline interleaving and can be parametrized at design time to support various antenna configurations. We develop reference FPGA designs for massive MU-MIMO-OFDM systems and provide an extensive comparison to existing designs in terms of implementation complexity, throughput, and error-rate performance. For a 128 BS antenna, 8-user massive MU-MIMO-OFDM system, our FPGA design outperforms the next-best implementation by more than 2.6 \times in terms of throughput per FPGA look-up tables.
Linear data-detection algorithms that build on zero forcing (ZF) or linear minimum mean-square error (L-MMSE) equalization achieve near-optimal spectral efficiency in massive multi-user ...multiple-input multiple-output (MU-MIMO) systems. Such algorithms, however, typically rely on centralized processing at the base station (BS) which results in 1) excessive interconnect and chip input/output (I/O) data rates and 2) high computational complexity. Decentralized baseband processing (DBP) partitions the BS antenna array into independent clusters that are associated with separate radio-frequency circuits and computing fabrics in order to overcome the limitations of centralized processing. In this paper, we investigate decentralized equalization with feedforward architectures that minimize the latency bottlenecks of existing DBP solutions. We propose two distinct architectures with different interconnect and I/O bandwidth requirements that fuse the local equalization results of each cluster in a feedforward network. For both architectures, we consider maximum ratio combining, ZF, L-MMSE, and a nonlinear equalization algorithm that relies on approximate message passing. For these algorithms and architectures, we analyze the associated post-equalization signal-to-noise-and-interference-ratio. We provide reference implementation results on a multigraphics processing unit system which demonstrate that decentralized equalization with feedforward architectures enables throughputs in the Gb/s regime and incurs no or only a small performance loss compared to centralized solutions.
This paper proposes a novel time-based method for determining the position of an IEEE 802.11g transmitter using multiple mutually synchronized 802.11g receivers. By means of baseband signal ...processing, the proposed algorithm obtains a high-resolution estimate of the time of arrival (TOA) of the long training sequence symbol at each receiver. An estimate of the position of the transmitter is obtained based on the estimation of the time differences of arrival (TDOA) of the symbols and the known fixed locations of the receivers. This paper investigates the effects of carrier and sampling clock offsets, in both frequency and phase, between nodes on the TOA and TDOA estimation error. In real-world experiments in a line of sight, low multipath indoor environment, the method was found to achieve mean errors of 42 cm per symbol for 1-D and 1.39 m per symbol for 2-D position estimation, for ranges of up to 25 m.
Non-binary low-density parity-check (NB-LDPC) codes show higher error-correcting performance than binary low-density parity-check (LDPC) codes when the codeword length is moderate and/or the channel ...has bursts of errors. The need for high-speed decoders for future digital communications led to the investigation of optimized NB-LDPC decoding algorithms and efficient implementations that target high throughput and low energy consumption levels. We carried out a comprehensive survey of existing NB-LDPC decoding hardware that targets the optimization of these parameters. Even though existing NB-LDPC decoders are optimized with respect to computational complexity and memory requirements, they still lag behind their binary counterparts in terms of throughput, power and area optimization. This study contributes to an overall understanding of the state-of-the-art on application-specific integrated-circuit (ASIC), field-programmable gate array (FPGA) and graphics processing units (GPU) based systems, and highlights the current challenges that still have to be overcome on the path to more efficient NB-LDPC decoder architectures.
Achieving high spectral efficiency in realistic massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems requires computationally complex algorithms for data detection in the ...uplink (users transmit to base-station) and beamforming in the downlink (base-station transmits to user). Most existing algorithms are designed to be executed on centralized computing hardware at the base-station (BS), which results in prohibitive complexity for systems with hundreds or thousands of antennas and generates raw baseband data rates that exceed the limits of current interconnect technology and chip I/O interfaces. This paper proposes a novel decentralized baseband processing architecture that alleviates these bottlenecks by partitioning the BS antenna array into clusters, each associated with independent radio-frequency chains, analog and digital modulation circuitry, and computing hardware. For this architecture, we develop novel decentralized data detection and beamforming algorithms that only access local channel-state information and require low communication bandwidth among the clusters. We study the associated tradeoffs between error-rate performance, computational complexity, and interconnect bandwidth, and we demonstrate the scalability of our solutions for massive MU-MIMO systems with thousands of BS antennas using reference implementations on a graphics processing unit (GPU) cluster.
Implementation of receivers for spatial multiplexing multiple-input multiple-output (MIMO) orthogonal-frequency- division-multiplexing (OFDM) systems is considered. The linear minimum mean-square ...error (LMMSE) and the -best list sphere detector (LSD) are compared to the iterative successive interference cancellation (SIC) detector and the iterative -best LSD. The performance of the algorithms is evaluated in 3G long-term evolution (LTE) system. The SIC algorithm is found to perform worse than the -best LSD when the MIMO channels are highly correlated, while the performance difference diminishes when the correlation decreases. The receivers are designed for 22 and 4 4 antenna systems and three different modulation schemes. Complexity results for FPGA and ASIC implementations are found. A modification to the -best LSD which increases its detection rate is introduced. The ASIC receivers are designed to meet the decoding throughput requirements in LTE and the -best LSD is found to be the most complex receiver although it gives the best reliable data transmission throughput. The SIC receiver has the best performance-complexity tradeoff in the system but in the 4 4 case, the -best LSD is the most efficient. A receiver architecture which could be reconfigured to using a simple or a more complex detector as the channel conditions change would achieve the best performance while consuming the least amount of power in the receiver.
We present an efficient VLSI architecture for 3GPP LTE/LTE-Advance Turbo decoder by utilizing the algebraic-geometric properties of the quadratic permutation polynomial (QPP) interleaver. The ...high-throughput 3GPP LTE/LTE-Advance Turbo codes require a highly-parallel decoder architecture. Turbo interleaver is known to be the main obstacle to the decoder parallelism due to the collisions it introduces in accesses to memory. The QPP interleaver solves the memory contention issues when several MAP decoders are used in parallel to improve Turbo decoding throughput. In this paper, we propose a low-complexity QPP interleaving address generator and a multi-bank memory architecture to enable parallel Turbo decoding. Design trade-offs in terms of area and throughput efficiency are explored to find the optimal architecture. The proposed parallel Turbo decoder has been synthesized, placed and routed in a 65-nm CMOS technology with a core area of 8.3
mm
2 and a maximum clock frequency of 400
MHz. This parallel decoder, comprising 64 MAP decoder cores, can achieve a maximum decoding throughput of 1.28
Gbps at 6 iterations
GPU-Based, LDPC Decoding for 5G and Beyond Tarver, Chance; Tonnemacher, Matthew; Chen, Hao ...
IEEE open journal of circuits and systems,
2021, Letnik:
2
Journal Article
Recenzirano
Odprti dostop
In 5G New Radio (NR), low-density parity-check (LDPC) codes are included as the error correction codes (ECC) for the data channel. While LDPC codes enable a low, near Shannon capacity, bit error rate ...(BER), they also become a computational bottleneck in the physical layer processing. Moreover, 5G LDPC has new challenges not seen in previous LDPC implementations, such as Wi-Fi. The LDPC specification in 5G includes many reconfigurations to support a variety of rates, block sizes, and use cases. 5G also creates targets for supporting high-throughput and low-latency applications. For this new, flexible standard, traditional hardware-based solutions in FGPA and ASIC may struggle to support all cases and may be cost-prohibitive at scale. Software solutions can trivially support all possible reconfigurations but struggle with performance. This article demonstrates the high-throughput and low-latency capabilities of graphics processing units (GPUs) for LDPC decoding as an alternative to FPGA and ASIC decoders, effectively providing the high performance needed while maintaining the benefits of a software-based solution. In particular, we highlight how by varying the parallelization strategy for mapping GPU kernels to blocks, we can use the many GPU cores to compute one codeword quickly to target low-latency, or we can use the cores to work on many codewords simultaneously to target high throughput applications. This flexibility is particularly useful for virtualized radio access networks (vRAN), a next-generation technology that is expected to become more prominent in the coming years. In vRAN, the hardware computational resources will become decoupled from the specific computational functions in the RAN through virtualization, allowing for benefits such as load-balancing, improved scalability, and reduced costs. To highlight and investigate how the GPU can accelerate tasks such as LDPC decoding when containerizing vRAN functionality, we integrate our decoder into the Open Air Interface (OAI) NR software stack. With our GPU-based decoder, we measure a best case-latency of <inline-formula> <tex-math notation="LaTeX">87~\mu \text{s} </tex-math></inline-formula> and a best-case throughput of nearly 4 Gbps using the Titan RTX GPU.
Noncontiguous transmission schemes combined with high power-efficiency requirements pose big challenges for radio transmitter and power amplifier (PA) design and implementation. Due to the nonlinear ...nature of the PA, severe unwanted emissions can occur, which can potentially interfere with neighboring channel signals or even desensitize the own receiver in frequency division duplexing transceivers. In this paper, to suppress such unwanted emissions, a low-complexity subband digital predistortion solution, specifically tailored for spectrally noncontiguous transmission schemes in low-cost devices, is proposed. The proposed technique aims at mitigating only the selected spurious intermodulation distortion components at the PA output, hence allowing for substantially reduced processing complexity compared with classical linearization solutions. Furthermore, novel decorrelation-based parameter learning solutions are also proposed and formulated, which offer reduced computing complexity in parameter estimation as well as the ability to track time-varying features adaptively. Comprehensive simulation and RF measurement results are provided, using a commercial LTE-Advanced mobile PA, to evaluate and validate the effectiveness of the proposed solution in real-world scenarios. The obtained results demonstrate that highly efficient spurious component suppression can be obtained using the proposed solutions.