This paper presents the implementation of a video segmentation unit used for embedded automated video surveillance systems. Various aspects of the underlying segmentation algorithm are explored and ...modifications are made with potential improvements of segmentation results and hardware efficiency. In addition, to achieve real-time performance with high resolution video streams, a dedicated hardware architecture with streamlined dataflow and memory access reduction schemes are developed. The whole system is implemented on a Xilinx field-programmable gate array platform, capable of real-time segmentation with VGA resolution at 25 frames per second. Substantial memory bandwidth reduction of more than 70% is achieved by utilizing pixel locality as well as wordlength reduction. The hardware platform is intended as a real-time testbench, especially for observations of long term effects with different parameter settings.
This paper presents architectures for supporting dynamic data scaling in pipeline fast Fourier transforms (FFTs), suitable when implementing large size FFTs in applications such as digital video ...broadcasting and digital holographic imaging. In a pipeline FFT, data is continuously streaming and must, hence, be scaled without stalling the dataflow. We propose a hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing. The presented co-optimization generates a higher signal-to-quantization-noise ratio and requires less memory than for instance convergent BFP. A 2048-point pipeline FFT has been fabricated in a standard-CMOS process from AMI Semiconductor (Lenart and Owall, 2003), and a field-programmable gate array prototype integrating a 2-D FFT core in a larger design shows that the architecture is suitable for image reconstruction in digital holographic imaging
Mathematical morphology with spatially variant structuring elements outperforms translation-invariant structuring elements in various applications and has been studied in the literature over the ...years. However, supporting a variable structuring element shape imposes an overwhelming computational complexity, dramatically increasing with the size of the structuring element. Limiting the supported class of structuring elements to rectangles has allowed for a fast algorithm to be developed, which is efficient in terms of number of operations per pixel, has a low memory requirement, and a low latency. These properties make this algorithm useful in both software and hardware implementations, not only for spatially variant, but also translation-invariant morphology. This paper also presents a dedicated hardware architecture intended to be used as an accelerator in embedded system applications, with corresponding implementation results when targeted for both field programmable gate arrays and application specific integrated circuits.
This article describes and evaluates algorithms and their hardware architectures for binary morphological erosion and dilation. In particular, a fast stall-free low-complexity architecture is ...proposed that takes advantage of the morphological duality principle and structuring element (SE) decomposition. The design is intended to be used as a hardware accelerator in real-time embedded processing applications. Hence, the aim is to minimize the number of operations, memory requirement, and memory accesses per pixel. The main advantage of the proposed architecture is that for the common class of flat and rectangular SEs, complexity and number of memory accesses per pixel is low and independent of both image and SE size. The proposed design is compared to the more common delay-line architecture in terms of complexity, memory requirements and execution time, both for an actual implementation and as a function of image resolution and SE size. The architecture is implemented for the UMC 0.13- mum CMOS process using a resolution of 640 × 480. A maximum SE of 63 × 63 is supported at an estimated clock frequency of 333 MHz.
This paper discusses the impact of flexibility when designing a Viterbi decoder for both convolutional and TCM codes. Different trade-offs have to be considered in choosing the right architecture for ...the processing blocks and the resulting hardware penalty is evaluated. We study the impact of symbol quantization that degrades performance and affects the wordlength of the rate-flexible trellis datapath. A radix-2-based architecture for this datapath relaxes the hardware requirements on the branch metric and survivor path blocks substantially. The cost of flexibility in terms of cell area and power consumption is explored by an investigation of synthesized designs that provide different transmission rates. Two designs are fabricated in a digital 0.13-mum CMOS process. Based on post-layout simulations, a symbol baud rate of 168 Mbaud/s is achieved in TCM mode, equivalent to a maximum throughput of 840 Mbit/s using a 64-QAM constellation.
This paper presents a digital hardware implementation of a novel wavelet-based event detector suitable for the next generation of cardiac pacemakers. Significant power savings are achieved by ...introducing a second operation mode that shuts down 2/3 of the hardware for long time periods when the pacemaker patient is not exposed to noise, while not degrading performance. Due to a 0.13-/spl mu/m CMOS technology and the low clock frequency of 1 kHz, leakage power becomes the dominating power source. By introducing sleep transistors in the power-supply rails, leakage power of the hardware being shut off is reduced by 97%. Power estimation on RTL-level shows that the overall power consumption is reduced by 67% with a dual operation mode. Under these conditions, the detector is expected to operate in the sub-/spl mu/W region. Detection performance is evaluated by means of databases containing electrograms to which five types of exogenic and endogenic interferences are added. The results show that reliable detection is obtained at moderate and low signal to noise-ratios (SNRs). Average detection performance in terms of detected events and false alarms for 25-dB SNR is P/sub D/=0.98 and P/sub FA/=0.014, respectively.
This paper evaluates the hardware aspects of multicarrier faster-than-Nyquist (FTN) signaling transceivers. The choice of time-frequency spacing of the symbols in an FTN system for improved bandwidth ...efficiency is targeted towards efficient hardware implementation. This work proposes a hardware architecture for the realization of iterative decoding of FTN multicarrier modulated signals. Compatibility with existing systems has been considered for smooth switching between the faster-than-Nyquist and orthogonal signaling schemes. One such being the use of fast Fourier transforms (FFTs) for multicarrier modulation. The performance of the fixed point model is very close to that of the floating point representation. The impact of system parameters such as number of projection points, time-frequency spacing, finite wordlengths and their design tradeoffs for reduced complexity iterative decoders in FTN systems have been investigated. The FTN decoder has been designed and synthesized in both 65 nm CMOS and FPGA. From the hardware resource usage numbers it can be concluded that FTN signaling can be used to achieve higher bandwidth efficiency with acceptable complexity overhead.
This brief proposes a new class of hybrid VLSI architectures for survivor path processing to be used in Viterbi decoders. The architecture combines the benefits of register exchange and traceforward ...algorithms, that is, low storage requirement and latency versus implementation efficiency. Based on a structural comparison, it becomes evident that the architecture can be efficiently applied to codes with a larger number of states where traceback-based architectures, which increase latency, are usually dominant.
This paper presents an iterative decoder for faster-than-Nyquist (FTN) and orthogonal signaling multi-carrier systems. FTN signaling is a method of improving bandwidth efficiency at the expense of ...higher processing complexity in the transceiver. The decoder can switch between orthogonal and FTN signaling modes and exploits channel properties to improve bandwidth efficiency. The decoder is fabricated in a 65-nm CMOS process and occupies a total area of 0.8 mm 2 with decoder core taking up 0.567 mm 2 . The power consumption of the chip is 9.6 mW at 1.2 V when clocked at 100 MHz, providing a peak information throughput of 1 Mbps and with an energy efficiency of 0.6 nJ per bit per iteration. To the best of our knowledge, those measurement results are from the first ever silicon implementation of a decoder for FTN signaling.
This paper presents a hardware architecture of pulse shaping filter used in multicarrier systems. The filter can be configured to be used for both transmitter and receiver with limited overhead. ...Generic implementation complexity analysis for a filter in a multicarrier system with N sub-carriers is presented, while the implemented architecture is for a system with 128 sub-carriers. The pulse shaping filter is part of a larger system based on faster-than-Nyquist signaling and aided in an overall complexity reduction. Hence, designing an efficient hardware architecture to keep the overhead moderate was the motivation behind this work. Architectural optimizations has been carried out in order to reduce area and power. The implementation of the proposed hardware architecture was carried out using a 65-nm CMOS process. The chip core occupies an area of 0.11 mm 2 and is estimated to consume 14.4 mW of power when running at 200 MHz.