Task mapping has been a hot topic in multiprocessor system-on-chip software design for decades. During the mapping process, load balance (LB) and communication optimization have been two important ...performance optimization factors. This paper studies the relations between LB, interprocessor communications, and communication pipeline technique during the mapping process, and proposes an integer linear programming (ILP)-based static task mapping approach, which considers both LB and communication optimization. The approach consists of an optimized ILP model for task mapping with fewer variables compared to previous ILP mapping works. Moreover, to enhance the scalability of the ILP task mapping, the task-processor-cluster algorithm is proposed to reduce the scale of the task graph and the number of processors and then solve the coarse-grained input by the ILP mapping. To increase the adaptability of the ILP task mapping, the improved augmented <inline-formula> <tex-math notation="LaTeX">\epsilon </tex-math></inline-formula>-constraint method is further integrated with the ILP formulations to select the best mapping for different applications. Experimental results on a 2/4/8/16/24-CPU platform of both synthetic and real-life benchmarks demonstrate the efficiency of the proposed approach.
Separate hardware- and software-only engineering approaches cannot meet the increasingly complex requirements of embedded systems. HW/SW interface codesign would enable the integration of components ...in heterogeneous multiprocessors. The authors analyze the evolution of this approach and define a long-term roadmap for future success.
We present a design flow for the generation of application-specific multiprocessor architectures. In the flow, architectural parameters are first extracted from a high-level system specification. ...Parameters are used to instantiate architectural components, such as processors, memory modules and communication networks. The flow includes the automatic generation of communication coprocessor that adapts the processor to the communication network in an application-specific way. Experiments with two system examples show the effectiveness of the presented design flow.
For the design of classic computers the Parallel programming concept is used to abstract HW/SW interfaces during high level specification of application software. The software is then adapted to an ...existing multiprocessor platforms using a low level software layers that implement the programming model. Unlike classic computers, the design of heterogeneous MPSoC includes also building the processors and other kind of hardware components required to execute the software. In this case, the programming model hides both hardware and software refinements. This paper deals with parallel programming models to abstract both hardware and software Interfaces in the case of heterogeneous MPSoC design. Different abstraction levels will be needed. For the long term, the use of higher level programming models will open new vistas for optimization and architecture exploration like CPU/RTOS tradeoffs.
As a solution for dealing with the design complexity of multiprocessor SoC architectures, we present a joint Simulink-SystemC design flow that enables mixed hardware/software refinement and ...simulation in the early design process. First, we introduce the Simulink combined algorithm/architecture model (CAAM) unifying the algorithm and the abstract target architecture. From the Simulink CAAM, a hardware architecture generator produces architecture models at three different abstract levels, enabling a trade-off between simulation time and accuracy. A multithread code generator produces memory-efficient multithreaded programs to be executed on the architecture models. To show the applicability of the proposed design flow, we present experimental results on two real video applications.
The Intellectual Property (IP)-based design for high-throughput dedicated digital signal processing (DSP) systems is obviously an important issue for improving not only design productivity, but also ...design from the high level of abstraction. However, in some cases, synthesizable register transfer level (RTL) model obtained by an automatic assembly of RTL IPs can be wrong due to delays induced by implementation constraints. In this paper, we present the formalization of the problem and propose an approach called automatic delay correction method (ADCM) to solve the problem without inserting an extra interface circuitry. The approach automatically inserts control structures to manage delays induced by the use of RTL IPs. It also inserts a control structure to coordinate the execution of parallel clocked IPs. The delays may be managed by registers or by counters included in the control structure. A formal theory of ADCM is developed to guide the implementation and guarantee optimal solutions in latency and area. Through experiments with synthetic example and three real world high-throughput DSP circuits, we also show the effectiveness of our approach.
Reduction of the on-chip memory size is a key issue in video codec system design. Because video codec applications involve complex algorithms that are both data-intensive and control-dependent, ...memory optimization based on global and precise analysis of data and control dependency is required. We generate a memory-efficient C code from a restricted Simulink model, which can represent both data and control dependency explicitly, by applying two buffer memory optimization techniques: copy removal and buffer sharing. Copy removal is performed while parsing the Simulink model. Buffer sharing requires global scheduling and formal lifetime analysis. Experimental results on an H.264 video decoder show that the buffer memory size and execution time of the C code generated by the proposed method are 71% and 32% less than those of the C code produced by Simulink's C code generator, respectively. When compared to the hand written C code, the memory size was reduced by 27% while its execution time was increased by only 3%.
For the design of classic computers the parallel programming concept is used to abstract HW/SW interfaces during high level specification of application software. The software is then adapted to ...existing multiprocessor platforms using a low level software layer that implements the programming model. Unlike classic computers, the design of heterogeneous MPSoC includes also building the processors and other kind of hardware components required to execute the software. In this case, the programming model hides both hardware and software refinements. This paper deals with parallel programming models to abstract both hardware and software interfaces in the case of heterogeneous MPSoC design. Different abstraction levels will be needed. For the long term, the use of higher level programming models will open new vistas for optimization and architecture exploration like CPU/RTOS tradeoffs.
Emerging embedded systems require heterogeneous multiprocessor SoC architectures that can satisfy both high-performance and programmability. However, as the complexity of embedded systems increases, ...software programming on an increasing number of multiprocessors faces several critical problems, such as multithreaded code generation, heterogeneous architecture adaptation, short design time, and low cost implementation. In this paper, we present a software code generation flow based on Simulink to address these problems. We propose a functional modeling style to capture data-intensive and control-dependent target applications, and a system architecture modeling style to seamlessly transform the functional model into the target architecture. Both models are described using Simulink. From a system architecture Simulink model, a code generator produces a multithreaded code, inserting thread and communication primitives to abstract the heterogeneity of the target architecture. In addition, the multithread code generator called LESCEA applies the extensions of dataflow based memory optimization techniques, considering both data and control dependency. Experimental results on a Motion-JPEG decoder and an H.264 decoder show that the proposed multithread code generator enables easy software programming on different multiprocessor architectures with substantially reduced data memory size (up to 68.0%) and code memory size (up to 15.9%).
In multiprocessor system-on-chip, tasks and communications should be scheduled carefully since their execution order affects the performance of the entire system. When we implement an MPSoC according ...to the scheduling result, we may find that the scheduling result is not correct or timing constraints are not met unless it takes into account the delays of MPSoC architecture. The unexpected scheduling results are mainly caused from inaccurate communication delays and or runtime scheduler's overhead. Due to the big complexity of scheduling problem, most previous work neglects the inter-processor communication, or just assumes a fixed delay proportional to the communication volume, without taking into consideration subtle effects like the communication congestion and synchronization delay, which may change dynamically throughout tasks execution. In this paper, we propose an accurate scheduling model of hardware/software communication architecture to improve timing accuracy by taking into account the effects of dynamic software synchronization and detailed hardware resource constraints such as communication congestion and buffer sharing. We also propose a method for runtime scheduler implementation and consider its performance overhead in scheduling. In particular, we introduce efficient hardware and software scheduler architectures. Furthermore, we address the issue of centralized implementation versus distributed implementation of the schedulers. We investigate the pros and cons of the two different scheduler implementations. Through experiments with significant demonstration examples, we show the effectiveness of the proposed approach.