Campus Plaza Kyoto, 5F, Lecture Room No.1 (5 min. walk from JR Kyoto Station)
Already closed.
Registration fee is free only for foreign residents outside Japan.
Moderator: Hiroto Yasuura (Kyushu Univ.)
Time | Speaker//Title |
---|---|
10:00-10:10 | Opening Remarks |
10:10-11:30 | Giovanni De Micheli, Stanford Univ. |
Network on a Chip: a New Paradigm for System on Chip Design | |
11:30-12:30 | Ahmed Jerraya, TIMA/INPG |
Multi Processors System on Chip |
Moderator: Yoshinori Takeuchi (Osaka Univ.)
Time | Speaker//Title |
---|---|
14:00-14:40 | Kenji Taniguchi, Osaka Univ. |
CDMA Bus for Parallel Processing Systems:Abstract | |
14:40-15:20 | Claudio Passerone Politecnico di Torino |
A Methodology for Design Space Exploration of On-chip Networks:Abstract | |
15:20- 16:00 | Satoshi Matsushita NEC |
Merlot: A Chip Multiprocessor with Speculative Multithreading | |
16:00-16:20 | Discussions and Closing Remarks |
We consider systems on chips (SoCs) that will be designed and produced in five to ten years from today, with gate lengths in the range 50-100nm. We address the distinguishing features of a design methodology that aims at achieving reliable designs under the limitations of the interconnect technology. Specifically, we consider energy consumption reduction, under guaranteed quality of service (QoS), as a main objective in system design. We show that the unreliability of the physical layer is a potential show-stopper for SoC design. We argue that network technology can be used to provide a framework for designing on-chip interconnect. We visit different layers of a micro-network stack abstraction and show new directions toward designing on-chip communication.
Modern system-on-chip (SoC) design shows a clear trend towards integration of multiple processor cores, the SOC System Driver section of the "International Technology Roadmap for Semiconductors" (http://public.itrs.net/) predicts that the number of processor cores will increase dramatically to match the processing demands of future applications. Typical multiprocessor SoC applications like network processors, multimedia hubs and base-band telecom circuits have particularly tight time-to-market and performance constraints which require a very efficient design cycle. A multiprocessor on SoC is composed of four kinds of components: software tasks, processors executing software, specific hardware cores and a global on-chip communication network. The crucial issue when designing SoC is to include hardware and software elements that adapt these components to each other. Multiprocessor SoC are quite different from classic symmetrical multiprocessor architectures. This is mainly because the implementation of system communication is much more complicated since heterogeneous processors are involved and complex protocols and topologies are used for communication. Component-based design provides primitives to build complex architectures from basic components allowing design-architects to reuse efficient custom solutions with best performances. This talk explores a high-level component-based methodology and design environment for application-specific multiprocessor SoC architectures. The design environment provides automatic HW-SW interface generation tools. In order to adapt generic components, the environment is able to synthesize hardware interfaces, device drivers, and operating systems that implement a high-level interconnect API. This approach, experimented over the design of a VDSL system, shows a drastic design time reduction without any significant efficiency loss in the final circuit.
A new bus architecture suitable for parallel processing systems is proposed. The multiple access bus which has simple interconnection topology is based on the direct sequence code division multiple access (DS-CDMA) technique which is widely used in wireless telecommunication systems. However, unlike the DS-CDMA radio system, the DS-CDMA interface uses wired buses as a communication medium and does not require up-conversion to radio frequencies. The new bus architecture features low bus power consumption, noise tolerance and dynamic programmability. The DS-CDMA interface had been successfully implemented with a 0.6 um CMOS process. Measured results with ten pairs of transmitters and receivers show that all transmitted data had been well received, showing not even a single error after 10^8 transmissions.
This presentation will introduce a design methodology for architectural exploration of networks on chip based on decoupling functionality from architecture, and computation from communication. A simple example of various mapping and refinement options for a single function-to-function token-based communication will be used to practically illustrate the basic concepts. A larger realistic on-chip network will then be used to describe what sort of information can be obtained by various mapping and performance analysis experiments. Although some specific tools will be used to make the explanation more concrete, the methodology is fully tool-independent, and can be implemented on top of several publicly available design frameworks.
We have been facing diminishing return problem of exploited parallelism with an aggressive ILP architecture like superscalar or VLIW. Multithreading has been researched to overcome this problem with larger instruction scheduling windows in smaller hardware cost. In speculative multithreading, speculative execution of threads enables us to exploit higher degree of parallelism beyond unresolved control or data dependency. It also relaxes the requirement of perfect memory disambiguation, so that automatic parallelization is more feasible. We have fabricated a prototype chip code-named Merlot. Merlot integrates four processing elements (PEs), SDRAM interfaces, and PCI interfaces on a 110mm2 die with 0.15um process. The PEs share an instruction cache, a data cache, and a register file to realize fast thread manipulation. Speculative execution is realized by store reservation buffers in front of a data cache and register renaming. The synchronous operation of SDRAM controller and core pipeline minimizes the cache miss penalty to less than 100ns by removing store-and-forward latency. With 4 PEs, IPC of 2.72 is estimated in restructured speech recognition code compiled with our parallelizing compiler. On the first silicon, we have successfully run a parallelized mpeg2 decoder with software workarounds. I would like to present the parallelizing scheme, the micro-architecture, and the demonstration of the chip operation.