Communication-Based Design for Nanoscale SoCs:Bridging the Science and Engineering of NoC Design.

Bridging the Science and Engineering of NoC Design

This section addresses the customization of regular NoC architectures through application-specific long- range link insertion as a way to illustrate the interaction between the science and engineering of network design. The basic idea of inducing small-world effects into regular networks via long-range links comes from physics [33]. We explore this idea in the context of application-specific NoC architecture custom- ization by introducing NoC-specific constraints into problem formulation. Specifically, the application- specific long-range link insertion is first formulated as a constraint optimization problem and then a practical solution is presented. Finally, the impact of the presented approach on performance, energy consumption, and area is thoroughly evaluated.

Long-Range Link Insertion as a Constrained Optimization Problem For NoCs, the long-range links should be inserted in a smart manner rather than randomly (as in Refs. [33,34]) due to the following reasons:

• Long-range links have a measurable impact on the network performance.

• The information about the communication workload can be effectively used for optimization purposes.

†Three nodes are selected arbitrarily to act as hotspot nodes. Each node in the network sends packets to these hotspot nodes with a higher probability compared to the remaining nodes.

Communication-Based Design for Nanoscale SoCs-0021

• Inserting long-range links has an associated cost in terms of area and wiring resources, so there exists a constraint on the proportion of long-range links that can be utilized.

Consequently, application-specific long-range link insertion is modeled as a constraint-optimization problem [32]. Figure 16.6 outlines long-range link insertion algorithm which inserts long-range links with the objective of maximizing the critical traffic load λc, subject to the available resources. Given a network architecture and a target application characterized by different communication frequencies (fij) among the network tiles, the algorithm first estimates the critical traffic load, λc. Then, the improvement in λc after the insertion of each long-range link is evaluated, and the long link which delivers the largest gain is permanently inserted to the network. Since this step requires a special mechanism for packet routing, a routing strategy is also developed, as described in Section 16.4.2. Finally, the link-insertion procedure repeats until all available resources are used up.

Deadlock-free Routing Algorithm

There are a number of practical issues regarding packet routing that are critical for NoCs:

• First, while theoretical studies [35,36] assume infinite buffering resources, the amount of limited on-chip resource should be taken into account when dealing with NoCs.

• Second, inserting long-range links can cause cyclic dependencies. Combined with finite buffers, the arbitrary use of long-range links may cause deadlock states.

• Finally, without a customized mechanism in place, the newly added long-range links cannot be utilized by the default routing strategy.

To address these issues, a routing algorithm is developed for the use of long-range connections. More precisely, the routers without a long-range connection use the default routing strategy. For all other routers, the distance to the destination, with and without the long-range link, is computed. Since the routing decision is made locally, only the long-range link connected to the current router is taken into account. If using the long-range link results in a shorter distance to the destination, then the long-range

Communication-Based Design for Nanoscale SoCs-0022

link is utilized provided that using it does not cause deadlock. To ensure deadlock-free operation the use of the long-range link is restricted by extending the original turn-model [38] to long-range links. More discussion is provided in Ref. [32].

Area Overhead

The effect of long-range links on area and network regularity should be kept minimal in order to justify the gains in performance. In order to preserve the advantages of structured wiring, the number of long- range links per router is limited to one. At the same time, the long-range links are segmented into regular, fixed-length, network links interconnected by repeaters. Repeaters are actually simplified routers consist- ing of only two ports that can accept an incoming flit, store it in a First-come-first-served (FIFO) buffer, and finally forward it to the output port. The use of repeaters with buffering capabilities guarantees latency-insensitive operation, as discussed in Ref. [39].

The feasibility of the proposed methodology and realistic measurements on area overhead are demonstrated by using an FPGA prototype consisting of a 4M gate Xilinx Virtex2 FPGA. As shown in Table 16.1, a router with five ports designed for a 2D mesh network utilizes 397 slices, while a router with six ports utilizes 503 slices of the target device.

In addition to this, a pure mesh network, and a mesh network with four long-range links consisting of 12 regular link segments in total were synthesized. It has been observed that the extra links induce about 7% area overhead. This overhead has to be taken into account, while computing the maximum number of long-range links that can be added to a regular mesh network.

Experimental Study

Demonstrating the potential of such a theoretical approach is a crucial step toward its widespread use in practice. For this reason, the impact of long-range links on the performance and energy consumption is evaluated using an FPGA protoptype and a cycle accurate C++-based NoC simulator. Figure 16.7 shows some FPGA measurements for a standard mesh network before and after inserting the long-range links. The evaluations are performed using realistic benchmarks retrieved from the E3S benchmark suite [41] using hotspot traffic. We observe that by inserting long-range links, the critical traffic load increases significantly resulting in an improvement in the average packet latency and network throughput. Similar improvements are obtained from the C++ simulations [32].

Comparison against Torus and Higher Dimensional Networks

On-chip implementation of higher dimensional mesh and torus networks looks similar to implementing customized topologies using long-range links. However, there is a fundamental difference in the sense that the application-specific customization approach finds the optimal links to be inserted based on a rigorous analysis rather than by inserting them based on a fixed rule. In fact, the topologies synthesized using long-range links reduce to the standard higher dimensional networks, if we replace the optimal link-insertion algorithm with a static rule for links insertion.

Owing to the optimization process, a network architecture obtained by application-specific long-range link insertion can achieve better performance compared to a standard higher dimensional network, although it utilizes less resources. To demonstrate this fact, a 4×4 2D torus network with folded links [5], and a mesh network with eight unidirectional links found using our proposed technique are compared. Long-range

Communication-Based Design for Nanoscale SoCs-0023

links inserted on top of the mesh network consist of 12 regular link segments. This amounts to half of the regular links required to convert the mesh network to a torus. The simulations show 4% improvement in the critical traffic load compared to the torus network. Likewise, the average packet latency, at 0.48 packet/ cycle injection rate which is close to the critical load of the torus network, drops from 77.0 to 34.4 cycles with the use of application-specific long-range links. This significant gain is obtained by utilizing only half of resources since inserting the most beneficial links for a given traffic pattern make more sense than blindly adding wrap-around channels all over the network as is the case for the folded torus.

Scalability Analysis

The scalability of the long-range link insertion technique is evaluated using networks of sizes ranging from 4 X 4 to 10 X10. Figure 16.8 shows that consistent improvements are obtained when the network size scales up. For example, by inserting only six long-range links (which is equivalent to 32 regular links total) to a 10 X10 network, the critical load of the network under hotspot traffic shifts from 1.18 to 1.40 packet/cycle giving a 18.7% improvement. This result is similar to the gain obtained for smaller networks. Figure 16.8(a) also reveals that the critical traffic load grows with the network size due to the increase in the total available bandwidth. Likewise, we observe a consistent reduction in the average packet latency across different network sizes, as shown in Figure 16.8(b).

Energy Consumption

The network should have, ideally, negligible overhead in terms of area and energy consumption. For this reason, accurate energy consumption measurements are performed directly on an FPGA prototype. To preserve the structured wiring, the long-range links are segmented into regular links connected by repeaters. The repeaters can be regarded as simplified routers consisting of only two ports that accept an incoming flit, store it into a FIFO buffer, and finally forward it to the output port. Therefore, there will be minimal impact on the link and buffering energy consumption. In contrast, due to simplification in the repeater design (compared to the original routers), the energy consumption due to the switch and routing logic is expected to decrease. At the same time, the routers with extra links will have a slightly increased energy consumption due to the larger crossbar switch. Overall, a very small change in the energy consumption is expected. Indeed, the measurements directly performed on the FPGA prototype using the technique in Ref. [40] show about 2.2% reduction in the energy consumed when performing the same task after the insertion of long-range links; this is in good agreement with our expectations. Likewise, the cycle-accurate C++-based simulations show that the long-range links have a minimal impact on the overall energy consumption. More detailed evaluation of energy consumption can be found in [42].

Comments

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

ASIC and Custom IC Cell Information Representation:GDS2

Timing Description Languages:SDF