Performance Modeling and Analysis Using VHDL and System:Fully Connected Model

Fully Connected Model

The fully connected architecture is built around the same comm_medium object as the Shared Bus model. The fully connected architecture creates a comm_array object which contains all the logical connections, and copies the addresses of the comm_medium objects into a two-dimensional array that it uses to map a processor’s request to the overall communication architecture to the appropriate logical connection. The fully connected architecture passes the number of processors to the comm_array object which then instantiates the number of comm_medium objects needed to provide a dedicated shared bus between each pair of processors in the simulation.

Crossbar Model

The crossbar architecture is similar to the fully connected architecture except that it only requires four cross_comm_medium objects since with nine processors only four concurrent connections are allowed. As mentioned earlier, it uses its own version of the basic comm_medium object. This is necessary because the logical connections in the crossbar are not associated with any particular processor, and the communication requests are not associated with any of the logical connections. Just to clarify in the fully connected architecture there is a logical connection between every processor modeled by a comm_medium object. The fully connected architecture model simple directed the requests it received to the correct logical connection. In the fully connected architecture there is only one connection that all requests are intended for, but in the crossbar the number of logical connections is equal to the number of processors divided by two, and every request could potentially communicate over any of them.

SystemC Performance Modeling Examples

This section contains a number of examples of performance models constructed using the SystemC modeling modules described above. The examples are presented as a demonstration that the models execute correctly and also that they demonstrate performance of the system they are intended to model. The first example is a trivial example with a set of four tasks all executing on one processor.

Since data communication inside a processor is assumed to take no time, the description should take simulation time equal to the sum of the computation time of all the tasks. The second example is the same four tasks allocated to two processors such that each task must send the data over the communication channel to the next task. This second example should take longer, with three 100 byte sends being sent over the communication channel. The third example is the same task graph description with varying bus parameters. The fourth and fifth examples have bus contention, to show that contention is handled properly. Each example lists the simulated latency, and a timeline showing the simulation results.

Single Processor

Figure 77.49 shows the simple sequential task graph for the first example. Here all the tasks are allocated to processor 0. Each task has a compute value of 10 (ms), and each edge has a data value of 100 (bytes). Since communication within a processor is assumed to take no time, the latency for this description should be the sum of the compute times, which is 40 ms.

In addition to the processing timeline shown previously, the models generate a text output stream that describes the actions each module is taking at a given simulation time. The text output for this simulation is shown below. Note that the final task is completed at 40 ms of simulation time which is exactly as expected.

Performance Modeling and Analysis Using VHDL and SystemC-0105

Dual Processor

The task graph for the second example is shown in Figure 77.50. The graph also shows the allocation of the tasks to two processors. Notice that the sequential tasks are on different processors so the data must be transferred across the communication channels before the computations can begin. Here the communication channel’s bandwidth determines how long a communication transaction should take to complete. The length of time is the data size in bytes divided by the bandwidth in megabytes per second. The communication channel can also take into account communication overhead, in nanoseconds, if it

Performance Modeling and Analysis Using VHDL and SystemC-0106

Performance Modeling and Analysis Using VHDL and SystemC-0107is specified. The channel bandwidth, and communication overhead are read in from a file. If this file is not present or an item is missing it will take on its default value. The value specified for this example is 100 Mbyte/s for bandwidth and 5 ns for channel overhead.

Since all the communication is of the same size, the expected latency for this example can be determined using the following equation:

Performance Modeling and Analysis Using VHDL and SystemC-0108

Thus the expected latency is 40 µs for computation plus 3 µs for the actual data transmission, plus 15 ns for communication overhead. That gives a total latency of 43015 ns. The timeline output for this simulation is shown in Figure 77.51. Note that the final task, Task 4, completes at 43 ms on the graph.

Parallel Communications Example

The next example shows the effect of various communications topologies on an application with require- ments for simultaneous communications. The task graph for this example is shown in Figure 77.52. Each task (called nodes in this graph) computes for a fixed period of time and then sends data to a second task causing it to begin execution. The tasks are allocated to processors such that after completion of the first set of tasks, all processors attempt to send data to another processor. Because the first tasks all have the same execution time, all the communications become ready to begin at the same time. Thus, if an architecture has parallel communication paths, this will result in a decrease in the total application run time. The four start tasks (nodes) in this example all compute for 10 ms, then attempt to do a nonblocking send of size 100 byte to a task allocated to another processor. They then move on to start the read required to begin their next task.

For all of the results discussed below, the channel parameters are set to a bus bandwidth of 1 Mbyte/s and a communication overhead of 0 ns.

Shared Bus Simulation Results

The first set of results are for a system with a single shared bus. On this system Tasks 0–3 all execute in parallel on the four processors. At this point there will be four-way contention for the single system bus.

Performance Modeling and Analysis Using VHDL and SystemC-0109

The communications operations will be assigned priority on a first come first serve basis. In the current implementation of SystemC, the task that will get first priority to communicate its data cannot be determined ahead of time, however the tasks will all run in the same order every time the simulation is run. With a shared bus architecture, the latency is 100 ms for all processors to compute in parallel plus 4*(Data size/Bandwidth)*1000 (ns), or 400 ms, for each of the four sends to occur in series, plus 100 µs for the last receiver to compute after completing their receive. Thus the overall latency should be 600 µs.

The timeline for this example is shown in Figure 77.53. The timeline shows all the tasks beginning their blocking read then having to wait for the arbitrator to select them to communicate across the bus. Once the individual communications have taken place, the destination task, Tasks 4–7, execute. The timeline correctly shows the last task completing execution at 600 µs.

Fully Connected Simulation Results

In the fully connected architecture there is a dedicated communications channel between each pair of processors. However, in this architecture, it was decided to model a system where a processor cannot both send and receive a message from the same processor at the same time. Because of the connectivity of the task graph for this application, after the first set of tasks execute in parallel, each processor needs to send and receive a message before it can execute the next task. For example, processor zero cannot send to three and receive from three at the same time. Rather it must do one, then the other. Thus, for this example, each channel in the fully connected architecture is effectively a half duplex connection.

During execution the run time is 100 µs for all processors to compute the first four tasks in parallel plus 100 ms for the first set of sends, plus 100 ms for the second set of sends—during which the tasks started by the first set of sends also execute, then finally 100 µs for the last two tasks to compute in

Performance Modeling and Analysis Using VHDL and SystemC-0110

parallel. Thus the overall latency for the fully connected architecture should be 400 µs. The timeline for this example is shown in Figure 77.54.

Crossbar Simulation Results

As mentioned above, the crossbar architecture behaves like a fully connected architecture where the maximum number of connections is limited to the number of processors divided by two. Thus for this four processor example, the crossbar architecture will only allow two communications at a time. This characteristic means that for this example, the crossbar architecture will have the same latency as the fully connected architecture for this example. This result is shown in Figure 77.55.

Second Contention Example

This second example expands on the previous example by showing a slightly different set of contention conditions. In the first example, the communication requirements specified by the task graph required

Performance Modeling and Analysis Using VHDL and SystemC-0111

the processors to send and receive data from the same processor. This effectively allowed for only two active communication transactions on the fully connected architecture. In this example, as shown in Figure 77.56, the processors will be sending and receiving data from different processors during the communications portion of the application. This set of communication requirements will allow all of the available communication channels to be used concurrently with the fully connected architecture.

Figure 77.57 shows the results for this example for the shared bus, fully connected, and crossbar architectures. Note that in this example, in the fully connected architecture, all of the communication operations occur in parallel which allows the entire application to execute in 300 µs.

Comments

Popular posts from this blog

SRAM:Decoder and Word-Line Decoding Circuit [10–13].

ASIC and Custom IC Cell Information Representation:GDS2

Timing Description Languages:SDF