System Timing:Synchronous Timing and Clock Distribution Networks

Synchronous Timing and Clock Distribution Networks

The timing of a synchronous VLSI system is characteristically analyzed at the level of its synchronous building blocks, the local data paths. In the following sections, the simple and effective modeling of synchronous building blocks is described. The effects of clock distribution networks on circuit operation are also presented.

Background

As described in Section 50.2, most high-performance digital integrated circuits implement data-processing algorithms based on the iterative execution of basic operations. Typically, these algorithms are highly parallelized and pipelined by inserting clocked registers at specific locations throughout the circuit. The synchronization strategy for these clocked registers in the vast majority of VLSI-based digital systems is a fully synchronous approach. It is not uncommon for the computational process in these systems to be spread over hundreds of thousands of functional logic elements and tens of thousands of registers.

For such synchronous digital systems to function properly, the vast number of switching events require a strict temporal ordering. This strict ordering is enforced by a global synchronization signal known as the clock signal. For a fully synchronous system to operate correctly, the clock signal must be delivered to every register at a precise relative time. The delivery function is accomplished by a circuit and interconnect structure known as a clock distribution network [15].

Multiple factors affect the propagation delay of the data signals through the combinational logic gates and the interconnect. Since the clock distribution network is composed of logic gates and interconnection wires, the signals in the clock distribution network are also delayed. Moreover, the dependence of the correct operation of a system on the signal delay in the clock distribution network is far greater than on the delay of the logic gates. Recall that by delivering the clock signal to registers at precise times, the clock distribution network essentially quantizes the time of a synchronous system (into clock periods), thereby permitting the simultaneous execution of operations.

The nature of the on-chip clock signal has become a primary factor in limiting circuit performance, causing the clock distribution network to become a performance bottleneck for high-speed VLSI systems. The primary source of load for the clock distribution network has shifted from the logic gates to the interconnect, thereby changing the physical nature of the load from a lumped capacitance (C) to a distributed resistive-capacitive (RC) load and eventually a distributed resistive-capacitive-inductive (RLC) load [7,16,17]. These interconnect impedances degrade the on-chip signal waveform shapes and increase the path delay. Furthermore, uncertainty is introduced into the signal timing due to statistical variations in the parameters characterizing the circuit elements along the clock and data signal paths, caused by the imperfect control of the manufacturing process and the environment. These changes in circuit behavior have a profound impact on both the choice of synchronous design methodology and on the overall circuit performance. Among the most important consequences are increased power dissipated by the clock distribution network as well as the increasingly challenging timing constraints that must be satisfied to avoid timing violations [3–6,15,18–20].

Definitions and Notation

A synchronous digital system is a network of logic gates and registers whose input and output terminals are interconnected by wires. A sequence of connected logic gates (no registers) is called a signal path. Signal paths bounded by registers are called sequentially adjacent paths.

System Timing-0533

Definition 50.1 (Sequentially adjacent pair of registers). For an arbitrary ordered pair of registers áRi, Rf ) in a synchronous circuit, one of the following two situations can be observed. Either there exists at least one signal path that connects some output of Ri to some input of Rf or any input of Rf cannot be reached from any output of Ri by propagating through a sequence of logic elements only. In the former case—denoted by R1 R2 —the pair of registers áRi, Rfñ is called a sequentially adjacent pair of registers and switching events at the output of Ri can possibly affect the input of Rf during the same clock period. A sequentially adjacent pair of registers is also referred to as a local data path [15].

A sample local data path with a register (a flip-flop or a latch) is shown in Figure 50.3. In Figure 50.3, the clock signals Ci and Cf driving the initial register Ri and the final register Rf, respectively, of the local data path are shown.

Definition 50.2 For any ordered pair of registers áRi , Rjñ in a fully synchronous circuit driven by the clock signals Ci and Cj , respectively, the clock skew TSkew(i, j) is defined as the difference:

System Timing-0534

dExamples of local data paths with flip-flops and latches are shown in Figure 50.17 and Figure 50.21, respectively.

Note that the clock skew as defined above is only defined for sequentially-adjacent registers, that is, for local data paths [such as the path shown in Figure 50.2(b)].

Definition 50.3 (Data propagation time). For any arbitrary pair of registers áRi, Rfñ in a local data path Ri Rf of a synchronous circuit, the amount of time a data signal is processed in the combi- national logic block is defined as the data propagation time Di,f.

Conventionally, the timing analysis of sequential circuits is performed when the circuit components are modeled with min–max timing models. In the min–max timing model, the delay information of a circuit component is represented with two quantities; the minimum corresponding to the delay of the compo- nent under best-case operation conditions and the maximum for the worst-case operation conditions. The subscripts m and M appended to the parameter D i,f represent the minimum and maximum data propagation times, Di,f and D i,f , respectively, constituting the min–max timing model for the local data path Ri Rf .

A fully synchronous digital circuit is formally defined as follows:

System Timing-0535

Note that in a fully synchronous digital system there are no purely combinational signal cycles, that is, the input of any logic gate Gk cannot be reached by starting at the same gate and propagating through a sequence of combinational logic gates only [15,21].

Graph Model of a Fully Synchronous Digital Circuit

Certain properties of a synchronous digital circuit may be better understood by analyzing a graph model of a circuit. A synchronous digital circuit can be modeled as a directed graph [22,23] G with a vertex set V = {v1, ¼, vN} and an edge set E = {e1,¼, eN } Í V ´ V . An example of a circuit graph G is illustrated in Figure 50.5(a). The number of registers in the circuit is |V | = N where the vertex vk corresponds to the register Rk. The number of local data paths in the circuit is |E | = NP = 11for the example shown in Figure 50.5. An edge is directed from vi to vj iff Ri Rj. In the case where multiple paths between a sequentially adjacent pair of registers Ri Rj exist, only one edge connects vi to vj . The underlying graph

System Timing-0536

(b) The underlying graph Gu of G.

Gu of the graph G is a nondirected graph that has the same vertex set V, where the directions have been removed from the edges. The underlying graph Gu of the graph G depicted in Figure 50.5(a) is shown in Figure 50.5(b). Furthermore, an input or an output of the circuit is indicated in Figure 50.5 by an edge incident to only one vertex.

Clock Skew Scheduling

The majority of the approaches used to design a clock distribution network simplify the performance goals by targeting minimal or zero global clock skew [24–26], which can be achieved by different routing strategies [27–30], buffered clock tree synthesis, symmetric n-ary trees [31] (most notably H-trees), or a distributed series of buffers connected as a mesh [15,32]. A zero clock skew scheme is established by distributing the clock signal to all synchronous components of a circuit with identical clock delays. In other words, the clock skew evaluates to zero on all of the local data paths of a zero clock skew circuit:

System Timing-0537

If the circuit operates at any clock period less than the largest maximum data propagation time, a timing hazard occurs. For any clock period greater than this value, the circuit is fully functional (no timing hazards occur). Finding a clock period TCP for which a zero clock skew circuit is fully functional (equal to or greater than the largest data propagation time Di,f ), is always possible, making it convenient to design zero clock skew systems. Consequently, the application of zero clock skew schemes has been central to the design of fully synchronous digital circuits for decades [15,33].

The vector column of clock delays TCD =[tcd , tcd ¼ ] is called a clock schedule [15,34]. A clock schedule that satisfies Eq. (50.3) is called a trivial clock schedule. Note that a trivial clock schedule TCD implies global zero clock skew since for any i and f ,tcd = tcd , thus, TSkew(i, f ) = 0. If TCD is chosen such that the timing constraints of a circuit are satisfied for every local data path Ri Rf, TCD is called a consistent clock schedule.

The goal of nonzero clock skew scheduling is to compute a consistent clock schedule that is not trivial, while improving the circuit performance. It has been shown in Refs. [15,24–26,35–37] that by adopting a nonzero clock skew synchronization scheme, synchronous circuits can operate at clock periods less than the largest maximum data propagation time of the circuit. In nonzero clock skew systems, the clock signal delays tcd at certain registers are intentionally delayed to provide additional data-processing time on slower local data paths. Mathematically, the nonzero clock skew values (also called useful skew) evaluate to TSkew(i, f ) ¹ 0 for some (or all ) local data paths Ri Rf of the circuit.

The process of determining a consistent clock schedule TCD can be considered as the mathematical problem of optimizing the circuit performance under the timing constraints of a circuit. However, there are important practical issues to consider before a clock schedule can be properly implemented. A clock distribution network must be synthesized such that the clock signal is delivered to each register with the proper delay so as to satisfy the clock skew schedule TCD. Furthermore, this clock distribution network must be constructed so as to minimize the deleterious effects of interconnect impedances and process parameter variations on the implemented clock schedule. Synthesizing the clock distribution network typically consists of determining a topology for the network, together with the circuit design and physical layout of the buffers and interconnect within the clock distribution network [15].

System Timing-0538

(b) Equivalent graph of a clock tree structure that corresponds to the circuit shown in (a).

Structure of a Clock Distribution Network

The clock distribution network is frequently organized as a rooted tree structure [15,22,24], as illustrated in Figure 50.6, and is often called a clock tree [15]. A circuit schematic of a clock distribution network is shown in Figure 50.6(a). An abstract graphical representation of the tree structure depicted in Figure 50.6(a) is shown in Figure 50.6(b). The unique source of the clock signal is at the root of the tree. This signal is distributed from the source to every register in the circuit through a sequence of buffers and interconnect. Typically, a buffer in the network drives a combination of other buffers and registers in the VLSI circuit. An interconnection network of wires connects the output of the driving buffer to the inputs of these driven buffers and registers. An internal node of the tree corresponds to a buffer and a leaf node of the tree corresponds to a register. There are N leaves¶ in the clock tree labeled F through FN where leaf Fj corresponds to register Rj. A clock tree topology that implements a given clock schedule TCD must enforce a clock skew TSkew(i, f ) for each local data path Ri Rf of the circuit to ensure that the timing constraints of the circuit are satisfied. This topology, however, can be affected by three important issues relating to the operation of a fully synchronous digital system.

Linear Dependency of the Clock Skews

An important corollary related to the conservation property [15] of clock skew is that there exists a linear dependency among the clock skews of a global data path that form a cycle in the underlying graph of the circuit. Specifically, if v0, e1,v1 (¹ v0), ¼, vk-1, ek, vk º v0 is a cycle in the underlying graph of the circuit,

System Timing-0539

¶The number of registers N in the circuit.

The importance of this property is that Eq. (50.5) describes the inherent correlation among certain clock skews within a circuit. These correlated clock skews therefore cannot be independently optimized. Returning to Figure 50.5, note that it is not necessary that a directed cycle exists in the directed graph G of a circuit for Eq. (50.5) to hold. For example, v2 ,v3 ,v4 is not a cycle in the directed circuit graph G in Figure 50.5(a) but v2 ,v3 ,v4 is a cycle in the undirected circuit graph Gu in Figure 50.5(b). In addition, TSkew(2,3) + TSkew(3,4) + TSkew(4,2) = 0, that is, the skews TSkew(2, 3), TSkew(3,4), and TSkew(4,2) are linearly dependent. A maximum of (ïVï - 1) = (N - 1) clock skews can be chosen independently of each other in a circuit, which is easily proven by considering a spanning tree of the underlying circuit graph Gu [22,23]. Any spanning tree of Gu will contain (N - 1) edges—each edge corresponding to a local data path—and the addition of any other edge of Gu will form a cycle such that Eq. (50.5) holds for this cycle. Note, for example, that for the circuit modeled by the graph shown in Figure 50.5, four independent clock skews can be chosen such that the remaining three clock skews can be expressed in terms of the independent clock skews.

The interdependency of the clock skew values makes the analysis of clock skew scheduling methods a difficult problem. Owing to this interdependency characteristic, a clock skew value cannot be determined independent of the remaining clock skews, thus a typical clock-skew scheduling method must simulta- neously encompass the analysis of all local data paths. Such simultaneous analysis of all local data paths in a given synchronous circuit is typically structured by including the timing constraints of every local data path in a single optimization problem.

Differential Character of the Clock Tree In a given circuit, the clock signal delay tj from the clock source to the register Rj is equal to the sum of the propagation delays of the buffers on the unique path that exists between the root of the clock tree and the leaf Fj corresponding to the j th register. Furthermore, if Ri Rf * is a sequentially adjacent pair of registers, there is a portion of the two paths—denoted P —between the root of the clock tree and R and Rf , respectively, that is common to both paths. This concept is illustrated in Figure 50.7. A portion of a clock tree is shown in Figure 50.7 where each of the vertices 1 through 9 corresponds to a buffer in the clock tree. The vertices 4, 5, and 9 are leaves of the tree and correspond to the registers R4, R5, and

System Timing-0540

System Timing-0541

R9, respectively.|| The local data paths R4 R5 and R5 R9 are indicated in Figure 50.7 with arrows while the paths of the clock signals to each of the registers R4, R5, and R9 are shown in Figure 50.7 as lightly shaded. The portion of the clock signal paths common to both registers of a local data path is shaded darker in Figure 50.7—note the segments 1®2® 3 for R4 R5 and 1® 2 for R5 R9 .

Similarly, there is a portion of the clock signal path to any of the registers Ri and Rf in a sequentially adjacent pair of registers R R , denoted by Pi and Pf , respectively, that is unique to this register.

Returning to Figure 50.7, the segments 3® 4 and 3®5 are unique to the clock signal paths to the registers R4 and R5 while the segments 2®3®5 and 2®6®9 are unique to the clock signal paths to the registers R5 and R9, respectively.

Note that the clock skew TSkew(i, f ) between the sequentially adjacent pair of registers Ri Rf is equal to the difference between the accumulated buffer propagation delays between Pi and Pf , that is, TSkew (i, f ) = Delay (Pi ) - Delay(Pf ). Therefore, any variation of circuit parameters over P affect the value of the clock skew TSkew(i, f ). For the example shown in Figure This differential feature of the clock tree suggests an approach for minimizing the effects of process parameter variations on the correct operation of the circuit. To illustrate this approach, each branch p ® q of the clock tree shown in Figure 50.7 is labeled with two numbers—7p,q > 0 is the intended delay of the branch and E ³ 0 is the maximum error (deviation) of this delay.** In other words, the actual delay of the branch p ® q is in the interval [r p,q – Ep,q ,r p,q + Ep,q]. With this notation, the target clock skew values for the local data paths R4 R5 and R5 R9 are shown in the middle column in Table 50.1. The bounds of the actual clock skew values for the local data paths shown in the rightmost column in Table 50.1.

R4 R5 and R5 R9 (considering the E variations) are As the results listed in Table 50.1 demonstrate, it is advantageous to maximize P * for any local dataRi Rf such that the parameter variations on P do not affect TSkew(i,f ).

Comments

Popular posts from this blog

Square wave oscillators and Op-amp square wave oscillator.

Adders:Carry Look-Ahead Adder.