C. D. Thompson department of computer science carnegie-Mellon University a complexity Theory for vlsi

Download 155.58 Kb.
Size155.58 Kb.
CMU-~CS-80- 140

C. D. Thompson





Carnegie-Mellon University

A Complexity Theory for VLSI


August 1980


A Complexity Theory for VLSI
Dissertation presented to

Carnegie-Mellon University

Computer Science Department

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy


C. D. Thompson


This research is supported in part by the National Science Foundation under Grant

MCS 78-236-76 and a graduate fellowship, and in part by the Office of Naval

Research under Contract N00014-76-C-0370.


The established methodologies for studying computational complexity can be applied to the new problems posed by very large-scale integrated (VLSI) circuits. This thesis develops a “VLsI model of computation” and derives upper and lower bounds on the silicon area and time required to solve the problems of sorting and discrete Fourier transformation. In particular, the area A and time T taken by any VLSI chip using any algorithm to perform an N-point Fourier transform must satisfy AT2 _ cN2log2N, for some fixed c> 0. A more general result for both sorting and Fourier transformation is that AT2X = ~(‘N1 + xlog2xN) for any x in the range o < x < 1. Also, the energy dissipated by a VLsI chip during the solution of either of these problems is at least ~2(’N32log N). The tightness of these bounds is demonstrated by the existence of nearly optimal circuits for both sorting and Fourier transformation. The circuits based on the shuffle-exchange interconnection pattern are fast but large: T = O(log2N) for Fourier transformation, T = O(log3iV) for sorting; both have area A of at most O(N2/log12A9. The circuits based on the mesh interconnection pattern are slow but small: T = O(N12loglog N,), A = O(Z’[ log 2NJ).
Keywords: Computational complexity, information theory, graph embedding, mesh connections, shuftle-exchange connections, parallel algorithms, sorting, Fourier transformation, VLSI, area-time complexity.


Table of Contents

1. Introduction ‘3

1.1 Units of area, time, information, and energy 6

1.2 Problem definitions 9

1.2.1 Discrete Fourier transformation 10

1.2.2 Sorting 12

1.3 Notation 13

1.4 Related work 13

2. The VLSI model of computation 19

2.1 Lower bounds 22

2.2 Correspondence to Visi’ chips 28

2.3 Upper bounds 41

2.4 Relation to the model of Mead and Rem 48

3. Lower bounds 51

3.1 Minimum bisection width 52

3.2 Area 54

3.3 Time bounds 60

3.3.1 Discrete Fourier transformation 67

3.3.2 Sorting 73

4. Upper bounds 77

4.1 Algorithms 78

4.2 Recirculation algorithms 84

4.3 VLSI implementations of the FF1’ 85

4.3.1 Performing the FF1’ on a mesh 85

4.3.2 The multiply-add cell 90

4.3.3 The FF1 on shuffle-exchange connections 98

4.3.4 Area bounds for the shuffle-exchange graph 104

4.4 Vi.si implementations of sorting 108

4.4.1 Sorting on shuffle-exchange connections 108

4.4.2 The comparison-exchange cell 110

4.4.3 Sorting on mesh connections 112

4.5 Constant factors in the VLsI implementations 116

5. Conclusion 119

Appendix A. Control program for cell 0 of a mesh-based FFT 123




Chapter 1

~nt rod uct~on
Very large-scale integrated (VLsI) circuit technology has proroundly changed the size and speed of computing structures. A VLsI microcomputer occupies less than a square centimeter of silicon, yet outperforms several cubic feet of twenty-year-old computer components. The circuit densities attainable with VLSI are already staggering, and further improvements lie on the horizon. Chips with one hundred thousand transistors are feasible today. This figure may well increase to ten or twenty million in the next decade [Mead 80].
The computational power of a chip is often measured by the number of transistors it contains. This is a misleading approach, for the organization of a chip’s circuitry has a very strong effect on its size and speed. In general, the more regular chip designs make more efficient use of silicon area. Such designs use less area for the wiring between transistors, leaving more room for the transistors themselves. This explains why present-day technology can put one hundred thousand transistors on a memory chip but only ten thousand transistors on a “random logic” chip. it also indicates that circuit size is more naturally measured by area than by counting transistors.
This thesis explores the relation between the speed and size of VLSI circuits, using the methodology of complexity theory. The first step in this methodology is to devise an accurate and precisely-defined model of a VLSI chip. It is then possible to derive limits on the area and time performance of any chip built according to the rules of the model. This thesis proves both upper and lower bounds on VLSI chip performance. A sample lower bound is that any chip that performs an N-point


Fourier transform in time T must have an area A large enough to satisfy AT2 _ cN2log2N, for some fixed c > 0. The cofresponding upper bound is obtained by designing a chip that can solve a Fourier transform. The designs presented in this thesis are nearly optimal in their AT2 performance, demonstrating that there is not much room for improvement in either the upper or the lower bounds.
The use of a new model of computation in this thesis is justified by the novel aspects of VLSI design. As indicated above, the size of a VLSI chip is best measured by its area. The useful area of a chip is devoted to transistors and wires; neither can be neglected in a realistic model. Unfortunately, Turing machines and other traditional models of computation lack the concept of wire area. Yet it is precisely this concept that is used in Chapter 3 to prove lower bounds on area-time performance.
The main results of this thesis are expressed as area-time tradeoffs. The product of chip area A with the square of the time T it takes to perform an N-element Fourier transform or sorting problem must satisfy AT2 = c2(’N2log2A7. That is, AT2 _ cN2Iog2N for some fixed c >0 and N> N0. This lower bound is nearly the best possible, in the sense that there exist both fast, large chips and slow, small chips that nearly achieve these bounds. The small chips solve their problems in time proportional to N12IoglogN, using area proportional to iVlog2IV. The near­optimality of these designs is immediate, since their AT2 performance exceeds the lower bound quoted above by a factor of only O(loglog ~ The fast chips are also near-optimal. The Fourier transform chip operates in O(iog2iV) time, while an analogous design solves a sorting problem in O(’log3A9 time. Both chips occupy O(N24og 1~2A9 area.
A more general result bounds the performance of any chip with area A that takes time T to solve an N-element sorting or Fourier transformation problem:
AT2X = ~(Nl~log2xN,), for all x such that 0< x ~ 1. Each value of x corresponds to a utility function AT2X with slightly different weights given to area


and time performance. The lower bound on AT2 performance given above is merely a special case (x = 1) of this general result.
The general lower bound implies that a chip with performance A = 0(N) and T = 0(’N 12log N) would be optimal under any AT2X metric, 0< x < I. The slow, small chips described above come quite close to these performance figures. The fast and large chips are far from optimal in this general sense. They only approach optimality when x is nearly 1, that is, when time is much more important than area.
The following section discusses units of measurement of VLsI circuits from a physical standpoint. The product of chip area with time is shown to be a measure of energy, leading to the following corollary of the general lower bound result: at least ~2(N32Iog N) units of energy must be dissipated during the sorting or Fourier transformation of N numbers on a VLSI chip.
The remainder of the current chapter (Sections 1.2 through 1.4) is devoted to a definition of the problems of sorting and Fourier transformation, the establishment of some notational conventions, and a review of the relevant literature.
Chapter 2 develops a “VLsI model of computation” as a basis for the derivation of lower and upper hounds on chip performance. The notion of a “communication graph” is introduced as the formal analog of a VLSI chip. Communication graphs correspond to VLsI chips, according to the scheme described in this chapter.
Lower bounds are the subject ol’ Chapter 3. A key parameter of any communication graph is defined, its “minimum bisection width.” The minimum bisection width of a communication graph determines lower hounds on the area and speed of its corresponding VLSI chip. In brief, the area A of a chip is at least proportional to the square of the width of its communication graph, w2 The maximum-possible speed of a chip also increases with its width; more precisely, T = WN log N)/w. Multiplying the first inequality by the square of the second gives the lower bound AT2 = ~2(tV2log2N). With the additional assumption that the area A is at least Q(N), the relation becomes AT2X = ~2((N+ w2)(’N log N,)/o.,2~.


It can be shown that choosing w O(N”2) minimizes AT2X for 0< x < 1, leading to the general lower bound AT2X ~jyl+X/~g2xJy~)
Chapter 4 develops near-optimal chip designs, providing upper bounds on achievable VLSI performance. Four designs are presented: “fast” and “slow” chips for both the sorting and Fourier transformation problems. The fast chips are based on the shuffle-exchange interconnection pattern, while the slow ones are based on mesh-type connections.

1.1 Units of area, time, information, and energy

A VLSI chip may be thought of as a multi-layered, planar structure. Transistors are formed on the surface of a silicon substrate, and cannot be stacked on top of each other. Above them is one or more layers of interconnect material (often a metal) that is selectively etched away to form connections or wires between the transistors. The conductive layers are normally insulated from each other, so that wires can cross over one another if they are formed in different layers. Wire segments in different layers can be connected to each other, if a “via” or hole is formed in the insulation.
For concreteness, the area calculations of this thesis assume there is only one layer of interconnect material, and that wires are laid out on a rectangular grid. Thus wires may meet only at i-ight angles. Wires may also cross over each other at right angles, if one of them makes a short run in a heavily doped “channel” in the silicon substrate.
The upper and lower bound results of this thesis could be extended to cover chips with multiple layers of interconnection, as indicated by the discussion on page 36 and by Theorem 3: k layers of interconnection decrease chip area by at most a factor of k2. There is little immediate importance to such an extension, since current chips use at most two layers of interconnection. Future chips are also likely to lìave a small number of layers, due to manufticturing difliculties and costs. Each


additional layer of interconnection requires one or more additional masking steps, to define where the wires are to run. Every manufacturing step contributes to chip cost both directly (because of increased fabrication time) and indirectly (because of reduced yields). Thus VLSI will continue to be essentially planar until radical advances are made in fabrication techniques.
For convenience, the ass umptions of the VLsI model of computation made in this thesis are numbered in order of appearance and collected in a list on page 44. (En this list, each assumption is split into two parts, labelled “L” and “U.” Part “U’ contains the suppositions strictly necessary for the lower bound proofs, while part “U” contains the additional suppositions needed by the upper bound constructions.) The assumptions made in the previous paragraphs may be summarized as (Li):
horizontal and vertical wires are formed in a single planar layer, although two wires may cross each other at right angles.
There is a natural unit of area for VLSI. Manufacturing and physical limitations give rise to a minimal spacing between the centers of parallel wires. In the terminology of Mead and Conway [Mead 80], this minimal spacing is 4X. The square of this length, 16X2, is thus a convenient area unit. The area of a chip can be expressed in terms of unit squares, leading to (Li, extended): one unit square is just large enough to contain one small transistor or one wire cross-over, and just wide enough to allow one wire to enter through each (unit-length) edge. The 64K RAM currently available has an area of about 106X2, and chips of 10~ or 10 9~2 may be possible {Mead 79].
The total area of a VLSI chip may be evaluated in two ways. In production, it is the mask size that is important, that is, the area of the smallest bounding rectangle. This is another important assumption, (U2): the area of a bounding rectangle is used to describe the upper bounds (circuit constructions) of this thesis. On the other hand, the lower bounds derived here measure only the area actually occupied by wires. This is assumption (L2). It strengthens the lower bound results and allows the derivation of bounds on energy dissipation, as noted later in this section.


Information is measured in bits. One bit of information describes the outcome of an event for which there were two equally likely possibilities. For example, a one­hit signal can communicate the result of flipping a fair coin (heads or tails). More generally, tile information content of an event that has outcome / with probability p~ is ~(—p1 logp~) bits [Shannon 49]. This quantity is also called the entropy of the probability distribution P.
The unit of time is defined by (L3): a unit-width wire has at most unit bandwidth. In other wo~cls, a signal that encodes one bit of information has a duration of at least one time unit. For binary logic, a lower bound on the length of the time unit is the duration of the shortest pulse that can change the state of a circuit.
The definitions of area, time, and information given above are chosen to simplify the theoretical model. Their values in actual implementations will depend on engineering decisions. For example, the unit of time will probably be stretched by a factor of ten or more to allow for propagation delay and synchronization overhead. In any event, the asymptotic results of this thesis will be valid to within a constant factor as long as the definitions of area and time remain fixed. These units of measurement must not change with the size of the circuit being built or with the size Not’ the problem being solved.
Unit values for area and time in a currently feasible MOS technology are 10 ~m and 50 as, respectively [Mead 80]. In other words, wires are at 10 ~m spacings and the clock signal used for synchronization has a period of 50 as. The units of AT2 in this case are 250000 ~tm2ns2. Anticipated advances in technology will reduce unit widths and times by a factor of /0, bringing the AT2 unit to 25 ~un~ns2.
A unit of energy is defined by the product of the units of area and time. When a signal is sent from one transistor to another, the driver must charge (or discharge) the capacitance presented by both the wire and the receiver. The energy required to charge a capacitor is proportional to its capacitance, so that the energy consumed by a wire or a transistor is proportional to its area (the constant of proportionality


depends on the way in which the chip is fabricated). Thus it can be said that one unit of energy is consumed by one unit of chip area every time it is involved in the transmission of a signal.
The proof’s of Chapter 3 place a lower bound on the amount of wire that must be active continuously if a circuit is to compute a function in a given time T The general lower bound on AT2X performance can therefore be used at x = ½ to give a lower hound on the total energy required to compute a function:
AT = ~(N32 log N), for both sorting and Fourier transformation.

The unit of energy for a MOS chip can be evaluated in the following way. Currently, wires have about iü~ pF of capacitance per unit area (100 j~m2~). Assuming that the signal voltage swing on a wire is 4 volts, .008 pJ of energy (= ½CV2) is needed to charge each unit of wire area, each time its voltage is changed. Thus the product of wire area with the amount of time it is actively transmitting data has the dimensions of energy, and units of .008 pJ. This energy unit will be reduced to 8 X I08 pJ, when lengths, times, and voltages are scaled down by the predicted factor of 10 {Mead 801.

1.2 Prob’em definitions
The two computational problems treated in this thesis are functions of N variables onto N variables. A VLsI chip is said to solve one of these problems if it can produce the appropriate N output values from any vector of N input values. Input and output values remain on the chip. This assumption is in accordance with a paradigm of comparing a V~ sr chip with a conventional computer. Just as rnr.”~~1Ly issues can he studied without reference to the I/O devices on a conventional computer, there is no need to model off-chip communication if there is a very large amount of memory on the chip. (The assumption of on-chip computation is implicit in assumptions L6 and L7, as defined in Section 2.i.)
The values for problem variables are chosen from the elements of a finite set.


That is, the value of a variable may be encoded as a “word” composed of a bounded number of bits. If there are no more than 7~f possible ~‘alues for each variable, then a word length of I7og Il’!] bits can be used.

A particular assignment of values to input variables constitutes a “problem instance.” These values must he chosen independently and uniformly, leading to (L4): there are ~~fNequally likely problem instances, if each of the iVinput variables can take on ~V[ different values. Assumption [4 is more powerful than it may appear at first glance. Its insistence on an independent, uniform distribution of input values allows many algebraic simplilications in the derivation of time bounds, as will be seen in Section 3.3.

A chip is said to solve a problem in average time Tif it takes an average of Tunits of time to solve one of the M”~ equally likely problem instances. En a similar fashion, a chip is said to solve a problem in worst-case time T if it takes no more than Tunits of time to solve any instance of the problem.
For lower hounds, an average-case result is strictly stronger than an equivalent worst-case result. A chip with an average-case time of at least T must also have a worst-case time of at least T. The lower bound on Fourier transformation covers the average case, while the sorting lower bound applies only to the worst case. Obtaining a lower bound on achievable performance for the average case of a sorting chip remains an open problem.
The upper bound results of this thesis use nonadaptive (straight-line, non­branching) algorithms, so that the chips operate at the same speed on any problem instance. The best-, average-, and worst-case performances of the chips are thus identical.

1.2.1 Discrete Fourier transformation

The central problem studied in this thesis is the computation of the discrete Fourier transform, or DEl. The DEl may he defined as a matrix-vector


multiplication, A~ = ~, where A is the matrix of constants defined below. The input vector is ~ and the output vector is ~ both are of length N. The elements of theN-by-N matrix A are constants determined by tile structure of the ring in which the computation is performed.
A ring is defined by a set of elements and the rules for adding and multiplying elements. The values of variables in ~ and ~ must be chosen from the elements in the ring. For the reason noted in the previous section, there can be only a finite number of values for each of the problem variables. The customary definition of a Fourier transform over the (infinite set) of complex reals is thus inadmissable here; only finite rings will be considered in this thesis.
The structure of the ring over which the DFT’ is defined has some impact on the properties of the transform. It is customary [Agarwal 75, Alto 74, Bonneau 73, Nicholson 71] to restrict attention to commutative rings with additive and multiplicative identity elements, a principal Nt/i root of unity, and a multiplicative inverse for N. Such rings lead to DETs with most of the properties associated with Fourier transforms in the field of complex numbers: invertibility, orthogonality, and the cyclic convolution property [Agarwal 75].
A particularly suitable ring for the DFT is the integers [0, 1, . . ., Al—i] under addition and multiplication modulo iVI. The prime factorization of the modulus Al characterizes this ring. If

Al r1 ~7r2 .. Pk~. (1.1)

then there is an Nth root of unity and an element N ~ if and only if N divides the greatest common divisor of [pj —1, P2 — . . . —1] [Agarwal 75, Bonneau 73]. An immediate implication of this result is that Al.> N.
If the constant a is a principal iVt/z root of unity, the matrix A in the defining equation ~ = A .~ is given by


The elements A[/, 1J are all distinct [Agarwal 75].


The length of the transform, N, determines what algorithms may be used to compute a DEl. Winograd’s ~Winograd 76] DFT algoi-ithm is based on a recognition of’ this effect. The upper boLinds of’ Chapter 4 are implementations of the “fast Fourier transform” or FF1’ [Aho 74], and thus assume (U4): N is a power of 2. En contrast, the lower bounds of Chapter 3 apply to all N.
The Fourier transform circuits of Chapter 4 use an additional assumption, also part of’ (U4): log iW = O(log iV). This assumption allows the value of’ any input or output variable to be coded in O(log ~ bits, so that the dependence of circuit size on Al can be expressed in terms of N alone. An interesting existence question is raised by this assumption. For arbitrary’ /V, there had better be a ring modulus Al of appropriate size capable of supporting an iV-element DFT. According to Agarwal’s result, cited above, an N-element DFT exists in a ring of prime modulus Al if Al is of the form qN-i- 1. Fortunately, it can be shown {Wagstaff’ 79, Linnik 44] that the least prime Al of this form is bounded by a polynomial in N. A word-length of log Al = O(!og N,) bits is thus available for any N-element DFT. Assumption U4 merely states that the circuit constructions of Chapter 4 use nearly minimal word-lengths.

1.2.2 Sorting
The N inputs to a sorting chip may be thought of as integers between 0 and Al—i. This analogy to tile integers is intended to convey two things. First, the input values are ntembers of a linearly ordered set, so that a “greater than” relation is defined. Second, there are exactly Al different possibilities for each input valLie.
The N outputs of a sorting chip are a permutation of the inputs into sorted order. That is, output y0 is the smallest input value, output y1 is the second smallest, and so forth.
The lower bound of Theorem 15 requires an addition to assumption (L4):
~‘[=iV1~ for some fixed positive ~. Note that some lower limit must be assumed


for Al in order to obtain a powerful result. For example, the sorting problem for
Al = 1 is trivial; no computation is required todetermine that all output values are

1.3 Notation
The following functional notation is used throughout this work.
= O(’g~’n~) “fis big 0 of g,” an upper bound within a constant factor. There exists a positive constant c for which f(’n) _ cg(n) for all sufficiently large n.
f(n) = O(g(n)) “f is theta of’ g,” an exact bound within a constant factor. There exist positive constants c1 and c2 for which c1 g(’n) ~ f(n) < c, g(n) for all sufficiently large n.
f~”n~) = ~(‘g(’n~ “f is omega of g,” a lower bound within a constant factor. There exists a positive constant c for which f(’n) _ cg(’n,) for all sufficiently large n.
[xl “ceiling of x,” the least integer greater than or equal to x.

Lx] “floor of x,” the greatest integer less than or equal to x. log x the base two logarithm of x.

log-3”x (log x) ~.
loglog~x (loglog ~
/x/ the number of elements in the set X or dimensions in the vector

the miumber of distinct values in the sample space of the (discrete) random variable X.

1.4 Related work
Three existing areas of inquiry shed light on the problems studied here. First, applications of a theory of graph embedding in the plane, such as printed circuit board wire routing, are relevant to the essentially planar VLSI technology. Second,


the model of computation developed here for VLSI is similar in spirit to other information theoretic models. Third, the area-time results derived in this thesis may be contrasted with the space-time tradeoffs observed in more traditional models of computation.
Graph embedding. Results in this area are written in a wide range of styles, from a hard-nosed pragmatic approach to a carefully-formalized theoretical treatment.
On the practical end, the problem of wire and chip placement on printed circuit boards is quite similar to the problem of wire and circuit placement on the surface of VLSI chips. Donath [Donath 79] derives upper bounds for wire length in both problem domains, as a function of a parameter in the empirical “Rent’s relationslup” for logic networks. These upper bounds overestimate wire lengths by a factor of two or less, on a number of actual layouts. Sutherland and Oestreicher [Sutherland 73] estimate wiring requirements for printed circuit boards, under the pessimistic assumption that chips are randomly placed on the board. Both studies use the key idea, developed here in Section 3.1, of obtaining analytical results by partitioning the graph.
Formal approaches to graph embedding also yield interesting results. Cutler and Shiloach [Cutler 78] study the problem of’ embedding bipartite graphs in the plane, under the rather restrictive assumption that no edge “crossovers” are allowed. Lipton and Tarjan [Lipton 77] and Rosenberg ~Rosenberg 79] obtain bounds on the cost of embeddings, under the very liberal assumption (for VLSI) that any number of edges may pass over a point. Leiserson [Leiserson 80] uses more natural assumptions in his algorithm for embedding a graph in near-minimal area. The last three papers cited use variants of the partitioning idea alluded to in the previous paragraph. Lipton and Tarjan take the idea one step further, and use the size of the partition to derive complexity results. Much of their work should transfer into the VLSI model of computation, but this task has not yet been attempted.

Information-theoretic models. An information-theoretic model is defined here as


one that emphasizes the cost of information transmission. Most existing models have a different orientation, measuring operation counts and memory requirements for the traditional von Neumann architecture for uniprocessors. Even so, the informational emphasis of the VLSI model of’ computation is far from unique.
Among established models in complexity theory, cellular automata [von Neutnann 66] are the most suited for informational studies. Each cell of a two­dimensional array of automata changes its state as a function of the current states of its nearest neighbors. From an information-theoretic standpoint, each cell receives a packet of’ information from its neighbors every time unit. This packet need have no more bits than the logarithm of the number of possible neighborhood state configurations. Minimal time and state solutions to cellular automata problems thus tend to minimize the transmission of information. Moshell and Rothstein’s “bus automata” could be used to model the flow of information among cellular automata in a natural fashion [Moshell 76].
Floyd [Floyd 72] makes use of the entropy of a memory state to obtain lower bounds on the number of operations needed to perform memory reorganization in a two-level store. In his model, an operation is the production of a third “page” of information, as any function of the contents of any two pages. If n11 is the number of records (amount of information) to be sent from tile ith page to the jdi page, the entropy of a memory state is defined as ~ nU log n1j. (Actually, Floyd’s V-function is defined somewhat differently to handle the case that log nU is not an integer.) One operation can change the entropy of a memory state by at most p, where p is the number of records on a page. A lower bound on the number of operations needed to achieve a reorganization is thus the entropy of the original state (relative to that reorganization) divided by p.
On another front, a theory of “distributed computing” is beginning to emerge as an outgrowth of research into parallel processing for database manipulation. A. Yao [Y ao 79] outlines some of the implications of various assumptions that might be made about a distributed system: one-way vs. two-way communication,


deterministic vs. probabilistic computations. Abelson [Abelson 80a} develops some analytic tools for bounding the information transfer required in the computation of continuous, differentiable functions. Both authors treat a problem as a fixed partitioning of the input data between a pair of processors. In distinction, this thesis defines a problem as an input-output relation, with no partitioning assumptions.
Space-time tradeoffs. Although operation counts and memory requirements of uniprocessors are irrelevant considerations for VLSI, the analytical tools to demonstrate space-time tradeoffs in more conventional models can be applied to VLSI area-time studies.

Grigoryev [Grigoryev 76] studies the problem of computing a set of m binary functions. He proves that any straight-line (nonbranching, nonadaptive) algorithm with T steps and S storage locations satisfies ST.> ml/2, if the set of functions is “I-independent.” A similar notion of functional independence is the basis for the bounds of Section 3.3. Grigoryev’s definition is discussed in more detail in that section. Savage and Swamy [Savage 79a] generalize Grigoryev’s method and apply it to integer multiplication. Earlier, they had found a space-time tradeoff for the fast Fourier transform algorithm [Savage 77].

Space-time bounds are often derived from consideration of a “pebble gathe” [Paterson 70, Savage 77] played on the graph of a straight-line algorithm. Each pebble corresponds to a storage register and each node represents a function to be evaluated. An edge leads from one node to another if the value of the parent appears as a parameter in the child’s function. Nodes with no parents correspond to problem inputs; nodes with no children are problem outputs. Placing a pebble on a node means storing the value of the node’s function in the pebble’s register, which is possible only if the node’s parents (the function’s parameters) are all pebbled (evaluated and stored). Removing a pebble from a node corresponds to erasing the contents of the pebble’s register. The object of the game is to pebble all childless nodes, in other words, to evaluate all the problem outputs. Time in this model is measured serially as the number of pebble movements, that is, the number of function evaluations. Space is the number of different pebbles (registers) used.


For VLSI, one is tempted to interpret the graph of an algorithm in quite a different fashion. Each node could be a processor waiting for its predecessors to send it the results of’ their function evaluations. Time would then be the depth of the graph, and space the area of the graph when embedded in the plane. The pebbling game seems irrelevant in this interpretation of a graph. Some pebbling results, however, involve proofs of’ necessary properties of any graph for a particular problem. For example, Valiant [Valiant 76] shows that any Fourier transform algorithm corresponds to a hyperconcentrator. Pippenger’s extension of this result, as reported by Tompa [Tompa 78], is the basis of’ Lemma 8 of Section 3.3.1.
Unfortunately, most pebbling results are based on algorithms operating over the set of real numbers. The analytic techniques used do not always transfer into the modular arithmetic, or other finite algebraic structures, of’ the VLSI model of computation.



Share with your friends:

The database is protected by copyright ©www.essaydocs.org 2020
send message

    Main page