Top

Prev

Next

Bottom

Contents

Index

13.4 Schedulers

Given a Universe of functional blocks to be scheduled and a Target describing the topology and characteristics of the single- or multiple-processor system for which code is to be generated, it is the responsibility of the Scheduler object to perform some or all of the following functions:

Determine which processor a given invocation of a given Block is executed on (for multiprocessor systems).
Determine the order in which actors are to be executed on a processor.
Arrange the execution of actors into standard control structures, like nested loops. In this section, we explain different scheduling options and their effect on the generated code.

13.4.1 Single-processor schedulers

For targets consisting of a single processor, we provide three different scheduling techniques. The user can select the most appropriate scheduler for a given application by setting the loopingLevel target parameter.

In the first approach (loopingLevel = DEF), which is the default SDF scheduler, we conceptually construct the acyclic precedence graph (APG) corresponding to the system, and generate a schedule that is consistent with that precedence graph. Note that the precedence graph is not physically constructed. There are many possible schedules for all but the most trivial graphs; the schedule chosen takes resource costs, such as the necessity of flushing registers and the amount of buffering required, into account. The target then generates code by executing the actors in the sequence defined by this schedule. This is a quick and efficient approach when the SDF graph does not have large sample-rate changes. If there are large sample-rate changes, the size of the generated code can be huge because the codeblock for an actor might occur many times (if the number of repetitions for the actor is greater than one); in this case, it is better to use some form of loop scheduling.

We call the second approach Joe's scheduler. In this approach (loopingLevel = CLUST), actors that have the same sample rate are merged (wherever this will not cause deadlock) and loops are introduced to match the sample rates. The result is a hierarchical clustering; within each cluster, the techniques described above can be used to generate a schedule. The code then contains nested loop constructs together with sequences of code from the actors.

Since the second approach is a heuristic solution, there are cases where some looping possibilities go undetected. By setting the loopingLevel to SJS, we can choose the third approach, called SJS (Shuvra-Joe-Soonhoi) scheduling after the inventor's first names [Bha94]. After performing Joe's scheduling at the front end, it attacks the remaining graph with an algorithm that is guaranteed to find the maximum amount of looping available in the graph.

A fourth approach, obtained by setting loopingLevel to ACYLOOP, we choose a scheduler that generates single appearance schedules optimized for buffer memory usage. This scheduler was developed by Praveen Murthy and Shuvra `Bhattacharyya [Mur96] [Bha96]. This scheduler only tackles acyclic SDF graphs, and if it finds that the universe is not acyclic, it automatically resets the loopingLevel target parameter to SJS. Basically, for a given SDF graph, there could be many different single appearance schedules. These are all optimally compact in terms of schedule length (or program memory in inline code generation). However, they will, in general, require differing amounts of buffering memory; the difference in the buffer memory requirement of an arbitrary single appearance schedule versus a single appearance schedule optimized for buffer memory usage can be dramatic. In code generation, it is essential that the memory consumption be minimal, especially when generating code for embedded DSP processors since these chips have very limited amounts of on-chip memory. Note that acyclic SDF graphs always have single appearance schedules; hence, this scheduler will always give single appearance schedules. If the file target parameter is set, then a summary of internal scheduling steps will be written to that file. Essentially, two different heuristics are used by the ACYLOOP scheduler, called APGAN and RPMC, and the better one of the two is selected. The generated file will contain the schedule generated by each algorithm, the resulting buffer memory requirement, and a lower bound on the buffer memory requirement (called BMLB) over all possible single appearance schedules.

If the second, third, or fourth approach is taken, the code size is drastically reduced when there are large sample rate changes in the application. On the other hand, we sacrifice some efficient buffer management schemes. For example, suppose that star A produces 5 samples to star B which consumes 1 sample at a time. If we take the first approach, we schedule this graph as ABBBBB and assign a buffer of size 5 between star A and B. Since each invocation of star B knows the exact location in the allocated buffer from which to read its sample, each B invocation can read the sample directly from the buffer. If we choose the second, third, or fourth approach, the scheduling result will be A5(B). Since the body of star B is included inside a loop of factor 5, we have to use indirect addressing for star B to read a sample from the buffer. Therefore, we need an additional buffer pointer for star B (memory overhead), and one more level of memory access (runtime overhead) for indirect addressing.

13.4.2 Multiprocessor schedulers

A key idea in Ptolemy is that there is no single scheduler that is expected to handle all situations. Users can write schedulers and can use them in conjunction with schedulers we have written. As with the rest of Ptolemy, schedulers are written following object-oriented design principles. Thus a user would never have to write a scheduler from ground up, and in fact the user is free to derive the new scheduler from even our most advanced schedulers. We have designed a suite of specialized schedulers that can be mixed and matched for specific applications.

The first step in multiprocessor scheduling, or parallel scheduling, is to translate a given SDF graph to an acyclic precedence expanded graph (APEG). The APEG describes the dependency between invocations of blocks in the SDF graph during execution of one iteration. Refer to the SDF domain documentation for the meaning of one iteration. Hence, a block in a multirate SDF graph may correspond to several APEG nodes. Parallel schedulers schedule the APEG nodes onto processors.

We have implemented three scheduling techniques that map SDF graphs onto multiple-processors with various interconnection topologies: Hu's level-based list scheduling, Sih's dynamic level scheduling [Sih91], and Sih's declustering scheduling [Sih91]. The target architecture is described by its Target object, derived from CGMultiTarget. The Target class provides the scheduler with the necessary information on interprocessor communication to enable both scheduling and code synthesis.

The CGMultiTarget has a parameter, schedName, that allows the user to select the type of schedule. Currently, there are five different scheduling options:

DL If schedName is set to DL, we select the Sih's dynamic level scheduler that accounts for IPC overhead during scheduling.
HU Hu's level scheduler is selected, which ignores the IPC overhead.
DC The Sih's declustering scheduler can be selected by setting DC. The declustering algorithm is advantageous only when the list scheduling algorithm shows poor performance, judged from the scheduling result because it is more expensive than the DL or HU scheduler.
HIER(DL) or HIER(HU) or HIER(DC)If we want to use Pino's hierarchical scheduler, we have to set schedName to HIER(DL or HU or DC). The default top-level scheduling option is the DL scheduler. To use other scheduler, DC or HU should be specified within the parenthesis.
CGDDF If the schedName is set to CGDDF, the Ha's dynamic construct scheduler is selected. To use this scheduler, Ptolemy should be recompiled with special flags, or use mkcgddf executable. Whichever scheduler is used, we schedule communication nodes in the generated code. For example, if we use the Hu's level-based list scheduler, we ignore communication overhead when assigning stars to processors. Hence, the code is likely to contain more communication stars than with the other schedulers that do not ignore IPC overhead. There are other target parameters that direct the scheduling procedure. If the parameter manualAssignment is set to YES, then the default parallel scheduler does not perform star assignment. Instead, it checks the processor assignment of all stars (set using the procId state of CG and derived stars). By default, the procId state is set to -1, which is an illegal assignment since the child target is numbered from 0. If there is any star, except the Fork star, that has an illegal procId state, an error is generated saying that manual scheduling has failed. Otherwise, we invoke a list scheduler that determines the order of execution of blocks on each processor based on the manual assignment. We do not support the case where a block might require more than one processor. The manualAssignment option automatically sets the oneStarOneProc state to be discussed next. If there are sample rate changes, a star in the program graph may be invoked multiple times in each iteration. These invocations may be assigned to multiple processors by default. We can prevent this by setting the oneStarOneProc state to YES. Then, all invocations of a star are assigned to the same processor regardless of whether they are parallelizable or not. The advantage of doing this is the simplicity in code generation since we do not need to splice in Spread/Collect stars, which will be discussed later. Also, it provides us another possible scheduling option: adjustSchedule; this is described below. The main disadvantage of setting oneStarOneProc to YES is the performance loss of not exploiting parallelism. It is most severe if Sih's declustering algorithm is used. Therefore, Sih's declustering algorithm is not recommended with this option. In this paragraph, we describe a future scheduling option which this release does not support yet. Once automatic scheduling (with oneStarOneProc option set) is performed, the processor assignment of each star is determined. After examining the assignment, the user may want to override the scheduling decision manually. It can be done by setting the adjustSchedule parameter. If that parameter is set, after the automatic scheduling is performed, the procId state of each star is automatically updated with the assigned processor. The programmer can override the scheduling decision by setting that state. The adjustSchedule cannot be YES before any scheduling decision is made previously. Again, this option is not supported in this release. Different scheduling options result in different assignments of APEG nodes. Regardless of which scheduling options are chosen, the final stage of the scheduling is to decide the execution order of stars including send/receive stars. This is done by a simple list scheduling algorithm in each child target. The final scheduling results are displayed on a Gantt chart. The multiple-processor scheduler produces a list of single processor schedules, giving them to the child targets. The schedules include send/receive stars for interprocessor communication. The child targets take their schedules and generate code. To produce code for child targets, we create a sub-galaxy for each child target, which consists of the stars scheduled on that target and some extra stars to be discussed below if necessary. A child target follows the same step to generate code as a single processor target except that the schedule is not computed again since the scheduling result is inherited from the parent target.

Send/Receive stars

After the assignment of APEG nodes is finished, the interprocessor communication requirements between blocks are determined in sub-galaxies. Suppose star A is connected to star B, and there is no sample rate change. By assigning star A and star B to different processors (1 and 2 respectively), the parallel scheduler introduces interprocessor communication. Then, processor 1 should generate code for star A and a "send" star, while processor 2 should generate code for a "receive" star and star B. These "send" and "receive" stars are inserted automatically by the Ptolemy kernel when determining the execution order of blocks in each child target and creating the sub-galaxies. The actual creation of send/receive stars is done by the parallel scheduler by invoking methods (createSend() and createReceive(), as mentioned earlier) in the parent multi-target.

Once the generated code is loaded, processors run autonomously. The synchronization protocol between processors is hardwired into the "send" and "receive" stars. One common approach in shared-memory architectures is the use of semaphores. Thus a typical synchronization protocol is to have the send star set a flag when it completes the data transfer, and have the receive star read the data and reset the semaphore. The receive star will not read the data if the semaphore has not been set and similarly, the send star will not write data if the semaphore has not been reset. In a message passing architecture, the send star may form a message header to specify the source and destination processors. In this case, the receive star would decode the message by examining the message header.

For properly supporting arbitrary data types, the send star should have an ANYTYPE input; the receive star should have an ANYTYPE output. The resolved type for each of these ports can be obtained using the Porthole::resolvedType method. For a preliminary version of the communication stars, you can use a fixed datatype such as FLOAT or INT.

The send/receive stars that are declared to support ANYTYPE but fail to support a particular datatype, should display an appropriate error message using the Error::abortRun method. Finally, each of these stars must call PortHole::numXfer to determine the size of the block of data that needs to be transferred upon each invocation.

Spread/Collect stars

Consider a multi-rate example in which star A produces two tokens and star B consumes one token each time. Suppose that the first invocation of star B is assigned to the same processor as the star A (processor 1), but the second invocation is assigned to processor 2. After star A fires in processor 1, the first token produced should be routed to star B assigned to the same processor while the second token produced should be shipped to processor 2; interprocessor communication is required! Since star A has one output port and that port should be connected to two different destinations (one is to star B, the other is to a "send" star), we insert a "spread" star after star A. As a result, the sub-galaxy created for processor 1 contains 4 blocks: star A is connected to a "spread" star, which in turn has two outputs connected to star B and a "send" star. The role of a "spread" star is to spread tokens from a single output porthole to multiple destinations.

On the other hand, we may need to "collect" tokens from multiple sources to a single input porthole. Suppose we reverse the connections in the above example: star B produces one token and star A consumes two tokens. We have to insert a "collect" star at the input porthole of star A to collect tokens from star B and a "receive" star that receives a token from processor 2.

The "spread" and "collect" stars are automatically inserted by the scheduler, and are invisible to the user. Moreover, these stars can not be scheduled. They are added to sub-galaxies only for the allocation of memory and other resources before generating code. The "spread" and "collect" stars themselves do not require extra memory since in most cases we can overlay memory buffers. For example, in the first example, a buffer of size 2 is assigned to the output of star A. Star B obtains the information it needs to fetch a token from the first location of the buffer via the "spread" star, while the "send" star knows that it will fetch a token from the second location. Thus, the buffers for the outputs of the "spread" star are overlaid with the output buffer of star A.

In case there are delays or past tokens are used on the connection between two blocks that should be connected through "spread" or "collect" stars, we need to copy data explicitly. Thus, we will need extra memory for these stars. In this case, the user will see the existence of "spread/collect" stars in the generated code.

Spread/Collect stars have only been implemented in the CGC domain so far.

Top

Prev

Next

Bottom

Contents

Index

Copyright © 1990-1997, University of California. All rights reserved.