Distributed Modeling and Simulation with Backtracking

Researchers:	Thomas Feng
Advisor:	Edward A. Lee

Distributed modeling and simulation are active research areas. They received increasing concern, particularly because high-speed networks and cluster computer systems are widely available. There are many good reasons for using such systems in large-scale modeling and simulation, some of which are:

to make full use of the combined power of the computing resources available in a network by involving them in a simulation, which is too slow or too large to fit into a single affordable machine;
to simulate a real application featuring the ability of high-speed distributed computation;
to build distributed collaborative shared data spaces;
to model the behavior of a P2P (peer-to-peer) or CS (client-server) network, to analyze its performance with a simulation identical to its real behavior, and finally to optimize it and re-engineer it;
and to enable cooperation in the modeling process with participants from different physical locations.

Ptolemy II is a modeling, design, and simulation framework suitable for addressing such requirements. To incorporate the ability of high-speed distributed computation into Ptolemy II, we study a number of interesting problems.

One example is the strategy employed to overcome network latency while still maximizing speed. Time Warp [1] is one approach, and there are others under research.

Some other problems are more tractable and partially satisfactory solutions have been found. One example is the backtracking requirement for the above-mentioned strategy. I have prototyped a mechanism using aspect-oriented programming to transparently insert rollback code into a Ptolemy II models with relatively small overhead. I am studying refinements, improvements, and semantic implications.

In this project, I will take a formal approach to describe the global semantics of those highly autonomous but still mutually dependent components in a distributed system. This work will probably give rise to a formalism which, like PN (process networks), regards components as processes and abstracts the physical connections between them as data channels with FIFO queues. However, unlike PN or DE (discrete events) systems, issues related to the nature of distributed systems cannot be abstracted away. In this scenario, components are highly flexible, whose activities are not restricted by blocking reads. Connections between them might be established, destroyed, or re-established dynamically. Components might transparently migrate from one physical location to another. Latency is observable, and in the worst case, messages might be lost. No single manager is able to arbitrate such global notions as global startup signals and global time. With the penalty of increased complexity, this formalism significantly improves flexibility, and more directly maps to implementation.

This project will also aim at high availability and recoverable components in a distributed system. Checkpoints for the components can be created independently and be stored either in backup servers or on hard disks. When the primary component fails, one of its backups may be used instead, until the primary restarts and catches up.

References:

[1] D. Jefferson, Brian Beckman, et al, Distributed Simulation and the Time Warp Operating System, UCLA Computer Science Department: 870042, 1987.

Last updated 09/29/06