Component model

A component model defines the kind of information that can be known about a component, and how this information is structured.

The CAL component model is simple---actors are our components, and we use CAL to represent actors. CAL representations are composed with each other, resulting in a CAL representation of the composite (cf. here), and CAL representations are used to generate target code. However, we do not really use CAL directly, we use a source form representation (format) based on XML, called CalML.

In this discussion, the term source form applies to any representation that preserves all semantically relevant information contained in the source code. For example, it has to preserve the actions, the guards, input patterns, etc., but it may remove comments and formatting, it may rename internal variables, and so on, because these transformations are semantically neutral. It may, of course, also store the actor in a 'non-textual' (binary) format, or in some 'weird' textual format, such as XML (e.g. CalML). Under this definition, Java byte-code is (almost) a source form of Java programs, because it essentially represents the complete semantically relevant information contained in a Java source file, as evidenced by the fact that it can relatively easily be decompiled.

Currently, we are supporting two source-form formats: CAL source code and CalML, an XML dialect.

Component model FAQ

In the following we will address some of the questions concerning the choice of source form as a component model.

Why source form?

What about protecting IP?

What about encapsulation?

What about binary component models?

Okay, what about non-source-form component models then?

Isn't it inefficient to process source form all the time?

Can I use non-source-form component models?

Am I tied to using CAL only?

Why source form?

There are two key reasons for this:

Using source form, which is basically (i.e. except for its concrete representation) defined by the language, you do not need to invent yet another, non-source-form format. In designing a language, you have already in essence defined your component model. Unless there is a compelling reason for it (and those exist, see below), why define another one?

Source form, by definition, contains everything that is semantically relevant about an actor. Similarly, a non-source-form format most likely omits some of this information. This may be acceptable, or even desirable in some contexts, but if you need the information, there is no way to get it back. In other words, it is easier to later throw information away, than to try to recreate it. This is also why we refer to source form as a low-entropy representation.

The last point is particularly important in our case, because we do not know how our components are going to be composed. In fact, factoring out the composition mechanism and allowing users to define their own is one of the key contributions we are hoping to make. Because we cannot (and do not want to!) make many assumptions about the composition mechanism, we do not know which information about an actor will be relevant. We have designed the CAL actor language so that it is easy to get a lot of structural information about the actor, either directly, or by simple analyses. At least in our context, it would seem a paradoxical next step to design a component model that selects a fixed subset of this information, and throws away all the rest, thereby making it inaccessible to future composition operators.

Source form as a component model is not a new thing. In fact, it is probably the oldest component model---punch cards contained source, so did Lisp files, shell scripts, etc. In more recent days, object-oriented language such as Smalltalk, Eiffel, and Java provide their classes in source form---in the case of Java, in a binary source form. (Smalltalk's images may also be considered 'binary source form', of course.)

What about protecting IP?

A source form component model essentially suggest to make the source code a distribution format, raising issues of protecting intellectual property. Can source form be a vehicle for commercial IP?

Well, it already is. Every bit of Java software (to mention just the most well-known case) is essentially provided in source form---using existing techniques and tools, byte code, the binary distribution format for Java code, can be decompiled into something that is structurally almost identical to the source code. Semantically irrelevant details such as local variable names are lost, but the code structure, the algorithms, are recovered. Therefore, byte code is very close to source form.

The way the software vendors deal with this problem in Java is by using obfuscation, i.e. by transforming the code so that it still performs the same function, but its design intent and intellectual content are very difficult to access (see e.g. this article by Greg Travis on Java obfuscation). Similar techniques would apply to any source form, including that of CAL.

[Note: In truth, Java byte code is somewhat of a limit case of what can usefully be called 'source form'. There are valid byte-code programs that are difficult to decompile, and of course the general code structure itself is really different from the original source. Furthermore, some transformations, such as constant substitution, cannot be reversed in principle. For the purposes of this discussion, however, it is close enough to being source form.]

Apart from obfuscation, sensitive code may also be relegated to external functions, which are not themselves implemented in CAL, but only called from it. Furthermore, implementors may use CAL to implement systems containing IP, and then generate code to a non-source-form component model for shipping and deployment.

What about encapsulation?

It may seem that distributing actors in source form would encourage breaking encapsulation between actors. In order to discuss this, we need to distinguish 'psychological' encapsulation from 'technical' encapsulation. The former describes the absence knowledge that passes from one actor's internal implementation to another by way of the programmer's head---because the programmer knows how actor A works internally, he or she may be induced to assume this internal knowledge when writing an actor B which interacts with A. It seems as though the best way to prevent this is a proper education of programmers, and a good development process, rather than obscuring parts of the system from the programmer's view, lest he or she does something stupid.

By contrast, 'technical' encapsulation describes technical barriers to actors directly accessing internal features of each other. CAL provides perfect technical encapsulation, in the sense that actors have no way to even know about each other, much less directly access each other's state. They do not even have a way to find out who they are communicating with. This is a consequence of the CAL language design, and quite independent of the fact that they are delivered in source form.

In every component model, a component really exhibits two interfaces: one to other components, and one to the composition mechanism (see e.g. Kiczales, G., Beyond the Black Box: Open Implementation , IEEE Software, pp. 137-142, 1996). In the CAL model, the interface of an actor toward other actors is its ports. However, the interface toward the composition mechanism, the model of computation, is the entire actor description. There is nothing unusual about this, either. E.g., in the Eiffel programming language, the compiler serves as the (uniquely defined) composition mechanism. Of course, it can see the complete definition of every class, and it has to, because, well, it needs to compile them. Similarly, in Java, the class loader serves as the byte-code composer. It, too, traverses the entire byte code to check it for validity, and to link it properly to the rest of the system.

What this means is that encapsulation among components is important, and is the part that engineers are quite rightly concerned about, the composition mechanism has usually a much more detailed knowledge of a component. In the extreme case of source-form models, it has knowledge of every semantically relevant property of a component.

What about binary component models?

They are great. Storing components in binary form (rather than textually) often makes them more compact, and simplifies their processing (reading a textual format often requires a lexer and a parser, even though these can be relatively simple as in the case of XML). We had considered binary source-form component models for CAL, but eventually gravitated toward XML, because it provided a mature platform for transformation and manipulation.

Okay, what about non-source-form component models then?

They are great, too. However, for the reasons discussed above, there does not seem to be a good case for them in the context of a system that tries to encourage the construction of new actor composition mechanisms.

Basically, whenever the composition mechanism is known and (relatively) fixed, a non-source-form model is often the superior choice. The reason is that the composition mechanism usually defines the kind of information that needs to be known about a component to be able to compose it---i.e., it defines a component model. For instance, if pieces of software are composed by a 'traditional' linker, that linker defines precisely what it needs to know about a function to do its job. Any other information about it, or any other format than machine code with symbolic labels, would either be useless fluff, or even complicate the job of a linker.

In summary, it is important to choose the component model that is right for the job. In the case of CAL, we believe source form is the appropriate choice.

Isn't it inefficient to process source form all the time?

That depends. Source form does not mean source code. One can devise source form formats that are significantly pre-cooked, maybe contain additional information extracted from the source code, and use compact non-textual storage formats. Much of the hard parts in processing the actual source code, such as parsing, static analysis, etc., can be done by a front-end, and the results stored in the source form format.

On the other hand, when all the additional information is not necessary, as in the linker example above, it may very well make composition or other operations on components less efficient.

Can I use non-source-form component models?

Of course! In fact, CAL is intended to be compilable to a large variety of different component models. We are experimenting with generating code for various models based on either Java or C/C++, and we are also planning to look at generating hardware descriptions, e.g. in VHDL. Naturally, not every CAL feature is directly represented in these models, but it also does not need to be. By the time code generation is started, CAL composition has been performed, and CAL-specific features are no longer needed (see the overviews of code generation and program transformation).

Am I tied in to using CAL only?

No. CAL allows easy calls to external functions and procedures, so any legacy code, or any code for which CAL simply is not a good language, can be written in another language, and called from CAL.

Jörn W. Janneck

Contact