A component model defines the kind of information that can be known about
a component, and how this information is structured.
The CAL component model is simple---actors are our components, and
we use CAL to represent actors. CAL representations are composed with each
other, resulting in a CAL representation of the composite (cf. here),
and CAL representations are used to generate target code. However, we do not
really use CAL directly, we use a source form representation (format)
based on XML, called CalML.
In this discussion, the term source form applies to any representation
that preserves all semantically relevant information contained in the source
code. For example, it has to preserve the actions, the guards, input patterns,
etc., but it may remove comments and formatting, it may rename internal variables,
and so on, because these transformations are semantically neutral. It may,
of course, also store the actor in a 'non-textual' (binary) format, or in
some 'weird' textual format, such as XML (e.g. CalML).
Under this definition, Java byte-code is (almost) a source form of Java programs,
because it essentially represents the complete semantically relevant information
contained in a Java source file, as evidenced by the fact that it can relatively
easily be decompiled.
Currently, we are supporting two source-form formats: CAL source code
and CalML, an XML dialect.
Component model FAQ
In the following we will address some of the questions concerning the choice
of source form as a component model.
Why source form?
There are two key reasons for this:
- Using source form, which is basically (i.e. except for its concrete representation)
defined by the language, you do not need to invent yet another, non-source-form
format. In designing a language, you have already in essence defined your
component model. Unless there is a compelling reason for it (and those exist,
see below), why define another one?
- Source form, by definition, contains everything that is semantically relevant
about an actor. Similarly, a non-source-form format most likely omits some
of this information. This may be acceptable, or even desirable in some contexts,
but if you need the information, there is no way to get it back. In other
words, it is easier to later throw information away, than to try to recreate
it. This is also why we refer to source form as a low-entropy representation.
The last point is particularly important in our case, because we do not
know how our components are going to be composed. In fact, factoring out
the composition mechanism and allowing users to define their own is one of
the key contributions we are hoping to make. Because we cannot (and do not
want to!) make many assumptions about the composition mechanism, we do
not know which information about an actor will be relevant. We have designed
the CAL actor language so that it is easy to get a lot of structural information
about the actor, either directly, or by simple analyses. At least in our context,
it would seem a paradoxical next step to design a component model that selects
a fixed subset of this information, and throws away all the rest, thereby
making it inaccessible to future composition operators.
Source form as a component model is not a new thing. In fact, it is probably
the oldest component model---punch cards contained source, so did Lisp files,
shell scripts, etc. In more recent days, object-oriented language such as
Smalltalk, Eiffel, and Java provide their classes in source form---in the
case of Java, in a binary source form. (Smalltalk's images may also be considered
'binary source form', of course.)
What about protecting IP?
A source form component model essentially suggest to make the source code
a distribution format, raising issues of protecting intellectual property.
Can source form be a vehicle for commercial IP?
Well, it already is. Every bit of Java software (to mention just the most
well-known case) is essentially provided in source form---using existing techniques
and tools, byte code, the binary distribution format for Java code, can be
decompiled into something that is structurally almost identical to the source
code. Semantically irrelevant details such as local variable names are lost,
but the code structure, the algorithms, are recovered. Therefore, byte code
is very close to source form.
The way the software vendors deal with this problem in Java is by using obfuscation,
i.e. by transforming the code so that it still performs the same function,
but its design intent and intellectual content are very difficult to access
(see e.g. this
article by Greg Travis on Java obfuscation). Similar techniques would
apply to any source form, including that of CAL.
[Note: In truth, Java byte code is somewhat of a limit case of what can usefully
be called 'source form'. There are valid byte-code programs that are difficult
to decompile, and of course the general code structure itself is really different
from the original source. Furthermore, some transformations, such as constant
substitution, cannot be reversed in principle. For the purposes of this discussion,
however, it is close enough to being source form.]
Apart from obfuscation, sensitive code may also be relegated to external
functions, which are not themselves implemented in CAL, but only called from
it. Furthermore, implementors may use CAL to implement systems containing
IP, and then generate code to a non-source-form component model for shipping
and deployment.
What about encapsulation?
It may seem that distributing actors in source form would encourage breaking
encapsulation between actors. In order to discuss this, we need to distinguish
'psychological' encapsulation from 'technical' encapsulation. The former describes
the absence knowledge that passes from one actor's internal implementation
to another by way of the programmer's head---because the programmer knows
how actor A works internally, he or she may be induced to assume this internal
knowledge when writing an actor B which interacts with A. It seems as though
the best way to prevent this is a proper education of programmers, and a good
development process, rather than obscuring parts of the system from the programmer's
view, lest he or she does something stupid.
By contrast, 'technical' encapsulation describes technical barriers to actors
directly accessing internal features of each other. CAL provides perfect technical
encapsulation, in the sense that actors have no way to even know about each
other, much less directly access each other's state. They do not even have
a way to find out who they are communicating with. This is a consequence of
the CAL language design, and quite independent of the fact that they are delivered
in source form.
In every component model, a component really exhibits two interfaces:
one to other components, and one to the composition mechanism (see e.g. Kiczales,
G., Beyond the Black Box: Open Implementation , IEEE Software, pp. 137-142,
1996). In the CAL model, the interface of an actor toward other actors is
its ports. However, the interface toward the composition mechanism, the model
of computation, is the entire actor description. There is nothing unusual
about this, either. E.g., in the Eiffel programming language, the compiler
serves as the (uniquely defined) composition mechanism. Of course, it can
see the complete definition of every class, and it has to, because, well,
it needs to compile them. Similarly, in Java, the class loader serves as the
byte-code composer. It, too, traverses the entire byte code to check it for
validity, and to link it properly to the rest of the system.
What this means is that encapsulation among components is important, and
is the part that engineers are quite rightly concerned about, the composition
mechanism has usually a much more detailed knowledge of a component. In the
extreme case of source-form models, it has knowledge of every semantically
relevant property of a component.
What about binary component models?
They are great. Storing components in binary form (rather than textually)
often makes them more compact, and simplifies their processing (reading a
textual format often requires a lexer and a parser, even though these can
be relatively simple as in the case of XML). We had considered binary source-form
component models for CAL, but eventually gravitated toward XML, because it
provided a mature platform for transformation and manipulation.
Okay, what about non-source-form component models
then?
They are great, too. However, for the reasons discussed above, there does
not seem to be a good case for them in the context of a system that tries
to encourage the construction of new actor composition mechanisms.
Basically, whenever the composition mechanism is known and (relatively) fixed,
a non-source-form model is often the superior choice. The reason is that the
composition mechanism usually defines the kind of information that needs to
be known about a component to be able to compose it---i.e., it defines a component
model. For instance, if pieces of software are composed by a 'traditional'
linker, that linker defines precisely what it needs to know about a function
to do its job. Any other information about it, or any other format than machine
code with symbolic labels, would either be useless fluff, or even complicate
the job of a linker.
In summary, it is important to choose the component model that is right for
the job. In the case of CAL, we believe source form is the appropriate choice.
Isn't it inefficient to process source form all
the time?
That depends. Source form does not mean source code. One can devise source
form formats that are significantly pre-cooked, maybe contain additional information
extracted from the source code, and use compact non-textual storage formats.
Much of the hard parts in processing the actual source code, such as parsing,
static analysis, etc., can be done by a front-end, and the results stored
in the source form format.
On the other hand, when all the additional information is not necessary,
as in the linker example above, it may very well make composition or other
operations on components less efficient.
Can I use non-source-form component models?
Of course! In fact, CAL is intended to be compilable to a large variety of
different component models. We are experimenting with generating code for
various models based on either Java or C/C++, and we are also planning to
look at generating hardware descriptions, e.g. in VHDL. Naturally, not every
CAL feature is directly represented in these models, but it also does not
need to be. By the time code generation is started, CAL composition has been
performed, and CAL-specific features are no longer needed (see the overviews
of code generation and program
transformation).
Am I tied in to using CAL only?
No. CAL allows easy calls to external functions and procedures, so any legacy
code, or any code for which CAL simply is not a good language, can be written
in another language, and called from CAL.
Jörn W. Janneck