Top Up Prev Next Bottom Contents Index Search

11.3 Tokenizer, a simple lexical analyzer class


The Tokenizer class is designed to accept input for a string or file and break it up into tokens. It is similar to the standard istream class in this regard, but it has some additional facilities. It permits character classes to be defined to specify that certain characters are white space and others are "special" and should be returned as single-character tokens; it permits quoted strings to override this, and it has a file inclusion facility. In short, it is a simple, reconfigurable lexical analyzer. Tokenizer has a public const data member named defWhite that contains the default white space characters: space, newline, and tab. It is possible to change the definition of white space for a particular constructor.

11.3.1 Initializing Tokenizer objects

Tokenizer provides three different constructors:

Tokenizer();
The default constructor creates a Tokenizer that reads from the standard input stream, cin. Its special characters are simply \key ( and \key ).

Tokenizer(istream& input,const char* spec,
const char* w = defWhite);
This constructor creates a Tokenizer that reads from the stream named by input. The other arguments specify the special characters and the white space characters.

Tokenizer(const char* buffer,const char* spec, 
const char* w = defWhite);
This constructor creates a Tokenizer that reads from the null-terminated string in buffer. Tokenizer's destructor closes any include files associated with the constructor and deletes associated internal storage. The following operations change the definition of white space and of special characters, respectively:

const char* setWhite(const char* w);
const char* setSpecial(const char* s);
In each case, the old value is returned. By default, the line comment character for Tokenizer is #. It can be changed by

char setCommentChar(char n); 
Use an argument of 0 to disable the feature. The old comment character is returned.

11.3.2 Reading from Tokenizers

The next operation is the basic mechanism for reading tokens from the Tokenizer:

Tokenizer& operator >> (char* pBuffer); 
Here pBuffer points to a character buffer that reads the token. There is a design flaw: there isn't a way to give a maximum buffer length, so overflow is a risk. By analogy with streams, the following operation is provided:

operator void*();
It returns null if EOF has already been reached and non-null otherwise. This permits loops like

Tokenizer tin;
while (tin) { ... do stuff ... }
int eof() const;
Returns true if the end of file or end of input has been reached on the Tokenizer. It is possible that there is nothing left in the input but write space, so in many situations skipwhite should be called before making this test.

void skipwhite();
Skip white space in the input.

void flush();
If in an include file, the file is closed. If at the top level, discard the rest of the current line.

11.3.3 Tokenizer include files

Tokenizer can use include files, and can nest them to any depth. It maintains a stack of include files, and as EOF is reached in each file, it is closed and popped off of the stack. The method

int fromFile(const char* name); 
opens a new file and the Tokenizer will then read from that. When that file ends, Tokenizer will continue reading from the current point in the current file.

const char* current_file() const;
int current_line() const;
These methods report on the file name and line number where Tokenizer is currently reading from. This information is maintained for include files. At the top level, current_file returns a null pointer, but current_line returns one more than the number of line feeds seen so far.

int readingFromFile() const;
Returns true (1) if the Tokenizer is reading from an include file, false (0) if not.



Top Up Prev Next Bottom Contents Index Search

Copyright © 1990-1997, University of California. All rights reserved.