MATLAB Examples

Contents

Growing a Compiler, Chapter 6

format compact  % Help MATLAB save screen space

6.1 Plenty phrase names

The limitation of 52 phrase names and the ursurpation 1/2 of them for Kleene*, and 8 of them for built-in character classes, is the last of the tasks set in Chapter 1. It would be nice to play the bootstrapping card once again, but there are difficulties. First, whatever bootstrapping grammars are devised, the target grammar will still have only 52 possible phrase names.

There are two obvious choices.

  • allow multi-character phrase names (e.g. abc)
  • introduce BNF-like brackets for phrase names (e.g. "<name>")

Allowing multi-character phrase names is notationally attractive. This choice however introduces a backward incompatibility since, at the moment, xyz means phrase x, then y, then z. This is not a fatal flaw -- extra features could have been built-in to the original iog0.h. The sweet spot would be builtin: checking, deblanking, character classes, multi-character input/output, multi-character phrase names, and interpretation of xyz*. The whole project could proceed from that point forward, and perhaps it should. However desirable the result, it would not be playing the game of bootstrapping.

Instead, a new kind of phrase name, in addition to the single letters, is introduced. Here are some examples

 <g>
 <rule>
 <This_is_a_phrase_name>
 <x123>
 <x*>

The first four examples allow identifier-like names between the BNF delimiters. The last example is introduced to support the removal of Kleene *. Since

 <x*>

will be generated from

 <x>*

the user is advised to not use the former without care.

The consequence of this extension is to allow an unlimited number of new phrase names, 52 of which can be used with Kleene *. All other constructs work as before, including x*.

The new capability is implemented in iog6.h and iog6.c.

G=gem6();    % instantiate the object (final implementation)
GEM=G.run;

6.2 Self-describing grammar

The extended GEM language can be described (not including whitespace) in its own terms.

fprintf('%s', G.self0);
<g> = <r>*;
<r> = <n> '=' <s>* ';';
<n> = l;
<n> = '<' l n* '>';
<n> = '<' l '*>';
n   = l;
n   = d;
n   = '_';
<s> = i i i;
<s> = i <i>;
<s> = o o o;
<s> = o <o>;
<s> = l '*';
<s> = l '+';
<s> = '<' l '>' '*';
<s> = '<' l '>' '+';
<s> = <n>;
i   = ''';
<i> = i;
<i> = a <i>;
o   = '"';
<o> = o;
<o> = a <o>;

One of the rules is worth a side comment. It would be visually pleasing to have rules

 <s> = i <c>+ i;
 <c> = a;

where a is the "all ASCII" class. Unfortunately this use introduces an error since the construct

 <c>+

would never terminate. Builtin character class a accepts everything including the terminating quote.

If desired, the longer phrase names can be used to improve readability.

fprintf('%s', G.self1);
<self>    = <rules>;
<rules>   = <rule> <rules>;
<rule>    = ;
<rule>    = <name> '=' <symbols> ';';
<name>    = l;
<name>    = '<' l n* '>';
<name>    = '<' l '*>';
n         = l;
n         = d;
n         = '_';
<symbols> = <symbols> <symbol>;
<symbols> = ;
<symbol>  = i i i;
<symbol>  = i <any_in>;
<symbol>  = o o o;
<symbol>  = o <any_out>;
<symbol>  = '<' l '>' '*';
<symbol>  = '<' l '>' '+';
<symbol>  = l '*';
<symbol>  = l '+';
<symbol>  = <name>;
<any_in>  = i;
<any_in>  = a <any_in>;
i         = ''';
<any_out> = o;
<any_out> = a <any_out>;
o         = '"';

The self grammars must be preprocessed to remove the Kleene*. There are three new nostar grammars.

 nostar1 changes x* to X and <x>* to <x*>
 nostar2 adds rules like X=xX;X=;
 nostar3 adds rules like <x*>=<x><x*>;<x*>=;

As in gem4, there is a GEM-based nodup function that removes duplicates.

Both of the above self-describing grammars use Kleene* so both must be preprocessed by the nostar grammars. The convention in gem6 is that the readable regular expression style grammar is given the name xxx0 for some xxx and the result of applying the nostars or noplus is called xxx. In this case, self results from preprocessing self0. The result of recognizing self0 is the null string.

fprintf('res=''%s''\n', G.run(G.scan(G.self0), G.self, 'ulda'));
res=''

6.3 A new pretty

Grammar pretty now has multiple responsibilities. It is a syntax checker. It takes out and puts in just enough whitespace. It leaves multi-character symbols alone. There are 9 places where b* must be used to remove whitespace. Function nodup is essential in singling-up the generated rules. Here is pretty0, the readable pretty-printer.

fprintf('%s', G.run(G.pretty0, G.pretty, 'ULDA'));
<g> = b* <r>*;
<r> = <p> b* '=' " =" <s>* b* ';' b* ";
";
<p> = L;
<p> = '<' "<" L n* '>' ">";
<p> = '<' "<" L '*>' "*>";
n = L;
n = D;
n = '_' "_";
<s> = b* " " <t>;
<t> = I I I;
<t> = I <i>;
<t> = O O O;
<t> = O <o>;
<t> = '<' "<" L '>' ">" b* '*' "*";
<t> = '<' "<" L '>' ">" b* '+' "+";
<t> = L b* '*' "*";
<t> = L b* '+' "+";
<t> = <p>;
I = ''' "'";
<i> = I;
<i> = A <i>;
O = '"' """;
<o> = O;
<o> = A <o>;
b = ' ';
b = '
';

Looking at the runnable pretty shows the application of the nostar preprocessing for elimination of Kleene*.

fprintf('%s', G.run(G.pretty, G.pretty, 'ULDA'));
<g> = B <r*>;
<r> = <p> B '=' " =" <s*> B ';' B ";
";
<p> = L;
<p> = '<' "<" L N '>' ">";
<p> = '<' "<" L '*>' "*>";
n = L;
n = D;
n = '_' "_";
<s> = B " " <t>;
<t> = I I I;
<t> = I <i>;
<t> = O O O;
<t> = O <o>;
<t> = '<' "<" L '>' ">" B '*' "*";
<t> = '<' "<" L '>' ">" B '+' "+";
<t> = L B '*' "*";
<t> = L B '+' "+";
<t> = <p>;
I = ''' "'";
<i> = I;
<i> = A <i>;
O = '"' """;
<o> = O;
<o> = A <o>;
b = ' ';
b = '
';
B = b B;
B =;
N = n N;
N =;
<r*> = <r> <r*>;
<r*> =;
<s*> = <s> <s*>;
<s*> =;

6.4 Executing X86 code

The framework for getting into execution is available.

fprintf('rc=%d', G.exe('5589E5606133C0C9C3'));           % execute X86 code
rc=0

6.5 Scaling

GEM contains five generous, but fixed-size static arrays (for the arguments and the stacks). Each array can give an out-of-bounds error in certain circumstances. Usually such a diagnostic signifies an input error (left-recursion, output loop, etc.). It is possible, however, to have a problem to solve that is too big to fit inside the static bounds. The easiest solution is to increase the macro definition defining the size of the affected array and recompile iog6.c. One could, instead, dynamically allocate and reallocate the arrays. The choice here is for early warning about input bugs (frequent) to the more general behavior, which would fail only when all of memory is exhausted.

6.6 Conclusion

GEM version 6 is a usable tool as it stands. The obvious direction to extend it is implementing a C-like or ML-like mini language, in which anything is possible. Another direction is to reimplement GEM according to the first choice in section 6.1 -- skipping all the steps and putting Kleene * into the first implementation. That would be an as-of-now-unwritten Chapter 7.