MATLAB Examples


Growing a Compiler, Chapter 1

format compact  % Help MATLAB save screen space

1 Introduction

This presentation proceeds in stages, each version an extension of its predecessor (bootstrapping). The development stops well short of an industrial compiler but that is not because of any known limitations to the approach.

Note: This presentation was prepared by the MATLAB publish feature which accepts a commented MATLAB script as input and produces an HTML document. This paragraph came from a MATLAB comment. The grey-bar sections are MATLAB code.

format compact  % help MATLAB save screen space

Immediately following the each code section is the corresponding output (if any). There was no output from the format statement above.

1.1 Executable Grammars

It is well-known that the rewriting rules of a Context-free Grammar can be mechanically applied, and that if some sequence of applications results in a parse, that parse is correct. The trick is, of course, in finding the sequence of applications.

The Grammar Executing Machine (GEM) presented here can do that, with reasonable efficiency, executing its input grammar one character at a time.

1.2 IOG: the Input/Output Grammar

An Input/Output Grammar (IOG), a kind of syntax directed translation schema, is a Mealy-like extension of the Context-Free Grammar (CFG). The name is chosen to emphasize the symmetry of input (what any CFG naturally defines) and output (which has been added). Following the usual conventions for CFGs, an IOG

  • ${\cal G} = \langle V_I, V_O, V_N, V_G, \Pi\rangle$

consists of a set of input symbols, a set of output symbols, a set of phrase names, a set of goals, and a set of reduction rules. The output symbols are analogous to the {actions} of YACC. In this paper $V_G$ contains only one symbol but for other uses, in particular the description of finite automata, multiple goals are useful.

An IOG satisfies the following constraints:

  • $V  = V_I \cup V_O \cup V_N$
  • $V_I \cap V_O = \{\}$
  • $V_I \cap V_N = \{\}$
  • $V_O \cap V_N = \{\}$
  • $V_G \subseteq V_N$
  • $\Pi \subseteq V_N\times V^*$

When $V_O$ is empty, an IOG is a conventional CFG with terminal symbols $V_I$.

There are some initial restrictions on the IOGs GEM can accept.

  • Whitespace is not allowed between symbols.
  • The input, output and phrase name symbols are all single-character.
  • The IOG must not be left-recursive.

Part of the task is to use IOGs to relax these restrictions on IOGs.

1.3 Grammar Executing Machine (version 0)

GEM is a Grammar Executing Machine. It can be thought of as a function

 o = GEM(i, g)

where i is the input text, o is the resulting output text, and g is the text of the IOG being executed. The final argument g can be thought of as the stored program in a Von Neumann computer.

The role of MATLAB in this presentation is as a test harness and demonstration platform; GEM itself is implemented in low-level C. The implementation of GEM and some of its grammars is bundled in a MATLAB object. GEM can be made available for use in this talk by the following lines of MATLAB code.

G   = gem0();   % Instantiate the object (base implementation).
GEM =;    % GEM is a MATLAB function

1.3.1 Running GEM

A grammar g acceptable to GEM is a sequence of rules, each of the form

p = $\alpha$;

where p is a letter and $\alpha$ is a sequence of symbols from $V$. Letters are used to signify members of $V_N$. By convention, the first rule in g defines the goal symbol, and therefore is the only member of $V_G$.

The word sequence used twice above is significant: grammar rules are tried in order, allowing specific cases to be separated out from more general cases. Useful GEM grammars may be technically ambiguous. Three examples suffice to show all the capabilities of GEM.

g0 is the simplest possible IOG.


It describes the empty language. When g0 is applied to the null string, the output is null string. More significantly GEM does not give a diagnostic, which implies it actually parsed the null string.

fprintf('res=''%s''\n', GEM('', 'r=;'));

IOG g1


shows the use of ' and " to delimit members of $V_I$ and $V_O$ respectively. This grammar accepts character x as input and produces character y as output. Note: in a MATLAB string '' stands for '.

fprintf('res=''%s''\n',GEM('x', 'r=''x''"y";'));

IOG g2


shows the use of additional rules to describe alternatives. Given either the string 1 or the string 2 as input, it produces the null string as output.

fprintf('res=''%s''\n', GEM('1', 'r=s;s=''1'';s=''2'';'));

1.3.2 GEM implementation

The (GEM) machine itself is implemented in C. Because the presentation is staged to show growth, there is both a series of C files and a series of M files calling them.

The form of the C code is a MEX wrapper file (iogN.c for some N) and an included grammar executing machine (iogN.h for the same N). The functions iogN are called from MATLAB functions gemN for N=0,1,2....

Why not use MATLAB instead of C for functions iog? Fair question. The answer is that GEM is best expressed at a very low level, even lower than C. MATLAB is in the wrong direction. Perhaps assembler would be more even more appropriate.

Here is the line-count of the C code for simplest version.

!wc -l iog0.c  iog0.h
    136    iog0.c 
     76    iog0.h 
    212    total 

The characters in the IOG are opcodes of the machine. The machine has three modes: PARSE, SEARCH and BACK. It has char output, input, and grammar vectors, corresponding to the o, i and g in the call to GEM. It has a parsing stack and backtracking stack.

The following lines of C code execute IOGs. The compiled C code takes a few nanoseconds to execute each step.

dbtype iog0.h
1     /* Grammar Executing Machine version 0 -- simple and risky */
2     static void
3     gem(void) {
4       for (;;) {        
5         switch (mode) {
6           /* executing rules */
7           case PARSE:
8             switch (*p) {
9             ALPHA                             /* call new rule */
10              ps++; *ps = p;                    /*   save return address */
11              p = p0; mode = SEARCH; break;     /*   search from beginning */
12            case '\'':                        /* input */
13              p++;                              /*   skip over 1rst ' */
14              if (*p == *i) {i++; p++; p++;}    /*   read-match */
15              else {mode = BACK; p--; p--;}     /*   read-mismatch */
16              break;
17            case '"':                         /* output */
18              p++; o++;                         /*   skip over 1rst " */
19              *o = *p;                          /*   move literal to output*/
20              p++; p++; break;                  /*   skip over 2nd " */
21            case ';':                         /* rule end (parsing) */
22              p--;                              /*   back up over ; */
23              bs++; *bs = p;                    /*   save backup pointer */
24              if (psize < 0) return;            /*   empty stack: success */
25              p = *ps; ps--;                    /*   return from rule */
26              p++; break;                       /*   skip over rule name */
27            default:                          /* bad char in grammar */
28              error("Unexpected character (PARSE)");
29            }
30            break;                            /* end of parse step */
32          /* backtracking */
33          case BACK:
34          switch (*p) {
35            ALPHA                             /* un-return from rule */
36              ps++; *ps = p;                    /* save return address */
37              p = *bs; bs--; break;             /* end of previous rule */
38            case '\'':                        /* input */
39              i--;                              /* un-get input */
40              p--; p--; p--; break;             /* un-skip literal */
41            case '"':                         /* output */
42              o--;                              /* un-put output */
43              p--; p--; p--; break;             /* un-skip literal */
44            case '=':                         /* rule begin (backtracking) */
45              mode = SEARCH;                    /* forward again */
46              p++; break;                       /* skip by = */
47            default:
48              error("Unexpected character (BACK)");
49            }
50            break;                            /* end of back step */
52          /* searching for a rule */
53          case SEARCH:
54            switch (*p) {
55            ALPHA                             /* phrase name */
56              p++; break;                       /* skip over name */
57            case '\'': 
58            case '"':                         /* input/output */
59              p++; p++; p++; break;             /* skip over 'x' or "x" */
60            case ';':                         /* rule coming */
61              p++;                              /* skip over ; */
62              if (p-p0==pN) {                   /* end of code */
63                if (psize == 0) error("Unparsable input");
64                p = *ps; ps--;                /* back out one rule */
65                mode = BACK;                    /* reverse direction */
66                p--; break;                     /* un-skip over name */
67              }
68              if (*p==**ps) mode = PARSE;       /* lhs is phrase name */
69              p++; p++; break;                  /* skip over lhs = */
70            default:
71              error("Unexpected character (SEARCH)");
72            }
73            break;
74        }
75      }
76    }

If you are interested in the MEX wrapper iog0.c and the driver gem0.m, you can download this File Exchange submission and look at them in the MATLAB editor.

1.3.3 How GEM works

One fundamental property of GEM input is that the grammars are symmetric. They can be executed right-to-left as easily as they can be executed left-to-right. When executed backward, the effect is to exactly undo the work that was done going forward. This provides a simple micro-managed kind of backtracking.

The executable grammar consists of concatenated rules. Each rule is of the form

  p = definition ;

The rule starts with a defined phrase name p and, except for the delimiters = and ;, contains only phrase names, input symbols and output symbols.

The initial mode of execution is SEARCH; the initial executable character is the goal phrase name (a letter). GEM linearly searches the entire grammar for definitions of the executable character. When a defining rule is found, mode PARSE is entered.

At each step during PARSE, GEM switches on the current character, interpreting it as one of:

  • a (recursive) call, or
  • move ahead in the input (a shift), or
  • move a character to the output, or
  • end of a grammar rule (a reduce).

When a new recursive call is needed, the mode shifts back to SEARCH.

Whenever the actual input fails to match the required input symbol in the IOG, the current trial parse fails and mode BACK is entered.

During mode BACK the IOG is un-executed backward. Since each reduce action stacks the location of the final character in its rule, the backup can always find its way back to the rule needed to unparse.

If, at some point, the input is entirely used and all recursions have returned, GEM reports the accumulated output.

Subsequent GrowingCompiler chapters continue the story.