Chapter
Five
The Pre-compiler
The symbols and symbol-combining rules of Atari
BASIC are coded into Syntax Tables, which direct the Program Pre— compiler in examining
source code and producing tokens. The information in the Syntax Tables is a transcription
of a meta— language definition of Atari BASIC. The Atari BASIC Meta-language
A meta-language is a language which describes or defines another language. Since
a meta-language is itself a language, it also has symbols and symbol-combining rules
- which define with precision the symbols and symbol-combining rules of the subject
language. Atari BASIC is precisely defined with a specially developed meta-language
called the Atari BASIC Meta-language, or ABMI,. (ABML was derived from a commonly
used compiler- technology meta-language called BNF.) The symbols and symbol-combining
rules of ABML were intentionally kept very simple. Making Up a Language To
show you how ABML works, we'll create an extremely simple language called SAP, for
Simple Arithmetic Process. SAP symbols consist of variables, constants, and operators.
- Variables: The letters A, B, and C only.
- Constants: The numbers 1,2,3,4,5,6,7,8, and 9 only.
- Operators: The characters +, -, *, /, and ! only.
Of course, you already know the functions of all the operators except "!"
The character! is a pseudo-operator of the SAP language used to denote the end of
the expression, like the period that ends this sentence.
The grammar of the SAP language is precisely defined
by the ABML definition in Figure 5-1. 33
Chapter
Five
Figure 5-1 The SAP Language Expressed in ABML SAP := <expression>!
<expression> := <value> <operation> | <operation> := <operator>
<expression> <value> := <constant> | <variable> <constant>
:= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <variable> := A | B | C <operator>
:= + | - | * | / The ABML symbols used to define the SAP language in Figure 5-1 are:
:= is defined as Whatever is on the left of : = is defined as consisting of
whatever is on the right of :=, and in that order. | or The symbol | allows choices
for what something is defined as. For instance, in the sixth line <variable>
can be A or B or C. If does not appear between two symbols, then there is no choice.
For example, in the second line <expression> must have both <value> and
<operation>, in that order, to be valid. <> label Whatever comes between
< and> is an ABML label. All labels, as non-terminal symbols, must be defined
at some point, though the definitions can be circular - notice that <operation>
is part of the definition of <expression> in the second line, while in the
third line <expression> is part of the definition of <operation>. terminal
Symbols used in definitions, which are not symbols enclosed by < and > and
are also not one of the ABML symbols, are terminal symbols in the language being
defined by ABML. In SAP, some terminal symbols are A, !, B, *, and 1. They cannot
be defined as consisting of other symbols - they are themselves the symbols that
the SAP language manipu— 34
Chapter
Five
lates, and must appear exactly as they are shown to be valid in
SAP. In effect, they are the vocabulary of the SAP language. Statement Generation
The ABML description of SAP can be used to generate grammatically correct statements
in the SAP language. To do this, we merely start with the first line of the definition
and replace the non-terminal symbols with the definitions of those symbols. The replacement
continues until only terminal symbols remain. These remaining terminal symbols constitute
a grammatically correct SAP statement. Since the or statement requires that one and
only one of the choices be used, we will have to arbitrarily replace the non- terminal
with the one valid choice. Figure 5-2 illustrates the ABML statement generation process.
Figure
5-2. The Generation of One Possible SAP Statement (1) SAP := <expression>! (2) SAP := <value>
<operation>! (3) SAP := <variable> <operation>! (4) SAP := B<
operation>! (5) SAP := B<operator> <expression>! (6) SAP := B*<expression>!
(7) SAP := B*<value> <operation>! (8) SAP := B*<constant> <operation>!
(9) SAP := B*4<operation>! (10) SAP := B*4< operator> <expression>!
(11) SAP := B*4+<expression>! (12) SAP := 8*4+<value> <operation>!
(13) SAP := B*4+<variable> <operation>! (14) SAP := B*4+C<operation>!
(15) SAP := B*4+C! In (2), <value> <operation> replaces <expression>
because the ABML definition of SAP (FigureS-I) defines <expression> as <value>
<operation>. In (3), the non-terminal <value> is replaced with 35
Chapter
Five
<variable>. The definition of <value> gives two choices
for the substitution of <value>. We happened to choose <variable>. In
(4), we reach a terminal symbol, and the process of defining <value> ends.
We happened to choose B to replace <variable>. In (5), we go back and
start defining <operation>. There are two choices for the replacement of <operation>,
either <operator> <expression> or nothing at all (since there is nothing
to the right of in the second line of Figures 5-1). If nothing had been chosen, then
(5) would have been: SAP :=B! The statement B! has no further non-terminals; the
process would have been finished, and a valid statement would have been produced.
Instead we happened to choose <operator> <expression>. The SAP definition
for <expression> is <value> <operation>. If we replace <operation>
with its definition we get: <expression> : = <value> <operator>
<expression> The definition of <expression> includes <expression>
as part of its definition. If the <operator> <expression> choice were
always made for <operation>, then the process of replacement would never stop.
A SAP statement can be infinitely long by definition. The only thing which prevents
us from always having an infinitely long SAP statement is that there is a second
choice for the replacement of <operation>: nothing. The replacements in (5)
and (10) reflect the repetitive choices of defining <expression> in terms of
itself. The choice in (15) reflects the nothing choice and thus finishes the
replacement process. Computerized Statement Generation If we slightly modify
our procedure for generating statements, we will have a process that could be easily
programmed into a computer. Instead of arbitrarily replacing the definition of non-
terminals, we can think of the non-terminal as a GOSUB. When we see <X> :=
<Y> <Z>, we can think of <Y> as being a subroutine-type procedure:
(a) Go to the line that has <Y> on the left side. (b) Process the definition
(right side) of <Y>. 36
Chapter
Five
(c) If while processing the definition of <Y>, other non-
terminals are found, GOSUB to them. (d) If while processing the definition of <Y>
we encounter a terminal, output the terminal symbol as the next symbol of the generated
statement. (e) When the definition of <Y> is finished, return to the place
that <Y> was called from and continue. Since ABML is structured so that it
can be programmed, a fascinating exercise is to design a simple English sentence
grammar with ABML, then write a BASIC program to generate valid English sentences
at random. The randomness of the sentences would be derived by using the RND function
to select from the definitions or choices. An example of such a grammar is shown
in Figure 5-3. (The programming exercise is left to you.) Figure 5-3. A Simple English Sentence Grammar in ABML SENTENCE := <subject>
<adverb> <verb> <object>. <subject> := The <adjective>
<noun> <verb> := eats | sleeps | drinks | talks | hugs <adverb>
:= quickly | silently | slowly | viciously | lovingly | sadly | <object> :=
at home | in the car | at the table | at school | <subject> <noun> :=
boy | girl | dog | programmer | computer | teacher <adjective> := happy | sad
| blue | light | round | smart | cool | nice | Syntactical Analysis The process of examining a language statement for grammatical
correctness is called syntactical analysis, or syntaxing. Statement verification
is similar to statement generation. Instead of arbitrarily choosing which or definition
to use, however, the choices are already made, and we must check to see whether the
statement symbols are used in valid patterns. To do this, we must process through
each or definition until we find a matching valid terminal symbol. The result of
statement generation is a valid, grammatically correct statement, but the result
of statement verification is a 37
Chapter
Five
statement validity indication, which is a simple yes or no. Either
the statement is grammatically correct or it is not.Failure occurs when some statement
symbol cannot be matched with a valid terminal symbol under the rules of the grammar.
The Reporting System To use the pass/fail result of statement verification,
we must build a reporting system into the non4erminal checking process. Whenever
we, in effect, GOSUB to a non-terminal definition, that non-terminal definition must
report its pass/fail status. A fail status is generated and returned by a
non-terminal definition when it finds no matching terminal for the current statement
symbol. If the current statement symbol is B and the <constant> definition
in the SAP language is called, then <constant> would report a fail status
to the routine that called it. A pass status is returned when a terminal symbol
is found which matches the current statement symbol. If our current statement symbol
had been 7 instead of B, then <constant> would have reported pass. Whenever
such a match does occur, we return to the statement, and the next symbol to the right
becomes the new current symbol for examination and verification. Cycling Through
the Definitions In SAP, the <constant> definition is called from the <value>
definition. If <constant> reports fail, then we examine the next or
choice, which is <variable>. The current symbol is B, so <variable> reports
pass. Since at least one of the or choices of <value> has reported
pass, <value> will report pass to its caller. If both <constant>
and <variable> had reported fail, then <value> would report fail
to its caller. The caller of <value> is <expression>. If <value>
reports pass, <operation> is called. If <operation> reports pass,
then <expression> can report pass to its caller. If either <value>
or <operation> reports fail, then <expression> must report fail,
since there are no other or choices for <expression> The definition
of <operation> contains a special pass/fail property. If either <operator>
or <expression> reports fail, 38
Chapter
Five
then the or choice must be examined. In this case the or
choice is nothing. The or nothing means something special: report pass,
but do not advance to the next symbol. The final pass/fail report is generated
from the first line of the definition. If <expression> reports pass
and the next symbol is!, then SAP reports pass. If either one of these conditions
has a fail status, then SAP must report fail to whatever called SAP
from outside the language. Backing Up Sometimes it is necessary to back up
over symbols which have already been processed. Let's assume that there was a definition
of the type <X> : <Y>I<z>. It is possible that while <Y>
is attempting to complete its definition, it will find a number of valid matching
terminal symbols before it discovers a symbol that it cannot match. In this case,
<Y> would have consumed a number of symbols before it decided to report fail.
All of the symbols that <Y> consumed must be unconsumed before <Z> can
be called, since <Z> will need to check those same symbols. The process of
unconsuming symbols is called backup. Backup is usually performed by the caller of
<Y>, which remembers which source symbol was current when it called <Y>.
If <Y> reports fail, then the caller of <Y> restores the current
symbol pointer before calling <Z>. Locating Syntax Error When a final
report of fail is given for a statement, it is often possible to guess where
the error occurred. In a left-to-right system, the symbol causing the failure is
usually the symbol which follows the rightmost symbol found to be valid. If we keep
track of the rightmost valid symbol during the various backups, we can report a best
guess as to where the failure- causing error is located. This is exactly what Atari
BASIC does with the inverse video character in the ERROR line. For simplicity, our
example was coded for SAP, but the syntactical analysis we have just described is
essentially the process that the Atari BASIC pre-compiler uses to verify the grammar
of a source statement. The Syntax Tables are an ABML description of Atari BASIC.
The pre-compiler, also known as the syntaxer, contains the routines which
verify BASIC statements. 39
Chapter
Five
Statement Syntax Tables There is one entry in the Syntax Tables for each BASIC statement.
Each statement entry in the Syntax Table is a transcription of an ABML definition
of the grammar for that particular statement. The starting address of the table entry
for a particular statement is pointed to by that statement's entry in the Statement
Name Table. The data in the Syntax Tables is very much like a computer machine language.
The pseudo-computer which executes this pseudo-machine language is the pre-compiler
code. Like any machine language, the pseudo-machine language of the Syntax Tables
has instructions and instruction operands. For example, an ABML non4erminal symbol
is transcribed to a code which the pre-compiler executes as a type of "GOSUB
and report pass/fail" command. Here are the pseudo-instruction codes
in the Syntax Tables; each is one byte in length. Absolute Non-Terminal Vector
Name: ANTV Code: $00 This is one of the forms of the non-terminal GOSUB. It is followed
by the address, minus 1, of the nonAerminal's definition within the Syntax Table.
The address is two bytes long, with the least significant byte first. External
Subroutine Call Name: ESRT Code: $01 This instruction is a special type of terminal
symbol checker. It is followed by the address, minus 1, of a 6502 machine language
routine. The address is two bytes long, with the least significant byte first. The
ESRT instruction is a deus ex machina - the "god from the machine"
who solved everybody's problems at the end of classical Greek plays. There are some
terminals whose definition in ABML would be very complex and require a great many
instructions to describe. In these cases, we go outside the pseudo-machine language
of the Syntax Tables and get help from 6502 machine language routines - the deus
ex machina that quickly gives the desired 40
Chapter
Five
result. A numeric constant is one example of where this outside
help is required. ABML or Name: OR Value: $02 This is the familiar ABML or
symbol ( | ). It provides for an alternative definition of a non-terminal. Return
Name: RTN Value: $03 This code signals the end of an ABML definition line. When we
write an ABML statement on paper, the end of a definition line is obvious - there
is no further writing on the line. When ABML is transcribed to machine codes, the
definitions are all pushed up against each other. Since the function that is performed
at the end of a definition is a return, the end of definition is called return (RTN).
Unused (Codes $04 through $0D are unused.) Expression Non-Terminal Vector
Name: VFXP Value: $0E The ABML definition for an Atari BASIC expression is located
at $A60D. Nearly every BASIC statement definition contains the possibility of having
<expression> as part of it. VEXP is a single-byte call to <expression>,
to avoid wasting the two extra bytes that ANTV would take. The pseudo- machine understands
that this instruction is the same as an ANTV call to <expression> at $A60D.
Change Last Token Name: CHNG Value: $0F This instruction is followed by a
one-byte change to token value. The operator token instructions cause a token
to be placed into the output buffer. Sometimes it is necessary to change the token
that was just produced. For example, there are several = operators. One = operator
is for the assignment 41
Chapter
Five
statement LET X = 4. Another = operator is for comparison
operations like IF Y = 5. The pseudo-machine will generate the assignment
= token when it matches =. The context of the grammar at that point may have required
a comparison = token. The CHNG instruction rectifies this problem. Operator
Token Name: (many) Value: $10 through $7F These instructions are terminal codes
for the Atari BASIC Operators. The code values are the values of each operator token.
The values, value names, and operator symbols are defined in the Operator Name Table
(see Chapter 2). When the pseudo-machine sees these terminal symbol representations,
it compares the symbol it represents to the current symbol in the source statement.
If the symbols do not match, then fail status is generated. If the symbols match,
then pass status is generated, the token (instruction value) is placed in the token
output buffer, and the next statement source symbol becomes the current symbol for
verification. Relative Non-Terminal Vectors Name: (none) Value: $80- $BF (Plus)
$C0 - $FF (Minus) This instruction is similar to ANTV, except that it is a single
byte. The upper bit is enough to signal that this one-byte code is a non-terminal
GOSUB. The destination address of the GOSUB is given as a position relative to the
current table location. The values $80 through $BF correspond to an address which
is at the current table address plus $00 through $31'. The values $C0 through $FF
correspond to an address which is at the current table address minus $01 through
$31'. Pre-compiler
Main Code Description The pre-compiler, which starts at SYNENT ($A1C3), uses the
pseudo-instructions in the Syntax Tables to verify the correctness of the source
line and to generate the tokenized statements. 42
Chapter
Five
Syntax Stack The pre-compiler uses a LIFO stack in its processing.
Each time a non-terminal vector ("GOSUB") is executed, the pre— compiler
must remember where the call was made from. It must also remember the current locations
in the input buffer (source statement) and the output buffer (tokenized statement)
in case the called routine reports fail and backup is required. This LIFO stack is
called the Syntax Stack. The Syntax Stack starts at 5480 at the label SIX. The stack
is 256 bytes in size. Each entry in the stack is four bytes long. The stack can hold
64 levels of non4erminal calls. If a sixty-fifth stack entry is attempted, the LINE
TOO LONG error is reported. (This error should be called LINE TOO COMPLEX, but the
line is most likely too long also.) The first byte of each stack entry is the current
input index (CIX). The second byte is the current output index (COX). The final two
bytes are the current address within the syntax tables. The current stack level is
managed by the STKLVL ($A9) cell. STKLVL maintains a value from $00 to $FC, which
is the displacement to the current top of the stack entry. Initialization
The editor has saved an address in SRCADR ($96). This address is the address, minus
1, of the current statement's ABML instructions in the Syntax Tables. The current
input index (CIX) and the current output index (COX) are also preset by the editor.
The initialization code resets the syntax stack manager (STKLVL) to zero and loads
the first stack entry with the values in CIX, COX, and CPC - the current program
counter, which holds the address of the next pseudo-instruction in the Syntax Tables.
PUSH Values are placed
on the stack by the PUSH routine ($A228). PUSH is entered with the new current pseudo-program
counter value on the CPU stack. PUSH saves the current CIX, COX, and CPC on the syntax
stack and increments STKLVL. Next, it sets a new CPC value from the data on the CPU
stack. Finally, PUSH goes to NEXT. 43
Chapter
Five
POP Values are removed from the stack with the POP routine ($A252).
POP is entered with the 6502 carry flag indicating pass/fail. If the carry
is clear, then pass is indicated. If the carry is set, then fail is indicated.
POP first checks STKLVL. If the current value is zero, then the pre-compiler is done.
In this case, POP returns to the editor via RTS. The carry bit status informs the
editor of the pass/fail status. If STKLVL is not zero, POP decrements STKLVL.
At this point, POP examines the carry bit status. If the carry is clear (pass),
POP goes to NEXT. If the carry is set (fail), POP goes to FAIL. NEXT and
the Processes It Calls After initialization is finished and after each Syntax
Table instruction is processed, NEXT is entered to process the next syntax instruction.
NEXT starts by calling NXSC to increment CPC and get the next syntax instruction
into the A register. The instruction value is then tested to determine which syntax
instruction code it is and where to go to process it. If the Syntax Instruction is
OR ($02) or RTN ($03), then exit is via POP. When POP is called due to these two
instructions, the carry bit is always clear, indicating pass. ERNTV. If the
instruction is RNTV ("GOSUB" $80-$FF), then ERNTV ($A201) is entered. This
code calculates the new CPC value, then exits via PUSH. GETADR. If the instruction
is ANTV ($00) or the deus ex machina ESRT ($01) instruction, then GETADR is
called. GETADR obtains the following two-byte address from the Syntax Table. If the
instruction was ANTV, then GETADR exits via PUSH. If the instruction was ESRT, then
GETADR calls the external routine indicated. The external routine will report pass/fail
via the carry bit. The pass/fail condition is examined at $A1FO. If pass
is indicated, then NEXT is entered. If fail is indicated, then FAIL is entered.
TERMTST. If the instruction is VEXP ($0E), then the code at $A1F9 will go
to TERMTST ($A2A9), which will cause the code 44
Chapter
Five
at $A2AF to be executed for VEXP. This code obtains the address,
minus 1, of the ABML for the <expression> in the Syntax Table and exits via
PUSH. ECHNG. If the instruction was CHNG ($0F), then ECHNG ($A2BA) is entered
via tests at $A1F9 and $A2AB. ECHNG will increment CPC and obtain the change-to
token which will then replace the last previously generated token in OUTBUFF. ECHNG
exits via RTS, which will take control back to NEXT. SRCONT. The Operator
Token Instructions ($10-$7F) are handled by the SRCONT routine. SRCONT is called
via tests at $A1F9 and $A2AD. SRCONT will examine the current source symbol to see
if it matches the symbol represented by the operator token. When SRCONT has made
its determination, it will return to the code at $A1FC. This code will examine the
pass/fail (carry cleariset) indicator returned by SRCONT and take the appropriate
action. (The SRCONT routine is detailed on the next page.) FAIL If any routine
returns a fail indicator, the FAIL code at $A26C will be entered. FAIL will
sequentially examine the instructions, starting at the Syntax Table address pointed
to by CPC, looking for an OR instruction. If an OR instruction is found, the code
at $A27D will be entered. This code first determines if the current statement symbol
is the rightmost source symbol to be examined thus far. If it is, it will update
MAXCIX. The editor will use MAXCIX to set the inverse video flag if the statement
is erroneous. Second, the code restores CIX and COX to their before-failure values
and goes to NEXT to try this new OR choice. If, while searching for an OR instruction,
FAIL finds a RTN instruction, it will call POP with the carry set. Since the carry
is set, POP will re-enter FAIL once it has restored things to the previous calling
level. All instruction codes other than OR and RTN are skipped over by FAIL. 45
Chapter
Five
Pre-compiler Subroutine Descriptions SRCONT ($A2E6) The SRCONT code will be entered when
an operator token instruction is found in the Syntax Tables by the main pre— compiler
code. The purpose of the routine is to determine if the current source symbol in
the user's line matches the terminal symbol represented by the operator token. If
the symbols match, the token is placed into the output buffer and pass is returned.
If the symbols do not match, fail is returned. SRCONT uses the value of the operator
token to access the terminal symbol name in the Operator Name Table. The characters
in the source symbol are compared to the characters in the terminal symbol. If all
the characters match, pass is indicated. TNVAR, TSVAR ($A32A) These
deus ex machina routines are called by the ESRT instruction. The purpose of
the routines is to determine if the current source symbol is a valid numeric (TNVAR)
or string (TSVAR) variable. If the source symbol is not a valid variable, fail
is returned. When pass is indicated, the routine will put a variable token
into the output buffer. The variable token ($80-$FF) is an index into the Variable
Name Table and the Variable Value Table, plus $80. The Variable Name Table is searched.
If the variable is already in the table, the token value for the existing variable
is used. If the variable is not in the table, it will be inserted into both tables
and a new token value will be used. A source symbol is considered a valid variable
if it starts with an alphabetic character and it is not a symbol in the Operator
Name Table, which includes all the reserved words. The variable is considered to
be a string if it ends with $; otherwise it is a numeric variable. If it is a string
variable, $ is stored with the variable name characters. The routine also determines
if the variable is an array by looking for (. If the variable is an array, (is stored
with the variable name characters in the Variable Name Table. As a result, ABC, ABC$,
and ABC(n) are all recognized as different variables. 46
Chapter
Five
TNCON ($A400) TNCON is called by the FSRT instruction. Its
purpose is to examine the current source symbol for a numeric constant, using the
floating point package. If the symbol is not a numeric constant, the routine returns
fail. If the symbol is a numeric constant, the floating point package has converted
it to a floating point number. The resulting six-byte constant is placed in the output
buffer preceded by the $OE numeric constant token. The routine then exits with pass
indicated. TSCON ($A428) TSCON is called by the ESRT instruction. Its purpose
is to examine the current symbol for a string constant. If the symbol is not a string
constant, the routine returns fail. If the first character of the symbol is
'', the symbol is a string constant. The routine will place the string constant
token ($0F) into the output buffer, followed by a string length byte, followed by
the string characters. The string constant consists of all the characters that follow
the starting double quote up to the ending double quote. If the EOL character ($9B)
is found before the ending double quote, an ending double quote is assumed. The EOL
is not part of the string. The starting and ending double quotes are not saved with
the string. All 256 character codes except $9B (EOL) and $22 (") are
allowed in the string. SEARCH ($A462) This is a general purpose table search
routine used to find a source symbol character string in a table. The table to be
searched is assumed to have entries which consist of a fixed length part (0 to 255
bytes) followed by a variable length ATASCII part. The last character of the ATASCII
part is assumed to have the most significant bit ($80) on. The last table entry is
assumed to have the first ATASCII character as $00. Upon entry, the X register contains
the length of the fixed part of the table (0 to 255). The A, Y register pair points
to the start of the table to be searched. The source string for comparison is pointed
to by INBUFF plus the value in CIX. Upon exit, the 6502 carry flag is clear if a
match was found, and set if no match was found. The X register points to the end
47
Chapter
Five
of the symbol, plus 1, in the buffer. The SRCADR ($95) two- byte
cell points to the matched table entry. STENUM ($AF) contains the number, relative
to zero, of the matched table entry. SETCODE (A2C8) The SETCODE routine is
used to place a token in the next available position in the output (token) buffer.
The value in COX determines the current displacement into the token buffer. Mter
the token is placed in the buffer, COX is incremented by one. If COX exceeds 255,
the LINE TOO LONG error message is generated. 48
<-Chapter
04Chapter 06->