Chapter Five

The Pre-compiler

The symbols and symbol-combining rules of Atari BASIC are coded into Syntax Tables, which direct the Program Pre— compiler in examining source code and producing tokens. The information in the Syntax Tables is a transcription of a meta— language definition of Atari BASIC. The Atari BASIC Meta-language A meta-language is a language which describes or defines another language. Since a meta-language is itself a language, it also has symbols and symbol-combining rules - which define with precision the symbols and symbol-combining rules of the subject language. Atari BASIC is precisely defined with a specially developed meta-language called the Atari BASIC Meta-language, or ABMI,. (ABML was derived from a commonly used compiler- technology meta-language called BNF.) The symbols and symbol-combining rules of ABML were intentionally kept very simple. Making Up a Language To show you how ABML works, we'll create an extremely simple language called SAP, for Simple Arithmetic Process. SAP symbols consist of variables, constants, and operators.

The grammar of the SAP language is precisely defined by the ABML definition in Figure 5-1. 33


Chapter Five Figure 5-1 The SAP Language Expressed in ABML SAP := <expression>! <expression> := <value> <operation> | <operation> := <operator> <expression> <value> := <constant> | <variable> <constant> := 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <variable> := A | B | C <operator> := + | - | * | / The ABML symbols used to define the SAP language in Figure 5-1 are: := is defined as Whatever is on the left of : = is defined as consisting of whatever is on the right of :=, and in that order. | or The symbol | allows choices for what something is defined as. For instance, in the sixth line <variable> can be A or B or C. If does not appear between two symbols, then there is no choice. For example, in the second line <expression> must have both <value> and <operation>, in that order, to be valid. <> label Whatever comes between < and> is an ABML label. All labels, as non-terminal symbols, must be defined at some point, though the definitions can be circular - notice that <operation> is part of the definition of <expression> in the second line, while in the third line <expression> is part of the definition of <operation>. terminal Symbols used in definitions, which are not symbols enclosed by < and > and are also not one of the ABML symbols, are terminal symbols in the language being defined by ABML. In SAP, some terminal symbols are A, !, B, *, and 1. They cannot be defined as consisting of other symbols - they are themselves the symbols that the SAP language manipu— 34
Chapter Five lates, and must appear exactly as they are shown to be valid in SAP. In effect, they are the vocabulary of the SAP language. Statement Generation The ABML description of SAP can be used to generate grammatically correct statements in the SAP language. To do this, we merely start with the first line of the definition and replace the non-terminal symbols with the definitions of those symbols. The replacement continues until only terminal symbols remain. These remaining terminal symbols constitute a grammatically correct SAP statement. Since the or statement requires that one and only one of the choices be used, we will have to arbitrarily replace the non- terminal with the one valid choice. Figure 5-2 illustrates the ABML statement generation process. Figure 5-2. The Generation of One Possible SAP Statement (1) SAP := <expression>! (2) SAP := <value> <operation>! (3) SAP := <variable> <operation>! (4) SAP := B< operation>! (5) SAP := B<operator> <expression>! (6) SAP := B*<expression>! (7) SAP := B*<value> <operation>! (8) SAP := B*<constant> <operation>! (9) SAP := B*4<operation>! (10) SAP := B*4< operator> <expression>! (11) SAP := B*4+<expression>! (12) SAP := 8*4+<value> <operation>! (13) SAP := B*4+<variable> <operation>! (14) SAP := B*4+C<operation>! (15) SAP := B*4+C! In (2), <value> <operation> replaces <expression> because the ABML definition of SAP (FigureS-I) defines <expression> as <value> <operation>. In (3), the non-terminal <value> is replaced with 35
Chapter Five <variable>. The definition of <value> gives two choices for the substitution of <value>. We happened to choose <variable>. In (4), we reach a terminal symbol, and the process of defining <value> ends. We happened to choose B to replace <variable>. In (5), we go back and start defining <operation>. There are two choices for the replacement of <operation>, either <operator> <expression> or nothing at all (since there is nothing to the right of in the second line of Figures 5-1). If nothing had been chosen, then (5) would have been: SAP :=B! The statement B! has no further non-terminals; the process would have been finished, and a valid statement would have been produced. Instead we happened to choose <operator> <expression>. The SAP definition for <expression> is <value> <operation>. If we replace <operation> with its definition we get: <expression> : = <value> <operator> <expression> The definition of <expression> includes <expression> as part of its definition. If the <operator> <expression> choice were always made for <operation>, then the process of replacement would never stop. A SAP statement can be infinitely long by definition. The only thing which prevents us from always having an infinitely long SAP statement is that there is a second choice for the replacement of <operation>: nothing. The replacements in (5) and (10) reflect the repetitive choices of defining <expression> in terms of itself. The choice in (15) reflects the nothing choice and thus finishes the replacement process. Computerized Statement Generation If we slightly modify our procedure for generating statements, we will have a process that could be easily programmed into a computer. Instead of arbitrarily replacing the definition of non- terminals, we can think of the non-terminal as a GOSUB. When we see <X> := <Y> <Z>, we can think of <Y> as being a subroutine-type procedure: (a) Go to the line that has <Y> on the left side. (b) Process the definition (right side) of <Y>. 36
Chapter Five (c) If while processing the definition of <Y>, other non- terminals are found, GOSUB to them. (d) If while processing the definition of <Y> we encounter a terminal, output the terminal symbol as the next symbol of the generated statement. (e) When the definition of <Y> is finished, return to the place that <Y> was called from and continue. Since ABML is structured so that it can be programmed, a fascinating exercise is to design a simple English sentence grammar with ABML, then write a BASIC program to generate valid English sentences at random. The randomness of the sentences would be derived by using the RND function to select from the definitions or choices. An example of such a grammar is shown in Figure 5-3. (The programming exercise is left to you.) Figure 5-3. A Simple English Sentence Grammar in ABML SENTENCE := <subject> <adverb> <verb> <object>. <subject> := The <adjective> <noun> <verb> := eats | sleeps | drinks | talks | hugs <adverb> := quickly | silently | slowly | viciously | lovingly | sadly | <object> := at home | in the car | at the table | at school | <subject> <noun> := boy | girl | dog | programmer | computer | teacher <adjective> := happy | sad | blue | light | round | smart | cool | nice | Syntactical Analysis The process of examining a language statement for grammatical correctness is called syntactical analysis, or syntaxing. Statement verification is similar to statement generation. Instead of arbitrarily choosing which or definition to use, however, the choices are already made, and we must check to see whether the statement symbols are used in valid patterns. To do this, we must process through each or definition until we find a matching valid terminal symbol. The result of statement generation is a valid, grammatically correct statement, but the result of statement verification is a 37
Chapter Five statement validity indication, which is a simple yes or no. Either the statement is grammatically correct or it is not.Failure occurs when some statement symbol cannot be matched with a valid terminal symbol under the rules of the grammar. The Reporting System To use the pass/fail result of statement verification, we must build a reporting system into the non4erminal checking process. Whenever we, in effect, GOSUB to a non-terminal definition, that non-terminal definition must report its pass/fail status. A fail status is generated and returned by a non-terminal definition when it finds no matching terminal for the current statement symbol. If the current statement symbol is B and the <constant> definition in the SAP language is called, then <constant> would report a fail status to the routine that called it. A pass status is returned when a terminal symbol is found which matches the current statement symbol. If our current statement symbol had been 7 instead of B, then <constant> would have reported pass. Whenever such a match does occur, we return to the statement, and the next symbol to the right becomes the new current symbol for examination and verification. Cycling Through the Definitions In SAP, the <constant> definition is called from the <value> definition. If <constant> reports fail, then we examine the next or choice, which is <variable>. The current symbol is B, so <variable> reports pass. Since at least one of the or choices of <value> has reported pass, <value> will report pass to its caller. If both <constant> and <variable> had reported fail, then <value> would report fail to its caller. The caller of <value> is <expression>. If <value> reports pass, <operation> is called. If <operation> reports pass, then <expression> can report pass to its caller. If either <value> or <operation> reports fail, then <expression> must report fail, since there are no other or choices for <expression> The definition of <operation> contains a special pass/fail property. If either <operator> or <expression> reports fail, 38
Chapter Five then the or choice must be examined. In this case the or choice is nothing. The or nothing means something special: report pass, but do not advance to the next symbol. The final pass/fail report is generated from the first line of the definition. If <expression> reports pass and the next symbol is!, then SAP reports pass. If either one of these conditions has a fail status, then SAP must report fail to whatever called SAP from outside the language. Backing Up Sometimes it is necessary to back up over symbols which have already been processed. Let's assume that there was a definition of the type <X> : <Y>I<z>. It is possible that while <Y> is attempting to complete its definition, it will find a number of valid matching terminal symbols before it discovers a symbol that it cannot match. In this case, <Y> would have consumed a number of symbols before it decided to report fail. All of the symbols that <Y> consumed must be unconsumed before <Z> can be called, since <Z> will need to check those same symbols. The process of unconsuming symbols is called backup. Backup is usually performed by the caller of <Y>, which remembers which source symbol was current when it called <Y>. If <Y> reports fail, then the caller of <Y> restores the current symbol pointer before calling <Z>. Locating Syntax Error When a final report of fail is given for a statement, it is often possible to guess where the error occurred. In a left-to-right system, the symbol causing the failure is usually the symbol which follows the rightmost symbol found to be valid. If we keep track of the rightmost valid symbol during the various backups, we can report a best guess as to where the failure- causing error is located. This is exactly what Atari BASIC does with the inverse video character in the ERROR line. For simplicity, our example was coded for SAP, but the syntactical analysis we have just described is essentially the process that the Atari BASIC pre-compiler uses to verify the grammar of a source statement. The Syntax Tables are an ABML description of Atari BASIC. The pre-compiler, also known as the syntaxer, contains the routines which verify BASIC statements. 39
Chapter Five Statement Syntax Tables There is one entry in the Syntax Tables for each BASIC statement. Each statement entry in the Syntax Table is a transcription of an ABML definition of the grammar for that particular statement. The starting address of the table entry for a particular statement is pointed to by that statement's entry in the Statement Name Table. The data in the Syntax Tables is very much like a computer machine language. The pseudo-computer which executes this pseudo-machine language is the pre-compiler code. Like any machine language, the pseudo-machine language of the Syntax Tables has instructions and instruction operands. For example, an ABML non4erminal symbol is transcribed to a code which the pre-compiler executes as a type of "GOSUB and report pass/fail" command. Here are the pseudo-instruction codes in the Syntax Tables; each is one byte in length. Absolute Non-Terminal Vector Name: ANTV Code: $00 This is one of the forms of the non-terminal GOSUB. It is followed by the address, minus 1, of the nonAerminal's definition within the Syntax Table. The address is two bytes long, with the least significant byte first. External Subroutine Call Name: ESRT Code: $01 This instruction is a special type of terminal symbol checker. It is followed by the address, minus 1, of a 6502 machine language routine. The address is two bytes long, with the least significant byte first. The ESRT instruction is a deus ex machina - the "god from the machine" who solved everybody's problems at the end of classical Greek plays. There are some terminals whose definition in ABML would be very complex and require a great many instructions to describe. In these cases, we go outside the pseudo-machine language of the Syntax Tables and get help from 6502 machine language routines - the deus ex machina that quickly gives the desired 40
Chapter Five result. A numeric constant is one example of where this outside help is required. ABML or Name: OR Value: $02 This is the familiar ABML or symbol ( | ). It provides for an alternative definition of a non-terminal. Return Name: RTN Value: $03 This code signals the end of an ABML definition line. When we write an ABML statement on paper, the end of a definition line is obvious - there is no further writing on the line. When ABML is transcribed to machine codes, the definitions are all pushed up against each other. Since the function that is performed at the end of a definition is a return, the end of definition is called return (RTN). Unused (Codes $04 through $0D are unused.) Expression Non-Terminal Vector Name: VFXP Value: $0E The ABML definition for an Atari BASIC expression is located at $A60D. Nearly every BASIC statement definition contains the possibility of having <expression> as part of it. VEXP is a single-byte call to <expression>, to avoid wasting the two extra bytes that ANTV would take. The pseudo- machine understands that this instruction is the same as an ANTV call to <expression> at $A60D. Change Last Token Name: CHNG Value: $0F This instruction is followed by a one-byte change to token value. The operator token instructions cause a token to be placed into the output buffer. Sometimes it is necessary to change the token that was just produced. For example, there are several = operators. One = operator is for the assignment 41
Chapter Five statement LET X = 4. Another = operator is for comparison operations like IF Y = 5. The pseudo-machine will generate the assignment = token when it matches =. The context of the grammar at that point may have required a comparison = token. The CHNG instruction rectifies this problem. Operator Token Name: (many) Value: $10 through $7F These instructions are terminal codes for the Atari BASIC Operators. The code values are the values of each operator token. The values, value names, and operator symbols are defined in the Operator Name Table (see Chapter 2). When the pseudo-machine sees these terminal symbol representations, it compares the symbol it represents to the current symbol in the source statement. If the symbols do not match, then fail status is generated. If the symbols match, then pass status is generated, the token (instruction value) is placed in the token output buffer, and the next statement source symbol becomes the current symbol for verification. Relative Non-Terminal Vectors Name: (none) Value: $80- $BF (Plus) $C0 - $FF (Minus) This instruction is similar to ANTV, except that it is a single byte. The upper bit is enough to signal that this one-byte code is a non-terminal GOSUB. The destination address of the GOSUB is given as a position relative to the current table location. The values $80 through $BF correspond to an address which is at the current table address plus $00 through $31'. The values $C0 through $FF correspond to an address which is at the current table address minus $01 through $31'. Pre-compiler Main Code Description The pre-compiler, which starts at SYNENT ($A1C3), uses the pseudo-instructions in the Syntax Tables to verify the correctness of the source line and to generate the tokenized statements. 42
Chapter Five Syntax Stack The pre-compiler uses a LIFO stack in its processing. Each time a non-terminal vector ("GOSUB") is executed, the pre— compiler must remember where the call was made from. It must also remember the current locations in the input buffer (source statement) and the output buffer (tokenized statement) in case the called routine reports fail and backup is required. This LIFO stack is called the Syntax Stack. The Syntax Stack starts at 5480 at the label SIX. The stack is 256 bytes in size. Each entry in the stack is four bytes long. The stack can hold 64 levels of non4erminal calls. If a sixty-fifth stack entry is attempted, the LINE TOO LONG error is reported. (This error should be called LINE TOO COMPLEX, but the line is most likely too long also.) The first byte of each stack entry is the current input index (CIX). The second byte is the current output index (COX). The final two bytes are the current address within the syntax tables. The current stack level is managed by the STKLVL ($A9) cell. STKLVL maintains a value from $00 to $FC, which is the displacement to the current top of the stack entry. Initialization The editor has saved an address in SRCADR ($96). This address is the address, minus 1, of the current statement's ABML instructions in the Syntax Tables. The current input index (CIX) and the current output index (COX) are also preset by the editor. The initialization code resets the syntax stack manager (STKLVL) to zero and loads the first stack entry with the values in CIX, COX, and CPC - the current program counter, which holds the address of the next pseudo-instruction in the Syntax Tables. PUSH Values are placed on the stack by the PUSH routine ($A228). PUSH is entered with the new current pseudo-program counter value on the CPU stack. PUSH saves the current CIX, COX, and CPC on the syntax stack and increments STKLVL. Next, it sets a new CPC value from the data on the CPU stack. Finally, PUSH goes to NEXT. 43
Chapter Five POP Values are removed from the stack with the POP routine ($A252). POP is entered with the 6502 carry flag indicating pass/fail. If the carry is clear, then pass is indicated. If the carry is set, then fail is indicated. POP first checks STKLVL. If the current value is zero, then the pre-compiler is done. In this case, POP returns to the editor via RTS. The carry bit status informs the editor of the pass/fail status. If STKLVL is not zero, POP decrements STKLVL. At this point, POP examines the carry bit status. If the carry is clear (pass), POP goes to NEXT. If the carry is set (fail), POP goes to FAIL. NEXT and the Processes It Calls After initialization is finished and after each Syntax Table instruction is processed, NEXT is entered to process the next syntax instruction. NEXT starts by calling NXSC to increment CPC and get the next syntax instruction into the A register. The instruction value is then tested to determine which syntax instruction code it is and where to go to process it. If the Syntax Instruction is OR ($02) or RTN ($03), then exit is via POP. When POP is called due to these two instructions, the carry bit is always clear, indicating pass. ERNTV. If the instruction is RNTV ("GOSUB" $80-$FF), then ERNTV ($A201) is entered. This code calculates the new CPC value, then exits via PUSH. GETADR. If the instruction is ANTV ($00) or the deus ex machina ESRT ($01) instruction, then GETADR is called. GETADR obtains the following two-byte address from the Syntax Table. If the instruction was ANTV, then GETADR exits via PUSH. If the instruction was ESRT, then GETADR calls the external routine indicated. The external routine will report pass/fail via the carry bit. The pass/fail condition is examined at $A1FO. If pass is indicated, then NEXT is entered. If fail is indicated, then FAIL is entered. TERMTST. If the instruction is VEXP ($0E), then the code at $A1F9 will go to TERMTST ($A2A9), which will cause the code 44
Chapter Five at $A2AF to be executed for VEXP. This code obtains the address, minus 1, of the ABML for the <expression> in the Syntax Table and exits via PUSH. ECHNG. If the instruction was CHNG ($0F), then ECHNG ($A2BA) is entered via tests at $A1F9 and $A2AB. ECHNG will increment CPC and obtain the change-to token which will then replace the last previously generated token in OUTBUFF. ECHNG exits via RTS, which will take control back to NEXT. SRCONT. The Operator Token Instructions ($10-$7F) are handled by the SRCONT routine. SRCONT is called via tests at $A1F9 and $A2AD. SRCONT will examine the current source symbol to see if it matches the symbol represented by the operator token. When SRCONT has made its determination, it will return to the code at $A1FC. This code will examine the pass/fail (carry cleariset) indicator returned by SRCONT and take the appropriate action. (The SRCONT routine is detailed on the next page.) FAIL If any routine returns a fail indicator, the FAIL code at $A26C will be entered. FAIL will sequentially examine the instructions, starting at the Syntax Table address pointed to by CPC, looking for an OR instruction. If an OR instruction is found, the code at $A27D will be entered. This code first determines if the current statement symbol is the rightmost source symbol to be examined thus far. If it is, it will update MAXCIX. The editor will use MAXCIX to set the inverse video flag if the statement is erroneous. Second, the code restores CIX and COX to their before-failure values and goes to NEXT to try this new OR choice. If, while searching for an OR instruction, FAIL finds a RTN instruction, it will call POP with the carry set. Since the carry is set, POP will re-enter FAIL once it has restored things to the previous calling level. All instruction codes other than OR and RTN are skipped over by FAIL. 45
Chapter Five Pre-compiler Subroutine Descriptions SRCONT ($A2E6) The SRCONT code will be entered when an operator token instruction is found in the Syntax Tables by the main pre— compiler code. The purpose of the routine is to determine if the current source symbol in the user's line matches the terminal symbol represented by the operator token. If the symbols match, the token is placed into the output buffer and pass is returned. If the symbols do not match, fail is returned. SRCONT uses the value of the operator token to access the terminal symbol name in the Operator Name Table. The characters in the source symbol are compared to the characters in the terminal symbol. If all the characters match, pass is indicated. TNVAR, TSVAR ($A32A) These deus ex machina routines are called by the ESRT instruction. The purpose of the routines is to determine if the current source symbol is a valid numeric (TNVAR) or string (TSVAR) variable. If the source symbol is not a valid variable, fail is returned. When pass is indicated, the routine will put a variable token into the output buffer. The variable token ($80-$FF) is an index into the Variable Name Table and the Variable Value Table, plus $80. The Variable Name Table is searched. If the variable is already in the table, the token value for the existing variable is used. If the variable is not in the table, it will be inserted into both tables and a new token value will be used. A source symbol is considered a valid variable if it starts with an alphabetic character and it is not a symbol in the Operator Name Table, which includes all the reserved words. The variable is considered to be a string if it ends with $; otherwise it is a numeric variable. If it is a string variable, $ is stored with the variable name characters. The routine also determines if the variable is an array by looking for (. If the variable is an array, (is stored with the variable name characters in the Variable Name Table. As a result, ABC, ABC$, and ABC(n) are all recognized as different variables. 46
Chapter Five TNCON ($A400) TNCON is called by the FSRT instruction. Its purpose is to examine the current source symbol for a numeric constant, using the floating point package. If the symbol is not a numeric constant, the routine returns fail. If the symbol is a numeric constant, the floating point package has converted it to a floating point number. The resulting six-byte constant is placed in the output buffer preceded by the $OE numeric constant token. The routine then exits with pass indicated. TSCON ($A428) TSCON is called by the ESRT instruction. Its purpose is to examine the current symbol for a string constant. If the symbol is not a string constant, the routine returns fail. If the first character of the symbol is '', the symbol is a string constant. The routine will place the string constant token ($0F) into the output buffer, followed by a string length byte, followed by the string characters. The string constant consists of all the characters that follow the starting double quote up to the ending double quote. If the EOL character ($9B) is found before the ending double quote, an ending double quote is assumed. The EOL is not part of the string. The starting and ending double quotes are not saved with the string. All 256 character codes except $9B (EOL) and $22 (") are allowed in the string. SEARCH ($A462) This is a general purpose table search routine used to find a source symbol character string in a table. The table to be searched is assumed to have entries which consist of a fixed length part (0 to 255 bytes) followed by a variable length ATASCII part. The last character of the ATASCII part is assumed to have the most significant bit ($80) on. The last table entry is assumed to have the first ATASCII character as $00. Upon entry, the X register contains the length of the fixed part of the table (0 to 255). The A, Y register pair points to the start of the table to be searched. The source string for comparison is pointed to by INBUFF plus the value in CIX. Upon exit, the 6502 carry flag is clear if a match was found, and set if no match was found. The X register points to the end 47
Chapter Five of the symbol, plus 1, in the buffer. The SRCADR ($95) two- byte cell points to the matched table entry. STENUM ($AF) contains the number, relative to zero, of the matched table entry. SETCODE (A2C8) The SETCODE routine is used to place a token in the next available position in the output (token) buffer. The value in COX determines the current displacement into the token buffer. Mter the token is placed in the buffer, COX is incremented by one. If COX exceeds 255, the LINE TOO LONG error message is generated. 48

<-Chapter 04Chapter 06->