Preface Fifteen years after the introduction of the Atari 8-bit computer, virtually the entire base of machines is still using the same CPU, and thus has the same basic processing power as when the machine was first shipped. Not that we need to be like other platforms -- which seem to get CPU upgrades before you've had enough time to pay off the credit card bill on the last system -- but we have missed out on a great amount of satisfaction that comes when you find out that you can do tasks faster, easier, and better than before. Or when you find that the machine is now capable of feats that it previously could not achieve. When I met the 65816 processor for the first time, I was immediately captivated by the range of enhancements that it provided. It was a feeling surpassed only by the still cherished memory of my first successful Assembly language program. Some of these enhancements are brilliant, and are implemented in very clever ways. For the designers to have achieved this much, and yet remain completely compatible with the 6502 at the same time, is truly a work of art. It is a processor that plain and simply _belongs_ inside the Atari computer; for the Atari is unquestionably in my mind the best 8-bit computer that has ever been created, or will ever be created. It is itself a work of art, that more than anything else, is responsible for the attachment I have to computers, and to programming in particular. There is no other machine more deserving, or more able to utilize the extra power of the 65816. As an Atari enthusiast since 1979, it gives me great pleasure to be a part of this new upgrade for the machine. John Harris Ahwahnee, California Programming notes for the 65816. The 65816 is a wonderful improvement over the 6502, containing 78 new opcodes, 9 new addressing modes, block memory moves, 16-bit registers, relocatable stack and zero-page, and a 16 megabyte linear address space. This is intended to be an introduction to the many new features and techniques provided by the 65816 processor. - Native Mode - The 816 powers-up in a state known as "6502 emulation mode". This is its default state, where it remains virtually 100% compatible with the 6502. Most of the 816's enhancements are still available in emulation mode, including 24-bit addressing, and the new opcodes and addressing modes. The primary missing ingredient is 16-bit registers. These are not available until the processor is switched into a different state, known as "native mode". This mode switching was necessary so that enhancements could be provided for the CPU, while still maintaining a way to provide complete 6502 compatibility. As such, there are a number of operational differences when the 816 is in native mode, and these will be noted in the upcoming sections. Switching between native and emulation modes is accomplished with a new instruction, called XCE, for eXchange Carry and Emulation bits. The emulation bit is like a shadow of the carry flag, and so to change modes, you set the carry for emulation mode, or clear the carry for native mode, and then do an XCE. The previous state of the emulation flag will be in the carry after the exchange has been made. ; Change to Native Mode CLC XCE ; Change to Emulation Mode SEC XCE Running the 816 in native mode requires changes to the interrupt processing system in the OS. A new OS is being worked on, but until this is available, the Atari will crash if you attempt to enter native mode without completely disabling interrupts. You will find that there is still a considerable amount of power available even in emulation mode, and you still have the option of entering native mode with all interrupts disabled. The wait for an 816-aware version of the OS won't be too much longer, and in the meantime, you can create a macro which enters native mode and disables interrupts at the same time. Later on, when the OS supports native mode with the interrupts still active, you can just remove the interrupt disable from the macro, and reassemble your code. Here's an example of the native and emulation mode macros supplied with the MAE assembler as part of the MACROS include file. !!!NAT .MD ;Macro definition for NATive STZ $D40E SEI CLC XCE .ME !!!EMU .MD ;Macro definition for EMUlation SEC XCE LDA #$C0 ;Change this to #$40 if you aren't using DLIs STA $D40E CLI .ME - B accumulator - The 816 has a second "hidden" accumulator known as the B accumulator. Though there are no operations that work directly with it, there is an instruction called XBA to exchange the A and B accumulators. Thus, B makes a great temporary holding spot. The B accumulator is actually the upper 8 bits of a 16-bit wide accumulator. When A and B are used together as a 16-bit register, it is referred to as the C accumulator. The B accumulator and XBA instruction are accessible in emulation mode. Only the 16-bit wide C accumulator requires the native mode before it can be accessed. Later on, we will learn how to configure the CPU for 16-bit register use, but it can also be useful to load 16 bit values into both accumulators while still in 8-bit register mode. These three instructions perform this operation. LDA #>address ;The high byte of the address is loaded first, XBA ; then swapped into the upper accumulator, LDA #
value XBA The A+B accumulator is also a great place for passing a 16-bit value to a subroutine. - Changing Register Sizes - Register size is controlled by two new bits in the native mode processor status register. The new "M" bit controls the size of the accumulator and all memory accesses, and the "X" bit controls the size of the X and Y index registers. The native mode status register looks like this: bit 7 6 5 4 3 2 1 0 N V M X D I Z C | | | Index register size | Accumulator and memory size These bits are set to 1 when the register size is 8 bits. To enable 16-bit registers, set the desired bit to 0. It is possible to have one bit set to 1, and the other bit set to 0. When transferring between registers of unequal size, a general rule is that the size of the destination register controls the width of the data moved. The following table will summarize this. A=8-bit,X=16-bit A=16-bit,X=8-bit ---------------- ---------------- TAX - Moves A and B into X Moves 8-bit A into X TXA - Moves lower 8 bits of X into A. Moves 8 bit X into A, B remains unchanged. and 0 into B. Two new instructions are provided for setting and resetting bits in the status register. Use the instruction SEP to set processor status bits. The operand should be an immediate quantity, containing binary 1's for each of the bits you wish to set. Consider this like an ORA instruction into the status register. Use the instruction REP to reset or clear processor status bits. The operand should be an immediate quantity, containing binary 1's for each of the bits you wish to clear. REP #$30 ;Set to all 16-bit registers REP #$20 ;Set 16-bit accumulator and memory REP #$10 ;Set 16-bit index registers SEP #$30 ;Set to all 8-bit registers SEP #$20 ;Set 8-bit accumulator and memory SEP #$10 ;Set 8-bit index registers Note that the SEP and REP instructions can set or reset any of the status register bits, so their use is not restricted to only changing register sizes. Other uses could include: SEP #$09 ;Setting or resetting multiple bits at a time. SEP #$40 ;Set the V flag. SEP #$80 ;Setting or clearing N or Z flags... REP #$02 ;...without changing register values. There are many things to watch out for when changing register sizes. Switching the accumulator from 16 to 8 bits is a reversible operation, since the upper 8 bits are preserved in the B register. However, switching the index register size from 16 to 8 bits will zero the upper 8 bits of both X and Y. Even if only one register is going to be used, the downsizing operation itself can clobber the other register, so the previous value must be saved and restored if the contents need to be preserved. When pushing and pulling registers from the stack, make sure the register size is in the same state for both operations, or else the incorrect number of bytes will be pulled from the stack. When assembling code with immediate operands, the assembler needs to know what the register sizes are, in order to know how many bytes to assemble into the object code. The CPU will want to fetch 16 bits if the register size is 16-bit, so the assembler better have put a 16-bit quantity there or else the code will not run properly. The programmer must use psuedo-ops to keep the assembler in sync with the state of the register sizes in the CPU. Four such psuedo-ops are provided in the MAE assembler. .AW ;Accumulator is word sized .AB ;Accumulator is byte sized .IW ;Index registers are word sized .IB ;Index registers are byte sized So after changing the register sizes with SEP or REP, always follow with the appropriate psuedo-ops to match the new register sizes. It is best to code both the CPU instruction and the psuedo-ops into macros, to reduce the chances of any mismatches occurring. The example MACROS include file provided with the MAE assembler has macros for this purpose. Here is what they look like: !!!R16 .MD ;Macro define to set 16-bit registers REP #$30 .AW .IW .ME !!!R8 .MD ;Set 8-bit registers SEP #$30 .AB .IB .ME !!!A16 .MD ;Set 16-bit accumulator REP #$20 .AW .ME !!!A8 .MD ;Set 8-bit accumulator SEP #$20 .AB .ME There are also I16 and I8 macros for setting index register size by itself. If your code goes into native mode to use 16-bit registers, and plans to return to emulation mode when the 16-bit need is finished, then you do not need to explicitly return to 8-bit registers with SEP. Returning to emulation mode will automatically return the registers to 8-bit size. Remember that you still need to use the .AB and .IB pseudo-ops. Be extremely careful when creating code that branches or JSRs out of a routine that changes register sizes. Consider the following example: R16 ; Macro to set 16 bit registers do some code CMP #VALUE BEQ YES do the "no" code RETURN R8 ; Macro to return to 8 bit registers RTS ; YES do the "yes" code BRA RETURN The problem with this code, is that the assembler will not know that the CPU is in 16-bit register mode at the label YES. So if the code contained any immediate references, they would not be assembled properly. This simple example could easily be rewritten by placing the YES portion of the routine above the label RETURN, and having the "no" section branch around it. This is a preferable arrangement if it is feasible, but in some cases this will not apply as easily. If you need to create a structure similar to the above, you will need to add additional sizing psuedo-ops to keep the assembler in sync with the code. Add the following lines before and after the YES segment: ; .AW ;Since the yes code executes in 16-bit mode, .IW ; set the assembler register sizes to match. YES do the "yes" code BRA RETURN .AB ;Set the assembler back to 8-bit mode. .IB - Bank Registers - The 816 still has a 16-bit program counter, and while it does offer a few 24-bit data addressing modes, most modes are still 16-bit. Thus, the CPU creates 24-bit addresses by using "bank registers" added to a 16-bit address. Much like a "page" of memory is an even 256 bytes, a "bank" of memory is an even 64K. The upper 8 bits of a 24-bit address is called the bank byte. The 816 provides two 8-bit bank registers, one for the PC and one for data addressing. By storing different values into these bank registers, the CPU can address the full 16 megabytes provided by the 24-bit address space. The bank register for the PC is called the PBR, or Program Bank Register. The bank register for data addressing is called the DBR, or Data Bank Register. Things get slightly confusing when looking at the mnemonics to access these bank registers. Mnemonics referring to the DBR will use the letter B; and the letter K is used to refer to the PBR. Access to the DBR is provided only through the stack. The instruction "PHB" pushes the DBR onto the stack, while "PLB" pulls a byte from the stack and puts it into the DBR. Thus, to set the DBR to another value, push that value onto the stack, and then do a PLB. The following code will set the DBR to $40. LDA #$40 PHA PLB There is a "PHK" instruction to push the PBR onto the stack, but no "PLK" instruction. Such an instruction would make little sense, as it would cause the program counter to immediately jump to a new location in the 24-bit address space. To set the PBR, there is a new 24-bit jump instruction, "JML", or jump-long. The operand is a 3 byte, full 24-bit address. It is sometimes useful to set the DBR to the same bank as the PBR, so that data access is aligned to the same bank where the code is running. You can do this with these simple instructions: PHK ;Push the PBR, PLB ; and pull it into the DBR - Direct Page - The concept of zero-page from the 6502 has been extended for the 816 processor. By adding a 16-bit register that can point to any location in the first 64K of memory, you can now relocate zero-page to any 256 byte block of memory within the first 64K. This register is called the Direct Page Register, and the zero-page addressing modes are now referred to as Direct Page modes. The direct page register does not need to point to an even page of memory, but odd memory alignments will slow down all direct page addressing by one cycle. There is one important difference about the way the 816 handles direct page indexed references that overflow the end of the page boundary. Such as referencing $F0,X when X=$20. On the 6502, any references above $FF would wrap around to $00 again, keeping all addresses within the page. On the 816 in native mode, indexes will cross the page boundary and reference locations outside of the direct page. In 6502 emulation mode, in order to maintain compatibility, index references will wrap around to $00, but, (and this is where it gets a bit strange), only for instructions that existed on the 6502, and only when the direct page register is set to 0. All operations that are new to the 816 will always generate addresses outside of the direct page in cases where the index calculation overflows the page boundary. I think a number of examples would best demonstrate the conditions here. ; X=$20, E=emu, dp=0 LDA $F0,X ;Gets byte at $10 STZ $F0,X ;Clears byte at $110 (non-6502 instruction) ; X=$20, E=emu, dp=$4000 LDA $F0,X ;Gets byte at $4110 (dp is not 0) ; X=$20, E=nat, dp=0 LDA $F0,X ;Gets byte at $110 (native mode) ; X=$FFFF, E=nat, dp=0, 16-bit register mode LDA $F0,X ;Gets byte at $100EF (yes, past the first 64K) The last example demonstrates the interesting effect that with 16-bit index registers, direct page indexed addresses can be up to 64K bytes away from the location of the direct page, all with no cycle penalties! Setting the direct page register is similar to setting the DBR, except that the direct page register is 16 bits. There are PHD and PLD instructions, but they will be two byte accesses. Thus to set the direct page register, you will need to push a two byte address onto the stack. There is a new opcode which is handy for this use. It's called PEA, for Push Effective Absolute address. The operand is an absolute address, and it is the address itself that gets pushed onto the stack, rather than the contents at that address. It's as if the operand is treated like an immediate constant, but the assembler syntax does not use the "#" immediate identifier. For example, if you want to set the direct page register to $4000, you would use: PEA $4000 PLD The direct page register can also be transferred directly to and from the accumulator. The instructions are TCD and TDC. The mnemonic choice of C was used to indicate that this is always a 16-bit transfer of the A+B accumulators, regardless of whether the accumulator is set to 8-bit or 16-bit mode. There are all sorts of unique possibilities for this relocatable direct page. If you need to do some very fast graphic routines, you can point the direct page register into screen memory! Or even point the direct page register at $D000, which would let you access the hardware graphic registers at direct page speed! Be creative, but keep in mind that if interrupts are occurring when the direct page is pointing somewhere else, direct page accesses within the interrupt would go to the wrong place. The interrupt routine would need to push the direct page, reset it back to 0, and then restore the direct page on exit. A new version of the OS, when it becomes available, may take care of this automatically. Until then, the programmer will be responsible for not allowing interrupts to occur when the direct page register is non-zero, or else providing the enhancements necessary to the interrupt service routines to handle the situation properly. More information will be given in the section on interrupts, near the end of this document. - Stack Register - The 816 now provides a 16-bit stack pointer, which can point to any location in the first 64K of memory. It also allows the stack to be longer than 256 bytes, and can be up to as much memory as you want to set aside for it. The 16-bit stack pointer is only available in native mode. When the CPU is in 6502 emulation mode, the upper byte of the stack pointer will be set to 1, forcing all stack accesses to page 1 just like the 6502. In addition to the TSX and TXS instructions, the 816 also offers TSC and TCS instructions for transferring the accumulator to and from the stack pointer. The mnemonic uses the letter C to indicate that the full 16-bit value in the A+B accumulator will be transferred, regardless of the setting of the M bit in the CPU status. Of course, since the high byte of the stack pointer is forced to 1 when the CPU is in emulation mode, only the lower 8 bits of the accumulator are transferred in emulation mode. Be careful of using the TXS instruction in native mode. The stack pointer will be 16 bits, but if the index registers are set to 8 bits, the upper byte of the X register will be zero. A TXS instruction from native mode with the index registers set to 8 bits, will wind up positioning the stack into page zero! This is a really sad occurrence, because it often winds up being the only problem when trying to run existing 6502 code in native mode. To set the stack pointer from native mode and 8 bit registers, this is the easiest method: TSC ;This will place the current 16 bit stack pointer in A & B LDA xxx ;Replace the lower 8 bits with the new desired value. TCS ;The upper 8 bits was kept in B, and will be unchanged - New Addressing Modes - The 816 provides many new addressing modes, and in some cases, expands the allowable modes of older instructions. For example, the BIT instruction now accepts immediate operands, plus indexed,X modes for absolute and direct page. INC and DEC can operate on the accumulator, in the same manner as the shift and rotate instructions. Just write "INC A", or simply "INC". New addressing modes include: JMP (address,X) ;Absolute Indexed Indirect Available only on JMP and JSR instructions, this addressing mode allows easy access into 2 byte jump tables. It's too bad the 6502 didn't have this mode. We wouldn't have been subjected to the -1 addresses in the $E400 jump tables if it had. In fact, a neat enhancement to an 816 version of the Atari OS would be to add a NOP instruction in front of every one of the $E400 routines. In this way, the current RTS method would still work, with its +1 adjustment skipping over the NOP and winding up in the same place the routine starts now. The difference, is that now a JSR (indirect,X) would enter the routine at the NOP, providing an easier method for programmer access. LDA (dp) ;Direct Page Indirect This is similar to the indirect indexed form of addressing, but without the ",Y" index. This is very handy when you only need one byte at the indirect address, and it does not require the use of a register set to 0. This mode is available for all instructions that currently allow indirect indexed. The following modes generate 24-bit addresses, and are only useful when memory is expanded beyond the initial 64K. Unless otherwise noted, all modes are available with the types of instructions that previously allowed the full range of addressing modes. Such as LDA, STA, and math operations like ADC and ORA. Operations that modify the operand, such as DEC, do not support these new addressing modes, nor do operations with the X or Y registers. LDA >long ;Absolute Long Simply a 24-bit version of absolute addressing, taking one extra address byte for the operand, and thus, one extra cycle as well. The ">" symbol is required by the MAE assembler to denote the long addressing mode. LDA >long,X ;Absolute Long Indexed,X A 24-bit version of absolute indexed,X, again taking one extra address byte for the operand, and one extra cycle. JMP [address] ;Absolute Indirect Long The operand is a 16-bit address, but this address should contain a 24-bit pointer to the destination jump location. This mode is only valid for the JMP instruction. Square brackets "[]" are used to denote 24-bit indirect pointers. LDA [dp] ;Direct Page Indirect Long Much like direct page indirect, and the operand will still be a one byte direct page location. However, the address at that location should be a 24-bit address. LDA [dp],Y ;DP Indirect Long Indexed,Y Adds a ",Y" index to direct page indirect long addressing. The last two addressing modes deserve a topic by themselves. - Stack Relative Addressing - The most significant improvement to the addressing modes in the 816, is the addition of stack relative addressing. This gives the programmer easy and direct access to any byte or address within the last $FF bytes pushed onto the stack. The operand of stack relative addressing is a one byte index into the stack. Since the stack pointer always points to the next available location, a relative pointer of $1 should be used to access the last byte pushed onto the stack. Higher numbers will access bytes successively deeper in the stack. Note that a relative pointer of $0 will reference the last byte that was pulled off the stack, but usage of this byte will be dangerous unless the conditions are such that no interrupts could have occurred. Interrupt processing will push additional bytes onto the stack, overwriting the byte that was "left over" from the last stack use. The direct form of stack relative addressing is written in Assembly as: LDA $1,S There is also an indirect indexed form of stack relative addressing. This operates much like the "(indirect),Y" type of addressing, except that instead of the address being stored in zero page, the address is pushed onto the stack. This is best explained by an example showing the same operation in an already familiar method, along side the new stack relative mode. Indirect Indexed Stack Relative Indirect Indexed ---------------- ------------------------------- LDA #$40 PEA $4000 STA $81 LDY #0 LDY #0 LDA ($1,S),Y ;"$1,S" defines the location STY $80 ; of the address bytes LDA ($80),Y Note that in the stack relative example, the programmer is responsible for pulling the address off of the stack when it is no longer needed. This example used an immediate address, for which the 816 can easily push all 16 bits using the PEA instruction. In cases where the address is calculated, or taken from some other memory variable, push the high byte of the address first, and then the low byte. The stack relative form of indirect addressing is 1-2 cycles slower than using the direct page, so it is not ideal for speed critical applications. Still, there are many excellent uses for stack relative addressing. In its simplest form, it can be a handy method for allocating temporary variables or pointers, without requiring the use of other memory locations. This can be especially helpful if you need indirect addressing capabilities, and do not want to overwrite any direct page locations. You can push the desired address onto the stack, and then use the stack relative form of indirect indexed addressing. At some point, you may need to use a direct page variable, modifying its contents, and then be responsible for restoring it to its original value. Traditionally, you would push the contents onto the stack, or transfer it to another memory location. Stack relative gives you another option. Once you push the original contents onto the stack, you may use that stacked address as your temporary pointer. In fact, there is a single instruction that can push the original pointer, called Push Effective Indirect address, or PEI. PEI is written in Assembly as: PEI ($80) This instruction will take the 16-bit address at the location of the direct page operand, and push it onto the stack. Much like PEA, it pushes the effective indirect address rather than the contents stored at the address. For example, lets say that the direct page location $80 contains the bytes $00,$40, (basically, the address $4000), and location $4000 contains $FF. The instruction "LDA ($80)" would return the contents $FF, while the instruction "PEA ($80)" would push $4000, being the effective address that the operand points to. The stack can also be a useful method for passing parameters to subroutines. You can push as many arguments as you need onto the stack before the subroutine call, and access them from within the subroutine by using stack relative addressing. Since there will be a two byte return address on the very top of the stack, the first parameter will be at an index of $3 when using stack relative. If your subroutine needs to preserve registers on the stack first, the index will increase accordingly. Some of the more exotic uses are in the form of reentrant or recursive programming. A routine is said to be reentrant if it can be called from another source, like an interrupt, while it is in the middle of execution. A routine is recursive if it can call itself. For a routine to be able to have these properties, it must preserve all registers and use no external memory references. One of the example code routines at the end of this document shows an example of recursive programming. - Relocatable Code - Relocatable code has the ability to be moved anywhere in memory, and still execute correctly. Improvements in addressing modes and opcodes make it much easier for the 816 programmer to write fully relocatable code. In combination with the stack relative addressing modes, the final key ingredient is an instruction called PER, Push Effective address Relative. Since relocatable code cannot rely on absolute addressing, PER provides a method for referencing other parts of a program or its data by using a 16-bit signed offset relative to the current PC. The PC always contains the absolute address of where a program is running. So by knowing how far it is from the currently executing instruction to another part of the program, PER can generate an absolute address by adding or subtracting this distance from the PC. In the assembler, you only need to provide the label name for where you want to reference, as the operand of the PER instruction. The assembler will calculate the proper offset and store it in the object code. To write a relocatable program, the programmer will need to replace situations that use absolute addresses, with relative address references. Instead of JMP instructions, use BRA for short jumps within 128 bytes, or BRL for long jumps of up to 32K. There is no BSR instruction for relative subroutine calls, but such an instruction can easily be created in a macro by using two other instructions. !!!BSR .MD (ZLOC) ;BSR macro definition, with 1 parameter for the adr PER *+5 ;Push return adr, adj -1 for RTS; Points after the BRL BRL ZLOC ;Branch to the subroutine. RTS returns to next instr. .ME In the above macro, we have used the PER instruction to push the return address for the subroutine as an offset of +6 from the current PC. The RTS is going to increment the pushed address before continuing execution, so we actually have to pass a relative address of +5 to get the RTS to return where we want. Then, it is a simple matter to jump to the subroutine with a relative branch. Since some subroutine calls could be within 128 bytes, it would be more efficient to allow a short BRA to be used in these situations. This is not practical for the assembler to do automatically, since it will not know how far away forward references will be. Thus it can not determine how many bytes to reserve for the branch, and changing this during the second pass would cause an error in the location of all subsequent addresses. Therefore, we recommend creating two macros and allowing the programmer to specify the size of the branch with the macro name. The MACROS include file supplied with the MAE assembler uses the name BSL for the long branch to subroutine as shown in the example above. Another macro called BSR uses the BRA instruction, and would be used only for subroutine calls within 128 bytes. That covers the absolute JMP and JSR instructions. Using relocatable structures for data storage is more complex, because there are more varieties in the ways programs need to access data. We can explore a few generalizations, after which, you should be capable of creating approaches that fit your individual needs. The basic building block is the PER instruction. Anytime you need to access an address that is within part of the program's space, you need to find the runtime location for that address using PER. PER pushes the address onto the stack, and what you do from there depends upon how often your program needs to use that address. For references that are confined to fairly local use, it is probably best to leave the address on the stack, and use it for stack relative indirect addressing. Let's look at a fairly simple table look-up routine to see how this works. We'll examine a key code look-up with a JMP table, first in standard absolute addressing, and then in a relocatable form using PER and stack relative. Absolute addressing - 6502 method --------------------------------- LDA $2FC ;Get key code LDY #MAXCMD ?FND CMP KEYCMDS,Y BEQ ?HV DEY BPL ?FND SEC ;Return with "Not Found" RTS ?HV TYA ;When key code is found, ASL ; multiply by 2 for word indexing TAY LDA CMDADR+1,Y PHA LDA CMDADR,Y PHA RTS ; and jump to routine ; KEYCMDS .HE BF 95 92 MAXCMD = *-KEYCMDS-1 CMDADR .WO C.A-1 .WO C.B-1 .WO C.C-1 With the 816, the above example could be shortened a bit by using the JMP (indirect,X) instruction. The ending code would look like this, plus the CMDADR table would not use -1 after the addresses with this method. ?HV TYA ASL TAX JMP (CMDADR,X) Now, here's a relocatable version --------------------------------- LDA $2FC PER KEYCMDS ;Push address of KEYCMDS LDY #MAXCMD ?FND CMP ($1,S),Y BEQ ?HV DEY BPL ?FND PLA ;Pull KEYCMDS address off stack PLA SEC ; and return "Not Found" RTS ?HV R16 ;Set 16-bit registers PLA ;Pull KEYCMDS adr off stack PER CMDADR ; and push CMDADR TYA ASL TAY LDA ($1,S),Y ;Addresses in the table are offsets from CMDADR ADC $1,S ;Add it to CMDADR, which is still on the stack, and STA $1,S ; it points to the absolute address of the routine R8 ;8-bit register mode RTS ;And go ; KEYCMDS .HE BF 95 92 MAXCMD = *-KEYCMDS-1 CMDADR .WO C.A-CMDADR-1 ;Addresses are stored as offsets from CMDADR .WO C.B-CMDADR-1 ; .WO C.C-CMDADR-1 ;Also includes -1 for RTS adjust This example used 16-bit registers, so keep in mind the current restrictions regarding native mode and interrupts. Until the new OS is in place, you may want to stay in emulation mode, and just use a double byte add with 8-bit registers. Not all program data is going to fit this type of example either. If the data is accessed at many locations throughout the program, the above process can get inefficient very quickly. In some cases, you may be able to leave the address on the stack, and continue to use it with stack relative addressing. This can quickly become impractical though when needing to access the address from different subroutine levels or with other items pushed onto the stack. A better solution is to do the PER at the start of the program, and then pull off the address and store it permanently in a direct page variable. Then use direct page indirect addressing to access the data. For single bytes and other small amounts of temporary storage, put the most often used variables directly into the direct page, or into free absolute memory areas like $400-$6FF. - Block Memory Moves - The 816 provides two instructions for moving blocks of memory, one incremental and one decremental. This is to handle the potentially destructive operation of moving overlapping blocks of memory. The incremental instruction is MVN, or Move Next, and the decremental instruction is MVP, or Move Previous. The block moves are about twice as fast, or more, as normal move operations using 6502 methods. Block moves are set up by placing the source address in the X register, the destination address in the Y register, and the number of bytes -1 to be moved in the C accumulator. The C register decrements to $FFFF in the move operation, thus the reason to subtract 1 from the desired byte count. The registers should be set to 16-bit mode of course, otherwise block moves would take place in page 0 only. This also means that block moves have very little use in 6502 emulation mode. Two operands follow the block move mnemonic in the assembler syntax, which are the bank bytes for the source and destination addresses respectively. Thus block moves can move data from one bank to another if desired. Here is an example that will move $1000 bytes from $4000-$4FFF to $8000-$8FFF. LDX #$4000 LDY #$8000 LDA #$FFF MVN 0 0 The MVN process will move one byte at a time, using 7 cycles per byte, incrementing both X and Y registers while decrementing the count in C. When the C register gets to $FFFF, the move will be completed, and X and Y will contain the address one past the range that was just moved. In this example, X would contain $5000, and Y would contain $9000. When the destination address is higher than the source address, and the blocks overlap, the programmer should use the MVP instruction (Move Previous). MVP moves the bytes using decrementing registers, so X and Y need to be initialized to point at the highest byte in the source and destination blocks respectively. These values can be found very easily with 16-bit ADC instructions. Assuming the C, X, and Y registers are initialized to the start of the memory blocks, as for the MVN process, the following lines of code will set X and Y to the right locations to start a MVP. PHA ;Save byte count TXA CLC ADC $1,S ;Add the stacked byte count to X TAX TYA ADC $1,S ; ...and Y TAY PLA ;Restore byte count MVP 0 0 ;All ready for the MVP - Other New Instructions - There are a number of other new instructions that haven't been introduced yet, or haven't been discussed formally. These include: PHX ;Push X PHY ; or Y PLX ;and Pull PLY TXY ;Transfer X register to Y TYX ; or Y to X JSL >$123456 ;Jump Subroutine Long, also pushes 24-bit address RTL ;Return subroutine Long, pulls 24-bit address. Be careful to always use RTL when using JSL, and use RTS when using JSR. Mixing these up will cause an improper number of bytes to be pulled from the stack, and will most likely crash the machine. STZ operand ;Store Zero to memory. No register required. A handy instruction for clearing memory. Usable in addressing modes of: direct page direct page indexed,X absolute absolute indexed,X TRB operand ;Test and Reset Bits Similarly to the BIT instruction, TRB will logically AND the value in the accumulator with the operand, with the Z flag indicating the results. TRB goes one step further, by resetting the corresponding bits in the operand. Technically, it logically ANDs the complement of the accumulator with the operand, and stores the result. In other words, TRB resets bits in the operand, and the Z flag lets you know if they were clear before the operation. TSB operand ;Test and Set Bits The counterpart to TRB, the TSB instruction sets bits in the operand, and the Z flag will indicate if the bits were clear before the operation. TRB and TSB support only absolute and direct page addressing modes. They are useful for setting or clearing bits in the $D301 register, and other bit-mapped locations that can be both read and written. Because this is a read-modify-write operation, it must be able to read the current value at the address, and thus can not be used on any hardware locations that do not support this. Locations like $D40E, $D208, $D00C, etc., can not be used with TRB and TSB. - Interrupts - This last section is not for the faint at heart. Early on it was mentioned that without changes to the interrupt processing in the Atari OS, the 816 will crash if you try to enter native mode with interrupts still active. We are about to learn why that happens, and why 816 interrupt processing routines on the Atari are a considerable challenge. Keep in mind that once a new 816-aware version of the OS is available, most of the problems we are about to look at will have already been addressed. The information here is provided for those of you that may need to write your own interrupt handlers, or are simply curious about what goes on behind-the-scenes. The 6502 standard interrupt vectors at $FFFA and above are used whenever the 816 is in emulation mode. When in native mode however, the 816 utilizes a second set of CPU interrupt vectors. Since the Atari OS does not contain valid addresses at these vector locations, this is the primary reason the system will crash on any native mode interrupts. The following table shows the locations of the native mode interrupt vectors. $FFEE - IRQ (note there is no native RESET vector, since RESET forces emulation mode) $FFEA - NMI $FFE8 - ABORT (not used in the Atari) $FFE6 - BRK instruction $FFE4 - COP instruction Where the 6502 vectored software BRK instructions through the IRQ vector, the 816 in native mode provides a separate BRK vector. This is necessary because the B status register bit is no longer available to distinguish an interrupt caused by the BRK instruction. The other new vector is designed to allow a co-processor to be used with the 816. Though that is an unlikely scenario with the Atari, the COP vector operates the same as the BRK vector, but in response to the COP processor instruction. Thus, a programmer could use the COP instruction and COP vector for applications similar to where BRK is used. Interrupts in native mode on the 816 are a significant programming concern. Since the processor can be in many states, a great deal of care must be used to ensure that the CPU is in the proper state for the interrupt handling routine, and restored to its original state when the interrupt exits. The 6502 had only the decimal mode to worry about. When an interrupt occurred, the programmer could not be sure whether the CPU was in decimal mode or not, and so the decimal flag needed to be cleared if the interrupt routine needed to do any math. Interestingly, the 816 always clears the decimal mode when responding to an interrupt, so 816 programmers don't need to be concerned with it. In its place however, the 816 programmer must be concerned with the size of the accumulator and index registers, and the contents of the DBR and direct page bank registers. This is why retrofitting the 816 into the Atari creates some serious problems.It is more than just providing the native mode vectors, because the programmer will not need to use the native mode unless he is also planning to use 16-bit registers. In order to utilize 816 enhancements while leaving the interrupts active, the interrupt service routines need to be aware of the many available states of the 816. They need to save the current state, reset everything to the default state that the interrupt routine expects, and then restore the state back to where it started when the interrupt occurred. The first two of these three steps are comparatively simple, because all of the interrupts start at the main interrupt vectors, and a new OS could provide the required code at the start of interrupt processing. The following code segment is an example of what is required to save the state of the DBR, direct page register, and register sizes, and also reset them to what would be expected as default conditions. PHD ;Save DBR and Direct PHB PEA $0000 ;Provides both operands for the following pulls PLD ;Set DBR and Direct to 0 PLB PHX ;X&Y always need to be saved, because resetting them PHY ; to 8 bits would destroy the upper half contents. PHP ;Save M & X status bits, so we can restore ; ; the registers at their proper size. SEP #$30 ;Set 8 bit regs Now you're ready to begin the interrupt routine with the CPU in its default state. The problem, is that the interrupt service routine needs to restore all of these things when it exits, and current Atari interrupt routines do not expect all of this stuff to be pushed on the stack. While the built-in OS routines could be fixed to provide this, there is no way to control what happens when the interrupt vectors get patched by external programs. All of these external programs, and even PBI devices, are going to exit with every- thing still on the stack, which would bring the system to a crashing halt. There are no perfect solutions to this this problem. If execution speed were of no concern, a completely safe solution is to push an extra 'dummy' interrupt return address onto the stack, so that the RTI from the interrupt service routines would return to the OS, where it could pull everything else from the stack and restore the state of the machine. Then a second RTI would resume main program execution. We already had to push the status register as part of the above operation, so we would only need to push the dummy return address right before it to set the stack up properly for an RTI return. Since we are in native mode, we must also push the PBR bank byte to create a 24-bit return address. The PER instruction works ideal for pushing the remaining 16-bit portion of the address, like we used in the BSR example in a previous section. The RTI does not add $1 to the return address, so in this case we may push the exact address with PER. Here is the completed interrupt pre-handler, along with the code to restore the processor state and exit. PHD ;Save DBR and Direct PHB PEA $0000 ;Provides both operands for the following pulls PLD ;Set DBR and Direct to 0 PLB PHX ;X&Y always need to be saved, because resetting them PHY ; to 8 bits would destroy the upper half contents. PHK ;Push bank byte, PER IRETURN ; and return address, and status for the RTI PHP ;Save M & X status bits, so we can restore ; ; the registers at their proper size. SEP #$30 ;Set 8 bit regs ; now jump to the appropriate interrupt vector, such as... JMP ($216) ;Its RTI will return to IRETURN IRETURN PLY ;Status byte was pulled as part of the RTI, so PLX ; now restore X&Y PLB ;Restore DBR and Direct PLD RTI ;Return to main program This adds up to 66 or 70 cycles, (depending on index register size). A delay like this could be detrimental to PBI and serial IRQs, and immediate VBLANK, and would be completely unacceptable for DLIs. The solution in the other extreme, would be to require interrupts to be completely disabled whenever the programmer needed to change the state of the CPU. In this way, the interrupt routines would not need any of the above maintenance code, and would be just as fast as existing routines. This method however would be too restrictive to the programmer. Everything else is a compromise some- where, but by taking a few points into consideration, a workable solution can be reached. Coding improvements to the interrupt routines by using 816-specific features can regain part of the 66-70 cycle overhead. Another point to keep in mind, is that the emulation mode interrupt vectors could be kept as-is, and thus can be used if the response time of the native mode vectors is too slow. All this means, is that the program can switch to emulation mode when doing disk I/O, or other interrupt speed sensitive tasks. Until the 816-OS is finished and receives considerable testing, we won't know if and where speed problems will occur. DLI routines need access as soon as possible. Thus, the NMI service routine should always plan to get to the DLI without doing any of the safety measures. In practice, this won't be that detrimental to the DLI programmer. DLI's rarely need direct page access, often don't require the index registers, and could get around changes in the DBR by coding all accesses to the hardware with 24-bit absolute addressing. In any case, exceptions to these conditions can be handled by the programmer in many ways, and in a much more efficient manner than the "save everything" method shown above. Since there simply is not enough time to save everything, there is little choice but to handle DLIs as a special case. The big drawback, is that it forces the programmer to be aware of the potential dangers of DLI programming in native mode, since the CPU may not be in a known state. The main NMI vector will still need to set the accumulator to 8 bits, which is needed for checking the NMIST location to determine the cause of the interrupt. So the native mode NMI handler will likely look like this: SEP #$20 ;Set 8 bit accumulator/memory BIT $D40F BPL NODLI JMP ($200) NODLI would then do the safety routine as above, and JMP ($222) This is going to be 4 cycles slower getting to the DLI than the normal Atari handler, (3 for the SEP and 1 extra for the bank byte push on the NMI itself), but this is the best that can be done. Finally, the OS should provide a native mode IRQ master vector, similar to the vector at $216, which would be in front of all of the safety code. The programmer can use this vector to customize the interrupt needs, if necessary. Providing a similar master vector for the NMI would add another 5 cycle delay to the DLI, and thus be undesirable. A compromise, would be to provide a native mode VBI vector that would be called at the NODLI location in the above NMI handler. This would give programming access to the native VBI before any of the safety measures. Such a vector is currently being considered. There are a few more considerations for interrupts on the 816. When the 816 responds to an interrupt in native mode, it will push an additional byte onto the stack between the status byte and the return address. This byte is the PBR; the current bank register for the program counter. Also, one extra cycle of delay will occur due to this byte being pushed. The PBR is not pushed when the CPU is in emulation mode, which can set up a potential interrupt problem if the programmer tries to run code outside of bank 0 while in emulation mode. Because the interrupt will set the PBR to 0, there will be no way for the processor to know where to return to when the interrupt exits. The best way to avoid this scenario is to never execute code outside of bank 0 unless the CPU is in native mode, or unless interrupts are fully disabled. 816 interrupts in both native and emulation modes do not exhibit the 6502's interrupt bugs. On the 6502, if an interrupt occurs while processing a BRK instruction, the BRK gets ignored. The 6502 can also skip NMIs and IRQs under certain conditions. Interrupts on the 816 appear to be completely stable. - Sample Code: Testing for the presence of the 65816. If you want to write a program that will run on both the 6502 and 65816 CPUs, yet take advantage of extra 816 abilities if it is present, then you need a way to determine which processor the code is running on. The following short code segment will do this, based on the code's behavior in decimal mode. When doing BCD math on the 6502, the N and Z flags are not set correctly -- basically because the 6502 is setting the flags before the final BCD adjustment is made internally. The 816 sets the flags properly, so we only have to a simple BCD add where the flags come out differently. Here's the code: SED LDA #$99 CLC ADC #1 CLD BEQ HV816 ; 6502 here ;flags show N set, and Z clear, even though A=0 HV816 ;816 here - Sample Code: Printing a line of text to the screen. Many of you may already use a routine that you can call like this: JSR PRINT .BY "This text is printed" 0 ...code continues here On the 6502, this routine would normally pull the return address from the stack, and place it in a zero-page variable. After incrementing it by one, this would now point to the start of the text string. The bytes of text can then be output one by one, until reaching the $0 end-of-string marker. Then the address is pushed back onto the stack, where an RTS returns to the program code following the text string. The stack relative mode of the 816 makes this routine more elegant, as you will see in the following comparison. 6502 version 65816 version 65816 16-bit version ------------ ------------- -------------------- PRINT PLA PRINT LDA $1,S PRINT REP #$30 STA ZP INC LDA $1,S PLA STA $1,S INC STA ZP+1 BNE ?1 STA $1,S PBYTE INC ZP LDA $2,S SEP #$30 BNE ?1 INC LDY #0 INC ZP+1 STA $2,S LDA ($1,S),Y ?1 LDY #0 ?1 LDY #0 BEQ PEND LDA (ZP),Y LDA ($1,S),Y JSR PUTE BEQ PEND BEQ PEND BRA PRINT JSR PUTE JSR PUTE PEND RTS JMP BPYTE BRA PRINT PEND LDA ZP+1 PEND RTS PHA LDA ZP PHA RTS The 65816 versions do not require a direct page variable, and do not need to pull and push the address from the stack. The stack relative addressing allows the code to operate directly on the address automatically pushed by the JSR. Also, when this stacked address is incremented through the text string, it is automatically in the right place for the RTS. Sadly, INC does not work with a stack relative operand, forcing us to LDA the value to the accumulator, increment it, and STA it back. The 16-bit version makes this process a little smaller, but again require having resident OS interrupt routines that allow size changes to the register. In all routines, "JSR PUTE" can be considered to mean any suitable routine to output a byte to the screen. - Sample Code: Block Move Patch The 816 block move instructions can move memory at least twice as fast as 6502 implementations. It will usually take less code as well, which sets up an opportunity to patch existing 6502 programs with a block move instruction. All we need to do is find the existing routine that does memory moves, and replace the instructions with an 816 equivalent block move. For a demonstration of this, I chose the word processing program TextPro, version 4.55. It is best if the application needs to do a lot of memory management, and this should be a good choice. I'll load the file into memory using the debugger. Since I want the entire file contents sequential, and at the address I choose, I'll use the "-" option for the load which does a straight CIO transfer. L -3000 TP.COM The first step is to find the existing memory move code, and while you could browse through the disassembly, a little detective work with the debugger is a better choice. A memory move should have a searchable sequence of an LDA instruction, followed immediately by an STA. The most common type will be using (indirect),Y addressing, and a quick check of the opcode table shows the hex values to be $B1 for the LDA, and $91 for the STA. Since we don't know what the operand will be, we will use a wildcard for that. H 3000 7800 B1 ? 91 This found one occurrence, but examination of the code showed it to not be what we were looking for. To search for other choices, you need to think of the other ways memory can be moved. Self-modifying code is sometimes used, where you would see an LDA absolute,X or absolute,Y followed by an STA. Since the operands in this case are two bytes, we would use two wildcards. It turns out that the routines we want use the absolute,Y addressing, and the search looks like this: H 3000 7800 B9 ? ? 99 This actually finds quite a few matches, being a common programming structure for other uses. We luck out, and the routine we are looking for happens to be the very first match, at $36C4. Even if it weren't though, it would have only taken a few minutes to examine the code at each location to find the right one. Anyway, here's what the code looks like, and I have adjusted the addresses in the left column to be the actual runtime locations, to match up with the self-modifying operands. $394B LDA $80 STA $396B LDA $81 STA $396C LDA $82 STA $396E LDA $83 STA $396F LDX $85 BEQ $3983 $3963 LDA #0 $3965 STA $B73C LDY #0 $396A LDA $FFFF,Y $396D STA $FFFF,Y INY CPY $B73C BNE $396A INC $396C INC $396F CPX #0 BEQ $3987 DEX BNE $3963 $3983 LDA $84 BNE $3965 $3987 RTS The next step is to determine the locations where we can get the parameters for the move. The source address is easy to detect, being the first four lines that store into the LDA $FFFF,Y instruction. Likewise, the destination address is the next four lines, making the source address found in $80-$81, and the destination in $82-$83. Finding the length takes a bit more examination, but winds up at $84-$85. All we have to do now, is replace this subroutine with an 816 code segment that uses the block move instruction. This code gets placed at $36A5, and although it can be entered with the single line assembler in the debugger, it will get confusing since almost every instruction that you enter will be an 816-specific instruction. The disassembler will not display any of these properly, so the display will look like a real mess. The debugger will also have problems giving you the proper address to enter the next line of code, since it doesn't know the length of 816 instructions. It will assume all unknown instructions to be one byte, which is sometimes okay, but if you enter a three byte instruction like "MVN 0 0", the debugger will give you the next line with the address pointing to the first 0 operand. You need to be aware of how long the 816 instructions really are, and move down to where the next instruction needs to start. I apologize for not providing a debugger that is 816-aware yet, but this *will* be fixed soon. I need it for my own work, so that ensures it a high priority in the 'to-do' list. For now, entering code segments like this is safer in the main assembler. You can assemble a disk file with an origin of $36A5, then return to the debugger, re-load TP.COM, and then load your 816 patch on top of it. I purposely won't give detailed instructions on how to enter the following code segment within the debugger. If you aren't comfortable with what is going on, and with what is required to get around it, entering code this way is probably too error-prone. I will give a nice tip though for those of you that are going to do it anyway! First fill the target area with all $EA's. In this case, "F 36A5 36E0 EA". This does two things for you. After entering an instruction, the next line should display a NOP. If it doesn't, then you are still within the operand of the previous instruction. Thus, it makes a great visual safety measure. Also, since you will need to move down to a lower address, having all NOPs at one byte per line will ensure that the address you need will be displayed. (Even if the address you need appears in the operand of the current line, there will still be another copy of it further down the screen). If you had left the previous code there, you might need an address that is in the operand field of an old instruction, requiring you to enter the address by hand. You also should move down with the cursor keys, and not the Return key. Okay, with all that out of the way, here is the 816 code segment. $36A5 STZ $D40E ;Until Atari OS can handle interrupts in native mode SEI ; we need to disable NMIs and IRQs CLC XCE ;Set native mode, REP #$30 ; and 16-bit regs. LDA $84 ;All parameters loaded as 16-bit quantities. BEQ $36B9 ;Catch 0 size condition DEC ;Adjust size for MVN count being -1 LDX $80 LDY $82 MVN 0 0 $36B9 XCE ;Carry was not changed, so this returns where we CLI ; were, including 8-bit regs. LDA #$C0 STA $D40E ;Interrupts are back on. LDY $84 ;These last two lines are a safety measure, just LDX #0 ; to match the closing register and flag values from TXA ; the previous routine. If you have enough space, RTS ; it's a good idea to do this in case the programmer ; relied on these values being there after the call. Most programs that do memory management will need to have two routines, for doing both incremental and decremental type moves, just as the 816 CPU has two block move instructions. Right after the routine we just patched, is another routine starting at $36E2. The requirements are similar, but with the added task of needing to compute the ending address for both the source and destination memory blocks, which is where the move needs to begin. I'll skip right to the 816 code in this case. $36E2 STZ $D40E SEI CLC XCE REP #$30 LDA $84 BEQ $36FE DEC ;Same so far TAY CLC ADC $80 ;Gets to end of source block TAX TYA PHA ADC $82 ;End of destination block TAY PLA MVP 0 0 SEC ;We changed the carry this time. $36FE XCE CLI LDA #$C0 STA $D40E LDY #$FF LDX #0 TXA RTS That completes the patches. Now we need to save to new version back to disk. Obviously, you should use a different file name in case something went wrong. If you are using SpartaDOS, it is a simple matter of checking the file size in the disk directory. The file shows as 17993 bytes long, so you can enter the save command: (remember the "-", and the relationship between file size and end address -1) S -3000 3000+#17992 TP.COM If you are using a DOS that does not give you the true file sizes, then you need to find the end of the TP.COM file another way. If you have been following this example, for now you should just enter the save command listed above. The method I will suggest for finding the end of the file needs to be done before loading TP.COM, which we are too late for. What we could have done, was fill memory with some known value before loading the TP.COM file. Such as "F 3000 8000 11". Then load TP.COM, and search for a long string of 11's. If you have picked a number that is not used in long sequential quantities within the file, then the match address you find should be the end of the file. Save from 3000 to one byte lower than where the 11's start. That completes this project. Operations like inserting and deleting text will be twice as fast in this new version. Note however, there will be some screen flickering caused by when we turned off the interrupts -- including DLIs -- during the move. When the 816 OS is available, we will not have to disable interrupts, and this patch will be completely clean. At this time, you could go back and put NOPs over the STZ $D40E and SEI instructions. - Sample Code: Sieve of Eratosthenes Benchmark This is a slightly improved version of the routine that appeared in "Programming The 65816", by David Eyes and Ron Lichty. The Sieve program calculates all of the prime numbers between 3 and 16,384. The basic procedure is to eliminate every multiple of a number N, since that number can't be prime while having N as a factor. All even numbers are non-prime, so we can start N at 3, and step up to the mid-point in the table skipping all even numbers. In the following routine, even numbers are effectively skipped by using 1/2 the table size, and considering the first table entry to be for the number 3, 5 in the second entry, 7 in the third, and so on. The routine will set the direct page to $3000, so that the entire 8K table can be accessed with dp,X addressing and 16-bit index registers. SIZE = 8192 .OR $00 ;This is added to the dp register of $3000, ITER .DS 2 ; putting these 3 variables at $3000-$3005 COUNT .DS 2 PRIME .DS 2 FLAGS = $3008 ;Note: by setting FLAGS to $3008, we can use the label in absolute ; addressing, which is required when we need to index by Y. Using ; "