OS/A65 Multitasking OS for 6502
(c) 1989-96 Andre Fachat
This is the description of a small operating system for the 6502 CPU.
It is a microkernel design, with preemtive multitasking, without
memory protection and swapping though. With a page mapped MMU the
different tasks are paged out and memory up to 1 MByte is supported.
Available software includes a shell with piping and I/O redirection,
filesystems for (Commodore serial and parallel) IEEE488 interface,
as well as for PC-style disks. Devices can be handled as files as well.
A BASIC interpreter has been ported from the C64.
A version for the C64 is available with version 1.3.5.
A description of the kernel interna, the kernel interface and the
standard library is given.
Some of the devices and filesystems are introduced.
OS/A65 has been written from somewhere in 1989 or 1990 up to the end of 1992
and is currently in version 1.35. It runs on a selfbuilt 6502 computer
with (or without) a 74ls610 MMU to allow task paging.
This system has been written for a special 6502 computer, as described
below. But then the kernel is completly hardware independent,
device drivers doing all the hardware interface stuff.
A special MMU is used to allow task paging, which gives lots
of interesting features. A version for computers without MMU
is available, though. In this version the different tasks
have to be cooperative, in that different zeropage/RAM addresses
have to be used.
All tables, like environment tables, or streams etc are static, i.e.
their sizes are fixed at compile time. Otherwise a very sophisticated
memory allocation/free handler for kernel memory would have been necessary.
This would have used much of the CPU power and of the system RAM for
administrative needs only. I didn't want this.
If an IRQ occurs, the current process is marked interrupted and
the next runnable task is being started, thus providing preemptive
multitasking. A simlpe round robin scheduler is being used.
No threads are available, as the CPU needs to have its stack
in $01xx in memory. The non-MMU version is a kind of threaded
system, as the memory is the same for all tasks, but the stack
at $01xx being divided into several parts for the different tasks.
The MMU maps memory pages of 4k size. The kernel is within the uppermost
4k, to include the 6502 Reset, IRQ and NMI vectors. I/O ports are
mapped to $e800 to $efff. The rest of the memory can be mapped from
the maximum total available memory of 1 MByte in 4k Blocks.
Each task has its own memory environment, i.e. has its own view of the
memory. It communicates with the system only by the means of kernel
calls. Therefore the kernel provides a message-sending interface, as well
as simple signals and semaphores.
Data transfer is done via 128 byte sized fifos, called streams. The
stream management as well as the memory manager are within the kernel,
although it wouldn't be necessary. For performance reasons the filesystem
manager is in the kernel as well. It dispatches the messages sent to
the filesystem to registered filesystem tasks.
The system structure looks like the following diagram.
---------- --------- --------- ------ -------
| fsdev | | fsiec | | fsibm | | sh | | mon | tasks...
---------- --------- --------- ------ -------
---------------------------- -------------- ----- ---------- --------
| | | | fsm | | | | | | |
| | env | -------------- | | | | | |
| | ------------------ | | stream | | mem |
| | | | | | |
| -------------------------------------- | | | |
| devices | | | | |
------------------------------------------------- ---------- --------
--------- ------- ----------- ----------
| video | | par | | spooler | | serial | devices...
--------- ------- ----------- ----------
The CS/A hardware consists of several boards, one for the CPU and MMU,
one for the ROM, 32k RAM and
some basic I/O hardware. A video board with 64k RAM gives PAL
(european video standard, as opposed to NTSC) b/w video output. A high
resolution mode is available. There is a standard PC floppy drive
interface, two serial interfaces (unfortunately without fifo), a
parallel printer output, a keyboard interface, and a parallel IEEE 488
interface. A serial - Commodore compatible - IEC bus interface is
available as well.
Most interfaces, i.e. chip addresses and I/O connections are compatible
to the old Commodore CBM 3032 computer. Some things have been changed
to avoid connections between different boards, though (A 3032 emulator
is working).
And of course new interfaces have been added, e.g. an infrared LED to
control audio or other equipment by infrared commands.
A special interface card and cable allows to replace a 6502 in an existing
computer and make its 64k memory part of the CS/A memory.
The size of the kernel is 4 kByte and it is located at $f000 in
CPU address space. Kernel entry takes place with an IRQ, an NMI
or via kernel call. These calls take place via a jumptable
located at $f000, the beginning of the kernel.
It is possible to get into the kernel through an CPU interrupt or a
direct kernel call. After an IRQ the CPU registers are saved on the
stack and the kernel is entered via a special routine 'memsys'.
This transparent (i.e. no registers except stack pointer are changed)
routine switches from Task environment to system memory environment.
Each kernel call has, as one of the first instructions, a call to this
routine to enter kernel address space. A similar routine, 'memtask',
is called at kernel exit to enter task memory space.
A task can be described by the memory map it is using and the CPU registers.
Interrupting a task and saving this information allows to restart the task
later. Here, in addition to this, a task has some additional properties.
The mapping of the STD* streams is saved in the environment table as well
as an optional task interrupt routine (if enabled at compile time).
The task status shows if a task is blocked, interrupted, runnable
or has executed a BRK opcode (which stops the task).
No resource bookkeeping is done. A task has to release all resources
it has allocated by itself, except for the STD* streams.
GETENV allocates a new environment struct in the kernel (which
can be freed with FREENV). With SETBLK the MMU mapping of
bus address space to CPU address space for an environment can be
manipulated. READ and WRITE allow other tasks to manipulate the
contents of another environments memory.
With FORK an environment then becomes a task, i.e.
starts to live. TERM and KILL allow the task itself or another task
to remove the task and free the environment struct. WTERM waits until
a task has terminated. TRESET resets the PC of a (running) task to
another address (really dirty trick...).
GETINFO gives an overview of how many tasks are in use and what is
their status. At the moment a maximum of 16 tasks (with MMU) is
possible.
Devices are a different kind of object. They have their memory
mapping as well, it just simply consists of one memory page.
Then devices are not really running tasks but they are
waiting for a message, that is passed through the DEVCMD system call.
The DEVCMD call registeres new devices and provides an interface to
the devices. When calling a device, the device memory mapping is
established and the device entry point is called as a subroutine.
After the return the control goes back to the calling task.
The DEVCMD interface is even reentrant: the video device allows
interrupts to occur during device execution so that, for example,
serial devices can be served without loosing characters (Of course
the video device itself has a mutex construct builtin to avoid
confusion).
The memory management routines are independent from all other
routines. For performance reasons they are in the kernel and not
put into a separate task. At startup no memory is available.
The kernel then tests the memory, gets its size and calls ENMEM
to register usable RAM to the memory handler. The video
driver does the same at startup when adding the video RAM to the
system. GETMEM can then be used to get a free block of RAM.
GETBLK tries to allocate a specific memory block where, for example,
some special hardware like the video buffer is mapped in.
FREMEM releases an allocated memory block. According to the system
hardware a set of 256 memory blocks can be handled, which make up
1 MByte of memory. The memory routines can not be called from within
an interrupt.
Streams, as this word is used here, are 128 byte FIFO buffers to allow
easy data transfer. The stream routines can be called from within an
interrupt. Each stream counts how many tasks are writing and reading
from it. If both counters are zero, the stream is free. If only one is
empty, special return codes are generated to indicate the situation.
At a fork, for example, the number of writing tasks is incremented for
STDOUT and STDERR, while the number of reading tasks is incremented
for STDIN. After a task has exited, the STD* counters are decremented.
A filesystem listening on the tasks STDOUT stream then
gets the return code E_EOF, because no task is writing on the stream
anymore.
STD* streams are negative stream numbers and are the same
for all tasks. They are mapped to the real stream numbers that are being
kept in the environment table for each task. All other stream numbers are
global and there is only a limited set of stream, as they are
statically allocated.
The message interface provides another easy means of interprocess
communication. Messages are exchanged using the rendez-vous technique,
i.e. both involved tasks have to be in the SEND respectively RECEIVE
routine. The SEND routine always blocks until the given message is
taken over by the receiving task. RECEIVE either waits for a
message or returns with E_NOTX if no message has been found.
The message is then copied from the senders message buffer (at $02xx)
to the receivers message buffer (at $02xx). The return values are set
and both tasks are marked runnable.
If a RECEIVE blocks and a SEND takes place, both tasks are marked
waiting at first, but the scheduler detects that and lets them
exchange the messages.
While RECEIVE gets messages from
any task, XRECEIVE allows to receive messages from a certain task
only.
Negative task numbers are used for special purposes. Certain system
calls, that need more parameter than fit into the CPU registers
can be called not only via jumptable but also via the SEND interface.
For this, the -1 ($ff) is used for system calls. -2 ($fe) is
used for the file system manager. -3 to -12 ($fd-$f4) can be
mapped to any other task by the TDUP system call. -3 is reserved
for an error exception handler and -4 for a timer task. Messages
to these task numbers are internally redirected to the registered task.
As the 6502 CPU has no idea about atomic operations, semaphores have to
be provided by the system to allow mutex constructs
(Well, there are atomic instructions, like INC or the shift opcodes. But
then there is - in not-CMOS versions - no test-and-set). Semaphores
can be allocated using GETSEM
and released with FRESEM. Then with a PSEM operation on a certain
semaphore a task can enter a critical region of the program. All other
tasks block when they try to do the same PSEM operation. Only after
a VSEM on the semaphore another task waiting in a PSEM is revived
and can enter the critical region.
Negative semaphore numbers cannot be allocated but can be used as some kind
of system semaphores. What they mean has yet to be defined, though.
Signal handling consists of a single system call. With the carry flag
cleared, the calling task is blocked until a specified signal is received.
With the carry flag set, a signal is sent. Only listening tasks receive
the signal.
Disks are
administered as drives, as a virtual filesystem with mounts etc
would have been to much for this system.
The filesystem manager should actually be another task and is in the
kernel only for performance and resource limitation reasons.
The TDUP interface would provide an easy way to register a filesystem manager,
if it were an extra task.
But then the used filesystem manager just redirects the send message
'on the fly'. An extra task would have to receive the message,
redirect it and send it to the filesystem task - blocking until
the message is taken over!
When a filesystem task, e.g. for IBM PC disks, is started, it registers
itself with the filesystem manager by sending an appropriate message
to task -2 ($fe). The filesystem then knows about how many drives are
supported by the filesystem and adds them to its list. A request to, e.g.
open a file is sent to the filesystem manager. From the drive number
the file system manager knows to which task to redirect the message.
It also remaps the drive number to the ones used in the filesystem
task.
Once a read-only or write-only file is opened, all communication takes
place via streams. Another interface, using read and write messages
is planned, but has not yet been implemented. Some filesystems (like
fsiec for Commodore floppy drives) wouldn't even support it.
The so called standard library is a set of
subroutines that tasks may
map into their memory or not. The library is below the
I/O area, at $e800. Directly below $e800 the jumptable for the
library is located. All routines run in task space, and have no
connection to the kernel, except for calling normal system calls.
All filesystem tasks use a common Filesystem
Interface to communicate with other tasks. The SEND/RECEIVE
calls are used to open files, while the Streams are used to transfer the
data. The filesystem manager redirects the SENDs to the right
filesystem tasks automatically.
The device filesystem provides an easy interface to devices, in allowing
them to be used as files. A request to open the device, for example,
is served by calling the appropriate DEVCMD commands.
fsdev also provides an easy interface to the ROM structure. The
programs stored in the system can be started from there. They cannot
be read, as only a header is being provided. This header, when used
to start the program, remaps the ROM into the task memory. This way
one need not move all the program through the stream interface.
This filesystem can be used in two directions. First it provides an
interface for CS/A to access Commodore disk drives on an IEEE 488
interface. Reading directories as well as renaming and deleting files and
formating and checking disks is supported.
The other way around with fsiec the CS/A computer can be used as a disk
drive for Commodore computers. The filesystem drives are mapped to the
drives 0-? of a single drive unit on the IEEE 488 bus. In addition to the
standard IEC floppy commands directory handling is supported. For
programs that only understand one drive, drives can be remapped with
assign.
The FAT filesystem is used to control and handle PC-style formatted disks
on standard PC disk drives on the CS/A floppy board. Several
3.5" and 5.25" formats are supported, up to 1.44 MByte 3.5" HD disks.
As hardware a WD1770 floppy controler at address $e8e0 is used. A 6522 VIA
at $e8f0 provides some additional control lines for drive select, motor on,
etc.
For those who remember: I wrote the program 'bdos' for the german 64er
magazin, which allowed the C128 to read and write PC disks. fsibm
has not a byte of code from this project, instead it is completely
rewritten.
One flaw of this filesystem is that it assumes it has the floppy
controller for itself. So there is no way to share it, e.g.
with a disk monitor. The next revision should define a semaphore
for this.
The shell is a command line interpreter that
provides an easy to use interface to the system. A scripting language is
not available, though. For external commands and some internal commands
like DIR and more I/O redirection is possible. I.e. that the STD*
streams can be redirected to a file or even piped to other programs.
'mon' is a full featured 6502 assembly and
machine language monitor. Among the usual standard commands other
features like, for example, relocating 6502 in a certain area or
searching for opcodes with specified addressing modes is possible.
The BASIC interpreter is an extended Commodore
Basic V2 interpreter as found in the C64. In fact it is de-assembled
from the C64 and patched to make it work on CS/A. In addition, some
commands have been changed and new commands have been added, amongst
them the Commodore Basic V4 commands. In contrast to the C64 basic
interpreter, it runs in multitasking and you can quit ;-)
The used CPU is a quite old one: the MOS 6502 or one of its clones,
like the Rockwell R65C02, for example. This 8 bit CPU has been designed
in the early 70s (?) but is still in use. The Rockwell modem chipset,
for example, is based on a R65C02 core. The CPU has one general purpose
8 bit register, the accumulator. Two 8 bit index registers, x and y, help with
addressing modes. The PC is 16 bit wide, as is the address bus to the
outside of the chip. The stack pointer is only 8 bit, thus confining
the stack to the area $01xx in memory. The so called zeropage ($00xx)
is used with special addressing modes to allow smaller code and
faster execution times by leaving out the address high byte.
An 8 bit status register completes the register set.
The hardware interface is quite simple. In fact it is compatible to the
motorola 6800 series chips. Commodore lost a lawsuit started by Motorola
where commodore
was forbidden to use the number 6500 (or was it 6501) and the same
pinout as the 6800 chip, thus making it a plug in replacement for the
Motorola chip. But then they could use the hardware interface,
though. Which means that one can simply use any 68xx interface chip
with a 6502 and vice versa.
Phi2 is the CPU clock output. During Phi2 high all address lines and
the read/write line are valid. Data is given on the data bus for write
somewhere in the middle of Phi2 high and it stays valid until the
Phi2 high-low transition. During read the data lines are latched
at the same transition. The CPU always reads some memory addresses,
even if its really doing something else.
/RESET, /NMI and /IRQ are the most important input lines. They are
low active and the I/O chips have open collector outputs so that
they can be wire-or'd together.
The hardware consists of several boards, connected by a 64 pin
bus interface (DIN 41612, rows a+c). On the bus are 20 address
lines and 8 data lines. /NMI, /IRQ, /RESET, /SO, RDY, SYNC and R/W
are the usual (buffered) 6502 control lines. Phi0, Phi1 and Phi2
are on the bus as well as a 2Phi2, which is a frequency doubled
phi2. Each low-high transition on 2Phi2 gives a
transition on Phi2. This can, for example, directly be connected
to /RAS of dynamic RAMs.
The CPU boards detects any access to a 4 or 2 kByte sized Area
from $e000 or $e800 to $efff and automatically selects the bus
line /IOSEL, and /MEMSEL otherwise. This way a simple 8 bit
compare with, for example, a single 74688 with the address
lines A4 to A11 selects a 16 byte I/O window for any interface
chip.
/IOINH(ibit) disables this feature, but therefore the I/O
memory has to be detected from normal memory access
(via /MEMSEL) and /EXTIO be activated therefore from an
external circuit.
(Well, the next revision should see /IOSEL as an open collector
output...)
All interface circuits, even including the MMU on the CPU
board, are connected to the bus lines. The CPU
can be cut off from the bus with /BE, thus allowing other
CPUs or bus masters to take the bus and even access the MMU.
The CPU has to be shutdown with
RDY for this, though.
The 74ls610 (used to be on the COCOM list, haha...) is
a set of 16 12 bit registers, where only 8 bit are used
here. During Phi2 low the CPU address lines A12 to A15
are used to select the appropriate register and latch its
contents to the output address lines A12-A19. Unfortunately
this chip is quite slow (I'd say that a hand-constructed thing
would be much faster...) so that for 2 MHz CPU clock
a 4 MHz (CMOS) CPU has to be used.
During Phi2 high the I/O access
to read and write the registers from the bus takes place.
At RESET the MMU is set to 'through' mode, i.e. the
CPU address lines A12 to A15 are passed through and
A16 to A19 are left zero, to allow a defined RESET
condition. The first I/O write on the MMU registers
then enables it.
Then the CPU memory is divided into 16 4 kByte blocks, that
can each be mapped from one of the 256 4 kByte Blocks in the
1 MByte bus address space by setting the MMU registers.
The video board has 64 kByte dynamic RAM that is used as
video and CPU memory and a 6545/6845 video CRT controller.
The memory access is twice as fast as the CPU access, so that
during Phi2 low the video readout is done and at Phi2 high
the usual CPU access takes place. The lowest address bits
are connected to the row, so that the video access
automatically refreshes the RAM. With 1 MHz CPU clock
40 columns are displayed, with 2 Mhz 80 columns can
be seen. The CRT is located at $e880 in I/O space, and
a write only control port at $e888. This port gives the
missing address lines A14 and A15 for the CRT, 2 bits
to select the charset (located in an EPROM on the board
that can not be read by the CPU) and 2 bits to invert
the HSYNC and VSYNC signals if necessary.
The last used bit maps the CRT row select lines
RA0 and RA1 (that normally select the character row
in the charset)
to the RAM/CPU address lines A12 and A13. This way -
and with a 1 to 1 charset - some high
resolution mode with 640x200 pixel is possible, with
a rather strange memory map, though.
This is necessary due to the 127 row limit in the CRT.
An IEEE 488 board with a 6821 PIA at $e82x and a 6522 VIA
at $e84x makes up a mostly Commodore PET 3032 compatible
IEEE 488 interface. The PIA is used for the data transfer,
while the VIA port A is used for the control lines.
VIA port B controls a Commodore serial IEC bus interface.
Both interfaces can be used as master or slave, i.e.
the have the circuitry to set NRFD/NDAC on the IEEE 488
and the DATA (?) line on the serial bus, if ATN on the bus
is set. So the master can see that there is a device on
the bus, even if it isn't ready to serve the request.
VIA CA2 can be externally connected to the
video card to allow PET compatible switching between
upper/lowercase and uppercase/graphics charset.
VIA CB2 drives, PET compatible of course, a piezo beeper ;-)
At $e810-$e817 is, as in a PET, a 6821 PIA to handle the
keyboard, while on the same board, at $e818-$e81f, a
6551 ACIA (which is not in a PET ;-) to handle a serial
interface is located.
The same ACIA is also located on the BIOS board, that also contains
32k ROM and 32k RAM that make up the lowest 64 kByte of the
Megabyte bus address space. This BIOS card together with
the CPU makes up a complete computer. The ACIA is at $efe8-$efef,
while at $efe0 a control port is located.
One bit allows the detection of an IRQ (e.g. to stop the irq
routine if all interrupts are served), two others are used
to control and sense a 50 Hz (line frequency in Europe, which
is put on the bus by the power supply) interrupt.
Two other bits control the /IOINH line and allow the I/O address
space to be located in $0e000-$0effff in bus (!) address space,
so that it can be mapped anywhere in CPU address space.
On the floppy board a WD 1770 (like in the Commodore
VC1571) chip at $e8fx can control PC compatible disk drives.
A 6522 VIA at $e8fx supplies the timers and some additional
I/O lines and also gives a simple 8 bit parallel interface
(for e.g. printers).
A special interface board emulates a 6502 in another computer
(restricted to 1 MHz or below). A 40 line cable is used
to connect the socket for the 6502, e.g. in a VC1541, with the
board. The 64 kByte address space of this VC1541 is then
mapped into the bus address space $02000-$02ffff. So you can,
for example, test new ROMs by putting the new ROM image in
the CS/A RAM and then remapping it for the old ROMs.
The C64 uses 8 select lines to select the keyboard rows.
The return value is then read in by 8 column port lines.
This simply is a RAM readout scheme - so that a
RAM can simulate a keyboard for the C64.
The keyboard simulator board provides such a RAM. It can
be written or read by the CPU during Phi2 high, and the
row select address from the C64 is read at Phi2 low and
latched. So, for example, the C64 can be controlled remotely.
Currently two ports have been done. One is a port to the CS/A65 without
MMU, to use it in embedded applications I had in mind. The other one
is the C64 port.
The port to a system without MMU is one of the more difficult.
The problem is, that global variables cannot be mapped to different
physical memory locations for different tasks. So this has to be
avoided. Unfortunately there are global variables used to communicate
with the kernel nearly everywhere. So these variables have to be
protected by semaphores. Another thing is that a binary can normally
only be invoced once. All variables would otherwise be the same for all
invocations and running different tasks on them would normally
give a real mess. More on this issue can be found in
nommu.html.
After porting it to a system without MMU, there was only a small step
porting it to the C64. The first implementation used the Commodore
parallel IEEE488 interface, as it's easier to program - and simulate
in an emulator... But after all, it has the same deficiencies as any
other system without MMU. More details on the actual implementation
can be found in README.c64.
On the other hand, to be really useful without an MMU besides
embedded applications, some things are still missing.
A program relocator, for example, would make it much easier in that
the applications don't have to be assembled to a certain memory area.
Instead they could be relocated to some free memory when being loaded.
The C128, for example, would surely be a good candidate for a new
version: It can map the zeropage and the stack anywhere in memory and
can thus have different ones for each task. In addition to this, a
relocator would have to be implemented, to be able to load programs to
any address. The (to be implemented new) memory manager then has to
keep track of which memory is used for each program. If one says that
read/write memory (i.e. global variables, except zeropage) should only
be dynamically allocated, one could even use shared read only binaries.
I.e. several invocations would only use one copy of the binary while
running. Global variables used would be global to all invocations,
giving some kind of task shared memory. A flag in the file header could
indicate the relocator and/or shared binary mode.
Well, a next version is not planned currently. But then the whole
stuff is under the GNU public license and could be used by anyone.
The new binary format with relocation would surely justify a jump
to version 1.4. This would make it much more usable for the C64. But
for this not only the relocator has to be written, but also the assembler
has to be changed to write out a relocatable file format.
This would also imply an extended memory manager, that keeps track
which memory is used by which task. And it would also use a smaller
block size than 4kByte.
Well, there used to be some very few spurious BRK executions in a task.
That's why I reduced the maximum number of tasks from 6 to 5, increasing
the individual stack size to 40. The previously smaller stack size of
32 seemed to cause this problem, which seems to be gone now. The problem
is that a normal
interrupt alone needs 10 bytes on each tasks stack - one can reduce
that, but with some effort only.
When pressing the return key, sometimes the invers cursor stays on the
screen, causing the program to give a syntax error or whatsoever
when it reads a $a0 from its input stream.
But other than that, none. But I haven't tested the C64 version
as much as the one for my selfbuilt computer. So there might still
be bugs in the serial bus handling or in the stuff special to computers
without MMU.
Ok, when starting up the ROM, the initial tasks are set up and then the
scheduler is started. This might cause problems when e.g. a filesystem
is not fast enough detecting its drives during the first scheduled
timeslice. Another filesystem might come in between so that the drive
numbers could be assigned in a different order.
Possible improvements are:
- A malloc() equivalent, that could be located in the STDIO lib.
At least some kind of sbrk() call we already have in the kernel.
- A relocator to be able to load binaries to different memory locations,
esp. for the versions without an MMU.
- There currently is no way to handle shared memory (in the MMU version)
in a generalized manner. I think of somethink like a Unix (Linux) mmap()
equivalent.
- The data transfer via streams is quite slow. Another interface
to handle block transfer would surely improve things. Something like
'copy memory from task A, address x length l to task B, address y'.
- Starting a new task from within the shell is way too slow. Reading
a file and writing it to another environments memory with the WRITE
system function takes its time. Instead the shell should just load a small
loader program into the new environment. This then reads the program
from the stream and executes it.