Diskcomm file format version 3.2, revision 1.0

Diskcomm archives represent the contents of diskettes used for the
Classic Atari computers.  Various pieces of information related to the
format and the contents of the diskette are stored in the archive
file.  To reduce the storage space requirements, compression
algorithms are applied to the data.  For some large archive files, it
may be necessary to split the archive into multiple files, in order to
be able to store the archive on diskettes.

On an Atari disk, data is organized in sectors.  These sectors are
numbered starting from 1.  There are various disk sizes.  The most
common ones are the standard diskettes.  Common diskette formats are
the single density diskette, which holds 720 sectors of 128 bytes, the
enhanced density diskette, which holds 1040 sectors of 128 bytes, and
the double density diskette, which holds 720 sectors of 256 bytes.
There are various other formats, but the single density and the
enhanced density are used most, since these are supported by the 1050
disk drive.  Other formats require a XF551, 815, or some third party
disk drive, like the Percom, Indus GT, Trak, Black Box with floppy
board, MIO, and the HDI, to name just a few.  Diskcomm will always use
1040 sectors for enhanced density diskettes, so for this type of
format, the number of sectors in the archive is defined by the format.
For single density and double density disks, the maximum number of
sectors can be modified between 1 and 9999.  By definition, the first
three sectors of any Atari disk contain only 128 bytes, since this is
considered the boot area.  So the first three sectors of double
density disks will contain only 128 bytes of data.  Diskcomm still
stores these sectors as sectors of 256 bytes within the archive, and
the remaining 128 bytes will simply contain zeroes.

The sectors of a disk are stored in the archive sequentially.  Sectors
of data within a Diskcomm archive are compressed.  While creating the
archive, Diskcomm examines the contents of each sector that is
processed.  Based on these contents, one of several compression
algorithms is used to reduce the amount of storage required for
representing the contents of this sector.  Sectors that contain
nothing but zeroes are considered empty sectors.  Empty sectors are
not stored in the archive.  A flag will be set in the information
stored for the preceeding sector, to indicate this.  The next sector
that contains data will be preceeded by its sector number.  It is
assumed that the diskette will be formatted before writing the archive
back to a diskette, and thus that initially all sectors on the output
disk will contain zeroes.  Therefore, there is no need to store empty
sectors in the archive.  To be able to skip these sectors when writing
the archive back to disk, the sector number included in the archive is
used to skip these sectors.

For sectors that contain data, the contents of the sector are compared
to the contents of the last preceeding sector containing data.  Empty
sectors have no influence on this comparison, since they are skipped.
There are five different algorithms that can be applied.  Each of them
is applied to the sector in turn, and if the result is successful, the
resulting compressed data is appended to the archive buffer, with the
type of compression prepended.  Like noted before, if the preceeding
sector was empty, the sector number is prepended to all of this, in
the 6502 low/high byte order.  Older versions of Diskcomm used a sixth
algorithm.  This is now obsoleted by one of the remaining five
algoriths, so this old algorithm is no longer applied when an archive
is being created.  However, some very old archive may still contain a
sector that was compressed by this algorithm.

Compressing of sectors continues until memory runs out, or until there
are no more sectors left to process.  Due to memory limitations, there
is a maximum of just over 24K of data that can be stored in the
archive buffer.  When appending the compressed data to the buffer
causes the buffer to contain 24K of compressed data, the buffer is
full, and it is flushed to disk.  A system that has more than 64K of
memory can hold multiple buffers before the data is flushed to disk.
Each buffer load is considered to be a pass in the compression of the
disk.  A pass is an undefined number of compressed sectors, that is
considered complete when hex 5F02 ( dec 24322 ) bytes of data or more
has been accumulated.  A pass can never contain more than hex 6002
bytes.  Each pass starts with the header, which consists of two bytes.
The first byte is either hex FA or hex F9.  When the archive is split
up into multiple files, this byte will contain hex F9, otherwise it
will contain hex FA.  The second byte of the header combines three
pieces of information.  The format of the original disk is indicated
in bit 5 and bit 6 of the second byte.  Bit value 00 is used for
single density disks, bit value 01 is used for enhanced density disks,
and bit value 10 is used for double density disks.  Bit value 11 is
undefined.  Bits 0 to 4 are used to indicate the pass number.  Each
pass is numbered sequentially, starting at 1.  since there are 5 bits
available for this, the highest possible pass number is 31.
Therefore, the largest archive will be no larger than 31 times 24K,
unless the pass count is allowed to roll over to zero.  The high order
bit of the second byte (bit 7) is set when this pass is the last pass.
Since compression is started before asking what the user wants to do,
the question of dividing the archive into smaller files is only
presented to the user if there is more than one pass.  If all data can
be stored in one pass, this question is not presented, and an archive
with header byte hex FA is created.  The first sector within a pass
will always be preceeded by its sector number.

Format descripton

<Diskcomm archive> = <pass>
<pass> = <archive type> <pass information> <sector number> <sector
data> <end of pass>
<pass information> = <last pass flag> + <diskette type> + <pass
number>
<sector data> = <content type> <compressed data> <sector number>
<content type> = <sequential flag> + <compression type>
<archive type> = FA | F9
<last pass flag> = 00 | 80
<diskette type> = 00 | 20 | 40
<end of pass> = 45
<sequential flag> = 00 | 80
<compression type> = 41 | 42 | 43 | 44 | 46 | 47

Format description in plain English.

Diskcomm archive                    A Diskcomm archive consists of one
or more passes.  When an archive is split into multiple files, each
pass is stored in a separate file.

Pass                    A pass consists of an archive type code,
followed by pass information, followed by the starting sector number,
followed by one or more sector data packets, followed by the end of
pass code.

Archive type                    The archive type indicates whether
this is a multi file archive or not.

Sector data                    A sector data packet consists of one
byte that indicates the compression type for the sector.  After the
compression type, the compressed data for the sector follows.  The
contents of this depends on the type of compression, and it can
contain any number of bytes, from zero up to the length of the sector
for the type of disk, either 128 or 256 bytes.  The high order bit of
the compression type is used to indicate whether or not a sector
number will follow the compressed data.  If this bit is zero, a sector
number will follow the data.  If this bit is one, there will not be a
sector number following the compressed data.

Sector number                    An unsigned sector number, which is
two bytes.  The first byte is the low order portion of the number, the
second byte is the high order portion of the number.  Normally ranging
from 1 to 9999.

End of pass                    The value hex 45.

Compression type                    One of the following hex values:
41, 42, 43, 44, 46 or 47.  The meaning of these values is described
below.


Type 41, modify begin.

The compression is relative to the previous sector.  The sector data
contains only the beginning portion.  The last portion is not changed.
The first byte of the sector data specifies at what offset to start
modifying the sector.  The remaining bytes of the sector data ar used
to modify the beginning portion of the sector.  This modification
takes place starting at the byte at the start offset, working towards
the beginning of the sector, up to and including the byte at offset
zero, the first byte of the sector.  This implies that the data bytes
are stored in a reverse order in the sector data.

Type 42, 128 byte DOS sector.

This is an obsolete compression type, that was used by early versions
of Diskcomm.  Earlier versions of Diskcomm supported only single
density diskettes, so this type of sector is always 128 bytes long.
Programs that decode archives should be aware of this.  Using it for
creating new archives is not recommended.  The sector data contains
five bytes.  The first byte of the sector data is used to initialize
the first 124 bytes of the sector.  The remaining four bytes are
stored in the last four bytes of the sector.

Type 43, compressed sector.

The sector data contains substrings.  These substrings alternate
between uncompressed and compressed, starting with an uncompressed
substring.  Each of these substrings starts with a byte that specifies
the ending offset of the resulting data in the sector.  When this
ending offset position is reached, the end of the substring is
reached, and the byte at this ending offset is the starting position
for the next substring.  The starting position for the first substring
is at offset zero.  An uncompressed substring will contain as many
bytes as are needed to fill the sector from the start position up to,
but not including the end offset.  For uncompressed substrings, if the
starting position offset is equal to the ending offset, there is no
further data, so in effect, this is a null string.  This is used when
there are two portions of data within the sector that can be
compressed, without other data in between these portions.  The
uncompressed substring must be present, therefore a null string must
be used in this case.  Compressed substrings are always two bytes in
length.  The compressed substring starts with a byte that indicates
the ending offset.  The second byte contains the fill character.  The
portion of the sector starting at the start offset, up to, but not
including the ending offset, is set to the value of this fill
character.  After the compressed substring, another uncompressed
substring follows.

For double density disks, the ending offset for the last substring is
256.  Since there is only one byte to represent the ending offset,
this is stored as zero.  However, zero is an offset that can be used
for the first uncompressed string, to indicate that the first
uncompressed string is a null string.  The end of this type of
compressed sector is reached when all bytes in the sector have been
processed.  This can occur at the end of an uncompressed substring.
In this case, there will not be a compressed substring following the
uncompressed string.  Likewise, if it occurs at the end of a
compressed substring, there will not be an uncompressed string
following it.

Type 44, modify end.

The compression is relative to the previous sector.  The sector data
contains only the ending portion.  The beginning portion is not
changed.  The first byte of the sector data specifies at what offset
to start modifying the sector.  The remaining bytes of the sector data
ar used to modify the ending portion of the sector.  This modification
takes place starting at the byte at the start offset, up to, and
including the last byte of the sector.

Type 45, end of pass.

This compression type indicates the end of a pass, so it is not a real
compression type.  There is no sector data for this type.  For a multi
file archive, this indicates the end of the file.  The archive is
continued in the next file, unless this pass was the last pass.  For
single file archives, this indicates that the next pass follows within
this file, unless this was the last pass.  The next pass starts with a
header again, followed by a sector number.

Type 46, same as before.

This compression type indicates that the data for this sector is
identical to the data of the previous non-zero sector.  There is no
sector data for this type.

Type 47, uncompressed sector.

The sector data contains the number of bytes required to fill an
entire sector, either 128 or 256 bytes.  No compression of any kind is
performed on this sector type.

Previous sector.

The buffer that holds the contents of the previous non-zero sector is
initialized at the start of a pass if the archive is a multi file
archive.  For single file archives, this buffer is cleared at the
start of the first pass only.

Known bugs and anomalies.

It looks like Diskcomm has some slight problems.  Double density
sectors are 256 bytes long.  If the buffer contains hex 5EFF bytes,
and the sector cannot be compressed, and a sector number must be
included, we must add 259 bytes to the buffer.  To mark the end of
pass, we have o add either one hex 45 byte, or hex 45 00 45.  This
might add up to three extra bytes.  The buffer starts at hex 2F00, and
if we would add these 259 bytes to the hex 5EFF bytes, we would write
up into the Diskcomm code which starts at hex 9000.  This area happens
to hold the maximum sector number as input by the user.  This makes
the pass longer than hex 6002 bytes.  On reading, this is also a
problem.  Diskcomm will not store the first two bytes.  The header is
processed first.  Then it tries to read hex 6000 bytes.  Within these
hex 6000 bytes, the end of pass compression type must be included.
This will be missing though, so Diskcomm will not be able to process
the file.  This problem only occurs with double density disks in the
specified exceptional conditions.