Exomizer2 on 6809 by Puls, implementation details The code starts by preserving the whole context on the stack. Interrupts are left in their current state. The page where the code runs is loaded into DP from PC. U was loaded with the address of the last byte of compressed data before the call, so this byte is copied to bitbuf, which is where the bit stream flows during the decompression. bitbuf is in fact part of a load immediate to save space and to increase speed (a load immediate is the quickest load instruction) (self modifying code). Y is then loaded with the address of the bits and base table which is going to be built by the following loop. In order to optimize space and speed, we use auto-incremental addressing in the loop, (one of the nice features of the 6809) and since we have to alternatively store the bits and the bases, they are interleaved in the table (in the 6502 code, the tables are separate), which let us go through just by incrementing Y with the auto-increments. The loop is basically the C code with a few twists, again for optimisation. The << operator is implemented in the loop labeled "roll". Instead of decrementing the number of iteration to 0, which is the usual way to make loops, we complement the iteration number and increment it to 0. Why ? Because we need a 1 for "1 << b1" and using COM sets CC:Carry to 1 in just one instruction of one byte (forcing the carry value is otherwise quite complicated to do and using an extra variable takes a lot of space) ! This 1 in CC:Carry is our 1 of "1 << b1". The main decrunching loop is labeled "mloop". Y is first loaded with the adress of the last byte of the output data. Then again, we essentially implemented the C code, but with some optimisations. The loop starts with fetching one bit. if the bit is not 1, D contains 0 at that point. We can therefore use it to initialise the calculation of the index without explicitely loading a 0. In order to save some more bytes, the calculation loop is reversed and the first step is skipped with the "fcb $8c" which is interpreted as a CMPX instruction with no effect. This is equivalent to a branch to the next instruction, because the first instruction of the loop (labeld "rbl") is stored in the parameter of the CMPX. The index is stored directly in the load immediate labeled "idx" so that it is immediately available when exiting the loop without a costly store on the stack or similar operation. the index is easily compared to 16 with a series of conditional branches. If the tests are made in the good order, only two branches are needed since when index is 17, decrementing it once (one byte, 2 cycles) sets it to 16 which is cosily reused as the number of bits to fetch in that case. The index calculation is immediately followed by the litteral copy loop (case index=17 or very first bit=1). The copy loop for non-litterals is somewhere else (label "cpy2"). We decided to implement 2 loops instead of using a flag to determine the "copy mode" like in C, because this is faster and takes less space than manipulating an extra variable. For non-litterals, the length and offsets are computed starting at "coffs". By reordering the values of the switch case (compare tab1/tab2 with the 6502 code), we spare a conditional branch. base[index] + readbits(&in, bits[index]) is computed by the "cook" subroutine. Since the bits and bases are interleaved, the progression in the array is by 3. Using 3 ABX to add 3 times B to X is a nice way to do it, but ASLB is equivalent to *2 and is faster than ABX. 2*B+B is therefore better than B+B+B. Since the bits and bases are interleaved, we can use a convenient 1,X to read the base. The non-litteral copy loop is similar to the litteral one, except that we use an offset which is hardcoded in the LDA. This offset was written just before entering the loop, as a result of the "cook" subroutine. The decruncher ends by saving Y (address to first byte of decompressed data) and restoring the context. The getbits subroutine fetches up to 16 bits from the U stream and returns them in D. Using a local variable on the stack turned out to be more compact than anything else as we need several accumulators here and we only have 2 (A and B). By reordering the loop and making a wise use of the ROL instruction, the carry flag in CC can be used instead of masking the value with a 16th bit (C code) when the end of a sequence is reached. The carry is automatically rolled in and no extra variable is needed : when bit_buffer is 1, it is rolled right (CC:Carry becomes 1) then tested to 0, if yes, we load the next byte from the stream and roll, the highest bit becomes CC:Carry which was 1. So in fact, we reuse the lowest bit of the previous bit_buffer instead of masking. While the code is still quite performant, we essentially worked on its size. Note that there is no extra variable space apart from the bits/base array. Possible improvements/modifications : - The use of DP could be suppressed completely. Pros: -1 byte in size, no need to place the code inside a page. Cons: runs slower.