Скачать CompMark - Splay Tree Adaptive Data Compression

Edwin Floyd
27.12.1989
Скачать файл (18,20 Кб)

COMPMARK.PAS - Adaptive data compression using "Splay" tree with Markov model. This algorithm was originally implemented in Pascal on the IBM PC by Kim Kokkonen [72457,2131], TurboPower Software, 8-16-88.

His documentation follows:

"Based on an article by Douglas W. Jones, 'Application of Splay Trees to Data Compression', in Communications of the ACM, August 1988, page 996.

"This is a method somewhat similar to Huffman encoding (SQZ), but which is locally adaptive. It therefore requires only a single pass over the uncompressed file, and does not require storage of a code tree with the compressed file. It is characterized by code simplicity and low data overhead. Compression efficiency is not as good as recent ARC implementations, especially for large files. However, for small files, the efficiency of SPLAY approaches that of ARC's squashing technique."

I have re-implemented the algorithm in assembler with some changes:

  1. My intended use for this unit is to compress a relatively small data buffer as one might wish to do before transmitting it over a communications channel or storing it on disk. Consequently, this unit compresses and decompresses an in-memory buffer rather than a file. InitCompress initially balances the Splay tree[s] in the work area. The work area retains any tree adaptations done during compression or expansion until InitCompress is called again. Therefore, If you wish to make each buffer independently expandable, you must call InitCompress before each call to CompressData. ExpandData will detect what you have done and automatically call InitCompress where necessary.
  2. I run-length-encode the input before compressing it with the Splay tree algorithm. This improves the compression ratio where the input contains long runs of duplicate characters.
  3. Kim's original implementation used a unique trailer character to mark the end of data. I store the pre-compressed data length as the first word in the compressed data buffer and do not use a unique trailer character. This permits the uncompressed length to be determined by inspection and, because the ExpandBuffer routine stops when the output length is achieved, transmission errors will be less likely to blow out a buffer on the receiving end. The "Bits" parameter from InitCompress is also stored as the third byte in the buffer.
  4. I have implemented the "Markov modeling" technique outlined in the Jones ACM reference. You may (indirectly) indicate the number of states in the InitCompress procedure. The work area size requirements are outlined in the comments on that proc. InitCompress(0) should reproduce the compression behavior of Kim's original SPLAY.PAS. The work area is passed as a parameter to the assembler primatives so they may be fully re-entrant.
  5. I have added objects for management of compressed sequential record files on disk (see below).

Cautions:

  1. CompressData and ExpandData both read/write their input/output under the constraints of the 8086 segmented archetecture and I do not normalize the input/output pointers before starting. Therefore, you should call these routines with normalized pointers, and expect invalid output if the input/output length exceeds 64k minus Ofs(Dest).
  2. The compressed output data may actually be longer than the input data if the input already has high "entropy". Compression typically increases the data entropy. Multiple compressions, therefore, are usually a waste of time.
  3. As indicated in the ACM reference, this compression technique does not perform as well as LZW and its variations on large text files. It should be considered only where working storage is very scarce and/or the data to be compressed is expected to contain considerable redundency at the character level. The reference indicates that this algorithm can do especially well with image files.

This program is contributed to the public domain. Please report bugs to the author:

Edwin T. Floyd [76067,747]
#9 Adams Park Ct.
Columbus, GA 31909
404-576-3305 (work)
404-322-0076 (home)

History

12-27-89 Added compressed sequential file objects
12-07-89 Added 'cld' to compmark.asm, added auto-init detection
         logic
10-15-89 Initial Upload