paq8hp10 (mirror), Mar. 26, 2007, by Alexander Ratushnyak was derived from paq8hp9 as a Hutter prize entry. The unzipped size is 103,224 bytes. Only the -7 option works.
Options select memory usage as shown in the table. Early versions took no options.
Compression Compressed size Decompressor Total size Time (ns/byte)
Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Note
------- ------- ---------- ----------- ----------- ----------- ----- ----- --- ----
p5 31,255,092 9,298 s 3421 1 6
p6 25,377,998 9,421 s 4190 16 6
p12 24,714,219 9,598 s 4160 16 6
paq1 22,156,982 16,436 s 7800 7790 50
paq6 v2 -8 19,589,267 26,548 s 47624 808
paqar 4.5 -7 18,388,609 414,164 s 118690 119010 470
paq8f -7 18,289,559 34,371 x 68960 854
-8 18,075,265 34,371 x 69170 1693
paq8g -7 17,817,246 804,867 s 44130 854
paq8h -7 17,674,700 147,195,723 801,612 s 147,997,335 56511 57278 854 5
raq8g -7 18,132,399 33,483 x 84555 84793 1089
-8 17,923,022 27,660 x 337430~330000 2095 17
-8 17,923,022 27,660 x 196540~196000 2095 15
paq8hp1 -7 17,566,769 205,783 x 60170 60660 748
-8 17,397,023 142,477,977 205,783 x 142,683,760 63317 1595
paq8hp2 -7 17,390,490 204,557 x 62000 62330 747
-8 17,223,661 141,145,684 204,557 x 141,350,241 65323 1584
paq8hp3 -7 17,241,280 177,477 x 61360 59690 742
-8 17,085,021 139,905,045 177,477 x 140,082,522 63420 1586
paq8hp4 -7 17,039,173 198,525 x ~65000 65110 755
-8 16,889,237 138,188,695 198,525 x 138,387,220 67956 68120 1598
paq8hp5 -7 16,898,402 161,887 x 76300 77710 900 19
-8 16,761,044 137,017,311 161,887 x 137,179,198 ~85153 75162 1787
paq8hp6 -7 16,731,800 138,828,889 166,715 x 138,995,604 74953 73707 941
-8 16,568,451 135,281,289 166,715 x 135,448,004 60865 1807 21
paq8j -7 18,208,284 39,366 s 138030 138260 959
-8 17,991,628 39,366 s 138990 136500 1896
paq8ja -7 18,184,224 39,781 s 148560 143200 993
-8 17,968,233 39,781 s 154700 153990 1965
paq8jb -7 18,180,081 39,982 s 148570 148200 1009
-8 17,964,363 39,982 s 188590 190190 1999
paq8jc -7 18,185,705 40,064 s 150910 152080 1017
-8 17,970,943 40,064 s 224410 234900 2015
paq8hp7a -7 16,592,672 137,441,743 150,678 x 137,592,421 79795 940
-8 16,431,239 150,678 x 76940 77600 1790
paq8hp7 -7 16,579,500 151,633 x 79620 79660 940
-8 16,417,646 133,835,408 151,633 x 133,987,041 66074 1850 21
paq8jd -7 18,158,159 40,460 s 157340 156350 1030
-8 17,943,042 40,460 s 406730 2028
paq8hp8 -7 16,528,353 151,711 x 79580 79970 940
-8 16,372,960 133,271,398 151,711 x 133,423,109 64639 1849 22
paq8k -8 18,239,915 41,881 s 457150 1463
paq8hp9 -7 16,516,789 136,676,674 111,653 x 136,788,327 84529 85957 940
paq8l -6 18,518,485 35,955 x 133910 435
-7 18,168,563 35,955 x 134770 837
-8 17,916,450 35,955 x 136000 136390 1643
paq8hp10 -7 16,490,947 102,256 x 86720 88890 940
paq8hp1 through paq8hp9 can be used as a preprocessor to other compressors by compressing with option -0. In the following tests on ppmonstr, options were tuned for the best possible compression of enwik8 with 2 GB memory (1.65 GB available under WinXP). The xml-wrt 2.0 options are -l0 -w -s -c -b255 -m100 -e2300 (level 0, turn off word containers, turn off space modeling, turn off containers, 255 MB buffer for dictionary, 100 MB buffer, 2300 word dictionary). The xml-wrt 3.0 options are -l0 -b255 -m255 -3 -s -e7000 (-3 = optimize for PPM).
xml-wrt prepends the dictionary to its output. To make the comparison fair, the compressed size of the dictionary must be added. This is done in two ways, first by compressing the preprocessed text and dictionary and adding the compressed sizes, and second by prepending the dictionary to the preprocessed text before compression. The first method compresses about 1-2 KB smaller.
The uncompressed size of each dictionary for paq8hp1 through paq8hp4 is 398,210 bytes. They contain identical words, but in different order. The first two dictionaries are identical. They compress smaller because they are sorted alphabetically. The dictionary for paq8hp5 is 411,681 bytes. It contains all of the words in the first 4 dictionaries plus 1280 new words (44,880 total).
Preprocessor Compressor enwik8 dict total dict+enwik8
------------ ---------- ---------- ------- ---------- ---------
paq8hp1 -0 | ppmonstr J -m1650 -o64 18,322,077 81,190 18,403,267 18,403,991
paq8hp2 -0 | ppmonstr J -m1650 -o64 18,266,424 81,190 18,347,614 18,349,587
paq8hp3 -0 | ppmonstr J -m1650 -o64 18,197,797 107,583 18,305,380 18,306,690
paq8hp4 -0 | ppmonstr J -m1650 -o64 18,170,944 107,590 18,278,534 18,280,098
paq8hp5 -0 | ppmonstr J -m1650 -o64 18,154,921 111,935 18,266,856 18,267,556
xml-wrt 2.0 | ppmonstr J -m1650 -o64 18,625,624
xml-wrt 3.0 | ppmonstr J -m1650 -o64 18,494,374
(none) ppmonstr J -m1650 -o16 19,062,555
ppmonstr J -m1650 -o32 19,084,964
ppmonstr J -m1650 -o64 19,098,634
The transform done by paq8hp1 through paq8hp5 is based on WRT by Przemyslaw Skibinski, which first appeared in PAsQDa and paqar, and later in paq8g and xml-wrt. The steps are as follows:
- The input is parsed into seqences of all uppercase letters or all lowercase letters, or one uppercase letter followed by lowercase letters, e.g. "THE", "the", or "The".
- All uppercase words are prefixed by a special symbol (0E hex in paq8hp3, paq8hp4, paq8hp5). If a lowercase letter follows with no intervening characters (e.g. "THEre", then a special symbol (0C hex) marks the end. (e.g. 0E "the" 0C "re").
- Capitalized words are prefixed with 7F hex (paq8hp3) or 40 hex (paq8hp4, paq8hp5) (e.g. "The" -> 40 "the").
- All letters are converted to lower case.
- Words are looked up in the dictionary. The first 80 words in the dictionary are coded with 1 byte: 80, 81, ... CF (hex).
- The next 2560 words (paq8hp1-4) or 3840 words (paq8hp5) are coded with 2 bytes: D080, D081, ... EFCF (paq8hp1-4), or D080, ... FFCF (paq8hp5).
- The last 40960 words are coded with 3 bytes: F0D080, F0D081, ... FFEFCF.
- If a word does not match, then the longest matching prefix with length at least 6 is coded and the rest of the word is spelled.
- If there is no matching prefix, then the longest matching suffix with length at least 6 is coded after spelling the preceding letters.
- If no matching word, prefix, or suffix is found, the word is spelled. Capitalization coding occurs regardless.
- Any input bytes with special meaning are escaped by prefixing with 06: 06, 0C, 0E, 40 or 7F, 80-FF.
WRT has additional capabilities depending on input, such as skipping encoding if little or no text is detected. The dictionary format is one word per line (linefeed only) with a 13 line header.