Compress/Compress program documentation.rtf at master · rsgoheen/Compress · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22\par
General use\par
Compress.exe [-c | -d] InputFile OutputFile\par
Where -c compresses and -d decompresses the inputfile to the outputfile.\par
\par
Algorithm\par
I chose the Huffman algorithm for string compression because it's a straightforward, established algorithm that's relatively easy to understand and implement.  I didn't do any kind of comparison with other algorithms -- given more time I certainly would have done more research.\par
Huffman compression is based around substituting shorter keys for the more frequently appearing characters. Less frequent characters are substituted with longer keys.\par
The data structure for a Huffman dictionary is an unbalanced binary tree, with the shorter keys appearing as nodes closer to the root.   I've implemented this in the code as a tree data structure that's used for both compressing and decompressing. The first step is to analyze the input (or a representative input) to determine the frequency.  Tree nodes are generated and assembled as a tree from this frequency dictionary.  A SortedSet<T> was used as a priority queue when generating the tree.  The use of this .Net collection required the IComparable and IEquitable interfaces on the HuffingtonNode class for everything to work correctly.\par
Decompressing means reading in the array of bits and navigating down from the root until a leaf is reached, then returning to the root node until all bits are read. \par
Compressing is decompressing in reverse, so starting from the leaf representing the character to be substituted and working back to the root.  To make this easier, I added a dictionary with the character as the key for quicker location of the correct leaf (rather than searching from the root recursively), and a link from each node to its parent node for easier navigation back to the root.\par
The same dictionary that did the compression must be available when decompressing the file.  Rather than persisting the tree structure, I persisted the frequency dictionary that was used to generate the tree, as the a single frequency dictionary will always result in the same tree.\par
The data from compressing is an array of bit values, which, if written directly to disk, would take more room than the existing file.  Therefore an additional step of packing the bits into byte values is undertaken.  This results in a savings of about 50% in file size.  On my computer, the original was 18,666kb, the output file (with header) was 9,799 kb, or 53% the originals size.  By comparison, using 7-Zip at its highest settings resulted in a file of 2,695 kb.\par
The output file is a header followed by the compress stream (converted from a bit array into a byte array).  The header is the dictionary information plus the length of the bit array prior to packing into the byte array. The dictionary information is the keys and values, converted to string values, along with the length of those strings.  Size fields are read in as 4 byte int values.\par
\par
Known issues and limitations\par
The primary limitation is that the compression process is effectively a batch process in memory. Larger files or lower memory computers (I'm developing with 8GB RAM) may encounter issues with larger files.  I would further refactor this code to stream the input and output file as a next step.\par
Since the code converts to and from byte arrays, there is the risk in moving between systems where data types have different sizes.  However, since the standard .Net int is an Int32 whatever the underlying platform (if I remember correctly), this should be a pretty low risk.  I would want to evaluate and test better for different (32 vs 64) systems.\par
Reading the entire file to generate the frequency dictionary is inefficient, especially since that means reading the entire file twice.  I considered hardcoding a dictionary into the program (either in code or settings), but did not.  If all the files are pretty similar, a dictionary could be generated once.\par
There could be better validation of input, especially on the decompression step.\par
There could be more unit tests.  I need to abstract out the file specific streams to allow the overall process to be unit tested, rather than the smaller atomic steps.  Refactoring this would likely be my next step.\par
\par
Other comments\par
It has been a while since I've had any problem that benefited from generating and using a tree, and even longer since I've had to worry about bits and bytes (you can spend a long time in a managed language without needing to think about the size of an int).  It took me a bit of time to think about how best to build the tree, and I made a couple of mistakes around packing and unpacking the bits, especially since this algorithm is all about variable length keys.  The unpacking logic is the part I'm least happy with, as that logic, in Program.cs, could be much better organized, both for readability and unit testing.\par
}