I am seeking advice on achieving the most efficient storage of an ordered list, that is the minimum storage for a list.
An ordered list of 256 unique items where each item is a unique number from 0 to 255 will standardly require 2^16 bits of data for storage, 2^8 places, each place holding a 2^8 value.
However this information ought to be storable in near 2^15 bits.
The second item, rather than being in the 2nd place of 256, can be viewed as being the next of the remaining 255, the next item the next of the remaining 254 etc.
This is a continuation of not needing to store the detail of the last item in a sorted list because that item must be in the last place by default.
In this case you can simply see you can have 2^8-1 places each place holding a 2^8 value, which is less than 2^16.
So how does this get down to 2^15+1 bits of storage. Or is there a proof that says otherwise? If there is a proof I would hope it doesn't say 2^16 bits of storage are needed as I have just shown that that is wrong!
I am hopefully just unaware of the terminology to identify work on this subject.
Can anyone advise of work on the matter?
Thank you for your time.
Glenn
Upon clarification of the question as storage of some particular permutation of 256 items (being the 8-bit numbers from 0 to 255 in particular), I have updated my answer. The prior discussion is below for posterity.
Answer
1684 bits.
Explanation
In this case, the clearest analysis comes through encoding and informational entropy. Again, we use the pigeonhole principle: in order to uniquely determine a particular permutation, we must have at least as many encodings as we have possible messages to be encoded.
An example may be helpful: consider a list of 256 numbers, each of which is an 8-bit number. The first item has 256 possible values, as does the second, as does the third, and so on. Overall, we have 256^256 possible messages, so we need at least 256^256 possible encodings. To determine the number of bits needed, we can simply take the base 2 logarithm of this, and log2(256^256) = 256*log2(256) = 256 * log2(2^8) = 256 * 8 = 2^11, so we can see that to encode this list, we only need 2^11, or 2048 bits. You may note this is the same as taking 8 bits per item, and multiplying it by the number of items. Your original question was incorrect on the storage needed, as you supposed that it requires 2^8 bits, so a 256 bit integer, which could store values from 0 to ~10^77.
With this understanding, we turn our attention to the problem at hand. There are 256 possibilities for the first item, then 255 possibilities for the second item, 254 possibilities for the third item, etc, until there is only 1 possibility for the last item. Overall, we have 256! possibilities, so we need at least 256! encodings. Again, we use the base 2 logarithm to determine how many bits we need, so we need log2(256!) bits. A nice property of logarithms is that they turn products into sums, so log2(256!) = log2(256) + log2(255) + log2(254) + ... + log2(2) + log2(1). This is analogous to using 8 bits for each of the 256 items, but here as each item has progressively less information, it requires fewer bits. Also note that log2(1) is 0, which corresponds to your observation that you don't need any information to encode the last item. Overall, when we perform this sum, we end up with 1683.996..., so we need 1684 bits to encode these ordered lists. Some variable-length encodings could get lower, but they can never go lower than log2(256!) bits on average.
Coming up with an encoding that uses 1684 bits is not simple, but here's one method that's more efficient than the original full storage. We can note that the first 128 items each have between 129 and 256 possibilities, and encode each of these items with 8 bits. The next 64 items each have between 65 and 128 possibilities, so we can encode each of these items with 7 bits. Continuing on, we end up using
(128 * 8) + (64 * 7) + (32 * 6) + (16 * 5) + (8 * 4) + (4 * 3) + (2 * 2) + (1 * 1) + (1 * 0) = 1793 bits to store the list.
Pre-clarification discussion
If all you're ever interested in encoding is an ordered list of 256 unique items, where each item is an 8-bit integer, then you can do it in 1 bit: 0 means you have that list, 1 means you don't, because there's only one possible list satisfying those criteria.
If you're trying to store anything in memory, you need at least as many configurations of that memory as there are different options (otherwise by the pigeonhole principle there would be at least two options you couldn't differentiate). Assuming by "ordered" you mean that they are strictly increasing or decreasing, an n-element ordered list of 8-bit integers, without repetition, has 256 choose n possible options (as there's only one possible configuration, the ordered one). Summing 256 choose n over all possible values of n (i.e. 0 to 256), gives 2^256, or 2^(2^8). Therefore, a perfect encoding scheme could use as few as 2^8 bits to store this particular kind of list, but couldn't encode any other kind of list.
EDIT: If you want to read more about this sort of stuff, read up on information theory.
EDIT: An easier way to think about this encoding is like this: We know the list is ordered, so if we know what items are in it then we know what order they're in, so we only need to know which items are in the list. There's 256 possible items (0 through 255), and if we assume the items in the list are unique then each item is either in the list, or it isn't. For each item, we use 1 bit to store whether or not it's in the list (so bit 0 records if the list contains 0, bit 1 records if the list contains 1, etc, bit 255 records if the list contains 255). Tada, we've stored all of the information about this 256 element array of bytes in only 256 = 2^8 bits.
EDIT: Let's examine an analogous situation. We have an ordered, unique list of up to 4 items, each of which is 2 bits. We'll write out all of the possible circumstances:[], [0], [1], [2], [3], [0,1], [0,2], [0,3], [1,2], [1,3], [2,3], [0,1,2], [0,1,3], [0,2,3], [1,2,3], [0,1,2,3]. These are the only possible lists where each element is two bits, the elements are unique, and the elements are in ascending order. Notice that I can't actually make it more efficient just by leaving the 3 off of [0,1,2,3], because I also need to distinguish it from [0,1,2]. The thing is, asking how much space you need to "store" something isolated from context is almost unanswerable. If all you want is to store enough information to recover it (i.e. you want lossless compression), and if you presume you know the properties, you can get your compression ratio pretty much as low as you want. For example, if you gave me an ordered list containing every element from 0 to 1,000,000 exactly once, even though storing that list directly in memory requires ~2^40 bits, you can recover the list from the known properties and the two numbers 0 and 1,000,000, for a total of 40 bits.
Related
DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?
you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:
A - 00
T - 01
G - 10
C - 110
N - 111
Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.
This is based on Huffman Coding Algorithm
Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.
You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).
In 32 bits you can fit log5(232) = 13 base-5 values.
In 64 bits you can fit log5(264) = 27 base-5 values.
The encoding process would be:
uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
packed = packed * 5 + *input++;
}
And decoding:
uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
*output++ = packed % 5;
packed /= 5;
}
There are plenty of methods to compress, but the main question is what data you want to compress?
1. Raw unaligned sequenced data from the sequencing machine (fastq)
2. Aligned data (sam/bam/cram)
3. Reference genomes
You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment).
Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.
Frankly, I would start with some version of Lempel-Ziv compression (a class of compression algorithms that includes the general-purpose gzip compression format). I note that some of the comments say that general-purpose compression algorithms don't work well on raw genome data, but their effectiveness depends on how the data is presented to them.
Note that most general-purpose compression programs (like gzip) examine their input on a per-byte basis. This means that "pre-compressing" the genome data at 3 bits/base is counterproductive; instead, you should format the uncompressed genome data at one byte per base before running it through a general-purpose compressor. Ascii "AGTCN" coding should be fine, as long as you don't add noise by including spaces, newlines, or variations in capitalization.
Lempel-Ziv compression methods work by recognizing repetitive substrings in their input, then encoding them by reference to the preceding data; I'd expect this class of methods should do a reasonably good job on appropriately presented genome data. A more genome-specific compression method may improve on this, but unless there's some strong, non-local constraint on genome encoding that I'm unaware of, I would not expect a major improvement.
We can use a combination of Roee Gavirel's idea and the following for an even tighter result. Since Roee's idea still stipulates that two out of our five characters be mapped to a 3-bit word, sections of the sequence where at least one of the five characters does not appear but one of the 3-bit words does could be mapped to 2-bit words, reducing our final result.
The condition to switch mapping is if there exists a section where at least one of the five characters does not appear and one of the 3-bit words appears just one time more than two times our section-prefix length in bits. If we order our possible characters (for example, alphabetically), given three bits indicating a specific missing character (if there's more than one missing, we choose the first in order) or none missing, we can immediately assign a consistent 2-bit mapping for the other four characters.
Two ideas for our prefixes:
(1)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: a constant number of bits representing section length. For maximal length sections of 65000, we could assign x = 16.
To justify the prefix use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears 39 times or more.
(2)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: the number of bits in the next section of the prefix - depends on how many characters the longest section could be. x = 6 would imply the maximal section length could be 2^(2^6)! Unlikely. For maximal length sections of 65000, we could assign x = 4;
the number of bits indicated in the previous part of the prefix, indicating the current section length.
In the example just above, our prefix length could vary between say 11 and 23 bits, which means to justify its use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears between 23 to 47 times or more.
I have an array of doubles having 6 indices, and it is mostly filled with zeros. I don't know yet what type should I use to storage it in the memory.
But, most importantly:
I would like to save it into a file (a binary file?).
What is the most efficient way to save it?
One requirement is that I can run through all the non-zero entries without passing by the zeros.
If I run 6 nested for I'll need too many lives.
Moreover, I don't know how to practically save it: Do I need two files, one acting as an index and the second one containing all the values?
Thanks!
This is probably a solved problem; there are probably sparse-matrix libraries that give you efficient in-memory representations too. (e.g. each row is a list of index:value, stored in a std::vector, linked list, hash, or other data structure, depending on whether inserting single non-zero values in the middle is valuable or whatever other operation is important).
A binary format will be faster to store/load, but whether you go binary or text isn't important for some ways of representing a sparse array. If you write a binary format, endian-agnostic code is a good way to make sure it's portable and doesn't have bugs that only show up on some architectures.
Options:
Simple but kind of ugly: gzip / lz4 / lzma the buffer holding your multidimensional array, writing the result to disk. Convert to little-endian on the fly while saving/loading, or store an endianness flag in the format.
Same idea but store all 6 indices with each value. Good if many inner-most arrays have no non-zero values, this may be good. Every non-zero value has a separate record (line, in a text-based format). Sample line (triple-nested example for readability, extends to 6 just fine):
dimensions on the first line or something
a b c val
...
3 2 5 -3.1416
means: matrix[3][2][5] = -3.1416
Use a nested sparse-array representation: each row is a list of index:value. Non-present indices are zero. A text format could use spaces and newlines to separate things; a binary format could use a length field at the start of each row or a sentinel value at the end.
You could flatten the multidimensional array out to one linear index for storage with 32bit integer indices, or you could represent the nesting somehow. I'm not going to try to make up a text format for this, since it got ugly as I started to think about it.
A regular flat representation of a 6 dimension array ...
double[10][10][10][10][10][10] = 1million entries * 8 bytes ~= 8MB
An associative array Index:Value representation, assume 50% of entries are 0.0 ... using a 4 byte 32bit index ...
500,000 * 4 bytes + 500,000 * bytes ~= 6MB
A bit map representation of the sparse array, assume 50% of entries are 0.0 ... bits are set so that every byte represents 8 entries in the array 10000001b would mean 8 entries where only the first and last are represented and the 6 middle values are ignored since they are zero ...
ceil(1million / 8) bytes + 500,000 * 8 bytes ~= 4.125MB
It sounds weird to be going bigger, but that's what I'm trying to do. I want to take the entire sequence of 16-bit integers and hash each one in such a way that it maps to 256-bit space uniformly.
The reason for this is that I'm trying to put a subset of the 16-bit number space into a 256-bit bloom filter, for fast membership testing.
I could use some well-known hashing function on each integer, but I'm looking for an extremely efficient implementation (just a few instructions) so that this runs well in a GPU shader program. I feel like the fact that the hash input is known to be only 16-bits can inform the hash function is designed somehow, but I am failing to see the solution.
Any ideas?
EDITS
Based on the responses, my original question is confusing. Sorry about that. I will try to restate it with a more concrete example:
I have a subset S1 of n numbers from the set S, which is in the range (0, 2^16-1). I need to represent this subset S1 with a 256-bit bloom filter constructed with a single hashing function. The reason for the bloom filter is a space consideration. I've chosen a 256-bit bloom filter because it fits my space requirements, and has a low enough probability of false positives. I'm looking to find a very simple hashing function that can take a number from set S and represent it in 256 bits such that each bit has roughly equal probability of being 1 or 0.
The reason for the requirement of simplicity in the hashing function is that this hashing function is going to have to run thousands of times per pixel, so anywhere where I can trim instructions is a win.
If you multiply (using uint32_t) a 16 bit value by prime (or for that matter any odd number) p between 2^31 and 2^32, then you "probably" smear the results fairly evenly across the 32 bit space. Then you might want to add another prime value, to prevent 0 mapping to 0 (you want each bit to have an equal probability of being 0 or 1, only one input value in 2^256 should have output all zeros, and since there are only 2^16 inputs that means you want none of them to have output all zeros).
So that's how to expand 16 bits to 32 with one operation (plus whatever instructions are needed to load the constant). Use four different values p1 ... p4 to get 256 bits, and run some tests with different p values to find good ones (i.e. those that produce not too many more false positives than what you expect for your Bloom filter given the size of the set you're encoding and assuming an ideal hashing function). For example I'm pretty sure -1 is a bad p-value.
No matter how good the values you'll see some correlations, though: for example as I've described it above the lowest bit of all 4 separate values will be equal, which is a pretty serious dependency. So you probably want a couple more "mixing" operations. For example you might say that each byte of the final output shall be the XOR of two of the bytes of what I've described (and not two least-siginficant bytes!), just to get rid of the simple arithmetic relations.
Unless I've misunderstood the question, though, this is not how a Bloom filter usually works. Usually you want your hash to produce an exact fixed number of set bits for each input, and all the arithmetic to compute the false positive rate relies on this. That's why for a Bloom filter 256 bits in size you'd normally have k 8-bit hashes, not one 256-bit hash. k is normally rather less than half the size of the filter in bits (the optimal value is the number of bits per value in the filter, times ln(2) which is about 0.7). So normally you don't want the probability of each bit being 1 to be anything like as high as 0.5.
The reason is that once you've ORed as few as 4 such 256-bit values together, almost all the bits in your filter are set (15 in 16 of them). So you're looking at a lot of false positives already.
But if you've done the math and you're happy with a single hash function producing a variable number of set bits averaging half of them, then fair enough. Or is the double-occurrence of the number 256 just a coincidence, because k happens to be 32 for the set size you have chosen and you're actually using the 256-bit hash as 32 8-bit hashes?
[Edit: your comment clarifies this, but anyway k should not get so high that you need 256 bits of hash in total. Clearly there's no point in this case using a Bloom filter with more than 16 bits per value (i.e fewer than 16 values), since using the same amount of space you could just list the values, and have a false positive rate of 0. A filter with 16 bits per value gives a false positive rate of something like 1 in 2200. Even there, optimal k is only 23, that is you should set 23 bits in the filter for each value in the set. If you expect the sets to be bigger than 16 values then you want to set fewer bits for each element, and you'll get a higher false positive rate.]
I believe there is some confusion in the question as posed. I will first try to clear up any inconsistencies I've noticed above.
OP originally states that he is trying to map a smaller space into a larger one. If this is truly the case, then the use of the bloom filter algorithm is unnecessary. Instead, as has been suggested in the comments above, the identity function is the only "hash" function necessary to set and test each bit. However, I make the assertion that this is not really what the OP is looking for. If so, then the OP must be storing 2^256 bits in memory (based on how the question is stated) in order for the space of 16-bit integers (i.e. 2^16) to be smaller than his set size; this is an unreasonable amount of memory to be using and is highly unlikely to be the case.
Therefore, I make the assumption that the problem constraints are as follows: we have a 256-bit bit vector in which we want to map the space of 16-bit integers. That is, we have 256 bits available to map 2^16 possible different integers. Thus, we are not actually mapping into a larger space, but, instead, a much smaller space. Similarly, it does appear (again, as previously pointed out in the comments above) that the OP is requesting a single hash function. If this is the case, there is clear misunderstanding about how bloom filters work.
Bloom filters typically use a set of hash independent hash functions to reduce false positives. Without going into too much detail, every input to the bloom filter runs through all n hash functions and then the resulting index in the bit vector is tested for each function. If all indices tested are set to 1, then the value may be in the set (with proper collisions in all n hash functions or overlap, false positives will occur). Moreover, if any of the indices is set to 0, then the value is absolutely not in the set. With this in mind, it is important to notice that an entirely saturated bloom filter has no benefit. That is, every query to the bloom filter will return that the item is in the set.
Hash Function Concerns
Now, back to the OP's original question. It is likely going to be best to use known hashing algorithms (since these are mathematically difficult to write and "rolling your own" typically doesn't end well). If you are worried about efficiency down to clock-cycles, implement the algorithm yourself in the appropriate assembly language for your architecture to reduce running time for each hash function. Remember, algorithmically, hash functions should run in O(1) time, so they should not contribute too much overhead if implemented properly. To start you off, I would recommend considering the modified bernstein hash. I have written a version for your specific case below (mostly for example purposes):
unsigned char modified_bernstein(short key)
{
unsigned ret = key & 0xff;
ret = 33 * ret ^ (key >> 8);
return ret % 256; // Try to do some modulo math to keep it in range
}
The bernstein method I have adapted generally runs as a function of the number of bytes of the input. Since a short type is 2 bytes or 16-bits, I have removed any variables and loops from the algorithm and simply performed some bit twiddling to get at each byte. Finally, an unsigned char can return a value in the range of [0,256) which forces the hash function to return a valid index in the bit vector.
I have a square grid which contains numbers and I need to compress it a lot so it can easily be transferred over a network. For instance, I need to be able to compress a 40x40 grid into less than 512 bytes, regardless of the values of the numbers in the grid. That is my basic requirement.
Each of the grid's cells contains a number from 0-7, so each cell can fit in 3 bits.
Does anyone know of a good algorithm that can achieve what I want?
You can encode your information differently. You dont't need to assign all numbers 0 to 7 a code with the same number of bits. You can assign based on number of times in the sequence.
First read the whole sequence counting the number of appearances of every number.
Based on that you can assign the code to each number.
If you assign the code following for example Huffman code the codes will be prefix code, meaning there is no extra character to separate numbers.
There are certain variations that you can introduce on the algorithm based on your test results to fine tune the compression ratio.
I used this technique in a project (university) and it provides, in general, good results. At least it should approximmate your theoretical 3 bits per character and can be much better if the distribution of probabilities helps.
What you want to do is perform a "burrowes-wheeler" transform on your data, and then compress it. Run-length encoding will be enough in this case.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
This will likely outperform huffman in your case.
It is true that in some cases you will need more than 512 bytes. So in your protocol just make an exception for "perverse" grids. But in the general case you should be under 512 easily.
As others have stated, the problem as stated is not possible as 600 bytes are required to represent all possible grids. 600 bytes is from 40 rows, 40 columns, 3 bits per cell, and 8 bits per byte (40 * 40 * 3 / 8). As Kerrek SB explained in the comments, you pack 8 cells into 3 bytes.
In your own comments, you mentioned this was game state being transferred over the network. Assuming you have a mechanism to assure reliable transport of the data, then if there is a reasonable limit to the number of cells that can change between updates, or if you are allowed to send your updates when a certain number of cells change, you can achieve a representation in 512 bytes. If you use 1 bit to represent whether or not a cell has changed, you would need 200 bytes. Then, you have 312 remaining bytes to represent the new values of the cells that have changed. So, you can represent up to 312*8/3 = 832 modified cells.
As an aside, this representation can represent up to 1064 changed cells in less than 600 bytes.
I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).
You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.