IPv6 address compression - compression

I have been compressing IPv6 addresses but for some reason I can't compress this one:
2​b​0​6​:​0​0​0​0​:​0​0​0​0​:​1​f​2​b​:​d​7​7​f​:​0​0​0​0​:​0​0​0​0​:​8​9​c​e​
This is what I compressed it too: 2b06::1f2b:d77f::89ce
But this is not working, and I wonder, why.

This is the correct way to do it: 2b06::1f2b:d77f:0:0:89ce
You can't have :: more than once in one address.
This 2b06::1f2b:d77f::89ce would be ambiguous, as it could expand to
2b06:0000:1f2b:d77f:0000:0000:0000:89ce,
2​b​0​6​:​0​0​0​0​:​0​0​0​0​:​1​f​2​b​:​d​7​7​f​:​0​0​0​0​:​0​0​0​0​:​8​9​c​e, or
2​b​0​6​:​0​0​0​0​:​0​0​0​0​:0000:​1​f​2​b​:​d​7​7​f​:​0​0​0​0​:​8​9​c​e.

The Standards Track RFC 5952, A Recommendation for IPv6 Address Text Representation explains:
4.2.3. Choice
in Placement of "::"
When there is an alternative choice in the placement of a "::", the
longest run of consecutive 16-bit 0 fields MUST be shortened (i.e.,
the sequence with three consecutive zero fields is shortened in
2001:0:0:1:0:0:0:1). When the length of the consecutive 16-bit 0
fields are equal (i.e., 2001:db8:0:0:1:0:0:1), the first sequence of
zero bits MUST be shortened. For example, 2001:db8::1:0:0:1 is correct
representation.

You can have more than :: in one address
the correct answer is 2b06::1f2b:d77f:0:0:89ce

Related

LZ77: storing format

I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?
What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.

LEX Pattern for matching compressed textual representation of an IP version 6 address

I am aware that there are lots of post on stack overflow and elsewhere of regular expressions, including LEX patterns for IPV6 addresses. None of them appear to be truly complete and indeed some requirements do not need to parse all possible Address Formats.
I am looking for a LEX pattern for IP version 6 address only for addresses represented in compressed textual form. This form is described in Section 2.2 of RFC 5952 (and possibly other related RFCs) and represents a relatively small subset of all possible IPv6 address formats.
If anyone has one which is well tested or is aware of one, please forward it.
RFC 5952 §2.2 does not formally describe the compressed IPv6 address form. The goal of RFC 5952 is to produce a "canonical textual representation form"; that is, a set of textual encodings which has a one-to-one relationship with the set of IPv6 addresses. Section 2.2 enumerates a few aspects of the compressed form which lead to encoding options; a canonical representation needs to eliminate all options.
The compressed syntax is actually described in clause 2 of RFC 4291 §2.2. That syntax is easy enough to describe as a regular expression, although it's a little annoying; it would be easier in a syntax which includes the intersection of two regular expressions (Ragel provides that operator, for example), but in this case a simple enumeration of possibilities suffices.
If you really want to limit the matches to the canonical representations listed in RFC 5952 §4.2, then you have a slightly more daunting task because of the requirement that the compressed run of 0s must be the longest run of 0s in the uncompressed address, or the first such run if there is more than one longest run of the same length.
That would be possible by making a much longer enumeration of permissible forms where the compressed run satisfies the "first longest" constraint. But I'm really not sure that there is any value in creating that monster, since RFC 5952 is quite clear that the intent is to restrict the set of representations produced by a conforming application (emphasis added):
…[A]ll implementations MUST accept and be able to handle any legitimate RFC4291 format.
Since regular expressions are mostly of use in recognising and parsing inputs, it seems unnecessary to go to the trouble of writing and verifying the list of possible canonical patterns.
An IPv6 address conforming to clause 1 of RFC 4291 §2.2 can easily be described in lex syntax:
piece [[:xdigit:]]{1,4}
%%
{piece}(:{piece}){7} { /* an uncompressed IPv6 address */ }
In passing, although it seems unnecessary for the same reasons noted above, it's very simple to restrict {piece} to the canonical 16-bit representations (lower-case only, no leading zeros):
piece 0|[1-9a-f][0-9a-f]{0,3}
The complication comes with the requirement in clause 2 that only one run of 0s be compressed. It's easy to write a regular expression which allows only one number to be omitted:
(({piece}:)*{piece})?::({piece}(:{piece})*)?
but that formulation no longer limits the number of pieces to 8. It's also fairly easy to write a regular expression which allows omitted pieces, limiting the number of fields:
{piece}(:{piece}?){1,6}:{piece}|:(:{piece}){1,7}|({piece}:){1,7}:|::
What's desired is the intersection of those two patterns, plus the pattern for uncompressed addresses. But, as mentioned, there's no way of writing intersections in (f)lex. So we end up enumerating possibilities. A simple enumeration is the number of initial uncompressed pieces:
(?x: /* Flex's extended syntax allows whitespace and continuation lines */
{piece}(:{piece}){7}
| {piece} ::{piece}(:{piece}){0,5}
| {piece}:{piece} ::{piece}(:{piece}){0,4}
| {piece}(:{piece}){2}::{piece}(:{piece}){0,3}
| {piece}(:{piece}){3}::{piece}(:{piece}){0,2}
| {piece}(:{piece}){4}::{piece}(:{piece})?
| {piece}(:{piece}){5}::{piece}
| {piece}(:{piece}){0,6}::
| ::{piece}(:{piece}){0,6}
| ::
)
That still excludes the various forms of embedding IPv4 addresses in IPv6, but it should be clear how to add those, if desired.

protocol buffers : no notation for fixed size buffers?

Since I am not getting an answer on this question I gotta prototype and check myself, as my dataset headers need to be fixed size, I need fixed size strings. So, is it possible to specify fixed size strings or byte arrays in protocol buffers ? It is not readily apparent here, and I kinda feel bad about forcing fixed size strings into the header message. --i.e, std::string('\0', 128);
If not I'd rather use a #pragma pack(1) struct header {...};'
edit
Question indirectly answered here. Will answer and except
protobuf does not have such a concept in the protocol, nor in the .proto schema language. In strings and blobs, the data is always technically variable length using a length prefix (which itself uses varint encoding, so even the length is variable length).
Of course, if you only ever store data of a particular length, then it will line up. Note also that since strings in protobuf are unicode using UTF-8 encoding, the length of the encoded data is not as simple as the number of characters (unless you are using only ASCII characters).
This is a slight clarification to the previous answer. Protocol Buffers does not encode strings as UTF-8, it encodes them as regular bytes. The on-wire format would be the number of bytes consumed followed by the actual bytes. See https://developers.google.com/protocol-buffers/docs/encoding/.
While the on-wire format is always the same, protocol buffers provides two interfaces for developers to use, string and bytes, with the primary difference being that the former will generally try to provide string types to the developer where as the latter will try to provide byte types (I.e. Java would provide String for string and ByteArray for bytes).

Binary file special characters

I'm coding a suffix array sorting, and this algorithm appends a sentinel character to the original string. This character must not be in the original string.
Since this algorithm will process binary files bytes, is there any special byte character that I can ensure I won't find in any binary file?
If it exists, how do I represent this character in C++ coding?
I'm on linux, I'm not sure if it makes a difference.
No, there is not. Binary files can contain every combination of byte values. I wouldn't call them 'characters' though, because they are binary data, not (necessarily) representing characters. But whatever the name, they can have any value.
This is more like a question you should answer yourself. We do not know what binary data you have and what characters can be there and what cannot. If you are talking about generic binary data - there could be any combination of bits and bytes, and characters, so there is no such character.
From the other point of view, you are talking about strings. What kind of strings? ASCII strings? ASCII codes have very limited range, for example, so you can use 128, for example. Some old protocols use SOH (\1) for similar purposes. So there might be a way around if you know exactly what strings you are processing.
To the best of my knowledge, suffix array cannot be applied to arbitrary binary data (well, it can, but it won't make any sense).
A file could contains bits only. Groups of bits could be interpreted as an ASCII character, floating point number, a photo in JPEG format, anything you could imagine. The interpretation is based on a coding scheme (such as ASCII, BCD) you choose. If your coding scheme doesn't fill the entire table of possible codes, you could pick one for your special purpouses (for example digits could be encoded naively on 4 bits, 2^4=16, so you have 6 redundant codewords).

Is this an acceptable use of "ASCII arithmetic"?

I've got a string value of the form 10123X123456 where 10 is the year, 123 is the day number within the year, and the rest is unique system-generated stuff. Under certain circumstances, I need to add 400 to the day number, so that the number above, for example, would become 10523X123456.
My first idea was to substring those three characters, convert them to an integer, add 400 to it, convert them back to a string and then call replace on the original string. That works.
But then it occurred to me that the only character I actually need to change is the third one, and that the original value would always be 0-3, so there would never be any "carrying" problems. It further occurred to me that the ASCII code points for the numbers are consecutive, so adding the number 4 to the character "0", for example, would result in "4", and so forth. So that's what I ended up doing.
My question is, is there any reason that won't always work? I generally avoid "ASCII arithmetic" on the grounds that it's not cross-platform or internationalization friendly. But it seems reasonable to assume that the code points for numbers will always be sequential, i.e., "4" will always be 1 more than "3". Anybody see any problem with this reasoning?
Here's the code.
string input = "10123X123456";
input[2] += 4;
//Output should be 10523X123456
From the C++ standard, section 2.2.3:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
So yes, if you're guaranteed to never need a carry, you're good to go.
The C++ language definition requres that the code-point values of the numerals be consecutive. Therefore, ASCII Arithmetic is perfectly acceptable.
Always keep in mind that if this is generated by something that you do not entirely control (such as users and third-party system), that something can and will go wrong with it. (Check out Murphy's laws)
So I think you should at least put on some validations before doing so.
It sounds like altering the string as you describe is easier than parsing the number out in the first place. So if your algorithm works (and it certainly does what you describe), I wouldn't consider it premature optimization.
Of course, after you add 400, it's no longer a day number, so you couldn't apply this process recursively.
And, <obligatory Year 2100 warning>.
Very long time ago I saw some x86 processor instructions for ASCII and BCD.
Those are AAA (ASCII Adjust for Addition), AAS (subtraction), AAM (mult), AAD (div).
But even if you are not sure about target platform you can refer to specification of characters set you are using and I guess you'll find that first 127 characters of ASCII is always have the same meaning for all characters set (for unicode that is first characters page).