Is it possible to losslessy compress 32 hexadecimal numbers into 30? - compression

For example is it possible to compress
002e3483bbdc11ddaae0754822a559f6 into something that just takes at most 30 characters.

Yes, you can convert it to a base-32 number so the greatest 32 characters hex number i.e. ffffffffffffffffffffffffffffffff is equivalent to 80000000000000000000000000 in base-32 that only has 26 characters, also note that in base-32 you will end with a string containing only this characters: 123456789ABCDEFGHIJKLMNOPQRSTUV
For example: 002e3483bbdc11ddaae0754822a559f6 is 5OQ87EUS27F0000000000000 in base-32

If your question is to compress 32 hex numbers into 30 hex numbers.
This is impossible to occur for all test cases, since, if it were possible, multiple 32-length hex strings would have to compress to the same 30-length hex string, thus you wouldn't know which one it was (the pigeonhole principle).
A less proof-y proof - you'd be able to repeatedly invoke the process on any size file to get down to a single 30-length hex string, which doesn't make a whole lot of sense.
Here is a article I just found. Wikipedia says something similar.

Convert hex to binary then use something like base64 or any other encoding scheme, see Binary-to-text encoding (Wikipedia). This has the advantage of not requiring 128bit arithmetic like the suggested base32 solution.
Conversion to base64 and back:
$ echo 002e3483bbdc11ddaae0754822a559f6 |xxd -r -ps |openssl base64 -e |tee >(openssl base64 -d |xxd -ps)
AC40g7vcEd2q4HVIIqVZ9g==
002e3483bbdc11ddaae0754822a559f6
Cut the line starting from |tee to get only the encoded output. In most programing languages you will have core or external libraries to do hex to binary conversion and base64 encoding.
NB: Conversion to base32 would also be possible but the base32 binary to text encoding requires 8-bytes padding, so you would have to trim it then re-add the pads (=) on decode.

Related

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

Store 32 bit value as C string in most efficient form

I am trying to find the most efficient way to encode 32 bit hashed string values into text strings for transmission/logging in low bandwidth environments. Complex compression can't be used because the hash values need to be contained in human readable text strings when logged and sent between client and host.
Consider the following contrived examples:
given the key/value map
table[0xFE12ABCD] = "models/texture/red.bmp";
table[0x3EF088AD] = "textures/diagnostics/pink.jpg";
and the string formats:
"Loaded asset (0x%08x)"
"Replaced (0x%08x) with (0x%08x)"
they could be printed as:
"Loaded asset models/texture/red.bmp"
"Replaced models/texture/red.bmp with textures/diagnostics/pink.jpg"
Or if the key/value map is known by the client and server:
"Loaded asset (0xFE12ABCD)"
"Replaced (0xFE12ABCD) with (0x3EF088AD)"
The receiver can then scan for the (0xNNNNNNNN) pattern and expand it locally.
This is what I am doing right now but I would like to find a way to represent the 32 bit value more efficiently. A simple step would be to use a better identifying token:
"Loaded asset $FE12ABCD"
"Replaced $1000DEEE with $3EF088AD"
Which already reduces the length of each token - $ is not used anywhere else so it is reasonable.
However, what other options are there to make that 32 bit value even smaller? I can't use an index - it has to be a full 32 bit value because in some cases the generator of the string has the hash and sometimes it has a string it will hash immediately.
A common solution is to use Base-85 coding. You can code four bytes into five Base-85 digits, since 855 > 232. Pick 85 printable characters and assign them to the digit values 0..84. Then do base conversion to go either way. Since there are 94 printable characters in ASCII, it is usually easy to find 85 that are "safe" in whatever constrains your strings to be "readable".

How to handle unicode values in JSON strings?

I'm writing a JSON parser in C++ and am facing a problem when parsing JSON strings:
The JSON specification states that JSON strings can contain unicode characters in the form of:
"here comes a unicode character: \u05d9 !"
My JSON parser tries to map JSON strings to std::string so usually, one character of the JSON strings becomes one character of the std::string. However for those unicode characters, I really don't know what to do:
Should I just put the raw bytes values in my std::string like so:
std::string mystr;
mystr.push_back('\0x05');
mystr.push_back('\0xd9');
Or should I interpret the two characters with a library like iconv and store the UTF-8 encoded result in my string instead ?
Should I use a std::wstring to store all the characters ? What then on *NIX OSes where wchar_t are 4-bytes long ?
I sense something is wrong in my solutions but I fail to understand what. What should I do in that situation ?
After some digging and thanks to H2CO3's comments and Philipp's comments, I finally could understand how this is supposed to work:
Reading the RFC4627, Section 3. Encoding:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
So it appears a JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both their BE or LE variants, for the last two).
Once that is clear, Section 2.5. Strings explains how to handle those \uXXXX values in JSON strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
With more complete explanations for characters not in the Basic Multilingual Plane.
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Hope this helps.
If I were you, I would use std::string to store UTF-8 and UTF-8 only.
If incoming JSON text does not contain any \uXXXX sequences, std::string can be used as is, byte to byte, without any conversion.
When you parse \uXXXX, you can simply decode it and convert it to UTF-8, effectively treating it as if it was true UTF-8 character in its place - this is what most JSON parsers are doing anyway (libjson for sure).
Granted, with this approach reading JSON with \uXXXX and immediately dumping it back using your library is likely to lose \uXXXX sequences and replace them with their true UTF-8 representations, but who really cares? Ultimately, net result is still exactly the same.

URL encoding for multibyte character string in c++

I am trying to achieve URL encoding for some of my strings via c++. Strings can contaim multibyte characters like ™, ®, ©, etc.
Input text: Something ™
Output should be: Something%20%E2%84%A2
I can achieve URL encode or decode in JS with encodeURIComponent and decodeURIComponent,
but I have some native code in c++ and hence need to encode some text via c++.
Any help here would be great relief for me.
It's not to hard to do manually, if you can't find a library. First encode the string as UTF-8 (there are other posts on SO about using the standard library to do that if the string is in another encoding) and then replace every character with a value above 127, and every one that's restricted in URLs, with the percent encoding of that character (A percent sign followed by the two hexadecimal digits representing the character's value).

C/C++: How to convert 6bit ASCII to 7bit ASCII

I have a set of 6 bits that represent a 7bit ASCII character. How can I get the correct 7bit ASCII code out of the 6 bits I have? Just append a zero and do an bitwise OR?
Thanks for your help.
Lennart
ASCII is inherently a 7-bit character set, so what you have is not "6-bit ASCII". What characters make up your character set? The simplest decoding approach is probably something like:
char From6Bit( char c6 ) {
// array of all 64 characters that appear in your 6-bit set
static SixBitSet[] = { 'A', 'B', ... };
return SixBitSet[ c6 ];
}
A footnote: 6-bit character sets were quite popular on old DEC hardware, some of which, like the DEC-10, had a 36-bit architecture where 6-bit characters made some sense.
You must tell us how your 6-bit set of characters looks, I don't think there is a standard.
The easiest way to do the reverse mapping would probably be to just use a lookup table, like so:
static const char sixToSeven[] = { ' ', 'A', 'B', ... };
This assumes that space is encoded as (binary) 000000, capital A as 000001, and so on.
You index into sixToSeven with one of your six-bit characters, and get the local 7-bit character back.
I can't imagine why you'd be getting old DEC-10/20 SIXBIT, but if that's what it is, then just add 32 (decimal). SIXBIT took the ASCII characters starting with space (32), so just add 32 to the SIXBIT character to get the ASCII character.
The only recent 6-bit code I'm aware of is base64. This uses four 6-bit printable characters to store three 8-bit values (6x4 = 8x3 = 24 bits).
The 6-bit values are drawn from the characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
which are the values 0 thru 63. Four of these (say UGF4) are used to represent three 8-bit values.
UGF4 = 010100 000110 000101 111000
= 01010000 01100001 01111000
= Pax
If this is how your data is encoded, there are plenty of snippets around that will tell you how to decode it (and many languages have the encoder and decoder built in, or in an included library). Wikipedia has a good article for it here.
If it's not base64, then you'll need to find out the encoding scheme. Some older schemes used other lookup methods of the shift-in/shift-out (SI/SO) codes for choosing a page within character sets but I think that was more for choosing extended (e.g., Japanese DBCS) characters rather than normal ACSII characters.
If I were to give you the value of a single bit, and I claimed it was taken from Windows XP, could you reconstruct the entire OS?
You can't. You've lost information. There is no way to reconstruct that, unless you have some knowledge about what was lost. If you know that, say, the most significant bit was chopped off, then you can set that to zero, and you've reconstructed at least half the characters correctly.
If you know how 'a' and 'z' are represented in your 6-bit encoding, you might be able to guess at what was removed by comparing them to their 7-bit representations.