Compress 21 Alphanumeric Characters in to 16 Bytes - c++

I'm trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I'm having trouble coming up with the right algorithm for this.
The trade ID which I'm trying to compress consists of 2 fields:
18 alphanumeric characters
consisting of the ASCII characters
0x20 to 0x7E, Inclusive. (32-126)
A 3-character numeric string "000" to "999"
So a C++ class that would encompass this data looks like this:
class ID
{
public:
char trade_num_[18];
char broker_[3];
};
This data needs to be stored in a 16-char data structure, which looks like this:
class Compressed
{
public:
char sku_[16];
};
I tried to take advantage of the fact that since the characters in trade_num_ are only 0-127 there was 1 unused bit in each character. Similarly, 999 in binary is 1111100111, which is only 10 bits -- 6 bits short of a 2-byte word. But when I work out how much I can squeeze this down, the smallest I can make it is 17 bytes; one byte too big.
Any ideas?
By the way, trade_num_ is a misnomer. It can contain letters and other characters. That's what the spec says.
EDIT: Sorry for the confusion. The trade_num_ field is indeed 18 bytes and not 16. After I posted this thread my internet connection died and I could not get back to this thread until just now.
EDIT2: I think it is safe to make an assumption about the dataset. For the trade_num_ field, we can assume that the non-printable ASCII characters 0-31 will not be present. Nor will ASCII codes 127 or 126 (~). All the others might be present, including upper and lower case letters, numbers and punctuations. This leaves a total of 94 characters in the set that trade_num_ will be comprised of, ASCII codes 32 through 125, inclusive.

If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.
>>> math.log(128**18 * 1000, 256)
16.995723035582763
You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.
>>> math.log(94**18 * 1000, 256)
15.993547951857446
This just fits into 16 bytes!
Example code
Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.
def encodeChar(c):
return ord(c) - 32
def encode(s, n):
t = 0
for c in s:
t = t * 94 + encodeChar(c)
t = t * 1000 + n
r = []
for i in range(16):
r.append(int(t % 256))
t /= 256
return r
print encode(' ', 0) # smallest possible value
print encode('abcdefghijklmnopqr', 123)
print encode('}}}}}}}}}}}}}}}}}}', 999) # largest possible value
Output:
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[ 59, 118, 192, 166, 108, 50, 131, 135, 174, 93, 87, 215, 177, 56, 170, 172]
[255, 255, 159, 243, 182, 100, 36, 102, 214, 109, 171, 77, 211, 183, 0, 247]
This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.
You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.

That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.
Assuming 8 bit = 1 byte
Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 26.555 = 94 different characters **unless there is a hidden structure in trade_num that we could exploit to save more bits.

This is something that should work, assuming you need only characters from allowedchars, and there is at most 94 characters there. This is python, but it is written trying not to use fancy shortcuts--so that you'll be able to translate it to your destination language easier. It assumes however that the number variable may contain integers up to 2**128--in C++ you should use some kind of big number class.
allowedchars=' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}'
alphabase = len(allowedchars)
def compress(code):
alphanumeric = code[0:18]
number = int(code[18:21])
for character in alphanumeric:
# find returns index of character on the allowedchars list
number = alphabase*number + allowedchars.find(character)
compressed = ''
for i in xrange(16):
compressed += chr(number % 256)
number = number/256
return compressed
def decompress(compressed):
number = 0
for byte in reversed(compressed):
number = 256*number + ord(byte)
alphanumeric = ''
for i in xrange(18):
alphanumeric = allowedchars[number % alphabase] + alphanumeric
number = number/alphabase
# make a string padded with zeros
number = '%03d' % number
return alphanumeric + number

You can do this in ~~15bytes (14 bytes and 6 bits).
For each character from trace_num_ you can save 1 bit if you want save ascii in 7 bits.
Then you have 2 bytes free and 2
bits, you must have 5.
Let get number information, each char can be one from ten values (0 to 9).
Then you must have 4 bits to save this character, to save number you must have 1 byte and 4 bits, then you save half of this.
Now you have 3 bytes free and 6 bits,
you must have 5.
If you want to use only qwertyuioplkjhgfdsazxcvbnmQWERTYUIOPLKJHGFDSAZXCVBNM1234567890[]
You can save each char in 6 bits. Then you have next 2 bytes and 2 bits.
Now you have 6 bytes left, and your string can save in 15 bytes +
nulltermination = 16bytes.
And if you save your number in integer on 10 bytes. You can fit this into 14 bytes and 6 bits.

There are 95 characters between the space (0x20) and tilde (0x7e). (The 94 in other answers suffer from off-by-1 error).
Hence the number of distinct IDs is 9518×1000 = 3.97×1038.
But that compressed structure can only hold (28)16 = 3.40×1038 distinct values.
Therefore it is impossible to represent all IDs by that structure, unless:
There is 1 unused character in ≥15 digits of trade_num_, or
There are ≥14 unused characters in 1 digit of trade_num_, or
There are only ≤856 brokers, or
You're using is a PDP-10 which has a 9-bit char.

Key questions are:
There appears to be some contradiction in your post whether the trade number is 16 or 18 characters. You need to clear that up. You say the total is 21 consisting of 16+3. :-(
You say the trade num characters are in the range 0x00-0x7f. Can they really be any character in that range, including tab, new line, control-C, etc? Or are they limited to printable characters, or maybe even to alphanumerics?
Does the output 16 bytes have to be printable characters, or is it basically a binary number?
EDIT, after updates to original post:
In that case, if the output can be any character in the character set, it's possible. If it can only be printable characters, it's not.
Demonstration of the mathematical possibility is straightforward enough. There are 94 possible values for each of 18 characters, and 10 possible values for each of 3. Total number of possible combinations = 94 ^ 18 * 10 ^ 3 ~= 3.28E35. This requires 128 bits. 2 ^127 ~= 1.70e38, which is too small, while 2^128 ~= 3.40e38, which is big enough. 128 bits is 16 bytes, so it will just barely fit if we can use every possible bit combination.
Given the tight fit, I think the most practical way to generate the value is to think of it as a double-long number, and then run the input through an algorithm to generate a unique integer for every possible input.
Conceptually, then, let's imagine we had a "huge integer" data type that is 16 bytes long. The algorithm would be something like this:
huge out;
for (int p=0;p<18;++p)
{
out=out*94+tradenum[p]-32;
}
for (int p=0;p<3;++p)
{
out=out*10+broker[p]-'0';
}
// Convert output to char[16]
unsigned char[16] out16;
for (int p=15;p>=0;--p)
{
out16[p]=huge&0xff;
huge=huge>>8;
}
return out16;
Of course we don't have a "huge" data type in C. Are you using pure C or C++? Isn't there some kind of big number class in C++? Sorry, I haven't done C++ in a while. If not, we could easily create a little library to implement a huge.

If it can only contain letters, then you have less than 64 possibilities per character (26 upper case, 26 lower case, leaving you 12 for space, terminator, underscore, etc). With 6 bits per character, you should get there - in 15 characters. Assuming you don't support special characters.

Use the first 10 bits for the 3-character numeric string (encode the bits like they represent a number and then pad with zeros as appropriate when decoding).
Okay, this leaves you with 118 bits and 16 alphanumeric characters to store.
0x00 to 0x7F (if you mean inclusive) comprises 128 possible characters to represent. That means that each character can be identified by a combination of 7 bits. Come up with an index mapping each number those 7 bits can represent to the actual character. To represent 16 of your "alphanumeric" characters in this way, you need a total of 112 bits.
We now have 122 bits (or 15.25 bytes) representing our data. Add an easter egg to fill in the remaining unused bits and you have your 16 character array.

Related

Most compact way to represent a 32-bit integer in a csv-file

I have a lot of (unsigned) integers originating from a measurement. Those are stored in a csv Textfile:
1111492765
562352
5362346
...
Since I have to transmitt this file through a low-bandwidth connection I am looking for a way to save storage-space (chars).
What is the best way to do so beside using a compression (gzip, ...)?
So far representing the 32-bit integers as hexvalues seems promising:
1111492765 = 10 Byte
is the same as
4240089D = 8 Byte
Note: At the receiving part of the Transmission I can convert the file to anything I like.
Following your integer -> hex (base 16) idea, you can convert the numbers to Base64 - this way, you'll only need ceil(log(number value)/log(64)) characters, e.g:
ceil(log(1111492765)/log(64)) = ceil(5.008) = 6 characters
ceil(log(562352)/log(64)) = ceil(3.184) = 4 characters
For this, you'll have to convert the number value by repeatingly doing "modulo 64" followed by "divide with 64". This way, you'll get values in the range 0..63 that you can encode using a Base64 alphabet (e.g. ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/). On the receiving code you can recombine the characters to get the original value.
Example for "562352":
Encoding:
---------
562352 mod 64 = 48 => encode as "w"
floor(562352/64) = 8786
8786 mod 64 = 18 => encode as "S"
floor(8786/64) = 137
137 mod 64 = 9 => encode as "J"
floor(137/64) = 2
2 mod 64 = 2 => encode as "B"
Number is below 64 => finished
Decoding:
---------
wSJB = 48, 18, 9, 2
value = 48 + 18 * (64 ^ 1) + 9 * (64 ^ 2) + 2 * (64 ^ 3) = 562352
Depending on how many valid characters you can find for csv, you can extend the alphabet to get shorter encodings (e.g there's Ascii85/Base85).
Also note: If a subset of your values are very similar to each other (not the case in your example, but might be the case for the real measured values), you can additionally use delta compression by only encoding the difference between two values.

How to add/sub int to a BYTE type?

I've a struct defined this way:
struct IMidiMsg {
int mOffset;
BYTE mStatus, mData1, mData2;
}
and I set to mData2 int values between 0 and 127, without any problem. But, if I add/sub int:
pNoteOff->mData2 -= 150;
pNoteOff->mData2 += 150;
I get weird results. I think due to the different type: it's BYTE, not int of course. (Note: BYTE is from minwindef.h)
Let say I've a mData2 with value 114: how would you first sub 150 and than add 150, getting again 114?
You cannot have negative values in a BYTE, as it is defined as typedef unsigned char BYTE; - it can only hold numbers between 0 and 255.
If you want to have negative values too, use signed char; that works between -127 and +128. Or just use a normal int. Saving bytes is not a good idea nowadays, with 32- or 64-bit architecture, it makes calculations slower if anything.
how would you first sub 150 and than add 150, getting again 114?
That is exactly what is supposed to happen. You start with 114, and you subtract 150 from a type capped at 255. Since 114 is less than 150, subtraction results in a "borrow", i.e. (256+114)-150=220.
When you add 150 to 220, so you get 370. Now the "carry" is dropped, so you get 370-256=114.
The same mechanism is at work here as in modulo arithmetic with any other cap. For example, if you consider single-digit decimal numbers, doing something like
3-6+6=3
3-6 --> (borrow) 13-6 --> 7
7+6 --> 13 --> 3 (drop tens)

Calculate the size to a Base 64 decoded message

I have a BASE64 encode string:
static const unsigned char base64_test_enc[] =
"VGVzdCBzdHJpbmcgZm9yIGEgc3RhY2tvdmVyZmxvdy5jb20gcXVlc3Rpb24=";
It does not have CRLF-per-72 characters.
How to calculate a decoded message length?
Well, base64 represents 3 bytes in 4 characters... so to start with you just need to divide by 4 and multiply by 3.
You then need to account for padding:
If the text ends with "==" you need to subtract 2 bytes (as the last group of 4 characters only represents 1 byte)
If the text ends with just "=" you need to subtract 1 byte (as the last group of 4 characters represents 2 bytes)
If the text doesn't end with padding at all, you don't need to subtract anything (as the last group of 4 characters represents 3 bytes as normal)
Base 64 uses 4 characters per 3 bytes. If it uses padding it always has a multiple of 4 characters.
Furthermore, there are three padding possibilities:
two characters and two padding characters == for one encoded byte
3 characters and one padding character = for two encoded bytes
and of course no padding characters, making 3 bytes.
So you can simply divide the number of characters by 4, then multiply by 3 and finally subtract the number of padding characters.
Possible C code could be (if I wasn't extremely rusty in C, please adjust):
size_t encoded_base64_bytes(const char *input)
{
size_t len, padlen;
char *last, *first_pad;
len = strlen(input);
if (len == 0) return 0;
last = input + len - 4;
first_pad = strchr(last, '=');
padlen = first_pad == null ? 0 : last - first_pad;
return (len / 4) * 3 - padlen;
}
Note that this code assumes that the input is valid base 64.
A good observer will notice that there are spare bits, usually set to 0 in the final characters if padding is used.

Drawing up a hexadecimal number from several decimal numbers

I have a vector container. There are a number from 0 to 255. The data bytes are top (of the container). On the fourth day begins mantissa and it may consist of several numbers, for example. Mantissa consists of <120, 111, 200>. That is, it is the number of machine: <0x78, 0x6F, 0xC8>. Total turn mantissa: 0x786FC8.
I can convert because of the method:
Set the number of 120, 111, 200 in hexadecimal.(0x78, 0x6F, 0xC8)
Put the numbers in line.("78" "6F" "C8")
Fold the line.("786FC8")
Move back to an integer type. 0x786FC8
Q: Is there any way you do it faster and without strings?
It sounds like you want <120, 111, 200> → (120 * 256 + 111) * 256 + 200.

How numbers are stored? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
What kind of method does the compiler use to store numbers? One char is 0-255. Two chars side by side are 0-255255. In c++ one short is 2 bytes. The size is 0-65535. Now, how does the compiler convert 255255 to 65535 and what happens with the - in the unsigned numbers?
The maximum value you can store in n bits (when the lowest value is 0 and the values represented are a continuous range), is 2ⁿ − 1. For 8 bits, this gives 255. For 16 bits, this gives 65535.
Your mistake is thinking that you can just concatenate 255 with 255 to get the maximum value in two chars - this is definitely wrong. Instead, to get from the range of 8 bits, which is 256, to the range of 16 bits, you would do 256 × 256 = 65536. Since our first value is 0, the maximum value is 65535, once again.
Note that a char is only guaranteed to have at least 8 bits and a short at least 16 bits (and must be at least as large as a char).
You have got the math totally wrong. Here's how it really is
since each bit can only take on either of two states only(1 and 0) , n bits as a whole can represents 2^n different quantities not numbers. When dealing with integers a standard short integer size of 2 bytes can represent 2^n - 1 (n=16 so 65535)which are mapped to decimal numbers in real life for compuational simplicity.
When dealing with 2 character they are two seperate entities (string is an array of characters). There are many ways to read the two characters on a whole, if you read is at as a string then it would be same as two seperate characters side by side. let me give you an example :
remember i will be using hexadecimal notation for simplicity!
if you have doubt mapping ASCII characters to hex take a look at this ascii to hex
for simplicity let us assume the characters stored in two adjacent positions are both A.
Now hex code for A is 0x41 so the memory would look like
1 byte ....... 2nd byte
01000100 01000001
if you were to read this from the memory as a string and print it out then the output would be
AA
if you were to read the whole 2 bytes as an integer then this would represent
0 * 2^15 + 1 * 2^14 + 0 * 2^13 + 0 * 2^12 + 0 * 2^11 + 1 * 2^10 + 0 * 2^9 + 0 * 2^8 + 0 * 2^7 + 1 * 2^6 + 0 * 2^5 + 0 * 2^4 + 0 * 2^3 + 0 * 2^2 + 0 * 2^1 + 1 * 2^0
= 17537
if unsigned integers were used then the 2 bytes of data would me mapped to integers between
0 and 65535 but if the same represented a signed value then then , though the range remains the same the biggest positive number that can be represented would be 32767. the values would lie between -32768 and 32767 this is because all of the 2 bytes cannot be used and the highest order bit is left to determine the sign. 1 represents negative and 2 represents positive.
You must also note that type conversion (two characters read as single integer) might not always give you the desired results , especially when you narrow the range. (example a doble precision float is converted to an integer.)
For more on that see this answer double to int
hope this helps.
When using decimal system it's true that range of one digit is 0-9 and range of two digits is 0-99. When using hexadecimal system the same thing applies, but you have to do the math in hexadecimal system. Range of one hexadecimal digit is 0-Fh, and range of two hexadecimal digits (one byte) is 0-FFh. Range of two bytes is 0-FFFFh, and this translates to 0-65535 in decimal system.
Decimal is a base-10 number system. This means that each successive digit from right-to-left represents an incremental power of 10. For example, 123 is 3 + (2*10) + (1*100). You may not think of it in these terms in day-to-day life, but that's how it works.
Now you take the same concept from decimal (base-10) to binary (base-2) and now each successive digit is a power of 2, rather than 10. So 1100 is 0 + (0*2) + (1*4) + (1*8).
Now let's take an 8-bit number (char); there are 8 digits in this number so the maximum value is 255 (2**8 - 1), or another way, 11111111 == 1 + (1*2) + (1*4) + (1*8) + (1*16) + (1*32) + (1*64) + (1*128).
When there are another 8 bits available to make a 16-bit value, you just continue counting powers of 2; you don't just "stick" the two 255s together to make 255255. So the maximum value is 65535, or another way, 1111111111111111 == 1 + (1*2) + (1*4) + (1*8) + (1*16) + (1*32) + (1*64) + (1*128) + (1*256) + (1*512) + (1*1024) + (1*2048) + (1*4096) + (1*8192) + (1*16384) + (1*32768).
It depends on the type: integral types must be stored as binary
(or at least, appear so to a C++ program), so you have one bit
per binary digit. With very few exceptions, all of the bits are
significant (although this is not required, and there is at
least one machine where there are extra bits in an int). On
a typical machine, char will be 8 bits, and if it isn't
signed, can store values in the range [0,2^8); in other words,
between 0 and 255 inclusive. unsigned short will be 16 bits
(range [0,2^16)), unsigned int 32 bits (range [0,2^32))
and unsigned long either 32 or 64 bits.
For the signed values, you'll have to use at least one of the
bits for the sign, reducing the maximum positive value. The
exact representation of negative values can vary, but in most
machines, it will be 2's complement, so the ranges will be
signed char:[-2^7,2^7-1)`, etc.
If you're not familiar with base two, I'd suggest you learn it
very quickly; it's fundamental to how all modern machines store
numbers. You should find out about 2's complement as well: the
usual human representation is called sign plus magnitude, and is
very rare in computers.