Base 256 GUID String Representation w/Character Offset 0x01F - c++

I need to make a string representation of a 128bit GUID that represents 8bit chunks instead of 4bit; so like base 256 :/ Why? Because I need to shorten the length of the string hex representation of the GUID, it is too long. I have a max array size of 31 plus NULL terminator. Also the character value cannot be from 0x000 to 0x01F.... It's giving me a headache but I know it can be done. Any suggestions on the best and safest way how? Right now I'm just mucking around with memcpy and adding 0x01F but the numbers keep bouncing around in my head.... can't nail 'em down no! And I have to interconvert :\

Try base64 encoding : it packs 6-bits per character and is still portable. You will need 22 chars to store 128 bits GUID in a portable way.

Having that OLESTR first convert it into a GUID with IIDFromString(), then use base64 as user Peter Tillemans suggests.

Related

How to write an integer x in its unique xLen-digit representation in base 256?

I want to transform the cypher in its base 256 of length xLen cypher (cypher is an instance of mpz_class). In order to do this I use the code below:
//Write the integer x in its unique xLen-digit representation in base 256
string str;
str=cypher.get_str(256);
//Print string figure by figure separated by space
for(int i=0;i<(int)str.length();i++){
if(i%256==0)
cout<<" "<<str[i];
else
cout<<str[i];
}
Unfortunately, I receive
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_M_construct null not valid Aborted (core dumped)
I strongly believe it's because of str=cypher.get_str(256) because changing the base to 10 returns no error.
I would appreciate a lot your ideas how I could replace the block.
Thank you!
There is no such thing as an ASCII representation in base 256. Base 256 means that each digit contains 256 values. As a digit is stored as a character, each character must have at least 256 printable values. As ASCII only contains 95 printable values (i.e. no control characters such as backspace or the bell) any value above that will certainly not be accepted. You could do this with e.g. Unicode, but you would be missing the point of base encoding.
If somebody writes base 256 it probably simply means that it should be stored, enocoded or referenced in bytes, as each byte has 256 values. However, you'd still have to decide if the integers should be stored in a (two-complement) signed encoding or unsigned. Furthermore, a decision has to be made if it needs to be stored in big endian (network order, with the highest bit/ byte on the left) or little endian order as used on x86 compatible CPU's.
Unfortunately, the page for the mpz_export function is exceedingly unclear. Of course, the big decimals are used internally as 32 bit or - nowadays more likely - 64 bit words. However, this fact is exposed through this function and it seems only be able to write the values in a multiple of the words size, so 4 or 8 bytes, setting the unnecessary bytes to 0 values. This is an unnecessary nuisance.
Nevertheless, as mentioned in the comments, this function is what you should be using to convert the number to binary representation.

Store SHA-1 in database in less space than the 40 hex digits

I am using a hash algorithm to create a primary key for a database table. I use the SHA-1 algorithm which is more than fine for my purposes. The database even ships an implementation for SHA-1. The function computing the hash is returning a hex value as 40 characters. Therefore I am storing the hex characters in a char(40) column.
The table will have lots of rows, >= 200 Mio. rows which is why I am looking for less data intensive ways of storing the hash. 40 characters times ~200 Mio. rows will require some GB of storage... Since hex is base16 I thought I could try to store it in base 256 in hope to reduce the amount of characters needed to around 20 characters. Do you have tips or papers on implementations of compression with base 256?
Store it as a blob: storing 8 bits of data per character instead of 4 is a 2x compression (you need some way to convert it though),
Cut off some characters: you have 160 bits, but 128 bits is enough for unique keys even if the universe ends, and for most purposes 80 bits would even be enough (you don't need cryptographic protection). If you have an anti-collision algorithm, use 36 or 40 bits is enough.
A SHA-1 value is 20 bytes. All the bits in these 20 bytes are significant, there's no way to compress them. By storing the bytes in their hexadecimal notation, you're wasting half the space — it takes exactly two hexadecimal digits to store a byte. So you can't compress the underlying value, but you can use a better encoding than hexadecimal.
Storing as a blob is the right answer. That's base 256. You're storing each byte as that byte with no encoding that would create some overhead. Wasted space: 0.
If for some reason you can't do that and you need to use a printable string, then you can do better than hexadecimal by using a more compact encoding. With hexadecimal, the storage requirement is twice the minimum (assuming that each character is stored as one byte). You can use Base64 to bring the storage requirements to 4 characters per 3 bytes, i.e. you would need 28 characters to store the value. In fact, given that you know that the length is 20 bytes and not 21, the base64 encoding will always end with a =, so you only need to store 27 characters and restore the trailing = before decoding.
You could improve the encoding further by using more characters. Base64 uses 64 code points out of the available 256 byte values. ASCII (the de facto portable) has 95 printable characters (including space), but there's no common “base95” encoding, you'd have to roll your own. Base85 is an intermediate choice, it does get some use in practice, and lets you store the 20-byte value in 25 printable ASCII characters.

Reading hard disk sector raw data - Why hex?

I'm trying to read hard disk sector to get the raw data. Now after searching a lot I found out that some people are storing that raw sector data in hex and some in char .
Which is better, and why ? Which will give me better performance ?
I'm trying to write it in C++ and OS is windows.
For clarification -
#include <iostream>
#include <windows.h>
#include <winioctl.h>
#include <stdio.h>
void main() {
DWORD nRead;
char buf[512];
HANDLE hDisk = CreateFile("\\\\.\\PhysicalDrive0",
GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, 0, NULL);
SetFilePointer(hDisk, 0xA00, 0, FILE_BEGIN);
ReadFile(hDisk, buf, 512, &nRead, NULL);
for (int currentpos=0;currentpos < 512;currentpos++) {
std::cout << buf[currentpos];
}
CloseHandle(hDisk);
std::cin.get();
}
Consider the above code written by someone else and not me.
Notice the datatype char buf[512]; . Storing with datatype as char and it hasn't been converted into hex.
Raw data is just "raw data"... you store it as it is, you do not convert it. So, there no performance issue here. At most the difference is in representing the raw data in human readable format. In general:
representing it in char format makes easier to understand if there is some text contained in it,
while hex is better for representing numeric data (in case it follows some kind of pattern).
In your specific case: char just means 1 byte. so you are sure you are storing your data in a 512 bytes buffer. Allocating such space in term of Integer size gets thing unnecessarily more complicated
You have got yourself confused.
The data on a disk is stored as binary, just a long ass stream of ones and zeros.
The reason it is read in hex of char format is because it is easier to do.
decimal: 36
char: z (potentially one way of representing this value)
hex: 24
binary: 100100
The binary is the raw bit stream you would read from the disc or mememory. Hex is like a shorthand representation for it, they are completely interchangeable, one Hex 'number' simple represents four bits. Again, the decimal is just yet another way to represent that value.
The char however is a little bit tricky; for my representation, I have taken the characters 0-9 to represent the values 0-9 and then a-z are ** representing** the values 10-36. Equally, I could have decided to take the standard ascii value which would give me '$'.
As to why 'char' is used when dealing with bytes, it is because the C++ 'har' type is just a single byte (which is normally 8 bits).
I will also point out the problem with negative numbers. when you have a integer number, that is signed (has positive and negative) the first bit (the most significant) represents a large negative value such that if all bits are 'one' the value will represent -1. For example, with four bits so it is easy to see...
0010 = +2
1000 = -8
0110 = +6
1110 = -2
The key to this problem is that it is all just how you interpret/represent the binary values. The same sequence of bits can be represented more or less any way you want.
I'm am guessing you're talking about the final data being written to some file. The reason to use hex is because it's easier to read and harder to mess up. Generally if someone is doing some sort of human analysis on the sector they're going to use a hex editor on the raw data anyway, so if you output it as hex you skip the need for a hex viewer/editor.
For instance, on DOS/Windows you have to make sure you open a file as binary if you're going to use characters. Also you might have to make sure that the operating system doesn't mess with the character format anywhere in between.

C++: How to read and write multi-byte integer values in a platform-independent way?

I'm developing a simple protocol that is used to read/write integer values from/to a buffer. The vast majority of integers are below 128, but much larger values are possible, so I'm looking at some form of multi-byte encoding to store the values in a concise way.
What is the simplest and fastest way to read/write multi-byte values in a platform-independent (i.e. byte order agnostic) way?
XDR format might help you there. If I had to summarize it in one sentence, it's a kind of binary UTF-8 for integers.
Edit: As mentioned in my comment below, I "know" XDR because I use several XDR-related functions in my office job. Only after your comment I realized that the "packed XDR" format I use every day isn't even part of the official XDR docs, so I'll describe it seperately.
The idea is thus:
inspect most-significant bit of byte.
If it is 0, that byte is the value.
if it is 1, the next three bits give "byte count", i.e. number of bytes in value.
mask out top nibble (flag bit plus byte count), concatenate the appropriate number of bytes and you've got the value.
I have no idea if this is a "real" format or my (former) coworker created this one himself (which is why I don't post code).
You might be interested in the following functions:
htonl, htons, ntohl, ntohs - convert
values between host and network byte
order
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
man byteorder
Text would be my first choice. If you want a varying length binary encoding you have two basic choices:
a length indication
an end marker
You obviously make merge those with some value bits.
For a length indication that would give you something where the length and some bits are given together (see for instance UTF-8),
For an end marker, you can for instance state that MSB set indicates the last byte and thus have 7 data bits per byte.
Other variants are obviously possible.
You could try Network Byte Order
Google's protocol buffers provide a pre-made implementation that uses variable-width encodings.

base64 to reduce digits required to encode a decimal number

I have to manage an ID of some objects.
I need these ID be unique.
I have the constraint that these ID can't be too long in term of digits required
Is base64 is a nice way to reduce the number of digits required to encoding an ID ?
EDIT:
langage : c++
data type : integer , then convert in a std::string
Each character in Base64 can represent 6 bits, so divide your ID length by 6 to see how many characters it will be. Binary data is 8 bits per byte so it will always be shorter, but the bytes won't all be readable.
Base64 will make the ID readable, but it still won't be good if the ID needs to be hand entered, like a key. For that you'll want to restrict the character set further.
Base64 is a nice way to transport binary data over ASCII. It doesn't usually decrease the size of anything. In my experience it increases it by 66% 33% (thanks for the correction).
If you care just about the length of the output string and not the actual byte size. Then by converting from decimal numeric system (base 10) to any numerical system with base higher then 10 the output string will be shorter
see example here
http://www.translatorscafe.com/cafe/units-converter/numbers/calculator/octal-to-decimal/
for example in their case
decimal 9999999999 <- 10 chars long
in base 32 numerical system
will be 4LDQPDR <- 7 chars long
with up to 95 printable ascii charecters you could use your own
base 95 numerical system and get even shorter string
used this approach in one of my projects and it was enough to squeeze "long" numerical ids in short string fields