Alternative to perl unpack in c++ [duplicate] - c++

How do I write C++ code that does what the pack -N option does in Perl?
I want to convert an integer variable to some binary form such that the unpack -N option on it gives back the integer variable.
My integer variable name is timestamp.
I found that it is related to htonl, but still htonl(timestamp) does not give the binary form.

I wrote a library, libpack, similar to Perl's pack function. It's a C library so it would be quite usable from C++ as well:
FILE *f;
fpack(f, "u32> u32>", value_a, value_b);
A u32 > specifies an unsigned 32-bit integer in big-endian format; i.e. equivalent to Perl's N format to pack().
http://www.leonerd.org.uk/code/libpack/

It takes 4 bytes and forms a 32-bit int as follows:
uint32_t n;
n = buf[0] << 24
| buf[1] << 16
| buf[2] << 8
| buf[3] << 0;
For example,
uint32_t n;
unsigned char buf[4];
size_t bytes_read = fread(buf, 1, 4, stream);
if (bytes_read < 4) {
if (ferror(stream)) {
// Error
// ...
}
else if (feof(stream)) {
// Premature EOF
// ...
}
}
else {
n = buf[0] << 24
| buf[1] << 16
| buf[2] << 8
| buf[3] << 0;
}

Related

Reading Binary Files Using Bitwise Shifters and Buffers in C++

I'm trying to read a binary file and simple convert the data to usable unsigned integers. The code below works for 2-byte reading, for certain file locations, and correctly prints the unsigned integer. When I use the 4-byte code though my value turns out to be a number much larger than it is supposed to be. I believe the issue lies within the read function, it seems as though I am getting the wrong character/decimal number (101 for example) which when bit shifted becomes a number much larger than it should be (~6662342).(when the program runs it throws an exception every now and then "stack around the variable buf runtime error #2" in visual studios). Any ideas? It may be my fundamental knowledge of how the data is stored in the char array that is affecting my data output.
working 2-byte code
unsigned char buf[2];
file.seekg(3513);
uint64_t value = readBufferLittleEndian(buf, &file);
printf("%i", value);
system("PAUSE");
return 0;
}
uint64_t readBufferLittleEndian(unsigned char buf[], std::ifstream *file)
{
file->read((char*)(&buf[0]), 2);
return (buf[1] << 8 | buf[0]);
}
broken 4-byte code
unsigned char buf[8 + 1]; //= { 0, 2 , 0 , 0 , 0 , 0 , 0, 0, 0 };
uint64_t buf1[9];
file.seekg(3213);
uint64_t value = readBufferLittleEndian(buf, &file, buf1);
std::cout << value;
system("PAUSE");
return 0;
}
uint64_t readBufferLittleEndian(unsigned char buf[], std::ifstream *file, uint64_t buf1[])
{
file->read((char*)(&buf[0]), 4);
for (int index = 0; index < 4; index++)
{
buf1[index] = buf[index];
}
buf1[0];
buf1[1];
buf1[2];
buf1[3];
//return (buf1[7] << 56 | buf1[6] << 48 | buf1[5] << 40 | buf1[4] << 32 | buf1[3] << 24 | buf1[2] << 16 | buf1[1] << 8 | buf1[0]);
return (buf1[3] << 24 | buf1[2] << 16 | buf1[1] << 8 | buf1[0]);
//return (buf1[1] << 8 | buf1[0]);
}
Please correct me if I got the endianess reversed.
code is C++ except for the printf line
you have to cast before shifting. You cannot shift a char left 56 bits.
ie do ((uint64_t)buf[n] << NN
Seekg(0) = byte 1, seekg(3212) = byte 3213. Not entirely sure why I was getting a zero in byte 3214 before, considering I now get 220 (indicating big-endianess). Getting 220 would have indicated that I was interpreting the functionality of seekg(). Oh well it is solved where it matters now anyway.

How to decode/encode a UTF-8 char in c++ without wchar_t

As the title states, I am attempting to decode/encode UTF-8 characters to a char, but I want to do it without using wchar_t or the like. I want to do the leg work myself. This way I know that I understand it, which I obviously don't or it would be working. I've spent about a week toying with it and am just not making progress.
I have tried several ways, yet always seem to produce incorrect results. My latest attempt:
ifstream ifs(FILENAME);
if(!ifs) {
cerr << "Open: " << FILENAME << "\n";
exit(1);
}
char in;
while (ifs >> std::noskipws >> in) {
int sz = 1;
if ((in & 0xc0) == 0xc0) //0xc0 = 0b11000000
{
sz++;
if((in & 0xE0) == 0xE0) //0xE0 = 0b11100000
{
sz++;
if((in & 0xF0) == 0xF0) //0xF0 = 0b11110000
sz++;
}
}
cout << sz << endl;
unsigned int a = in;
for(int i = 1; i < sz; i++) {
ifs >> in;
a += in;
}
Why do this code not work? I simply do not understand.
EDIT: Copy+Paste spaghetti...two different var names
It appears that you're testing the wrong value. Your loop is reading into the value in, but you are testing against some value named c.
When you read in additional characters, you're also going about it wrong. You're using some value length instead of presumably sz. And you're adding characters to an integer (which is not necessarily 32-bits by the way) instead of shifting and combining with bitwise OR.
Those are weird mistakes. Perhaps you didn't paste your real code in the question, or you actually have these values lying around in scope of your function.
I would also suggest rearranging your branching, which is a bit obtuse. The rule, according to your code is:
mask | sz
---------+-------
0xxxxxxx | 1
10xxxxxx | 1
110xxxxx | 2
1110xxxx | 3
1111xxxx | 4
You could define a simple table to select a size based on the upper 4 bits.
int sizes[16];
std::fill( sizes, sizes+16, 1 );
sizes[0xc] = 2;
sizes[0xd] = 2;
sizes[0xe] = 3;
sizes[0xf] = 4;
In your loop, let's fix the c and length things, use the size table to avoid silly branching, use istream::get instead of the stream input operator (>>), and combine the characters into a single value in a more normal way.
for( char c; ifs.get(c); )
{
// Select correct character size (bytes)
int sz = sizes[static_cast<unsigned char>(c) >> 4];
// Construct character
char32_t val = c;
while( --sz > 0 && ifs.get(c) )
{
val = (val << 8) | (static_cast<char32_t>(c) & 0xff);
}
// Output character value in hex, unless error.
if( ifs )
{
std::cout << std::hex << std::fill('0') << std::setw(8) << val << std::endl;
}
}
Now, this last part concatenates the bytes in big-endian order. I don't know if this is correct, as I haven't read the standard. But it's much more correct than just adding values together. It also uses a guaranteed 32-bit datatype, unlike the unsigned int you used.

C++ equivalent of 'pack' in Perl

How do I write C++ code that does what the pack -N option does in Perl?
I want to convert an integer variable to some binary form such that the unpack -N option on it gives back the integer variable.
My integer variable name is timestamp.
I found that it is related to htonl, but still htonl(timestamp) does not give the binary form.
I wrote a library, libpack, similar to Perl's pack function. It's a C library so it would be quite usable from C++ as well:
FILE *f;
fpack(f, "u32> u32>", value_a, value_b);
A u32 > specifies an unsigned 32-bit integer in big-endian format; i.e. equivalent to Perl's N format to pack().
http://www.leonerd.org.uk/code/libpack/
It takes 4 bytes and forms a 32-bit int as follows:
uint32_t n;
n = buf[0] << 24
| buf[1] << 16
| buf[2] << 8
| buf[3] << 0;
For example,
uint32_t n;
unsigned char buf[4];
size_t bytes_read = fread(buf, 1, 4, stream);
if (bytes_read < 4) {
if (ferror(stream)) {
// Error
// ...
}
else if (feof(stream)) {
// Premature EOF
// ...
}
}
else {
n = buf[0] << 24
| buf[1] << 16
| buf[2] << 8
| buf[3] << 0;
}

c++ 64 bit network to host translation

I know there are answers for this question using using gcc byteswap and other alternatives on the web but was wondering why my code below isn't working.
Firstly I have gcc warnings ( which I feel shouldn't be coming ) and reason why I don't want to use byteswap is because I need to determine if my machine is big endian or little endian and use byteswap accordingly i.,e if my machine is big endian I could memcpy the bytes as is without any translation otherwise I need to swap them and copy it.
static inline uint64_t ntohl_64(uint64_t val)
{
unsigned char *pp =(unsigned char *)&val;
uint64_t val2 = ( pp[0] << 56 | pp[1] << 48
| pp[2] << 40 | pp[3] << 32
| pp[4] << 24 | pp[5] << 16
| pp[6] << 8 | pp[7]);
return val2;
}
int main()
{
int64_t a=0xFFFF0000;
int64_t b=__const__byteswap64(a);
int64_t c=ntohl_64(a);
printf("\n %lld[%x] [%lld] [%lld]\n ", a, a, b, c);
}
Warnings:-
In function \u2018uint64_t ntohl_64(uint64_t)\u2019:
warning: left shift count >= width of type
warning: left shift count >= width of type
warning: left shift count >= width of type
warning: left shift count >= width of type
Output:-
4294901760[00000000ffff0000] 281470681743360[0000ffff00000000] 65535[000000000000ffff]
I am running this on a little endian machine so byteswap and ntohl_64 should result in exact same values but unfortunately I get completely unexpected results. It would be great if someone can pointout whats wrong.
The reason your code does not work is because you're shifting unsigned chars. As they shift the bits fall off the top and any shift greater than 7 can be though of as returning 0 (though some implementations end up with weird results due to the way the machine code shifts work, x86 is an example). You have to cast them to whatever you want the final size to be first like:
((uint64_t)pp[0]) << 56
Your optimal solution with gcc would be to use htobe64. This function does everything for you.
P.S. It's a little bit off topic, but if you want to make the function portable across endianness you could do:
Edit based on Nova Denizen's comment:
static inline uint64_t htonl_64(uint64_t val)
{
union{
uint64_t retVal;
uint8_t bytes[8];
};
bytes[0] = (val & 0x00000000000000ff);
bytes[1] = (val & 0x000000000000ff00) >> 8;
bytes[2] = (val & 0x0000000000ff0000) >> 16;
bytes[3] = (val & 0x00000000ff000000) >> 24;
bytes[4] = (val & 0x000000ff00000000) >> 32;
bytes[5] = (val & 0x0000ff0000000000) >> 40;
bytes[6] = (val & 0x00ff000000000000) >> 48;
bytes[7] = (val & 0xff00000000000000) >> 56;
return retVal;
}
static inline uint64_t ntohl_64(uint64_t val)
{
union{
uint64_t inVal;
uint8_t bytes[8];
};
inVal = val;
return bytes[0] |
((uint64_t)bytes[1]) << 8 |
((uint64_t)bytes[2]) << 16 |
((uint64_t)bytes[3]) << 24 |
((uint64_t)bytes[4]) << 32 |
((uint64_t)bytes[5]) << 40 |
((uint64_t)bytes[6]) << 48 |
((uint64_t)bytes[7]) << 56;
}
Assuming the compiler doesn't do something to the uint64_t on it's way back through the return, and assuming the user treats the result as an 8-byte value (and not an integer), that code should work on any system. With any luck, your compiler will be able to optimize out the whole expression if you're on a big endian system and use some builtin byte swapping technique if you're on a little endian machine (and it's guaranteed to still work on any other kind of machine).
uint64_t val2 = ( pp[0] << 56 | pp[1] << 48
| pp[2] << 40 | pp[3] << 32
| pp[4] << 24 | pp[5] << 16
| pp[6] << 8 | pp[7]);
pp[0] is an unsigned char and 56 is an int, so pp[0] << 56 performs the left-shift as an unsigned char, with an unsigned char result. This isn't what you want, because you want all these shifts to have type unsigned long long.
The way to fix this is to cast, like ((unsigned long long)pp[0]) << 56.
Since pp[x] is 8-bit wide, the expression pp[0] << 56 results in zero. You need explicit masking on the original value and then shifting:
uint64_t val2 = (( val & 0xff ) << 56 ) |
(( val & 0xff00 ) << 48 ) |
...
In any case, just use compiler built-ins, they usually result in a single byte-swapping instruction.
Casting and shifting works as PlasmaHH suggesting but I don't know why 32 bit shifts upconvert automatically and not 64 bit.
typedef uint64_t __u64;
static inline uint64_t ntohl_64(uint64_t val)
{
unsigned char *pp =(unsigned char *)&val;
return ((__u64)pp[0] << 56 |
(__u64)pp[1] << 48 |
(__u64)pp[2] << 40 |
(__u64)pp[3] << 32 |
(__u64)pp[4] << 24 |
(__u64)pp[5] << 16 |
(__u64)pp[6] << 8 |
(__u64)pp[7]);
}

How to simply reconstruct numbers from a buffer in little endian format

Suppose I have:
typedef unsigned long long uint64;
unsigned char data[BUF_SIZE];
uint64 MyPacket::GetCRC()
{
return (uint64)(data[45] | data[46] << 8 |
data[47] << 16 | data[48] << 24 |
(uint64)data[49] << 32| (uint64)data[50] << 40 |
(uint64)data[51] << 48| (uint64)data[52] << 56);
}
Just wondering, if there is an cleaner way. I tried a memcpy to an uint64 variable
but that gives me the wrong value. I think I need the reverse. The data is in little endian format.
The big advantage of using the shift-or sequence is that it will work regardless if your host machine is big- or little-endian.
Of course, you would always tweak the expression. Personally, I try to join "pairs", that is two bytes at a time, then two shorts, and finally two longs, as this will help compilers to generate better code.
Well, maybe better idea is to swap order + cast?
typedef unsigned long long uint64;
unsigned char data[BUF_SIZE];
uint64 MyPacket::GetCRC()
{
uint64 retval;
unsigned char *rdata = reinterpret_cast<unsigned char*>(&retval);
for(unsigned i = 0; i < 8; ++i) rdata[i] = data[52-i];
return retval;
}
Here's one of my own that is similar to that provided by #x13n.
uint64 MyPacket::GetCRC()
{
int offset=45;
uint64 crc;
memcpy(&crc, data+offset, 8);
//std::reverse((char*)&crc, (char*)&crc + 8); // if this was a big endian machine
return crc;
}
Nothing much wrong with what you have there to be honest. It will work and it's quick.
The thing I'd change would be to format it a little better for readability:
uint64 MyPacket::GetCRC()
{
return (uint64) data[45] |
(uint64) data[46] << 8 |
(uint64) data[47] << 16 |
(uint64) data[48] << 24 |
(uint64) data[49] << 32 |
(uint64) data[50] << 40 |
(uint64) data[51] << 48 |
(uint64) data[52] << 56;
}
I guess your other option would be to do it in a loop instead:
uint64 MyPacket::GetCRC()
{
const int crcoffset = 45;
uint64 crc = 0;
for (int i = 0; i < 8; i++)
{
crc |= (uint64)data[i + crcoffset] << (i * 8);
}
return crc;
}
That would probably result in very similar assembly (as the compiler would probably do loop-unwinding for such a small loop) but it is a bit harder to grok in my opinion so you are better off with what you have.