Read values from unsigned char bytestream in C++ - c++

my task is to read metadata values from a unsigned char array, which contains the bytes of a binary .shp file (Shapefile)
unsigned char* bytes;
The header of the file which is stored in the array and the order of the information stored in it looks like this:
int32_t filecode // BigEndian
int32_t skip[5] // Uninteresting stuff
int32_t filelength // BigEndian
int32_t version // LitteEndian
int32_t shapetype // LitteEndian
// Rest of the header and of the filecontent which I don't need
So my question would be how can I extract this information (except the skip part of course) under consideration of the endianness and read it into the according variables.
I thought about using ifstream, but I couldnt figure out how to do it properly.
Example:
Read the first four bytes of the binary, ensure big endian byte order, store it in a int32_t. Then skip 5* 4 Bytes (5 * int32). Then read four bytes, ensure big endian byte order, and store it in a int32_t. Then read four bytes, ensure little endian byte order, and again store it in a int32_t and so on.
Thanks for your help guys!

So 'reading' a byte array just means extracting the bytes from the positions in the byte array where you know your data is stored. Then you just need to do the appropriate bit manipulations to handle the endianess. So for example, filecode would be this
filecode = (bytes[0] << 24) | (bytes[1] << 16) | (bytes[2] << 8) | bytes[3];
and version would be this
version = bytes[13] | (bytes[14] << 8) | (bytes[15] << 16) | (bytes[16] << 24);
(An offset of 13 for the version seems a bit odd, I'm just going on what you stated above).

Related

how to read/write sequnce of bits bit by bit in c++

I have implemented the Huffman coding algorithm in C++, and it's working fine. I want to create a text compression algorithm.
behind every file or data in the digital world, there is 0/1.
I want to persist the sequence of bits(0/1) that are generated by the Huffman encoding algorithm in the file.
my goal is to save the number of bits used in the file to store. I'm storing metadata for decoding in a separate file. I want to write bit by bit data to file, and then read the same bit by bit in c++.
the problem I'm facing with the binary mode is that it not allowing me to put data bit by bit.
I want to put "10101" as bit by bit to file but it put asci values or 8-bits of each character at a time.
code
#include "iostream"
#include "fstream"
using namespace std;
int main(){
ofstream f;
f.open("./one.bin", ios::out | ios::binary);
f<<"10101";
f.close();
return 0;
}
output
any help or pointer to help is appreciated. thank you.
"Binary mode" means only that you have requested that the actual bytes you write are not corrupted by end-of-line conversions. (This is only a problem on Windows. No other system has the need to deliberately corrupt your data.)
You are still writing a byte at a time in binary mode.
To write bits, you accumulate them in an integer. For convenience, in an unsigned integer. This is your bit buffer. You need to decide whether to accumulate them from the least to most or from the most to least significant positions. Once you have eight or more bits accumulated, you write out one byte to your file, and remove those eight bits from the buffer.
When you're done, if there are bits left in your buffer, you write out those last one to seven bits to one byte. You need to carefully consider how exactly you do that, and how to know how many bits there were, so that you can properly decode the bits on the other end.
The accumulation and extraction are done using the bit operations in your language. In C++ (and many other languages), those are & (and), | (or), >> (right shift), and << (left shift).
For example, to insert one bit, x, into your buffer, and later three bits in y, ending up with the earliest bits in the most significant positions:
unsigned buf = 0, bits = 0;
...
// some loop
{
...
// write one bit (don't need the & if you know x is 0 or 1)
buf = (buf << 1) | (x & 1);
bits++;
...
// write three bits
buf = (buf << 3) | (y & 7);
bits += 3;
...
// write bytes from the buffer before it fills the integer length
if (bits >= 8) { // the if could be a while if expect 16 or more
// out is an ostream -- must be in binary mode if on Windows
bits -= 8;
out.put(buf >> bits);
}
...
}
...
// write any leftover bits (it is assumed here that bits is in 0..7 --
// if not, first repeat if or while from above to clear out bytes)
if (bits) {
out.put(buf << (8 - bits));
bits = 0;
}
...

how to store data from std::vector<short> in std::vector<uint8_t>

What I want to do is store the data in a std::vector<short> in a std::vector<uint8_t>, splitting each short into two uint8_t values. I need to do this because I have a network application that will only send std::vector<uint8_t>'s, so I need to convert to uint8_t to send and then convert back when I receive the uint8_t vector.
Normally what i would do (and what I saw when I looked up the problem) is:
std::vector<uint8_t> newVec(oldvec.begin(),oldvec.end());
However, if i understand correctly this will take each individual short value, truncate to the size of a uint8_t, and make a new vector of half the amount of data and the same number of entries, when what i want is the same amount of data with twice as many entries.
solutions that include a way to reverse the process and that avoid copying as much as possible would help a lot. Thanks!
to split something at the 8 bit boundary, you can use right shifts and masks, i.e.
uint16_t val;
uint8_t low = val & 0xFF;
uint8_t high = (val >> 8) & 0xFF;
now you can put your high and low into the second vector in your order.
For splitting and merging, you would have the following:
unsigned short oldShort;
uint8_t char1 = oldShort & 0xFF; // lower byte
uint8_t char2 = oldShort >> 8; // upper byte
Then push the two parts onto the vector, and send it off to your network library. On the receiving end, during re-assembly, you would read the next two bytes off of the vector and combine them back into the short.
Note: Make sure that there are an even number of elements on the received vector such that you didn't obtain corrupted/modified data during transit.
// Read off the next two characters and merge them again
unsigned short mergedShort = (char2 << 8) | char1;
I need to do this because I have a network application1 that will only send std::vector's
Besides masking and bit shifting you should take endianess into account when sending stuff over the wire.
The network representation of data is usually big endian. So you can always put the MSB first. Provide a simple function like:
std::vector<uint8_t> networkSerialize(const std::vector<uint16_t>& input) {
std::vector<uint8_t> output;
output.reserve(input.size() * sizeof(uint16_t)); // Pre-allocate for sake of
// performance
for(auto snumber : input) {
output.push_back((snumber & 0xFF00) >> 8); // Extract the MSB
output.push_back((snumber & 0xFF)); // Extract the LSB
}
return output;
}
and use it like
std::vector<uint8_t> newVec = networkSerialize(oldvec);
See live demo.
1)Emphasis mine
Disclaimer: People are talking about "network byte order". If you send something huger than 1 byte, of course you need to take network endiannes into account. However, as far as I understand the limitation "network application that will only send std::vector<uint8_t>" explicitly states that "I don't want to mess with any of that endianness stuff". uint8_t is already a one byte and if you send a sequence of bytes in an one order, you should get them back in the exactly same order. This is helpful: sending the array through a socket. There can be different system endianness on client and server machines but OP said nothing about it so that is a different story...
Regarding the answer:
Assuming all "endianness" questions are closed.
If you just want to send a vector of shorts, I believe, VTT`s answer will perform the best. However, if std::vector<short> is just a particular case, you can use pack() function from my answer to a similar question. It packs any iterable container, string, C-string and more... into a vector of bytes and does not perform any endiannes shenanigans. Just include byte_pack.h and then you can use it like this:
#include "byte_pack.h"
void cout_bytes(const std::vector<std::uint8_t>& bytes)
{
for(unsigned byte : bytes) {
std::cout << "0x" << std::setfill('0') << std::setw(2) << std::hex
<< byte << " ";
}
std::cout << std::endl;
}
int main()
{
std::vector<short> test = { (short) 0xaabb, (short) 0xccdd };
std::vector<std::uint8_t> test_result = pack(test);
cout_bytes(test_result); // -> 0xbb 0xaa 0xdd 0xcc (remember of endianness)
return 0;
}
Just copy everything in one go:
::std::vector<short> shorts;
// populate shorts...
::std::vector<uint8_t> bytes;
::std::size_t const bytes_count(shorts.size() * sizeof(short) / sizeof(uint8_t));
bytes.resize(bytes_count);
::memcpy(bytes.data(), shorts.data(), bytes_count);

"Right" way to retrieve an int from a big-endian binary file in c++

I have a binary file in big-endian format from which I am retrieving 2-bit and 4-bit integer data. The machine I'm running on is little-endian.
Does anyone have any suggestions or a best-practice on pulling integer data from a known format binary and switching endianness on the fly? I'm not sure that my current solution is even correct:
int myInt;
ifstream dataFile(dataFileLocation, ios::in | ios::binary);
dataFile.seekg(99, ios::beg); //Pull data starting at byte 100;
//For 4-byte value:
char chunk[4];
dataFile.read(chunk, 4);
myInt = (int)(chunk[0] << 24 | chunk[1] << 16 | chunk[2] << 8 | chunk[3]);
//For 2-byte value:
char chunk[2];
dataFile.read(chunk, 4);
myInt = (int)(chunk[0] << 8 | chunk[1]);
This seems to work fine for 2-byte data but gives what I believe are incorrect values on 4-byte data. I've read about htonl() but from what I've read that's not a smart way to go for flexibility.
Use unsigned integral types only and you'll be fine:
unsigned char buf[4];
infile.read(reinterpret_cast<char*>(buf), 4);
unsigned int b4 = (buf[0] << 24) + ... + (buf[3]);
unsigned int b2 = (buf[0] << 8) + (buf[1]);
Shifting involves type promotions, and indefinite sign extensions (given the implementation-defined nature of char). Basically you always want everything to be unsigned when manipulating bits.

Deciphering unsigned char*

I have a process that listens to an UDP multi-cast broadcast and reads in the data as a unsigned char*.
I have a specification that indicates fields within this unsigned char*.
Fields are defined in the specification with a type and size.
Types are: uInt32, uInt64, unsigned int, and single byte string.
For the single byte string I can merely access the offset of the field in the unsigned char* and cast to a char, such as:
char character = (char)(data[1]);
Single byte uint32 i've been doing the following, which also seems to work:
uint32_t integer = (uint32_t)(data[20]);
However, for multiple byte conversions I seem to be stuck.
How would I convert several bytes in a row (substring of data) to its corresponding datatype?
Also, is it safe to wrap data in a string (for use of substring functionality)? I am worried about losing information, since I'd have to cast unsigned char* to char*, like:
std::string wrapper((char*)(data),length); //Is this safe?
I tried something like this:
std::string wrapper((char*)(data),length); //Is this safe?
uint32_t integer = (uint32_t)(wrapper.substr(20,4).c_str()); //4 byte int
But it doesn't work.
Thoughts?
Update
I've tried the suggest bit shift:
void function(const unsigned char* data, size_t data_len)
{
//From specifiction: Field type: uInt32 Byte Length: 4
//All integer fields are big endian.
uint32_t integer = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | (data[3]);
}
This sadly gives me garbage (same number for every call --from a callback).
I think you should be very explicit, and not just do "clever" tricks with casts and pointers. Instead, write a function like this:
uint32_t read_uint32_t(unsigned char **data)
{
const unsigned char *get = *data;
*data += 4;
return (get[0] << 24) | (get[1] << 16) | (get[2] << 8) | get[3];
}
This extracts a single uint32_t value from a buffer of unsigned char, and increases the buffer pointer to point at the next byte of data in the buffer.
This assumes big-endian data, you need to have a well-defined idea of the buffer's endian-mode in order to interpret it.
Depends on the byte ordering of the protocol, for big-endian or so called network byte order do:
uint32_t i = data[0] << 24 | data[1] << 16 | data[2] << 8 | data[3];
Without commenting on whether it's a good idea or not, the reason why it doesn't work for you is that the result of wrapper.substring(20,4).c_str() is (uint32_t *), not (uint32_t). So if you do:
uint32_t * integer = (uint32_t *)(wrapper.substr(20,4).c_str(); it should work.
uint32_t integer = ntohl(*reinterpret_cast<const uint32_t*>(data + 20));
or (handles alignment issues):
uint32_t integer;
memcpy(&integer, data+20, sizeof integer);
integer = ntohl(integer);
The pointer way:
uint32_t n = *(uint32_t*)&data[20];
You will run into problems on different endian architectures though. The solution with bit shifts is better and consistent.
std::string wrapper((char*)(data),length); //Is this safe?
This should be safe since you specified the length of the data.
On the other hand if you did this:
std::string wrapper((char*)data);
The string length would be determined wherever the first 0 byte occurs, and you will more than likely chop off some data.

C/C++ read a byte from an hexinput from stdin

Can't exactly find a way on how to do the following in C/C++.
Input : hexdecimal values, for example: ffffffffff...
I've tried the following code in order to read the input :
uint16_t twoBytes;
scanf("%x",&twoBytes);
Thats works fine and all, but how do I split the 2bytes in 1bytes uint8_t values (or maybe even read the first byte only). Would like to read the first byte from the input, and store it in a byte matrix in a position of choosing.
uint8_t matrix[50][50]
Since I'm not very skilled in formating / reading from input in C/C++ (and have only used scanf so far) any other ideas on how to do this easily (and fast if it goes) is greatly appreciated .
Edit: Found even a better method by using the fread function as it lets one specify how many bytes it should read from the stream (stdin in this case) and save to a variable/array.
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
Parameters
ptr - Pointer to a block of memory with a minimum size of (size*count) bytes.
size - Size in bytes of each element to be read.
count - Number of elements, each one with a size of size bytes.
stream - Pointer to a FILE object that specifies an input stream.
cplusplus ref
%x reads an unsigned int, not a uint16_t (thought they may be the same on your particular platform).
To read only one byte, try this:
uint32_t byteTmp;
scanf("%2x", &byteTmp);
uint8_t byte = byteTmp;
This reads an unsigned int, but stops after reading two characters (two hex characters equals eight bits, or one byte).
You should be able to split the variable like this:
uint8_t LowerByte=twoBytes & 256;
uint8_t HigherByte=twoBytes >> 8;
A couple of thoughts:
1) read it as characters and convert it manually - painful
2) If you know that there are a multiple of 4 hexits, you can just read in twobytes and then convert to one-byte values with high = twobytes << 8; low = twobyets & FF;
3) %2x