Interpret strings as packed binary data in C++ - c++

I have question about interpreting strings as packed binary data in C++. In python, I can use struct module. Is there a module or a way in C++ to interpret strings as packed binary data without embedding Python?

As already mentioned, it is better to consider this an array of bytes (chars, or unsigned chars), possibly held in a std::vector, rather than a string. A string is null terminated, so what happens if a byte of the binary data had the value zero?
You can either cast a pointer within the array to a pointer to your struct, or copy the data over a struct:
#include <memory>
#pragma pack ( push )
#pragma pack( 1 );
struct myData
{
int data1;
int data2;
// and whatever
};
#pragma pack ( pop )
char* dataStream = GetTheStreamSomehow();
//cast the whole array
myData* ptr = reinterpret_cast<myData*>( dataStream );
//cast from a known position within the array
myData* ptr2 = reinterpret_cast<myData*>( &(dataStream[index]) );
//copy the array into a struct
myData data;
memcpy( &data, dataStream, sizeof(myData) );
If you were to have the data stream in a vector, the [] operator would still work. The pragma pack declarations ensure the struct is single byte aligned - researching this is left as an exercise for the reader. :-)

Basically, you don't need to interpret anything. In C++, strings are
packed binary data; you can interpret them as text, but you're not
required to. Just be aware that the underlying type of a string, in
C++, is char, which can be either signed (range [-128,127] on all
machines I've heard of) or unsigned (usually [0,255], but I'm aware of
machines where it is [0,511]).
To pass the raw data in a string to a C program, use
std::string::data() and std::string::size(). Otherwise, you can
access it using iterators or indexation much as you would with
std::vector<char> (which may express the intent better).

A string in C++ has a method called c_str ( http://www.cplusplus.com/reference/string/string/c_str/ ).
c_str returns the relevant binary data in a string in form of an array of characters. You can cast these chars to anything you wish and read them as an array of numbers.

Eventhough it might be closer to pickling in python, boost serialization may be closest to what you want to achieve.
Otherwise you might want to do it by hand. It is not that hard to make reader/writer classes to convert primitives/classes to packed binary format. I would do it by shifting bytes to avoid host endianess issues.

Related

cast to pointer from integer of different size when converting uint64_t to bytes

[EDIT]I wanted write uint64_t to char* array in network byte order to send it as UDP datagram with sendto, uint64_t has 8 bytes so I convert them as follow:
void strcat_number(uint64_t v, char* datagram) {
uint64_t net_order = htobe64(v);
for (uint8_t i=0; i<8 ;++i) {
strcat(datagram, (const char*)((uint8_t*)&net_order)[i]);
}
}
wchich give me
warning: cast to pointer from integer of different size [-Wint-to-pointer-xast]
strcat(datagram, (const char*)((uint8_t*)&net_order)[i]);
how can I get rid of this warning or maybe do this number converting simpler or clearer?
((uint8_t*)&net_order)
this is a pointer to net_order casted to a uint8_t pointer
((uint8_t*)&net_order)[i]
this is the i-th byte of the underlying representation of net_order.
(const char*)((uint8_t*)&net_order)[i]
this is the same as above, but brutally casted to a const char *. This is an invalid pointer, and it is what the compiler is warning you about; even just creating this pointer is undefined behavior, and using it in any way will almost surely result in a crash.
Notice that, even if you somehow managed to make this kludge work, strcat is still the wrong function, as it deals with NUL-terminated strings, while here you are trying to put binary data inside your buffer, and binary data can naturally contain embedded NULs. strcat will append at the first NUL (and stop at the first NUL in the second parameter) instead of at the "real" end.
If you are building a buffer of binary data you have to use straight memcpy, and most importantly you cannot use string-related functions that rely on the final NUL to know where the string ends, but you have to keep track explicitly of how many bytes you used (i.e. the current position in the datagram).

C++ string length in bytes

string str; str="hello"; str.length(); sizeof(str);
I see that str.length returns the length in bytes why sizeof(str) doesn't return the same?
Is there alternative in c++ to a c command which is strlen(str)? What is the alternative of this coomand in c++?
When I use winsock in the send function I return the length in bytes. What should I use?
str.length? Or sizeof(str)? Pr something else? Because I see they produce different results.
sizeof returns the size of the data structure, not the size of the data in contains.
length() returns the length of the string that str contains, and is the function you want
It might seem confusing because sizeof(char[30]) is 30, but that is because the size of the data structure is 30, and will remain 30 no matter what you put in it
The string is actually an extremely complicated structure, but suppose it was a simple class with a pointer and a length
class string
{
char *data;
int length;
};
then sizeof(string) would return:
The size of a char * pointer, possibly but not necessarily 4
plus the size of an int, possibly but not necessarily 4
So you might get a value of 8. What the value of data or length is has no effect on the size of the structure.
sizeof() is not really meant to be used on a string class. The string class doesn't store ONLY the string data; there would be no difference between its data and a C-style string; it has other stuff in it as well, which throws off sizeof(). To get the actual length of the characters in the string, use str.length().
Don't use the C strlen() on a C++ string object. Don't use sizeof() either. Use .length().
std::string in C++ is instantiated as a pointer to a string object, since a string may have varying length. What sizeof() is returning is the size of the pointer to the string object (which on a 32 bit machine will probably be 4)
Operator sizeof() returns size of given type or object in bytes. 'Type version' is quite simple to understand, bu with 'Object version' you need to rember one thing:
sizeof() looks only on type definition and deduces total size from size and number of its members (in general, polymorphic and multiple inherited types may have additional 'hidden' members).
In other words, let's assume we have:
struct A
{
int* p1;
char* p2;
};
As you can probably suspect, sizeof(A) will return 8 (as pointer is 4-byte type on most 32-bit systems). But, when you do something like this:
A a_1;
a_1.p1 = new int[64];
sizeof(a_1) will still return 8. That's because memory allocated by new and pointed by A's member, does not 'belong' to this object.
And that is why sizeof(str) and str.length() give different results. std::string allocates memory for chars on the heap (dynamically, via malloc()), so it doesn't change string's size.
So, if you want to send string via network, proper size is str.len() and data pointer can be retrieved by calling str.c_str().
I didn't understant part with "strlen(str) equivalent". In C++ there is also strlen() function, with the same prototype, working exactly in the same way. It simply requires const char*, so you cannot use it for std::string (but you can do strlen(str.c_str()), as std::string's internal string is guaranteed to be null-terminated). For std::string use .length() as you already did.

void arithmetic in C++

I have a structure:
struct {
Header header;
uint32_t var1;
uint32_t var2;
char var3;
char var4[4];
};
You get the hint. The thing is that I am receiving byte arrays over the network, and I first have to parse the Header first. So I first parse the header first, and then I have to parse the rest of the structure.
I tried,
void* V = data; // which is sizeof(uint32_t) * 2 + sizeof(char) * 5
and then try to parse it like (V), V+sizeof(uint32_t) ... etc. etc.
but it gave compiler errors. How do I parse the rest of this struct over the network?
The fundamental unit of data in C++ is char. It is the smallest type that can be addressed, and it has size one by definition. Moreover, the language rules specifically allow all data to be viewed as a sequence of chars. All I/O happens in terms of sequences (or streams) of chars.
Therefore, your raw data buffer should be a char array.
(On the other hand, a void * has very specific and limited use in C++; it's main purpose is to designate an object's address in memory. For example, the result of operator new() is a void *.)

Reinterpret float vector as unsigned char array and back

I've searched and searched stackoverflow for the answer, but have not found what I needed.
I have a routine that takes an unsigned char array as a parameter in order to encode it as Base64. I would like to encode an STL float vector (vector) in Base64, and therefore would need to reinterpret the bytes in the float vector as an array of unsigned characters in order to pass it to the encode routine. I have tried a number of things from reinterpret and static casts, to mem copies, etc, but none of them seem to work (at least not the way I implemented them).
Likewise, I'll need to do the exact opposite when decoding the encoded data back to a float array. The decode routine will provide the decoded data as an unsigned char array, and I will need to reinterpret that array of bytes, converting it to a float vector again.
Here is a stripped down version of my C++ code to do the encoding:
std::string
EncodeBase64FloatVector( const vector<float>& p_vector )
{
unsigned char* sourceArray;
// SOMEHOW FILL THE sourceArray WITH THE FLOAT VECTOR DATA BITS!!
char* target;
size_t targetSize = p_vector.size() * sizeof(float);
target = new char[ targetSize ];
int result = EncodeBase64( sourceArray, floatArraySizeInUChars, target, targetSize );
string returnResult;
if( result != -1 )
{
returnResult = target;
}
delete target;
delete sourceArray;
return returnResult;
}
Any help would be greatly appreciated. Thanks.
Raymond.
std::vector guarantees the data will be contiguous, and you can get a pointer to the first element in the vector by taking the address of the first element (assuming it's not empty).
typedef unsigned char byte;
std::vector<float> original_data;
...
if (!original_data.empty()) {
const float *p_floats = &(original_data[0]); // parens for clarity
Now, to treat that as an array of unsigned char, you use a reinterpret_cast:
const byte *p_bytes = reinterpret_cast<const byte *>(p_floats);
// pass p_bytes to your base-64 encoder
}
You might want to encode the length of the vector before the rest of the data, in order to make it easier to decode them.
CAUTION: You still have to worry about endianness and representation details. This will only work if you read back on the same platform (or a compatible one) that you wrote with.
sourceArray = reinterpret_cast<const unsigned char *>(&(p_vector[0]))
I would highly recommend checking out Google's protobuf to solve your problem. Floats and doubles can vary in size and layout between platforms and that package has solved all those problems for you. Additionally, it can easily handle your data structure should it ever become more complicated than a simple array of floats.
If you do use that, you will have to do your own base64 encoding still as protobuf encodes data assuming you have an 8-bit clean channel to work with. But that's fairly trivial.

How to read in specific sizes and store data of an unknown type in c++?

I'm trying to read data in from a binary file and then store in a data structure for later use. The issue is I don't want to have to identify exactly what type it is when I'm just reading it in and storing it. I just want to store the information regarding what type of data it is and how much data of this certain type there is (information easily obtained in the first couple bytes of this data)
But how can I read in just a certain amount of data, disregarding what type it is and still easily be able to cast (or something similar) that data into a readable form later?
My first idea would be to use characters, since all the data I will be looking at will be in byte units.
But if I did something like this:
ifstream fileStream;
fileStream.open("fileName.tiff", ios::binary);
//if I had to read in 4 bytes of data
char memory[4];
fileStream.read((char *)&memory, 4);
But how could I cast these 4 bytes if I later I wanted to read this and knew it was a double?
What's the best way to read in data of an unknown type but know size for later use?
fireStream.
I think a reinterpret_cast will give you what you need. If you have a char * to the bytes you can do the following:
double * x = reinterpret_cast<double *>(dataPtr);
Check out Type Casting on cplusplus.com for a more detailed description of reinterpret_cast.
You could copy it to the known data structure which makes life easier later on:
double x;
memcpy (&x,memory,sizeof(double));
or you could just refer to it as a cast value:
if (*((double*)(memory)) == 4.0) {
// blah blah blah
}
I believe a char* is the best way to read it in, since the size of a char is guaranteed to be 1 unit (not necessarily a byte, but all other data types are defined in terms of that unit, so that, if sizeof(double) == 27, you know that it will fit into a char[27]). So, if you have a known size, that's the easiest way to do it.
You could store the data in a class that provides functions to cast it to the possible result types, like this:
enum data_type {
TYPE_DOUBLE,
TYPE_INT
};
class data {
public:
data_type type;
size_t len;
char *buffer;
data(data_type a_type, char *a_buffer, size_t a_len)
: type(a_type), buffer(NULL), len(a_len) {
buffer = new char[a_len];
memcpy(buffer, a_buffer, a_len);
}
~data() {
delete[] buffer;
}
double as_double() {
assert(TYPE_DOUBLE == type);
assert(len >= sizeof(double));
return *reinterpret_cast<double*>(buffer);
}
int as_int() {...}
};
Later you would do something like this:
data d = ...;
switch (d.type) {
case TYPE_DOUBLE:
something(d.as_double());
break;
case TYPE_INT:
something_else(d.as_int());
break;
...
}
That's at least how I'm doing these kind of things :)
You can use structures and anonymous unions:
struct Variant
{
size_t size;
enum
{
TYPE_DOUBLE,
TYPE_INT,
} type;
union
{
char raw[0]; // Copy to here. *
double asDouble;
int asInt;
};
};
Optional: Create a table of type => size, so you can find the size given the type at runtime. This is only needed when reading.
static unsigned char typeSizes[2] =
{
sizeof(double),
sizeof(int),
};
Usage:
Variant v;
v.type = Variant::TYPE_DOUBLE;
v.size = Variant::typeSizes[v.type];
fileStream.read(v.raw, v.size);
printf("%f\n", v.asDouble);
You will probably receive warnings about type punning. Read: Doing this is not portable and against the standard! Then again, so is reinterpret_cast, C-style casting, etc.
Note: First edit, I did not read your original question. I only had the union, not the size or type part.
*This is a neat trick I learned a long time ago. Basically, raw doesn't take up any bytes (thus doesn't increase the size of the union), but provides a pointer to a position in the union (in this case, the beginning). It's very useful when describing file structures:
struct Bitmap
{
// Header stuff.
uint32_t dataSize;
RGBPixel data[0];
};
Then you can just fread the data into a Bitmap. =]
Be careful. In most environments I'm aware of, doubles are 8 bytes, not 4; reinterpret_casting memory to a double will result in junk, based on what the four bytes following memory contain. If you want a 32-bit floating point value, you probably want a float (though I should note that the C++ standard does not require that float and double be represented in any way and in particular need not be IEEE-754 compliant).
Also, your code will not be portable unless you take endianness into account in your code. I see that the TIFF format has an endianness marker in its first two bytes that should tell you whether you're reading in big-endian or little-endian values.
So I would write a function with the following prototype:
template<typename VALUE_TYPE> VALUE_TYPE convert(char* input);
If you want full portability, specialize the template and have it actually interpret the bits in input. Otherwise, you can probably get away with e.g.
template<VALUE_TYPE> VALUE_TYPE convert(char* input) {
return reinterpret_cast<double>(input);
}