I'm trying to implement deserialization where the mapping to field/member is only known at runtime (its complicated). Anyway what I'm trying to do is something like the following:
Class A
{
public:
int a; // index 0
float b; // index 1
char c; // index 2
}
Then I have two arrays, one with the index of the field and the other with something that indicates the type. I then want to iterate over the arrays and write to the fields from a byte stream.
Sorry for the crappy description but I just don't know how to implement it in code. Any ideas would be appreciated thanks!
Yes you can, the there are two things you need to look out for when doing it though.
First of all make sure you start writing from (const char*)&A.a because all compilers append stuff that doesn't really concern you at the start of an object (visualc puts the vtable there for instance) and you won't be writing what you think you are if you start from the address of the object.
Second you might want to do a #pragma pack(1) before declaring any class that needs to be written to disk because the compilers usually align class members to make DMA transfers more efficient and you might end up having problems with this as well.
On the dynamic part of it, if making one class definition for each field combination you want to have is acceptable, then it's ok to do it like this, otherwise you'd be better off including a hash table in your class and serializing/deserializing its' contents by writing key-value pairs to the file
I can't think of a language construct that will be able to give your a field address given an index at runtime. If you could have the "type" array to actually include field sizes you would have been able to do something like:
istream &in = <get it somehow>;
size_t *field_size = <get it somehow>;
size_t num_of_fields = <get it somehow>;
A a;
char *ptr = reinterpret_cast<char *>(&a);
for (int i = 0; i < num_of_fields; i++)
{
in.read(ptr, field_size[i]);
ptr += field_size[i];
}
Note that this will be true if your class is simple and doesn't have any virtual function members
(or inheritcs from such a class). If that is the case, you would do better to include a dummy member
for getting to the byte offset where fields start within the class:
class A
{
int __dummy; /* must be the first data member in the class */
...
<rest of your class definition here>
};
and now change the initialization of ptr as follows:
ptr = reinterpret_cast<char *>(&a) + offsetof(A, __dummy);
Another implicit assumption for this code is that machine byte-order is the same for both the machine running this code and the machine from which the serialized data is received. If not, then you will need to convert the byte ordering of the data read from the stream. This conversion is of course type dependent but you could have another array of conversion functions per field.
There are a lot of issues and decisions needed. At the simplest, you could keep an offset into A per field, you can switch on type and set through a pointer to the field. For example - assuming there's a int16_t encoding field numbers in the input stream, making no effort to use static_cast<> etc. where it's a little nicer to do so, and assuming a 0 field number input terminator...
A a;
char* pa = (char*)&a;
char* p_v = (char*)&input_buffer;
...
while ((field_num = *(int16_t)p_v) && (p_v += sizeof(int16_t)))
switch (type[field_num])
{
case Int32:
*(int32_t*)(p_a + offset[field_num]) = *(int32_t*)(p_v);
p_v += sizeof(int32_t);
break;
...
}
You may want to consider using e.g. ntohl() etc. to handle endianness conversions.
Let the compiler do it:
Write an operator>> function.
Related
I've been trying to learn a bit about reverse engineering and how to essentially wrap an existing class (that we do not have the source for, we'll call it PrivateClass) with our own class (we'll call it WrapperClass).
Right now I'm basically calling the constructor of PrivateClass while feeding a pointer to WrapperClass as the this argument...
Doing this populates m_string_address, m_somevalue1, m_somevalue2, and missingBytes with the PrivateClass object data. The dilemma now is that I am noticing issues with the original program (first a crash that was resolved by adding m_u1 and m_u2) and then text not rendering that was fixed by adding mData[2900].
I'm able to deduce that m_u1 and m_u2 hold the size of the string in m_string_address, but I wasn't expecting there to be any other member variables after them (which is why I was surprised with mData[2900] resolving the text rendering problem). 2900 is also just a random large value I threw in.
So my question is how can we determine the real size of a class that we do not have the source for? Is there a tool that will tell you what variables exist in a class and their order (or atleast the correct datatypes or datatype sizes of each variable). I'm assuming this might be possible by processing assembly in an address range into a semi-decompiled state.
class WrapperClass
{
public:
WrapperClass(const wchar_t* original);
private:
uintptr_t m_string_address;
int m_somevalue1;
int m_somevalue2;
char missingBytes[2900];
};
WrapperClass::WrapperClass(const wchar_t* original)
{
typedef void(__thiscall* PrivateClassCtor)(void* pThis, const wchar_t* original);
PrivateClassCtor PrivateClassCtorFunc = PrivateClassCtor(DLLBase + 0x1c00);
PrivateClassCtorFunc(this, original);
}
So my question is how can we determine the real size of a class that
we do not have the source for?
You have to guess or logically deduce it for yourself. Or just guess. If guessing doesn't work out for you, you'll have to guess again.
Is there a tool that will tell you what variables exist in a class and
their order (or atleast the correct datatypes or datatype sizes of
each variable) I'm assuming by decompiling and processing assembly in
an address range.
No, there is not. The type of meta information that describes a class, it's members, etc. simply isn't written out as the program does not need it nor are there currently no facilities defined in the C++ Standard that would require a compiler to generate that information.
There are exactly zero guarantees that you can reliably 'guess' the size of a class. You can however probably make a reasonable estimate in most cases.
The one thing you can be sure of though: the only problem is when you have too little memory for a class instance. Having too much memory isn't really a problem at all (Which is what adding 2900 extra bytes works).
On the assumption that the code was originally well written (e.g. the developer decided to initialise all the variables nicely), then you may be able to guess the size using something like this:
#define MAGIC 0xCD
// allocate a big buffer
char temp_buffer[8092];
memset(temp_buffer, MAGIC, 8092);
// call ctor
PrivateClassCtor PrivateClassCtorFunc = PrivateClassCtor(DLLBase + 0x1c00);
PrivateClassCtorFunc(this, original);
// step backwards until we find a byte that isn't 0xCD.
// Might want to change the magic value and run again
// just to be sure (e.g. the original ctor sets the last
// few bytes of the class to 0xCD by coincidence.
//
// Obviously fails if the developer never initialises member vars though!
for(int i = 8091; i >= 0; --i) {
if(temp_buffer[i] != MAGIC) {
printf("class size might be: %d\n", i + 1);
break;
}
}
That's probably a decent guess, however the only way to be 100% sure would be to stick a breakpoint where you call the ctor, switch to assembly view in your debugger of choice, and then step through the assembly line by line to see what the max address being written to is.
I am not sure if this is possible at all in standard C++, so whether it even is possible to do, could be a secondary way to put my question.
I have this binary data which I want to read and re-create using structs. This data is originally created as a stream with the content appended to a buffer, field by field at a time; nothing special about that. I could simply read it as a stream, the same way it was written. Instead, I merely wanted to see if letting the compiler do the math for me, was possible, and instead implementing the binary data as a data structure instead.
The fields of the binary data have a predictable order which allows it to be represented as a data type, the issue I am having is with the depth and variable length of repeating fields. I am hoping the example code below makes it clearer.
Simple Example
struct Common {
int length;
};
struct Boo {
long member0;
char member1;
};
struct FooSimple : Common {
int count;
Boo boo_list[];
};
char buffer[1024];
int index = 15;
((FooSimple *)buffer)->boo_list[index].member0;
Advanced Example
struct Common {
int length;
};
struct Boo {
long member0;
char member1;
};
struct Goo {
int count;
Boo boo_list[];
};
struct FooAdvanced : Common {
int count;
Goo goo_list[];
};
char buffer[1024];
int index0 = 5, index1 = 15;
((FooAdvanced *)buffer)->goo_list[index0].boo_list[index1].member0;
The examples are not supposed to relate. I re-used some code due to lack of creativity for unique names.
For the simple example, there is nothing unusual about it. The Boo struct is of fixed size, therefore the compiler can do the calculations just fine, to reach the member0 field.
For the advanced example, as far as I can tell at least, it isn't as trivial of a case. The problem that I see, is that if I use the array selector operator to select a Goo object from the inline array of Goo-elements (goo_list), the compiler will not be able to do the offset calculations properly unless it makes some assumptions; possibly assuming that all preceding Goo-elements in the array have zero Boo-elements in the inline array (boo_list), or some other constant value. Naturally, that won't be the case.
Question(s):
What ways are there to achieve the offset computations to be done by the compiler, despite the inline arrays having variable lengths? Unless I am missing something, I believe templates can't help at all, due to their compile-time nature.
Is this even possible to achieve in C++?
How do you handle the case with instantiating a FoodAdvanced object, by feeding a variable number of Goo and Boo element counts to the goo_list and boo_list members, respectively?
If it is impossible, would I have to write some sort of wrapper code to handle the calculations instead?
I'm trying to store a 2D array of variable length c-strings into a struct so I can transmit and rebuild it over a network socket.
The plan is to have rows and cols, which is in the header of the packet, help me read the variable size lens and arr that come after. I believe I must be syntactically writing the pointers incorrectly or there's some kind of aux pointer I need use when setting them into the struct.
struct STORAGE {
int rows; // hdr
int cols; // hdr
int** lens;
const char*** arr;
}
// code
int rows = 11;
int cols = 2;
int lens[rows][cols];
const char* arr[rows][cols];
// ... fill with strings ...
// ... along with lens ...
STORAGE store;
store.rows = rows;
store.cols = cols;
store.lens = lens;
store.arr = arr;
I get these errors when compiling this code:
error: invalid conversion from int to int** [-fpermissive]
error: cannot convert const char* [11][2] to `const char***' in assignment
I come from mostly a Java background, but I do understand how pointers work and such. The syntax of this one is just a little sideways for someone with my background (mostly write java/c++ and less c). Any suggestions?
Note: the reason why I'm not using more complex types like strings, maps, vectors, etc is that I need to transmit the structure over the network (ie pointers to the heap won't work if they have variable sizes). It must be low-level arrays unless someone can offer a better solution.
It must be low-level arrays unless someone can offer a better solution.
A unidimensional std::vector<int> or std::vector<uint8_t> already provides you with a low-level array allocated contiguously using the std::vector::data() member.
Any further dimensions you need might be determined by sectioning that data properly. For network transmission, you would need to provide the necessary sectioning dimensions up front, and sent the data afterwards.
Something like:
Transmit num_of_dimensions
Transmit dim_size_1, dim_size_2, dim_size_3, ...
Transmit data
Receive num_of_dimensions
Loop Receiving dimension sizes
Receive dim_size_1 * dim_size_2 * dim_size_3 * ... of data
What I'd probably have to handle such situation is a class / struct looking like:
template<typename T>
class MultiDimensional {
size_t num_dimensions_; // If known in advance can be made a template parameter also
std::vector<size_t> dimension_sizes_;
std::vector<T> data_;
public:
const T& indexing_accessor(...) const;
T& indexing_accessor(...);
std::vector<uint8_t> render_transmision_data();
// construct from transmission data
MultiDimensional(std::vector<uint8_t>& transmission_data);
};
Using low-level stuff like arrays won't help you much, it's too complicated already. Also, it can get you in a hell of compatibility problems (like thinking about byte-order).
Unless you have very strict performance constraints, use a solution designed specifically for networking instead: protocol buffers. This is a bit of overkill for your case, but it scales well in case you need to add anything.
To use protocol buffers, you first define "messages" (structures) in a .proto file, then compile them to C++ with Proto Compiler.
You define your message like this (this is a complete .proto file):
syntax = "proto2";
package test;
message Storage {
message Row {
repeated string foo = 1;
}
repeated Row row = 1;
}
There is no direct support for 2D arrays, but an array of arrays will do just fine (repeated means that there can be multiple values in given field, it is basically a vector). You could add fields for array size, if you need quick access to them, but checking the size of repeated fields should suffice in most practical cases.
What you get is a class that have all the fields you need, takes care of memory management, and have a bunch of methods to serialize and deserialize.
The C++ code gets a little longer in places, as you need to use getters and setters, but it should be well offset by the fact that you never need to think about serialization - it happens all by itself.
Example use of this thing in C++ could look like this:
#include "test.pb.h" // Generated from test.proto
using ::test::Storage;
int main() {
Storage s;
Storage::Row* row1 = s.add_row();
row1->add_foo("foo 0,0");
row1->add_foo("foo 0,1");
Storage::Row* row2 = s.add_row();
row2->add_foo("foo 1,0");
row2->add_foo("foo 1,1");
assert(s.row_size() == 2);
assert(s.row(0).foo_size() == 2);
s.PrintDebugString(); // prints to stdout
}
In the result, you get this output (note that this is debug output, not real serialization):
row {
foo: "foo 0,0"
foo: "foo 0,1"
}
row {
foo: "foo 1,0"
foo: "foo 1,1"
}
For completeness: in above example the source files were test.proto and test.cpp, compiled using:
protoc --cpp_out=. test.proto
g++ test.cpp test.pb.cc -o test -lprotobuf
I have assignment which asks one to write a function for any data type.The function is supposed to print the bytes of the structure and identify the total number of bytes the data structure uses along with differentiating between bytes used for members and bytes used for padding.
My immediate reaction, along with most of the classes reaction was to use templates. This allows you to write the function once and gather the run time type of the objects passed into the function. Using memset and typeid's one can easily accomplish what has been asked. However, our prof. just saw our discussion about templates and damned templates to hell.
After seeing this I was thrown for a loop and I'm looking for a little guidance as the best way to get around this. Some things I've looked into:
void pointers with explicit casting (this seems like it'd get messy)
base class with virtual functions only from which all data structures inherit from, seems a bit odd to do.
a base class with 'friendships' to each of our data structures.
rewriting a function for each data structure in our problem set (what I imagine is the worst possible solution).
Was hoping I overlooked a common c++ tool, does anyone have any ideas?
Treat the function as stupid as possible, in fact, treat it as if it doesn't know anything and all information must be passed to it.
Parameters to the function:
Structure address, as a uint8_t *. (Needed to print the bytes)
Structure size, in bytes. (Needed to print the bytes and to print the
total size)
A vector of member information: member length OR the sum of the bytes used by the members.
The vector is needed to fulfill the requirement of printing the bytes used by the members and the bytes used by padding. Optionally you could pass the sum of the members.
Example:
void Analyze_Structure(uint8_t const * p_structure,
size_t size_of_structure,
size_t size_occupied_by_members);
The trick of this assignment is to figure out how to have the calling function determine these items.
Hope this helps.
Edit 1:
struct Apple
{
char a;
int weight;
double protein_per_gram;
};
int main(void)
{
Apple granny_smith;
Analyze_Structure((uint8_t *) &granny_smith,
sizeof(Apple),
sizeof(granny_smith.a)
+ sizeof(granny_smith.weight)
+ sizeof(granny_smith.protein_per_gram);
return 0;
}
I have assignment which asks one to write a function for any data type.
This means either templates (which your prof. dismissed), void*, or variable number of arguments (simiar to printf).
The function is supposed to print the bytes of the structure
void your_function(void* data, std::size_t size)
{
std::uint8_t* bytes = reinterpret_cast<std::uint8_t*>(data);
for(auto x = bytes; x != bytes + size; ++x)
std::clog << "0x" << std::hex << static_cast<std::uint32_t>(*x) << " ";
}
[...] and identify the total number of bytes the data structure uses along with differentiating between bytes used for members and bytes used for padding.
On this one, I'm lost: the bytes used for padding are (by definition) not part of the structure. Consider:
struct x { char c; char d; char e; }; // sizeof(x) == 3;
x instance{ 0, 0, 0 };
your_function(&instance, sizeof(x)); // passes 3, not 4 (4 for 32bits architecture)
Theoretically, you could also pass alignof(instance) to the function, but that won't tell you the alignment of the fields in memory (as far as I know it is not standardized, but I may be wrong).
There are a few possibilities here:
Your prof. learned "hacky" C++ that was considered good code 10 or 20 years ago and didn't update his knowledge (C-style code, pointers, direct memory access and "smart hacks" are all in here).
He didn't know how to express exactly what he wanted or the terminology to use ("write a function for any data type" is too vague: as a developer, if I got this assignment, the first thing to do would be to ask for details - like "how will it be used?" and "what is the expected function signature").
For example, this could be achieved - to a degree - with macros, but if he wants you to use macros in place of functions and templates, you should probably contemplate changing professors.
He meant that you should write some arbitrary data type (like my struct x above) and define your API around that (unlikely).
I am not sure that such a function can be built without a minimum of introspection: you need to know what the struct members are, otherwise you only have access to the size of the struct.
Anyway, here is my proposal for a solution that should work without introspection, provided the user of the code "cooperates".
Your functions will take as arguments void* and size_t for the address and sizeof of the struct.
0) let the user create a struct of the desired type.
1) let the user call a function of yours that sets all bytes to 0.
2) let the user assign a value to every field of the struct.
3) let the user call a function of yours that keeps a record of every byte that is still 0.
4) let the user call a function of yours that sets all bytes to 1.
5) let the user assign a value to every field of the struct again. (Same values as the first time!)
6) let the user call a function of yours and count the bytes that are still 1 AND were marked before. These are padding bytes.
The reason to try with values 0 then 1 is that the values assigned by the user could include bytes 0; but they can't be bytes 0 and bytes 1 at the same time so one of the test will exclude them.
struct _S { int I; char C } S;
Fill0(S, sizeof(S));
// User cooperation
S.I= 0;
S.C= '\0';
Mark0(S, sizeof(S)); // Has some form of static storage
Fill1(S, sizeof(S));
// User cooperation
S.I= 0;
S.C= '\0';
DetectPadding(S, sizeof(S));
You can pack all of this in a single function that takes a callback function argument that does the member assignments.
void Assign(void* pS) // User-written callback
{
struct _S& S= *(struct _S)pS;
S.I= 0;
S.C= '\0';
}
I have the following declaration in a file that gets generated by a perl script ( during compilation ):
struct _gamedata
{
short res_count;
struct
{
void * resptr;
short id;
short type;
} res_table[3];
}
_gamecoderes =
{
3,
{
{ &char_resource_ID_RES_welcome_object_ID,1002, 1001 },
{ &blah_resource_ID_RES_another_object_ID,1004, 1003 },
{ &char_resource_ID_RES_someting_object_ID,8019, 1001 },
}
};
My problem is that struct _gamedata is generated during compile time and the number of items in res_table will vary. So I can't provide a type declaring the size of res_table in advance.
I need to parse an instance of this structure, originally I was doing this via a pointer to a char ( and not defining struct _gamedata as a type. But I am defining res_table.
e.g.
char * pb = (char *)_gamecoderes;
// i.e. pb points to the instance of `struct _gamedata`.
short res_count = (short *)pb;
pb+=2;
res_table * entry = (res_table *)pb;
for( int i = 0; i < res_count; i++ )
{
do_something_with_entry(*entry);
}
I'm getting wierd results with this. I'm not sure how to declare a type _struct gamedata as I need to be able to handle a variable length for res_table at compile time.
Since the struct is anonymous, there's no way to refer to the type of this struct. (res_table is just the member name, not the type's name). You should provide a name for the struct:
struct GameResult {
short type;
short id;
void* resptr;
};
struct _gamedata {
short res_count;
GameResult res_table[3];
};
Also, you shouldn't cast the data to a char*. The res_count and entry's can be extracted using the -> operator. This way the member offsets can be computed correctly.
_gamedata* data = ...;
short res_count = data->res_count;
GameResult* entry = data->res_table;
or simply:
_gamedata* data;
for (int i = 0; i < data->res_count; ++ i)
do_something_with_entry(data->res_table[i]);
Your problem is alignment. There will be at least two bytes of padding in between res_count and res_table, so you cannot simply add two to pb. The correct way to get a pointer to res_table is:
res_table *table = &data->res_table;
If you insist on casting to char* and back, you must use offsetof:
#include <stddef.h>
...
res_table *table = (res_table *) (pb + offsetof(_gamedata, res_table));
Note: in C++ you may not use offsetof with "non-POD" data types (approximately "types you could not have declared in plain C"). The correct idiom -- without casting to char* and back -- works either way.
Ideally use memcpy(3), at least use type _gamedata, or define a protocol
We can consider two use cases. In what I might call the programmer-API type, serialization is an internal convenience and the record format is determined by the compiler and library. In the more formally defined and bulletproof implementation, a protocol is defined and a special-purpose library is written to portably read and write a stream.
The best practice will differ depending on whether it makes sense to create a versioned protocol and develop stream I/O operations.
API
The best and most completely portable implementation when reading from compiler-oject serialized streams would be to declare or dynamically allocate an exact or max-sized _gamedata and then use memcpy(3) to pull the data out of the serial stream or device memory or whatever it is. This lets the compiler allocate the object that is accessed by compiler code and it lets the developer allocate the object that is accessed by developer (i.e., char *) logic.
But at a minimum, set a pointer to _gamedata and the compiler will do everything for you. Note also that res_table[n] will always be at the "right" address regardless of the size of the res_table[] array. It's not like making it bigger changes the location of the first element.
General serialization best practice
If the _gamedata object itself is in a buffer and potentially misaligned, i,e., if it is anything other than an object allocated for a _gamedata type by the compiler or dynamically by a real allocator, then you still have potential alignment issues and the only correct solution is to memcpy(3) each discrete type out of the buffer.
A typical error is to use the misaligned pointer anyway, because it works (slowly) on x86. But it may not work on mobile devices, or future architectures, or on some architectures when in kernel mode, or with advanced optimizations enabled. It's best to stick with real C99.
It's a protocol
Finally, when serializing binary data in any fashion you are really defining a protocol. So, for maximum robustness, don't let the compiler define your protocol. Since you are in C, you can generally handle each fundamental object discretely with no loss in speed. If both the writer and reader do it, then only the developers have to agree on the protocol, not the developers and the compilers and the build team, and the C99 authors, and Dennis M. Ritchie, and probably some others.
As #Zack points out, there is padding between elements of your structure.
I'm assuming you have a char* because you've serialized the structure (in a cache, on disk, or over the network). Just because you are starting with a char * doesn't mean you have to access the entire struct the hard way. Cast it to a typed pointer, and let the compiler do the work for you:
_gamedata * data = (_gamedata *) my_char_pointer;
for( int i = 0; i < data->res_count; i++ )
{
do_something_with_entry(*data->res_table[i]);
}