Datastructure Storage in Filesystem

Datastructure Storage in Filesystem - c++

I am trying to write a persistent datastructure in C++ , however I feel that I should be able to make it binary compatible with various other implementations of my datastructure readers, and hence, my current idea is to declare datastructure in the native memory without any abstraction.
For example, I would specify a linear block of memory as a datastructure (using new keyword) and then describe what the first byte means, what the second byte means and so on. I know I can do this using struct but then, the datastructure would be bound to one language and other languages will have to then use this structure. Also, the implementation might then change from compiler to compiler. I would instead like it as a memory standard.
Is what I am trying to do somewhat sensible? Or I am trying to over-simplify things and should really proceed with a struct data structure? Now onto the C++ part, if you believe that I should be using a struct data structure, then what are the disadvantages of using a full-fledged class?
(I am using a class anyway to wrap around the memory structure and provide functions to it since the datastructure is anyway persistent.)
EDIT
As justin as suggested, I do not need any such advanced interface wrapper around the memory structure, so my last point about class wrapper is not stated properly. What I mean is I would like to have a class interface for the memory representation, it does not necessarily have to be a wrapper.

Several file formats I have read/worked with do exactly that -- define a memory standard or layout, then typically back it up with a demonstration in C-like pseudo-structure. Sometimes they will provide struct or class representations, and some are completely abstracted by a library. Of course, these formats go on to document all fields, their sizes, the endianness of the data and so on.
I figure endian related issues, padding, complexity (e.g. introduced by variations in the data structures) and proper versioning are the biggest sources of errors. Another issue I find is the use of data structures of yesteryear and inconsistency of data structures used to represent similar functionalities -- You may receive a spec, and realize it contains several different string representations -- all of which are archaic, and somebody has to go on to support all of these (bidirectionally).
Proceeding that route:
You should not commit to a binary representation (or compilable program) if you don't want to support it (and attempts of long-lived formats fail/stumble along the way, as platforms and toolsets change). Just commit to a formal memory standard at first, then build on top of that with tests and example input files to verify the representation is properly serialized and deserialized correctly. A very basic test suite will help ensure your model is portable on all the systems you need, and can point out potential pitfalls or platform specific considerations you may not have been aware of.
If you really want to provide a compilable representation, I'd stick with a very compliant struct representation -- clients can take that (in memory) representation and turn it into any C++ abstraction/representation they like. That is to say, a serialized representation should probably not reflect that of a representation in memory, apart from trivially simple representations and the intermediate storage of such a representation (flattened and packed structs).
One of the important parts is that you should have tests which confirm your in memory object graph which you create with these structs are forward and backwards serializable and de-serializable, and support proper versioning -- so it often takes a bit of work to make a complex serialized representation compatible. So you see this approach just introduces one abstraction layer on top of another. In this regard, you may want give C++ abstraction the ability to create itself from the packed in memory representation, and to ensure that that representation can also correctly populate the packed structure without data loss.
Beyond that, is there any need to have a more advanced interface? If there is, then you may want to provide that information.
So yes, the memory standard is the part that you must get correct and stable, and to which all implementations should refer to and test against -- regardless of platform/architecture differences. IOW, you're on the right track ;)

In C++ there's no practical difference between struct and class (besides the default accessibility being public in struct). Traditionally, struct is used when a type only has (public) member variables and no member functions but this is only a convention, not a rule enforced by the compiler.
I'd certainly use a struct/class to describe the data. If someone wants to write a reader of your data structure, they can either import your header file or implement the data structure in their language of choice - in most programming languages this should be pretty simple.
I recommend you start your structure something like this:
typedef struct
{
int Version; // struct layout version
int ByteSize; // byte size of structure for validation
...
} MYDATA;
This way when your data structure is being passed around, your code can verify that the allocated structure size matches with how many bytes you'd expect for a given version of your structure. You could then easily introduce new versions of your structure by simply updating the version field and checking for the new size.
When you save your data to disk, make sure that you write it out field-by-field, rather than through a single write (using a pointer and sizeof() to ensure that other languages won't have to deal with potential padding that your C++ compiler may decide to put in. It's possible to manually lay out fields in the structure so that there's no padding but you have to be very, very careful while doing that and it's easy to make mistakes.

Related

What's the purpose of VkStructureType? [duplicate]

In all of the create info structs (vk*CreateInfo) in the new Vulkan API, there is ALWAYS a .sType member. Why is this there if the value can only be one thing? Also the Vulkan specification is very explicit that you can only use vk*CreateInfo structs as parameters for their corresponding vkCreate* function. It seems a little redundant. I can see that if the driver was passing this struct straight to the GPU, you might need to have it (I did notice it is always the first member). But this seems like a really bad idea for the app to do it because if the driver did it, apps would be much less error prone, and prepending an int to a struct doesn't seems like an extremely computational inefficient operation. I just don't see why it exists.
TL;DR
Why do the vk*CreateInfo structs have the .sType member?

They have one so that the pNext field actually works.
Yes, the API takes a struct with a proper C type, so both the caller and the receiver agree on what type that struct is. But especially nowadays, many such structs have linked lists of structures that provide additional information to the implementation. These extension structures (though many are core in Vulkan 1.1/2) are just like all other structures, with their own sType field.
These fields are crucial because the linked lists are built with pNext pointers... which are void*s. They have no set type. The way the implementation determines what a non-NULL pNext pointer points to is by examining the first 4 bytes stored there. This is the sType field; it allows the implementation to know what type to cast the pointer to.
Of course, the primary struct that an API takes doesn't strictly need an sType field, since its type is part of the API itself. However, there is a hypothetical reason to do so (it hasn't panned out in Vulkan releases).
A later version of Vulkan could expand on the creation of, for example, command buffer pools. But how would it do that? Well, they could add a whole new entrypoint: vkCreateCommandPool2. But this function would have almost the exact same signature as vkCreateCommandPool; the only difference is that they take different pCreateInfo structures.
So instead, all you have to do is declare a VkCommandPoolCreateInfo2 structure. And then declare that vkCreateCommandPool can take either one. How would the implementation tell which one you passed in?
Because the first 4 bytes of any such structure is sType. They can test that value. If the value is VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO, then it's the old structure. If it's VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO_2, then it's the new one.
Of course, as previously stated, this hasn't panned out; post-1.0 Vulkan versions opted to incorporate extension structs rather than replacing existing ones. But the option is there.

c++ alignment using pragma pack and inheritance

I am not too familiar with concept of packing / alignment in C++, I did some reading about this recently and have a question.
I am deriving from a base class (written by somebody else and I have header for that). Author of this class has used pragma pack to align members to 1 byte boundary. however I am not sure if it is necessary for derived class to do the same or not, what are consequences of packing/not packing derived class with same alignment as base class ?
any help/suggestions will be greatly appreciated
thanks

In everyday, well-written C++ code it doesn't normally matter if there's padding or not, though the choice may impact performance. So, you should be able to derive from that base class without worrying about explicitly specifying any packing yourself. That said, the base class may be packed because there'll be a massive number of instances in memory or bitwise-copied to a file or network stream, in which case you'll want to consider whether instances of your new class may end up mixed in with that data, and whether you also want to use packing for the extra data members for the same reasons.
Not all code is well-written though. For example, if the program treats the objects as binary blobs of data and uses functions like memcmp on them, or does a byte-wise void*/size checksum, then garbage data in padding members may break the logic/behaviour. If the data is written object by object with particular separator or delimiter characters, then embedded garbage may inject unwanted separators/delimiters and break the reading/parsing logic. There's no way to assess these risks without doing an impact study on the existing code.

C++ way to serialize a message?

In my current project I have a few different interfaces that require me to serialize messages into byte buffers. I feel like I'm probably not doing it in a way that would make a true C++ programmer happy (and I'd like to).
I would typically do something like this:
struct MyStruct {
uint32_t x;
uint64_t y;
uint8_t z[80];
};
uint8_t* serialize(const MyStruct& s) {
uint8_t* buffer = new uint8_t[sizeof(s)];
uint8_t* temp = buffer;
memcpy(temp, &s.x, sizeof(s.x));
temp += sizeof(s.x);
//would also have put in network byte order...
... etc ...
return buffer;
}
Excuse any typos, that was just an example off the top of my head. Obviously it can get more complex if the structure I'm serializing has internal pointers.
So, I have two questions that are closely related:
Is there any problem in the specific scenario above with serializing by casting the struct directly to a char buffer assuming I know that the destination systems are in the same endianness?
Main question: Is there a better... erm... C++? way to do this aside from the addition of smart pointers? I feel like this is such a common problem that the STL probably handles it - and if it doesn't, I'm sure there's a better way to do it anyway using C++ mechanisms.
EDIT Bonus points if you can do a clean example of serializing this structure in a better way using standard C++/STL without added libraries.

You might want to take a look at Google Protocol Buffers (also known as protobuf). You define your data in a language neutral IDL and then run it through a generator to generate your C++ classes. It'll take care of byte ordering issues and can provide a very compact binary form.
By using this you'll not only be able to save your C++ data but it'll be useable in other languages (C#, Java, Python etc) as there are protobuf implementation available for them.

You should probable use either Boost::serialization or streams directly. More information in either of the links to the right.
Is it possible to serialize and deserialize a class in C++?

just voted AzPs reply as the answer, checking Boost first is the way to go.
in addition about your code sample:
1 - changing the signature of your serialization function to a method taking a file:
void MyStruct::serialize(FILE* file) // or stream
{
int size = sizeof(this);
fwrite(&size, sizeof(int), 1, file); // write size
fwrite(this, 1, size, file); // write raw bytes of struct
}
reduces the necessity to copy the struct.
2 - yes, your code makes the serialized bytes dependent on your platform, compiler and compiler settings. this is not good or bad, if the same binary writes and reads the serialized bytes, this might be beneificial because of simplicity and performance. But it is not only endianness, also packing and struct layout affects compatibility. For instance a 32bit or a 64bit version of your app will change the layout of our struct quite for sure. Finally serializing the raw footprint also serializes padding bytes - the bytes the compiler might put between struct fields, an overhead undesirable for high traffic network streams (see google protocol buffers as they hunt every bit they can save).
EDIT:
i see you added "embedded". yes, then such simple serialize / deserialize (mirror implementation of the above serialize) methods might be a good and simple choice.

Returning a flexible datatype from a C++ function

I'm developing for a legacy C++ application which uses ODBC for it's data access. Coming from a C# background, I really miss the ADO style of data access.
I'm writing a wrapper (because we can't actually use ADO) to make our data access less painful. This means no char arrays, no manual text blob streaming, and no declaritive column binding.
I'm struggling with how to store / return data values. In C# at least, you can declare an object and cast it to whatever (as long as the type is convertable).
My current C++ solution is to use boost::any to store the data value in a custom DataColumnValue object. This class has conversion and assignment operators to the various types used in our app (more than 10). There's a bit of complexity here because if you store an int in the boost::any and try to boost::any_cast<long> you get a boost::bad_any_cast. Client objects shouldn't have to know how the value is stored internally.
Does anyone have any experience trying to store / return values whose types are only known at runtime? Is there a better / cleaner way?

I used OTL (http://otl.sourceforge.net/) in some one-off projects back in the day for interfacing C++ and some SQL Server databases. It's streams-based, so it can do type conversion for you. I did find the streams paradigm a bit confusing at times, as I had to unstream the values in the query order - I never quite figured out how to pull a named value out of the record stream.
But it worked flawlessly otherwise.
In regards to Boost.Any, I've implemented similar constructs before, copying the COM Variant as a C++ union. With Boost.Variant/Any you might need to add addition template specializations to support the particular datatype conversions you're attempting (long is not an int after all). I don't see any particular downside to your approach except scalability in number of types.

Designing a string class in C++

I need to design (and code) a "customized" string class in C++. I am seeking any documentation and pointers on design issues or potential pitfalls I should be aware of.
Links are very welcome, as are the identification of problems (if any) with current string libs (Qstring, std::string, and the others).

Despite the critics, I think this is a valid question.
The std::string is not a panacea. It looks like someone took the class from a pure-OO and dumped it in C++, which is probably the case.
Advice 1: Prefer non-member non-friend methods
Now that this is said, in this hour of internationalization, I would certainly advise you to design a class that would support Unicode. And I do say Unicode, not UTF-8 or UTF-16. It's ill-fitting (I think) to devise a class that would contain the data in a given encoding. You can provide methods to then output the information in various formats.
Advice 2: Support Unicode
Then, there is a number of points on the memory allocation schemes:
Small String Optimization: the class contains pre-allocated space for a few characters (a dozen or two), and thus avoid heap allocation for those
Copy On Write: the various strings share a buffer so that copy is cheap, when one string needs to modify its content, it copies the buffer if it's not the sole owner --> the issue is that multithreading introduces overhead here and it's been showed that for a general purpose technic this overhead could dwarf the actual copying cost
Immutability: "new" languages such as Java, C# or Python use immutable strings. Think of it as a pool of strings, all strings containing "Fooo" will point to the same buffer. Note that these languages support garbage collection, which rather helps here.
I would personally pick the "Small String Optimization" here (though it's not exclusive with the other two), simply because it's simple to implement and should actually benefit you (heap allocation cost, locality of reference issues).
The other two technics are somewhat complex in the face of multi-threading, and such are likely error-prone and unlikely to yield any real benefit unless carefully crafted.
And that brings my last advice:
Advice 3: Don't implement internal locking in an attempt of MultiThreading support
It will slow down the class when used in SingleThreaded context and will not yield as much benefit as you'd think when used in a MultiThreaded one.
Finally, you could perhaps find something suiting your tastes (or get some pointers) by browsing existing code. I don't promise to exhibit "smooth" interfaces though:
ICU UnicodeString: Unicode support, at least
std::string: over 100 member methods (counting the various overloads)
llvm StringRef: note how many algorithms are implemented as member methods :'(

Effective STL by Scott Meyers has some interesting discussion about possible std::string implementation techniques, though it covers rather advanced issues such as copy-on-write and reference counting.

Depending on what the "customization" is (e.g. a custom allocator), you may be able to do it via a template parameter of the std::basic_string class.

Herb Sutter gives a sample of a custom string class in the GotW #29. You could use it for the start.

From a general-purpose point of view a "new" string class ideally combined the good points of std::string, CString, QString and others. A few points in random order:
MFC CString supports using it in printf-like functions due to a very specific implementation. If you need or want this feature I recommend buying the book "MFC Internals" by George Sheperd. Although the book is from 1996(!) it's description of how CString is implemented should be worth it. http://www.amazon.com/MFC-Internals-Microsoft-Foundation-Architecture/dp/0201407213/ref=sr_1_1?ie=UTF8&s=books&qid=1283176951&sr=8-1
Check that your string class plays nicely with all interfaces you'll use it with (iostreams, Windows API, printf*, etc.)
Don't aim for full unicode support (as in: collation, grapheme clusters, ...) as that will mean your class will never be done, but consider making it a wchar_t class with conversion options.
Consider making the ctor/function that creates your string objects from char* always take the specific encoding of the character arrays. (Can be helpful in mixed UTF-8 / other character sets environments.)
Look at the full CString interface and at the full std:string interface and decide what you are going to need and what you can skip.
Look at QString to see what the other two miss.
Do not provide implicit conversion to neither char/wchar_t*
Consider adding convenient conversion functions to/from numeric types.
Don't write a string class without a full set of detailed Unit Tests!

The world doesn't need another string class. Is this homework? If not, use std::string.

The problem with std::string is.. that you can't change it. Sometimes you need the basics of a std::string, but disagree with the implementation of your c++ library.
As an example, thread-safe reference counting employed means lots of locking (or at least locked operations). Also, if most of your strings are short (because you know this will be the case), you might want a string class that is optimized for that use-case.
So even if you like the std::string API, or at least have learned to live with it, there is room for 'competing implementations' that are more or less workalikes.
PowerDNS would love to have one, as we currently pass many dns host names around, and a large majority of them would fit in a, say, 25 bytes fixed buffer, which would relieve a lot of new/delete pressure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Datastructure Storage in Filesystem - c++

Related

What's the purpose of VkStructureType? [duplicate]

c++ alignment using pragma pack and inheritance

C++ way to serialize a message?

Returning a flexible datatype from a C++ function

Designing a string class in C++

Categories

Resources