Are the C++ Standard Library File stream operations crippled in Microsoft?

Are the C++ Standard Library File stream operations crippled in Microsoft? - c++

I'm asking this question because I have been working on a project that requires collecting a lot of data REALLY fast, depending on the scenario. 5.7GBytes with a capital BYTE per second or 11.4GBytes per second.
We are working with a small striped raid array using 3 Samsung Pro NVME (for 11.4GB/s we have a larger array).
Currently, the project has been developed on Windows, I wanted to make things as portable as possible so I focused on using C++ Standard Library; however, no matter what I did I could not crack transferring files faster than 1.5GB/s
The strategy was simple to create a couple of huge swap buffers, and write them directly to disk as a huge unformatted binary file.
Using std::ofstream
and benchmarking manually setting varied buffer sizes through:
rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
open(Filename, std::ios::binary|std::ios::trunc);
followed by my managed write loop, I was able to find a sweet spot, but never able to crack 1.5GB/s
I then found the Windows SDK and its CreateFile function
In particular, the create file function using the FILE_FLAG_NO_BUFFERING flag.
This was a game-changer, as long as I made sure I fed it sector-aligned data (in my case everything needed to be some multiple of 512Bytes) I was suddenly able to take full advantage of the raid array throughput.
I revisited the std::ofstream function in an attempt to work with more OS-agnostic functions; however, even though one can specify zero buffer for std::ofstream, there doesn't appear to be any documentation with regards to any caveats to using that function with no buffer.
std::ofstream allows 64bit values for its write size, unlike Windows SDK WriteFile which only accepts DWORD's setting the maximum write size is the largest multiple of 512 one can squeeze into a uint32_t and you must manage your write in a loop if your file exceeds 4GB (mine do).
This just raises the question, is Microsoft simply not giving the C++ Standard Library Devs access to the necessary OS-level system calls to take advantage of Ultra-high-speed drive arrays? Or am I missing something in how to use the C++ Standard Library to its full potential?

"is Microsoft simply not giving the C++ Standard Library Devs..."
You might notice that the product you're using is called Microsoft Visual Studio. The Standard Library developers for Visual Studio work at Microsoft, although in a different team as the Windows developers.
The reason is a bit more simple: the Visual C++ devs can't possibly know and optimize for all possible use scenario's. It's a bit unusual to do text formatting at such high speeds. Remember, the point of ostream is to provide operator<<. ofstream is for formatted output to files. But for high-speed I/O you want binary output anyway.

To put it bluntly, the bandwidth you're aiming for are within the ballpark of the physical limits of current commodity hardware (~24GByte/s for 16×PCIe.4), and in my own work I found it very challenging to reach single-core memory transfer rates above 8GByte/s without the use of "dark magic" (aka hand crafted assembly and optimized system call code), and it involved carefully aligning the memory accesses and making use of vector extensions. But most importantly, to reach these levels of optimization requires to be aware of the kind of data that is being processed and what kind of access patters to expect and/or build caching intermediaries to accomodate for the underlying hardware.
Such optimizations are plain and simply outside of the scope of general purpose standard libraries. Standard libraries in their implementation must adhere to the behaviours written down in the specification, and some of these requirements tend to collide with what has to be done to make the most of the underlying hardware.
So I'm sorry to tell you, but you'll probably have to bite the bullet and use the low level system APIs directly, bypassing the standard library.

Related

Converting endianness of struct-Data

What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.

There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.

The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.

Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

How do demomakers attain ultra small filesizes?

When I watch demoscene videos on youtube the author's often boast of how their filesizes are 64kb or less, some as few as just 4kb. When I compile even a very basic program in C++ the executable is always at least 90kb or so. Are these demos written entirely in assembly? it was my understanding that demomakers used c/c++ as well.

I'm one of the coder of Felix's Workshop and Immersion (64k intros by Ctrl-Alt-Test). Most 64k intros nowadays use C++ (exception: Logicoma uses Rust). Assembly may make sense for 4k intros (although most of them actually use C++), but not for 64k intros.
Here are the two most important things:
Compile without the standard library (in particular, the STL could make the binary quite large).
Compress your binary (kkrunchy for 64k intros on Windows, Crinkler for 4k intros on Windows).
Now, you can write a ton of code before filling the 64kB. How to use them? Procedural generation.
For music, music sheet is compressed. Instruments are generated with a soft synth. A popular option, although a bit outdated, is to use v2 by Farbrausch.
If you need textures, generate them.
If you need 3d models, generate them.
Animations and effects are procedural.
For the camera, save some key positions and interpolate.
Shaders are heavily used in modern graphics. Minifying the shaders can save quite a lot of space.
Want to hear more about procedural generation and other techniques? Check IQ's articles.
If you want to further optimise your code, here are some additional tricks:
You probably use lots of floats. Try to truncate the mantissa of your floats (it can save many kB).
Disable function inlining (it saved me 2kB).
Try the fastcall calling convention (it saved me 0.7kB).
Disable support for exceptions. You don't need them.
If you use classes, avoid inheritance.
Be careful if you use templates.
In a typical 4k intro, the C++ code is used for the music and the initialisation. Graphics are done in a shader.

Those demos do not use the standard library (not C++ and not even the C standard lib), nor do they link with standard libraries (to avoid import table sizes). They dynamically link only the absolute minimum necessary.
The demo's "main function" is usually identical with the entry point (unlike in a normal program where the entry point is a CRT init function which does some OS-specific setup, initializes globals, runs constructors, and eventually calls main).
Usually the demo executables are not compliant with the specifications (omitting minimum section sizes and alignments) of the executable format and are compressed with an exe-packer. Technically, these are "broken" programs, but they are just "broken" so much that they still run successfully.
Also, such demos rely heavily on procedurally generated content.

These ultra-small programs typically don't depend on any libraries or frameworks, as is typical with traditional application development. These programs typically accesses graphics/io, etc. directly.

I can't comment yet because I don't have 50 rep points, so I'm answering.
One way to create a smaller program is to use an older compiler, such as Microsoft Visual C/C++ 4.0, which produces a smaller .exe file than say Microsoft Visual Studio 2005.

It really depends on your environment, but if you don't
instantiate any templates, and you link everything dynamically,
it's fairly easy to achieve a very small size for your
executable, since none of the code you actually execute will be
in the executable.

hibernate-like saving state of a program

Is there any way in C++ or Java or Python that would allow me to save the state of my program, no questions asked? For example, I've spent an hour learning how to save a tree-like structure into a file. Very educative but I feel I could just do:
saveState(file);
And the "file" would contain whole memory my program uses. Just like operating system's "hibernate" or "suspend-to-disk" feature. I know about boost serialization, this is probably not what I'm looking for.

What you most likely want is what we call serialization or object marshalling. There are a whole butt load of academic problems with data/object serialization that you can easily google.
That being said given the right library (probably very native) you could do a true snapshot of your running program similarly what "OS specific hibernate" does. Here is an SO answer for doing that on Linux: https://stackoverflow.com/a/12190830/318174
To do the above snapshot-ing though you will most likely need an external process from the process you want to save. I highly recommend you don't that. Instead read/lookup in your language of choice (btw welcome to SO, don't tag every language... that pisses people off) how to do serialization or object marshalling... hint... most people these days pick JSON.

I think that what you describe would be a feature that few people would actually want to use for a real system. Usually you want to save something so it can be transmitted, or so you can stop running the program, or guard against the possibility that the program quits (or power fails).
In most production systems one wants to make the writes to disk small and incremental so that the system can remain responsive, and writing inconsistent data can be avoided. Writing ALL memory to disk on a regular basis would probably result in lots of non-responsive time. You would need to lock the entire system to avoid inconsistent state.
Writing your own persistence is tedious and error prone however so you may find this SO question of interest: Persisting graph data (Java)

There are a couple of frameworks around this. Check out Google Protocol Buffers if you need support for Java, Python, and C++ https://developers.google.com/protocol-buffers/ I've used it in some projects and it works well.
There's also Thrift (orginally from Facebook) http://thrift.apache.org/ I don't have any experience with it though.
Another option is what #QuentinUK suggests. Use a class that inherits from something streamable and/or make streamable operators/functions.
I'd use a framework.

Here's your problem:
http://en.wikipedia.org/wiki/Address_space_layout_randomization
Back in ancient history (16-bit DOS programs with extenders), compilers used to support "based" pointers which stored relative addresses. These were safe to serialize en masse. And applications did so, saving both code and data, the serialized modules were called "overlays".
Today, you'd need based pointer support in your toolchain (resulting in every pointer access requiring an extra adjustment), or else to go through all the data, distinguishing the pointers from the other data (how?) and adjusting them to their new storage location, in case the OS already loaded some library at the same address your old program had used for its heap. In modern "managed" environments, where pointers already have to be identified for the garbage collector, this is feasible even if not commonly done. In native code, it's very difficult, although that metadata is created to enable relocation of shared libraries.
So instead people end up walking their entire data structures manually, and converting object links (pointers) into something that can be restored on the other end, even though the object has a new address (again, because the old address may have been used for a shared library).
Note that many processors have features to support based addressing... and that since based addressing is no longer common, compilers went ahead and used those pointer arithmetic features to speed up user code.

Yes, derive objects from a streamable class and add the streaming functions. Then you can stream everything to disk. You will need a library for this such as MFC.

multiple access to file in r/w

I'm planning to write a programm which has to access to a certain file many times in r/w.
So I decided to use fstream, since I can use this class for both reading and writing purpose.
My idea is to open the file at the startup of the application and then close it as the application is closed too.
Since the file can be arbitrarily big, I was planning to use a "paging" structure, in which:
1) preallocate a fixed amount of memory for each page and a fixed number of page
2) load part of the file in to the first free page
3) if there is no free page, I select one non empty with a certain criterion, I commit all edit in it (if there are any) and then load the part of file in the page.
That's not so hard to code. But I was wondering If I'm going to reinvent the wheel... maybe the fstream itself is written in a smart way so that it also implements a similar paging mechanism. In that case, I would not take care about, just write and read at any time.
Some suggestion?

Don't do this by yourself. Unless you are using very exotic implementation, the fstream class already implement such a mechanism efficiently.
Checkout http://www.cplusplus.com/doc/tutorial/files/ "Buffers and Synchronization"
There are possible issues if you are seek-ing into file larger than 2GB with a old kernel or implementation of the standard library. Check this
Large file support in C++
or use Boost.Filesystem

Internal working of the standard C++ library vary by implementation. Hence a test would be needed to get some real data on your preferred platform. Generally memory mapped files are considered to be the fastest way to access data stored in a file (as Uflex has mentioned in his comment, but it has some drawbacks as well (see the linked wiki page). You can either use the standard (POSIX) C functions mmap() and munmap(), or the Boost C++ libraries which also have a portable C++ interface for memory mapped files.

Binary files and cross platform compatibility

I have written a C++ library that saves my data (a collection of custom structs etc) into a binary file. I currently use (i.e. create and consume) the files locally, on my Windows (XP) machine. For simplicity, lets think of the library in two parts: a writer (Creates the files) and a reader or consumer (simply reads data from the files).
Recently though, I would like to also consume (i.e. read) the data files I have created on my XP machine, on my Linux machine. I must point out at this stage that both machines are PCs (so have the same endianess etc).
I can build a reader (and compile for Linux [Ubuntu 9.10 to be precise]), since I am the library creator. My question, before I embark down this road (of building the reader etc) is:
Assuming I have succesfully built the reader for Linux,
Can I simply copy accross, files that were created on the windows (XP) machine to the Linux (Ubuntu 9.10) machine and use the Linux reader to successfully read the copied over file?

For the files to be binary compatible:
endianness must match (as it does for you)
bitfield packing order must be the same
sizes and signedness of types must be the same
the compiler must make the same decisions about padding and alignment
It's certainly possible for all of these conditions to be fulfilled, or for you to not happen to be hitting any cases for which they are not. At the very least, though, I'd add some sanity checks and/or sentinel members to detect problems.

Binary files should be compatible across machines with the same endianess.
The issue you may have in your code is the size of ints, you can't necessarily assume that the compiler on different OS's has the same size int. So either copy blocks of bytes and cast them, or use int16, int32 etc.

Structs are not a file format, and you shouldn't try to use them as such.
When attempting to make structs work with fread and fwrite, there's a huge number of hacks to make it work. You byte-swap integers so that you can share files between little-endian and big-endian machines. You change your structs to use fixed-width integer types, so you can share between machines with different word sizes (such as between x86 and x64 machines). You add compiler-specific pragmas to control the padding of structs to share between compiler versions.
It works, but it's ugly. Not to mention, easy to get wrong.
Much like the recommendation in The byte order fallacy, a much better idea is to write code to read/write the fields individually. By writing your own code, you can ensure there's no padding, and you can choose integer sizes independently of the local size of integers, and you can support both endiannesses without byte-swapping (by reading/writing the bytes of an integer separately).
Unlike the hacky approach, this is hard to get wrong. Further, because you don't rely on any compiler or architecture specific behaviors, either your code will work on all compilers and architectures, or none. If you do it right, you shouldn't have any platform-specific bugs.
There is one downside; individually reading/writing the fields will be slower than just using fread/fwrite directly. You can set up a buffer (uint8_t buffer[]) and write the entirety of the data into it, and then write everything out at once, which might help, but it'll still be slower (because you'd still have to move the fields into the buffer one at a time), but for most purposes it'll still be fast enough (exceptions being embedded / real-time systems or extremely high performance computing).

If:
the machines have the same endianess (as you stated they have) and
you do open the streams in binary mode, as text mode might do funny things e.g. with line-ends and
you have programmed cleanly so you don't stumble over implementation-defined stuff like alignments, data type sizes, and struct packing,
then yes, your files should be portable.
The third bullet point is what makes a file format a "portable" one. Depending on what kind of data you have in your structs, it can be very easy or a bit tricky. Bitfields, or data being reinterpreted from a different type are especially tricky.

You might consider taking a look at the Boost Serialization Library.
A lot of thought has been put into it, and it will handle many of the potential cross-platform incompatibilities for you.
Of course, it's possible that it's overkill for your particular use case, especially if you've already got your writers & readers implemented.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js