logical operations on large files binary data in C/C++

logical operations on large files binary data in C/C++ - c++

I have two binary files (order of tens of MB) and I want to or every bit of these files. And of course, I want it to be as efficient as possible.
So I have two ways in mind to do that, but I still think (I kinda feel) that should be a more efficient way that I do not know of.
Given file a and b .. what I want to do is a = a|b
Loading two files, parse them in to two huge std::bitsets and or them together
loading two files byte by byte and or them if a huge for loop...
Is there any other way to do that?

Don't go byte-by-byte. That'd be seriously slow. Instead, read the files in chunks. Find what the block size is for your system (4k? 8K? 64k?) and read the file using chunks of that size. Then you can loop through the byte streams in memory and do the OR operations there.
In logical terms, even though you might only be reading a byte at a time, the OS will still read an entire block worth of data, then throw away all but the byte you wanted. Next time around that block'll be cached, but it's still going through the full read motions for every byte you want. So... just suck the entire block into memory and save yourself that wasted overhead.

I would recommend loading the two files a chunk at a time, where a chunk is some appropiate portion of the data. The best size would depend on your operating system and filesystem, but its usually something like the cluster size, or 2 * the cluster size, or so on... You would have to run some test to determine the best buffer size.

I don't think you would have any performance advantage either way (if in your "second option" you are going to load the file in big chunks), after all you'd be using a big stack-allocated buffer in both cases (which is what std::bitset boils down to), so go with the one you like best.
The only advantage I see in the std::bitset::operator|=, besides clarity, is that it may be able to exploit some platform-specific trick to or big sequences of bytes, but I think that the compiler would be able to optimize your big "or loop" anyway.

Related

C++ data block management

(Having trouble coming up with a good question title here!)
In the last few months, I seem to keep writing code for doing the same task (in minor variations). So I'm wondering if there's anything in the C++ standard libraries (or maybe Boost) that can help. Let me try to explain what I'm trying to do…
As a concrete example, suppose I have a logical file that's made up of several physical files on disk. What I want to do is write a function like
void ReadData(void * buffer, UInt64 offset, UInt64 size);
Seems simple enough, right? However, it's actually quite fiddly:
First, I need to figure out which physical file contains the requested logical offset.
Next, I need to check whether the requested size spans a physical file boundary.
Actually, in the worst case, it's plausible the size could span multiple files, although that's unlikely.
Finally, I need to fread() the right blocks of data, from the right files, into the right offsets in buffer.
Checking that neither offset nor size exceeds the available data would be prudent as well.
None of this is astonishingly hard, it's just fiddly and easy to screw up. There are endless opportunities for off-by-one bugs, out-of-bounds accesses and so forth. It's also really awkward to test thoroughly. In short, this isn't the sort of code you want to end up writing multiple times over.
Does anybody know of anything in the libraries that will help me here? I'm mostly interested in something for handling all the coordinate transformations, not the actual I/O operations themselves.

Fastest way of reading a file in Linux?

On Linux what would be the fastest way of reading a file in to an array of bytes/to process the bytes? This can include memory-mapping, sys calls etc. I am not familiar with the many Linux-specific functions.
In the past I have used boost memory mapping, but I need faster Linux-specific performance rather than portability.

mmap should be the fastest way to access the contents of a file if the file is large enough. There's an initial cost for setting up the memory mappings, but that's offset by not needing to copy the data from the page cache into userland. And if you want all the contents of the file, the cost to allocate the memory to your program should be more or less the same as the cost of mmap.
Your best bet, as always, is to test and benchmark.

Don't let yourself get fooled by lazy stuff like memory mapping. Rather focus on what you really need. Do you really need to read the whole file into memory? Then the straight-forward way of opening, reading chunks in a loop, and closing the file will be as fast as it can be done.
But often you don't really want that. Instead you might want to read specific parts, a block here, a block there, jump through the file, read a block at a specific position, etc.
Then still fseeking out those positions and freading the blocks won't have overheads worth mentioning. But it can be more convenient to use memory mapping to let the operating system or a library deal with stuff like memory allocation etc. It won't get the job done faster, though.

What is the most efficient way to read formatted data from a large file?

Options:
1. Reading the whole file into one huge buffer and parsing it afterwards.
2. Mapping the file to virtual memory.
3. Reading the file in chunks and parsing them one by one.
The file can contain quite arbitrary data but it's mostly numbers, values, strings and so on formatted in certain ways (commas, brackets, quotations, etc).
Which option would give me greatest overall performance?

If the file is very large, then you might consider using multiple threads with option 2 or 3. Each thread can handle a single chunk of file/memory and you can overlap IO and computation (parsing) this way.

It's hard to give a general answer to your question as choosing the "right" strategy heavily depends on the organization of the data you are reading.
Especially if there's a really huge amount of data to be processed options 1. and 2. won't work anyways as the available amount of main memory poses an upper limit to any attempt like this.
Most probably the biggest gain in terms of efficiency can be accomplished by (re)structuring the data you are going to process.
Checking if there is any chance to organize the data in a way to save from needlessly processing whole chunks would be the primary spot I'd try to improve upon before addressing the problem mentioned in the question.
In terms of efficiency there's nothing but a constant to win in choosing any of the mentioned methods while on the other hand there might be much better improvement with the right organization of your data. The bigger the data the more important your decision will get.
Some facts about the data that seem interesting enough to take into consideration include:
Is there any regular pattern to the data you are going to process ?
Is the data mostly static or highly dynamic?
Does it have to be parsed sequentially or is it possible to process data in parallel?

It makes no sense to read the entire file all at once and then convert from text to binary data; it's more convenient to write, but you run out of memory faster. I would read the text in chunks and convert as you go. The converted data, in binary format instead of text, will likely take up less space than the original source text anyway.

Speed to create and read data

I have some small questions about the speed to create and read data in C/C++:
=> If I need to fill data in a array of any type (think about a 2048*2048 array), using a loop and fill each cell is faster then loading it from a file? (excluding time spent to open and close the file).
=> If have data in a separate file and read it, it costs the same time to read it from the original file? (imagine that I need to fill an array, is better to have this array filled on the main program or I can read without loss from a external file? (excluding the time to open and close the file))
=> Memcpy still fast if I need to copy a lot of data ?

The file operations will be MANY MANY MANY Times slower than memory operations.
memcpy is up to the compiler, but yes, in general it will do it quicker or just the same as you could without resorting to assembly.

If I need to fill data in a array of any type (think about a 2048*2048 array), using a loop and fill each cell is faster then loading it from a file? (excluding time spent to open and close the file).
where does data for you to fill, when you not read from file ? But in general, read from file is extremely slow. when reading on main memory is nearly atomic, a same operation on file is slower than 1000x or more. In practice, always to prevent to read from file if not necessary.
Memcpy still fast if I need to copy alot of data ?
yes. often it's faster, depend on Compiler and your hardware. because memcpy use some special CPU instruction for example SIMD (single intruction - multiply data) for performance, and maybe your CPU doesn't have it. Compiler still have this function for comparable.

In-memory operations are many orders faster than FILE IO operations, but you might be able to utilise a half way house.
Memory Mapped files use OS technology to map the contents of the file directly to memory without you having to read and copy each byte. You can then read/write the memory as normal. It's the basis of virtual memory in many architectures and as such is highly optimised and performant.

c++: how to optimize IO?

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?

Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...

The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.

Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.

These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.

Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.

As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.

It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.

Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js