I have an input file, which is 40,000 columns by 2 million rows. This file is roughly 70GB in memory and thus to large to fit in memory at one go.
I need to effectively transpose this file, however there are some lines which are junk and should not be added to the output.
How I have currently implemented this is using ifstream and a nested get line, which effectively reads the whole file into memory (and thus lets the OS handle memory management), and then outputs the transpose like this. This works in an acceptable timescale however obviously has a large memory footprint for the application.
I now have to run this program on a cluster which makes me specify memory requirements ahead of time, and thus a large memory footprint increases job queuing time in the cluster.
I feel there has to be a more memory efficient approach to doing this. One thought I had was using mmap, which would allow me to do the transposition without reading the file into memory at all. Are there any other alternatives?
To be clear, I am happy to use any language and any method that can do this in a reasonable amount of time (my current program takes around 4 minutes on this large file on a local workstation).
Thanks
I would probably do this with a pre-processing pass over the file, that only needs to have a line at a time in its working set.
Filter the junk and make every line the same (binary) size.
Now, you can memory map the temp file, and stride the columns as rows for the output.
I think that the best way for you to do this would be to instead parse each line and find out whether it is junk or not. After this, you could put the remaining lines into output. This may take more time, but it would save a lot of memory and save you from using so much for lines which are completely useless to any text you are trying to print. However, using an mmap would also be a great way to achieve your goal
Hope this helps!!
Related
I would like to ask if anybody sees a bottle bottleneck in my code or any way to optimize it.
I am thinking about if my code has a fault somewhere or if I need to choose a completely new approach.
I have memory-mapped a file, and I need to read doubles from this memory-mapped file.
I need to do this around 100.000 times as fast as possible.
I was expecting that it would be quite fast in Release mode, but that is not the case.
The first time I do it, it takes over 5 seconds. The next time it takes around 200 ms. This is a bit faster (I guess it has to do with the way Windows handles a memory-mapped file), but it is still too slow.
void clsMapping::FeedJoinFeaturesFromMap(vector<double> &uJoinFeatures,int uHPIndex)
{
int iBytePos=this->Content()[uHPIndex];
int iByteCount=16*sizeof(double);
uJoinFeatures.resize(16);
memcpy(&uJoinFeatures[0], &((char*)(m_pVoiceData))[iBytePos],iByteCount);
}
Does anybody see a way to improve my code? I hardcoded the iByteCountCount, but that did not really change anything.
Thank you for your ideas.
You're reading 12.5MB of data from the file. That's not so much, but it's still not trivial.
The difference between your first and second run is probably due to file caching - the second time you want to read the file, the data is already in memory so less I/O is required.
However, 5 seconds for reading 12.5MB of data is still a lot. The only reason I can find for this is that your doubles are scattered all over the file, requiring Windows read a lot more than 12.5MB to memory.
You can avoid memory mapping altogether. If the data is stored in order in the file (not consecutive, but in order - you can read the data without seeking back), you can try avoiding the memory mapped file altogether, and just seek your way to the right place.
I doubt this will help much. Other things you can do is reorder your file, if it's at all possible, or place it on an SSD.
Options:
1. Reading the whole file into one huge buffer and parsing it afterwards.
2. Mapping the file to virtual memory.
3. Reading the file in chunks and parsing them one by one.
The file can contain quite arbitrary data but it's mostly numbers, values, strings and so on formatted in certain ways (commas, brackets, quotations, etc).
Which option would give me greatest overall performance?
If the file is very large, then you might consider using multiple threads with option 2 or 3. Each thread can handle a single chunk of file/memory and you can overlap IO and computation (parsing) this way.
It's hard to give a general answer to your question as choosing the "right" strategy heavily depends on the organization of the data you are reading.
Especially if there's a really huge amount of data to be processed options 1. and 2. won't work anyways as the available amount of main memory poses an upper limit to any attempt like this.
Most probably the biggest gain in terms of efficiency can be accomplished by (re)structuring the data you are going to process.
Checking if there is any chance to organize the data in a way to save from needlessly processing whole chunks would be the primary spot I'd try to improve upon before addressing the problem mentioned in the question.
In terms of efficiency there's nothing but a constant to win in choosing any of the mentioned methods while on the other hand there might be much better improvement with the right organization of your data. The bigger the data the more important your decision will get.
Some facts about the data that seem interesting enough to take into consideration include:
Is there any regular pattern to the data you are going to process ?
Is the data mostly static or highly dynamic?
Does it have to be parsed sequentially or is it possible to process data in parallel?
It makes no sense to read the entire file all at once and then convert from text to binary data; it's more convenient to write, but you run out of memory faster. I would read the text in chunks and convert as you go. The converted data, in binary format instead of text, will likely take up less space than the original source text anyway.
I'm reading in from a CSV file, parsing it, and storing the data, pretty simple.
Right now were using the standard readLine() method to do that, and I'm trying to squeeze some extra efficency out of this processing loop. I don't know how much they hide behind the scenes, but I assume each call to getLine is a new OS call with all the pain that entails? I don't want to pay for OS calls on each line of input. I would provide a huge buffer and have it fill the buffer with many lines at once.
However, I only care about full lines. I don't want to have to handle maintaining partial lines from one buffer read to append to the second buffer read to make a full line, that's just ugly and annoying.
So, is there a method out there that does this for me? It seems like there almost has to be. Any method which I can instruct to read in x number of lines, or x bytes but don't output the last partial line, or even an easy way for me to manage the memory buffer so I minimize the amount of code for handling partial strings would be appreciated. I can use Boost, though if there is a method in standard C++ I would prefer that.
Thanks.
It's very unlikely that you'll be able to do better than the built-in C++ streams. They're quite fast. In general, the fastest way to completely read a file is to use a single thread to read the entire file from start to end, especially if the file is contiguous on disk. Furthermore, it's likely that the disk is much more of a bottleneck during reading than the OS. If you need to improve the performance of your app, I have a few recommendations.
Use a profiler. If your app is reading a line then parsing it or processing it in some way, it's possible that the parsing or processing is something that can be optimized. This can be determined in profiling. If parsing or processing takes up substantial CPU resources, then optimization may be worth the effort.
If you determine that parsing or processing is responsible for a slow application, and that it can't be easily optimized, consider multiprogramming. If the processing of individual lines does not depend on the results of previous lines being processed, then use multiple threads or CPUs to do the processing.
Use pipelining if you have to process multiple files. For example, suppose you have four stages in your app: reading, parsing, processing, saving. It may be more efficient to read one file at a time rather then all of them all at once. However, while reading the second file, you can still parse the first one. While reading the third file, you can parse the second file and process the first one, etc. One way to implement this is a staged mult-threaded application design.
Use RAID to improve disk reads. Certain raid modes can create faster reads and writes.
i am java programmer, but still i have a hint... read the data in a stream. that means for example 4 or 5 times 2048bytes (or much more)... you can iterate over the stream (and convert it) and search for your line-ends(or some other char)... but i think "readLine" is doing the same anyway...
I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.
I'm having trouble optimizing a C++ program for the fastest runtime possible.
The requirements of the code is to output the absolute value of the difference of 2 long integers, fed through a file into the program. ie:
./myprogram < unkownfilenamefullofdata
The file name is unknown, and has 2 numbers per line, separated by a space. There is an unknown amount of test data. I created 2 files of test data. One has the extreme cases and is 5 runs long. As for the other, I used a Java program to generate 2,000,000 random numbers, and output that to a timedrun file -- 18.MB worth of tests.
The massive file runs at 3.4 seconds. I need to break that down to 1.1 seconds.
This is my code:
int main() {
long int a, b;
while (scanf("%li %li",&a,&b)>-1){
if(b>=a)
printf("%li/n",(b-a));
else
printf("%li/n",(a-b));
} //endwhile
return 0;
}//end main
I ran Valgrind on my program, and it showed that a lot of hold-up was in the read and write portion. How would I rewrite print/scan to the most raw form of C++ if I know that I'm only going to be receiving a number? Is there a way that I can scan the number in as a binary number, and manipulate the data with logical operations to compute the difference? I was also told to consider writing a buffer, but after ~6 hours of searching the web, and attempting the code, I was unsuccessful.
Any help would be greatly appreciated.
What you need to do is load the whole string into memory, and then extract the numbers from there, rather than making repeated I/O calls. However, what you may well find is that it simply takes a lot of time to load 18MB off the hard drive.
You can improve greatly on scanf because you can guarantee the format of your file. Since you know exactly what the format is, you don't need as many error checks. Also, printf does a conversion on the new line to the appropriate line break for your platform.
I have used code similar to that found in this SPOJ forum post (see nosy's post half-way down the page) to obtain quite large speed-ups in the reading integers area. You will need to modify it to deal with negative numbers. Hopefully it will give you some ideas about how to write a faster printf function as well, but I would start with replacing scanf and see how far that gets you.
As you suggest the problem is reading all these numbers in and converting from text to binary.
The best improvement would be to write the numbers out from whatever program generates them as binary. This will reduce significantly reduce the amount of data that has to be read from the disk, and slightly reduce the time needed to convert from text to binary.
You say that 2,000,000 numbers occupy 18MB = 9 bytes per number. This includes the spaces and the end of line markers, so sounds reasonable.
Storing the numbers as 4 byte integers will half the amount of data that must be read from the disk. Along with the saving on format conversion, it would be reasonable to expect a doubling of performance.
Since you need even more, something more radical is required. You should consider splitting up the data file onto separate files, each on its own disk and then processing each file in its own process. If you have 4 cores and split the processing up into 4 separate processes and can connect 4 high performace disks, then you might hope for another doubling of the performance. The bottleneck is now the OS disk management, and it is impossible to guess how well the OS will manage the four disks in parallel.
I assume that this is a grossly simplified model of the processing you need to do. If your description is all there is to it, the real solution would be to do the subtraction in the program that writes the test files!
Even better than opening the file in your program and reading it all at once, would be memory-mapping it. ~18MB is no problem for the ~2GB address space available to your program.
Then use strtod to read a number and advance the pointer.
I'd expect a 5-10x speedup compared to input redirection and scanf.