I have an input file, which is 40,000 columns by 2 million rows. This file is roughly 70GB in memory and thus to large to fit in memory at one go.
I need to effectively transpose this file, however there are some lines which are junk and should not be added to the output.
How I have currently implemented this is using ifstream and a nested get line, which effectively reads the whole file into memory (and thus lets the OS handle memory management), and then outputs the transpose like this. This works in an acceptable timescale however obviously has a large memory footprint for the application.
I now have to run this program on a cluster which makes me specify memory requirements ahead of time, and thus a large memory footprint increases job queuing time in the cluster.
I feel there has to be a more memory efficient approach to doing this. One thought I had was using mmap, which would allow me to do the transposition without reading the file into memory at all. Are there any other alternatives?
To be clear, I am happy to use any language and any method that can do this in a reasonable amount of time (my current program takes around 4 minutes on this large file on a local workstation).
Thanks
I would probably do this with a pre-processing pass over the file, that only needs to have a line at a time in its working set.
Filter the junk and make every line the same (binary) size.
Now, you can memory map the temp file, and stride the columns as rows for the output.
I think that the best way for you to do this would be to instead parse each line and find out whether it is junk or not. After this, you could put the remaining lines into output. This may take more time, but it would save a lot of memory and save you from using so much for lines which are completely useless to any text you are trying to print. However, using an mmap would also be a great way to achieve your goal
Hope this helps!!
I'm working on a program and my "database" is some .csv files.
I have a list of objects in a .csv with some information about each. Which is the best-appropriate way to treat the "data".
Work with fstream, meaning that everytime that I want to modify the data, or I want to read something I will directly work with my files with the tools of fstream
Or, at the beginning of the program I will load the data in a vector, read,write on the vector and in the end of the program I delete the previous file and load the new one.
In a matter of performance will it be different? Considering that the objects are numerous.
I think that it is more of a combination rather than choosing A or B. Especially choosing only A. is not safe as multiple components could access the file simultaneously. Plus if there are many updates then using streams each time could be make your code very slow.
Therefore I believe that you should use B but also take care of the implementation to persist your data in a safe way (write your data in the file).
Regarding the data structure, this depends on the usage. One important question to ask here is whether there are many insertions and deletions. If this is the case then it would be more efficient to use a list instead of a vector, as the list provides instant time insertions and vector is not appropriate for this purpose.
If the data include a unique attribute and fast lookups are needed then a hash or a map would be more suitable.
Take the CSV parser from my CSVtoC utility.
http://www.malcolmmclean.site11.com/www/
CSV files are not good for dynamic update as the records are not fixed in known disk physical locations. (The alternative is to contrive the CSV so that that doesn't hold, but it's a delicate and messy approach).
Reading a CSV is hard, writing one is trivial.
Options:
1. Reading the whole file into one huge buffer and parsing it afterwards.
2. Mapping the file to virtual memory.
3. Reading the file in chunks and parsing them one by one.
The file can contain quite arbitrary data but it's mostly numbers, values, strings and so on formatted in certain ways (commas, brackets, quotations, etc).
Which option would give me greatest overall performance?
If the file is very large, then you might consider using multiple threads with option 2 or 3. Each thread can handle a single chunk of file/memory and you can overlap IO and computation (parsing) this way.
It's hard to give a general answer to your question as choosing the "right" strategy heavily depends on the organization of the data you are reading.
Especially if there's a really huge amount of data to be processed options 1. and 2. won't work anyways as the available amount of main memory poses an upper limit to any attempt like this.
Most probably the biggest gain in terms of efficiency can be accomplished by (re)structuring the data you are going to process.
Checking if there is any chance to organize the data in a way to save from needlessly processing whole chunks would be the primary spot I'd try to improve upon before addressing the problem mentioned in the question.
In terms of efficiency there's nothing but a constant to win in choosing any of the mentioned methods while on the other hand there might be much better improvement with the right organization of your data. The bigger the data the more important your decision will get.
Some facts about the data that seem interesting enough to take into consideration include:
Is there any regular pattern to the data you are going to process ?
Is the data mostly static or highly dynamic?
Does it have to be parsed sequentially or is it possible to process data in parallel?
It makes no sense to read the entire file all at once and then convert from text to binary data; it's more convenient to write, but you run out of memory faster. I would read the text in chunks and convert as you go. The converted data, in binary format instead of text, will likely take up less space than the original source text anyway.
I'm reading in from a CSV file, parsing it, and storing the data, pretty simple.
Right now were using the standard readLine() method to do that, and I'm trying to squeeze some extra efficency out of this processing loop. I don't know how much they hide behind the scenes, but I assume each call to getLine is a new OS call with all the pain that entails? I don't want to pay for OS calls on each line of input. I would provide a huge buffer and have it fill the buffer with many lines at once.
However, I only care about full lines. I don't want to have to handle maintaining partial lines from one buffer read to append to the second buffer read to make a full line, that's just ugly and annoying.
So, is there a method out there that does this for me? It seems like there almost has to be. Any method which I can instruct to read in x number of lines, or x bytes but don't output the last partial line, or even an easy way for me to manage the memory buffer so I minimize the amount of code for handling partial strings would be appreciated. I can use Boost, though if there is a method in standard C++ I would prefer that.
Thanks.
It's very unlikely that you'll be able to do better than the built-in C++ streams. They're quite fast. In general, the fastest way to completely read a file is to use a single thread to read the entire file from start to end, especially if the file is contiguous on disk. Furthermore, it's likely that the disk is much more of a bottleneck during reading than the OS. If you need to improve the performance of your app, I have a few recommendations.
Use a profiler. If your app is reading a line then parsing it or processing it in some way, it's possible that the parsing or processing is something that can be optimized. This can be determined in profiling. If parsing or processing takes up substantial CPU resources, then optimization may be worth the effort.
If you determine that parsing or processing is responsible for a slow application, and that it can't be easily optimized, consider multiprogramming. If the processing of individual lines does not depend on the results of previous lines being processed, then use multiple threads or CPUs to do the processing.
Use pipelining if you have to process multiple files. For example, suppose you have four stages in your app: reading, parsing, processing, saving. It may be more efficient to read one file at a time rather then all of them all at once. However, while reading the second file, you can still parse the first one. While reading the third file, you can parse the second file and process the first one, etc. One way to implement this is a staged mult-threaded application design.
Use RAID to improve disk reads. Certain raid modes can create faster reads and writes.
i am java programmer, but still i have a hint... read the data in a stream. that means for example 4 or 5 times 2048bytes (or much more)... you can iterate over the stream (and convert it) and search for your line-ends(or some other char)... but i think "readLine" is doing the same anyway...
I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.