C++ memcopy from mapped file too slow - c++

I would like to ask if anybody sees a bottle bottleneck in my code or any way to optimize it.
I am thinking about if my code has a fault somewhere or if I need to choose a completely new approach.
I have memory-mapped a file, and I need to read doubles from this memory-mapped file.
I need to do this around 100.000 times as fast as possible.
I was expecting that it would be quite fast in Release mode, but that is not the case.
The first time I do it, it takes over 5 seconds. The next time it takes around 200 ms. This is a bit faster (I guess it has to do with the way Windows handles a memory-mapped file), but it is still too slow.
void clsMapping::FeedJoinFeaturesFromMap(vector<double> &uJoinFeatures,int uHPIndex)
{
int iBytePos=this->Content()[uHPIndex];
int iByteCount=16*sizeof(double);
uJoinFeatures.resize(16);
memcpy(&uJoinFeatures[0], &((char*)(m_pVoiceData))[iBytePos],iByteCount);
}
Does anybody see a way to improve my code? I hardcoded the iByteCountCount, but that did not really change anything.
Thank you for your ideas.

You're reading 12.5MB of data from the file. That's not so much, but it's still not trivial.
The difference between your first and second run is probably due to file caching - the second time you want to read the file, the data is already in memory so less I/O is required.
However, 5 seconds for reading 12.5MB of data is still a lot. The only reason I can find for this is that your doubles are scattered all over the file, requiring Windows read a lot more than 12.5MB to memory.
You can avoid memory mapping altogether. If the data is stored in order in the file (not consecutive, but in order - you can read the data without seeking back), you can try avoiding the memory mapped file altogether, and just seek your way to the right place.
I doubt this will help much. Other things you can do is reorder your file, if it's at all possible, or place it on an SSD.

Related

Most memory efficient way to transpose a large file in C++

I have an input file, which is 40,000 columns by 2 million rows. This file is roughly 70GB in memory and thus to large to fit in memory at one go.
I need to effectively transpose this file, however there are some lines which are junk and should not be added to the output.
How I have currently implemented this is using ifstream and a nested get line, which effectively reads the whole file into memory (and thus lets the OS handle memory management), and then outputs the transpose like this. This works in an acceptable timescale however obviously has a large memory footprint for the application.
I now have to run this program on a cluster which makes me specify memory requirements ahead of time, and thus a large memory footprint increases job queuing time in the cluster.
I feel there has to be a more memory efficient approach to doing this. One thought I had was using mmap, which would allow me to do the transposition without reading the file into memory at all. Are there any other alternatives?
To be clear, I am happy to use any language and any method that can do this in a reasonable amount of time (my current program takes around 4 minutes on this large file on a local workstation).
Thanks
I would probably do this with a pre-processing pass over the file, that only needs to have a line at a time in its working set.
Filter the junk and make every line the same (binary) size.
Now, you can memory map the temp file, and stride the columns as rows for the output.
I think that the best way for you to do this would be to instead parse each line and find out whether it is junk or not. After this, you could put the remaining lines into output. This may take more time, but it would save a lot of memory and save you from using so much for lines which are completely useless to any text you are trying to print. However, using an mmap would also be a great way to achieve your goal
Hope this helps!!

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?
Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.

Windows C++ Lock file in memory

If I need to read from a file very often, and I will load the file into a vector of unsigned char using fread, the consequent fread are really fast, even if the vector of unsigned char is destroy right after reading.
It seems to me that something (Windows or the disk) caches the file and thus freads are very fast. I have not read anything about this behaviour, so I am unsure what really causes this.
If I don't use my application for 1 hour or so and then do an fread again, the fread is slow.
It seems to me that the cache got emptied.
Can somebody explain this behaviour to me? I would like to actively use it.
It is a problem for me when the freads are slow.
Memory-mapping the file works theoretically, but the file itself is too big, so I can not use it.
90/10 law
90% of the execution time of a computer program is spent executing 10% of the code
It is not a rule but usually it is so, so lots of programs tries to keep recent data if possible because it is very likely that that data will be accessed very soon again.
Windows OS is not an exception, after receiving command to read file OS keeps some data about file. It stores in memory addresses of ages where the program is stored, if possible even store some part (or even all) of binary data in memory, it makes next file read much faster if that read is just after the first-one.
All-in-all you are right that there is caching, but I can't to say, that is really going on as I'm not working in Microsoft...
Also answering into next part of question. File mapping into memory may be solution but if the file is very large machine may not have stat much memory so it wouldn't be an option. However, you can use the 90/10 law. In your case you should have just a part of file mapped into memory (that part that is the most important), also while reading you should make a data table of overall parameters.
Don't know exact situation, but it may save.

C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.

method to read multiple lines from a file at once without partial lines

I'm reading in from a CSV file, parsing it, and storing the data, pretty simple.
Right now were using the standard readLine() method to do that, and I'm trying to squeeze some extra efficency out of this processing loop. I don't know how much they hide behind the scenes, but I assume each call to getLine is a new OS call with all the pain that entails? I don't want to pay for OS calls on each line of input. I would provide a huge buffer and have it fill the buffer with many lines at once.
However, I only care about full lines. I don't want to have to handle maintaining partial lines from one buffer read to append to the second buffer read to make a full line, that's just ugly and annoying.
So, is there a method out there that does this for me? It seems like there almost has to be. Any method which I can instruct to read in x number of lines, or x bytes but don't output the last partial line, or even an easy way for me to manage the memory buffer so I minimize the amount of code for handling partial strings would be appreciated. I can use Boost, though if there is a method in standard C++ I would prefer that.
Thanks.
It's very unlikely that you'll be able to do better than the built-in C++ streams. They're quite fast. In general, the fastest way to completely read a file is to use a single thread to read the entire file from start to end, especially if the file is contiguous on disk. Furthermore, it's likely that the disk is much more of a bottleneck during reading than the OS. If you need to improve the performance of your app, I have a few recommendations.
Use a profiler. If your app is reading a line then parsing it or processing it in some way, it's possible that the parsing or processing is something that can be optimized. This can be determined in profiling. If parsing or processing takes up substantial CPU resources, then optimization may be worth the effort.
If you determine that parsing or processing is responsible for a slow application, and that it can't be easily optimized, consider multiprogramming. If the processing of individual lines does not depend on the results of previous lines being processed, then use multiple threads or CPUs to do the processing.
Use pipelining if you have to process multiple files. For example, suppose you have four stages in your app: reading, parsing, processing, saving. It may be more efficient to read one file at a time rather then all of them all at once. However, while reading the second file, you can still parse the first one. While reading the third file, you can parse the second file and process the first one, etc. One way to implement this is a staged mult-threaded application design.
Use RAID to improve disk reads. Certain raid modes can create faster reads and writes.
i am java programmer, but still i have a hint... read the data in a stream. that means for example 4 or 5 times 2048bytes (or much more)... you can iterate over the stream (and convert it) and search for your line-ends(or some other char)... but i think "readLine" is doing the same anyway...