Best way to read 12-15GB ASCII file in C++ - c++

I am trying to count the number of lines in a huge file. This ASCII file is anywhere from 12-15GB. Right now, I am using something along the lines of readline() to count each line of the file. But ofcourse, this is extremely slow. I've also tried to implement a lower level reading using seekg() and tellg() but due to the size of my file, I am unable to allocate a large enough array to store each character to run a '\n' comparison (I have 8GB of ram). What would be a faster way of reading this ridiculously large file? I've looked through many posts here and most people don't seem to have trouble with the 32bit system limitation, but here, I see that as a problem (correct me if I'm wrong).
Also, if anyone can recommend me a good way of splitting something this large, that would be helpful as well.
Thanks!

Don't try to read the whole file at once. If you're counting lines, just read in chunks of a given size. A couple of MB should be a reasonable buffer size.

Try Boost Memory-Mapped Files, one code for both Windows and POSIX platforms.

Memory-mapping a file does not require that you actually have enough RAM to hold the whole file. I've used this technique successfully with files up to 30 GB (I think I had 4 GB of RAM in that machine). You will need a 64-bit OS and 64-bit tools (I was using Python on FreeBSD) in order to be able to address that much.
Using a memory mapped file significantly increased the performance over explicitly reading chunks of the file.

what OS are you on? is there no wc -l or equivalent command on that platform?

Related

Memory Mapped Files and Max File Size

I am using boost::iostreams::mapped_file_source to create a memory mapped files. In excess of 1024. To my surprise when I have created around 1024 memory mapped files my program throws an exception stating there are too many files open. After some research I found that Ubuntu uses a max file size per processes of 1024 (found from ulimit -n). Unfortunately, I need all of the files to be open at the same time. Does anyone know a way around this? Is it possible to make the files not count towards the limit someway? I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance. And I would also like to not modify the operating system by changing the value. Any points in the correct direction are much appreciated!
Why do you need many mapped files open? That seems very inefficient. Maybe you can map (regions of) a single large file?
Q. I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance
That's ... nonsense. The performance could basically only increase.
One particular thing to keep in mind is to align the different regions inside your "big mapped file" to multiple of your memory page/disk block size. 4k should be a nice starter for this coarse alignment.

Copying multiple file segments into a single file - Qt

I have a file split into many segments. I have to combine the files into a single file. Now the simple code I came up with is:
QFile file;
file.setFileName(fileUrl);
file.open(QIODevice::WriteOnly);
for(int j=0;j<totalSegments;j++)
{
Segment[j]->fileSegment.close();
if(!Segment[j]->fileSegment.open(QIODevice::ReadOnly))
{
qDebug()<<"Segment not found";
continue;
}
file.write(Segment[j]->fileSegment.readAll()); // is this really efficient and safe
Segment[j]->fileSegment.close();
Segment[j]->fileSegment.remove();
}
The above code snippet works fine on Windows as well as Linux. But I have some questions:
1- Is this method really efficient. If suppose the segment size is in GB's will this badly affect the performance of the system, or can even corrupt the file or fail due to less available RAM.
2- The above method fails in some Linux Distro's especially Fedora if total size is more than 2GB. I haven't tested myself but was reported to me by many.
3- In Linux can it fail if segments are on an EXT4 filesystem and target file into which the file will be written on NTFS system. It didn't fail on Ubuntu but many users are complaining that it does. I can't just replicate it. Am I doing something wrong.
Please avoid multiple sub-questions per question in general, but I will try to answer your questions regardless.
1- Is this method really efficient. If suppose the segment size is in GB's will this badly affect the performance of the system, or can even corrupt the file or fail due to less available RAM.
It is very bad idea for large files. I think you wish to establish chunk file read and write.
2- The above method fails in some Linux Distro's especially Fedora if total size is more than 2GB. I haven't tested myself but was reported to me by many.
2 GB < (or was it 4 GB?) counts as large file on 32 bit systems, so it is possible that they use the software without large file support build. It is necessary to make sure that support is enabled while building. There used to be a configure option for Qt as -largefile.
3- In Linux can it fail if segments are on an EXT4 filesystem and target file into which the file will be written on NTFS system. It didn't fail on Ubuntu but many users are complaining that it does. I can't just replicate it. Am I doing something wrong.
Yes, it can be the same issue, also you need to pay attention to memory fragmentation which means, you will not be able to allocate 2 GB in memory even if you have 2 GB available, but the memory is inappropriately fragmented. On Windows, you may wish to use the /LARGEADDRESSAWARE option for instance when using 32 bit process.
Overall, the best would be to establish the loop for reading and writing, and then you could forget the large address aware and so on issues. You would still need to make sure that Qt can handle large files though if you wish to support them for your clients. This is of course only necessary on 32 bit because there is no practical limit for 64 bit with the currently ongoing file sizes at this point.
Since you requested some code in the comment to get you going, here is a simple and untested version of chunk read and immediate write of the content from an input file into an output file. I am sure this will get you going so that you can figure out the rest.
QFileInfo fileInfo("/path/to/my/file");
qint64 size = fileInfo.size();
QByteArray data;
int chunkSize = 4096;
for (qint64 bytes = 0; bytes < size, bytes+=data.size()) {
data = myInputFile.read(chunkSize);
// Error check
myOutputFile.write(data);
}

Which type of file access to use?

I'm building an io framework for a custom file format for windows only. The files are anywhere between 100MB and 100GB. The reads/writes come in sequences of few hundred KB to a couple MB in unpredictable locations. Read speed is most critical, though, cpu use might trump that since I hear fstream can put a real dent in it when working with SSDs.
Originally I was planning to use fstream but as I red a bit more into file IO, I discovered a bunch of other options. Since I have little experience with the topic, I'm torn as to which method to use. The options I've scoped out are fstream, FILE and mapped file.
In my research so far, all I've found is a lot of contradictory benchmark results depending on chunk sizes, buffer sizes and other bottlenecks i don't understand. It would be helpful if somebody could clarify the advantages/drawbacks between those options.
In this case the bottleneck is mainly your hardware and not the library you use, since the blocks you are reading are relatively big 200KB - 5MB (compared to the sector size) and sequential (all in one).
With hard disks (high seek time) it could have a sense to read more data than needed to optimize the caching. With SSDs I would not use big buffers but read only the exact required data since the seek time is not a big issue.
Memory Mapped files are convenient for complete random access to your data, especially in small chunks (even few bytes). But it takes more code to setup a memory mapped file. On 64 bit systems you may could map the whole file (virtually) and then let the OS caching system optimize the reads (multiple access to the same data). You can then return just the pointer to the required memory location without even the need to have temporary buffers or use memcpy. It would be very simple.
fstream gives you extra features compared to FILE which I think aren't much of use in your case.

Windows C++ Lock file in memory

If I need to read from a file very often, and I will load the file into a vector of unsigned char using fread, the consequent fread are really fast, even if the vector of unsigned char is destroy right after reading.
It seems to me that something (Windows or the disk) caches the file and thus freads are very fast. I have not read anything about this behaviour, so I am unsure what really causes this.
If I don't use my application for 1 hour or so and then do an fread again, the fread is slow.
It seems to me that the cache got emptied.
Can somebody explain this behaviour to me? I would like to actively use it.
It is a problem for me when the freads are slow.
Memory-mapping the file works theoretically, but the file itself is too big, so I can not use it.
90/10 law
90% of the execution time of a computer program is spent executing 10% of the code
It is not a rule but usually it is so, so lots of programs tries to keep recent data if possible because it is very likely that that data will be accessed very soon again.
Windows OS is not an exception, after receiving command to read file OS keeps some data about file. It stores in memory addresses of ages where the program is stored, if possible even store some part (or even all) of binary data in memory, it makes next file read much faster if that read is just after the first-one.
All-in-all you are right that there is caching, but I can't to say, that is really going on as I'm not working in Microsoft...
Also answering into next part of question. File mapping into memory may be solution but if the file is very large machine may not have stat much memory so it wouldn't be an option. However, you can use the 90/10 law. In your case you should have just a part of file mapped into memory (that part that is the most important), also while reading you should make a data table of overall parameters.
Don't know exact situation, but it may save.

Positioning an ifstream in very large files

I have to process very large log files (hundreds of Gigabytes) and in order to speed things up I want to split that processing on all the cores I have available. Using seekg and tellg I'm able to estimate the block sizes in relatively small files and position each thread on the beginning of these blocks but when they grow big the indexes overflow.
How can I position and index in very big files when using C++ ifstreams and Linux?
Best regards.
The easiest way would be to do the processing on a 64-bit OS, and write the code using a 64-bit compiler. This will (at least normally) give you a 64-bit type for file offsets, so the overflow no longer happens, and life is good.
You have two options:
Use 64-bit OS.
Use OS specific functions.