I have a file split into many segments. I have to combine the files into a single file. Now the simple code I came up with is:
QFile file;
file.setFileName(fileUrl);
file.open(QIODevice::WriteOnly);
for(int j=0;j<totalSegments;j++)
{
Segment[j]->fileSegment.close();
if(!Segment[j]->fileSegment.open(QIODevice::ReadOnly))
{
qDebug()<<"Segment not found";
continue;
}
file.write(Segment[j]->fileSegment.readAll()); // is this really efficient and safe
Segment[j]->fileSegment.close();
Segment[j]->fileSegment.remove();
}
The above code snippet works fine on Windows as well as Linux. But I have some questions:
1- Is this method really efficient. If suppose the segment size is in GB's will this badly affect the performance of the system, or can even corrupt the file or fail due to less available RAM.
2- The above method fails in some Linux Distro's especially Fedora if total size is more than 2GB. I haven't tested myself but was reported to me by many.
3- In Linux can it fail if segments are on an EXT4 filesystem and target file into which the file will be written on NTFS system. It didn't fail on Ubuntu but many users are complaining that it does. I can't just replicate it. Am I doing something wrong.
Please avoid multiple sub-questions per question in general, but I will try to answer your questions regardless.
1- Is this method really efficient. If suppose the segment size is in GB's will this badly affect the performance of the system, or can even corrupt the file or fail due to less available RAM.
It is very bad idea for large files. I think you wish to establish chunk file read and write.
2- The above method fails in some Linux Distro's especially Fedora if total size is more than 2GB. I haven't tested myself but was reported to me by many.
2 GB < (or was it 4 GB?) counts as large file on 32 bit systems, so it is possible that they use the software without large file support build. It is necessary to make sure that support is enabled while building. There used to be a configure option for Qt as -largefile.
3- In Linux can it fail if segments are on an EXT4 filesystem and target file into which the file will be written on NTFS system. It didn't fail on Ubuntu but many users are complaining that it does. I can't just replicate it. Am I doing something wrong.
Yes, it can be the same issue, also you need to pay attention to memory fragmentation which means, you will not be able to allocate 2 GB in memory even if you have 2 GB available, but the memory is inappropriately fragmented. On Windows, you may wish to use the /LARGEADDRESSAWARE option for instance when using 32 bit process.
Overall, the best would be to establish the loop for reading and writing, and then you could forget the large address aware and so on issues. You would still need to make sure that Qt can handle large files though if you wish to support them for your clients. This is of course only necessary on 32 bit because there is no practical limit for 64 bit with the currently ongoing file sizes at this point.
Since you requested some code in the comment to get you going, here is a simple and untested version of chunk read and immediate write of the content from an input file into an output file. I am sure this will get you going so that you can figure out the rest.
QFileInfo fileInfo("/path/to/my/file");
qint64 size = fileInfo.size();
QByteArray data;
int chunkSize = 4096;
for (qint64 bytes = 0; bytes < size, bytes+=data.size()) {
data = myInputFile.read(chunkSize);
// Error check
myOutputFile.write(data);
}
Related
I am using boost::iostreams::mapped_file_source to create a memory mapped files. In excess of 1024. To my surprise when I have created around 1024 memory mapped files my program throws an exception stating there are too many files open. After some research I found that Ubuntu uses a max file size per processes of 1024 (found from ulimit -n). Unfortunately, I need all of the files to be open at the same time. Does anyone know a way around this? Is it possible to make the files not count towards the limit someway? I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance. And I would also like to not modify the operating system by changing the value. Any points in the correct direction are much appreciated!
Why do you need many mapped files open? That seems very inefficient. Maybe you can map (regions of) a single large file?
Q. I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance
That's ... nonsense. The performance could basically only increase.
One particular thing to keep in mind is to align the different regions inside your "big mapped file" to multiple of your memory page/disk block size. 4k should be a nice starter for this coarse alignment.
I'm building an io framework for a custom file format for windows only. The files are anywhere between 100MB and 100GB. The reads/writes come in sequences of few hundred KB to a couple MB in unpredictable locations. Read speed is most critical, though, cpu use might trump that since I hear fstream can put a real dent in it when working with SSDs.
Originally I was planning to use fstream but as I red a bit more into file IO, I discovered a bunch of other options. Since I have little experience with the topic, I'm torn as to which method to use. The options I've scoped out are fstream, FILE and mapped file.
In my research so far, all I've found is a lot of contradictory benchmark results depending on chunk sizes, buffer sizes and other bottlenecks i don't understand. It would be helpful if somebody could clarify the advantages/drawbacks between those options.
In this case the bottleneck is mainly your hardware and not the library you use, since the blocks you are reading are relatively big 200KB - 5MB (compared to the sector size) and sequential (all in one).
With hard disks (high seek time) it could have a sense to read more data than needed to optimize the caching. With SSDs I would not use big buffers but read only the exact required data since the seek time is not a big issue.
Memory Mapped files are convenient for complete random access to your data, especially in small chunks (even few bytes). But it takes more code to setup a memory mapped file. On 64 bit systems you may could map the whole file (virtually) and then let the OS caching system optimize the reads (multiple access to the same data). You can then return just the pointer to the required memory location without even the need to have temporary buffers or use memcpy. It would be very simple.
fstream gives you extra features compared to FILE which I think aren't much of use in your case.
I am working on a C++ project, and I need to quickly access byte values.
I have experimented a lot with memory-mapped files, smart ordering of the data so that only little has to be read, etc.
I just could not get it to work fast enough reliable. There are always situations where the disk access and seeking in the file seems to be the bottle neck.
I was now thinking about loading the entire byte data (unsigned chars) into RAM.
However, it is 39567865 unsigned chars. It works on my computer, but I would like it to work on all computers.
Can anybody tell me if my approach is crazy or not? In other words: It is valid for a common software (not some scientific approach that is run on a super computer) to load such an amount of data into RAM to have it accessible quickly?
Chars are 1 byte wide, so
39567865 / 1024 = 38,640 kb
This is about 37.7 Mb. You'll be fine, unless you plan to work on embedded machines that have very little RAM. For reference: The machine you are working on most likely has 4-8 Gb of RAM, your memory consumption is about 0.4%-0.8% of that.
On today's usual Win32( or win64) machines loading up a 100M file into memory is completely fair, even preferred to alternatives.
The general answer depends on what system requirements you set, and what is the usual use pattern of the program, if it's launched in dozens of copies within seconds, some other way might be considered.
Is there a way of allocating a file with a determined size with Qt?
The reason is to avoid or minimize fragmentation. I don't want to zero-write a large file (unwanted overhead), but just allocate it from the file system.
I'd like a solution which works on Win/OSX/Linux. I know there are solutions depending on the file system in question for all these platforms, but digging up the solutions and testing on each platform takes some time.
I'm not sure about fragmentation, but Qt has QFile::resize() method which clearly pre-allocates (or truncates) the file. The process is fast - ~1s for 800MB on my machine, therefore the file is clearly not explicitly garbage-filled. Tested on Windows 7.
I am trying to count the number of lines in a huge file. This ASCII file is anywhere from 12-15GB. Right now, I am using something along the lines of readline() to count each line of the file. But ofcourse, this is extremely slow. I've also tried to implement a lower level reading using seekg() and tellg() but due to the size of my file, I am unable to allocate a large enough array to store each character to run a '\n' comparison (I have 8GB of ram). What would be a faster way of reading this ridiculously large file? I've looked through many posts here and most people don't seem to have trouble with the 32bit system limitation, but here, I see that as a problem (correct me if I'm wrong).
Also, if anyone can recommend me a good way of splitting something this large, that would be helpful as well.
Thanks!
Don't try to read the whole file at once. If you're counting lines, just read in chunks of a given size. A couple of MB should be a reasonable buffer size.
Try Boost Memory-Mapped Files, one code for both Windows and POSIX platforms.
Memory-mapping a file does not require that you actually have enough RAM to hold the whole file. I've used this technique successfully with files up to 30 GB (I think I had 4 GB of RAM in that machine). You will need a 64-bit OS and 64-bit tools (I was using Python on FreeBSD) in order to be able to address that much.
Using a memory mapped file significantly increased the performance over explicitly reading chunks of the file.
what OS are you on? is there no wc -l or equivalent command on that platform?