I need to read fixed number of bytes from files, whose sizes are around 50MB. To be more precise, read a frame from YUV 4:2:0 CIF/QCIF files (~25KB to ~100KB per frame). Not very huge number but I don't want whole file to be in the memory. I'm using C++, in such a case, which of FILE* or ifstream has better (less/minimal) memory usage? Please kindly advise. Thanks!
EDIT:
I read fixed number of bytes: 25KB or 100KB (depending on QCIF/CIF format). The reading is in binary mode and forward-only. No seeking needed. No writing needed, only reading.
EDIT:
If identifying better of them is hard, which one does not require loading the whole file into memory?
Impossible to say - it will depend on the implementation, and how you are reading the data, which you have not described. In general, questions here regarding performance are somewhat pointless, as they depend heavily on your actual usage of library and language features, the specific implementation, your hardware etc. etc.
Edit: To answer your expanded question - neither library requires you read everything into memory. Why would you think they do?
I think the best answer would be "profile and see", but in theory FILE* should be more efficient in time and memory usage. Streams do add different wrappers, error handlers, etc, etc, etc... over raw reading / writing routines, that could (in your particular case) affect the memory usage.
You can expect a smaller executable using FILE*, since its supporting libraries are simpler than ifstream, but the other factors (runtime memory consumption and performace) rarely make a significant difference. But the small gain will be in general towards FILE*, again merely because it's simpler.
If the processing you do with the file is very basic and/or you don't need to parse a text input file, FILE* will suit you well. On the other hand, if the opposite is true, I'd go for ifstream - I find the >> operator a lot handier and safer than using fscanf.
Performance wise you're definitely better of with FILE* (I profiled that some time ago in a project of mine). Memory wise iostreams shouldn't pose a big problem, although I think that there is some overhead as it wraps the C library.
Related
I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.
Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.
Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.
I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!
Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.
I will be reading and writing large chucks of a large binary file.
Is there a class in standard C++ or upcoming standard C++ or upcoming standard C++ + boost, that will make my task easier?
If not would it be possible to use the string class for this? What would be the dangers of doing so?
PS: A few observations that will clarify things. I expect that the blobs will be passed around a lot, so a container that is reference counted and CoW would probably be preferable.
Also my resistance to using a string class is twofold: these are blobs, not strings, "unprintable characters" and in particular nulls may cause difficulties when they appear.
If you have a blob of binary data you can store this easily and efficiently in a std::vector<unsigned char>.
You can increase performance if you know (or can guess) the size of the blobs by calling reserve.
And finally, if you use streams you can easily read into a vector using std::back_inserter.
Depending on what exactly you want to do, a memory mapped file, such as the one from boost, is probably a good starting point. For in-memory modification, use an std::vector, as others have suggested.
Don't bother with CoW - it's mostly frowned upon in the C++ world, with the possible exception of everything in Qt.
First of all, to be clear, I'm aware that a huge number of MD5 implementations exist in C++. The problem here is I'm wondering if there is a comparison of which implementation is faster than the others. Since I'm using this MD5 hash function on files with size larger than 10GB, speed indeed is a major concern here.
I think the point avakar is trying to make is: with modern processing power the IO speed of your hard drive is the bottleneck not the calculation of the hash. Getting a more efficient algorithm will not help you as that is not (likely) the slowest point.
If you are doing anything special (1000's of rounds for example) then it may be different, but if you are just calculating a hash of a file. You need to speed up your IO, not your math.
I don't think it matters much (on the same hardware; but indeed GPGPU-s are different, and perhaps faster, hardware for that kind of problem). The main part of md5 is a quite complex loop of complex arithmetic operations. What does matter is the quality of compiler optimizations.
And what does also matter is how you read the file. On Linux, mmap and madvise and readahead could be relevant. Disk speed is probably the bottleneck (use an SSD if you can).
And are you sure you want md5 specifically? There are simpler and faster hash coding algorithms (md4, etc.). Still your problem is more I/O bound than CPU bound.
I'm sure there are plenty of CUDA/OpenCL adaptations of the algorithm out there which should give you a definite speedup. You could also take the basic algorithm and think a bit -> get a CUDA/OpenCL implementation going.
Block-ciphers are perfect candidates for this type of implementation.
You could also get a C implementation of it and grab a copy of the Intel C compiler and see how good that is. The vectorization extensions in Intel CPUs are amazing for speed boosts.
table available here:
http://www.golubev.com/gpuest.htm
looks like probably your bottleneck will be your harddrive IO
I'm asking in context of performance. Is stringstream simply a string/vector, so writing to it may result in its whole content being copied to a bigger chunk of memory, or is it done in a more tricky way (say, a list of strings or whatever)?
27.7.3/1 says that basic_ostringstream uses a basic_stringbuf. I think that 27.7.1.3/8 says that basic_stringbuf makes space by reallocating a buffer, and doesn't even guarantee exponential growth (and hence amortized O(1) to append).
But I find the streams section of the standard pretty impenetrable, and there's always the "as-if" rule. So I can't promise you that using a deque underneath (and consolidating when someone asks for the string / buffer) is actually forbidden.
It's up to the standard library vendor how to implement stringstream (or any library feature for that matter). You can look at the sstream header shipped with your compiler to see how it's implemented there. That much on the theoretical side...
As far as practical experience and measurements show, ostringstream is often slow compared to other methods for formatting data as character strings. But then again, only optimize after you have measured that what you want to optimize is indeed a performance bottleneck, otherwise that'll just be a waste of time at best.
If your measurements show that the performance of ostringstream really is a problem for you, consider using Boost.Karma. Of course there are more reasons to use Boost.Karma than just performance, so if you are starting a new code rather than want to modify an existing one using string streams, you might well want to use Karma from the get-go.
What is the best way to parse a large floating point file stored in ASCII?
What would be the fastest way to do it? I remember someone telling me using ifstream was bad, because it worked on a small number of bytes, and it would be better to just read the file into memory first. Is that true?
Edit: I am running on Windows, and the file format is for a point cloud that is stored in rows like x y z r g b. I am attempting to read them into arrays. Also, the files are around 20 MB each, but I have around 10 GB worth of them.
Second edit: I am going to have to load the files to display every time I want to do a visualization, so it would be nice to have it as fast as possible, but honestly, if ifstream preforms reasonably, I wouldn't mind sticking with readable code. It's running quite slow right now, but that might be more of a hardware I/O limitation than anything I can do in software, I just wanted to confirm.
I think your first concern should be how large the floating point numbers are. Are they float or can there be double data too? The traditional (C) way would be to use fscanf with the format specifier for a float and afaik it is rather fast. The iostreams do add a small overhead in terms of parsing the data, but that is rather negligible. For the sake of brevity I would suggest you use iostreams (not to mention the usual stream features that you'd get with it).
Also, I think it will really help the community if you could add the relevant numbers along with your question, like for e.g., how large a file are you trying to parse ? Is this a small memory footprint environment (like an embedded system).
It's all based on the operating system, and the choice of C and C++ standard libraries.
The days of slow ifstream are pretty much over, however, there is likely some overhead in handling C++ generic interfaces.
atof/strtod might be the fastest way to deal with it if the string is already in the memory.
Finally, any attempt you'd do at getting the file read into memory will likely be futile. Modern operating systems usually get in the way (especially if the file is larger than RAM you will end up swapping code since the system will treat your (already stored on disk) data as swappable).
If you really need to be ridiculously fast (The only places I can think it will be useful are HPC and Map/Reduce based approaches) - try mmap (Linux/Unix) or MapViewOfFile to get the file prefetched into virtual memory in the most sensible approach, and then atof + custom string handling.
If the file is really well organized for this kind of game, you can even be quirky with mmaps and pointers and have the conversion multithreaded. Sounds like a fun excercise if you have over 10GB of floats to convert on a regular basis.
The fastest way is probably to use an ifstream, but you can also use fscanf. If you have a specific platform, you could hand-load the file into memory and parse the float from it manually.