Best way to parse a large floating point file stored in ASCII?

Best way to parse a large floating point file stored in ASCII? - c++

What is the best way to parse a large floating point file stored in ASCII?
What would be the fastest way to do it? I remember someone telling me using ifstream was bad, because it worked on a small number of bytes, and it would be better to just read the file into memory first. Is that true?
Edit: I am running on Windows, and the file format is for a point cloud that is stored in rows like x y z r g b. I am attempting to read them into arrays. Also, the files are around 20 MB each, but I have around 10 GB worth of them.
Second edit: I am going to have to load the files to display every time I want to do a visualization, so it would be nice to have it as fast as possible, but honestly, if ifstream preforms reasonably, I wouldn't mind sticking with readable code. It's running quite slow right now, but that might be more of a hardware I/O limitation than anything I can do in software, I just wanted to confirm.

I think your first concern should be how large the floating point numbers are. Are they float or can there be double data too? The traditional (C) way would be to use fscanf with the format specifier for a float and afaik it is rather fast. The iostreams do add a small overhead in terms of parsing the data, but that is rather negligible. For the sake of brevity I would suggest you use iostreams (not to mention the usual stream features that you'd get with it).
Also, I think it will really help the community if you could add the relevant numbers along with your question, like for e.g., how large a file are you trying to parse ? Is this a small memory footprint environment (like an embedded system).

It's all based on the operating system, and the choice of C and C++ standard libraries.
The days of slow ifstream are pretty much over, however, there is likely some overhead in handling C++ generic interfaces.
atof/strtod might be the fastest way to deal with it if the string is already in the memory.
Finally, any attempt you'd do at getting the file read into memory will likely be futile. Modern operating systems usually get in the way (especially if the file is larger than RAM you will end up swapping code since the system will treat your (already stored on disk) data as swappable).
If you really need to be ridiculously fast (The only places I can think it will be useful are HPC and Map/Reduce based approaches) - try mmap (Linux/Unix) or MapViewOfFile to get the file prefetched into virtual memory in the most sensible approach, and then atof + custom string handling.
If the file is really well organized for this kind of game, you can even be quirky with mmaps and pointers and have the conversion multithreaded. Sounds like a fun excercise if you have over 10GB of floats to convert on a regular basis.

The fastest way is probably to use an ifstream, but you can also use fscanf. If you have a specific platform, you could hand-load the file into memory and parse the float from it manually.

Related

Strings vs binary for storing variables inside the file format

We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?

Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.

If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?

Best approach to storing scientific data sets on disk C++

I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)

Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.

Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.

I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.

How to check if a char is valid in C++

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.

Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.

Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.

I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!

Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.

Binary parser or serialization?

I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.

Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.

Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.

We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.

Which of FILE* or ifstream has better memory usage?

I need to read fixed number of bytes from files, whose sizes are around 50MB. To be more precise, read a frame from YUV 4:2:0 CIF/QCIF files (~25KB to ~100KB per frame). Not very huge number but I don't want whole file to be in the memory. I'm using C++, in such a case, which of FILE* or ifstream has better (less/minimal) memory usage? Please kindly advise. Thanks!
EDIT:
I read fixed number of bytes: 25KB or 100KB (depending on QCIF/CIF format). The reading is in binary mode and forward-only. No seeking needed. No writing needed, only reading.
EDIT:
If identifying better of them is hard, which one does not require loading the whole file into memory?

Impossible to say - it will depend on the implementation, and how you are reading the data, which you have not described. In general, questions here regarding performance are somewhat pointless, as they depend heavily on your actual usage of library and language features, the specific implementation, your hardware etc. etc.
Edit: To answer your expanded question - neither library requires you read everything into memory. Why would you think they do?

I think the best answer would be "profile and see", but in theory FILE* should be more efficient in time and memory usage. Streams do add different wrappers, error handlers, etc, etc, etc... over raw reading / writing routines, that could (in your particular case) affect the memory usage.

You can expect a smaller executable using FILE*, since its supporting libraries are simpler than ifstream, but the other factors (runtime memory consumption and performace) rarely make a significant difference. But the small gain will be in general towards FILE*, again merely because it's simpler.
If the processing you do with the file is very basic and/or you don't need to parse a text input file, FILE* will suit you well. On the other hand, if the opposite is true, I'd go for ifstream - I find the >> operator a lot handier and safer than using fscanf.

Performance wise you're definitely better of with FILE* (I profiled that some time ago in a project of mine). Memory wise iostreams shouldn't pose a big problem, although I think that there is some overhead as it wraps the C library.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js