How to check if a char is valid in C++ - c++

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.

Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.

Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.

I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!

Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.

Related

most common way to deal with endianness and files C++

I started out just reading/writing 8-bit integers to files using chars. It was not very long before I realized that I needed to be able to work with more than just 256 possible values. I did some research on how to read/write 16-bit integers to files and became aware of the concept of big and little endian. I did even more research and found a few different ways to deal with endianness and I also learned some ways to write endianness-independent code. My overall conclusion was that I have to first check if the system I am using is using big or little endian, change the endianness depending on what type the system is using, and then work with the values.
The one thing I have not been able to find is the best/most common way to deal with endianness when reading/writing to files in C++ (no networking). So how should I go about doing this? To help clarify, I am asking for the best way to read/write 16/32-bit integers to files between big and little endian systems. Because I am concerned about the endianness between different systems, I would also like a cross-platform solution.
The most common way is simply to pass your in-memory values through htons() or htonl() before writing them to the file, and also pass the read data through ntohs() or ntohl() after reading it back from the file. (htons()/ntohs() handle 16-bit values, htonl()/ntohl() handle 32-bit values)
When compiled for a big-endian CPU, these functions are no-ops (they just return the value you passed in to them verbatim), so the values will get written to the file in big-endian format. When compiled for a little-endian CPU, these functions endian-swap the passed-in value and return the swapped version, so again the values will get written to the file in big-endian format.
That way the values in the file are always stored in big-endian format, and they always get converted to/from the appropriate (CPU-native) format when being transferred to/from memory. This is the simplest way to do it (since you don't have to write or debug any conditional logic), and the most common (these functions are implemented and available on just about all platforms)
In practice, a good habit is to avoid binary data (to exchange data between computers) and prefer text files and textual protocols to exchange data. You could use textual formats like JSON, YAML, XML, .... (or sometimes invent your own). There are many C++ libraries related to them, e.g. jsoncpp.
Textual data is indeed more verbose (takes more disk space) and slightly slower to parse (but the disk I/O is often the bottleneck, not the CPU time "wasted" in parsing or encoding formats like JSON) but is much easier to work on.
Read also about serialization. You'll find lots of libraries doing that (using some "common" well defined data format such as XDR or ASN1). Many file formats contain some header describing the concrete encoding. The elf(5) format is a good example of that.
Be aware that most of the time the data is more valuable (economically) than the software working on it. So it is very important to document very well how your data is organized in files.
Consider also using databases. Sometimes simply using sqlite with tables containing JSON is very effective.
PS. Without an actual real world case, your question is too broad, and has no meaningful universal answer. There is no single best way!
Basile, I agree that there is no universal answer.
In my world, embedded real time systems, using a text representation is blasphemy. Textual representations and JSON is at least 2 orders of magnitude slower than binary representations. It may be fine for the web. But that makes a difference when you have to process several kilo bytes of data per seconds (to handle voice for instance) across DSPs and GPPs.
For a more in depth discussion on this toppic, check out chapter 7 of the ZeroMQ book.

Strings vs binary for storing variables inside the file format

We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?

Streaming Real and Debug Data To Disk in C++

What is a flexible way to stream data to disk in a c++ program in Windows?
I am looking to create a flexible stream of data that may contain arbitrary data (say time, average, a flag if reset, etc) to disk for later analysis. Data may come in at non-uniform, irregular intervals. Ideally this stream would have minimal overhead and be easily readable in something like MATLAB so I could easily analyze events and data.
I'm thinking of a binary file with a header file describing types of packets followed by a wild dump of data tagged with . I'm considering a lean, custom format but would also be interested in something like HDF5.
It is probably better to use an existing file format rather than a custom one. First you don't reinvent the wheel, second you will benefit from a well tested and optimized library.
HFD5 seems like a good bet. It is fast and reliable, and easy to read from Matlab. It has some overhead but it is to allow great flexibility and compatibility.
This requirement sounds suspiciously like a "database"

Best way to parse a large floating point file stored in ASCII?

What is the best way to parse a large floating point file stored in ASCII?
What would be the fastest way to do it? I remember someone telling me using ifstream was bad, because it worked on a small number of bytes, and it would be better to just read the file into memory first. Is that true?
Edit: I am running on Windows, and the file format is for a point cloud that is stored in rows like x y z r g b. I am attempting to read them into arrays. Also, the files are around 20 MB each, but I have around 10 GB worth of them.
Second edit: I am going to have to load the files to display every time I want to do a visualization, so it would be nice to have it as fast as possible, but honestly, if ifstream preforms reasonably, I wouldn't mind sticking with readable code. It's running quite slow right now, but that might be more of a hardware I/O limitation than anything I can do in software, I just wanted to confirm.
I think your first concern should be how large the floating point numbers are. Are they float or can there be double data too? The traditional (C) way would be to use fscanf with the format specifier for a float and afaik it is rather fast. The iostreams do add a small overhead in terms of parsing the data, but that is rather negligible. For the sake of brevity I would suggest you use iostreams (not to mention the usual stream features that you'd get with it).
Also, I think it will really help the community if you could add the relevant numbers along with your question, like for e.g., how large a file are you trying to parse ? Is this a small memory footprint environment (like an embedded system).
It's all based on the operating system, and the choice of C and C++ standard libraries.
The days of slow ifstream are pretty much over, however, there is likely some overhead in handling C++ generic interfaces.
atof/strtod might be the fastest way to deal with it if the string is already in the memory.
Finally, any attempt you'd do at getting the file read into memory will likely be futile. Modern operating systems usually get in the way (especially if the file is larger than RAM you will end up swapping code since the system will treat your (already stored on disk) data as swappable).
If you really need to be ridiculously fast (The only places I can think it will be useful are HPC and Map/Reduce based approaches) - try mmap (Linux/Unix) or MapViewOfFile to get the file prefetched into virtual memory in the most sensible approach, and then atof + custom string handling.
If the file is really well organized for this kind of game, you can even be quirky with mmaps and pointers and have the conversion multithreaded. Sounds like a fun excercise if you have over 10GB of floats to convert on a regular basis.
The fastest way is probably to use an ifstream, but you can also use fscanf. If you have a specific platform, you could hand-load the file into memory and parse the float from it manually.

Which of FILE* or ifstream has better memory usage?

I need to read fixed number of bytes from files, whose sizes are around 50MB. To be more precise, read a frame from YUV 4:2:0 CIF/QCIF files (~25KB to ~100KB per frame). Not very huge number but I don't want whole file to be in the memory. I'm using C++, in such a case, which of FILE* or ifstream has better (less/minimal) memory usage? Please kindly advise. Thanks!
EDIT:
I read fixed number of bytes: 25KB or 100KB (depending on QCIF/CIF format). The reading is in binary mode and forward-only. No seeking needed. No writing needed, only reading.
EDIT:
If identifying better of them is hard, which one does not require loading the whole file into memory?
Impossible to say - it will depend on the implementation, and how you are reading the data, which you have not described. In general, questions here regarding performance are somewhat pointless, as they depend heavily on your actual usage of library and language features, the specific implementation, your hardware etc. etc.
Edit: To answer your expanded question - neither library requires you read everything into memory. Why would you think they do?
I think the best answer would be "profile and see", but in theory FILE* should be more efficient in time and memory usage. Streams do add different wrappers, error handlers, etc, etc, etc... over raw reading / writing routines, that could (in your particular case) affect the memory usage.
You can expect a smaller executable using FILE*, since its supporting libraries are simpler than ifstream, but the other factors (runtime memory consumption and performace) rarely make a significant difference. But the small gain will be in general towards FILE*, again merely because it's simpler.
If the processing you do with the file is very basic and/or you don't need to parse a text input file, FILE* will suit you well. On the other hand, if the opposite is true, I'd go for ifstream - I find the >> operator a lot handier and safer than using fscanf.
Performance wise you're definitely better of with FILE* (I profiled that some time ago in a project of mine). Memory wise iostreams shouldn't pose a big problem, although I think that there is some overhead as it wraps the C library.