C++: 32 vs 64 bit stream operations - c++

If I write an int into a fstream in a 32 application and the read that int back in a 64 bit application, should I expect the value to be different? If so (and I presume it is), what is the best way to achieve architecture-independent stream operations?

If you read and write using operator<< and operator>>, it will be platform independent, assuming that the integer is small enough to fit in both types, since if will be written as text. If you use ostream::read and osteam::write, it will not be platform independent, since you will be writing binary data.
If you don't need raw performance, using a text format is the easiest way to achieve platform independence. If you need better performance, you should look at a serialization library. Boost has a good cross-platform one.

Well it depends if you write binary or in ASCII. If you write your numbers in ASCII (UTF-8) then the reading should produce the same result.
I would recommend that you use the boost::serializaton package to read and write data in a controlled and uniform manner.
However if it works in the opposite direction is not certain, i.e. from 64 bit to 32 bit. It depends on your compiler, if it compiles ints to 64 bits then you might write values that can not be read into 32 bit ints. Even if you write to a formatted stream.
However there are no guarantees as the size of an int in C++, just that is larger or equal in size to a short. It is up to the compiler.
If you want to be really sure you can use GMP to handle large integers, then validate the data automatically.

Related

most common way to deal with endianness and files C++

I started out just reading/writing 8-bit integers to files using chars. It was not very long before I realized that I needed to be able to work with more than just 256 possible values. I did some research on how to read/write 16-bit integers to files and became aware of the concept of big and little endian. I did even more research and found a few different ways to deal with endianness and I also learned some ways to write endianness-independent code. My overall conclusion was that I have to first check if the system I am using is using big or little endian, change the endianness depending on what type the system is using, and then work with the values.
The one thing I have not been able to find is the best/most common way to deal with endianness when reading/writing to files in C++ (no networking). So how should I go about doing this? To help clarify, I am asking for the best way to read/write 16/32-bit integers to files between big and little endian systems. Because I am concerned about the endianness between different systems, I would also like a cross-platform solution.
The most common way is simply to pass your in-memory values through htons() or htonl() before writing them to the file, and also pass the read data through ntohs() or ntohl() after reading it back from the file. (htons()/ntohs() handle 16-bit values, htonl()/ntohl() handle 32-bit values)
When compiled for a big-endian CPU, these functions are no-ops (they just return the value you passed in to them verbatim), so the values will get written to the file in big-endian format. When compiled for a little-endian CPU, these functions endian-swap the passed-in value and return the swapped version, so again the values will get written to the file in big-endian format.
That way the values in the file are always stored in big-endian format, and they always get converted to/from the appropriate (CPU-native) format when being transferred to/from memory. This is the simplest way to do it (since you don't have to write or debug any conditional logic), and the most common (these functions are implemented and available on just about all platforms)
In practice, a good habit is to avoid binary data (to exchange data between computers) and prefer text files and textual protocols to exchange data. You could use textual formats like JSON, YAML, XML, .... (or sometimes invent your own). There are many C++ libraries related to them, e.g. jsoncpp.
Textual data is indeed more verbose (takes more disk space) and slightly slower to parse (but the disk I/O is often the bottleneck, not the CPU time "wasted" in parsing or encoding formats like JSON) but is much easier to work on.
Read also about serialization. You'll find lots of libraries doing that (using some "common" well defined data format such as XDR or ASN1). Many file formats contain some header describing the concrete encoding. The elf(5) format is a good example of that.
Be aware that most of the time the data is more valuable (economically) than the software working on it. So it is very important to document very well how your data is organized in files.
Consider also using databases. Sometimes simply using sqlite with tables containing JSON is very effective.
PS. Without an actual real world case, your question is too broad, and has no meaningful universal answer. There is no single best way!
Basile, I agree that there is no universal answer.
In my world, embedded real time systems, using a text representation is blasphemy. Textual representations and JSON is at least 2 orders of magnitude slower than binary representations. It may be fine for the web. But that makes a difference when you have to process several kilo bytes of data per seconds (to handle voice for instance) across DSPs and GPPs.
For a more in depth discussion on this toppic, check out chapter 7 of the ZeroMQ book.

Choosing between 32 and 64 bit intrinsic CRC on Intel CPU

I need to calculate CRC in order to form a hash function on an INTEL machine and came up with the following two intrinsic functions:
_mm_crc32_u32
_mm_crc32_u64
In my project, I am dealing with 32-bit variables and my dilemma is between shifting and ORing each two variables (thus creating a 64-bit variable) and then using the 64-bit CRC or run the 32-bit CRC on each of the two 32-bit variables.
I can't find anywhere the amount of cycles that each one of these functions take, and from the Intel function specifications it is unclear which one is preferable.
The same dilemma also applies on the 16-bit version of the CRC function:
_mm_crc32_u16
I tried checking it by taking the time before and after the CRC. The results were pretty much the same. So I need a more sophisticated way of calculating it.
Don't use CRC for hash values. It's not the same kind of thing.
Use the murmurhash for classic computer science hashing needs (that is, not huge cryptographic strength hashes). That also has implementations for different widths.
I don't understand what you mean: you have two 32-bit values and want a hash of that? That might be sensible or might not, depending on why. Can you clarify what you are trying to accomplish?

c++ binary data layout guaranteed by the standard

This is purely a theoretical problem, nothing I have really found myself in, but it has piqued my curiosity and wanted to see if anyone has a better solution for it:
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
Say we have a file format that uses a 64 bit header struct immediately followed by a variable length array of 32 bit structures:
Header: magic : 32 bit
count : 32 bit
Field : id : 16 bit
data : 16 bit
My first instinct would be to write something like:
struct Field
{
uint16_t id ;
uint16_t data ;
};
Except that our compiler may decide that padding is advisable and we end up with a 64 bit structure. So our next bet is:
using Field = uint16_t[2];
and work on that.
That is, unless someone has carefully read the standard and noticed that uint16_t is optional. At this point our next best friend is uint_least16_t, which is guaranteed to be at least 16 bits long, but for all we know could be 20 bits long in a 10 bit / char processor.
At this point, the only real solution I can come up with is some sort of bit stream, capable of reading and writing specific amounts of bits, and adaptable by std::numeric_limits.
So, is there someone out there who has very carefully read the standard and found the point I'm missing? Or it is this the only real way of having a portable guarantee.
Notes:
- I've just realized that endianness would probably add another layer of complexity.
- I'm using the current working draft of the ISO standard (N3797).
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
You can't. Not in C++, which was standardized against an abstract platform where little more than the existence of a "byte" that is made up of bits can be assumed. We can't even say for certain, in looking only at the Standard, how many bits are in a char. You can use bitfields for everything, as bits are indivsible, but then you'll have padding to contend with at the least.
Sometimes it is best to give up on the idea of absolute Standards conformance for the sake of conformance, and look to other means to get the job done efficiently and effectively. In this case, platform specifics in combination with almost absolute Standards conformance (aka, good programming practices) will set you free.
Every platform I work on regularly (linux & windows) provides a means to regulate the padding the compiler will actually apply. For network communications, under Linux & Windows I use:
#pragma pack (push, 1)
as a preface to all the data structures I'm going to send over the wire. Endianness is indeed another challenge, but one more or less easily dealt with using other resources provided by every platform: ntohl and the like.
Standards conformance is a laudable goal, and indeed in a code review I would reject most code that is non-conformant. The lack of conformance is really just a moniker for the rejection however; not the reason itself. The actual reason for the rejection is in large part difficulty in maintaining and porting non-conformant code when moving to another platform, or indeed even just upgrading the compiler on the same platform. Non-conformant code might compile and even appear to work, but it will very often fail in subtle and miserable ways when you least expect it, even after thorough testing.
The moral of the story is:
You should always write Standards-conformant code, except when you
shouldn't.
This really is just a re-imagining of Einstein's articulation of Occam's Razor:
Make everything as simple as possible, but no simpler.
If you want to ensure portability to everything standard-conforming, including platforms for which CHAR_BITS isn't 8, well, you've got your work cut out for you.
If you are comfortable limiting yourself to 98% of the computers you'll ever program, I recommend writing explicit serialization for anything that has to adhere to a particular wire-format. That includes breaking integers into bytes, etc.
Write appropriate abstractions around things and the code won't be too bad. Don't put shifts and masks everywhere. Encapsulate it.
I would use network types and network byte orders. See this link.http://www.beej.us/guide/bgnet/output/html/multipage/htonsman.html. The example uses uint16_t. You can write the values a field at a time to prevent padding.
Or if you want to read and write the entire structure at one see this link C++ struct alignment question
Make the structure easy for the program to use.
Provide input methods that extract data from the input and write to the data members. This removes the issue of padding, alignment boundaries and endianness. Similarly with output.
For example, if your input data is 16-bits wide, but your platform is 32-bits wide, declare the structure using 32-bit fields. Copy the 16 bits from the input into the 32-bit fields.
Most programs read into a structure fewer times than they access the data members. Your program is not reading the input 100% of the time.

How to check if a char is valid in C++

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.
Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.
Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.
I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!
Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.

Best way to parse a large floating point file stored in ASCII?

What is the best way to parse a large floating point file stored in ASCII?
What would be the fastest way to do it? I remember someone telling me using ifstream was bad, because it worked on a small number of bytes, and it would be better to just read the file into memory first. Is that true?
Edit: I am running on Windows, and the file format is for a point cloud that is stored in rows like x y z r g b. I am attempting to read them into arrays. Also, the files are around 20 MB each, but I have around 10 GB worth of them.
Second edit: I am going to have to load the files to display every time I want to do a visualization, so it would be nice to have it as fast as possible, but honestly, if ifstream preforms reasonably, I wouldn't mind sticking with readable code. It's running quite slow right now, but that might be more of a hardware I/O limitation than anything I can do in software, I just wanted to confirm.
I think your first concern should be how large the floating point numbers are. Are they float or can there be double data too? The traditional (C) way would be to use fscanf with the format specifier for a float and afaik it is rather fast. The iostreams do add a small overhead in terms of parsing the data, but that is rather negligible. For the sake of brevity I would suggest you use iostreams (not to mention the usual stream features that you'd get with it).
Also, I think it will really help the community if you could add the relevant numbers along with your question, like for e.g., how large a file are you trying to parse ? Is this a small memory footprint environment (like an embedded system).
It's all based on the operating system, and the choice of C and C++ standard libraries.
The days of slow ifstream are pretty much over, however, there is likely some overhead in handling C++ generic interfaces.
atof/strtod might be the fastest way to deal with it if the string is already in the memory.
Finally, any attempt you'd do at getting the file read into memory will likely be futile. Modern operating systems usually get in the way (especially if the file is larger than RAM you will end up swapping code since the system will treat your (already stored on disk) data as swappable).
If you really need to be ridiculously fast (The only places I can think it will be useful are HPC and Map/Reduce based approaches) - try mmap (Linux/Unix) or MapViewOfFile to get the file prefetched into virtual memory in the most sensible approach, and then atof + custom string handling.
If the file is really well organized for this kind of game, you can even be quirky with mmaps and pointers and have the conversion multithreaded. Sounds like a fun excercise if you have over 10GB of floats to convert on a regular basis.
The fastest way is probably to use an ifstream, but you can also use fscanf. If you have a specific platform, you could hand-load the file into memory and parse the float from it manually.