Portable support for large files - c++

I looked at
this
and this
and this
and I still don't know how to get to know size of file larger than 4 gb in a portable way.
Notably, incorporating some of the answers failed compiling for Cygwin, while the others failed for Linux.

Turns out there are quite a few functions defined by various standards:
fseek/ftell
It's defined by ANSI standard library. It's available virtually everywhere. It is guaranteed to work with 32-bit integers only, but it isn't required to (meaning you might get support for large files out of the box).
fseeko/ftello
This is defined by POSIX standard. On many platforms, depending on value of _FILE_OFFSET_BITS it will cause off_t to be defined as off64_t and fseeko as fseeko64 for _FILE_OFFSET_BITS=64.
fseeko64/ftello64
This is the 64-bit equivalent of fseeko and ftello. I couldn't find information on this in any standard.
Cygwin inconsistency
While it conforms to POSIX, I can't compile the fseeko no matter what I define under Cygwin, unless I use --std=gnu++11 which is obviously nonsense, since it's part of POSIX rather than a GNU extension. So what gives? According to this discussion:
64 bit file access is the natural
file access type for Cygwin. off_t is 8 bytes. There are no foo64
functions for that reason. Just use fopen and friends and you get 64
bit file access for free.
This means #ifdef for cygwin on POSIX platforms.
_fseeki64 / _ftelli64
These are defined by Microsoft Visual C++ and are exclusively used with their compiler. Obviously it doesn't support anything else from the list above (other than fseek), so you're going to need #ifdefs.
EDIT: I actually advise against using them and I'm not the only one who thinks that. I experienced literally following:
wfopen a file in binary mode
fwrite 10 bytes worth to it
_ftelli64 the position
It returns 12 rather than 10 bytes
Looks like this is horribly broken.
lseek and lseek64
Defined by POSIX, these are to be used with integer file descriptors opened with open() from unistd.h rather than FILE* structs. These are not compatible with Windows. Again, they use off_t data type.
_lseek, _lseeki64
This is Windows equivalent of lseek/lseek64. Curiously, _lseeki64 doesn't use off_t and uses __int64 instead, so you know it'll work with big files. Neat.
fsetpos/fgetpos
While these are actually pretty portable, they're almost unusable, since they operate on opaque structures rather than integer offsets, meaning you can add or subtract them, or even navigate to certain position in file obtained by any means other than through fgetpos.
Conclusion
So to make your program portable, depending on the platform, you should use:
fseeko (POSIX) + define _FILE_OFFSET_BITS=64 on POSIX
fseek for Cygwin and for default implementation
_lseeki64 for Windows - or, if you manage to work your way around it - _fseeki64.
An example that uses _ftelli64:
int64_t portable_ftell(FILE *a)
{
#ifdef __CYGWIN__
return ftell(a);
#elif defined (_WIN32)
return _ftelli64(a);
#else
return ftello(a);
#endif
}
In reality, instead of checking #ifdefs which always looked fragile to me, you could check if the functions compile using your build systems, and define your own constants such as HAVE_FTELLO64 accordingly.
Note that if you indeed decide to use lseek/_lseeki64 family and numeric file descriptors rather than the FILE* structures, you should be aware of following differences between open/fopen:
open doesn't use buffering, fopen does. Less buffering means worse performance.
open can't perform newline conversions for text files, fopen can.
More details in this question.
References:
http://www.lix.polytechnique.fr/~liberti/public/computing/prog/c/C/FUNCTIONS/funcref.htm#stdio
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/open.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/lseek.html
http://man7.org/linux/man-pages/man2/lseek.2.html
https://msdn.microsoft.com/en-us/library/75yw9bf3.aspx
http://www.cplusplus.com/reference/cstdio/fgetpos/
http://pubs.opengroup.org/onlinepubs/009695399/functions/fgetpos.html

Related

Why are the standard datatypes not used in Win32 API? [duplicate]

This question already has answers here:
Why does the Win32-API have so many custom types?
(4 answers)
Closed 6 years ago.
I have been learning Visual C++ Win32 programming for some time now.
Why are there the datatypes like DWORD, WCHAR, UINT etc. used instead of, say, unsigned long, char, unsigned int and so on?
I have to remember when to use WCHAR instead of const char *, and it is really annoying me.
Why aren't the standard datatypes used in the first place? Will it help if I memorize Win32 equivalents and use these for my own variables as well?
Yes, you should use the correct data-type for the arguments for functions, or you are likely to find yourself with trouble.
And the reason that these types are defined the way they are, rather than using int, char and so on is that it removes the "whatever the compiler thinks an int should be sized as" from the interface of the OS. Which is a very good thing, because if you use compiler A, or compiler B, or compiler C, they will all use the same types - only the library interface header file needs to do the right thing defining the types.
By defining types that are not standard types, it's easy to change int from 16 to 32 bit, for example. The first C/C++ compilers for Windows were using 16-bit integers. It was only in the mid to late 1990's that Windows got a 32-bit API, and up until that point, you were using int that was 16-bit. Imagine that you have a well-working program that uses several hundred int variables, and all of a sudden, you have to change ALL of those variables to something else... Wouldn't be very nice, right - especially as SOME of those variables DON'T need changing, because moving to a 32-bit int for some of your code won't make any difference, so no point in changing those bits.
It should be noted that WCHAR is NOT the same as const char - WCHAR is a "wide char" so wchar_t is the comparable type.
So, basically, the "define our own type" is a way to guarantee that it's possible to change the underlying compiler architecture, without having to change (much of the) source code. All larger projects that do machine-dependant coding does this sort of thing.
The sizes and other characteristics of the built-in types such as int and long can vary from one compiler to another, usually depending on the underlying architecture of the system on which the code is running.
For example, on the 16-bit systems on which Windows was originally implemented, int was just 16 bits. On more modern systems, int is 32 bits.
Microsoft gets to define types like DWORD so that their sizes remain the same across different versions of their compiler, or of other compilers used to compile Windows code.
And the names are intended to reflect concepts on the underlying system, as defined by Microsoft. A DWORD is a "double word" (which, if I recall correctly, is 32 bits on Windows, even though a machine "word" is probably 32 or even 64 bits on modern systems).
It might have been better to use the fixed-width types defined in <stdint.h>, such as uint16_t and uint32_t -- but those were only introduced to the C language by the 1999 ISO C standard (which Microsoft's compiler doesn't fully support even today).
If you're writing code that interacts with the Win32 API, you should definitely use the types defined by that API. For code that doesn't interact with Win32, use whatever types you like, or whatever types are suggested by the interface you're using.
I think that it is a historical accident.
My theory is that the original Windows developers knew that the standard C type sizes depend on the compiler, that is, one compiler may have 16-bit integer and another a 32-bit integer. So they decided to make the Window API portable between different compilers using a series of typedefs: DWORD is a 32 bit unsigned integer, no matter what compiler/architecture you are using. Naturally, nowadays you will use uint32_t from <stdint.h>, but this wasn't available at that time.
Then, with the UNICODE thing, they got the TCHAR vs. CHAR vs. WCHAR issue, but that's another story.
And, then it grew out of control and you get such nice things as typedef void VOID, *PVOID; that are utterly nonsense.

Windows to iPhone binary files

Is it safe to pass binary files from Windows to iPhone that are written like:
std::ostream stream = // get it somehow
stream.write(&MyHugePODStruct, sizeof(MyHugePODStruct));
and read like:
std::istream stream = // get it somehow
stream.read(&MyHugePODStruct, sizeof(MyHugePODStruct));
While the definition of MyHugePODStruct is the same? if not is there any way to serialize this with either standard library (c++11 included) or boost safely? is there more clean way to this, because it seems like a non portable piece of code?
No, for many reasons. First off, this won't compile, because you have to pass a char * to read and write. Secondly, this isn't guaranteed to work on even one single platform, because the structure may contain padding (but that itself may differ between different among differently compiled versions of the code, even on the same platform). Next, there are 64/32-bit issues to consider which affect many of the primitive types (e.g. long double is padded to 12 bytes on x86, but to 16 bytes on x64). Last but not least there's endianness (though I'm not sure what the iOS endianness is).
So in short, no, don't do that.
You have to serialize each struct member separately, and according to its data type.
You might like to check out Boost.serialization, though I have no experience with it.

Is there a standard way to determine at compile-time if system is 32 or 64 bit?

I need to set #ifdef - checks for conditional compile. I want to automate the process but cannot specify the target OS/machine. Is there some way that the pre-compiler can resolve whether it it is running on 32-bit or 64-bit?
(Explanation) I need to define a type that is 64 bits in size. On 64bit OS it is a long, on most others it is a long long.
I found this answer - is this the correct way to go?
[edit] a handy reference for compiler macros
The only compile check you can do reliably would be sizeof(void*) == 8, true for x64 and false for x86. This is a constexpr and you can pass it to templates but you can forget using ifdef with it. There is no platform-independent way to know the address size of the target architecture (at pre-process time), you will need to ask your IDE for one. The Standard doesn't even have the concept of the address size.
No there is no standard language support for macro to determine if the machine is a 64-bit or 32-bit at preprocessor stage.
In response to your edit, there is a "macro-less for you" way to get a type that is 64 bits.
if you need a type that can hold 64 bits, then #include <cstdint> and use either int64_t or uint64_t. You can also use the Standard Integer Types provided by Boost.
Another option is to use long long. It's technically not part of the C++ standard (it will be in C++0x) but is supported on just about every compiler.
I would look at source code for a cross-platform library. It is a quite large part. Every pair of OS and compiler has own set of definitions. Few libraries You may look at:
http://www.libsdl.org/ \include\SDL_config*.h (few files)
http://qt.nokia.com/ \src\corelib\global\qglobal.h
Boost has absorbed the old Predef project. You'll want the architecture macros, more specifically BOOST_ARCH_X86_32/BOOST_ARCH_X86_64, assuming you only care about x86.
If you need a wider detection (e.g. ARM64), either add the relevant macro's to your check, or check what you actually want to check, e.g.
sizeof(void*) == 8
Well, the answer is clearly going to be OS-specific, so you need to narrow down your requirements.
For example, on Unix uname -a typically gives enough info to distinguish a 32-bit build of the OS from a 64-bit build.
The command can be invoked by your pre-compiler. Depending on its output, compiler flags can be set appropriately.
I would be tempted to hoist the detection out of the code and put that into the Makefile. Then, you can leverage system tools to detect and set the appropriate macro upon which you are switching in your code.
In your Makefile ...
<do stuff to detect and set SUPPORT_XX_BIT to the appropriate value>
gcc myFile.c -D$(SUPPORT_XX_BIT) -o myFile
In your code ...
#if defined(SUPPORT_32_BIT)
...
#elif defined(SUPPORT_64_BIT)
...
#else
#error "Select either 32 or 64 bit option\n"
#endif
Probably the easiest way might be comparing the size of int and long long. You cannot do it in the pre-processor though but you can use it in static_assert.
Edit: WoW all the negative votes. I made my point a bit more clear. Also it appears I should have mentioned 'long long' rather than 'long' because of the way MSVC works.

Cross-platform primitive data types in C++

Unlike Java or C#, primitive data types in C++ can vary in size depending on the platform. For example, int is not guaranteed to be a 32-bit integer.
Various compiler environments define data types such as uint32 or dword for this purpose, but there seems to be no standard include file for fixed-size data types.
What is the recommended method to achieve maximum portability?
I found this header particularly useful:
BOOST cstdint
Usually better than inventing own wheel (which incurs the maintenance and testing).
Create a header file called types.h, and define all the fixed-size primitive types you need (int32, uint32, uint8, etc.). To support multiple platforms, you can either use #ifdef's or have a separate include directory for each platform (include_x86, include_x86_64, include_sparc). In the latter case you would have separate build configurations for each platform, which would have the right include directory in their include path. The second method is preferable, according to the "The C++ Gotchas" by Stephen Dewhurst.
Just an aside, if you are planning to pass binary data between different platforms, you also have to worry about byte order.
Part of the C99 standard was a stdint.h header file to provide this kind of information. For instance, it defines a type called uint32_t. Unfortunately, a lot of compilers don't support stdint.h. The best cross-platform implementation I've seen of stdint.h is here: http://www.azillionmonkeys.com/qed/pstdint.h. You can just include that in your project.
If you're using boost, I believe it also provides something equivalent to the stdint header.
Define a type (e.g. int32) in a header file. For each platform use another #ifdef and make sure that in32 is a 32 bit integer. Everywhere in your code use int32 and make sure that when you compile on different platforms you use the right define
There is a stdint.h header defined by the C99 standard and (I think) some variant or another of ISO C++. This defines nice types like int16_t, uint64_t, etc... which are guaranteed to have a specific size and representation. Unfortunately, it's availability isn't exactly standard (Microsoft in particular was a foot dragger here).
The simple answer is this, which works on every 32 or 64 bit byte-addressable architecture I am aware of:
All char variables are 1 byte
All short variables are 2 bytes
All int variables are 4 byte
DO NOT use a "long", which is of indeterminate size.
All known compilers with support for 64 bit math allow "long long" as a native 64 bit type.
Be aware that some 32 bit compilers don't have a 64 bit type at all, so using long long will limit you to 64 bit systems and a smaller set of compilers (which includes gcc and MSVC, so most people won't care about this problem).
If its name begins with two underscores (__), a data type is non-standard.
__int8 (unsigned __int8)
__int16 (unsigned __int16)
__int32 (unsigned __int32)
__int64 (unsigned __int64)
Try to use boost/cstdint.hpp
Two things:
First, there is a header file called limits.h that gives lots of useful platform
specific information. It will give max and min values for the int type for example.
From that, you can deduce how big the int type is.
You can also use the sizeof operator at runtime for these purposes too.
I hope this helps . . .
K

Seeking and reading large files in a Linux C++ application

I am running into integer overflow using the standard ftell and fseek options inside of G++, but I guess I was mistaken because it seems that ftell64 and fseek64 are not available. I have been searching and many websites seem to reference using lseek with the off64_t datatype, but I have not found any examples referencing something equal to fseek. Right now the files that I am reading in are 16GB+ CSV files with the expectation of at least double that.
Without any external libraries what is the most straightforward method for achieving a similar structure as with the fseek/ftell pair? My application right now works using the standard GCC/G++ libraries for 4.x.
fseek64 is a C function. To make it available you'll have to define _FILE_OFFSET_BITS=64 before including the system headers That will more or less define fseek to be actually fseek64. Or do it in the compiler arguments e.g.
gcc -D_FILE_OFFSET_BITS=64 ....
http://www.suse.de/~aj/linux_lfs.html has a great overviw of large file support on linux:
Compile your programs with "gcc -D_FILE_OFFSET_BITS=64". This forces all file access calls to use the 64 bit variants. Several types change also, e.g. off_t becomes off64_t. It's therefore important to always use the correct types and to not use e.g. int instead of off_t. For portability with other platforms you should use getconf LFS_CFLAGS which will return -D_FILE_OFFSET_BITS=64 on Linux platforms but might return something else on e.g. Solaris. For linking, you should use the link flags that are reported via getconf LFS_LDFLAGS. On Linux systems, you do not need special link flags.
Define _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE. With these defines you can use the LFS functions like open64 directly.
Use the O_LARGEFILE flag with open to operate on large files.
If you want to stick to ISO C standard interfaces, use fgetpos() and fsetpos(). However, these functions are only useful for saving a file position and going back to the same position later. They represent the position using the type fpos_t, which is not required to be an integer data type. For example, on a record-based system it could be a struct containing a record number and offset within the record. This may be too limiting.
POSIX defines the functions ftello() and fseeko(), which represent the position using the off_t type. This is required to be an integer type, and the value is a byte offset from the beginning of the file. You can perform arithmetic on it, and can use fseeko() to perform relative seeks. This will work on Linux and other POSIX systems.
In addition, compile with -D_FILE_OFFSET_BITS=64 (Linux/Solaris). This will define off_t to be a 64-bit type (i.e. off64_t) instead of long, and will redefine the functions that use file offsets to be the versions that take 64-bit offsets. This is the default when you are compiling for 64-bit, so is not needed in that case.
fseek64() isn't standard, the compiler docs should tell you where to find it.
Have you tried fgetpos and fsetpos? They're designed for large files and the implementation typically uses a 64-bit type as the base for fpos_t.
Have you tried fseeko() with the _FILE_OFFSET_BITS preprocessor symbol set to 64?
This will give you an fseek()-like interface but with an offset parameter of type off_t instead of long. Setting _FILE_OFFSET_BITS=64 will make off_t a 64-bit type.
The same for goes for ftello().
Use fsetpos(3) and fgetpos(3). They use the fpos_t datatype , which I believe is guaranteed to be able to hold at least 64 bits.