What is the purpose of std::byte? - c++

Now that c++17 has std::byte, I was looking for a way to convert code that reads files to char into code that reads files into byte. A file contains bytes, not a bunch of integers.
Then I read this question and this other question where people argued that reading files into byte is wrong, and that reading files into char is right.
If byte is not designed for the purpose of accessing memory, and by analogy, files, then what is the its purpose? As is quoted in the other two questions:
Like char and unsigned char, it can be used to access raw memory
occupied by other objects (object representation), but unlike those
types, it is not a character type and is not an arithmetic type. A
byte is only a collection of bits, and only bitwise logic operators
are defined for it.
This sounds like the exact type that should be used for reading files, not characters.

You are perhaps misunderstanding things.
byte is very much intended for "accessing memory". You are intended to use the type when the storage is just a sequence of bytes rather than an array of characters.
Iostream types cannot be specialized with byte, since they're designed around characters as their interface. That is, they do not think of files as sequences of bytes; they think of them as sequences of characters. Now, you can certainly read directly into a byte array by using a cast or two. But that's not the way iostream natively thinks.
You have to make a distinction between the way iostream works and the way files work. Iostream is just one file IO library, after all; it is hardly the end-all, be-all of file APIs.
Most file APIs for reading binary data take void* rather than character arrays. std::fread/fwrite, and so forth.
That is, you should think of this, not as a problem with std::byte, but as a problem with iostream. Just another in a long line of them.

Related

What is the purpose of the various types added by zlib and how can I use them?

I am currently trying, with the goal of learning the basic usage of zlib, to create a small utility to compress and decompress files in C++. I am using the compress2 and uncompress functions provided by zlib to facilitate this. Both of these functions, however, take various types that seem specific to zlib (Bytef, uLongf, etc.) without any automatic conversions between them and C++ types (or, rather, the pointers to each of these types). This makes simple code to interface with zlib more complex, unless I write my entire application based around zlib's types.
My question has 3 parts:
What is the purpose of these types as opposed to built-in types such as unsigned long, which I am using in my own file I/O code to represent file lengths?
What is the proper way to use these types? Can I reinterpret cast my char pointers to data to (de)compress to Bytef pointers without changing the data length from the length of my char array? Because char is one byte and Bytef's name suggests that it is the same length, I imagine that I can, but I want to make sure. Can I simply assign an unsigned long (or other non-zlib integral type) to uLongf and other zlib seemingly-integral types?
Where is zlib's official documentation on these types?
I skimmed the zlib manual and fully read the sections that could possibly seem relevant, along with a ctrl+f aided search, to no avail. My search engine also does not know the answer.
For portability, your file lengths should use off_t, not unsigned long. On some systems those are different sizes, with off_t being longer.
Yes, you can just cast between Bytef and char. (Bytef is actually unsigned char, but there is no conversion required.) uLong is simply unsigned long. (See zconf.h.)
zlib's documentation is zlib.h, where those types are called out as parameters of the functions. You can use zlibCompileFlags() to determine the number of bits in each type. (See zlib.h.)

Why does ostream::write() require ‘const char_type*’ instead of ‘const void*’ in C++?

The fwrite() function in C uses const void *restrict buffer as the first argument, so you can pass pointer to your struct as the first parameter directly.
http://en.cppreference.com/w/c/io/fwrite
e.g. fwrite(&someStruct, sizeof(someStruct), 1, file);
But in C++, the ostream::write() requires const char_type*, which forces you to use reinterpret_cast. (In Visual Studio 2013, it's const char*.)
http://en.cppreference.com/w/cpp/io/basic_ostream/write
e.g. file.write(reinterpret_cast<char*>(&someStruct), sizeof(someStruct));
In almost all cases, the binary data to be written to files is not a char array, so why does the standard prefer the style which seems more complex?
P.S.
1. Actually I used the write() method in ofstream with ios::binary mode, but according to the reference, it inherits ofstream. So I use ostream::write() above.
2. If you want to print a stream of characters, you could use operator<<().
Isn't write() method designed for writing raw data?
3. If write() is not the way to write binary data, then what is the way to do it within the standard? (Although this may bother portability of the code due to various memory align strategies on different platforms)
The portrayal of this as a C vs C++ thing is misleading. C++ provides std::fwrite(const void*, ...) just like C. Where C++ chooses to be more defensive is specifically the std::iostream versions.
"Almost in all cases the binary data to be written to files is not char array"
That's debatable. In C++ isn't not unusual to add a level of indirection in I/O, so objects are streamed or serialised to a convenient - and possibly portable (e.g. endian-standardised, without or with standardised structure padding) - representation, then deserialised/parsed when re-read. The logic is typically localised with the individual objects involved, such that a top-level object doesn't need to know details of the memory layout of its members. Serialisation and streaming tends to be thought of / buffered etc. at the byte level - fitting in better with character buffers, and read() and write() return a number of characters that could currently be transmitted - again at the character and not object level - so it's not very productive to pretend otherwise or you'll have a mess resuming partially successful I/O operations.
Raw binary writes / reads done naively are a bit dangerous as they don't handle these issues so it's probably a good thing that the use of these functions is made slightly harder, with reinterpret_cast<> being a bit of a code smell / warning.
That said, one unfortunate aspect of the C++ use of char* is that it may encourage some programmers to first read to a character array, then use inappropriate casts to "reinterpret" the data on the fly - like an int* aimed at the character buffer in a way that may not be appropriately aligned.
If you want to print a stream of characters, you could use operator<<(). Isn't write() method designed for writing raw data?
To print a stream of characters with operator<<() is problematic, as the only relevant overload takes a const char* and expects a '\0'/NUL-terminated buffer. That makes it useless if you want to print one or more NULs in the output. Further, when starting with a longer character buffer operator<< would often be clumsy, verbose and error prone, needing a NUL swapped in and back around the streaming, and would sometimes be a significant performance and/or memory use issue e.g. when writing some - but not the end - of a long string literal into which you can't swap a NUL, or when the character buffer may be being read from other threads that shouldn't see the NUL.
The provided std::ostream::write(p, n) function avoids these problems, letting you specify exactly how much you want printed.
char_type is not exactly char *, it's the template parameter of the stream that represents the stream's character type:
template<typename _CharT, typename _Traits>
class basic_ostream : virtual public basic_ios<_CharT, _Traits>
{
public:
// Types (inherited from basic_ios):
typedef _CharT char_type;
<...>
And std::ostream is just the char instantiation:
typedef basic_ostream<char> ostream;
In C/C++, char is the data type for representing a byte, so char[] is the natural data type for binary data.
Your question, I think, is better directed at the fact C/C++ was not designed to have distinct data types for "bytes" and "characters", rather than at the design of the stream libraries.
Ree,
From the cplusplus.com site the signature of ostream::write is :
ostream& write (const char* s, streamsize n);
I have just checked it on VS2013, you can write easily :
std::ofstream outfile("new.txt", std::ofstream::binary);
char buffer[] = "This is a string";
outfile.write(buffer, strlen(buffer));

int8_t and char: converts between pointers to integer types with different sign - but it doesn't

I'm working with some embedded code and I am writing something new from scratch so I am preferring to stick with the uint8_t, int8_t and so on types.
However, when porting a function:
void functionName(char *data)
to:
void functionName(int8_t *data)
I get the compiler warning "converts between pointers to integer types with different sign" when passing a literal string to the function. ( i.e. when calling functionName("put this text in"); ).
Now, I understand why this happens and these lines are only debug however I wonder what people feel is the most appropriate way of handling this, short of typecasting every literal string. I don't feel that blanket typecasting in any safer in practice than using potentially ambiguous types like "char".
You seem to be doing the wrong thing, here.
Characters are not defined by C as being 8-bit integers, so why would you ever choose to use int8_t or uint8_t to represent character data, unless you are working with UTF-8?
For C's string literals, their type is pointer to char, and that's not at all guaranteed to be 8-bit.
Also it's not defined if it's signed or unsigned, so just use const char * for string literals.
To answer your addendum (the original question was nicely answered by #unwind). I think it mostly depends on the context. If you are working with text i.e. string literals you have to use const char* or char* because the compiler will convert the characters accordingly. Short of writing your own string implementation you are probably stuck with whatever the compiler provides to you. However, the moment you have to interact with someone/something outside of your CPU context e.g. network, serial, etc. you have to have control over the exact size (which I suppose is where your question stems from). In this case I would suggest writing functions to convert strings or any data-type for that matter to uint8_t buffers for serialized sending (or receiving).
const char* my_string = "foo bar!";
uint8_t buffer* = string2sendbuffer(my_string);
my_send(buffer, destination);
The string2buffer function would know everything there is to know about putting characters in a buffer. For example it might know that you have to encode each char into two buffer elements using big-endian byte ordering. This function is most certainly platform dependent but encapsulates all this platform dependence so you would gain a lot of flexibility.
The same goes for every other complex data-type. For everything else (where the compiler does not have that strong an opinion) I would advise on using the (u)intX_t types provided by stdint.h (which should be portable).
It is implementation-defined whether the type char is signed or unsigned. It looks like you are using an environment where is it unsigned.
So, you can either use uint8_t or stick with char, whenever you are dealing with characters.

Write a C++ struct to a file and read file using another programming language?

I have a challenging situation; we will have programs on Mac, PC, iOS and Android receiving files in a legacy format and parsing data from those files. We cannot change how those files are created.
The files are produced by a C++ program filling a struct with numbers and Strings and then writing it out. Here's a sanitized version.
struct MyObject {
String Kfkj(MAXKYS);
String Oern(MAXKYS);
String Vdflj(MAXKYS, 9);
int Muic;
int Tdfkj;
int VdfkAsdk;
int SsdjsdDsldsk;
int Ndsoief;
String TdflsajPdlj;
String TdckjdfPas;
String AdsfakjIdd;
int IdkfjdKasdkj;
int AsadkjaKadkja(MAXKYS);
int Kasldsdkj;
bool Usadl;
String PsadkjOasdj(9);
String PasdkjOsdkj;
};
Primitives and Strings, as you can see.
Then here is how they write it out to a file:
MyInstance MyObject;
FileName = "C:\MyFile.ab2"
ofstream fout (FileName, ios::binary);
fout.write((char*)& MyInstance, sizeof(MyInstance));
There is no option for us to translate it once and then distribute the file to other platforms; we must translate it on each and every different platform, and this is what we have to work with. I'd appreciate any information on how C++ serializes data, so we know how to parse the file.
EDIT: solution
The feedback I received from multiple answers here was VERY helpful. Using that, I did extensive analysis with hex editors and discovered:
the elements come in the file one after another
a "String," in this case, starts with an int describing how many characters follow the int for that String. If the String does not exist, it will still have that int with a value of 0.
integers, for the files and machines I saw, are two bytes, little-endian, and MOSTLY unsigned (there were a few that were signed, just to keep me on my toes)
the boolean was two bytes, with apparently -1 (FF FF) representing "true"
So far we have not ran into issues with different padding or endianness on different devices, but those are very real concerns. The skilled notes and warnings in these answers provides us with more ammunition to try to convince the client to change to a less fragile alternative, such as XML or JSON, for transferring data online across platforms.
As for those of you asking if the developer was fired... well, let's just say their code is very old, but after multiple conversations we're still having trouble convincing them writing out the C++ struct and trying to read that on different platforms is not a good idea.
You're going to run into many problems.
C++ doesn't have a specific format for serializing data per se. It is highly dependent on the computer architecture/processor that you are running on.
The compiler is allowed to add padding to help alignment on systems. When we say alignment we basically are referring to an architecture/processor's affinity for having data lie on specific byte boundaries. For example, some processors vastly prefer floating point numbers to lie at 4 or 8 byte boundaries - if they don't the processor may work much slower or may not work at all.
So, you can't simply know what padding your system is adding magically.
What you can do is use #pragma pack(1) / #pragma pack(0) to stop your compiler from padding your numbers.
PS: you also have to worry about endianness. What if one computer is running on big-endian and one is little endian? They will interpret bytes differently without a conversion.
Simply put, you either have to fix the application generating the files so it uses a proper serialization scheme OR you need to look at it running on a SPECIFIC computer, look at exactly how it writes the files, and write a translator for every target platform (which is just silly).
Interesting Suggestion
If you're really stuck, write an app that monitors the folder where you write files. Have the app pick up the files (since it's on the same PC it'll be able to read their format without issue). Have it write the files back in XML or some other true serialization format and distribute those instead.
Whoa - that's crazy. So String objects don't contain any pointers? Must not- because you claim this is working code.
Anyway, that code isn't doing any serialization. Its just writing the structure out to file exactly the way it is laid out in memory. The only issue you have is that on some platforms padding and sizeof integral types like int may be different.
You'll have to find the size of the integral types, and use that information in reader/writer for newer platforms to make sure they get laid out the same way on the legacy platform.
You're running a real risk with that code though. As it is, a compiler change could suddenly cause the file layout to change.
The format of your data file is entirely down to the compiler that your C++ program is compiled with, and the definition of your String class. You can rely on the fields being in the order they're declared in, and in this case, I think you can rely on there not being any padding at the start, but that's about all. Some tips that might help you out in this case:-
You don't give the definition of the String class you're using. If it's a typedef for std::string, you're completely screwed, because the contents of the string aren't in the memory. I assume your C++ programmers are using some special local buffer, in which case I'll guess you will find the first bytes of the object are the string, and there is some amount of useless padding afterwards. I hope the struct contains an int at the start telling you how much data in it is useful.
You'll probably find the int fields are four bytes long.
You'll probably find the bool field is one byte long, followed by three bytes of useless padding. Only one bit, most likely the bottom bit, will be set.
That's about all the useful guesswork I can offer you. In your target language, make sure to read the whole file in as the closest thing to a byte array available in the language, and only after that, use the language features to convert it into the right kind of thing in your language. Don't try reading it in as integers, as that won't let you byte-swap if you're on a platform with different endianness to the C++ program. I suggest also looking through the file in a text editor to reverse-engineer it and help you find the offset of each field.
Last piece of advice: consider printing P45s (or pink slips, or whatever you have in your country) for whichever programmers or project managers thought this kind of 'serialization' was a good idea. This kind of sloppy work might have been acceptable in a life-or-death situation, but they have seriously screwed you over in a way you're going to find it very hard to recover from. Writing the code to read in these files will not be that hard, if it's only one struct like this, but keeping it reliable will be a world of pain, and they've effectively made it impossible for themselves to change compilers or compiler version safely.
The way it's done, the struct is written in raw form to a file. So basically what you need to know to parse this file is the binary layout of your struct.
Basically, the fields are just one after the other, so to read an int, you just read 4 bytes and cast that to an int, etc.
Strings are a particular case. It's not clear from your code whether this "String" type is an inline array of characters, or a pointer to such an array. In the first case, you need to know how many characters each string contains and simply read that number of characters sequentially. In the second case, you won't be able to get the string back, since it won't have been written to file. The pointer will be useless to you.
One last concern is whether the struct is packed or not. Since you gave no indication to that, by default struct fields are aligned to 4-bytes boundaries, so there may be space for instance after the boolean field that you need to account for. If the struct is packed, then each field comes directly after the previous.
So, to make a long story short, figure out your struct binary layout using its definition and, if all else fails, inspecting the memory at run-time with the debugger, or use a hex editor to study the output file. Then write that specification down somewhere and this will give you what you need to read from the file. It's impossible to tell exactly what that layout is simply by looking at the pseudo-definition you gave.
Writing in an ofstream does not serialize data. This code write the raw memory content of the struct as it was a string of char. Depending of your compiler, its version, its options and the system it is running on the content will be completely different.
Even the number of bits of a char is allowed to change between c++ implementation.
Data referenced by the object of the struct won't be written (forget the content of std::string).
If you cannot change the writer code. You must know the alignment policy, the size of base type and the data representation. You will have to analyze files produced by hand, for example with an hexadecimal editor like this one
http://www.physics.ohio-state.edu/~prewett/hexedit/
, and probably look at your compiler documentation.
If you can change the writer code. Use proper serialization like json, protocol buffer or simply xml.
No one has pointed out something that sticks out to me as particularly problematic (maybe because I've been bit by it). That problem: the data member bool Usadl;. sizeof(bool) varies across platforms, across compilers, and even across releases of the same compiler. Common values for sizeof(bool) are 4 and 1. This will bite you. It's getting hard to find a big endian machine nowadays, very, very hard to find a computer where CHAR_BIT is not 8 or sizeof(int) is not 4. This is not the case for sizeof(bool).
In agreement with everyone else, Chad's team needs to document the structure of the records in the file, and then make sure the program that produces the file writes this structure explicitly, including element sizes, padding, and endianness. Don't depend on class layout to do this for you. That's just asking for trouble.
The best way would probably be to use JSON or if you want a more robust solution go with something like Avro. Avro has a C++ API and a Java API, so it covers most of the cases you're encountering.

Marshall multiple protobuf to file

Background:
I'm using Google's protobuf, and I would like to read/write several gigabytes of protobuf marshalled data to a file using C++. As it's recommended to keep the size of each protobuf object under 1MB, I figured a binary stream (illustrated below) written to a file would work. Each offset contains the number of bytes to the next offset until the end of the file is reached. This way, each protobuf can stay under 1MB, and I can glob them together to my heart's content.
[int32 offset]
[protobuf blob 1]
[int32 offset]
[protobuf blob 2]
...
[eof]
I have an implemntation that works on Github:
src/glob.hpp
src/glob.cpp
test/readglob.cpp
test/writeglob.cpp
But I feel I have written some poor code, and would appreciate some advice on how to improve it. Thus,
Questions:
I'm using reinterpret_cast<char*> to read/write the 32 bit integers to and from the binary fstream. Since I'm using protobuf, I'm making the assumption that all machines are little-endian. I also assert that an int is indeed 4 bytes. Is there a better way to read/write a 32 bit integer to a binary fstream given these two limiting assumptions?
In reading from fstream, I create a temporary fixed-length char buffer, so that I can then pass this fixed-length buffer to the protobuf library to decode using ParseFromArray, as ParseFromIstream will consume the entire stream. I'd really prefer just to tell the library to read at most the next N bytes from fstream, but there doesn't seem to be that functionality in protobuf. What would be the most idiomatic way to pass a function at most N bytes of an fstream? Or is my design sufficiently upside down that I should consider a different approach entirely?
Edit:
#codymanix: I'm casting to char since istream::read requires a char array if I'm not mistaken. I'm also not using the extraction operator >> since I read it was poor form to use with binary streams. Or is this last piece of advice bogus?
#Martin York: Removed new/delete in favor of std::vector<char>. glob.cpp is now updated. Thanks!
Don't use new []/delete[].
Instead us a std::vector as deallocation is guaranteed in the event of exceptions.
Don't assume that reading will return all the bytes you requested.
Check with gcount() to make sure that you got what you asked for.
Rather than have Glob implement the code for both input and output depending on a switch in the constructor. Rather implement two specialized classes like ifstream/ofstream. This will simplify both the interface and the usage.