Should a buffer of bytes be signed or unsigned char buffer? - c++

Should a buffer of bytes be signed char or unsigned char or simply a char buffer?
Any differences between C and C++?
Thanks.

If you intend to store arbitrary binary data, you should use unsigned char. It is the only data type that is guaranteed to have no padding bits by the C Standard. Each other data type may contain padding bits in its object representation (that is the one that contains all bits of an object, instead of only those that determines a value). The padding bits' state is unspecified and are not used to store values. So if you read using char some binary data, things would be cut down to the value range of a char (by interpreting only the value bits), but there may still be bits that are just ignored but still are there and read by memcpy. Much like padding bits in real struct objects. Type unsigned char is guaranteed to not contain those. That follows from 5.2.4.2.1/2 (C99 TC2, n1124 here):
If the value of an object of type char is treated as a signed integer when used in an
expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the
value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of
CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of
UCHAR_MAX. The value UCHAR_MAX shall equal 2^CHAR_BIT − 1
From the last sentence it follows that there is no space left for any padding bits. If you use char as the type of your buffer, you also have the problem of overflows: Assigning any value explicitly to one such element which is in the range of 8 bits - so you may expect such assignment to be OK - but not within the range of a char, which is CHAR_MIN..CHAR_MAX, such a conversion overflows and causes implementation defined results, including raise of signals.
Even if any problems regarding the above would probably not show in real implementations (would be a very poor quality of implementation), you are best to use the right type from the beginning onwards, which is unsigned char.
For strings, however, the data type of choice is char, which will be understood by string and print functions. Using signed char for these purposes looks like a wrong decision to me.
For further information, read this proposal which contain a fix for a next version of the C Standard which eventually will require signed char not have any padding bits either. It's already incorporated into the working paper.

Should a buffer of bytes be signed
char or unsigned char or simply a char
buffer? Any differences between C and
C++?
A minor difference in how the language treats it. A huge difference in how convention treats it.
char = ASCII (or UTF-8, but the signedness gets in the way there) textual data
unsigned char = byte
signed char = rarely used
And there is code that relies on such a distinction. Just a week or two ago I encountered a bug where JPEG data was getting corrupted because it was being passed to the char* version of our Base64 encode function — which "helpfully" replaced all the invalid UTF-8 in the "string". Changing to BYTE aka unsigned char was all it took to fix it.

It depends.
If the buffer is intended to hold text, then it probably makes sense to declare it as an array of char and let the platform decide for you whether that is signed or unsigned by default. That will give you the least trouble passing the data in and out of the implementation's runtime library, for example.
If the buffer is intended to hold binary data, then it depends on how you intend to use it. For example, if the binary data is really a packed array of data samples that are signed 8-bit fixed point ADC measurements, then signed char would be best.
In most real-world cases, the buffer is just that, a buffer, and you don't really care about the types of the individual bytes because you filled the buffer in a bulk operation, and you are about to pass it off to a parser to interpret the complex data structure and do something useful. In that case, declare it in the simplest way.

If it actually is a buffer of 8 bit bytes, rather than a string in the machine's default locale, then I'd use uint8_t. Not that there are many machines around where a char is not a byte (or a byte a octet), but making the statement 'this is a buffer of octets' rather than 'this is a string' is often useful documentation.

You should use either char or unsigned char but never signed char. The standard has the following in 3.9/2
For any object (other than a
base-class subobject) of POD type T,
whether or not the object holds a
valid value of type T, the underlying
bytes (1.7) making up the object can
be copied into an array of char or
unsigned char.If the content of
the array of char or unsigned char is
copied back into the object, the
object shall subsequently hold its
original value.

It is better to define it as unsigned char. Infact Win32 type BYTE is defined as unsigned char. There is no difference between C & C++ between this.

For maximum portability always use unsigned char. There are a couple of instances where this could come into play. Serialized data shared across systems with different endian type immediately comes to mind. When performing shift or bit masking the values is another.

The choice of int8_t vs uint8_t is similar to when you are comparing a ptr to be NULL.
From a functionality point of view, comparing to NULL is the same as comparing to 0 because NULL is a #define for 0.
But personally, from a coding style point of view, I choose to compare my pointers to NULL because the NULL #define connotes to the person maintaining the code that you are checking for a bad pointer...
VS
when someone sees a comparison to 0 it connotes that you are checking for a specific value.
For the above reason, I would use uint8_t.

If you fetch an element into a wider variable, it will of course be sign-extended or not.

Should and should ... I tend to prefer unsigned, since it feels more "raw", less inviting to say "hey, that's just a bunch of small ints", if I want to emphasize the binary-ness of the data.
I don't think I've ever used an explicit signed char to represent a buffer of bytes.
Of course, one third option is to represent the buffer as void * as much as possible. Many common I/O functions work with void *, so sometimes the decision of what integer type to use can be fully encapsulated, which is nice.

Several years ago I had a problem with a C++ console application that printed colored chars for ASCII values above 128 and this was solved by switching from char to unsigned char, but I think it had been solveable while keeping char type, too.
For now, most C/C++ functions use char and I understand both languages much better now, so I use char in most cases.

Do you really care? If you don't, just use the default (char) and don't clutter your code with unimportant matter. Otherwise, future maintainers will be left wondering why did you use signed (or unsigned). Make their life simpler.

If you lie to the compiler, it will punish you.
If the buffer contains data that is just passing through, and you will not manipulate them in any way, it doesn't matter.
However, if you have to operate on the buffer contents then the correct type declaration will make your code simpler. No "int val = buf[i] & 0xff;" nonsense.
So, think about what the data actually is and how you need to use it.

typedef char byte;
Now you can make your array be of bytes. It's obvious to everyone what you meant, and you don't lose any functionality.
I know it's somewhat silly, but it makes your code read 100% as you intended.

Related

Using std::string as a generic uint8_t buffer

I am looking through the source code of Chromium to study how they implemented MediaRecorder API that encodes/records raw mic input stream to a particular format.
I came across interesting codes from their source. In short:
bool DoEncode(float* data_in, std::string* data_out) {
...
data_out->resize(MAX_DATA_BTYES_OR_SOMETHING);
opus_encode_float(
data_in,
reinterpret_cast<uint8_t*>(base::data(*data_out))
);
...
}
So DoEncode (C++ method) here accepts an array of float and converts it to an encoded byte stream, and the actual operation is done in opus_encode_float() (which is a pure C function).
The interesting part is the Google Chromium team used std::string for an byte array instead of std::vector<uint_8> and they even manually cast to a uint8_t buffer.
Why would the guys from Google Chromium team do like this, and is there a scenario that using std::string is more useful for a generic bytes buffer than using others like std::vector<uint8_t>?
The Chromium coding style (see below) forbids using unsigned integral types without good reason. External API is not such reason. Sizes of signed and unsigned chars are 1, so why not.
I looked at opus encoder API and it seems the earlier versions used signed char:
[out] data char*: Output payload (at least max_data_bytes long)
Although API uses unsigned chars now, the description still refers to signed char. So std::string for chars was more convenient for the earlier API and Chromium team didn't change the already used container after API was updated, they used cast in one line instead of updating tens other lines.
Integer Types
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.
If your code is a container that returns a size, be sure to use a type that will accommodate any possible usage of your container. When in doubt, use a larger type rather than a smaller type.
Use care when converting integer types. Integer conversions and promotions can cause undefined behavior, leading to security bugs and other problems.
On Unsigned Integers
Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.
That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.
We can only theorize.
My speculation: they wanted to use the built-in SSO optimization that exists in std::string but might not be available for std::vector<uint8_t>.

Is it better to use char or unsigned char array for storing raw data?

When need to buffer in memory some raw data, for example from a stream, is it better to use an array of char or of unsigned char? I always used char but at work are saying it is better unsigned char and I don't know why.
UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.
For earlier C++ versions:
unsigned char emphasises that the data is not "just" text
if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above
unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays
you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text
in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?
if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs
As far as the structure of the buffer is concerned, there is no difference: in both cases you get an element size of one byte, mandated by the standard.
Perhaps the most important difference that you get is the behavior that you see when accessing the individual elements of the buffer, for example, for printing. With char you get implementation-defined signed or unsigned behavior; with unsigned char you always see unsigned behavior. This becomes important if you want to print the individual bytes of your "raw data" buffer.
Another good alternative for use for buffers is the exact-width integer uint8_t. It is guaranteed to have the same width as unsigned char, its name requires less typing, and it tells the reader that you are not intended to use the individual elements of the buffer as character-based information.
Internally, it is exactly the same: Each element is a byte. The difference is given when you operate with those values.
If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char.
Suppose you are use the first range (signed char), then you can perform the operation 100+100. Otherwise that operation will overflow and give you an unexpected value.
Depending on your compiler or machine type, char may be unsigned or signed by default:
Is char signed or unsigned by default?
Thus having char the ranges described for the cases above.
If you are using this buffer just to store binary data without operating with it, there is no difference between using char or unsigned char.
EDIT
Note that you can even change the default char for the same machine and compiler using compiler's flags:
-funsigned-char
Let the type char be unsigned, like unsigned char.
Each kind of machine has a default for what char should be. It is either likeunsigned char by default or like signed char by default.
Ideally, a portable program should always use signed char or unsigned char when it depends on the signedness of an object. But many
programs have been written to use plain char and expect it to be
signed, or expect it to be unsigned, depending on the machines they
were written for. This option, and its inverse, let you make such a
program work with the opposite default.
The type char is always a distinct type from each of signed char or unsigned char, even though its behavior is always just like one of
those two.
As #Pablo said in his answer, the key reason is that if you're doing arithmetic on the bytes, you'll get the 'right' answers if you declare the bytes as unsigned char: you want (in Pablo's example) 100 + 100 to add to 200; if you do that sum with signed char (which you might do by accident if char on your compiler is signed) there's no guarantee of that – you're asking for trouble.
Another important reason is that it can help document your code, if you're explicit about what datatypes are what. It's useful to declare
typedef unsigned char byte
or even better
#include <stdint.h>
typedef uint8_t byte
Using byte thereafter makes it that little bit clearer what your program's intent is. Depending on how paranoid your compiler is (-Wall is your friend), this might even cause a type warning if you give a byte* argument to a char* function argument, thus prompting you to think slightly more carefully about whether you're doing the right thing.
A 'character' is fundamentally a pretty different thing from a 'byte'. C happens to blur the distinction (because at C's level, in a mostly ASCII world, the distinction doesn't matter in many cases). This blurring isn't always helpful, but it's at least good intellectual hygiene to keep the difference clear in your head.
It is usually better to use char but it makes so little difference it does not matter. It's raw data so you should be simply passing it around as such rather than trying to work with it via char pointers of one type or another. Since char is the native data type it makes most sense to use this rather than imagining you are forcing your data into one type or another.
If you use unsigned char then it will take only valid ASCII characters as its range will become -127 to +127.
and you can find complete difference between char and unsigned char details in this question.
diff bet char and unsigned char
and you can see the table here.
ASCII table
complete tables of raw characters
If you are able to work with C++17 there is a std::byte type that is more appropriate for working with raw data. It only has bitwise logic operators defined for it.

int8_t vs char ; Which is the best one?

I know both are different types (signed char and char), however my company coding guidelines specifies to use int8_t instead of char.
So, I want to know, why I have to use int8_t instead of char type. Is there any best practices to use int8_t?
The use of int8_t is perfectly good for some circumstances - specifically when the type is used for calculations where a signed 8-bit value is required. Calculations involving strictly sized data [e.g. defined by external requirements to be exactly 8 bit in the result] (I used pixel colour levels in a comment above, but that really would be uint8_t, as negative pixel colours usually don't exist - except perhaps in YUV type colourspace).
The type int8_t should NOT be used as a replacement of char in for strings. This can lead to compiler errors (or warnings, but we don't really want to have to deal with warnings from the compiler either). For example:
int8_t *x = "Hello, World!\n";
printf(x);
may well compile fine on compiler A, but give errors or warnings for mixing signed and unsigned char values on compiler B. Or if int8_t isn't even using a char type. That's just like expecting
int *ptr = "Foo";
to compile in a modern compiler...
In other words, int8_t SHOULD be used instead of char if you are using 8-bit data for caclulation. It is incorrect to wholesale replace all char with int8_t, as they are far from guaranteed to be the same.
If there is a need to use char for string/text/etc, and for some reason char is too vague (it can be signed or unsigned, etc), then usign typedef char mychar; or something like that should be used. (It's probably possible to find a better name than mychar!)
Edit: I should point out that whether you agree with this or not, I think it would be rather foolish to simply walk up to whoever is in charge of this "principle" at the company, point at a post on SO and say "I think you're wrong". Try to understand what the motivation is. There may be more to it than meets the eye.
They simply make different guarantees:
char is guaranteed to exist, to be at least 8 bits wide, and to be able to represent either all integers between -127 and 127 inclusive (if signed) or between 0 and 255 (if unsigned).
int8_t is not guaranteed to exist (and yes, there are platforms on which it doesn’t), but if it exists it is guaranteed to an 8-bit twos-complement signed integer type with no padding bits; thus it is capable of representing all integers between -128 and 127, and nothing else.
When should you use which? When the guarantees made by the type line up with your requirements. It is worth noting, however, that large portions of the standard library require char * arguments, so avoiding char entirely seems short-sighted unless there’s a deliberate decision being made to avoid usage of those library functions.
int8_t is only appropriate for code that requires a signed integer type that is exactly 8 bits wide and should not compile if there is no such type. Such requirements are far more rare than the number of questions about int8_t and it's brethren indicates. Most requirements for sizes are that the type have at least a particular number of bits. signed char works just fine if you need at least 8 bits; int_least8_t also works.
int8_t is specified by the C99 standard to be exactly eight bits wide, and fits in with the other C99 guaranteed-width types. You should use it in new code where you want an exactly 8-bit signed integer. (Take a look at int_least8_t and int_fast8_t too, though.)
char is still preferred as the element type for single-byte character strings, just as wchar_t should be preferred as the element type for wide character strings.

What is practiced in C++ to use for byte manipulation uint8 or char?

I am coming from Java to C++ and I need something similar to byte[] from Java. I can use std::vector<> for easy array like manipulation but I need answer what is practiced in C++ to use for byte manipulation uint8 or char ? ( I have lot off packing bigger integers in arrays with & 0xff and >> number so it need to be quick)
Assuming that uint8 is an 8 bit unsigned integer type, the main difference on a "normal" C++ implementation is that char is not necessarily unsigned.
On "not normal" C++ implementations, there could be more significant differences -- char might not be 8 bits. But then, what would you define uint8 to be on such an implementation anyway?
Whether the sign difference matters or not depends how you're using it, but as a rule of thumb it's best to use unsigned types with bitwise operators. That said, they both get promoted to int in bitwise & anyway (again on a "normal" C++ implementation) and it really doesn't matter for &, it doesn't cause surprises in practice. But using << on a negative signed value results in undefined behavior, so avoid that.
So, use an unsigned type. If the most convenient way for you to write that is uint8, and you know that your code deals in octets and will only run on systems where char is an octet, then you may as well use it.
If you want to use a standard type, use unsigned char. Or uint8_t in order to deliberately prevent your code compiling on "not normal" implementations where char is not an octet.
C++ reserved words are char, signed char, and/or unsigned char. uint8 is probably a typedef synonym for unsigned char.

How do all the different size types relate to each other?

Currently I have a scenario where I want to check whether writing a given string to a filestream will grow the file beyond a given size (this is used for logfile rotation). Now, std::ofstream::tellp() returns a streampos, but std::string::size() returns a size_t. The effect is, that this does not work:
out_stream.tellp() + string.size() < limit
because apparently there is an ambiguous overload of operator + for these types. This leads me to two questions:
How can I resolve the above ambiguity?
How do all the different types (size_t, streamsize, streampos, streamoff) relate to each other? When can they be safely converted, and what are possible pitfalls. I am generally confused about these types. All I know is that they are implementation dependent, and that they make certain guarantees (e.g. size_t is always large enough to hold the size of the larges object that would fit into memory on the architecture for which the application was compiled), but what are the guarantees concerning interoperability of these types (see example above, or comparing a streamsize to a size_t)?
You should be able to convert the result from tellp to a std::string::size_type by casting.
static_cast<std::string::size_type>(out_stream.tellp()) + string.size() < limit
EDIT: This is safe because your stream offset will never be negative and will safely convert to an unsigned value.
The real question is: what is the type of limit? The usual way of
testing if there is still room is usually:
limit - out_stream.tellp() >= string.size()
But you have to ensure that limit has a type from which
out_stream.tellp() can be subtracted.
In theory, streampos isn't convertable nor comparable to an integral
type, or that, converted to an integral type, it gives significant
information. And it needed support subtraction, or comparison, for that
matter. In practice, I don't think you have to worry too much about the
conversion to an integral type existing, and being monotonic (although
perhaps on some exotic mainframe...). But you can't be sure that
arithmetic with it will work, so I'd probably prefer converting it
explicitly to a streamsize (which is guaranteed to be a signed integral
type). (Regardless of how you approach the problem, you'll have to deal
with the fact that string.size() returns a size_t, which is required to
be unsigned, whereas streamsize is required to be signed.)
With regards to your second question:
size_t is a typedef to an unsigned integral type, large enough to
specify the size of any possible object,
streamsize is a typedef to a signed integral type, large enough to
specify the size of an "object" in a stream,
streamoff is a typedef to an integral type capable of specifying
the position of a byte in a file, and
streampos is a typedef to fpos, where
something is a type which can be used to maintain the state in
the case of a multibyte stream.
The standard makes very few requirements concerning the relationships
between them (and some of the few it makes are mathematically impossible
to realize), so you're pretty much on your own.
I believe the standard says that streamsize is implementation-specific, so no help there. For a practical answer, you can check the headers where these are typedefed.
Considering that size_t might be 4 bytes while your application could conceivably operate on a stream of more than 4GB length, I believe that you should cast to a known-good-size type for interoperating for an airtight solution.
Of course, if you know (maybe with a compile-time assertion) that size_t or streamsize is 8 bytes long, you can use that type directly. If you have a stream whose length doesn't fit in 8 bytes, you have more serious problems than casting to the right type.
If you have big sizes, isn't unsigned long long the best you can get. If that isn't big enough, what else is?