Using std::string as a generic uint8_t buffer

Using std::string as a generic uint8_t buffer - c++

I am looking through the source code of Chromium to study how they implemented MediaRecorder API that encodes/records raw mic input stream to a particular format.
I came across interesting codes from their source. In short:
bool DoEncode(float* data_in, std::string* data_out) {
...
data_out->resize(MAX_DATA_BTYES_OR_SOMETHING);
opus_encode_float(
data_in,
reinterpret_cast<uint8_t*>(base::data(*data_out))
);
...
}
So DoEncode (C++ method) here accepts an array of float and converts it to an encoded byte stream, and the actual operation is done in opus_encode_float() (which is a pure C function).
The interesting part is the Google Chromium team used std::string for an byte array instead of std::vector<uint_8> and they even manually cast to a uint8_t buffer.
Why would the guys from Google Chromium team do like this, and is there a scenario that using std::string is more useful for a generic bytes buffer than using others like std::vector<uint8_t>?

The Chromium coding style (see below) forbids using unsigned integral types without good reason. External API is not such reason. Sizes of signed and unsigned chars are 1, so why not.
I looked at opus encoder API and it seems the earlier versions used signed char:
[out] data char*: Output payload (at least max_data_bytes long)
Although API uses unsigned chars now, the description still refers to signed char. So std::string for chars was more convenient for the earlier API and Chromium team didn't change the already used container after API was updated, they used cast in one line instead of updating tens other lines.
Integer Types
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.
If your code is a container that returns a size, be sure to use a type that will accommodate any possible usage of your container. When in doubt, use a larger type rather than a smaller type.
Use care when converting integer types. Integer conversions and promotions can cause undefined behavior, leading to security bugs and other problems.
On Unsigned Integers
Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.
That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.

We can only theorize.
My speculation: they wanted to use the built-in SSO optimization that exists in std::string but might not be available for std::vector<uint8_t>.

Related

Why isn't there an endianness modifier in C++ like there is for signedness?

(I guess this question could apply to many typed languages, but I chose to use C++ as an example.)
Why is there no way to just write:
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
to specify the endianness for specific members, variables and parameters?
Comparison to signedness
I understand that the type of a variable not only determines how many bytes are used to store a value but also how those bytes are interpreted when performing computations.
For example, these two declarations each allocate one byte, and for both bytes, every possible 8-bit sequence is a valid value:
signed char s;
unsigned char u;
but the same binary sequence might be interpreted differently, e.g. 11111111 would mean -1 when assigned to s but 255 when assigned to u. When signed and unsigned variables are involved in the same computation, the compiler (mostly) takes care of proper conversions.
In my understanding, endianness is just a variation of the same principle: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
It seems obvious to have that feature in a typed language that allows low-level programming. However, this is not a part of C, C++ or any other language I know, and I did not find any discussion about this online.
Update
I'll try to summarize some takeaways from the many comments that I got in the first hour after asking:
signedness is strictly binary (either signed or unsigned) and will always be, in contrast to endianness, which also has two well-known variants (big and little), but also lesser-known variants such as mixed/middle endian. New variants might be invented in the future.
endianness matters when accessing multiple-byte values byte-wise. There are many aspects beyond just endianness that affect the memory layout of multi-byte structures, so this kind of access is mostly discouraged.
C++ aims to target an abstract machine and minimize the number of assumptions about the implementation. This abstract machine does not have any endianness.
Also, now I realize that signedness and endianness are not a perfect analogy, because:
endianness only defines how something is represented as a binary sequence, but now what can be represented. Both big int and little int would have the exact same value range.
signedness defines how bits and actual values map to each other, but also affects what can be represented, e.g. -3 can't be represented by an unsigned char and (assuming that char has 8 bits) 130 can't be represented by a signed char.
So that changing the endianness of some variables would never change the behavior of the program (except for byte-wise access), whereas a change of signedness usually would.

What the standard says
[intro.abstract]/1:
The semantic descriptions in this document define a parameterized nondeterministic abstract machine.
This document places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine.
Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
C++ could not define an endianness qualifier since it has no concept of endianness.
Discussion
About the difference between signness and endianness, OP wrote
In my understanding, endianness is just a variation of the same principle [(signness)]: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
I'd argue signness both have a semantic and a representative aspect1. What [intro.abstract]/1 implies is that C++ only care about semantic, and never addresses the way a signed number should be represented in memory2. Actually, "sign bit" only appears once in the C++ specs and refer to an implementation-defined value.
On the other hand, endianness only have a representative aspect: endianness conveys no meaning.
With C++20, std::endian appears. It is still implementation-defined, but let us test the endian of the host without depending on old tricks based on undefined behaviour.
1) Semantic aspect: an signed integer can represent values below zero; representative aspect: one need to, for example, reserve a bit to convey the positive/negative sign.
2) In the same vein, C++ never describe how a floating point number should be represented, IEEE-754 is often used, but this is a choice made by the implementation, in any case enforced by the standard: [basic.fundamental]/8 "The value representation of floating-point types is implementation-defined".

In addition to YSC's answer, let's take your sample code, and consider what it might aim to achieve
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
You might hope that this would exactly specify layout for architecture-independent data interchange (file, network, whatever)
But this can't possibly work, because several things are still unspecified:
data type size: you'd have to use little int32_t, big int64_t and int16_t respectively, if that's what you want
padding and alignment, which cannot be controlled strictly within the language: use #pragma or __attribute__((packed)) or some other compiler-specific extension
actual format (1s- or 2s-complement signedness, floating-point type layout, trap representations)
Alternatively, you might simply want to reflect the endianness of some specified hardware - but big and little don't cover all the possibilities here (just the two most common).
So, the proposal is incomplete (it doesn't distinguish all reasonable byte-ordering arrangements), ineffective (it doesn't achieve what it sets out to), and has additional drawbacks:
Performance
Changing the endianness of a variable from the native byte ordering should either disable arithmetic, comparisons etc (since the hardware cannot correctly perform them on this type), or must silently inject more code, creating natively-ordered temporaries to work on.
The argument here isn't that manually converting to/from native byte order is faster, it's that controlling it explicitly makes it easier to minimise the number of unnecessary conversions, and much easier to reason about how code will behave, than if the conversions are implicit.
Complexity
Everything overloaded or specialized for integer types now needs twice as many versions, to cope with the rare event that it gets passed a non-native-endianness value. Even if that's just a forwarding wrapper (with a couple of casts to translate to/from native ordering), it's still a lot of code for no discernible benefit.
The final argument against changing the language to support this is that you can easily do it in code. Changing the language syntax is a big deal, and doesn't offer any obvious benefit over something like a type wrapper:
// store T with reversed byte order
template <typename T>
class Reversed {
T val_;
static T reverse(T); // platform-specific implementation
public:
explicit Reversed(T t) : val_(reverse(t)) {}
Reversed(Reversed const &other) : val_(other.val_) {}
// assignment, move, arithmetic, comparison etc. etc.
operator T () const { return reverse(val_); }
};

Integers (as a mathematical concept) have the concept of positive and negative numbers. This abstract concept of sign has a number of different implementations in hardware.
Endianness is not a mathematical concept. Little-endian is a hardware implementation trick to improve the performance of multi-byte twos-complement integer arithmetic on a microprocessor with 16 or 32 bit registers and an 8-bit memory bus. Its creation required using the term big-endian to describe everything else that had the same byte-order in registers and in memory.
The C abstract machine includes the concept of signed and unsigned integers, without details -- without requiring twos-complement arithmetic, 8-bit bytes or how to store a binary number in memory.
PS: I agree that binary data compatibility on the net or in memory/storage is a PIA.

That's a good question and I have often thought something like this would be useful. However you need to remember that C aims for platform independence and endianness is only important when a structure like this is converted into some underlying memory layout. This conversion can happen when you cast a uint8_t buffer into an int for example. While an endianness modifier looks neat the programmer still needs to consider other platform differences such as int sizes and structure alignment and packing.
For defensive programming when you want find grain control over how some variables or structures are represented in a memory buffer then it is best to code explicit conversion functions and then let the compiler optimiser generate the most efficient code for each supported platform.

Endianness is not inherently a part of a data type but rather of its storage layout.
As such, it would not be really akin to signed/unsigned but rather more like bit field widths in structs. Similar to those, they could be used for defining binary APIs.
So you'd have something like
int ip : big 32;
which would define both storage layout and integer size, leaving it to the compiler to do the best job of matching use of the field to its access. It's not obvious to me what the allowed declarations should be.

Short Answer: if it should not be possible to use objects in arithmetic expressions (with no overloaded operators) involving ints, then these objects should not be integer types. And there is no point in allowing addition and multiplication of big-endian and little-endian ints in the same expression.
Longer Answer:
As someone mentioned, endianness is processor-specific. Which really means that this is how numbers are represented when they are used as numbers in the machine language (as addresses and as operands/results of arithmetic operations).
The same is "sort of" true of signage. But not to the same degree. Conversion from language-semantic signage to processor-accepted signage is something that needs to be done to use numbers as numbers. Conversion from big-endian to little-endian and reverse is something that needs to be done to use numbers as data (send them over the network or represent metadata about data sent over the network such as payload lengths).
Having said that, this decision appears to be mostly driven by use cases. The flip side is that there is a good pragmatic reason to ignore certain use cases. The pragmatism arises out of the fact that endianness conversion is more expensive than most arithmetic operations.
If a language had semantics for keeping numbers as little-endian, it would allow developers to shoot themselves in the foot by forcing little-endianness of numbers in a program which does a lot of arithmetic. If developed on a little-endian machine, this enforcing of endianness would be a no-op. But when ported to a big-endian machine, there would a lot of unexpected slowdowns. And if the variables in question were used both for arithmetic and as network data, it would make the code completely non-portable.
Not having these endian semantics or forcing them to be explicitly compiler-specific forces the developers to go through the mental step of thinking of the numbers as being "read" or "written" to/from the network format. This would make the code which converts back and forth between network and host byte order, in the middle of arithmetic operations, cumbersome and less likely to be the preferred way of writing by a lazy developer.
And since development is a human endeavor, making bad choices uncomfortable is a Good Thing(TM).
Edit: here's an example of how this can go badly:
Assume that little_endian_int32 and big_endian_int32 types are introduced. Then little_endian_int32(7) % big_endian_int32(5) is a constant expression. What is its result? Do the numbers get implicitly converted to the native format? If not, what is the type of the result? Worse yet, what is the value of the result (which in this case should probably be the same on every machine)?
Again, if multi-byte numbers are used as plain data, then char arrays are just as good. Even if they are "ports" (which are really lookup values into tables or their hashes), they are just sequences of bytes rather than integer types (on which one can do arithmetic).
Now if you limit the allowed arithmetic operations on explicitly-endian numbers to only those operations allowed for pointer types, then you might have a better case for predictability. Then myPort + 5 actually makes sense even if myPort is declared as something like little_endian_int16 on a big endian machine. Same for lastPortInRange - firstPortInRange + 1. If the arithmetic works as it does for pointer types, then this would do what you'd expect, but firstPort * 10000 would be illegal.
Then, of course, you get into the argument of whether the feature bloat is justified by any possible benefit.

From a pragmatic programmer perspective searching Stack Overflow, it's worth noting that the spirit of this question can be answered with a utility library. Boost has such a library:
http://www.boost.org/doc/libs/1_65_1/libs/endian/doc/index.html
The feature of the library most like the language feature under discussion is a set of arithmetic types such as big_int16_t.

Because nobody has proposed to add it to the standard, and/or because compiler implementer have never felt a need for it.
Maybe you could propose it to the committee. I do not think it is difficult to implement it in a compiler: compilers already propose fundamental types that are not fundamental types for the target machine.
The development of C++ is an affair of all C++ coders.
#Schimmel. Do not listen to people who justify the status quo! All the cited arguments to justify this absence are more than fragile. A student logician could find their inconsistence without knowing anything about computer science. Just propose it, and just don't care about pathological conservatives. (Advise: propose new types rather than a qualifier because the unsigned and signed keywords are considered mistakes).

Endianness is compiler specific as a result of being machine specific, not as a support mechanism for platform independence. The standard -- is an abstraction that has no regard for imposing rules that make things "easy" -- its task is to create similarity between compilers that allows the programmer to create "platform independence" for their code -- if they choose to do so.
Initially, there was a lot of competition between platforms for market share and also -- compilers were most often written as proprietary tools by microprocessor manufacturers and to support operating systems on specific hardware platforms. Intel was likely not very concerned about writing compilers that supported Motorola microprocessors.
C was -- after all -- invented by Bell Labs to rewrite Unix.

Is it better to use char or unsigned char array for storing raw data?

When need to buffer in memory some raw data, for example from a stream, is it better to use an array of char or of unsigned char? I always used char but at work are saying it is better unsigned char and I don't know why.

UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.
For earlier C++ versions:
unsigned char emphasises that the data is not "just" text
if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above
unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays
you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text
in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?
if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs

As far as the structure of the buffer is concerned, there is no difference: in both cases you get an element size of one byte, mandated by the standard.
Perhaps the most important difference that you get is the behavior that you see when accessing the individual elements of the buffer, for example, for printing. With char you get implementation-defined signed or unsigned behavior; with unsigned char you always see unsigned behavior. This becomes important if you want to print the individual bytes of your "raw data" buffer.
Another good alternative for use for buffers is the exact-width integer uint8_t. It is guaranteed to have the same width as unsigned char, its name requires less typing, and it tells the reader that you are not intended to use the individual elements of the buffer as character-based information.

Internally, it is exactly the same: Each element is a byte. The difference is given when you operate with those values.
If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char.
Suppose you are use the first range (signed char), then you can perform the operation 100+100. Otherwise that operation will overflow and give you an unexpected value.
Depending on your compiler or machine type, char may be unsigned or signed by default:
Is char signed or unsigned by default?
Thus having char the ranges described for the cases above.
If you are using this buffer just to store binary data without operating with it, there is no difference between using char or unsigned char.
EDIT
Note that you can even change the default char for the same machine and compiler using compiler's flags:
-funsigned-char
Let the type char be unsigned, like unsigned char.
Each kind of machine has a default for what char should be. It is either likeunsigned char by default or like signed char by default.
Ideally, a portable program should always use signed char or unsigned char when it depends on the signedness of an object. But many
programs have been written to use plain char and expect it to be
signed, or expect it to be unsigned, depending on the machines they
were written for. This option, and its inverse, let you make such a
program work with the opposite default.
The type char is always a distinct type from each of signed char or unsigned char, even though its behavior is always just like one of
those two.

As #Pablo said in his answer, the key reason is that if you're doing arithmetic on the bytes, you'll get the 'right' answers if you declare the bytes as unsigned char: you want (in Pablo's example) 100 + 100 to add to 200; if you do that sum with signed char (which you might do by accident if char on your compiler is signed) there's no guarantee of that – you're asking for trouble.
Another important reason is that it can help document your code, if you're explicit about what datatypes are what. It's useful to declare
typedef unsigned char byte
or even better
#include <stdint.h>
typedef uint8_t byte
Using byte thereafter makes it that little bit clearer what your program's intent is. Depending on how paranoid your compiler is (-Wall is your friend), this might even cause a type warning if you give a byte* argument to a char* function argument, thus prompting you to think slightly more carefully about whether you're doing the right thing.
A 'character' is fundamentally a pretty different thing from a 'byte'. C happens to blur the distinction (because at C's level, in a mostly ASCII world, the distinction doesn't matter in many cases). This blurring isn't always helpful, but it's at least good intellectual hygiene to keep the difference clear in your head.

It is usually better to use char but it makes so little difference it does not matter. It's raw data so you should be simply passing it around as such rather than trying to work with it via char pointers of one type or another. Since char is the native data type it makes most sense to use this rather than imagining you are forcing your data into one type or another.

If you use unsigned char then it will take only valid ASCII characters as its range will become -127 to +127.
and you can find complete difference between char and unsigned char details in this question.
diff bet char and unsigned char
and you can see the table here.
ASCII table
complete tables of raw characters

If you are able to work with C++17 there is a std::byte type that is more appropriate for working with raw data. It only has bitwise logic operators defined for it.

How do all the different size types relate to each other?

Currently I have a scenario where I want to check whether writing a given string to a filestream will grow the file beyond a given size (this is used for logfile rotation). Now, std::ofstream::tellp() returns a streampos, but std::string::size() returns a size_t. The effect is, that this does not work:
out_stream.tellp() + string.size() < limit
because apparently there is an ambiguous overload of operator + for these types. This leads me to two questions:
How can I resolve the above ambiguity?
How do all the different types (size_t, streamsize, streampos, streamoff) relate to each other? When can they be safely converted, and what are possible pitfalls. I am generally confused about these types. All I know is that they are implementation dependent, and that they make certain guarantees (e.g. size_t is always large enough to hold the size of the larges object that would fit into memory on the architecture for which the application was compiled), but what are the guarantees concerning interoperability of these types (see example above, or comparing a streamsize to a size_t)?

You should be able to convert the result from tellp to a std::string::size_type by casting.
static_cast<std::string::size_type>(out_stream.tellp()) + string.size() < limit
EDIT: This is safe because your stream offset will never be negative and will safely convert to an unsigned value.

The real question is: what is the type of limit? The usual way of
testing if there is still room is usually:
limit - out_stream.tellp() >= string.size()
But you have to ensure that limit has a type from which
out_stream.tellp() can be subtracted.
In theory, streampos isn't convertable nor comparable to an integral
type, or that, converted to an integral type, it gives significant
information. And it needed support subtraction, or comparison, for that
matter. In practice, I don't think you have to worry too much about the
conversion to an integral type existing, and being monotonic (although
perhaps on some exotic mainframe...). But you can't be sure that
arithmetic with it will work, so I'd probably prefer converting it
explicitly to a streamsize (which is guaranteed to be a signed integral
type). (Regardless of how you approach the problem, you'll have to deal
with the fact that string.size() returns a size_t, which is required to
be unsigned, whereas streamsize is required to be signed.)
With regards to your second question:
size_t is a typedef to an unsigned integral type, large enough to
specify the size of any possible object,
streamsize is a typedef to a signed integral type, large enough to
specify the size of an "object" in a stream,
streamoff is a typedef to an integral type capable of specifying
the position of a byte in a file, and
streampos is a typedef to fpos, where
something is a type which can be used to maintain the state in
the case of a multibyte stream.
The standard makes very few requirements concerning the relationships
between them (and some of the few it makes are mathematically impossible
to realize), so you're pretty much on your own.

I believe the standard says that streamsize is implementation-specific, so no help there. For a practical answer, you can check the headers where these are typedefed.
Considering that size_t might be 4 bytes while your application could conceivably operate on a stream of more than 4GB length, I believe that you should cast to a known-good-size type for interoperating for an airtight solution.
Of course, if you know (maybe with a compile-time assertion) that size_t or streamsize is 8 bytes long, you can use that type directly. If you have a stream whose length doesn't fit in 8 bytes, you have more serious problems than casting to the right type.

If you have big sizes, isn't unsigned long long the best you can get. If that isn't big enough, what else is?

Why is uint_8 etc. used in C/C++?

I've seen some code where they don't use primitive types int, float, double etc. directly.
They usually typedef it and use it or use things like
uint_8 etc.
Is it really necessary even these days? Or is C/C++ standardized enough that it is preferable to use int, float etc directly.

Because the types like char, short, int, long, and so forth, are ambiguous: they depend on the underlying hardware. Back in the days when C was basically considered an assembler language for people in a hurry, this was okay. Now, in order to write programs that are portable -- which means "programs that mean the same thing on any machine" -- people have built special libraries of typedefs and #defines that allow them to make machine-independent definitions.
The secret code is really quite straight-forward. Here, you have uint_8, which is interpreted
u for unsigned
int to say it's treated as a number
_8 for the size in bits.
In other words, this is an unsigned integer with 8 bits (minimum) or what we used to call, in the mists of C history, an "unsigned char".

uint8_t is rather useless, because due to other requirements in the standard, it exists if and only if unsigned char is 8-bit, in which case you could just use unsigned char. The others, however, are extremely useful. int is (and will probably always be) 32-bit on most modern platforms, but on some ancient stuff it's 16-bit, and on a few rare early 64-bit systems, int is 64-bit. It could also of course be various odd sizes on DSPs.
If you want a 32-bit type, use int32_t or uint32_t, and so on. It's a lot cleaner and easier than all the nasty legacy hacks of detecting the sizes of types and trying to use the right one yourself...

Most code I read, and write, uses the fixed-size typedefs only when the size is an important assumption in the code.
For example if you're parsing a binary protocol that has two 32-bit fields, you should use a typedef guaranteed to be 32-bit, if only as documentation.
I'd only use int16 or int64 when the size must be that, say for a binary protocol or to avoid overflow or keep a struct small. Otherwise just use int.
If you're just doing "int i" to use i in a for loop, then I would not write "int32" for that. I would never expect any "typical" (meaning "not weird embedded firmware") C/C++ code to see a 16-bit "int," and the vast majority of C/C++ code out there would implode if faced with 16-bit ints. So if you start to care about "int" being 16 bit, either you're writing code that cares about weird embedded firmware stuff, or you're sort of a language pedant. Just assume "int" is the best int for the platform at hand and don't type extra noise in your code.

The sizes of types in C are not particularly well standardized. 64-bit integers are one example: a 64-bit integer could be long long, __int64, or even int on some systems. To get better portability, C99 introduced the <stdint.h> header, which has types like int32_t to get a signed type that is exactly 32 bits; many programs had their own, similar sets of typedefs before that.

C and C++ purposefully don't define the exact size of an int. This is because of a number of reasons, but that's not important in considering this problem.
Since int isn't set to a standard size, those who want a standard size must do a bit of work to guarantee a certain number of bits. The code that defines uint_8 does that work, and without it (or a technique like it) you wouldn't have a means of defining an unsigned 8 bit number.

The width of primitive types often depends on the system, not just the C++ standard or compiler. If you want true consistency across platforms when you're doing scientific computing, for example, you should use the specific uint_8 or whatever so that the same errors (or precision errors for floats) appear on different machines, so that the memory overhead is the same, etc.

C and C++ don't restrict the exact size of the numeric types, the standards only specify a minimum range of values that has to be represented. This means that int can be larger than you expect.
The reason for this is that often a particular architecture will have a size for which arithmetic works faster than other sizes. Allowing the implementor to use this size for int and not forcing it to use a narrower type may make arithmetic with ints faster.
This isn't going to go away any time soon. Even once servers and desktops are all fully transitioned to 64-bit platforms, mobile and embedded platforms may well be operating with a different integer size. Apart from anything else, you don't know what architectures might be released in the future. If you want your code to be portable, you have to use a fixed-size typedef anywhere that the type size is important to you.

Should a buffer of bytes be signed or unsigned char buffer?

Should a buffer of bytes be signed char or unsigned char or simply a char buffer?
Any differences between C and C++?
Thanks.

If you intend to store arbitrary binary data, you should use unsigned char. It is the only data type that is guaranteed to have no padding bits by the C Standard. Each other data type may contain padding bits in its object representation (that is the one that contains all bits of an object, instead of only those that determines a value). The padding bits' state is unspecified and are not used to store values. So if you read using char some binary data, things would be cut down to the value range of a char (by interpreting only the value bits), but there may still be bits that are just ignored but still are there and read by memcpy. Much like padding bits in real struct objects. Type unsigned char is guaranteed to not contain those. That follows from 5.2.4.2.1/2 (C99 TC2, n1124 here):
If the value of an object of type char is treated as a signed integer when used in an
expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the
value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of
CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of
UCHAR_MAX. The value UCHAR_MAX shall equal 2^CHAR_BIT − 1
From the last sentence it follows that there is no space left for any padding bits. If you use char as the type of your buffer, you also have the problem of overflows: Assigning any value explicitly to one such element which is in the range of 8 bits - so you may expect such assignment to be OK - but not within the range of a char, which is CHAR_MIN..CHAR_MAX, such a conversion overflows and causes implementation defined results, including raise of signals.
Even if any problems regarding the above would probably not show in real implementations (would be a very poor quality of implementation), you are best to use the right type from the beginning onwards, which is unsigned char.
For strings, however, the data type of choice is char, which will be understood by string and print functions. Using signed char for these purposes looks like a wrong decision to me.
For further information, read this proposal which contain a fix for a next version of the C Standard which eventually will require signed char not have any padding bits either. It's already incorporated into the working paper.

Should a buffer of bytes be signed
char or unsigned char or simply a char
buffer? Any differences between C and
C++?
A minor difference in how the language treats it. A huge difference in how convention treats it.
char = ASCII (or UTF-8, but the signedness gets in the way there) textual data
unsigned char = byte
signed char = rarely used
And there is code that relies on such a distinction. Just a week or two ago I encountered a bug where JPEG data was getting corrupted because it was being passed to the char* version of our Base64 encode function — which "helpfully" replaced all the invalid UTF-8 in the "string". Changing to BYTE aka unsigned char was all it took to fix it.

It depends.
If the buffer is intended to hold text, then it probably makes sense to declare it as an array of char and let the platform decide for you whether that is signed or unsigned by default. That will give you the least trouble passing the data in and out of the implementation's runtime library, for example.
If the buffer is intended to hold binary data, then it depends on how you intend to use it. For example, if the binary data is really a packed array of data samples that are signed 8-bit fixed point ADC measurements, then signed char would be best.
In most real-world cases, the buffer is just that, a buffer, and you don't really care about the types of the individual bytes because you filled the buffer in a bulk operation, and you are about to pass it off to a parser to interpret the complex data structure and do something useful. In that case, declare it in the simplest way.

If it actually is a buffer of 8 bit bytes, rather than a string in the machine's default locale, then I'd use uint8_t. Not that there are many machines around where a char is not a byte (or a byte a octet), but making the statement 'this is a buffer of octets' rather than 'this is a string' is often useful documentation.

You should use either char or unsigned char but never signed char. The standard has the following in 3.9/2
For any object (other than a
base-class subobject) of POD type T,
whether or not the object holds a
valid value of type T, the underlying
bytes (1.7) making up the object can
be copied into an array of char or
unsigned char.If the content of
the array of char or unsigned char is
copied back into the object, the
object shall subsequently hold its
original value.

It is better to define it as unsigned char. Infact Win32 type BYTE is defined as unsigned char. There is no difference between C & C++ between this.

For maximum portability always use unsigned char. There are a couple of instances where this could come into play. Serialized data shared across systems with different endian type immediately comes to mind. When performing shift or bit masking the values is another.

The choice of int8_t vs uint8_t is similar to when you are comparing a ptr to be NULL.
From a functionality point of view, comparing to NULL is the same as comparing to 0 because NULL is a #define for 0.
But personally, from a coding style point of view, I choose to compare my pointers to NULL because the NULL #define connotes to the person maintaining the code that you are checking for a bad pointer...
VS
when someone sees a comparison to 0 it connotes that you are checking for a specific value.
For the above reason, I would use uint8_t.

If you fetch an element into a wider variable, it will of course be sign-extended or not.

Should and should ... I tend to prefer unsigned, since it feels more "raw", less inviting to say "hey, that's just a bunch of small ints", if I want to emphasize the binary-ness of the data.
I don't think I've ever used an explicit signed char to represent a buffer of bytes.
Of course, one third option is to represent the buffer as void * as much as possible. Many common I/O functions work with void *, so sometimes the decision of what integer type to use can be fully encapsulated, which is nice.

Several years ago I had a problem with a C++ console application that printed colored chars for ASCII values above 128 and this was solved by switching from char to unsigned char, but I think it had been solveable while keeping char type, too.
For now, most C/C++ functions use char and I understand both languages much better now, so I use char in most cases.

Do you really care? If you don't, just use the default (char) and don't clutter your code with unimportant matter. Otherwise, future maintainers will be left wondering why did you use signed (or unsigned). Make their life simpler.

If you lie to the compiler, it will punish you.
If the buffer contains data that is just passing through, and you will not manipulate them in any way, it doesn't matter.
However, if you have to operate on the buffer contents then the correct type declaration will make your code simpler. No "int val = buf[i] & 0xff;" nonsense.
So, think about what the data actually is and how you need to use it.

typedef char byte;
Now you can make your array be of bytes. It's obvious to everyone what you meant, and you don't lose any functionality.
I know it's somewhat silly, but it makes your code read 100% as you intended.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js