Size of byte (clarification) - c++

I'm writing a game server, and this might be an easy question, but I just want some clarification.
Why is it that a byte (char or unsigned char) can hold up to a value of 255 (0xFF, which I believe is 2 bytes)? When I use sizeof(unsigned char) the compiler tells me it is 1 byte.
Is it because (in ACSII) it is getting "converted" to a character?
Sorry for this poor explaination, I'm not really good at describing a question.

This touches on a bunch of subjects, including the historical meaning of a byte, the C definition of a char, and mathematics.
For starters, a byte has historically been a lot of things, but nowadays we nearly always mean an octet, which is 8 bits. As a play on words, there's also the nybble (or often nibble) which is half a byte (not called bite).
Mathematics tells us that with an ordered combination of 8 1-or-0 values, we get 28 = 256 combinations. Sometimes we use this unsigned, sometimes signed, but either way we want to have 0 in the range; so the unsigned range is 0..255. For the signed range, we have more options, of which two's complement is the most popular; in that case, we get one more negative value than positive, for a range of -128..+127.
C++ inherits char from C, where it is defined to have a sizeof of 1, to be the smallest addressable size (i.e. having distinct address values with &), and a minimal range of -128..127 or 0..255 depending on if it's signed or not. That boils down to requiring at least 8 bits, or one byte; exactly one byte if the machine supports it.
0xff is another way of writing 255. 0x is the C way of marking a hexadecimal constant, so each digit in it is 4 bits (for 16 possible digits), ergo the nibble. This translates to an unsigned octet with all bits set to 1.
If specific size matters to your code, there is a header stdint.h that defines types of minimal and exact sizes, for speed or size optimization.
Incidentally, ASCII is a 7-bit character set. Machines with 7-bit bytes are unusual nowadays, and wider character sets like ISO 8859-1 and UTF-8 are popular.

0xFF can be stored in 8 bits, which is one byte.
sizeof(char) is defined to always return 1, regardless of the actual size in bits of the underlying datatype (see 5.3.3.1 of the current standard). The sizes of all other dataypes are calculated relative to the size of a char.

When I use sizeof(unsigned char) the compiler tells me it is 1 byte.
The size of char [whether it is signed or unsigned ] is always 1 as mandated by the C++ Standard.

char size is always 1 but number of bits can differ, C define macro CHAR_BIT that have number of bits in char.
This mean maximum value that unsigned char can have is pow(2, CHAR_BIT) - 1.
More info there: What is CHAR_BIT?

Sizeof char or unsigned char is 1 Byte as per the standard.
Why different ranges if same size?
1 Byte = 8 bits or 2^8
2^8 = 256
Hence,
signed char range is from -128 to 127
unsigned char range is from 0 to 255
This is because in case of signed char one of the bits is used to store the sign, while since unsigned char cannot be -ve, that bit is utlized to increase the range.

255, 0xFF is one byte when represented as an unsigned char. You cannot represent 255 as a signed char.

1 byte is 8 bits so in case of
signed : (1 bit is used for sign so 2^7 = 128) it holds from -128 to 127
unsigned : (2^8 = 255) it holds from 0 to 255

Related

C++ and unsigned types

I'm reading the C++ Primer 5th Edition, and I don't understand the following part:
In an unsigned type, all the bits represent the value. For example, an 8-bit
unsigned char can hold the values from 0 through 255 inclusive.
What does it mean with "all the bits represent the value"?
You should compare this to a signed type. In a signed value, one bit (the top bit) is used to indicate whether the value is positive or negative, while the rest of the bits are used to hold the value.
The value of an object of trivially copyable type is determined by some bits in it, while other bits do not affect its value. In the C++ standard, the bits that do not affect the value are called padding bits.
For example, consider a type with 8 bits where the last 4 bits are padding bits, then the objects represented by 00000000 and 00001111 have the same value, and compare equal.
In reality, padding bits are often used for alignment and/or error detection.
Knowing the knowledge above, you can understand what the book is saying. It says there are no padding bits for an unsigned type. However, the statement is wrong. In fact, the standard only guarantees unsigned char (and signed char, char) has no padding bits. The following is a quote of related part of the standard [basic.fundamental]/1:
For narrow character types, all bits of the object representation participate in the value representation.
Also, the C11 standard 6.2.6.2/1 says
For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter).
It means that all 8 bits represent an actual value, while in signed char only 7 bits represent actual value and 8-th bit (the most significant) represent sign of that value - positive or negative (+/-).
For example, one byte contains 8 bits, and all 8 bits are used to counting up from 0.
For unsigned, all bits zero = 00000000 means 0, 00000001 = 1, 00000010 = 2, 00000011 = 3, ... up to 11111111 = 255.
For a signed byte (or signed char), the leftmost bit means the sign, and therefore cannot be used to count. (I am optically separating the leftmost bit!) 0 0000001 = 1, but 1 0000001 = -1, 0 0000010 = 2, and 1 0000010 = -2, etc, up to 0 1111111 = 127, and 1 1111111 = -127. In this example, 1 0000000 would mean -0, which is useless/wasted, so it can mean for example 128.
There are other ways to code the bits into numbers, and some computers start from the left instead from the right. These details are hardware specific, and not relevant to understand 'unsigned', you only need to care about that when you want to mess in the code with the single bits (not recommended).
This is mostly a theoretical thing. On real hardware, the same holds for signed integers as well. Obviously, with signed integers, some of those values are negative.
Back to unsigned - what the text says is basically that the value of an unsigned number is simply 1<<0 + 1<<1 + 1<<2 + ... up to the total number of bits. Importantly, not only are all bits contributing, but all combinations of bits form a valid number. This is NOT the case for signed integers. Therefore, if you need a bitmask, it has to be an unsigned type of sufficient width, or you could run into invalid bit patterns.

how 256 stored in char variable and unsigned char

Up to 255, I can understand how the integers are stored in char and unsigned char ;
#include<stdio.h>
int main()
{
unsigned char a = 256;
printf("%d\n",a);
return(0);
}
In the code above I have an output of 0 for unsigned char as well as char.
For 256 I think this is the way the integer stored in the code (this is just a guess):
First 256 converted to binary representation which is 100000000 (totally 9 bits).
Then they remove the remove the leftmost bit (the bit which is set) because the char datatype only have 8 bits of memory.
So its storing in the memory as 00000000 , that's why its printing 0 as output.
Is the guess correct or any other explanation is there?
Your guess is correct. Conversion to an unsigned type uses modular arithmetic: if the value is out of range (either too large, or negative) then it is reduced modulo 2N, where N is the number of bits in the target type. So, if (as is often the case) char has 8 bits, the value is reduced modulo 256, so that 256 becomes zero.
Note that there is no such rule for conversion to a signed type - out-of-range values give implementation-defined results. Also note that char is not specified to have exactly 8 bits, and can be larger on less mainstream platforms.
On your platform (as well as on any other "normal" platform) unsigned char is 8 bit wide, so it can hold numbers from 0 to 255.
Trying to assign 256 (which is an int literal) to it results in an unsigned integer overflow, that is defined by the standard to result in "wraparound". The result of u = n where u is an unsigned integral type and n is an unsigned integer outside its range is u = n % (max_value_of_u +1).
This is just a convoluted way to say what you already said: the standard guarantees that in these cases the assignment is performed keeping only the bits that fit in the target variable. This norm is there since most platform already implement this at the assembly language level (unsigned integer overflow typically results in this behavior plus some kind of overflow flag set to 1).
Notice that all this do not hold for signed integers (as often plain char is): signed integer overflow is undefined behavior.
yes, that's correct. 8 bits can hold 0 to 255 unsigned, or -128 to 127 signed. Above that and you've hit an overflow situation and bits will be lost.
Does the compiler give you warning on the above code? You might be able to increase the warning level and see something. It won't warn you if you assign a variable that can't be determined statically (before execution), but in this case it's pretty clear you're assigning something too large for the size of the variable.

Converting unsigned char * to char *

here is my code:
std::vector<unsigned char> data;
... // put some data to data vector
char* bytes= reinterpret_cast<char*>(imageData.data());
My problem is that in vector 'data' I have chars of value 255. After conversion in bytes pointer I have values of -1 instead of 255. How should I convert this data properly?
EDIT
Ok, its come up that I really dont need conversion but only a bits order. THX for trying help
char can be either signed or unsigned depending on the platform. If it is signed, like on your platform, it has a guaranteed range from -128 to 127 by the standard. For common platforms it is an 8bit type, so those are the only values that it can hold. This means that you can't represent 255 as a char.
Now to explain what you are seing: The typical representation of signed numbers in modern processors is two's-complement, for which -1 has the maximum representable bitpattern (all ones), which is the same as 255 for ùnsigned char. So the cast does exactly what you ask it to: reinterpreting the unsigned chars as (signed) chars.
However I can't tell you how to convert the data properly, since that depends on what you want to do with it. The way you are doing it might be fine for your purposes, if it isn't your only choice is to change the datatype.
This works as it should. Your char type has a size of 1 byte which equals to 8 bits. If it's unsigned, all of the bits are used to hold the value, which makes the maximum value that a char can hold 255 (28 = 256 different values, starting with 0).
In case of signed char, one bit is used to hold the sign instead of the value, which leaves you only 7 bts for the value, allowing to store numbers from -128 to 127.
So, when you hold 255 in a unsigned char, all the bits are interpreted as the value, thus you have 255. If you convert it to signed char, the first bit starts to be treated as the sign bit, and the data in the variable starts to be interpreted as -1.

Relation between word length, character size, integer size and byte

What is the relation between word length, character size, integer size, and byte in C++?
The standard requires that certain types have minimum sizes (short is at least 16 bits, int is at least 16 bits, etc), and that some groups of type are ordered (sizeof(int) >= sizeof(short) >= sizeof(char)).
In C++ a char must be large enough to hold any character in the implemetation's basic character set.
int has the "natural size suggested by the architecture of the execution environment". Note that this means that an int does not need to be at least 32-bits in size. Implementations where int is 16 bits are common (think embedded ot MS-DOS).
The following are taken from various parts of the C++98 and C99 standards:
long int has to be at least as large as int
int has to be at least as large as short
short has to be at least as large as char
Note that they could all be the same size.
Also (assuming a two's complement implementation):
long int has to be at least 32-bits
int has to be at least 16-bits
short has to be at least 16-bits
char has to be at least 8 bits
The Standard doesn't know this "word" thingy used by processors. But it says the type "int" should have the natural size for a execution environment. But even for 64 bit environments, int is usually only 32 bits. So "word" in Standard terms has pretty much no common meaning (except for the common English "word" of course).
Character size is the size of a character. Depends on what character you talk about. Character types are char, unsigned char and signed char. Also wchar_t is used to store characters that can have any size (determined by the implementation - but must use one of the integer types as its underlying type. Much like enumerations), while char/signed char or unsigned char has to have one byte. That means that one byte has as much bits as one char has. If an implementation says one object of type char has 16 bits, then a byte has 16 bits too.
Now a byte is the size that one char occupies. It's a unit, not some specific type. There is not much more about it, just that it is the unit that you can access memory. I.e you do not have pointer access to bit-fields, but you have access to units starting at one byte.
"Integer size" now is pretty wide. What do you mean? All of bool, char, short, int, long and their unsinged counterparts are integers. Their range is what i would call "integer size" and it is documented in the C standard - taken over by the C++ Standard. For signed char the range is from -127 <-> 127, for short and int it is the same and is -2^15+1 <-> 2^15-1 and for long it is -2^31+1 <-> 2^31-1. Their unsigned counterparts range from 0 up to 2^8-1, 2^16-1 and 2^32-1 respectively. Those are however minimal sizes. That is, an int may not have maximal size 2^14 on any platform, because that is less than 2^15-1 of course. It follows for those values that a minimum of bits is required. For char that is 8, for short/int that is 16 and for long that is 32. Two's-complement representation for negative numbers is not required, which is why the negative value is not -128 instead of -127 for example for signed char.
Standard C++ doesn't have a datatype called word or byte. The rest are well defined as ranges. The base is a char which has of CHAR_BITS bits. The most commonly used value of CHAR_BITS is 8.
sizeof( char ) == 1 ( one byte ) (in c++, in C - not specified)
sizeof( int ) >= sizeof( char )
word - not c++ type, usualy in computer architecture it mean 2 bytes
Kind of depends on what you mean by relation. The size of numeric types is generally a multiple of the machine word size. A byte is a byte is a byte -- 8 bits, no more, no less. A character is defined in the standard as a single unsigned byte I believe (check your ARM for details).
The general rule is, don't make any assumptions about the actual size of data types. The standard specifies relationships between the types such as a "long" integer will be either the same size or larger than an "int". Individual implementations of the language will pick specific sizes for the types that are convenient for them. For example, a compiler for a 64-bit processor will choose different sizes than a compiler for a 32-bit processor.
You can use the sizeof() operator to examine the specific sizes for the compiler you are using (on the specific target architecture).

What is an unsigned char?

In C/C++, what an unsigned char is used for? How is it different from a regular char?
In C++, there are three distinct character types:
char
signed char
unsigned char
If you are using character types for text, use the unqualified char:
it is the type of character literals like 'a' or '0' (in C++ only, in C their type is int)
it is the type that makes up C strings like "abcde"
It also works out as a number value, but it is unspecified whether that value is treated as signed or unsigned. Beware character comparisons through inequalities - although if you limit yourself to ASCII (0-127) you're just about safe.
If you are using character types as numbers, use:
signed char, which gives you at least the -127 to 127 range. (-128 to 127 is common)
unsigned char, which gives you at least the 0 to 255 range.
"At least", because the C++ standard only gives the minimum range of values that each numeric type is required to cover. sizeof (char) is required to be 1 (i.e. one byte), but a byte could in theory be for example 32 bits. sizeof would still be report its size as 1 - meaning that you could have sizeof (char) == sizeof (long) == 1.
This is implementation dependent, as the C standard does NOT define the signed-ness of char. Depending on the platform, char may be signed or unsigned, so you need to explicitly ask for signed char or unsigned char if your implementation depends on it. Just use char if you intend to represent characters from strings, as this will match what your platform puts in the string.
The difference between signed char and unsigned char is as you'd expect. On most platforms, signed char will be an 8-bit two's complement number ranging from -128 to 127, and unsigned char will be an 8-bit unsigned integer (0 to 255). Note the standard does NOT require that char types have 8 bits, only that sizeof(char) return 1. You can get at the number of bits in a char with CHAR_BIT in limits.h. There are few if any platforms today where this will be something other than 8, though.
There is a nice summary of this issue here.
As others have mentioned since I posted this, you're better off using int8_t and uint8_t if you really want to represent small integers.
Because I feel it's really called for, I just want to state some rules of C and C++ (they are the same in this regard). First, all bits of unsigned char participate in determining the value if any unsigned char object. Second, unsigned char is explicitly stated unsigned.
Now, I had a discussion with someone about what happens when you convert the value -1 of type int to unsigned char. He refused the idea that the resulting unsigned char has all its bits set to 1, because he was worried about sign representation. But he didn't have to be. It's immediately following out of this rule that the conversion does what is intended:
If the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type. (6.3.1.3p2 in a C99 draft)
That's a mathematical description. C++ describes it in terms of modulo calculus, which yields to the same rule. Anyway, what is not guaranteed is that all bits in the integer -1 are one before the conversion. So, what do we have so we can claim that the resulting unsigned char has all its CHAR_BIT bits turned to 1?
All bits participate in determining its value - that is, no padding bits occur in the object.
Adding only one time UCHAR_MAX+1 to -1 will yield a value in range, namely UCHAR_MAX
That's enough, actually! So whenever you want to have an unsigned char having all its bits one, you do
unsigned char c = (unsigned char)-1;
It also follows that a conversion is not just truncating higher order bits. The fortunate event for two's complement is that it is just a truncation there, but the same isn't necessarily true for other sign representations.
As for example usages of unsigned char:
unsigned char is often used in computer graphics, which very often (though not always) assigns a single byte to each colour component. It is common to see an RGB (or RGBA) colour represented as 24 (or 32) bits, each an unsigned char. Since unsigned char values fall in the range [0,255], the values are typically interpreted as:
0 meaning a total lack of a given colour component.
255 meaning 100% of a given colour pigment.
So you would end up with RGB red as (255,0,0) -> (100% red, 0% green, 0% blue).
Why not use a signed char? Arithmetic and bit shifting becomes problematic. As explained already, a signed char's range is essentially shifted by -128. A very simple and naive (mostly unused) method for converting RGB to grayscale is to average all three colour components, but this runs into problems when the values of the colour components are negative. Red (255, 0, 0) averages to (85, 85, 85) when using unsigned char arithmetic. However, if the values were signed chars (127,-128,-128), we would end up with (-99, -99, -99), which would be (29, 29, 29) in our unsigned char space, which is incorrect.
signed char has range -128 to 127; unsigned char has range 0 to 255.
char will be equivalent to either signed char or unsigned char, depending on the compiler, but is a distinct type.
If you're using C-style strings, just use char. If you need to use chars for arithmetic (pretty rare), specify signed or unsigned explicitly for portability.
unsigned char takes only positive values....like 0 to 255
where as
signed char takes both positive and negative values....like -128 to +127
char and unsigned char aren't guaranteed to be 8-bit types on all platforms—they are guaranteed to be 8-bit or larger. Some platforms have 9-bit, 32-bit, or 64-bit bytes. However, the most common platforms today (Windows, Mac, Linux x86, etc.) have 8-bit bytes.
An unsigned char is an unsigned byte value (0 to 255). You may be thinking of char in terms of being a "character" but it is really a numerical value. The regular char is signed, so you have 128 values, and these values map to characters using ASCII encoding. But in either case, what you are storing in memory is a byte value.
In terms of direct values a regular char is used when the values are known to be between CHAR_MIN and CHAR_MAX while an unsigned char provides double the range on the positive end. For example, if CHAR_BIT is 8, the range of regular char is only guaranteed to be [0, 127] (because it can be signed or unsigned) while unsigned char will be [0, 255] and signed char will be [-127, 127].
In terms of what it's used for, the standards allow objects of POD (plain old data) to be directly converted to an array of unsigned char. This allows you to examine the representation and bit patterns of the object. The same guarantee of safe type punning doesn't exist for char or signed char.
unsigned char is the heart of all bit trickery. In almost all compilers for all platforms an unsigned char is simply a byte and an unsigned integer of (usually) 8 bits that can be treated as a small integer or a pack of bits.
In addition, as someone else has said, the standard doesn't define the sign of a char. So you have 3 distinct char types: char, signed char, unsigned char.
If you like using various types of specific length and signedness, you're probably better off with uint8_t, int8_t, uint16_t, etc simply because they do exactly what they say.
Some googling found this, where people had a discussion about this.
An unsigned char is basically a single byte. So, you would use this if you need one byte of data (for example, maybe you want to use it to set flags on and off to be passed to a function, as is often done in the Windows API).
An unsigned char uses the bit that is reserved for the sign of a regular char as another number. This changes the range to [0 - 255] as opposed to [-128 - 127].
Generally unsigned chars are used when you don't want a sign. This will make a difference when doing things like shifting bits (shift extends the sign) and other things when dealing with a char as a byte rather than using it as a number.
unsigned char takes only positive values: 0 to 255 while
signed char takes positive and negative values: -128 to +127.
quoted frome "the c programming laugage" book:
The qualifier signed or unsigned may be applied to char or any integer. unsigned numbers
are always positive or zero, and obey the laws of arithmetic modulo 2^n, where n is the number
of bits in the type. So, for instance, if chars are 8 bits, unsigned char variables have values
between 0 and 255, while signed chars have values between -128 and 127 (in a two' s
complement machine.) Whether plain chars are signed or unsigned is machine-dependent,
but printable characters are always positive.
signed char and unsigned char both represent 1byte, but they have different ranges.
Type | range
-------------------------------
signed char | -128 to +127
unsigned char | 0 to 255
In signed char if we consider char letter = 'A', 'A' is represent binary of 65 in ASCII/Unicode, If 65 can be stored, -65 also can be stored. There are no negative binary values in ASCII/Unicode there for no need to worry about negative values.
Example
#include <stdio.h>
int main()
{
signed char char1 = 255;
signed char char2 = -128;
unsigned char char3 = 255;
unsigned char char4 = -128;
printf("Signed char(255) : %d\n",char1);
printf("Unsigned char(255) : %d\n",char3);
printf("\nSigned char(-128) : %d\n",char2);
printf("Unsigned char(-128) : %d\n",char4);
return 0;
}
Output -:
Signed char(255) : -1
Unsigned char(255) : 255
Signed char(-128) : -128
Unsigned char(-128) : 128