How the bits are interpreted in memory - c++

I'm studying the C++ programming language and i have a problem with my book (Programming priciples and practice using C++). What my book says is :
the meaning of the bit in memory is completely dependent on the type used to access it.Think of it this way : computer memory doesn't know about our types, it's just memory. The bits of memory get meaning only when we decide how that memory is to be interpreted.
Can you explain me what does it mean ? Please do it in a simple way because I'm only a beginner who is learning C++ by 3 weeks.

The computer's memory only stores bits and bytes - how those values are interpreted is up to the programmer (and his programming language).
Consider, e.g., the value 01000001. If you interpret it as a number, it's 65 (e.g., in the short datatype). If you interpret it as an ASCII character (e.g., in the char datatype), it's the character 'A'.

A simple example: take the byte 01000001. It contains (as all bytes) 8 bits. There are 2 bits set on (with the value of 1), the second and the last bits. The second has a corresponding decimal value of 64 in the byte, and the last has the value of 1. So they are interpreted as different powers of 2 by convention (in this case 2ˆ6 and 2ˆ0). This byte will have a decimal value of 64 + 1 = 65. For the byte 01000001 itself, there is also an interpretation convention. For instance, it can be the number 65 or the letter 'A' (according to the ASCII Table). Or, the byte can be part of a data that has a larger representation than just one byte.

As a few people have noted, bits are just a way of representing information. We need to have a way to interpret the bits to derive meaning out of them. It's kind of like a dictionary. There are many different "dictionaries" out there for many different types of data. ASCII, 2s complement integers, and so forth.
C++ variables must have a type, so each is assigned to a category, like int, double, float, char, string and so forth. The data type tells the compiler how much space to allocate in memory for your variable, how to assign it a value, and how to modify it.

Related

What is the endianness of binary literals in C++14?

I have tried searching around but have not been able to find much about binary literals and endianness. Are binary literals little-endian, big-endian or something else (such as matching the target platform)?
As an example, what is the decimal value of 0b0111? Is it 7? Platform specific? Something else? Edit: I picked a bad value of 7 since it is represented within one byte. The question has been sufficiently answered despite this fact.
Some background: Basically I'm trying to figure out what the value of the least significant bits are, and masking it with binary literals seemed like a good way to go... but only if there is some guarantee about endianness.
Short answer: there isn't one. Write the number the way you would write it on paper.
Long answer:
Endianness is never exposed directly in the code unless you really try to get it out (such as using pointer tricks). 0b0111 is 7, it's the same rules as hex, writing
int i = 0xAA77;
doesn't mean 0x77AA on some platforms because that would be absurd. Where would the extra 0s that are missing go anyway with 32-bit ints? Would they get padded on the front, then the whole thing flipped to 0x77AA0000, or would they get added after? I have no idea what someone would expect if that were the case.
The point is that C++ doesn't make any assumptions about the endianness of the machine*, if you write code using primitives and the literals it provides, the behavior will be the same from machine to machine (unless you start circumventing the type system, which you may need to do).
To address your update: the number will be the way you write it out. The bits will not be reordered or any such thing, the most significant bit is on the left and the least significant bit is on the right.
There seems to be a misunderstanding here about what endianness is. Endianness refers to how bytes are ordered in memory and how they must be interpretted. If I gave you the number "4172" and said "if this is four-thousand one-hundred seventy-two, what is the endianness" you can't really give an answer because the question doesn't make sense. (some argue that the largest digit on the left means big endian, but without memory addresses the question of endianness is not answerable or relevant). This is just a number, there are no bytes to interpret, there are no memory addresses. Assuming 4 byte integer representation, the bytes that correspond to it are:
low address ----> high address
Big endian: 00 00 10 4c
Little endian: 4c 10 00 00
so, given either of those and told "this is the computer's internal representation of 4172" you could determine if its little or big endian.
So now consider your binary literal 0b0111 these 4 bits represent one nybble, and can be stored as either
low ---> high
Big endian: 00 00 00 07
Little endian: 07 00 00 00
But you don't have to care because this is also handled by the hardware, the language dictates that the compiler reads from left to right, most significant bit to least significant bit
Endianness is not about individual bits. Given that a byte is 8 bits, if I hand you 0b00000111 and say "is this little or big endian?" again you can't say because you only have one byte (and no addresses). Endianness doesn't pertain to the order of bits in a byte, it refers to the ordering of entire bytes with respect to address(unless of course you have one-bit bytes).
You don't have to care about what your computer is using internally. 0b0111 just saves you the time from having to write stuff like
unsigned int mask = 7; // only keep the lowest 3 bits
by writing
unsigned int mask = 0b0111;
Without needing to comment explaining the significance of the number.
* In c++20 you can check the endianness using std::endian.
All integer literals, including binary ones are interpreted in the same way as we normally read numbers (left most digit being most significant).
The C++ standard guarantees the same interpretation of literals without having to be concerned with the specific environment you're on. Thus, you don't have to concern yourself with endianness in this context.
Your example of 0b0111 is always equal to seven.
The C++ standard doesn't use terms of endianness in regards to number literals. Rather, it simply describes that literals have a consistent interpretation, and that the interpretation is the one you would expect.
C++ Standard - Integer Literals - 2.14.2 - paragraph 1
An integer literal is a sequence of digits that has no period or
exponent part, with optional separating single quotes that are ignored
when determining its value. An integer literal may have a prefix that
specifies its base and a suffix that specifies its type. The lexically
first digit of the sequence of digits is the most significant. A
binary integer literal (base two) begins with 0b or 0B and consists of
a sequence of binary digits. An octal integer literal (base eight)
begins with the digit 0 and consists of a sequence of octal digits.
A decimal integer literal (base ten) begins with a digit other than 0
and consists of a sequence of decimal digits. A hexadecimal integer
literal (base sixteen) begins with 0x or 0X and consists of a sequence
of hexadecimal digits, which include the decimal digits and the
letters a through f and A through F with decimal values ten through
fifteen. [Example: The number twelve can be written 12, 014, 0XC, or
0b1100. The literals 1048576, 1’048’576, 0X100000, 0x10’0000, and
0’004’000’000 all have the same value. — end example ]
Wikipedia describes what endianness is, and uses our number system as an example to understand big-endian.
The terms endian and endianness refer to the convention used to
interpret the bytes making up a data word when those bytes are stored
in computer memory.
Big-endian systems store the most significant byte of a word in the
smallest address and the least significant byte is stored in the
largest address (also see Most significant bit). Little-endian
systems, in contrast, store the least significant byte in the smallest
address.
An example on endianness is to think of how a decimal number is
written and read in place-value notation. Assuming a writing system
where numbers are written left to right, the leftmost position is
analogous to the smallest address of memory used, and rightmost
position the largest. For example, the number one hundred twenty three
is written 1 2 3, with the hundreds place left-most. Anyone who reads
this number also knows that the leftmost digit has the biggest place
value. This is an example of a big-endian convention followed in daily
life.
In this context, we are considering a digit of an integer literal to be a "byte of a word", and the word to be the literal itself. Also, the left-most character in a literal is considered to have the smallest address.
With the literal 1234, the digits one, two, three and four are the "bytes of a word", and 1234 is the "word". With the binary literal 0b0111, the digits zero, one, one and one are the "bytes of a word", and the word is 0111.
This consideration allows us to understand endianness in the context of the C++ language, and shows that integer literals are similar to "big-endian".
You're missing the distinction between endianness as written in the source code and endianness as represented in the object code. The answer for each is unsurprising: source-code literals are bigendian because that's how humans read them, in object code they're written however the target reads them.
Since a byte is by definition the smallest unit of memory access I don't believe it would be possible to even ascribe an endianness to any internal representation of bits in a byte -- the only way to discover endianness for larger numbers (whether intentionally or by surprise) is by accessing them from storage piecewise, and the byte is by definition the smallest accessible storage unit.
The C/C++ languages don't care about endianness of multi-byte integers. C/C++ compilers do. Compilers parse your source code and generate machine code for the specific target platform. The compiler, in general, stores integer literals the same way it stores an integer; such that the target CPU's instructions will directly support reading and writing them in memory.
The compiler takes care of the differences between target platforms so you don't have to.
The only time you need to worry about endianness is when you are sharing binary values with other systems that have different byte ordering.Then you would read the binary data in, byte by byte, and arrange the bytes in memory in the correct order for the system that your code is running on.
One picture is sometimes more than thousand words.
Endianness is implementation-defined. The standard guarantees that every object has an object representation as an array of char and unsigned char, which you can work with by calling memcpy() or memcmp(). In C++17, it is legal to reinterpret_cast a pointer or reference to any object type (not a pointer to void, pointer to a function, or nullptr) to a pointer to char, unsigned char, or std::byte, which are valid aliases for any object type.
What people mean when they talk about “endianness” is the order of bytes in that object representation. For example, if you declare unsigned char int_bytes[sizeof(int)] = {1}; and int i; then memcpy( &i, int_bytes, sizeof(i)); do you get 0x01, 0x01000000, 0x0100, 0x0100000000000000, or something else? The answer is: yes. There are real-world implementations that produce each of these results, and they all conform to the standard. The reason for this is so the compiler can use the native format of the CPU.
This comes up most often when a program needs to send or receive data over the Internet, where all the standards define that data should be transmitted in big-endian order, on a little-endian CPU like the x86. Some network libraries therefore specify whether particular arguments and fields of structures should be stored in host or network byte order.
The language lets you shoot yourself in the foot by twiddling the bits of an object representation arbitrarily, but it might get you a trap representation, which could cause undefined behavior if you try to use it later. (This could mean, for example, rewriting a virtual function table to inject arbitrary code.) The <type_traits> header has several templates to test whether it is safe to do things with an object representation. You can copy one object over another of the same type with memcpy( &dest, &src, sizeof(dest) ) if that type is_trivially_copyable. You can make a copy to correctly-aligned uninitialized memory if it is_trivially_move_constructible. You can test whether two objects of the same type are identical with memcmp( &a, &b, sizeof(a) ) and correctly hash an object by applying a hash function to the bytes in its object representation if the type has_unique_object_representations. An integral type has no trap representations, and so on. For the most part, though, if you’re doing operations on object representations where endianness matters, you’re telling the compiler to assume you know what you’re doing and your code will not be portable.
As others have mentioned, binary literals are written with the most-significant-digit first, like decimal, octal or hexidecimal literals. This is different from endianness and will not affect whether you need to call ntohs() on the port number from a TCP header read in from the Internet.
You might want to think about C or C++ or any other language as being intrinsically little endian (think about how the bitwise operators work). If the underlying HW is big endian, the compiler ensures that the data is stored in big endian (ditto for other endianness) however your bit wise operations work as if the data is little endian. Thing to remember is that as far as the language is concerned, data is in little endian. Endianness related problems arise when you cast the data from one type to the other. As long as you don't do that you are good.
I was questioned about the statement "C/C++ language as being intrinsically little endian", as such I am providing an example which many knows how it works but well here I go.
typedef union
{
struct {
int a:1;
int reserved:31;
} bits;
unsigned int value;
} u;
u test;
test.bits.a = 1;
test.bits.reserved = 0;
printf("After bits assignment, test.value = 0x%08X\n", test.value);
test.value = 0x00000001;
printf("After value assignment, test.value = 0x%08X\n", test.value);
Output on a little endian system:
After bits assignment, test.value = 0x00000001
After value assignment, test.value = 0x00000001
Output on a big endian system:
After bits assignment, test.value = 0x80000000
After value assignment, test.value = 0x00000001
So, if you do not know the processor's endianness, where does everything come out right? in the little endian system! Thus, I say that the C/C++ language is intrinsically little endian.

Which bit is first and when you bit shift, does it actually shift in that direction?

So.. wrestling with bits and bytes, It occurred to me that if i say "First bit of nth byte", it might not mean what I think it means. So far I have assumed that if I have some data like this:
00000000 00000001 00001000
then the
First byte is the leftmost of the groups and has the value of 0
First bit is the leftmost of all 0's and has the value of 0
Last byte is the rightmost of the groups and has the value of 8
Last bit of the second byte is the rightmost of the middle group and has the value of 1
Then I learned that the byte order in a typed collection of bytes is determined by the endianess of the system. In my case it should be little endian (windows, intel, right?) which would mean that something like 01 10 as a 16 bit uinteger should be 2551 while in most programs dealing with memory it would be represented as 265.. no idea whats going on there.
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs. For me its left to right, for somebody else it might be what first appears when you add 1 to 0 or right to left.
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
So could somebody clarifiy as to what is first and last in terms of bits and bytes in my environment, does it go right to left or left to right or both, wut? And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
A shift is an arithmetic operation, not a memory-based operation: it is intended to work on the value, rather than on its representation. Shifting left by one is equivalent to a multiplication by two, and shifting right by one is equivalent to a division by two. These rules hold first, and if they conflict with the arrangement of the bits of a multibyte type in memory, then so much for the arrangement in memory. (Since shifts are the only way to examine bits within one byte, this is also why there is no meaningful notion of bit order within one byte.)
As long as you keep your operations to within a single data type (rather than byte-shifting long integers and them examining them as character sequences), the results will stay predictable. Examining the same chunk of memory through different integer types is, in this case, a bit like performing integer operations and then reading the bits as a float; there will be some change, but it's not the place of the integer arithmetic definitions to say exactly what. It's out of their scope.
You have some understanding, but a couple misconceptions.
First off, arithmetic operations such as shifting are not concerned with the representation of the bits in memory, they are dealing with the value. Where memory representation comes into play is usually in distributed environments where you have cross-platform communication in the mix, where the data on one system is represented differently on another.
Your first comment...
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs
This isn't entirely true, though the bits are only given meaning by the reader and the writer of data, generally bits within an 8-bit byte are always read from left (MSB) to right (LSB). The byte-order is what is determined by the endian-ness of the system architecture. It has to do with the representations of the data in memory, not the arithmetic operations.
Second...
And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
From Wikipedia:
The initial endianness design choice was (is) mostly arbitrary, but later technology revisions and updates perpetuate the same endianness (and many other design attributes) to maintain backward compatibility. As examples, the Intel x86 processor represents a common little-endian architecture, and IBM z/Architecture mainframes are all big-endian processors. The designers of these two processor architectures fixed their endiannesses in the 1960s and 1970s with their initial product introductions to the market. Big-endian is the most common convention in data networking (including IPv6), hence its pseudo-synonym network byte order, and little-endian is popular (though not universal) among microprocessors in part due to Intel's significant historical influence on microprocessor designs. Mixed forms also exist, for instance the ordering of bytes within a 16-bit word may differ from the ordering of 16-bit words within a 32-bit word. Such cases are sometimes referred to as mixed-endian or middle-endian. There are also some bi-endian processors which can operate either in little-endian or big-endian mode.
Finally...
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
Many programming languages and libraries offer functions that allow you to convert to/from network (big endian) and host order (system dependent) so that you can ensure data you're dealing with is in the proper format, if you need to care about it. Since you're asking specifically about bit shifting, it doesn't matter in this case.
Read this post for more info

C++ Byte is implementation dependent

I have been reading C Primer Plus.
It is said that:
Note that the meaning of byte is implementation dependent. So a 2-byte int could be 16 bits on one system and 32 bits on another.
Here I think I am not sure about this. From my understanding, 1 byte always = 8 bits, so it makes sense that 2-byte int = 2 * 8 = 16 bits. But from this statement it sounds like some system define 1 byte = 16 bits. Is that correct?
In general, how I should understand this statement?
The C++ standard, section 1.7 point 1 confirms this:
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation defined. (...)
The memory available to a C++ program consists of one or more
sequences of contiguous bytes. Every byte has a unique address.
Bytes are always composed of at least 8 bits. They can be larger than 8 bits, though this is fairly uncommon.
One byte is not always 8-bits. Before octets (the term you want to use if you want to explicitly refer to an 8-bit byte), there were 4-, 6-, and 7-bit bytes. For the purposes of [modern] programming (in pretty much any language), you can assume it is at least 8 bits.
Historically, a byte was not always 8 bits. Today it is, but long time ago, it could be 6, 7, 8, 9 ... so to have a language that could exploit the specifics of the hardware (for efficiency) but still letting the user express himself in a bit higher level language, they had to make sure the int type was mapped on the most natural fit for the hardware.

What is the difference between the types of 0x7FFF and 32767?

I'd like to know what the difference is between the values 0x7FFF and 32767. As far as I know, they should both be integers, and the only benefit is the notational convenience. They will take up the same amount of memory, and be represented the same way, or is there another reason for choosing to write a number as 0x vs base 10?
The only advantage is that some programmers find it easier to convert between base 16 and binary in their heads. Since each base 16 digit occupies exactly 4 bits, it's a lot easier to visualize the alignment of bits. And writing in base 2 is quite cumbersome.
The type of an undecorated decimal integral constants is always signed. The type of an undecorated hexadecimal or octal constant alternates between signed and unsigned as you hit the various boundary values determined by the widths of the integral types.
For constants decorated as unsigned (e.g. 0xFU), there is no difference.
Also, it's not possible to express 0 as a decimal literal.
See Table 6 in C++11 and 6.4.4.1/5 in C11.
Both are integer literals, and just provide a different means of expressing the same number. There is no technical advantage to using one form over the other.
Note that you can also use octal notation as well (by prepending the value with 0).
The 0x7FFF notation is much more clear about potential over/underflow than the decimal notation.
If you using something that is 16 bits wide, 0x7FFF alerts you to the fact that if you use those bits in a signed way, you are at the very maximum of what those 16 bits can hold for a positive, signed value. Add 1 to it, and you'll overflow.
Same goes for 32 bits wide. The maximum that it can hold (signed, positive) is 0x7FFFFFFF.
You can see these maximums straight off of the hex notation, whereas you can't off of the decimal notation. (Unless you happen to have memorized that 32767 is the positive signed max for 16 bits).
(Note: the above is true when twos complement is being used for distinguishing between positive and negative values if the 16 bits are holding a signed value).
One is hex -- base 16 -- and the other is decimal?
That is true, there is no difference. Any differences would be in the variable the value is stored in. The literals 0x7FFF and 32767 are identical to the compiler in every way.
See http://www.cplusplus.com/doc/tutorial/constants/.
Choosing to write 0x7fff or 32767 into source code it's only a programmer choice because, those values are stored in the same identical way into computer memory.
For example: I'd feel more comfortable use the 0x notation when I need to do operations with 4bit instead the classical byte.
If I need to extract the lower 4 bit of a char variable I'd do
res = charvar & 0x0f;
That's the same of:
res = charvar & 15;
The latter is just less intuitive and readable but the operation is identical

Is there a name for this compression algorithm?

Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
Varint
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.