Bits and bytes filling - c++

Having written code in a form:
int i = 0xFF;
And knowing that on my system int type has 4 bytes, is there a standard way of predicting how those missing bits will be filled into the i variable, that is will it always be filled as:
0xFFFFFFFF
or
0xFF000000
That is, the above example is concerned with the the fact which asks, can we predict if the rest of bits will be filled with zeros or ones?
Another question is, is there a way to know if it will be filled at the back or from the front that is:
0xFF000000 //<<- bits filled at the back
or
0x000000FF //<<- bits filled from the front

int i = 0xFF; with 4-byte ints fills 1 byte with FF and the other 3 with 00.
char i = 0xFF; on the other hand (with 1-byte char) fills 1 byte with FF and you have no guarentee what the next 3 bytes are (neither their values, nor if they are used for other variables etc. or not).
The byte direction
FF 00 00 00
00 00 00 FF
is called the "endianess" of your CPU (and the OS, compiler etc. are made that way that they match it). "Normal" computers today are "little-endian", that is FF 00 00 00, altough big-endian exists in devices like mobile phones etc.
However, you won't notice the endianess directly in your program: If you compare the int to this values, it will be equal to 00 00 00 FF which is mathematically correct (or said in the other direction: when you write 000000FF in your source file, the compiler changes it, in reality you're comparing the int value to FF 00 00 00). To get the "real" representation, casting to a byte/char pointer and accessing the 4 bytes individually is necessary.

The rest of the bits will be 0, always. When stored to memory, this will be represented as FF 00 00 00 or 00 00 00 FF - this depends on the endianess of the machine.
The first is little-endian, - FF 00 00 00 and is typically used by Intel machines, though as pointed out in a comment by Toby Speight, the IA-64 architecture is able to be configured to use either layout.
The second example is big-endian. - 00 00 00 FF

int i = 0xFF;
Is equivalent to
int i = 255;
Which in the bits is equivalent to
0x000000FF

Related

What does a PC byte order of a binary file mean? [duplicate]

What is the difference between Big Endian and Little Endian Byte order ?
Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?
Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF):
Byte Index: 0 1
---------------------
Big-Endian: 12 34
Little-Endian: 34 12
In order to decide if a text uses UTF-16BE or UTF-16LE, the specification recommends to prepend a Byte Order Mark (BOM) to the string, representing the character U+FEFF. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.
A visual example: The word "Example" in different encodings (UTF-16 with BOM):
Byte Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
------------------------------------------------------------
ASCII: 45 78 61 6d 70 6c 65
UTF-16BE: FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c 00 65
UTF-16LE: FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 65 00
For further information, please read the Wikipedia page of Endianness and/or UTF-16.
Ferdinand's answer (and others) are correct, but incomplete.
Big Endian (BE) / Little Endian (LE) have nothing to do with UTF-16 or UTF-32.
They existed way before Unicode, and affect how the bytes of numbers get stored in the computer's memory. They depend on the processor.
If you have a number with the value 0x12345678 then in memory it will be represented as 12 34 56 78 (BE) or 78 56 34 12 (LE).
UTF-16 and UTF-32 happen to be represented on 2 respectively 4 bytes, so the order of the bytes respects the ordering that any number follows on that platform.
UTF-16 encodes Unicode into 16-bit values. Most modern filesystems operate on 8-bit bytes. So, to save a UTF-16 encoded file to disk, for example, you have to decide which part of the 16-bit value goes in the first byte, and which goes into the second byte.
Wikipedia has a more complete explanation.
little-endian: adj.
Describes a computer architecture in which, within a given 16- or 32-bit word, bytes at lower addresses have lower significance (the word is stored ‘little-end-first’). The PDP-11 and VAX families of computers and Intel microprocessors and a lot of communications and networking hardware are little-endian. The term is sometimes used to describe the ordering of units other than bytes; most often, bits within a byte.
big-endian: adj.
[common; From Swift's Gulliver's Travels via the famous paper On Holy Wars and a Plea for Peace by Danny Cohen, USC/ISI IEN 137, dated April 1, 1980]
Describes a computer architecture in which, within a given multi-byte numeric representation, the most significant byte has the lowest address (the word is stored ‘big-end-first’). Most processors, including the IBM 370 family, the PDP-10, the Motorola microprocessor families, and most of the various RISC designs are big-endian. Big-endian byte order is also sometimes called network order.
---from the Jargon File: http://catb.org/~esr/jargon/html/index.html
Byte endianness (big or little) needs to be specified for Unicode/UTF-16 encoding because for character codes that use more than a single byte, there is a choice of whether to read/write the most significant byte first or last. Unicode/UTF-16, since they are variable-length encodings (i.e. each char can be represented by one or several bytes) require this to be specified. (Note however that UTF-8 "words" are always 8-bits/one byte in length [though characters can be multiple points], therefore there is no problem with endianness.) If the encoder of a stream of bytes representing Unicode text and the decoder aren't agreed on which convention is being used, the wrong character code can be interpreted. For this reason, either the convention of endianness is known beforehand or more commonly a byte order mark is usually specified at the beginning of any Unicode text file/stream to indicate whethere big or little endian order is being used.

use of CRC32 over short 16byte messages

I want to use a CRC32 on a short message (16 bytes) to use as a unique key for another purpose. Will the CRC32 value be guaranteed to be unique for every possible 16 byte message ?
Given that there can be 2^128 messages (or over 3.40 × 10^38 if you prefer) its
not testable on a spreadsheet.
No. CRC32 are 32 bits, so there's no way to map 128 bits uniquely to 32 bits.
Checksum functions, like CRC32, are useful to detect common corruption (like one byte changed), not for unique identification.
Close kind of functions is hash functions that try to give unique value for every input, but they don't guarantee that there are no collisions, only try to minimize their probability for typical input distributions.
There's also "perfect hash function" that is guaranteed to be unique, but it limits input set to hash function output range and is typically implemented as lookup table.
You don't need to test all 2^128 possible patterns. For 128 data bits, the best 32 bit CRC polynomials can do is detect up to 7 bit errors.
https://users.ece.cmu.edu/~koopman/crc/crc32.html
In general, these 7 bit error detecting CRCs can only correct up to 3 bit errors, so there are some 4 bit patterns that will produce duplicates. A test program only needs to test comb(128,4) = 10,668,000 patterns to search for duplicates between 4 bit patterns. A 4 bit pattern can't duplicate a 1, 2, or 3 bit pattern, because worst case that would be a total of 7 error bits, which these CRCs are guaranteed to detect.
I tested using CRC polynomial 0x1f1922815, with all zero bits except for the pattern to be tested and the first collision found was between these two 4 bit patterns:
crc(00 00 00 00 00 00 40 00 00 00 00 00 11 00 08 00) = 87be3
crc(40 00 00 00 04 00 00 08 00 00 00 00 00 00 04 00) = 87be3

Hexadecimal representation on little and big endian systems

Consider the following code :
unsigned char byte = 0x01;
In C/C++ the hexadecimal will be considered as an int, and therefore will be expanded to more then one byte. Because there is more then one byte, if the system uses little or big endian has effect. My question is to what this will be expanded?
If the system has int that contains 4 bytes, to which of the following it will be expanded :
0x00000001 OR 0x01000000
And does the endianness effect to which of them it will be expanded?
It is completely architecture specific. The compiler produces the output according to your computer's endianness. Assuming that your microprocessor is big-endian, the compiler will produce;
char x = 0x01; --> 01,
int x = 0x01; --> 00 00 00 01
On the other hand, if you use a cross compiler that targets little-endian or you compile the code on a little-endian machine, it will produce;
char x = 0x01; --> 01
int x = 0x01; --> 01 00 00 00
Summarily, there is no standard or default endianness for the output, the output will be specific to the architecture which the compiler targets to.
The endianness does't matter. The endianness only relates to how the value is stored in memory. But the value represented by what's stored in memory will be 0x01, no matter how it is stored.
Take for example the value 0x01020304. It can be stored in memory as either 01 02 03 04 or 04 03 02 01. But in both cases it's still the value 0x01020304 (or 16 909 060).

Understanding alignment concept

An alignment is an implementation-defined integer value representing
the number of bytes between successive addresses at which a given
object can be allocated.
That concept is a bit unclear. For instance:
struct B { long double d; };
struct D : virtual B { char c; }
When D is the type of a complete object, it will have a subobject of
type B, so it must be aligned appropriately for a long double.
What does it mean? sizeof(long double) is the number of bytes between what in that case??
Most CPU's have "preferences" about where data can be stored. When reading or writing to a memory address, the operation may be slower (or completely illegal) if the address doesn't match the data size you try to write. For example, it is common to require that 4-byte integers be allocated starting on an address that is divisible by 4.
That is, an int stored on address 7 is either less efficient, or completely illegal, depending on your CPU. But if it is stored at address 8, the CPU is happy.
That is what alignment expresses: for any object of type T what must its address be divisible by, in order to satisfy the CPU's requirements?"
In C++, the alignment for an object is left implementation-defined (because, as said above, it depends on the CPU architecture). C++ merely says that every object has an alignment, and describes how to determine the alignment of compound objects.
Being "aligned for a long double" simply means that the object must be allocated so that its first byte is placed in an address that is valid for a long double. If the CPU architecture specifies the alignment of a long double to be 10 (for example), then it means that every object with this alignment must be allocated on an address that is divisible by 10.
I completely understand your uncertainty. That might be among the worst attempts I have seen to explain what alignment is.
Think of it in practical terms.
00100000 64 65 66 61 75 6c 74 0a 31 0a 31 0a 31 0a 31 0a |default.1.1.1.1.|
00100010 32 0a 31 0a 33 0a 30 0a 31 0a 34 0a 32 33 0a 31 |2.1.3.0.1.4.23.1|
00100020 39 0a 31 37 0a 35 0a 32 36 0a 32 34 0a 33 0a 38 |9.17.5.26.24.3.8|
00100030 0a 31 32 0a 31 31 0a 31 30 0a 31 34 0a 31 33 0a |.12.11.10.14.13.|
00100040 38 32 0a 38 33 0a 38 34 0a |82.83.84.|
This is a hexdump of arbitrary data. Assume this data has been loaded into memory at the addresses as shown. Since the addresses are written in hexadecimal, it is easy to see the bit alignment.
The data at 0x100000 and 0x100001 could be accessed as a 16-bit aligned value (with a suitable CPU instruction or a 16-bit data access through a C pointer reference). However the data 0x100003 and 0x100004 is not aligned—it spans the 16-bit word at 2 and 3, and the 16-bit word at 4 and 5. The former is 16-bit aligned and the latter is not.
Likewise, the 64-bit (8 byte) value at 0x100030..100037 is 64-bit aligned. But everything until 0x100038 is not.
The alignment characteristics are due to the memory bus hardware which organizes memory accesses into bus cycles. A 16-bit data bus has the ability to fetch 8 bits or 16 bits in one operation, but the latter only if the address is even (least significant address bit is zero). CPUs which allow "odd alignment" perform two successive bus cycles (one even the other odd) to accomplish the operation. Other CPUs simply issue faults for non-aligned bus cycles.

Converting Hex Dump to Double

MY program currently prints a hex dump by reading from memory where a double is stored.
It gives me
00 00 00 00 00 50 6D 40
How can I make sense of this and get the value I store, which is 234.5?
I realize there are 64 bits in a double, first bit is the sign bit, the next 11 are exponent and the last 52 are the mantissa
(-1)^sign * (1.mantissa) * 2^(exponent - 1023)
However, I've tried both little endian and big endian representations of the double and I can't seem to make it work.
First thing to realize is that most modern processors use little endian representation. This means that the last byte is actually the most significant. So your value taken as a single hex constant is 0x406d500000000000.
The sign bit is 0. The next 11 bits are 0x406. The next 52 are 0xd500000000000.
(-1)^sign is 1. 2^(exponent - 1023) is 128. Those are simple.
1.mantissa is hard to evaluate unless you realize what it really means. It's the constant 1.0 followed by the 52 bits of mantissa as a fraction. To convert from an integer to a fraction you need to divide it by the representation of 2^52. 0xd500000000000/(2**52) is 0.83203125.
Putting it all together: 1 * (1.0 + 0.83203125) * 128 is 234.5.
This online calculator can do it for you.
If you are on big endianness, enter
00 00 00 00 00 50 6D 40
or if you are on little endianness
40 6D 50 00 00 00 00 00
The first is a strange number, the second is 234.5