What is the size of bitset in C++ - c++

I want to know how bitset actually allocates memory. I read from some blog that it takes up memory in bits. However when i run the following code:
bitset<3> bits = 001;
cout<<sizeof(bits);
I get the output as 4. What is the explanation behind it?
Also is there a method to allocate space in bits in C++?

You can approximate sizeof(bitset<N>) as:
If internal representation is 32bit (like unsigned on 32bit systems) as 4 * ((N + 31) / 32)
If internal representation is 64bit (like unsigned long on 64bit systems) as 8 * ((N + 63) / 64)
It seems that the first is true: 4 * ((3 + 31) / 32) is 4

I get the output as 4. What is the explanation behind it?
There is no information in standard about how bitset should be realized. It's implementation defined, look at bitset header of your compiler.
Also is there a method to allocate space in bits in C++?
No, there is no method to allocate space in bits in C++.

Your CPU doesn't operate with individual bits, but with bytes and words. In your case, sizeof(bits) results in 4 because compiler decided to align this datastructure to 4 bytes.

Typically on a 32-bit processor, the compiler will make the allocated memory size a multiple of 4 bytes, and so the nearest multiple of 4 greater than 3/8 is 4 bytes.

You cannot address separate bits, the lowest adressable unit is byte. So no, you cannot allocate bits precisely.
Another thing is padding - you almost always get more bytes allocated that you asked for, this is for optimalization purposes. Addressing bytes not on 32b boundaries is often expensive, addressing bytes on x64 CPU that are not on 64b boundaries results in exception. (speaking of Intel platform.)

Related

bitset<32> size of 8 bytes? [duplicate]

I want to know how bitset actually allocates memory. I read from some blog that it takes up memory in bits. However when i run the following code:
bitset<3> bits = 001;
cout<<sizeof(bits);
I get the output as 4. What is the explanation behind it?
Also is there a method to allocate space in bits in C++?
You can approximate sizeof(bitset<N>) as:
If internal representation is 32bit (like unsigned on 32bit systems) as 4 * ((N + 31) / 32)
If internal representation is 64bit (like unsigned long on 64bit systems) as 8 * ((N + 63) / 64)
It seems that the first is true: 4 * ((3 + 31) / 32) is 4
I get the output as 4. What is the explanation behind it?
There is no information in standard about how bitset should be realized. It's implementation defined, look at bitset header of your compiler.
Also is there a method to allocate space in bits in C++?
No, there is no method to allocate space in bits in C++.
Your CPU doesn't operate with individual bits, but with bytes and words. In your case, sizeof(bits) results in 4 because compiler decided to align this datastructure to 4 bytes.
Typically on a 32-bit processor, the compiler will make the allocated memory size a multiple of 4 bytes, and so the nearest multiple of 4 greater than 3/8 is 4 bytes.
You cannot address separate bits, the lowest adressable unit is byte. So no, you cannot allocate bits precisely.
Another thing is padding - you almost always get more bytes allocated that you asked for, this is for optimalization purposes. Addressing bytes not on 32b boundaries is often expensive, addressing bytes on x64 CPU that are not on 64b boundaries results in exception. (speaking of Intel platform.)

C++ Byte is implementation dependent

I have been reading C Primer Plus.
It is said that:
Note that the meaning of byte is implementation dependent. So a 2-byte int could be 16 bits on one system and 32 bits on another.
Here I think I am not sure about this. From my understanding, 1 byte always = 8 bits, so it makes sense that 2-byte int = 2 * 8 = 16 bits. But from this statement it sounds like some system define 1 byte = 16 bits. Is that correct?
In general, how I should understand this statement?
The C++ standard, section 1.7 point 1 confirms this:
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation defined. (...)
The memory available to a C++ program consists of one or more
sequences of contiguous bytes. Every byte has a unique address.
Bytes are always composed of at least 8 bits. They can be larger than 8 bits, though this is fairly uncommon.
One byte is not always 8-bits. Before octets (the term you want to use if you want to explicitly refer to an 8-bit byte), there were 4-, 6-, and 7-bit bytes. For the purposes of [modern] programming (in pretty much any language), you can assume it is at least 8 bits.
Historically, a byte was not always 8 bits. Today it is, but long time ago, it could be 6, 7, 8, 9 ... so to have a language that could exploit the specifics of the hardware (for efficiency) but still letting the user express himself in a bit higher level language, they had to make sure the int type was mapped on the most natural fit for the hardware.

Why does long integer take more than 4 bytes on some systems?

I understand that the standard says that the size of a long integer is implementation dependant, but I am not sure why.
All it needs to do is to be able to store -2147483647 to 2147483647 or 0 to 4294967295.
Assuming that 1 byte is 8 bits, this should never need more than 4 bytes. Is it safe to say, then, that a long integer will take more than 4 bytes only if a byte has less than 8 bits? Or could there be other possibilities as well? Like maybe inefficient implementations wasting space?
An obvious use for a long larger than 32 bits is to have a larger range available.
For example, before long long int (and company) were in the standard, DEC was selling 64-bit (Alpha) processors and a 64-bit operating system. They built a (conforming) system with:
char = 1 byte
short = 2 bytes
int = 4 bytes
long = 8 bytes
As to why they'd do this: well, an obvious reason was so their customers would have access to a 64-bit type and take advantage of their 64-bit hardware.
The extra bytes aren't a waste of space. A larger range is quite useful. The standard specifies minimum ranges, not the precise range itself; there's nothing wrong with having wider types.
When the standard originally specified an int should be at least 16 bits, common processors had registers no larger than that. Representing a long took two registers and special operations!
But then 32 bits became the norm, and now ints are 32 bits everywhere and longs are 64. Nowadays most processors have 64-bit instructions, and a long can often be stored in a single register.
You're assuming quite a few things:
A byte is CHAR_BIT bits wide
The PDP-10 had bytes ranging from 1 to 36 bits. The DEC VAX supported operations on 128-bit integer types. So, there's plenty reason to go over and above what the standard mandates.
The limits for data types are given in §3.9.1/8
Specializations of the standard template std::numeric_limits (18.3)
shall specify the maximum and minimum values of each arithmetic type
for an implementation.
Lookup <limits> header.
This article by Jack Klein may be of interest to you!
If you want an integer of a specific size, then you want to use the types with the size specified:
int8_t
int16_t
int32_t
int64_t
int128_t
...
These are available in some random header file (it varies depending on the OS you're using, although in C++ it seems to be <stdint>)
You have the unsigned version using a u at the beginning (uint32_t).
The others already answered why the size would be so and so.
Note that the newest Intel processors support numbers of 256 bits too. What a waste, hey?! 8-)
Oh! And time_t is starting to use 64 bits too. In 2068, time_t in 32 bits will go negative and give you a date in late 1800... That's a good reason to adopt 64 bits for a few things.
One reason for using an 8 byte integer is to be able to address more than 4 gigs of memory. Ie. 2^32 = 4 gigabytes. 2^64 = well, it's a lot!
Personally, I've used 8 byte ints for implementing a radix sort on double floats (casting the floats as ints then doing magical things with it that aren't worth describing here. :))

Why are all datatypes a power of 2?

Why are all data type sizes always a power of 2?
Let's take two examples:
short int 16
char 8
Why are they not the like following?
short int 12
That's an implementation detail, and it isn't always the case. Some exotic architectures have non-power-of-two data types. For example, 36-bit words were common at one stage.
The reason powers of two are almost universal these days is that it typically simplifies internal hardware implementations. As a hypothetical example (I don't do hardware, so I have to confess that this is mostly guesswork), the portion of an opcode that indicates how large one of its arguments is might be stored as the power-of-two index of the number of bytes in the argument, thus two bits is sufficient to express which of 8, 16, 32 or 64 bits the argument is, and the circuitry required to convert that into the appropriate latching signals would be quite simple.
The reason why builtin types are those sizes is simply that this is what CPUs support natively, i.e. it is the fastest and easiest. No other reason.
As for structs, you can have variables in there which have (almost) any number of bits, but you will usually want to stay with integral types unless there is a really urgent reason for doing otherwise.
You will also usually want to group identical-size types together and start a struct with the largest types (usually pointers).That will avoid needless padding and it will make sure you don't have access penalties that some CPUs exhibit with misaligned fields (some CPUs may even trigger an exception on unaligned access, but in this case the compiler would add padding to avoid it, anyway).
The size of char, short, int, long etc differ depending on the platform. 32 bit architectures tend to have char=8, short=16, int=32, long=32. 64 bit architectures tend to have char=8, short=16, int=32, long=64.
Many DSPs don't have power of 2 types. For example, Motorola DSP56k (a bit dated now) has 24 bit words. A compiler for this architecture (from Tasking) has char=8, short=16, int=24, long=48. To make matters confusing, they made the alignment of char=24, short=24, int=24, long=48. This is because it doesn't have byte addressing: the minimum accessible unit is 24 bits. This has the exciting (annoying) property of involving lots of divide/modulo 3 when you really do have to access an 8 bit byte in an array of packed data.
You'll only find non-power-of-2 in special purpose cores, where the size is tailored to fit a special usage pattern, at an advantage to performance and/or power. In the case of 56k, this was because there was a multiply-add unit which could load two 24 bit quantities and add them to a 48 bit result in a single cycle on 3 buses simultaneously. The entire platform was designed around it.
The fundamental reason most general purpose architectures use powers-of-2 is because they standardized on the octet (8 bit bytes) as the minimum size type (aside from flags). There's no reason it couldn't have been 9 bit, and as pointed out elsewhere 24 and 36 bit were common. This would permeate the rest of the design: if x86 was 9 bit bytes, we'd have 36 octet cache lines, 4608 octet pages, and 569KB would be enough for everyone :) We probably wouldn't have 'nibbles' though, as you can't divide a 9 bit byte in half.
This is pretty much impossible to do now, though. It's all very well having a system designed like this from the start, but inter-operating with data generated by 8 bit byte systems would be a nightmare. It's already hard enough to parse 8 bit data in a 24 bit DSP.
Well, they are powers of 2 because they are multiples of 8, and this comes (simplifying a little) from the fact that usually the atomic allocation unit in memory is a byte, which (edit: often, but not always) is made by 8 bits.
Bigger data sizes are made taking multiple bytes at a time.
So you could have 8,16,24,32... data sizes.
Then, for the sake of memory access speed, only powers of 2 are used as a multiplier of the minimum size (8), so you get data sizes along these lines:
8 => 8 * 2^0 bits => char
16 => 8 * 2^1 bits => short int
32 => 8 * 2^2 bits => int
64 => 8 * 2^3 bits => long long int
8 bits is the most common size for a byte (but not the only size, examples of 9 bit bytes and other byte sizes are not hard to find). Larger data types are almost always multiples of the byte size, hence they will typically be 16, 32, 64, 128 bits on systems with 8 bit bytes, but not always powers of 2, e.g. 24 bits is common for DSPs, and there are 80 bit and 96 bit floating point types.
The sizes of standard integral types are defined as multiple of 8 bits, because a byte is 8-bits (with a few extremely rare exceptions) and the data bus of the CPU is normally a multiple of 8-bits wide.
If you really need 12-bit integers then you could use bit fields in structures (or unions) like this:
struct mystruct
{
short int twelveBitInt : 12;
short int threeBitInt : 3;
short int bitFlag : 1;
};
This can be handy in embedded/low-level environments - but bear in mind that the overall size of the structure will still be packed out to the full size.
They aren't necessarily. On some machines and compilers, sizeof(long double) == 12 (96 bits).
It's not necessary that all data types use of power of 2 as number of bits to represent. For example, long double uses 80 bits(though its implementation dependent on how much bits to allocate).
One advantage you gain with using power of 2 is, larger data types can be represented as smaller ones. For example, 4 chars(8 bits each) can make up an int(32 bits). In fact, some compilers used to simulate 64 bit numbers using two 32 bit numbers.
Most of the times your computer tries to keep all data formats in either a whole multiple (2, 3, 4...) or a whole part (1/2, 1/3, 1/4...) of the machine data size. It does this so that each time it loads N data words it loads an integer number of bits of information for you. That way, it doesn't have to recombine parts later on.
You can see this in the x86 for example:
a char is 1/4th of 32-bits
a short is 1/2 of 32-bits
an int / long are a whole 32 bits
a long long is 2x 32 bits
a float is a single 32-bits
a double is two times 32-bits
a long double may either be three or four times 32-bits, depending on your compiler settings. This is because for 32-bit machines it's three native machine words (so no overhead) to load 96 bits. On 64-bit machines it is 1.5 native machine word, so 128 bits would be more efficient (no recombining). The actual data content of a long double on x86 is 80 bits, so both of these are already padded.
A last aside, the computer doesn't always load in its native data size. It first fetches a cache line and then reads from that in native machine words. The cache line is larger, usually around 64 or 128 bytes. It's very useful to have a meaningful bit of data fit into this and not be stuck on the edge as you'd have to load two whole cache lines to read it then. That's why most computer structures are a power of two in size; it will fit in any power of two size storage either half, completely, double or more - you're guaranteed to never end up on a boundary.
There are a few cases where integral types must be an exact power of two. If the exact-width types in <stdint.h> exist, such as int16_t or uint32_t, their widths must be exactly that size, with no padding. Floating-point math that declares itself to follow the IEEE standard forces float and double to be powers of two (although long double often is not). There are additionally types char16_t and char32_t in the standard library now, or built-in to C++, defined as exact-width types. The requirements about support for UTF-8 in effect mean that char and unsigned char have to be exactly 8 bits wide.
In practice, a lot of legacy code would already have broken on any machine that didn’t support types exactly 8, 16, 32 and 64 bits wide. For example, any program that reads or writes ASCII or tries to connect to a network would break.
Some historically-important mainframes and minicomputers had native word sizes that were multiples of 3, not powers of two, particularly the DEC PDP-6, PDP-8 and PDP-10.
This was the main reason that base 8 used to be popular in computing: since each octal digit represented three bits, a 9-, 12-, 18- or 36-bit pattern could be represented more neatly by octal digits than decimal or hex. For example, when using base-64 to pack characters into six bits instead of eight, each packed character took up two octal digits.
The two most visible legacies of those architectures today are that, by default, character escapes such as '\123' are interpreted as octal rather than decimal in C, and that Unix file permissions/masks are represented as three or four octal digits.

Endianness in casting an array of two bytes into a single short

Problem: I cannot understand the number 256 (2^8) in the extract of the IBM article:
On the other hand, if it's a
big-endian system, the high byte is 1
and the value of x is 256.
Assume each element in an array consumes 4 bites, then the processor should read somehow: 1000 0000. If it is a big endian, it is 0001 0000 because endianness does not affect bits inside bytes. [2] Contradiction to the 256 in the article!?
Question: Why is the number 256_dec (=1000 0000_bin) and not 32_dec (=0001 0000_bin)?
[2] Endian issues do not affect sequences that have single bytes, because "byte" is considered an atomic unit from a storage point of view.
Because a byte is 8 bits, not 4. The 9th least significant bit in an unsigned int will have value 2^(9-1)=256. (the least significant has value 2^(1-1)=1).
From the IBM article:
unsigned char endian[2] = {1, 0};
short x;
x = *(short *) endian;
They're correct; the value is (short)256 on big-endian, or (short)1 on little-endian.
Writing out the bits, it's an array of {00000001_{base2}, 00000000_{base2}}. Big endian would interpret that bit array read left to right; little endian would swap the two bytes.
256dec is not 1000_0000bin, it's 0000_0001_0000_0000bin.
With swapped bytes (1 byte = 8 bits) this looks like 0000_0000_0000_0001bin, which is 1dec.
Answering your followup question: briefly, there is no "default size of an element in an array" in most programming languages.
In C (perhaps the most popular programming language), the size of an array element -- or anything, really -- depends on its type. For an array of char, the elements are usually 1 byte. But for other types, the size of each element is whatever the sizeof() operator gives. For example, many C implementations give sizeof(short) == 2, so if you make an array of short, it will then occupy 2*N bytes of memory, where N is the number of elements.
Many high-level languages discourage you from even attempting to discover how many bytes an element of an array requires. Giving a fixed number of bytes ties the designers' hands to always using that many bytes, which is good for transparency and code that relies on its binary representation, but bad for backward compatibility whenever some reason comes along to change the representation.
Hope that helps. (I didn't see the other comments until after I wrote the first version of this.)