How a program in 16 bit system can access integers more than 65535 but not address - c++

A 16 bit system can only access RAM upto 64kbytes (normally). There is a concept of memory addresses that 16 bit system can access 2^16 numbers thus in unsigned integers it can only access 2^16 = 65536 INTEGERS (0 to 65535). Thus 16 bit sytem can only use addresses upto 64kbytes(after conclusion of small calculation). now the main que. Is that when we define an integer to be 'long int' than how can it access integers more than 65535?

There are a bunch of misconceptions in this post:
I came to know in previous days that a 16 bit system can only access RAM upto 64kbytes
This is factually wrong, the 8086 has a external address bus of 20 bits, so it can access 1,048,576 bytes (~1MB). You can read more about the 8086 architecture here: https://en.wikipedia.org/wiki/Intel_8086.
Is that when we define an integer to be 'long int' than how can it access integers more than 65535?
Are you asking about register size? In that case the answer is easy: it doesn't. It can access the first 16 bits, and then it can access the other 16 bits, and whatever the application does with those 2 16 bit values is up to it (and the framework used, like the C runtime).
As to how you can access the full address space of 20 bits with just 16bit integers, the answer is address segmenting. You have a second register (CS, DS, SS, and ES on 8086) that stores the high part of the address, and the CPU "stitches" them together to send to the memory controller.

Computers can perform arithmetic on values larger than a machine word in much the same way as humans can perform arithmetic on values larger than a digit: by splitting operations into multiple parts, and keeping track of "carries" that would move data between them.
On the 8086, for example, if AX holds the bottom half of a 32-bit number and DX holds the top half, the sequence:
ADD AX,[someValue]
ADC DX,[someValue+2]
will add to DX::AX the 32-bit value whose lower half is at address [someValue] and whose upper half is at [someValue+2]. The ADD instruction will update a "carry" flag indicating whether there was a carry out from the addition, and the ADC instruction will add an extra 1 if the carry flag was set.
Some processors don't have a carry flag, but have an instruction that will compare two registers, and set a third register to 1 if the first was greater than the second, and 0 otherwise. On those processors, if one wants to add R1::R0 to R3::R2 and place the result in R5::R4, one can use the sequence:
Add R0 to R2 and store the result in R4
Set R5 to 1 if R4 is less than R0 (will happen if there was a carry), and 0 otherwise
Add R1 to R5, storing the result in R5
Add R3 to R5, storing the result in R5
Four times as slow as a normal single-word addition, but still at least somewhat practical. Note that while the carry-flag approach is easily extensible to operate on numbers of any size, extending this approach beyond two words is much harder.

Related

Are there any performance differences between representing a number using a (4 byte) `int` and a 4 element unsigned char array?

Assuming an int in C++ is represented by 4 bytes, and an unsigned char is represented by 1 byte, you could represent an int with an array of unsigned char with 4 elements right?
My question is, are there any performance downsides to representing a number with an array of unsigned char? Like if you wanted to add two numbers together would it be just as fast to do int + int compared to adding each element in the array and dealing with carries manually?
This is just me trying to experiment and to practice working with bytes rather than some practical application.
There will be many performance downsides on any kind of manipulation using the 4-byte array. For example, take simple addition: almost any CPU these days will have a single instruction that adds two 32-bit integers, in one (maybe two) CPU cycle(s). To emulate that with your 4-byte array, you would need at least 4 separate CPU instructions.
Further, many CPUs actually work faster with 32- or 64-bit data than they do with 8-bit data - because their internal registers are optimized for 32- and 64-bit operands.
Let's scale your question up. Is there any performance difference between single addition of two 16 byte variables compared to four separate additions of 4 byte variables? And here comes the concept of vector registers and vector instructions (MMX, SSE, AVX). It's pretty much the same story, SIMD is always faster, because there is literally less instructions to execute and the whole operation is done by dedicated hardware. On top of that, in your question you also have to take into account that modern CPUs don't work with 1 byte variables, instead they still process 32 or 64 bits at once anyway. So effectively you will do 4 individual additions using 4 byte registers, only to use single lower byte each time and then manually handle carry bit. Yeah, that will be very slow.

How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?

I need to optimize the following compression operation (on a server with AVX2 instructions available):
take the exponents of an array of floats, shift and store to a uint8_t array
I have little experience and was suggested to start with https://github.com/feltor-dev/vcl library
now that I have
uint8_t* uin8_t_ptr = ...;
float* float_ptr = ...;
float* final_ptr = float_ptr + offset;
for (; float_ptr < final_ptr; float_ptr+=8) {
Vec8f vec_f = Vec8f().load(float_ptr);
Vec8i vec_i = fraction(vec_f) + 128; // range: 0~255
...
}
My question is how to efficiently store the vec_i results to the uint8_t array?
I couldn't find relevant functions in the vcl library and was trying to explore the intrinsic instructions since I could access the __m256i data.
My current understanding is to use something like _mm256_shuffle_epi8, but don't know the best way to do it efficiently.
I wonder if trying to fully utilize the bits and store 32 elements every time (using a loop with float_ptr+=32) would be the way to go.
Any suggestions are welcome. Thanks.
Probably your best bet for vectorization of this might be with vpackssdw / vpackuswb, and vpermd as a lane-crossing fixup after in-lane pack.
_mm256_srli_epi32 to shift the exponent (and sign bit) to the bottom in each 32-bit element. A logical shift leaves a non-negative result regardless of the sign bit.
Then pack pairs of vectors down to 16-bit with _mm256_packs_epi32 (signed input, signed saturation of output).
Then mask off the sign bit, leaving an 8-bit exponent. We wait until now so we can do 16x uint16_t elements per instruction instead of 8x uint32_t. Now you have 16-bit elements holding values that fit in uint8_t without overflowing.
Then pack pairs of vectors down to 8-bit with _mm256_packus_epi16 (signed input, unsigned saturation of output). This actually matters, packs would clip some valid values because your data uses the full range of uint8_t.
VPERMD to shuffle the eight 32-bit chunks of that vector that came from each lane of 4x 256-bit input vectors. Exactly the same __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7)); shuffle as in How to convert 32-bit float to 8-bit signed char?, which does the same pack after using FP->int conversion instead of right-shift to grab the exponent field.
Per result vector, you have 4x load+shift (vpsrld ymm,[mem] hopefully), 2x vpackssdw shuffles, 2x vpand mask, 1x vpackuswb, and 1x vpermd. That's 4 shuffles, so the best we can hope for on Intel HSW/SKL is 1 result vector per 4 clocks. (Ryzen has better shuffle throughput, except for vpermd which is expensive.)
But that should be achievable, so 32 bytes of input / 8 bytes of output per clock on average.
The 10 total vector ALU uops (including the micro-fused load+ALU), and the 1 store should be able to execute in that time. We have room for 16 total uops including loop overhead before the front-end becomes a worse bottleneck than shuffles.
update: oops, I forgot to count unbiasing the exponent; that will take an extra add. But you can do that after packing down to 8-bit. (And optimize it to an XOR). I don't think we can optimize it away or into something else, like into masking away the sign bit.
With AVX512BW, you could do a byte-granularity vpaddb to unbias, with zero-masking to zero the high byte of each pair. That would fold the unbiasing into the 16-bit masking.
AVX512F also has vpmovdb 32->8 bit truncation (without saturation), but only for single inputs. So you'd get one 64-bit or 128-bit result from one input 256 or 512-bit vector, with 1 shuffle + 1 add per input instead of 2+1 shuffles + 2 zero-masked vpaddb per input vector. (Both need the right shift per input vector to align the 8-bit exponent field with a byte boundary at the bottom of a dword)
With AVX512VBMI, vpermt2b would let us grab bytes from 2 input vectors. But it costs 2 uops on CannonLake, so only useful on hypothetical future CPUs if it gets cheaper. They can be the top byte of a dword, so we could start with vpaddd a vector to itself to left-shift by 1. But we're probably best with a left-shift because the EVEX encoding of vpslld or vpsrld can take the data from memory with an immediate shift count, unlike the VEX encoding. So hopefully we get a single micro-fused load+shift uop to save front-end bandwidth.
The other option is to shift + blend, resulting in byte-interleaved results that are more expensive to fix up, unless you don't mind that order.
And byte-granularity blending (without AVX512BW) requires vpblendvb which is 2 uops. (And on Haswell only runs on port 5, so potentially a huge bottleneck. On SKL it's 2 uops for any vector ALU port.)

How does a 32-bit machine compute a double precision number

If i only have 32-bit machine, how do does the cpu compute a double precision number? This number is 64 bit wide. How does a FPU handle it?
The more general question would be, how to compute something which is wider, then my alu. However i fully understood the integer way. You can simply split them up. Yet with floating point numbers, you have the exponent and the mantissa, which should be handled differnetly.
Not everything in a "32-bit machine" has to be 32bit. The x87 style FPU hasn't been "32-bit" from its inception, which was a very long time before AMD64 was created. It was always capable of doing math on 80-bit extended doubles, and it used to be a separate chip, so no chance of using the main ALU at all.
It's wider than the ALU yes, but it doesn't go through the ALU, the floating point unit(s) use their own circuits which are as wide as they need to be. These circuits are also much more complicated than the integer circuits, and they don't really overlap with integer ALUs in their components
There are a several different concepts in a computer architecture that can be measured in bits, but none of them prevent handling 64 bit floating point numbers. Although these concepts may be correlated, it is worth considering them separately for this question.
Often, "32 bit" means that addresses are 32 bits. That limits each process's virtual memory to 2^32 addresses. It is the measure that makes the most direct difference to programs, because it affects the size of a pointer and the maximum size of in-memory data. It is completely irrelevant to the handling of floating point numbers.
Another possible meaning is the width of the paths that transfer data between memory and the CPU. That is not a hard limit on the sizes of data structures - one data item may take multiple transfers. For example, the Java Language Specification does not require atomic loads and stores of double or long. See 17.7. Non-Atomic Treatment of double and long. A double can be moved between memory and the processor using two separate 32 bit transfers.
A third meaning is the general register size. Many architectures use separate registers for floating point. Even if the general registers are only 32 bits the floating point registers can be wider, or it may be possible to pair two 32 bit floating point registers to represent one 64-bit number.
A typical relationship between these concepts is that a computer with 64 bit memory addresses will usually have 64 bit general registers, so that a pointer can fit in one general register.
Even 8 bit computers provided extended precision (80 bit) floating point arithmetic, by writing code to do the calculations.
Modern 32 bit computers (x86, ARM, older PowerPC etc.) have 32 bit integer and 64 or 80 bit floating-point hardware.
Let's look at integer arithmetic first, since it is simpler. Inside of you 32 bit ALU there are 32 individual logic units with carry bits that will spill up the chain. 1 + 1 -> 10, the carry but carried over to the second logic unit. The entire ALU will also have a carry bit output, and you can use this to do arbitrary length math. The only real limitation for the but width is how many bits you can work with in one cycle. To do 64 bit math you need 2 or more cycles and need to do the carry logic yourself.
It seems that the question is just "how does FPU work?", regardless of bit widths.
FPU does addition, multiplication, division, etc. Each of them has a different algorithm.
Addition
(also subtraction)
Given two numbers with exponent and mantissa:
x1 = m1 * 2 ^ e1
x2 = m2 * 2 ^ e2
, the first step is normalization:
x1 = m1 * 2 ^ e1
x2 = (m2 * 2 ^ (e2 - e1)) * 2 ^ e1 (assuming e2 > e1)
Then one can add the mantissas:
x1 + x2 = (whatever) * 2 ^ e1
Then, one should convert the result to a valid mantissa/exponent form (e.g., the (whatever) part might be required to be between 2^23 and 2^24). This is called "renormalization" if I am not mistaken. Here one should also check for overflow and underflow.
Multiplication
Just multiply the mantissas and add the exponents. Then renormalize the multiplied mantissas.
Division
Do a "long division" algorithm on the mantissas, then subtract the exponents. Renormalization might not be necessary (depending on how you implement the long division).
Sine/Cosine
Convert the input to a range [0...π/2], then run the CORDIC algorithm on it.
Etc.

Difference between byte flip and byte swap

I am trying to find the difference becoz of byte flip functionality I see in Calculator on Mac with Programmer`s view.
So I wrote a program to byte swap a value which we do to go from small to big endian or other way round and I call it as byte swap. But when I see byte flip I do not understand what exactly it is and how is it different than byte swap. I did confirm that the results are different.
For example, for an int with value 12976128
Byte Flip gives me 198;
Byte swap gives me 50688.
I want to implement an algorithm for byte flip since 198 is the value I want to get while reading something. Anything on google says byte flip founds the help byte swap which isnt the case for me.
Byte flip and byte swap are synonyms.
The results you see are just two different ways of swapping the bytes, depending on whether you look at the number as a 32bit number (consisting of 4 bytes), or as the smallest size of a number that can hold 12976128, which is 24 bits or 3 bytes.
The 4byte swap is more usual in computer culture, because 32bit processors are currently predominant (even 64bit architectures still do most of their mathematics in 32bit numbers, partly because of backward compatible software infrastructure, partly because it is enough for many practical purposes). But the Mac Calculator seems to use the minimum-width swap, in this case a 3 byte swap.
12976128, when converted to hexadecimal, gives you 0xC60000. That's 3 bytes total ; each hexadecimal digit is 4 bits, or half a byte wide. The bytes to be swapped are 0xC6, zero, and another zero.
After 3byte swap: 0x0000C6 = 198
After 4byte swap: 0x0000C600 = 50688

Why are all datatypes a power of 2?

Why are all data type sizes always a power of 2?
Let's take two examples:
short int 16
char 8
Why are they not the like following?
short int 12
That's an implementation detail, and it isn't always the case. Some exotic architectures have non-power-of-two data types. For example, 36-bit words were common at one stage.
The reason powers of two are almost universal these days is that it typically simplifies internal hardware implementations. As a hypothetical example (I don't do hardware, so I have to confess that this is mostly guesswork), the portion of an opcode that indicates how large one of its arguments is might be stored as the power-of-two index of the number of bytes in the argument, thus two bits is sufficient to express which of 8, 16, 32 or 64 bits the argument is, and the circuitry required to convert that into the appropriate latching signals would be quite simple.
The reason why builtin types are those sizes is simply that this is what CPUs support natively, i.e. it is the fastest and easiest. No other reason.
As for structs, you can have variables in there which have (almost) any number of bits, but you will usually want to stay with integral types unless there is a really urgent reason for doing otherwise.
You will also usually want to group identical-size types together and start a struct with the largest types (usually pointers).That will avoid needless padding and it will make sure you don't have access penalties that some CPUs exhibit with misaligned fields (some CPUs may even trigger an exception on unaligned access, but in this case the compiler would add padding to avoid it, anyway).
The size of char, short, int, long etc differ depending on the platform. 32 bit architectures tend to have char=8, short=16, int=32, long=32. 64 bit architectures tend to have char=8, short=16, int=32, long=64.
Many DSPs don't have power of 2 types. For example, Motorola DSP56k (a bit dated now) has 24 bit words. A compiler for this architecture (from Tasking) has char=8, short=16, int=24, long=48. To make matters confusing, they made the alignment of char=24, short=24, int=24, long=48. This is because it doesn't have byte addressing: the minimum accessible unit is 24 bits. This has the exciting (annoying) property of involving lots of divide/modulo 3 when you really do have to access an 8 bit byte in an array of packed data.
You'll only find non-power-of-2 in special purpose cores, where the size is tailored to fit a special usage pattern, at an advantage to performance and/or power. In the case of 56k, this was because there was a multiply-add unit which could load two 24 bit quantities and add them to a 48 bit result in a single cycle on 3 buses simultaneously. The entire platform was designed around it.
The fundamental reason most general purpose architectures use powers-of-2 is because they standardized on the octet (8 bit bytes) as the minimum size type (aside from flags). There's no reason it couldn't have been 9 bit, and as pointed out elsewhere 24 and 36 bit were common. This would permeate the rest of the design: if x86 was 9 bit bytes, we'd have 36 octet cache lines, 4608 octet pages, and 569KB would be enough for everyone :) We probably wouldn't have 'nibbles' though, as you can't divide a 9 bit byte in half.
This is pretty much impossible to do now, though. It's all very well having a system designed like this from the start, but inter-operating with data generated by 8 bit byte systems would be a nightmare. It's already hard enough to parse 8 bit data in a 24 bit DSP.
Well, they are powers of 2 because they are multiples of 8, and this comes (simplifying a little) from the fact that usually the atomic allocation unit in memory is a byte, which (edit: often, but not always) is made by 8 bits.
Bigger data sizes are made taking multiple bytes at a time.
So you could have 8,16,24,32... data sizes.
Then, for the sake of memory access speed, only powers of 2 are used as a multiplier of the minimum size (8), so you get data sizes along these lines:
8 => 8 * 2^0 bits => char
16 => 8 * 2^1 bits => short int
32 => 8 * 2^2 bits => int
64 => 8 * 2^3 bits => long long int
8 bits is the most common size for a byte (but not the only size, examples of 9 bit bytes and other byte sizes are not hard to find). Larger data types are almost always multiples of the byte size, hence they will typically be 16, 32, 64, 128 bits on systems with 8 bit bytes, but not always powers of 2, e.g. 24 bits is common for DSPs, and there are 80 bit and 96 bit floating point types.
The sizes of standard integral types are defined as multiple of 8 bits, because a byte is 8-bits (with a few extremely rare exceptions) and the data bus of the CPU is normally a multiple of 8-bits wide.
If you really need 12-bit integers then you could use bit fields in structures (or unions) like this:
struct mystruct
{
short int twelveBitInt : 12;
short int threeBitInt : 3;
short int bitFlag : 1;
};
This can be handy in embedded/low-level environments - but bear in mind that the overall size of the structure will still be packed out to the full size.
They aren't necessarily. On some machines and compilers, sizeof(long double) == 12 (96 bits).
It's not necessary that all data types use of power of 2 as number of bits to represent. For example, long double uses 80 bits(though its implementation dependent on how much bits to allocate).
One advantage you gain with using power of 2 is, larger data types can be represented as smaller ones. For example, 4 chars(8 bits each) can make up an int(32 bits). In fact, some compilers used to simulate 64 bit numbers using two 32 bit numbers.
Most of the times your computer tries to keep all data formats in either a whole multiple (2, 3, 4...) or a whole part (1/2, 1/3, 1/4...) of the machine data size. It does this so that each time it loads N data words it loads an integer number of bits of information for you. That way, it doesn't have to recombine parts later on.
You can see this in the x86 for example:
a char is 1/4th of 32-bits
a short is 1/2 of 32-bits
an int / long are a whole 32 bits
a long long is 2x 32 bits
a float is a single 32-bits
a double is two times 32-bits
a long double may either be three or four times 32-bits, depending on your compiler settings. This is because for 32-bit machines it's three native machine words (so no overhead) to load 96 bits. On 64-bit machines it is 1.5 native machine word, so 128 bits would be more efficient (no recombining). The actual data content of a long double on x86 is 80 bits, so both of these are already padded.
A last aside, the computer doesn't always load in its native data size. It first fetches a cache line and then reads from that in native machine words. The cache line is larger, usually around 64 or 128 bytes. It's very useful to have a meaningful bit of data fit into this and not be stuck on the edge as you'd have to load two whole cache lines to read it then. That's why most computer structures are a power of two in size; it will fit in any power of two size storage either half, completely, double or more - you're guaranteed to never end up on a boundary.
There are a few cases where integral types must be an exact power of two. If the exact-width types in <stdint.h> exist, such as int16_t or uint32_t, their widths must be exactly that size, with no padding. Floating-point math that declares itself to follow the IEEE standard forces float and double to be powers of two (although long double often is not). There are additionally types char16_t and char32_t in the standard library now, or built-in to C++, defined as exact-width types. The requirements about support for UTF-8 in effect mean that char and unsigned char have to be exactly 8 bits wide.
In practice, a lot of legacy code would already have broken on any machine that didn’t support types exactly 8, 16, 32 and 64 bits wide. For example, any program that reads or writes ASCII or tries to connect to a network would break.
Some historically-important mainframes and minicomputers had native word sizes that were multiples of 3, not powers of two, particularly the DEC PDP-6, PDP-8 and PDP-10.
This was the main reason that base 8 used to be popular in computing: since each octal digit represented three bits, a 9-, 12-, 18- or 36-bit pattern could be represented more neatly by octal digits than decimal or hex. For example, when using base-64 to pack characters into six bits instead of eight, each packed character took up two octal digits.
The two most visible legacies of those architectures today are that, by default, character escapes such as '\123' are interpreted as octal rather than decimal in C, and that Unix file permissions/masks are represented as three or four octal digits.