bit-refected constant of PCLMULQDQ fast-crc

bit-refected constant of PCLMULQDQ fast-crc - crc

I am confused about how to calculate the bit-reflected constants in the white paper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction".
In the post Fast CRC with PCLMULQDQ NOT reflected and How the bit-reflect constant is calculated when we use CLMUL in CRC32, #rcgldr mentioned that "...are adjusted to compensate for the shift, so instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32...", but I do not understand what does this mean.
For example, constant k1=(x^(4*128+64)%P(x))=0x8833794c (on page 16) v.s. k1'=(x^(4*128+64-32)%P(x)<<32)'=(0x154442db4>>1) (on page 22), I can't see those two figures have any reflection relationship (10001000_00110011_01111001_01001100 v.s 10101010_00100010_00010110_11011010).
I guess my question is why the exponent needs to subtract 32 to compensate 32bits of left shift? and why k1 and (k1)' are not reflected?
Could you please help to interpret it? Thanks
I had carefully searched for the answer to this question on the internet, especially in StackOverflow, and I tried to understand the related posts but need some experts to interpret more.

I modded what was originally some Intel examples to work with Visual Studio | Windows, not-reflected and reflected for 16, 32, and 64 bit CRC, in this github repository.
https://github.com/jeffareid/crc
I added some missing comments and also added a program to generate the constants used in the assembly code for each of the 6 cases.
instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32
This is done for non-reflected CRC. The CRC is kept in the upper 32 bits, as well as the constants, so that the result of PCLMULQDQ ends up in the upper 64 bits, and then right shifted. Shifting the constants left 32 bits is the same as multiplying by 2^32, or with polynomial notation, x^32.
For reflected CRC, the CRC is kept in the lower 32 bits, which are logically the upper 32 bits of a reflected number. The issue is PCLMULQDQ multiplies the product by 2, right shifting the product by 1 bit, leaving bit 127 == 0 and the 127 bit product in bits 126 to 0. To compensate for that, the constants are (x^(a) mod poly) << 1 (left shift for reflected number is == divide by 2).
The example code at that github site includes crc32rg.cpp, which is the program to generate the constants used by crc32ra.asm.
Another issue occurs when doing 64 bit CRC. For non-reflected CRC, sometimes the constant is 65 bits (for example, if the divisor is 7), but only the lower 64 bits are stored, and the 2^64 bit handled with a few more instructions. For reflected 64 bit CRC, since the constants can't be shifted left, (x^(a-1) mod poly) is used instead.

#rcgldr I think I didn't catch your point tbh... probably I didn't make my question clear...
If my understanding of the code (reverse CRC32) is correct, take the simplest scenario as an example, the procedure of 1-fold 32byte block is shown here. I don't understand why the exponents used in the constants are not 128 and 192 (=128+64) respectively.

Related

What is the Significance of clearing/setting/toggling the Most Significant Bit (MSB) or Least Significant Bit (LSB)?

What are the real world uses of manipulating (clearing/setting/toggling) the MSB or LSB?
By definition, MSB is the left most bit, contributing the maximum value and LSB is the right most bit, contributing the least value.
Why one has to manipulate these bits? What can we achieve by manipulating these bits?

One real world example of:
manipulating the LSB is Fenwick Tree
can be used to find the sum of nos. in range & updation of a number in array both in O(log N)
manipulating the MSB is Binary Search
using bit manipulation -- Binary searching via bitmasking?

If you're using an integer value as a flags structure or to contain bitfields, then that's a reason. A reason to adjust the MSB or LSB individually might be to set a special flag where you know the bit would be otherwise unused, for example in some ISAs all memory addresses (for loading/writing) must be aligned on a word boundary (typically a word length is 32-bits) which means the last few bits of a pointer are completely insignificant and can be used by application or system, the same applies to the upper bits - but only in certain circumstances.
Other reasons include doing quick arithmetic operations on IEEE-754 numbers: e.g. toggling the sign-bit which would be quicker than going through the FPU.

From Wikipedia:
MSB
Signed magnitude representation
This representation is also called "sign–magnitude" or "sign and magnitude" representation. In this approach, the problem of representing a number's sign can be to allocate one sign bit to represent the sign: setting that bit (often the most significant bit) to 0 is for a positive number or positive zero, and setting it to 1 is for a negative number or negative zero. The remaining bits in the number indicate the magnitude (or absolute value). Hence, in a byte with only seven bits (apart from the sign bit), the magnitude can range from 0000000 (0) to 1111111 (127). Thus numbers ranging from −12710 to +12710 can represented once the sign bit (the eighth bit) is added. A consequence of this representation is that there are two ways to represent zero, 00000000 (0) and 10000000 (−0). This way, −4310 encoded in an eight-bit byte is 10101011.
LSB
The least significant bits have the useful property of changing rapidly if the number changes even slightly. For example, if 1 (binary 00000001) is added to 3 (binary 00000011), the result will be 4 (binary 00000100) and three of the least significant bits will change (011 to 100). By contrast, the three most significant bits (MSBs) stay unchanged (000 to 000).
Least significant bits are frequently employed in pseudorandom number generators, hash functions and checksums.

How does a 32-bit machine compute a double precision number

If i only have 32-bit machine, how do does the cpu compute a double precision number? This number is 64 bit wide. How does a FPU handle it?
The more general question would be, how to compute something which is wider, then my alu. However i fully understood the integer way. You can simply split them up. Yet with floating point numbers, you have the exponent and the mantissa, which should be handled differnetly.

Not everything in a "32-bit machine" has to be 32bit. The x87 style FPU hasn't been "32-bit" from its inception, which was a very long time before AMD64 was created. It was always capable of doing math on 80-bit extended doubles, and it used to be a separate chip, so no chance of using the main ALU at all.
It's wider than the ALU yes, but it doesn't go through the ALU, the floating point unit(s) use their own circuits which are as wide as they need to be. These circuits are also much more complicated than the integer circuits, and they don't really overlap with integer ALUs in their components

There are a several different concepts in a computer architecture that can be measured in bits, but none of them prevent handling 64 bit floating point numbers. Although these concepts may be correlated, it is worth considering them separately for this question.
Often, "32 bit" means that addresses are 32 bits. That limits each process's virtual memory to 2^32 addresses. It is the measure that makes the most direct difference to programs, because it affects the size of a pointer and the maximum size of in-memory data. It is completely irrelevant to the handling of floating point numbers.
Another possible meaning is the width of the paths that transfer data between memory and the CPU. That is not a hard limit on the sizes of data structures - one data item may take multiple transfers. For example, the Java Language Specification does not require atomic loads and stores of double or long. See 17.7. Non-Atomic Treatment of double and long. A double can be moved between memory and the processor using two separate 32 bit transfers.
A third meaning is the general register size. Many architectures use separate registers for floating point. Even if the general registers are only 32 bits the floating point registers can be wider, or it may be possible to pair two 32 bit floating point registers to represent one 64-bit number.
A typical relationship between these concepts is that a computer with 64 bit memory addresses will usually have 64 bit general registers, so that a pointer can fit in one general register.

Even 8 bit computers provided extended precision (80 bit) floating point arithmetic, by writing code to do the calculations.
Modern 32 bit computers (x86, ARM, older PowerPC etc.) have 32 bit integer and 64 or 80 bit floating-point hardware.

Let's look at integer arithmetic first, since it is simpler. Inside of you 32 bit ALU there are 32 individual logic units with carry bits that will spill up the chain. 1 + 1 -> 10, the carry but carried over to the second logic unit. The entire ALU will also have a carry bit output, and you can use this to do arbitrary length math. The only real limitation for the but width is how many bits you can work with in one cycle. To do 64 bit math you need 2 or more cycles and need to do the carry logic yourself.

It seems that the question is just "how does FPU work?", regardless of bit widths.
FPU does addition, multiplication, division, etc. Each of them has a different algorithm.
Addition
(also subtraction)
Given two numbers with exponent and mantissa:
x1 = m1 * 2 ^ e1
x2 = m2 * 2 ^ e2
, the first step is normalization:
x1 = m1 * 2 ^ e1
x2 = (m2 * 2 ^ (e2 - e1)) * 2 ^ e1 (assuming e2 > e1)
Then one can add the mantissas:
x1 + x2 = (whatever) * 2 ^ e1
Then, one should convert the result to a valid mantissa/exponent form (e.g., the (whatever) part might be required to be between 2^23 and 2^24). This is called "renormalization" if I am not mistaken. Here one should also check for overflow and underflow.
Multiplication
Just multiply the mantissas and add the exponents. Then renormalize the multiplied mantissas.
Division
Do a "long division" algorithm on the mantissas, then subtract the exponents. Renormalization might not be necessary (depending on how you implement the long division).
Sine/Cosine
Convert the input to a range [0...π/2], then run the CORDIC algorithm on it.
Etc.

How are Overflow situations dealt with? [duplicate]

This question already has answers here:
Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
(6 answers)
Closed 7 years ago.
I just simply wanted to know, who is responsible to deal with mathematical overflow cases in a computer ?
For example, in the following C++ code:
short x = 32768;
std::cout << x;
Compiling and running this code on my machine gave me a result of -32767
A "short" variable's size is 2 bytes .. and we know 2 bytes can hold a maximum decimal value of 32767 (if signed) .. so when I assigned 32768 to x .. after exceeding its max value 32767 .. It started counting from -32767 all over again to 32767 and so on ..
What exactly happened so the value -32767 was given in this case ?
ie. what are the binary calculations done in the background the resulted in this value ?
So, who decided that this happens ? I mean who is responsible to decide that when a mathematical overflow happens in my program .. the value of the variable simply starts again from its min value, or an exception is thrown for example, or the program simply freezes .. etc ?
Is it the language standard, the compiler, my OS, my CPU, or who is it ?
And how does it deal with that overflow situation ? (Simple explanation or a link explaining it in details would be appreciated :) )
And btw, pls .. Also, who decides what a size of a 'short int' for example on my machine would be ? also is it a language standard, compiler, OS, CPU .. etc ?
Thanks in advance! :)
Edit:
Ok so I understood from here : Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
that It's the processor who defines what happens in an overflow situation (like for example in my machine it started from -32767 all over again), depending on "representations for signed values" of the processor, ie. is it sign magnitude, one's complement or two's complement ...
is that right ?
and in my case (When the result given was like starting from the min value -32767 again.. how do you suppose my CPU is representing the signed values, and how did the value -32767 for example come up (again, binary calculations that lead to this, pls :) ? )

It doesn't start at it's min value per se. It just truncates its value, so for a 4 bit number, you can count until 1111 (binary, = 15 decimal). If you increment by one, you get 10000, but there is no room for that, so the first digit is dropped and 0000 remains. If you would calculate 1111 + 10, you'd get 1.
You can add them up as you would on paper:
1111
0010
---- +
10001
But instead of adding up the entire number, the processor will just add up until it reaches (in this case) 4 bits. After that, there is no more room to add up any more, but if there is still 1 to 'carry', it sets the overflow register, so you can check whether the last addition it did overflowed.
Processors have basic instructions to add up numbers, and they have those for smaller and larger values. A 64 bit processor can add up 64 bit numbers (actually, usually they don't add up two numbers, but actually add a second number to the first number, modifying the first, but that's not really important for the story).
But apart from 64 bits, they often can also add up 32, 16 and 8 bit numbers. That's partly because it can be efficient to add up only 8 bits if you don't need more, but also sometimes to be backwards compatible with older programs for a previous version of a processor which could add up to 32 bits but not 64 bits.
Such a program uses an instruction to add up 32 bits numbers, and the same instruction must also exist on the 64 bit processor, with the same behavior if there is an overflow, otherwise the program wouldn't be able to run properly on the newer processor.
Apart from adding up using the core constructions of the processor, you could also add up in software. You could make an inc function that treats a big chunk of bits as a single value. To increment it, you can let the processor increment the first 64 bits. The result is stored in the first part of your chunk. If the overflow flag is set in the processor, you take the next 64 bits and increment those too. This way, you can extend the limitation of the processor to handle large numbers from software.
And same goes for the way an overflow is handled. The processor just sets the flag. Your application can decide whether to act on it or not. If you want to have a counter that just increments to 65535 and then wraps to 0, you (your program) don't need to do anything with the flag.

Why is the CRC32 generating polynomial 33 bits long?

First off, if there's a better site to ask this question then please do migrate this or close it and let me know where to go.
Secondly, we're discussing CRC in one of my classes, and neither us nor the professor understand why CRC polynomials are one bit longer than the name (or resulting checksum) suggest. I've done some searching, but nothing seems to discuss why it's one bit longer.

A CRC is the remainder after dividing the message by the polynomial. By definition, the remainder has to be less than the length of the polynomial. Hence the CRC for a "33-bit" polynomial is 32 bits.
Note that the largest exponent of a "33-bit" polynomial is 32 (the lowest term has exponent zero), so the degree of the polynomial, as well as the length of the CRC is 32.

How can I set all bits to '1' in a binary number of an unknown size?

I'm trying to write a function in assembly (but lets assume language agnostic for the question).
How can I use bitwise operators to set all bits of a passed in number to 1?
I know that I can use the bitwise "or" with a mask with the bits I wish to set, but I don't know how to construct a mask based off some a binary number of N size.

~(x & 0)
x & 0 will always result in 0, and ~ will flip all the bits to 1s.

Set it to 0, then flip all the bits to 1 with a bitwise-NOT.

You're going to find that in assembly language you have to know the size of a "passed in number". And in assembly language it really matters which machine the assembly language is for.
Given that information, you might be asking either
How do I set an integer register to all 1 bits?
or
How do I fill a region in memory with all 1 bits?
To fill a register with all 1 bits, on most machines the efficient way takes two instructions:
Clear the register, using either a special-purpose clear instruction, or load immediate 0, or xor the register with itself.
Take the bitwise complement of the register.
Filling memory with 1 bits then requires 1 or more store instructions...
You'll find a lot more bit-twiddling tips and tricks in Hank Warren's wonderful book Hacker's Delight.

Set it to -1. This is usually represented by all bits being 1.

Set x to 1
While x < number
x = x * 2
Answer = number or x - 1.
The code assumes your input is called "number". It should work fine for positive values. Note for negative values which are twos complement the operation attempt makes no sense as the high bit will always be one.

Use T(~T(0)).
Where T is the typename (if we are talking about C++.)
This prevents the unwanted promotion to int if the type is smaller than int.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js