Lowest exponent of a CRC polynomial - crc

I have never seen a CRC polynomial without the lowest term x⁰ = 1.
Are there any exceptions I haven't seen yet?
Why do all CRC polynomials have the lowest term x⁰?

A CRC polynomial of the form xn + ... + x0 is used for a n bit CRC (it is used with a borrowless divide of the data bits by the CRC polynomial that produces an n bit remainder, the CRC). If the CRC polynomial is of the form xn + ... + x1 (no x0 term), then it is effectively a n-1 bit CRC.
However, there are cases where common code may use different tables for fast computations of 32 bit or 16 bit CRC's, where the only difference in the main part of the code is the constants. The code is written as if the CRC is of the form x32 + ... + x0, but to allow most of the same code to generate a 16 bit CRC, the polynomial is of the form x32 + ... + x16. There's a final step correction done to shift the final CRC right by 16 bits to place the 16 bit CRC in the proper bits. An example of this is in this 500+ line fast crc32/16 assembly example using pclmulqdq insruction (carryless multiply), which in this case is setup to produce a 16 bit CRC.

Related

How to efficiently perform double/int128 conversions with AVX2?

I'm trying to make a software that users can move in a wide range(at least 1Mly diameter range and at least 0.1mm position representation precision). I think of 128bit fixed point number to represent position. However, mathematical calculation(e.g. distance, sqrt, divide, integration) is not suitable for fixed(or integer), so I use double or single floating point for math. (Usually on the result of subtracting two int128 coordinates to get a relative distance, so usually the value is small enough to not lose too much precision, or the big diff values needn't so many precision.)
So I encountered a problem when implementing fixed128: how to do fast int128-double conversion with AVX2 SIMD? (AVX512 is not popular so I can't use it in this software)
What I've tried(A bit long, maybe it can be ignored):
I've referred to this answer:How to efficiently perform double/int64 conversions with SSE/AVX?
Wim's answer showed that when we need convert int64 to double, splitting multiple integer to less than 52bits long as significand and concating exponent bits in the left, the do fp math to reduce the extra exponents is efficient.
So I tried to split uint128 (consisting of two uint64s: ilow and ihigh) into three parts:
part1 v_lo: ilow's low 48 bits;
part2 v_mi: ilow's high 16 bits and ihigh's low 16bits;
part3 v_hi: lhigh's high 48 bits;
We can get the v_lo and v_hi with the method almost same as wim's "uint64_to_double_fast_precise", but part2 "v_mi" become a problem. it increased 4 instructions which is more than low+high(1+2).(my code following)
Maybe there's faster way by some magical swizzle with permute/shuffle/unpackhi/unpacklo/broadcast/blend or their combination? These swizzle intrinsic really swizzled me.
my code for ufixed128-double conversion:
constexpr auto fix128_frac_bits = 32;
__m256d ufixed128_to_double_fast(const __m256i& ihigh, const __m256i& ilow)
{
//constants
__m256d magic_d_hm = _mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits) + pow(2.0, 52 + 80 - fix128_frac_bits));
__m256d magic_d_lo = _mm256_set1_pd(pow(2.0, 52 - fix128_frac_bits));
__m256i magic_i_lo = _mm256_castpd_si256(magic_d_lo);
__m256i magic_i_mi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 48 - fix128_frac_bits)));
__m256i magic_i_hi = _mm256_castpd_si256(_mm256_set1_pd(pow(2.0, 52 + 80 - fix128_frac_bits)));
//majik operations
__m256i v_lo = _mm256_blend_epi16(ilow, magic_i_lo, 0b10001000);
__m256i v_mi = _mm256_slli_epi64(ihigh, 16);
__m256i losr48 = _mm256_srli_epi64(ilow, 48);
v_mi = _mm256_xor_si256(v_mi, losr48);
v_mi = _mm256_blend_epi32(magic_i_mi, v_mi, 0b01010101);
__m256i v_hi = _mm256_srli_epi64(ihigh, 16);
v_hi = _mm256_xor_si256(v_hi, magic_i_hi);
//final fp
__m256d loresult = _mm256_sub_pd(_mm256_castsi256_pd(v_lo), magic_d_lo);
__m256d result = _mm256_sub_pd(_mm256_castsi256_pd(v_hi), magic_d_hm);
result = _mm256_add_pd(result, _mm256_castsi256_pd(v_mi));
result = _mm256_add_pd(result, loresult);
return result;
}
Edit: I've successfully made signed fixed128_to_double, just fp64 add '2.0^(127 - fix128_frac_bits)' into constant 'magic_d_hm' and 'magic_i_hi'.
But there's no fast 'double_to_int128' and 'double_to_uint128' which I have no idea. I can do it faster than C++ 'static_cast' scalar convert
with do bit operstions(mask out exponent and sign, and concat hidden 1,and do left/right shift), but it's much slower than thouse magical ops and use a lot of registers for constants.
Can anyone help me?
If I'm in a blind alley, and there's a better method than fixed128/double-double to represent the wide range position, please tell me. (Except floating-origin or floating-grid(int64-double):they are unstable for physics, or exposes a lot of complexity to the upper construction, or hard to do AVX acceleration.)
About double-double: I planned to compare performance between fixed128 and double-double after highly optimized them, and decide which to use after that. That's another work I'm doing.
my current codes: https://github.com/Veloctor/Int128

Why is the result of a bitwise shift unrecoverable if there is a mathematical equivalent of the same operation?

Take for example the number 91. That number in binary is 1011011. If you shift that number to the right by 5 bits, you would get 2 (10 in binary). According to a google search, bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2 to the power of the number of bits to be shifted, respectively. so to get from 91 to 2 by bit shifting, the equation would look like this: 91 / 2^5, which is also 91 / 32. Now, of course if you did that in your calculator, there would be some decimal values, which aren't included when bit shifting. The resulting 2 is actually 2.84357. I'm sure you know that if you do a certain operation on a number and then you do the inverse, the result would be what you had in the first place. So does decimal precision have something to do with this?
There is a mathematical equivalent of shifting to the right... and the mathematical operation is UNRECOVERABLE.
You seem to think that shifting to the right is:
bit shifting to the left or right by a certain amount of bits is the same as multiplying or dividing the number by 2
This is what you will hear people casually say, but it is only half right. As it it is not the same but only similar.
The correct statement is:
shifting a base-2 number one digit to the right is THE SAME as dividing by two in the integer domain
If you have an integer calculator, if you did 91/32 you will get 2. You will not get ANY decimal point because we are operating in the integer domain.
For real numbers, the equivalent operation is:
FLOOR(91/32)
Which is also unrecoverable because it also results in 2.
The lesson here is be careful when listening to what people CASUALLY say. Casual speech is often imprecise and assumes the listener is familiar with the subject. You need to dig deeper what the statement is actually trying to say.
As for why it is unrecoverable? Division of integers give two results: the quotient (which is the main result) and the remainder. When we divide 91 by 32 we are doing this:
2
_____
32 ) 91
64
__
27
So we get the result of 2 and a remainder of 27. The reason you can't get 91 by multiplying 2*32 is because we threw away the remainder.
You can get the result back if you saved the remainder. However, calculating the remainder is not a matter of simple shifts. Here's an example of how to make it reversable in C:
int test () {
int a = 91;
int b = 32;
int result;
int remainder;
result = a / b; // result will be 2
remainder = a % b; // remainder will be 27
return (result * b) + remainder; // returns 91
}
You can only recover the result of an operation if it has a 1-1 mapping between the inputs and outputs, i.e. it has an inverse function. But not all mathematical functions have an inverse function
For example if f(x) = x >> n with >> is the shift operator then it'll be equivalent to
f(x) = ⌊x/2n⌋
with ⌊ ⌋ being the floor function. Since there are many inputs that lead to the same output, the relationship isn't 1-1 and there can't be an inverse function for it. This function works the same for both signed and unsigned right shift:
91 >> 5 == floor(91.0/32.0) == 2
-91 >> 5 == floor(-91.0/32.0) == -3
Similarly for an unsigned left shift function g(x) = x << n then the equivalent is
g(x) = (x * 2n) mod 2N
with N being the size in bits of x, because integer math in hardware, C and many other languages always reduce modulo 2N due to the limit of register size and the use of two's complement. And it's clear that the modulo function also isn't invertible/recoverable. The signed left shift is almost the same with some small modifications

How do I quickly compute the product of 100bit numbers

I am trying to calculate the product of two 100-bit numbers. It is supposed to mimic the behavior of multiplication of unsigned integers native to 100-bit CPU architecture. That is, the program must calculate the actual product, modulo 2^100.
To do this QUICKLY, I have opted to implement 100bit numbers as uint64_t[2], a two element array of 64bit numbers. More precisely, x = 2^64 * a + b. I need to quickly perform arithmetic and logical operations (products, bit shifts, bit rotate, xor etc). I have chosen this representation because it allows me to use the fast, native operations on the 64bit constituents. For example, rotating a 128bit 'number' is only twice as slow as rotating a 64bit int. Boost::128bit is MUCH slower and bitset and valarray don't have arithmetic. I COULD use the arrays for all operations except multiplication, and then convert the arrays to say boost:128bit and then just multiply, but that is a last resort and probably slow as hell.
I have tried to following. Let us have two such pairs of 64bit numbers, say 2^64 a + b and 2^64 x + y. Then the product can be expressed as
2^128 ax + 2^64 (ay + bx) + by
We may ignore the first term, for it is too large. It would be almost sufficient to take the pair
ay + bx, by
to be our answer, but the more significant half is 'missing' the overflow from the b*y operation. I don't know how to calculate this without breaking the numbers b,y into four different 32bits, and using a divide and conquer approach that will ensure the expanded terms of the product each don't overflow.
This is for a 'chess engine' with magic multiplication hashing on a 10x10 board
You only care about the most significant 32 bits of each number in b * y for the overflow it might produce:
struct Num {
uint64_t low;
uint64_t high;
Num &operator*=(const Num &o) {
high = low * o.high +
high * o.low +
(low >> 32u) * (o.low >> 32u); // <- handles overflow
low *= o.low;
high &= 0xFFFFFFFFF; // keeping number 100 bits
return *this;
}
};
See if your cpu supports any native 128 bit ints, because that would be optimal (though not portable).
Good luck with your chess engine!
Come to think of it and borrowing basket's notation:
hell bent on 100 bits, the error would be smaller using 64 bits of high and only 36 of low:
you can compute the most significant 64 bits of "low×low" using (low >> 4u) * (o.low >> 4u), using the upper 36 bits of this as an overflow to high.
With no effort to coin names for magic literals:
Bits100 &operator*=(const Bits100 &o) {
high = low * o.high + // ignore high * o.high
high * o.low +
(low >> 4u) * (o.low >> 4u) >> 28; // handles overflow in most cases
low = low * o.low & 0xFFFFFFFFF; // keep low to 100-64 bits
return *this;
}

Attempting to understand different CRC implementations

Taken from IEEE 802.3,
Mathematically, the CRC value corresponding to a given MAC frame is defined by the following procedure:
a) The first 32 bits of the frame are complemented.
b) The n bits of the protected fields are then considered to be the
coefficients of a polynomial M(x) of degree n – 1. (The first bit
of the Destination Address field corresponds to the x(n–1) term and the last
bit of the MAC Client Data field (or Pad field if present) corresponds to the
x0 term.)
c) M(x) is multiplied by x32 and divided by G(x), producing a remainder R(x) of degree ≤ 31.
d) The coefficients of R(x) are considered to be a 32-bit sequence.
e) The bit sequence is complemented and the result is the CRC.
https://www.kernel.org/doc/Documentation/crc32.txt
A big-endian CRC written this way would be coded like:
for (i = 0; i < input_bits; i++) {
multiple = remainder & 0x80000000 ? CRCPOLY : 0;
remainder = (remainder << 1 | next_input_bit()) ^ multiple;
}
Where is part c) M(x) is multiplied by x^32? I don't see 32 zeros appended to any number.
Also the following piece of code make no sense to me. The code and math don't really match up.
Evaluating the differences in CRC-32 implementations
and
unsigned short
crc16_update(unsigned short crc, unsigned char nextByte)
{
crc ^= nextByte;
for (int i = 0; i < 8; ++i) {
if (crc & 1)
crc = (crc >> 1) ^ 0xA001;
else
crc = (crc >> 1);
}
return crc;
}
What are these implementations doing? None of them really resemble the original procedure.
Even after reading the very end of this it still makes no sense:
http://www.relisoft.com/science/crcmath.html
This tutorial (also here, here, and here for those who will complain about link rot), in particular "10. A Slightly Mangled Table-Driven Implementation", explains well the optimization to avoid feeding an extra 32 zero bits at the end.
The bottom line is that you feed the bits into the end of the register instead of the start, which has the same effect as feeding a register-length's worth of zeros at the end.
The tutorial also shows nicely how the implementation you quoted implements the long division over GF(2).

Calculating polynomial division result as well as remainder (CRC)

I'm trying to write a table-based CRC routine for receiving Mode S uplink interrogator messages. On the downlink side, the CRC is just the 24-bit CRC based on polynomial P=0x1FFF409. So far, so good -- I wrote a table-based implementation that follows the usual byte-at-a-time convention, and it's working fine.
On the uplink side, though, things get weird. The protocol specification says that calculating the target uplink address is by finding:
U' = x^24 * U / G(x)
...where U is the received message and G(x) is the encoding polynomial 0x1FFF409, resulting in:
U' = x^24 * m(x) + A(x) + r(x) / G(x)
...where m(x) is the original message, A(x) is the address, and r(x) is the remainder. I want the low-order quotient A(x); e.g., the result of the GF(2) polynomial division operation instead of the remainder. The remainder is effectively discarded. The target address is encoded with the transmitted checksum such that the receiving aircraft can validate the checksum by comparing it with its address.
This is great and all, and I have a bitwise implementation which follows from the above. Please ignore the weird shifting of the polynomial and checksum, this has been cribbed from this Pascal implementation (on page 15) which assumes 32-bit registers and makes optimizations based on that assumption. In reality the message and checksum come as a single, 56-bit transmission.
#This is the reference bit-shifting implementation. It is slow.
def uplink_bitshift_crc():
p = 0xfffa0480 #polynomial (0x1FFF409 shifted left 7 bits)
a = 0x00000000 #rx'ed uplink data (32 bits)
adr = 0xcc5ee900 #rx'ed checksum (24 bits, shifted left 8 bits)
ad = 0 #will hold division result low-order bits
for j in range(56):
#if MSBit is 1, xor w/poly
if a & 0x80000000:
a = a ^ p
#shift off the top bit of A (we're done with it),
#and shift in the top bit of adr
a = ((a << 1) & 0xFFFFFFFF) + ((adr >> 31) & 1)
#shift off the top bit of adr
adr = (adr << 1) & 0xFFFFFFFF
if j > 30:
#shift ad left 1 bit and shift in the msbit of a
#this extracts the LS 24bits of the division operation
#and ignores the remainder at the end
ad = ad + ((a >> 31) & 1)
ad = ((ad << 1) & 0xFFFFFFFF)
#correct the ad
ad = ad >> 2
return ad
The above is of course slower than molasses in software and I'd really like to be able to construct a lookup table that would allow similar byte-at-a-time calculation of the received address, or massage the remainder (which is quickly calculated) into a quotient.
TL;DR:
Given a message, the encoding polynomial, and the remainder (calculated by the normal CRC method), is there a faster way to obtain the quotient of the polynomial division operation than by using shift registers to do polynomial division "longhand"?
You might take a look at the PyCRC library, I guess this may answer your questions.
Too late for the OP, but I'm posting this for others that might see this question. You can generate two tables to operate a byte at a time. The first 256 by 8 bit table is indexed by the current leading 8 bits of the dividend (message), and the 8 bit values are the quotients. The second 256 by 32 bit table is indexed by the 8 bit quotient and the 32 bit values are the 32 bit product of the 8 bit quotient times the 25 bit polynomial (since this is a carryless multiply, the product is 32 bits, (x^7 * x^24 = x^31)), which you xor to the upper 32 bits of the dividend, which will zero out the upper 8 bits of the dividend. Then loop back for the next 8 bits of the dividend.
A modern X86 cpu has the carryless multiply instruction, PCLMULQDQ that operates on 128 bit xmm registers, performing a 64 bit by 64 bit multiply to produce a 128 bit product (since it's a carryless multiply bit 127 is always 0, so it's really a 127 bit product). A multiply of the 56 bit message by the 41 bit constant 2^64/G(x) will produce a 96 bit product, of which the upper 32 bits will be the quotient (lower 64 bits are not used).