Impact of data padding on CRC calculation - crc

I am calculating CRC on a large chunk of data every cycle in hardware (64B per cycle). In order to parallelize the CRC calculation, I want to calculate the CRC for small data chunks and then XOR them in parallel.
Approach:
We divide the data into small chunks (64B data divided into 8 chunks
of 8B each).
Then we calculate CRC's for all the chunks
individually (8 CRC's in parallel for 8B chunks).
Finally calculate
the CRC for padded data. This answer points out that the CRC
for padded data is obtained by multiplying the old CRC with x^n.
Hence, I am calculating the CRC for a small chunk of data, then multiply it with CRC of 0x1 shifted by 'i' times as shown below.
In short, I am trying to accomplish below:
For example: CRC-8 on this site:
Input Data=(0x05 0x07) CRC=0x54
Step-1: Data=0x5 CRC=0x1B
Step-2: Data=0x7 CRC=0x15
Step-3: Data=(0x1 0x0) CRC=0x15
Step-4: Multiply step-1 CRC and step-3 CRC with primitive polynomial 0x7. So, I calculate (0x1B).(0x15) = (0x1 0xC7) mod 0x7.
Step-5: Calculate CRC Data=(0x1 0xC7) CRC=0x4E (I assume this is same as (0x1 0xC7) mod 0x7)
Step-6: XOR the result to get the final CRC. 0x4E^0x15=0x5B
As we can see, the result in step-6 is not the correct result.
Can someone help me how to calculate the CRC for padded data? Or where am I going wrong in the above example?

Rather than calculate and then adjust multiple CRC's, bytes of data can be carryless multiplied to form a set of 16 bit "folded" products, which are then xor'ed and a single modulo operation performed on the xor'ed "folded" products. An optimized modulo operation uses two carryless multiples, so it's avoided until all folded products have been generated and xor'ed together. A carryless multiply uses XOR instead of ADD and a borrowless divide uses XOR instead of SUB. Intel has a pdf file about this using the XMM instruction PCLMULQDQ (carryless multiply), where 16 bytes are read at a time, split into two 8 byte groups, with each group folded into a 16 byte product, and the two 16 byte products are xor'ed to form a single 16 byte product. Using 8 XMM registers to hold folding products, 128 bytes at time are processed. (256 bytes at at time in the case of AVX512 and ZMM registers).
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
Assume your hardware can implement a carryless multiply that takes two 8 bit operands and produces a 16 bit (technically 15 bit) product.
Let message = M = 31 32 33 34 35 36 37 38. In this case CRC(M) = C7
pre-calculated constants (all values shown in hex):
2^38%107 = DF cycles forwards 0x38 bits
2^30%107 = 29 cycles forwards 0x30 bits
2^28%107 = 62 cycles forwards 0x28 bits
2^20%107 = 16 cycles forwards 0x20 bits
2^18%107 = 6B cycles forwards 0x18 bits
2^10%107 = 15 cycles forwards 0x10 bits
2^08%107 = 07 cycles forwards 0x08 bits
2^00%107 = 01 cycles forwards 0x00 bits
16 bit folded (cycled forward) products (can be calculated in parallel):
31·DF = 16CF
32·29 = 07E2
33·62 = 0AC6
34·16 = 03F8
35·6B = 0A17
36·15 = 038E
37·07 = 0085
38·01 = 0038
----
V = 1137 the xor of the 8 folded products
CRC(V) = 113700 % 107 = C7
To avoid having to use borrowless divide for the modulo operation, CRC(V) can be computed using carryless multiply. For example
V = FFFE
CRC(V) = FFFE00 % 107 = 23.
Implementation, again all values in hex (hex 10 = decimal 16), ⊕ is XOR.
input:
V = FFFE
constants:
P = 107 polynomial
I = 2^10 / 107 = 107 "inverse" of polynomial
by coincidence, it's the same value
2^10 % 107 = 15 for folding right 16 bits
fold the upper 8 bits of FFFE00 16 bits to the right:
U = FF·15 ⊕ FE00 = 0CF3 ⊕ FE00 = F2F3 (check: F2F3%107 = 23 = CRC)
Q = ((U>>8)·I)>>8 = (F2·107)>>8 = ...
to avoid a 9 bit operand, split up 107 = 100 ⊕ 7
Q = ((F2·100) ⊕ (F2·07))>>8 = ((F2<<8) ⊕ (F2·07))>>8 = (F200 ⊕ 02DE)>>8 = F0DE>>8 = F0
X = Q·P = F0·107 = F0·100 ⊕ F0·07 = F0<<8 ⊕ F0·07 = F000 ⊕ 02D0 = F2D0
CRC = U ⊕ X = F2F3 ⊕ F2D0 = 23
Since the CRC is 8 bits, there's no need for the upper 8 bits in the last two steps, but it doesn't help that much for the overall calculation.
X = (Q·(P&FF))&FF = (F0·07)&FF = D0
CRC = (U&FF) ⊕ X = F3 ⊕ D0 = 23
Example program to generate 2^0x10 / 0x107 and powers of 2 % 0x107:
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define poly 0x107
uint16_t geninv(void) /* generate 2^16 / 9 bit poly */
{
uint16_t q = 0x0000u; /* quotient */
uint16_t d = 0x0001u; /* initial dividend = 2^0 */
for(int i = 0; i < 16; i++){
d <<= 1;
q <<= 1;
if(d&0x0100){ /* if bit 8 set */
q |= 1; /* q |= 1 */
d ^= poly; /* d ^= poly */
}
}
return q; /* return inverse */
}
uint8_t powmodpoly(int n) /* generate 2^n % 9 bit poly */
{
uint16_t d = 0x0001u; /* initial dividend = 2^0 */
for(int i = 0; i < n; i++){
d <<= 1; /* shift dvnd left */
if(d&0x0100){ /* if bit 8 set */
d ^= poly; /* d ^= poly */
}
}
return (uint8_t)d; /* return remainder */
}
int main()
{
printf("%04x\n", geninv());
printf("%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
powmodpoly(0x00), powmodpoly(0x08), powmodpoly(0x10), powmodpoly(0x18),
powmodpoly(0x20), powmodpoly(0x28), powmodpoly(0x30), powmodpoly(0x38),
powmodpoly(0x40), powmodpoly(0x48));
printf("%02x\n", powmodpoly(0x77)); /* 0xd9, cycles crc backwards 8 bits */
return 0;
}
Long hand example for 2^0x10 / 0x107.
100000111 quotient
-------------------
divisor 100000111 | 10000000000000000 dividend
100000111
---------
111000000
100000111
---------
110001110
100000111
---------
100010010
100000111
---------
10101 remainder
I don't know how many registers you can have in your hardware design, but assume there are five 16 bit registers used to hold folded values, and either two or eight 8 bit registers (depending on how parallel the folding is done). Then following the Intel paper, you fold values for all 64 bytes, 8 bytes at a time, and only need one modulo operation. Register size, fold# = 16 bits, reg# = 8 bits. Note that powers of 2 modulo poly are pre-calculated constants.
foldv = prior buffer's folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv = reg0·((2^0x18)%poly) advance by 3 bytes
foldv ^= reg1·((2^0x10)%poly) advance by 2 bytes
fold0 = msg[0 1] ^ foldv handling 2 bytes at a time
fold1 = msg[2 3]
fold2 = msg[4 5]
fold3 = msg[6 7]
for(i = 8; i < 56; i += 8){
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x48)%poly) advance by 9 bytes
fold0 ^= reg1·((2^0x40)%poly) advance by 8 bytes
fold0 ^= msg[i+0 i+1]
reg2 = fold1>>8 if not parallel, reg0
reg3 = fold1&ff and reg1
fold1 = reg2·((2^0x48)%poly) advance by 9 bytes
fold1 ^= reg3·((2^0x40)%poly) advance by 8 bytes
fold1 ^= msg[i+2 i+3]
...
fold3 ^= msg[i+6 i+7]
}
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x38)%poly) advance by 7 bytes
fold0 ^= reg1·((2^0x30)%poly) advance by 6 bytes
reg2 = fold1>>8 if not parallel, reg0
reg3 = fold1&ff and reg1
fold1 = reg2·((2^0x28)%poly) advance by 5 bytes
fold1 ^= reg3·((2^0x20)%poly) advance by 4 bytes
fold2 ... advance by 3 2 bytes
fold3 ... advance by 1 0 bytes
foldv = fold0^fold1^fold2^fold3
Say the final buffer has 5 bytes:
foldv = prior folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv = reg0·((2^0x30)%poly) advance by 6 bytes
foldv ^= reg1·((2^0x28)%poly) advance by 5 bytes
fold0 = msg[0 1] ^ foldv
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x20)%poly) advance by 4 bytes
fold0 ^= reg1·((2^0x18)%poly) advance by 3 bytes
fold1 = msg[2 3]
reg2 = fold1>>8
reg3 = fold1&ff
fold1 = reg0·((2^0x10)%poly) advance by 2 bytes
fold1 ^= reg1·((2^0x08)%poly) advance by 1 bytes
fold2 = msg[4] just one byte loaded
fold3 = 0
foldv = fold0^fold1^fold2^fold3
now use the method above to calculate CRC(foldv)

As shown in your diagram, you need to calculate the CRC of 0x05 0x00, (A,0), and the CRC of 0x00 0x07, (0,B), and then exclusive-or those together. Calculating on the site you linked, you get 0x41 and 0x15 respectively. Exclusive-or those together, and, voila, you get 0x54, the CRC of 0x05 0x07.
There is a shortcut for (0,B), since for this CRC, the CRC of a string of zeros is zero. You can calculate the CRC of just 0x07 and get the same result as for 0x00 0x07, which is 0x15.
See crcany for how to combine CRCs in general. crcany will generate C code to compute any specified CRC, including code to combine CRCs. It employs a technique that applies n zeros to a CRC in O(log(n)) time instead of O(n) time.

Related

How does this alignment works? ((n + ZBI_ALIGNMENT - 1) & -ZBI_ALIGNMENT)

I'm trying to understand how this alignment works. It should align an uint32 address to its nearest 8 byte aligned address
static inline uint32_t
ZBI_ALIGN(uint32_t n) {
return ((n + ZBI_ALIGNMENT - 1) & -ZBI_ALIGNMENT);
Let's take n=10, and ZBI_ALIGNMENT=8. The nearest address should be 16
returns ((10 + 8 -1) & -8) = 17 & -8
Why this should be aligned?
The key to this formula is that it is only valid if ZBI_ALIGNMENT happens to be a power of two, which is not a big deal because alignment requirements tend to fulfil that criteria.
A number being aligned to (aka being a multiple of) a power of two means that all bits smaller than that power of two are set to 0. You can convince yourself of that easily by looking at a few 8-bit numbers:
15: 00001111
16: 00010000 <--- aligned to 16
17: 00010001
31: 00011111
32: 00100000 <--- aligned to 16
48: 00110000 <--- aligned to 16
Assuming that we have a mask that happens to have only have the bits higher or equal to 16 set, N & mask, would be a no-op for all multiples of 16, and give us the previous multiple of 16 for all other values.
16: 00010000
mask for 16: 11110000
15 & mask -> 00000000 : 0
16 & mask -> 00010000 : 16
17 & mask -> 00010000 : 16
32 & mask -> 00100000 : 32
In order to get the right value directly, we can use (N + 15) & mask instead. If N is a multiple of 16 already, N + 15 will land just shy of the next multiple. Otherwise, it will always "bump" the value to the next range. e.g. 1+15 = 16, 16 + 15 = 31, etc... This generalises as (N + (DESIRED_ALIGMENT - 1)).
So all that's left to figure out is how to calculate the mask for a given desired alignment.
Conveniently, in two's complement representation (which all signed integers have to use), negative values of powers of two happen to be exactly the mask we need.
For 8 bit numbers it looks like this:
-1 -> 11111111
-2 -> 11111110
-4 -> 11111100
-8 -> 11111000
etc...
So mask can simply be computed as -ZBI_ALIGNMENT.
Putting all this together, we get:
((n + ZBI_ALIGNMENT - 1) & -ZBI_ALIGNMENT)

Walkthrough: sum 2 integers using bit manipulation

I am trying to understand the logic behind the following code which sums 2 integers using bit manipulation:
def sum(a, b):
while b != 0:
carry = a & b
a = a ^ b
b = carry << 1
return a
As an example I used: a = 11 and b = 7
11 in binary representation is 1011
7 in binary representation is 0111
Then I walked through the algorithm:
iter #1: a = 1011, b = 0111
carry = 0011 (3 decimal)
a = 1100 (12 decimal)
b = 0110 (6 decimal)
iter #2: a = 1100, b = 0110
carry = 0100 (4 decimal)
a = 1010 (10 decimal)
b = 1000 (8 decimal)
iter #3: a = 1010, b = 1000
carry = 1000 (8 decimal)
a = 00010 (2 decimal)
b = 10000 (16 decimal)
iter #4: a = 00010, b = 10000
carry = 00000 (0 decimal)
a = 10010 (18 decimal)
b = 00000 (0 decimal)
We Done (because b is now 0).
As we can see, in all iterations a+b is always 18 which is the right answer.
However I failed to understand what is actually happens here. The value of a is going down and down with each iteration until suddenly pops to 18 in the last iteration. Also, can we learn anything from the value of the carry during the process?
I would love to understand the intuition behind this.
Thanks to #WJS answer I think I got it.
let's add 11 and 7 as before, but let's do it in the following order:
First, calculate it without the carry.
Second, calculate only the carry.
Then add both parts.
01011
00111
-----
01100 (neglecting carry)
00110 (finding only the carry)
-----
10010 (sum)
Now, to find the first part, how can we get rid of the carry bits? with XOR.
To find the second part, we use AND and then shift it 1 bit left to place it "under" the right bit.
Now all we have to do is sum both parts. The whole point is not using + operator so how can we do that? Recursion!
We assign the first part to a and the second part to b and we repeat this process until b=0 which means we are done.
Perhaps if you take a simpler example it will help.
a = 11
b = 11
a & b == 11 since AND returns 1's where both bits in the same
position are 1. These are the carry bits.
Now get rid of the the carry locations using exclusive or
a = a ^ b == 00
But a `carry` would cause addition to add bits one position to
the left so shift the carry bits left by 1 bit.
b = carry << 1 = 110
now repeat the process
carry = a & b = 0 & 110 == 0 no more carries
b = carry << 1 == 0
done.
11 + 11 = 110 = 3 + 3 = 6
Understanding the roles of (AND) & and (XOR) ^ are key. Applying those to slightly more complex examples should help. But ignore the interim decimal values as they don't help much. Think only about what is happening in binary.
I think this is easy to understand if you look at what happens with individual bits.
First step is calculating carry which only happens in binary when both bits are 1, so a&b calculates that for every bit. Then bitwise addition is happening via XOR (ignoring carry), and XOR works because:
0+0=0 (==0^0)
1+0=1 (==1^0)
1+1=0 (==1^1, generates carry bit which we ignore)
Next step is to shift carry to the left (<<1), move it to b and repeat until carry is empty.

How many decimal number possibilities are there for a 26-bit Wiegand number?

I have successfully emulated a 26 bit Wiegand signal using an ESP32. Basically, the program transforms a manually inputted decimal number into a proper 26 bit Wiegand binary number and then sends it on 2 wires following the protocol:
bool* wiegandArray = new bool[26];
void getWiegand(unsigned int dec) {
// transform dec number into binary number using single bit shift operation
// and store it in wiegandArray[]
for (int i = 24; i > 0; --i) {
wiegandArray[i] = dec & 1;
dec >>= 1;
}
// check for parity of the first 12 bits
bool even = 0;
for(int i = 1; i < 13; i++) {
even ^= wiegandArray[i];
}
// add 0 or 1 as first bit (leading parity bit - even) based on the number of 'ones' in the first 12 bits
wiegandArray[0] = even;
// check for parity of the last 12 bits
bool odd = 1;
for(int i = 13; i < 25; i++) {
odd ^= wiegandArray[i];
}
// add 0 or 1 as last bit (trailing parity bit - odd) based on the number of 'ones' in the last 12 bits
wiegandArray[25] = odd;
}
Using this online calculator I can generate appropriate decimal numbers for a 26 bit Wiegand number.
Now, the problem that I am facing is that the end-user will actually input a CARD ID. A Card ID is a decimal number that should always result in a 24 bit binary number: 8 bits of facility code and 16 bits of ID code. And upon this 24 bit number I apply the parity bits to get a 26 bit code.
For example:
CARD ID= 16336141 / 101000111000110100101101
Facility Code: 163 / 10100011
Card Number: 36141 / 1000110100101101
Resulting 26 Wiegand: 10718509 / 11010001110001101001011010
The issue is that I don't know how to tackle this issue.
How can I generate a 26 bit Wiegand from 0 ? That would be 0 00000000 0000000000000000 1.
The largest 24 bit number is 16777215. But 8 bits for site codes (0-255) and 16 bits for card numbers (0-65535) mean 255*65535 = 16711425.
What is the actual range ? Should I start generating 26 bit Wiegand binary numbers from 0 ?

Using python to convert a 16 bit number into 2 byte number

I have 16 bites numbers such as 65303 which I need to convert onto a 2 byte number using Python. Thanks!
If you have a 16 bits number you can compute the lo part (bit 0 to bit 7) and hi part (bit 8 to bit 15) with:
n = 65303
lo = n & 0x00ff
hi = n >> 8

What do "Non-Power-Of-Two Textures" mean?

What do "Non-Power-Of-Two Textures" mean? I read this tutorial and I meet some binaries operations("<<", ">>", "^", "~"), but I don't understand what they are doing.
For example following code:
GLuint LTexture::powerOfTwo(GLuint num)
{
if (num != 0)
{
num--;
num |= (num >> 1); //Or first 2 bits
num |= (num >> 2); //Or next 2 bits
num |= (num >> 4); //Or next 4 bits
num |= (num >> 8); //Or next 8 bits
num |= (num >> 16); //Or next 16 bits
num++;
}
return num;
}
I very want to understand this operations. As well, I read this. Very short article. I want to see examples of using, but I not found. I did the test:
int a = 5;
a <<= 1; //a = 10
a = 5;
a <<= 2; //a = 20
a = 5;
a <<= 3; //a = 40
Okay, this like multiply on two, but
int a = 5;
a >>= 1; // a = 2 Whaat??
In C++, the <<= is the "left binary shift" assignment operator; the operand on the left is treated as a binary number, the bits are moved to the left, and zero bits are inserted on the right.
The >>= is the right binary shift; bits are moved to the right and "fall off" the right end, so it's like a division by 2 (for each bit) but with truncation. For negative signed integers, by the way, additional 1 bits are shifted in at the left end ("arithmetic right shift"), which may be surprising; for positive signed integers, or unsigned integers, 0 bits are shifted in at the left ("logical right shift").
"Powers of two" are the numbers created by successive doublings of 1: 2, 4, 8, 16, 32… Most graphics hardware prefers to work with texture maps which are powers of two in size.
As said in http://lazyfoo.net/tutorials/OpenGL/08_non_power_of_2_textures/index.php
powerOfTwo will take the argument and find nearest number that is power of two.
GLuint powerOfTwo( GLuint num );
/*
Pre Condition:
-None
Post Condition:
-Returns nearest power of two integer that is greater
Side Effects:
-None
*/
Let's test:
num=60 (decimal) and its binary is 111100
num--; .. 59 111011
num |= (num >> 1); //Or first 2 bits 011101 | 111011 = 111111
num |= (num >> 2); //Or next 2 bits 001111 | 111111 = 111111
num |= (num >> 4); //Or next 4 bits 000011 | 111111 = 111111
num |= (num >> 8); //Or next 8 bits 000000 | 111111 = 111111
num |= (num >> 16); //Or next 16 bits 000000 | 111111 = 111111
num++; ..63+1 = 64
output 64.
For num=5: num-1 =4 (binary 0100), after all num |= (num >> N) it will be 0111 or 7 decimal). Then num+1 is equal to 8.
As you should know the data in our computers is represented in the binary system, in which digits are either a 1 or a 0.
So for example number 10 decimal = 1010 binary. (1*2^3 + 0*2^2 + 1*2^1 + 0*2^0).
Let's go to the operations now.
Binary | OR means that wherever you have at least one 1 the output will be 1.
1010
| 0100
------
1110
~ NOT means negation i.e. all 0s become 1s and all 1s become 0s.
~ 1010
------
0101
^ XOR means you turn a pair of 1 and 0 into a 1. All other combinations leave a 0 as output.
1010
^ 0110
------
1100
Bit shift.
N >> x means we "slide" our number N, x bits to the right.
1010 >> 1 = 0101(0) // zero in the brackets is dropped,
since it goes out of the representation = 0101
1001 >> 1 = 0100(1) // (1) is dropped = 0100
<< behaves the same way, just the opposite direction.
1000 << 1 = 0001
Since in binary system numbers are represented as powers of 2, shifting a bit one or the other direction will result in multiplying or dividing by 2.
Let num = 36. First subtract 1, giving 35. In binary, this is 100011.
Right shift by 1 position gives 10001 (the rightmost digit disappears). Bitwise Or'ed with num gives:
100011
10001
-------
110011
Note that this ensures two 1's on the left.
Now right shift by 2 positions, giving 1100. Bitwise Or:
110011
1100
-------
111111
This ensures four 1's on the left.
And so on, until the value is completely filled with 1's from the leftmost.
Add 1 and you get 1000000, a power of 2.
This procedure always generates a power of two, and you can check that it is just above the initial value of num.