Making a CRC table for AES3 (AES-2003) - c++

As a bit of insight into what I am doing, I am attempting to process AES/EBU subframes for an SDI interface. That shouldn't be too important; let's abstract away from that.
Page 12 of a standards document calls for a CRC check using the polynomial: G(x) = x^8 + x^4 + x^3 + x^2 + 1 (or x^0).
The document can be found here: http://tech.ebu.ch/docs/tech/tech3250.pdf
As you can probably anticipate, I would like to generate a CRC table for the given formula. I've come across a code-snippet which uses the formula G(x) = x^8 + x^2 + x^1 + x^0.
The code snippet can be located here:
http://www.koders.com/cpp/fid9C544B36B8C41721691790197D38DAC91D2C29EF.aspx?s=crc#L8
Could the formula be modified (see the modified version below) to work with the my AES3 CRC? Will the following work?
// x^8 + x^4 + x^3 + x^2 + x^0 or (1)
void make_crc_table( void )
{
int i, j;
unsigned long poly, c;
/* terms of polynomial defining this crc (except x^8): */
static const byte p[] = {0,2,3,4};
poly = 0L;
for ( i = 0; i < sizeof( p ) / sizeof( byte ); i++ )
{
poly |= 1L << p[i];
}
for ( i = 0; i < 256; i++ )
{
c = i;
for ( j = 0; j < 8; j++ )
{
//ZeroDefect: This part has me worried.
c = ( c & 0x80 ) ? poly ^ ( c << 1 ) : ( c << 1 );
}
crctable[i] = (byte) c;
}
}
Any tips/suggestions would be much appreciated.
ZeroDefect.

As far as I can tell, this is just encoding all the possible state transitions for the CRC feedback register (see the diagrams at Wikipedia) into a lookup table.
It looks like all you should need to do is modify the p[] array to take account of your tap positions.

Related

Karatsuba - polynomials multiplication with CUDA

I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
}
}
}
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
// WRONG
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
}
}
result[i] = rtmp;
}
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?
For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE, and it does matter for the correctness of the solution I will propose.
In your first kernel, only one thread in your entire grid was reading and writing location result[i]. But in your second kernel, you now have multiple threads writing to the result[i] location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i] at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).
The canonical method to sort this out would be to employ a reduction technique.
However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda tag about it.
You could modify your kernel to something like this, to sort out the race condition:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j < end ){ // see note below
atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
}
}
}
Note that depending on your GPU type, and the actual type of TYPE you are using, this may not work (may not compile) as-is. But since you had previously used TYPE as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd for those should be available.
A few other comments:
This may not be giving you the grid size you expect:
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
I think the usual calculations there would be:
dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );
I always recommend proper CUDA error checking and running your codes with cuda-memcheck, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
if(j >= start && j <= end ){
should be this:
if(j >= start && j < end ){
to match your for-loop range. I am also making an assumption that size is less than resultSize (again, a MCVE would help).

C++ polynomials: indefinite integrals

I am trying to find the indefinite integral of a polynomial, however neither my maths nor my coding is great. My code compiles but I believe I have the wrong formula:
Polynomial Polynomial :: indefiniteIntegral() const
{
Polynomial Result;
Result.fDegree = fDegree + 1;
for ( int i = fDegree; i > 0 ; i--){
Result.fCoeffs[i] = pow(fCoeffs[i], (Result.fDegree)) / (Result.fDegree);
}
return Result;
}
Looks like what you want is
for ( int i = fDegree; i > 0; --i ) {
Result.fCoeffs[i] = fCoeffs[i-1] / static_cast<float>(i);
}
I don't know the underlying implementation of your class, so I don't know how you're implementing fCoeffs (if its doubles or floats) and if you need to worry about i being out of bounds. If its a vector then it definitely needs to be initialized to the right size; if its a map, then you may not need to.
Try something like
Polynomial Polynomial::indefiniteIntegral() const
{
Polynomial Result;
Result.fDegree = fDegree + 1;
for (int i = fDegree; i > 0 ; i--) {
Result.fCoeffs[i] = fCoeffs[i-1] / i;
}
Result.rCoeffs[0] = 0;
return Result;
}
Each monomial a x^i is stored as value a in fCoeffs[i], after integration it should be moved to fCoeffs[i+1], multiplied with 1/(i+1). The lowest coefficient is set to 0.
And yes, you better make sure there is room for the highest coefficient.
Example: [1 1] is 1 + x and should become C + x + 1/2 x^2 which is represented by [0 1 0.5], keeping in mind that we introduced an arbitrary constant.

How to compute sum of evenly spaced binomial coefficients

How to find sum of evenly spaced Binomial coefficients modulo M?
ie. (nCa + nCa+r + nCa+2r + nCa+3r + ... + nCa+kr) % M = ?
given: 0 <= a < r, a + kr <= n < a + (k+1)r, n < 105, r < 100
My first attempt was:
int res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res = (res + mod_nCr(n, a+r*k, mod)) % mod;
}
but this is not efficient. So after reading here
and this paper I found out the above sum is equivalent to:
summation[ω-ja * (1 + ωj)n / r], for 0 <= j < r; and ω = ei2π/r is a primitive rth root of unity.
What can be the code to find this sum in Order(r)?
Edit:
n can go upto 105 and r can go upto 100.
Original problem source: https://www.codechef.com/APRIL14/problems/ANUCBC
Editorial for the problem from the contest: https://discuss.codechef.com/t/anucbc-editorial/5113
After revisiting this post 6 years later, I'm unable to recall how I transformed the original problem statement into mine version, nonetheless, I shared the link to the original solution incase anyone wants to have a look at the correct solution approach.
Binomial coefficients are coefficients of the polynomial (1+x)^n. The sum of the coefficients of x^a, x^(a+r), etc. is the coefficient of x^a in (1+x)^n in the ring of polynomials mod x^r-1. Polynomials mod x^r-1 can be specified by an array of coefficients of length r. You can compute (1+x)^n mod (x^r-1, M) by repeated squaring, reducing mod x^r-1 and mod M at each step. This takes about log_2(n)r^2 steps and O(r) space with naive multiplication. It is faster if you use the Fast Fourier Transform to multiply or exponentiate the polynomials.
For example, suppose n=20 and r=5.
(1+x) = {1,1,0,0,0}
(1+x)^2 = {1,2,1,0,0}
(1+x)^4 = {1,4,6,4,1}
(1+x)^8 = {1,8,28,56,70,56,28,8,1}
{1+56,8+28,28+8,56+1,70}
{57,36,36,57,70}
(1+x)^16 = {3249,4104,5400,9090,13380,9144,8289,7980,4900}
{3249+9144,4104+8289,5400+7980,9090+4900,13380}
{12393,12393,13380,13990,13380}
(1+x)^20 = (1+x)^16 (1+x)^4
= {12393,12393,13380,13990,13380}*{1,4,6,4,1}
{12393,61965,137310,191440,211585,203373,149620,67510,13380}
{215766,211585,204820,204820,211585}
This tells you the sums for the 5 possible values of a. For example, for a=1, 211585 = 20c1+20c6+20c11+20c16 = 20+38760+167960+4845.
Something like that, but you have to check a, n and r because I just put anything without regarding about the condition:
#include <complex>
#include <cmath>
#include <iostream>
using namespace std;
int main( void )
{
const int r = 10;
const int a = 2;
const int n = 4;
complex<double> i(0.,1.), res(0., 0.), w;
for( int j(0); j<r; ++j )
{
w = exp( i * 2. * M_PI / (double)r );
res += pow( w, -j * a ) * pow( 1. + pow( w, j ), n ) / (double)r;
}
return 0;
}
the mod operation is expensive, try avoiding it as much as possible
uint64_t res = 0;
int mod=1000000009;
for (int k = 0; a + r*k <= n; k++) {
res += mod_nCr(n, a+r*k, mod);
if(res > mod)
res %= mod;
}
I did not test this code
I don't know if you reached something or not in this question, but the key to implementing this formula is to actually figure out that w^i are independent and therefore can form a ring. In simpler terms you should think of implement
(1+x)^n%(x^r-1) or finding out (1+x)^n in the ring Z[x]/(x^r-1)
If confused I will give you an easy implementation right now.
make a vector of size r . O(r) space + O(r) time
initialization this vector with zeros every where O(r) space +O(r) time
make the first two elements of that vector 1 O(1)
calculate (x+1)^n using the fast exponentiation method. each multiplication takes O(r^2) and there are log n multiplications therefore O(r^2 log(n) )
return first element of the vector.O(1)
Complexity
O(r^2 log(n) ) time and O(r) space.
this r^2 can be reduced to r log(r) using fourier transform.
How is the multiplication done, this is regular polynomial multiplication with mod in the power
vector p1(r,0);
vector p2(r,0);
p1[0]=p1[1]=1;
p2[0]=p2[1]=1;
now we want to do the multiplication
vector res(r,0);
for(int i=0;i<r;i++)
{
for(int j=0;j<r;j++)
{
res[(i+j)%r]+=(p1[i]*p2[j]);
}
}
return res[0];
I have implemented this part before, if you are still cofused about something let me know. I would prefer that you implement the code yourself, but if you need the code let me know.

Gaussian blur not uniform

I have been trying to implement a simple Gaussian blur algorithm, for my image editing program. However, I have been having some trouble making this work, and I think the problem lies in the below snippet:
for( int j = 0; j < pow( kernel_size, 2 ); j++ )
{
int idx = ( i + kx + ( ky * img.width ));
//Try and overload this whenever possible
valueR += ( img.p_pixelArray[ idx ].r * kernel[ j ] );
valueG += ( img.p_pixelArray[ idx ].g * kernel[ j ] );
valueB += ( img.p_pixelArray[ idx ].b * kernel[ j ] );
if( kx == kernel_limit )
{
kx = -kernel_limit;
ky++;
}
else
{
kx++;
}
}
kx = -kernel_limit;
ky = -kernel_limit;
A brief explanation of the code above: kernel size is the size of the kernel (or matrix) generated by the Gaussian blur formula. kx and ky are variables to be used for iterating over the kernel. i is the parent loop, that nests this one, and goes over every pixel in the image. Each value variable simply holds a float R, G, or B value, and is used afterwards to obtain the final result. The if-else is used to increase kx and ky. idx is used to find the correct pixel. kernel limit is a variable set to
(*kernel size* - 1) / 2
So I can have kx going from -1 ( with a 3x3 kernel ) to +1, and the same thing with ky. I think the problem lies with the line
int idx = ( i + kx + ( ky * img.width ));
But I am not sure. The image I get is:
As can be seen, the color is blurred in a diagonal direction, and looks more like some kind of motion blur than Gaussian blur. If someone could help out, I would be very grateful.
EDIT:
The way I fill the kernel is as follows:
for( int i = 0; i < pow( kernel_size, 2 ); i++ )
{
// This. Is. Lisp.
kernel[i] = (( 1 / ( 2 * pi * pow( sigma, 2 ))) * pow (e, ( -((( pow( kx, 2 ) + pow( ky, 2 )) / 2 * pow( sigma, 2 ))))));
if(( kx + 1 ) == kernel_size )
{
kx = 0;
ky++;
}
else
{
kx++;
}
}
Few problems:
Your Gaussian misses brackets (even though you already have plenty..) around 2 * pow( sigma, 2 ). Now you multiply by variance instead of divide.
But what your problem is, is that your gaussian is centered at kx = ky = 0, as you let it run from 0 to kernel_size, instead of from -kernel_limit to kernel_limit. This results in the diagonal blurring. Something like the following should work better
kx = -kernel_limit;
ky = -kernel_limit;
int kernel_size_sq = kernel_size * kernel_size;
for( int i = 0; i < kernel_size_sq; i++ )
{
double sigma_sq = sigma * sigma;
double kx_sq = kx * kx;
double ky_sq = ky * ky;
kernel[i] = 1.0 / ( 2 * pi * sigma_sq) * exp(-(kx_sq + ky_sq) / (2 * sigma_sq));
if(kx == kernel_limit )
{
kx = -kernel_limit;
ky++;
}
else
{
kx++;
}
}
Also note how I got rid of your lisp-ness and some improvements: use some intermediate variables for clarity (compiler will optimize them away if anyway you ask it to); simple multiplication is faster than pow(x, 2); pow(e, x) == exp(x).

Taylor McLaughlin Series to estimate the distance of two points

Distance from point to point: dist = sqrt(dx * dx + dy * dy);
But sqrt is too slow and I can't accept that. I found a method called Taylor McLaughlin Series to estimate the distance of two points on the book. But I can't comprehend the following code. Thanks for anyone who helps me.
#define MIN(a, b) ((a < b) ? a : b)
int FastDistance2D(int x, int y)
{
// This function computes the distance from 0,0 to x,y with 3.5% error
// First compute the absolute value of x, y
x = abs(x);
y = abs(y);
// Compute the minimum of x, y
int mn = MIN(x, y);
// Return the distance
return x + y - (mn >> 1) - (mn >> 2) + (mn >> 4);
}
I have consulted related data about McLaughlin Series, but I still can't comprehend how the return value use McLaughlin Series to estimate the value. Thanks for everyone~
This task is almost duplicate of another one:
Very fast 3D distance check?
And there was link to great article:
http://www.azillionmonkeys.com/qed/sqroot.html
In the article you can find different aproaches for approximation of root. For example maybe this one is suitable for you:
int isqrt (long r) {
float tempf, x, y, rr;
int is;
rr = (long) r;
y = rr*0.5;
*(unsigned long *) &tempf = (0xbe6f0000 - *(unsigned long *) &rr) >> 1;
x = tempf;
x = (1.5*x) - (x*x)*(x*y);
if (r > 101123) x = (1.5*x) - (x*x)*(x*y);
is = (int) (x*rr + 0.5);
return is + ((signed int) (r - is*is)) >> 31;
}
If you can calculate root operation fast, then you can calculate distance in regular way:
return isqrt(a*a+b*b)
And one more link:
http://www.flipcode.com/archives/Fast_Approximate_Distance_Functions.shtml
u32 approx_distance( s32 dx, s32 dy )
{
u32 min, max;
if ( dx < 0 ) dx = -dx;
if ( dy < 0 ) dy = -dy;
if ( dx < dy )
{
min = dx;
max = dy;
} else {
min = dy;
max = dx;
}
// coefficients equivalent to ( 123/128 * max ) and ( 51/128 * min )
return ((( max << 8 ) + ( max << 3 ) - ( max << 4 ) - ( max << 1 ) +
( min << 7 ) - ( min << 5 ) + ( min << 3 ) - ( min << 1 )) >> 8 );
}
You are right sqrt is quite a slow function. But do you really need to compute the distance?
In a lot of cases you can use the distance² instead.
E.g.
If you want to find out what distance is shorter, you can compare the squares of the distance as well as the real distances.
If you want to check if a 100 > distance you can as well check 10000 > distanceSquared
Using the ditance squared in your program instead of the distance you can often avoid calculating the sqrt.
It depends on your application if this is an option for you, but it is always worth to be considered.