avx2 register bits reverse - c++

Is there a (fast) way to perform bits reverse of 32bit int values within avx2 register?
<do something here>
//binary: 10100010110111001010100111010010 => 1001011100101010011101101000101
//register contains 1268071237 which is decimal representation of 1001011100101010011101101000101

Since I can't find a suitable dupe, I'll just post it.
The main idea here is to make use of pshufb's dual use a parallel 16-entry table lookup to reverse the bits of each nibble. Reversing bytes is obvious. Reversing the order of the two nibble in every byte could be done by building it into the lookup tables (saves a shift) or by explicitly shifting the low part nibble up (saves a LUT).
Something like this in total, not tested:
__m256i rbit32(__m256i x) {
__m256i shufbytes = _mm256_setr_epi8(3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12);
__m256i luthigh = _mm256_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15, 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m256i lutlow = _mm256_slli_epi16(luthigh, 4);
__m256i lowmask = _mm256_set1_epi8(15);
__m256i rbytes = _mm256_shuffle_epi8(x, shufbytes);
__m256i high = _mm256_shuffle_epi8(lutlow, _mm256_and_si256(rbytes, lowmask));
__m256i low = _mm256_shuffle_epi8(luthigh, _mm256_and_si256(_mm256_srli_epi16(rbytes, 4), lowmask));
return _mm256_or_si256(low, high);
In a typical context in a loop, those loads should be lifted out.
Curiously Clang uses 4 shuffles, it's duplicating the first shuffle.


Reading integers in different endianness from binary file in C++

I'm reading an ESRI Shapefile, and to my dismay it uses big endian and little endian at different points (see, for instance, the table at page 4, plus the tables from page 5 to 8).
So I created two functions in C++, one for each endianness.
uint32_t readBig(ifstream& f) {
uint32_t num;
uint8_t buf[4];
num = buf[3] | buf[2]<<8 | buf[1]<<16 | buf[0]<<24;
return num;
uint32_t readLittle(ifstream& f) {
uint32_t num;
f.read(reinterpret_cast<char *>(&num),4);
return num;
But I'm not sure this is the most efficient way to do it. Can this code be improved? Keep in mind it will run thousands, maybe millions of times for a single shapefile. So to have even one of the functions calling the other seem worse than to have two separate functions. Is there a difference in performance between using reinterpret_cast or explicit type conversion (char*)? Should I use the same in both functions?
Casting between pointer types does not affect performance -- In
this case, it's just a technicality to make the compiler happy.
If you're really making a separate call to read for every 32-bit
value, the time taken by the byte-swapping operation will likely be
in the noise. For speed, you probably should have your own
buffering layer so that you inner loop doesn't make any function
It's nice if the swap compiles down to a single opcode (like bswap), but whether or not that
is possible, or the fastest option, is processor-specific.
If you're really interested in maximizing speed, consider using SIMD intrinsics.
In most cases the compiler should generate a bswap instruction, which is probably sufficient. If however you need something faster than that, vpshufb is your friend...
#include <immintrin.h>
#include <cstdint>
// swap byte order in 16 x int16
inline void swap_16xi16(uint16_t input[16])
constexpr uint8_t mask_data[] = {
1, 0,
3, 2,
5, 4,
7, 6,
9, 8,
11, 10,
13, 12,
15, 14,
1, 0,
3, 2,
5, 4,
7, 6,
9, 8,
11, 10,
13, 12,
15, 14
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
_mm256_storeu_si256((__m256i*)input, swapped);
// swap byte order in 8 x int32
inline void swap_8xi32(uint32_t input[8])
constexpr uint8_t mask_data[] = {
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12,
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
_mm256_storeu_si256((__m256i*)input, swapped);
// swap byte order in 4 x int64
inline void swap_4xi64(uint64_t input[4])
constexpr uint8_t mask_data[] = {
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8,
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8
const __m256i swapped = _mm256_shuffle_epi8(
_mm256_loadu_si256((const __m256i*)input),
_mm256_loadu_si256((const __m256i*)mask_data)
_mm256_storeu_si256((__m256i*)input, swapped);
inline void swap_16xi16(int16_t input[16])
{ swap_16xi16((uint16_t*)input); }
inline void swap_8xi32(int32_t input[8])
{ swap_8xi32((uint32_t*)input); }
inline void swap_4xi64(int64_t input[4])
{ swap_4xi64((uint64_t*)input); }
inline void swap_8f(float input[8])
{ swap_8xi32((uint32_t*)input); }
inline void swap_4d(double input[4])
{ swap_4xi64((uint64_t*)input); }

Efficient Eigen Matrix SubIndexing + Concatenation

I'm using Eigen for easy optimization of some of my matrix math. I'm currently trying to make the following operation more efficient:
Given Matrix A:
1, 2, 3
4, 5, 6
Matrix B:
7, 11, 13, 19, 26, 7, 11
8, 9, 15, 6, 8, 4, 1
and "index map" column vector IM:
0, 1, 3, 6
I'd like to append the columns of Matrix B mapping to the indexes in IM, to Matrix A as such:
1, 2, 3, 7, 11, 19, 11
4, 5, 6, 8, 9, 6, 1
I'm currently able to do this with a massive for loop, but this is the bottleneck in my code and I'd like to avoid this:
#pragma unroll
for (int i = 0; i < 25088; i++) {
block.noalias() += _features.col(ff[i]);
I've seen the discussion here and poured over the docs but can't seem to figure out the right syntax relating to Eigen matrices: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=329
Any thoughts/tips would be much appreciated!

HMAC_GOST341194 value mismatch

I'm not sure that this place is right choise for that kind of issues(it's rather crypto related), but there is no other hope to spot error, unfortunately.
So, here is the code that I use to compute HMAC_GOST341194:
HMAC hmac;
string step1, step2, step3;
ipad.assign(blockSize, 0x36);
opad.assign(blockSize, 0x5c);
for (size_t i = 0uL, e = length; i < e; ++i)
ipad.replace(i, 1, 1, secret[i] ^ 0x36);
opad.replace(i, 1, 1, secret[i] ^ 0x5c);
step1 = ipad + text;
hmac.hash(step1, step1.length(), step2);
step3 = opad + step2;
hmac.hash(step3, step3.length(), mac);
Hash function was double checked - there are no errors and all the test values are equal with other sources.
My block size is 256.
I use following S-boxes(CryptoPro Param Set):
const unsigned char S[8][16] = {
{ 10, 4, 5, 6, 8, 1, 3, 7, 13, 12, 14, 0, 9, 2, 11, 15 },
{ 5, 15, 4, 0, 2, 13, 11, 9, 1, 7, 6, 3, 12, 14, 10, 8 },
{ 7, 15, 12, 14, 9, 4, 1, 0, 3, 11, 5, 2, 6, 10, 8, 13 },
{ 4, 10, 7, 12, 0, 15, 2, 8, 14, 1, 6, 5, 13, 11, 9, 3 },
{ 7, 6, 4, 11, 9, 12, 2, 10, 1, 8, 0, 14, 15, 13, 3, 5 },
{ 7, 6, 2, 4, 13, 9, 15, 0, 10, 1, 5, 11, 8, 14, 12, 3 },
{ 13, 14, 4, 1, 7, 0, 5, 10, 3, 12, 8, 15, 6, 2, 9, 11 },
{ 1, 3, 10, 9, 5, 11, 4, 15, 8, 6, 7, 14, 13, 0, 2, 12 },
Here is what I have as an example(the only sample found):
K(ASCII) = "s=, ehesttgiyga bnss esi2leh3 mT"
K(in hex) = 733d2c20 65686573 74746769 79676120
626e7373 20657369 326c6568 33206d54 (32 bytes)
text (ASCII) = "This is message, length=32 bytes"
text (in hex) = 54686973 20697320 6D657373 6167652C
206C656E 6774683D 33322062 79746573
HMAC_GOSTR3411 = 4ff66c94 bddaae61 13360514 2b582b9c
0f38bbdf f3d7f0ee 6a9c935d 92bfa107
However, my value is: C0F2FE71C3CA016356722646308B69453BB4CD1E232231E04BEB03DB6976F128
Any help of providing more test data or either reject/verify existing will be appreciated.
I've found the answer for this question: the example is incorrect, but there was a bug in my code either.
So, assuming
K(ASCII) = "s=, ehesttgiyga bnss esi2leh3 mT"
text (ASCII) = "This is message, length=32 bytes"
ipad || text hash is: E9D755A47F72A558AE5E75F5B141F5B174E7B1FED281436F3FE835D78D0D9F05
opad || hash(ipad || text) hash is: 8FF55DDAAB167A22DE98286F10458A1619BC45C88F6EAC9CE947ED3FFB348822
Second value is the HMAC_GOST341194 itself.
All hashes was computed using my code and verified using cpverify with params cpverify.exe -mk -alg GR3411 "%YOUR_PATH_TO_FILE%\hash.txt"
Hope this will help someone save couple hours of time not to research why the example HMAC is computed incorrectly here.
PoC is available here.

inexplicable cuda behavior related to memory

so basically i took my c++ code (which is working correctly) and rewrite it to cuda (i have no experience with cuda). The one part of the code (solve() method) is not working correctly and i really dont know why.
So my question is what exactly means "unspecified launch failure" error during cudaMemcpy and why is it happening in my code.
My second question is why variables backup_ans and ans differs when they compute the same thing?
#include "stdio.h"
#include <algorithm>
__device__ unsigned int primes[1024];
__device__ long long n = 1ll<<32; // #unsigned_integers
__device__ int hashh(long long x) {
return (x>>1)%1024;
// compute (x^e)%n
__device__ unsigned long long mulmod(unsigned long long x,unsigned long long e,unsigned long long n) {
unsigned long long ans = 1;
while(e>0) {
if(e&1) ans = (ans*x)%n;
x = (x*x)%n;
return ans;
// determine whether n is strong probable prime base a or not.
// n is ODD
__device__ int is_SPRP(unsigned long long a,unsigned long long n) {
int d=0;
unsigned long long t = n-1;
while(t%2==0) {
unsigned long long x = mulmod(a,t,n);
if(x==1) return 1;
for(int i=0;i<d;++i) {
if(x==n-1) return 1;
return 0;
__device__ int prime(long long x) {
return is_SPRP((unsigned long long)primes[(((long long)0xAFF7B4*x)>>7)%1024],(unsigned long long)x);
// copy all unsigned COMPOSITE ingeters which are not congruent to zero modulo 2,3,5,7 and their hashh value = 0;
// count of those elements store in c
// 335545 is just magic constant to distribute all integers equally on all 400*32 threads
__global__ void find(unsigned int *out,unsigned int *c) {
unsigned int buff[4096];
int local_c = 0;
long long b = 121+(threadIdx.x+blockIdx.x*blockDim.x)*335545;
long long e = b+335545;
if(b%2==0) ++b;
for(long long i=b;i<e && i<n;i+=2) {
if(i%3==0 || i%5==0 || i%7==0 || prime(i)) continue;
if(hashh(i)==0) {
buff[local_c++]=(unsigned int)i;
if(local_c==4096) {
int start = atomicAdd(c,local_c);
for(int i=0;i<local_c;++i) out[i+start]=buff[i];
int start = atomicAdd(c,local_c);
for(int i=0;i<local_c;++i) out[i+start]=buff[i];
// find base for which all elements in input are NOT SPRP. base is from {2,..,34} stored in 32bit uint
__global__ void solve(unsigned int *input, unsigned int *count,unsigned int *backup, unsigned int *ans) {
__shared__ unsigned int s[32];
unsigned int dif = (*count)/(blockDim.x*gridDim.x) +1;
unsigned int b = (threadIdx.x+blockIdx.x*blockDim.x)*dif;
unsigned int e = b+dif>(*count)?(*count):b+dif;
unsigned int mysol = 0;
for(long long i = 2; i<33; ++i) {
int sol = 1;
// each thread doing its part
for(unsigned int j = b; j<e ; ++j) {
//is some element is sprp base i break
if(is_SPRP((unsigned long long)i,(unsigned long long)input[j])!=0) {
// if all elements passed store base to mysol
if(sol==1) mysol|=1<<(i-2);
s[threadIdx.x] = mysol;
// save thread_result
backup[threadIdx.x+blockDim.x*blockIdx.x] = mysol;
// compute global resulte and store it to ans
if(threadIdx.x==0) {
unsigned int global_sol = ~0;
for(int i=0;i<blockDim.x;++i) global_sol&=s[i];
int main(void) {
// number of blocks & thread for solve
const int blocks = 400;
const int threads = 32;
unsigned int prms[] = { 17, 11, 6, 60, 7, 13, 11, 34, 13, 2, 3, 37, 13, 11, 38, 2, 7, 105, 2, 7, 42, 11, 7, 3, 6, 15, 53, 44, 6, 6, 5, 15, 54, 7, 35, 10, 10, 15, 10, 10, 17, 17, 11, 10, 15, 43, 7, 5, 5, 3, 7, 43, 34, 2, 34, 2, 68, 53, 39, 10, 7, 6, 11, 2, 5, 2, 7, 2, 6, 5, 15, 40, 3, 5, 5, 2, 2, 10, 47, 13, 7, 43, 6, 7, 5, 6, 6, 13, 6, 35, 6, 15, 6, 13, 40, 10, 11, 2, 7, 2, 2, 3, 13, 3, 11, 15, 10, 5, 11, 14, 7, 11, 47, 5, 2, 2, 6, 2, 5, 55, 6, 5, 7, 2, 6, 58, 35, 11, 5, 12, 17, 6, 10, 12, 6, 6, 2, 53, 2, 2, 13, 5, 14, 7, 15, 6, 13, 62, 10, 6, 3, 7, 7, 3, 14, 5, 14, 73, 15, 11, 11, 6, 5, 17, 10, 5, 3, 37, 51, 10, 7, 5, 38, 12, 5, 11, 5, 7, 6, 5, 6, 40, 43, 57, 10, 13, 7, 15, 2, 10, 34, 7, 39, 10, 5, 3, 6, 13, 11, 5, 10, 43, 10, 5, 3, 14, 5, 2, 5, 41, 5, 39, 46, 2, 10, 2, 5, 12, 3, 2, 2, 5, 15, 43, 17, 41, 2, 13, 15, 38, 11, 11, 3, 34, 5, 6, 3, 7, 2, 37, 5, 6, 10, 17, 35, 2, 15, 6, 7, 5, 3, 13, 13, 12, 34, 2, 12, 10, 15, 13, 2, 2, 34, 6, 6, 5, 2, 7, 13, 3, 6, 11, 39, 42, 7, 2, 6, 39, 47, 3, 17, 5, 13, 7, 2, 47, 3, 7, 6, 11, 17, 37, 48, 7, 37, 11, 7, 10, 3, 14, 39, 14, 15, 43, 17, 2, 12, 7, 13, 5, 3, 6, 34, 37, 3, 17, 13, 2, 5, 10, 10, 44, 37, 2, 2, 10, 10, 7, 3, 7, 2, 7, 5, 43, 43, 11, 15, 51, 13, 17, 10, 11, 2, 5, 34, 17, 2, 2, 42, 6, 6, 5, 47, 15, 2, 12, 7, 3, 10, 15, 3, 7, 12, 12, 15, 43, 14, 7, 58, 13, 10, 6, 6, 38, 34, 5, 5, 13, 38, 6, 11, 10, 6, 7, 2, 55, 2, 13, 5, 11, 44, 15, 17, 2, 40, 2, 15, 13, 6, 2, 3, 3, 3, 3, 6, 39, 5, 11, 17, 37, 5, 7, 6, 10, 6, 12, 7, 5, 14, 10, 12, 71, 10, 35, 6, 11, 3, 2, 38, 3, 2, 34, 10, 17, 42, 2, 12, 6, 6, 11, 40, 12, 10, 6, 10, 2, 3, 3, 56, 11, 7, 42, 2, 38, 12, 2, 2, 13, 40, 12, 6, 5, 5, 59, 15, 38, 5, 5, 5, 7, 2, 10, 7, 2, 17, 10, 11, 6, 6, 6, 2, 10, 6, 54, 2, 82, 3, 34, 14, 15, 44, 5, 46, 2, 13, 5, 12, 13, 11, 10, 39, 5, 40, 3, 60, 3, 42, 11, 3, 46, 17, 3, 2, 37, 6, 42, 12, 14, 3, 12, 66, 13, 34, 7, 3, 13, 3, 11, 2, 13, 12, 38, 34, 5, 40, 10, 14, 6, 14, 11, 38, 58, 2, 48, 5, 15, 5, 73, 3, 37, 5, 11, 10, 5, 5, 13, 2, 10, 13, 34, 17, 3, 7, 47, 2, 2, 10, 15, 3, 3, 13, 6, 34, 13, 10, 13, 3, 6, 41, 10, 6, 2, 6, 2, 6, 2, 6, 6, 37, 10, 44, 35, 13, 51, 2, 7, 53, 5, 40, 5, 2, 37, 11, 15, 11, 13, 2, 5, 2, 6, 10, 17, 15, 43, 39, 17, 2, 12, 10, 15, 17, 7, 13, 3, 7, 15, 37, 5, 15, 7, 6, 10, 51, 2, 2, 40, 61, 2, 13, 13, 11, 2, 5, 34, 5, 5, 7, 2, 2, 2, 11, 3, 6, 13, 6, 17, 11, 10, 7, 46, 15, 7, 14, 35, 11, 7, 10, 6, 11, 40, 11, 2, 39, 7, 6, 66, 5, 3, 6, 5, 11, 10, 2, 10, 7, 13, 2, 45, 34, 6, 35, 2, 11, 5, 59, 75, 10, 17, 14, 17, 17, 17, 2, 11, 7, 10, 6, 11, 6, 56, 34, 35, 11, 14, 12, 41, 40, 17, 40, 3, 11, 7, 37, 14, 7, 13, 7, 5, 2, 10, 6, 39, 2, 7, 37, 35, 10, 5, 15, 2, 7, 38, 34, 11, 17, 5, 6, 10, 3, 6, 7, 7, 43, 14, 2, 43, 3, 2, 47, 7, 35, 7, 3, 53, 2, 10, 10, 10, 60, 10, 6, 2, 6, 10, 5, 7, 57, 53, 13, 3, 35, 38, 15, 42, 3, 3, 12, 2, 10, 3, 38, 54, 13, 10, 11, 7, 13, 7, 2, 12, 39, 10, 54, 2, 12, 38, 10, 12, 12, 5, 15, 6, 10, 13, 5, 15, 10, 13, 6, 41, 40, 14, 12, 10, 11, 40, 5, 11, 10, 2, 5, 2, 13, 6, 2, 13, 5, 2, 10, 15, 5, 5, 10, 34, 13, 2, 5, 14, 5, 6, 5, 13, 3, 43, 6, 13, 11, 50, 3, 6, 6, 12, 15, 11, 37, 7, 69, 11, 14, 14, 7, 43, 5, 35, 11, 35, 11, 11, 34, 34, 39, 14, 11, 2, 10, 53, 6, 11, 2, 11, 60, 39, 11, 6, 15, 40, 17, 47, 34, 50, 7, 59, 47, 5, 13, 39, 5, 6, 53, 10, 14, 5, 51, 5, 7, 5, 6, 77, 7, 12, 7, 42, 2, 5, 2, 6, 60, 10, 13, 10, 6, 47, 6, 15, 17, 10, 11, 10, 12, 7, 7, 10, 17, 34, 5, 10, 7, 7, 2, 6, 10, 38, 2, 15, 6, 13, 7, 13, 2, 3, 13, 5, 3, 17, 2, 5, 15, 11, 39, 7, 39, 10, 10, 2, 6, 13, 3, 5, 17, 6, 14, 10, 37, 44, 3, 34, 5, 11, 7, 12, 2, 5, 3, 12, 3, 2, 3, 133, 12, 2, 2, 2, 3, 34, 14, 41, 2, 37, 11, 2, 6, 11, 6, 7, 15, 11, 35, 13, 6, 5, 2, 14, 7, 2 };
printf("primes_copy: %s\n",cudaGetErrorString(cudaMemcpyToSymbol(primes,prms,1024*4)));
// allocate buffers
unsigned int *dev_input,*dev_count;
printf("alloc_input: %s\n",cudaGetErrorString(cudaMalloc((void**)&dev_input,sizeof(int)*(1<<23))));
printf("alloc_count: %s\n",cudaGetErrorString(cudaMalloc((void**)&dev_count,4)));
printf("memset_count: %s\n",cudaGetErrorString(cudaMemset(dev_count,0,4)));
unsigned int count;
printf("copy_count: %s\n",cudaGetErrorString(cudaMemcpy(&count,dev_count,4,cudaMemcpyDeviceToHost)));
// sort found elements just to make debbug easier, it is not necessary
unsigned int *backup_numbers = new unsigned int[1000000];
printf("copy_backup: %s\n",cudaGetErrorString(cudaMemcpy(backup_numbers,dev_input,4*count,cudaMemcpyDeviceToHost)));
printf("copy_S_backup: %s\n",cudaGetErrorString(cudaMemcpy(dev_input,backup_numbers,4*count,cudaMemcpyHostToDevice)));
delete[] backup_numbers;
printf("\nsize: %u\n",count);
// allocate buffers
unsigned int *dev_backup, *dev_ans;
printf("malloc_backup: %s\n",cudaGetErrorString(cudaMalloc((void**)&dev_backup,sizeof(int)*blocks*threads)));
printf("malloc_ans: %s\n",cudaGetErrorString(cudaMalloc((void**)&dev_ans,4)));
printf("memset_ans: %s\n",cudaGetErrorString(cudaMemset(dev_ans,0xFF,4)));
unsigned int ans,*backup;
printf("memcpy_ans: %s\n",cudaGetErrorString(cudaMemcpy(&ans,dev_ans,4,cudaMemcpyDeviceToHost)));
backup = new unsigned int[400*32];
printf("memcpy_backup: %s\n",cudaGetErrorString(cudaMemcpy(backup,dev_backup,4*blocks*threads,cudaMemcpyDeviceToHost)));
unsigned int backup_ans = ~0;
// compute global result using backuped thread_results
// notice backup_ans and ans MUST be the same, but they are NOT (WHY!)
for(int i=0;i<threads*blocks;++i) backup_ans&=backup[i];
printf("ans: %u\nbackup_ans %u\n",ans,backup_ans);
delete[] backup;
All code except solve() method works as intend. solve() method just computes bullshit (because backup_ans and ans differ) and it is also giving me the "unspecified launch failure" error on last two cudaMemcpy.
When i run solve<<<1,1>>>(...) i got
ans: 134816642 backup_ans 432501552
but when i run solve<<<400,32>>>(...) it gives me
ans: 134816642 backup_ans 0
(correct answer should be 0)
In all situations it should compute backup_ans=ans=0
Any advice what i am doing wrong would be helpful.
Code for generating primes.bin
#include <cstdlib>
#include <stdio.h>
using namespace std;
const unsigned long long n = 1ll<<32;
const int buffer_size = 2000000;
typedef unsigned char uch;
typedef unsigned int uint;
typedef unsigned long long ull;
uch *primes;
int prime(long long x) {
if(x==2) return 1;
if(x%2==0) return 0;
long long pos = x/16;
long long index = (x&15)>>1;
return (1<<index)&(~(primes[pos]));
void eratosten_sieve(void) {
long long pos;
long long index;
for(long long i=3;i*i<n;++i) {
if(!prime(i)) continue;
for(long long j=i*i;j<n;j+=(i<<1)) {
pos = j/16;
index = ((j&15)>>1);
int main(void) {
primes = new uch[(n/16)+1];
for(long long i=0;i<(n/16)+1;++i) primes[i]=0;
int l = n/16 +1;
FILE *f = fopen("primes.bin","wb");
delete[] primes;
PS: i am compiling it by nvcc -arch compute_11
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 1023 MBytes (1073020928 bytes)
(14) Multiprocessors, ( 8) CUDA Cores/MP: 112 CUDA Cores
GPU Clock rate: 1500 MHz (1.50 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 256-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 9800 GT
Result = PASS
OK, you are out of memory. It took me a while to figure out because I was not thinking about the large static allocation:
__device__ unsigned char primes[(1<<28)+1];
Normally when folks are out of memory, they discover it on a cudaMalloc operation. In your case, your GPU has 1GB of memory, and I am guessing you are also hosting a display on it (you didn't answer that question). Take a look at how much free memory there is in the nvidia-smi -a output, it will look something like this:
FB Memory Usage
Total : 1535 MiB
Used : 3 MiB
Free : 1532 MiB
Your numbers will be smaller - the Free line is what we care about.
Your dynamic allocations (ie. from cudaMalloc) are allocating about 350MB. But the kernel launch brings the static allocation into play, and then your total footprint rises to over 700MB (2^28 is over 250MB). If you have a display running on that GPU, it will consume some of the 1GB of memory, leaving you with not enough to run a kernel that requires 700MB.
If you want to run on that GPU, see if you can pare your problem size down somehow.
And it's always good to do proper cuda error checking, but apart from this issue, your code seems to run with no errors for me on devices with more memory.

Weka data load error

I want to load the data in breast-cancer-wisconsin through Weka Explorer as a C4.5 data file and I'm getting the following errors when choosing both to load C4.5 .data and C4.5 .names:
Any ideas?
It does not look like the C45 names file is correct. Try replacing breast-cancer-wisconsin.names with this one:
2, 4.
clump: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
size: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
shape: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
adhesion: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
epithelial: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nuclei: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
chromatin: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
nucleoli: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
mitoses: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
Note that class comes first (only labels).
Here I have removed the first column of subjects' id in the original dataset using
$ cut -d, -f2-11 breast-cancer-wisconsin.data > breast-cancer-wisconsin.data
but it is not difficult to adapt the above code.
Alternative solutions:
Generate a csv file: you just need to add a header to the *.data file and rename it as *.csv. E.g., replace breast-cancer-wisconsin.data with breast-cancer-wisconsin.csv which should look like
Construct directly an *.arff file by hand; that's not really complicated as there are few variables. An example file can be found here.