Empty Destructor Crashing Program: C++ - c++

The following program calculates all primes for really large numbers (eg. 600,851,475,143). Everything works right so far except when I put in large numbers the destructor is crashing the application. Can anyone see something wrong with my application?
After rechecking my solution the answer is wrong but the question still is valid.
#include <iostream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <cmath>
#include <stdexcept>
#include <climits>
typedef std::vector<unsigned long long>::const_iterator prime_it;
#define MAX_COL 900000
struct large_vector
{
public:
large_vector(unsigned long long size, unsigned int row) :
m_Row(row)
{
m_RowVector.reserve(size);
}
std::vector<bool> m_RowVector;
unsigned int m_Row;
};
struct prime_factor
{
public:
prime_factor(unsigned long long N);
~prime_factor() {}
void print_primes();
private:
std::vector<bool> m_Primes;
std::vector<large_vector>m_Vect_Primes;
unsigned long long m_N;
};
prime_factor::prime_factor(unsigned long long N) :
m_N(N)
{
// If number is odd then we need the cieling of N/2 / MAX_COL
int number_of_vectors = (m_N % MAX_COL == 0) ? (m_N / MAX_COL) : ((m_N / MAX_COL) + 1);
std::cout << "There will be " << number_of_vectors << " rows";
if (number_of_vectors != 0) {
for (int x = 0; x < number_of_vectors; ++x) {
m_Vect_Primes.push_back(large_vector(MAX_COL, x));
}
m_Vect_Primes[0].m_RowVector[0] = false;
m_Vect_Primes[0].m_RowVector[1] = false;
unsigned long long increment = 2;
unsigned long long index = 0;
while (index < m_N) {
for (index = 2*increment; index < m_N; index += increment) {
unsigned long long row = index/MAX_COL;
unsigned long long col = index%MAX_COL;
m_Vect_Primes[row].m_RowVector[col] = true;
}
while (m_Vect_Primes[increment/MAX_COL].m_RowVector[increment%MAX_COL]) {
increment++;
}
}
}
}
void prime_factor::print_primes()
{
for (int index = 0; index < m_N; ++index) {
if (m_Vect_Primes[index/MAX_COL].m_RowVector[index%MAX_COL] == false) {
std::cout << index << " ";
}
}
}
/*!
* Driver
*/
int main(int argc, char *argv[])
{
static const unsigned long long N = 600851475143;
prime_factor pf(N);
pf.print_primes();
}
Update
I am pretty sure this is a working version:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <cmath>
#include <stdexcept>
#include <climits>
typedef std::vector<unsigned long long>::const_iterator prime_it;
#define MAX_COL 900000
struct large_vector
{
public:
large_vector(unsigned long long size, unsigned int row) :
m_Row(row)
{
m_RowVector.resize(size);
}
std::vector<bool> m_RowVector;
unsigned int m_Row;
};
struct prime_factor
{
public:
prime_factor(unsigned long long N);
~prime_factor() {}
void print_primes();
private:
std::vector<bool> m_Primes;
std::vector<large_vector>m_Vect_Primes;
unsigned long long m_N;
};
prime_factor::prime_factor(unsigned long long N) :
m_N(N)
{
// If number is odd then we need the cieling of N/2 / MAX_COL
int number_of_vectors = (m_N % MAX_COL == 0) ? ((m_N/2) / MAX_COL) : (((m_N/2) / MAX_COL) + 1);
std::cout << "There will be " << number_of_vectors << " rows";
if (number_of_vectors != 0) {
for (int x = 0; x < number_of_vectors; ++x) {
m_Vect_Primes.push_back(large_vector(MAX_COL, x));
}
m_Vect_Primes[0].m_RowVector[0] = false;
m_Vect_Primes[0].m_RowVector[1] = false;
unsigned long long increment = 2;
unsigned long long index = 0;
while (index < m_N) {
for (index = 2*increment; index < m_N/2; index += increment) {
unsigned long long row = index/MAX_COL;
unsigned long long col = index%MAX_COL;
m_Vect_Primes[row].m_RowVector[col] = true;
}
increment += 1;
while (m_Vect_Primes[increment/MAX_COL].m_RowVector[increment%MAX_COL]) {
increment++;
}
}
}
}
void prime_factor::print_primes()
{
for (unsigned long long index = 0; index < m_N/2; ++index) {
if (m_Vect_Primes[index/MAX_COL].m_RowVector[index%MAX_COL] == false) {
std::cout << index << " ";
}
}
}
/*!
* Driver
*/
int main(int argc, char *argv[])
{
static const unsigned long long N = 400;
prime_factor pf(N);
pf.print_primes();
}

Your usage of reserve is incorrect.
m_RowVector.reserve(size);
Here m_RowVector has space reserved so that the vector can grow without being re-allocated. BUT the size of m_RowVector is still 0 and thus accessing any elements is still undefined. You must change the size of the array with either resize() or push_back() to put elements into the vector.
I can't see anything wrong but I am sure that you have other index beyond the end of vector problems. I would change the use of operator[] into the method at() this will throw an exception when you access elements of the end of the vector and give you a clue to the actual location of the error.

Related

Undefined behavior with determinist procedure

I am currently trying to implement a "cave generation" as a 2D array following the "Game of Life" ideas. The idea is as follow:
I have a 2d vector of 0s and 1s (which respectively represent air and block) randomly generated with a uniform_real_distribution with density (here 0.45, so 45% of the array will be 1).
After this we iterate x times on the array. An iteration looks as follow:
First, we copy the array on a new one.
Second, we iterate on the old array as follow: We look at the number of blocks on the neighbourhood of the block we're at, and depending on two things we do this:
IF the current tile is air and has more than 4 blocks in its neighbourhood (-1,-1) to (1,1) excluding himself, change it to a block in the NEW ARRAY
IF the current tile is a block and has less than 3 blocks in its neighbourhood, change it to air in the NEW ARRAY
Copy the new array in the old array
The problem is, that EVEN when I seed my uniform law with a determinist seed, sometimes (1 time over 3), the map will be completely filled with blocks after two or three iterations. I have literally 0 idea of why after looking at my code for many hours, and this is why I am here. There is the code:
cavefactory.h
#ifndef CAVEFACTORY_H_
#define CAVEFACTORY_H_
#include <vector>
namespace cavegenerator {
// define cave_t as a 2d vector of integers
using cave_t = std::vector<std::vector<int>>;
// constants
namespace DEFAULT {
constexpr unsigned short int WIDTH = 64;
constexpr unsigned short int HEIGHT = 64;
constexpr float DENSITY = 0.45;
constexpr unsigned short int BIRTH_LIMIT = 4;
constexpr unsigned short int DEATH_LIMIT = 3;
} // namespace DEFAULT
class CaveFactory {
public:
CaveFactory(unsigned short int width = DEFAULT::WIDTH,
unsigned short int height = DEFAULT::HEIGHT,
float density = DEFAULT::DENSITY);
// makes a cave with the desired number of iterations and parameters
static cave_t MakeCave(unsigned short int width = DEFAULT::WIDTH,
unsigned short int height = DEFAULT::HEIGHT,
float density = DEFAULT::DENSITY,
int iterations = 3,
unsigned short int bl = DEFAULT::BIRTH_LIMIT,
unsigned short int dl = DEFAULT::DEATH_LIMIT);
// implemented in case of generalization of cave(more than two blocks)
bool isSolid(int i, int j);
cave_t getCave();
void Print();
void Iterate( unsigned short int bl = DEFAULT::BIRTH_LIMIT,
unsigned short int dl = DEFAULT::DEATH_LIMIT );
private:
cave_t cave_;
int NumberOfNeighbours(int i, int j);
void Initialize(float density = DEFAULT::DENSITY);
};
} // namespace cavegenerator
#endif // CAVEFACTORY_H_
cavefactory.cc
#include "cavefactory.h"
#include <random>
#include <iostream>
#include <ctime>
#include <algorithm>
namespace cavegenerator {
CaveFactory::CaveFactory(unsigned short int width, unsigned short int height, float density) {
cave_.resize(width);
for (auto &i : cave_) {
i.resize(height);
}
Initialize(density);
}
bool CaveFactory::isSolid(int i, int j) {
return (cave_[i][j] == 1);
}
int CaveFactory::NumberOfNeighbours(int x, int y) {
int num = 0;
for (int i = -1; i < 2; i++) {
for (int j = -1; j < 2; j++) {
if ( i == 0 && j == 0 ) continue; // we don't want to count ourselve
// if out of bounds, add a solid neighbour
if ( x + i >= (int)cave_.size() || x + i < 0 || y + j >= (int)cave_[i].size() || y + j < 0) {
++num;
} else if (isSolid(x+i, y+j)) {
++num;
}
}
}
return num;
}
cave_t CaveFactory::getCave() {
return cave_;
}
void CaveFactory::Print() {
for (auto &i : cave_) {
for (auto &j : i) {
std::cout << ((j==1) ? "x" : " ");
}
std::cout << "\n";
}
return;
}
cave_t CaveFactory::MakeCave(unsigned short int width,
unsigned short int height,
float density,
int iterations,
unsigned short int bl,
unsigned short int dl)
{
CaveFactory cave(width, height, density);
for (int i = 0; i < iterations; i++) {
cave.Iterate(bl, dl);
}
return cave.getCave();
}
// Initlialize the cave with the specified density
void CaveFactory::Initialize(float density) {
std::mt19937 rd(4);
std::uniform_real_distribution<float> roll(0, 1);
for (auto &i : cave_) {
for (auto &j : i) {
if (roll(rd) < density) {
j = 1;
} else {
j = 0;
}
}
}
}
// for each cell in the original cave, if the cell is solid:
// if the number of solid neighbours is under the death limit, we kill the block
// if the cell is air, if the number of solid blocks is above the birth limit we place a block
void CaveFactory::Iterate(unsigned short int bl, unsigned short int dl) {
cave_t new_cave = cave_;
for (int i = 0; i < (int)cave_.size(); i++) {
for (int j = 0; j < (int)cave_[0].size(); j++) {
int number_of_neighbours = NumberOfNeighbours(i, j);
if (isSolid(i, j) && number_of_neighbours < dl) {
new_cave[i][j] = 0;
} else if (!isSolid(i,j) && number_of_neighbours > bl) {
new_cave[i][j] = 1;
}
}
}
std::copy(new_cave.begin(), new_cave.end(), cave_.begin());
}
} // namespace cavegenerator
main.cc
#include <iostream>
#include <vector>
#include <random>
#include <ctime>
#include <windows.h>
#include "cavefactory.h"
int main() {
cavegenerator::CaveFactory caveEE;
caveEE.Print();
for(int i = 0; i < 15; i++) {
caveEE.Iterate();
Sleep(600);
system("cls");
caveEE.Print();
}
return 0;
}
I know windows.h is a bad habit, I just used it for debugging.
I hope someone can make me understand, maybe it's just a normal behavior I'm not aware of?
Thank you very much.
(int)cave_[i].size() in NumberOfNeighbours is incorrect, it should be (int)cave_[x+i].size() (or (int)cave_[0].size() since all rows and columns are equal size). When i equals -1 you have an out of bounds vector access and undefined behaviour.

Sieve of Eratosthenes not working beyond 200,000

My C++ program to calculate all the prime numbers using sieve of Eratosthenes method stops after 200,000. But I need to calculate the primes up to 2 million. Help would be appreciated if someone could tell me where I went wrong with my code.
#include <iostream>
#include<math.h>
using namespace std;
void isprime(long long int prime[],long int n)
{
for(long long int i=0;i<=n;i++)
{
prime[i]=1;
}
prime[0]=prime[1]=0;
for(long long int i=2;i<=sqrt(n);i++)
{
if(prime[i]==1)
{
for(long long int j=2;i*j<=n;j++)
prime[i*j]=0;
}
}
for(long long int i=0;i<=n;i++)
{
if(prime[i]==1)
cout<<i<<endl;
}
}
int main()
{
long long int n;
cout<<"enter number";
cin>>n;
long long int prime[n+1];
isprime(prime,n);
return 0;
}
Since each sieve element contains only a 0 or 1, there is no need to use a long long int to store each one. std::vector<bool> potentially uses 1 bit per element and thus is optimal for memory efficiency.
Here is your code with a very few modifications to use a std::vector<bool>. Since some bit manipulation is required to get and set individual elements, this version may be slower than code which uses one byte or int per sieve element. You can benchmark various versions and decide the right trade-off for your needs.
#include <cmath>
#include <cstddef>
#include <exception>
#include <iostream>
#include <string>
#include <vector>
// returns the number of primes <= n
long isprime(long n) {
std::vector<bool> prime(n + 1);
for (long i = 0; i <= n; i++) {
prime[i] = 1;
}
prime[0] = prime[1] = 0;
long upper_bound = std::sqrt(n);
for (long i = 2; i <= upper_bound; i++) {
if (prime[i] == 1) {
for (long j = 2; i * j <= n; j++)
prime[i * j] = 0;
}
}
long num_primes = 0;
for (long i = 0; i <= n; i++) {
if (prime[i] == 1) {
++num_primes;
// std::cout << i << std::endl;
}
}
return num_primes;
}
int main() {
std::cout << "Enter the sieve size: ";
std::string line;
std::getline(std::cin, line);
std::cout << std::endl;
long len = std::stol(line);
long num_primes = isprime(len);
std::cout << "There are " << num_primes << " primes <= " << len << std::endl;
return 0;
}

Dynamic Programming- Primitive Calculator

The full explanation of the problem is here---http://imgur.com/a/UiE7L .
I've written the code, but it is showing segmentation error which I'm not able to solve. As per the logic of the program, I am saving the minimum number of operations needed to reach number n on the nth position of the array. I intend to go by this logic.
#include <iostream>
#include <vector>
#include <algorithm>
#include <stdio.h>
#include <stdlib.h>
long long f(long long n, vector <long long> arr)
{
arr[1]=0;
arr.push_back(n);
long long ans=0, ret=0;
if (n==1)
{
return (0);
}
ans= f(n-1, arr) + 1;
if (n%2==0)
{
ret= f(n/2, arr) + 1;
if (ret<ans)
{
ans=ret;
std::cout<<ans<<'\n';
}
}
if (n%3==0)
{
ret= f(n/3, arr) + 1;
if (ret<ans)
{
ans=ret;
std::cout<<ans<<'\n';
}
}
arr[n]=ans;
return arr[n];
}
int main() {
long long n;
std::cin >> n;
std::vector<long long> arr;
std::cout<<f(n, arr);
return 0;
}
#include <bits/stdc++.h>
using namespace std;
long long f(long long n, vector <long long> arr)
{
arr[1]=0;
//arr.push_back(n); // not required
long long ans=0, ret=0;
if (n==1)
{
return (0);
}
ans= f(n-1, arr) + 1;
if (n%2==0)
{
ret= f(n/2, arr) + 1;
if (ret<ans)
{
ans=ret;
//std::cout<<ans<<'\n';
}
}
if (n%3==0)
{
ret= f(n/3, arr) + 1;
if (ret<ans)
{
ans=ret;
//std::cout<<ans<<'\n';
}
}
arr[n]=ans;
return arr[n];
}
int main() {
long long n = 120;
std::vector<long long> arr(n+1); // declare arr with size n+1
std::cout<<f(n, arr);
return 0;
}
You have accessed a[1] without declaring size of array >= 2 as array in c++ is 0 indexing and in addition to that if you provide array with some initial size as arr(n+1) while declaring then no need to push value of n in arr again.
then your solution works correct.
for itterative approach
#include <bits/stdc++.h>
using namespace std;
int main() {
long long n;
cin >> n;
vector<long long> arr(n+1);
for (int i = 1; i <= n; i++) {
arr[i] = arr[i - 1] + 1;
if (i % 2 == 0) arr[i] = min(1 + arr[i / 2], arr[i]);
if (i % 3 == 0) arr[i] = min(1 + arr[i / 3], arr[i]);
}
cout << arr[n]-1 << endl;
return 0;
}

Can't understand why my program throws error

My code is in
#include <iostream>
#include <string>
#include <algorithm>
#include <climits>
#include <vector>
#include <cmath>
using namespace std;
struct State {
int v;
const State *rest;
void dump() const {
if(rest) {
cout << ' ' << v;
rest->dump();
} else {
cout << endl;
}
}
State() : v(0), rest(0) {}
State(int _v, const State &_rest) : v(_v), rest(&_rest) {}
};
void ss(int *ip, int *end, int target, const State &state) {
if(target < 0) return; // assuming we don't allow any negatives
if(ip==end && target==0) {
state.dump();
return;
}
if(ip==end)
return;
{ // without the first one
ss(ip+1, end, target, state);
}
{ // with the first one
int first = *ip;
ss(ip+1, end, target-first, State(first, state));
}
}
vector<int> get_primes(int N) {
int size = floor(0.5 * (N - 3)) + 1;
vector<int> primes;
primes.push_back(2);
vector<bool> is_prime(size, true);
for(long i = 0; i < size; ++i) {
if(is_prime[i]) {
int p = (i << 1) + 3;
primes.push_back(p);
// sieving from p^2, whose index is 2i^2 + 6i + 3
for (long j = ((i * i) << 1) + 6 * i + 3; j < size; j += p) {
is_prime[j] = false;
}
}
}
}
int main() {
int N;
cin >> N;
vector<int> primes = get_primes(N);
int a[primes.size()];
for (int i = 0; i < primes.size(); ++i) {
a[i] = primes[i];
}
int * start = &a[0];
int * end = start + sizeof(a) / sizeof(a[0]);
ss(start, end, N, State());
}
It takes one input N (int), and gets the vector of all prime numbers smaller than N.
Then, it finds the number of unique sets from the vector that adds up to N.
The get_primes(N) works, but the other one doesn't.
I borrowed the other code from
How to find all matching numbers, that sums to 'N' in a given array
Please help me.. I just want the number of unique sets.
You've forgotten to return primes; at the end of your get_primes() function.
I'm guessing the problem is:
vector<int> get_primes(int N) {
// ...
return primes; // missing this line
}
As-is, you're just writing some junk here:
vector<int> primes = get_primes(N);
it's undefined behavior - which in this case manifests itself as crashing.

radix select using cuda

I have been working to develop a radix select using CUDA which utilizes k smallest element to sort given number of elements. The main idea behind this radix select is that is scans through 32 bit integer starting from its MSB to LSB. It partitions all 0 bit on left side and all 1 bit on the right side. The side with contains k smallest elements is solved recursively. My partition process works just fine but I am having problem dealing with recursive function calls. I am unable to stop the recursion. Please help me on that!
My kernel function looks like this: This is kernel.h
#include "header.h"
#define WARP_SIZE 32
#define BLOCK_SIZE 32
__device__ int Partition(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
int threadID = threadIdx.x + BLOCK_SIZE * blockIdx.x;
int WarpID = threadID >> 5;
int LocWarpID = threadID - 32 * WarpID;
int NumWarps = N / WARP_SIZE;
int pivot;
__shared__ int DataPartition[BLOCK_SIZE];
__shared__ int DataBinary[WARP_SIZE];
for(int i = 0; i < NumWarps; i++)
{
if(LocWarpID >= firstidx && LocWarpID <=lastidx)
{
int r = d_DataIn[i * WARP_SIZE + LocWarpID];
int p = (r>>(31-bit))&1;
unsigned int B = __ballot(p);
unsigned int B_flip = ~B;
if(p==1)
{
int b = B << (32-LocWarpID);
int RightLoc = __popc(b);
DataPartition[lastidx - RightLoc] = r;
}
else
{
int b_flip = B_flip << (32 - LocWarpID);
int LeftLoc = __popc(b_flip);
DataPartition[LeftLoc] = r;
}
if(LocWarpID <= lastidx - __popc(B))
{
d_DataIn[LocWarpID] = DataPartition[LocWarpID];
}
else
{
d_DataIn[LocWarpID] = DataPartition[LocWarpID];
}
pivot = lastidx - __popc(B);
return pivot+1;
}
}
}
__device__ int RadixSelect(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
if(firstidx == lastidx)
return *d_DataIn;
int q = Partition(d_DataIn, firstidx, lastidx, k, N, bit);
int length = q - firstidx;
if(k == length)
return *d_DataIn;
else if(k < length)
return RadixSelect(d_DataIn, firstidx, q-1, k, N, bit+1);
else
return RadixSelect(d_DataIn, q, lastidx, k-length, N, bit+1);
}
__global__ void radix(int *d_DataIn, int firstidx, int lastidx, int k, int N, int bit)
{
RadixSelect(d_DataIn, firstidx, lastidx, k, N, bit);
}
Host code is main.cu and it looks like:
#include "header.h"
#include <iostream>
#include <fstream>
#include "kernel.h"
#define BLOCK_SIZE 32
using namespace std;
int main()
{
int N = 32;
thrust::host_vector<float>h_HostFloat(N);
thrust::counting_iterator <unsigned int> Numbers(0);
thrust::transform(Numbers, Numbers + N, h_HostFloat.begin(), RandomFloatNumbers(1.f, 100.f));
thrust::host_vector<int>h_HostInt(N);
thrust::transform(h_HostFloat.begin(), h_HostFloat.end(), h_HostInt.begin(), FloatToInt());
thrust::device_vector<float>d_DeviceFloat = h_HostFloat;
thrust::device_vector<int>d_DeviceInt(N);
thrust::transform(d_DeviceFloat.begin(), d_DeviceFloat.end(), d_DeviceInt.begin(), FloatToInt());
int *d_DataIn = thrust::raw_pointer_cast(d_DeviceInt.data());
int *h_DataOut;
float *h_DataOut1;
int fsize = N * sizeof(float);
int size = N * sizeof(int);
h_DataOut = new int[size];
h_DataOut1 = new float[fsize];
int firstidx = 0;
int lastidx = BLOCK_SIZE-1;
int k = 20;
int bit = 1;
int NUM_BLOCKS = N / BLOCK_SIZE;
radix <<< NUM_BLOCKS, BLOCK_SIZE >>> (d_DataIn, firstidx, lastidx, k, N, bit);
cudaMemcpy(h_DataOut, d_DataIn, size, cudaMemcpyDeviceToHost);
WriteData(h_DataOut1, h_DataOut, 10, N);
return 0;
}
List of headers that I used:
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include "functor.h"
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <thrust/device_ptr.h>
Another header file "functor.h" to convert floating point numbers to int type and to generate random floating numbers.
#include <thrust/random.h>
#include <sstream>
#include <fstream>
#include <iomanip>
struct RandomFloatNumbers
{
float a, b;
__host__ __device__
RandomFloatNumbers(float _a, float _b) : a(_a), b(_b) {};
__host__ __device__
float operator() (const unsigned int n) const{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a,b);
rng.discard(n);
return dist(rng);
}
};
struct FloatToInt
{
__host__ __device__
int operator() (const float &x)
const {
union {
float f_value;
int i_value;
} value;
value.f_value = x;
return value.i_value;
}
};
float IntToFloat(int &x)
{
union{
float f_value;
int i_value;
}value;
value.i_value = x;
return value.f_value;
}
bool WriteData(float *h_DataOut1, int *h_DataOut, int bit, int N)
{
std::ofstream data;
std::stringstream file;
file << "out\\Partition_";
file << std::setfill('0') <<std::setw(2) << bit;
file << ".txt";
data.open((file.str()).c_str());
if(data.is_open() == false)
{
std::cout << "File is not open" << std::endl;
return false;
}
for(int i = 0; i < N; i++)
{
h_DataOut1[i] = IntToFloat(h_DataOut[i]);
//cout << h_HostFloat[i] << " \t" << h_DataOut1[i] << endl;
//std::bitset<32>bitshift(h_DataOut[i]&1<<31-bit);
//data << bitshift[31-bit] << "\t" <<h_DataOut1[i] <<std::endl;
data << h_DataOut1[i] << std::endl;
}
data << std::endl;
data.close();
std::cout << "Partition=" <<bit <<"\n";
return true;
}
Per your request, I'm posting the code I used to investigate this and help me in studying your code.
#include <stdio.h>
#include <stdlib.h>
__device__ int gpu_partition(unsigned int *data, unsigned int *partition, unsigned int *ones, unsigned int* zeroes, int bit, int idx, unsigned int* warp_ones){
int one = 0;
int valid = 0;
int my_one, my_zero;
if (partition[idx]){
valid = 1;
if(data[idx] & (1ULL<<(31-bit))) one=1;}
__syncthreads();
if (valid){
if (one){
my_one=1;
my_zero=0;}
else{
my_one=0;
my_zero=1;}
}
else{
my_one=0;
my_zero=0;}
ones[idx]=my_one;
zeroes[idx]=my_zero;
unsigned int warp_one = __popc(__ballot(my_one));
if (!(threadIdx.x & 31))
warp_ones[threadIdx.x>>5] = warp_one;
__syncthreads();
// reduce
for (int i = 16; i > 0; i>>=1){
if (threadIdx.x < i)
warp_ones[threadIdx.x] += warp_ones[threadIdx.x + i];
__syncthreads();}
return warp_ones[0];
}
__global__ void gpu_radixkernel(unsigned int *data, unsigned int m, unsigned int n, unsigned int *result){
__shared__ unsigned int loc_data[1024];
__shared__ unsigned int loc_ones[1024];
__shared__ unsigned int loc_zeroes[1024];
__shared__ unsigned int loc_warp_ones[32];
int l=0;
int bit = 0;
unsigned int u = n;
if (n<2){
if ((n == 1) && !(threadIdx.x)) *result = data[0];
return;}
loc_data[threadIdx.x] = data[threadIdx.x];
loc_ones[threadIdx.x] = (threadIdx.x<n)?1:0;
__syncthreads();
unsigned int *next = loc_ones;
do {
int s = gpu_partition(loc_data, next, loc_ones, loc_zeroes, bit++, threadIdx.x, loc_warp_ones);
if ((u-s) > m){
u = (u-s);
next = loc_zeroes;}
else{
l = (u-s);
next = loc_ones;}}
while ((u != l) && (bit<32));
if (next[threadIdx.x]) *result = loc_data[threadIdx.x];
}
int partition(unsigned int *data, int l, int u, int bit){
unsigned int *temp = (unsigned int *)malloc(((u-l)+1)*sizeof(unsigned int));
int pos = 0;
for (int i = l; i<=u; i++)
if(data[i] & (1ULL<<(31-bit))) temp[pos++] = data[i];
int result = u-pos;
for (int i = l; i<=u; i++)
if(!(data[i] & (1ULL<<(31-bit)))) temp[pos++] = data[i];
pos = 0;
for (int i = u; i>=l; i--)
data[i] = temp[pos++];
free(temp);
return result;
}
unsigned int radixselect(unsigned int *data, int l, int u, int m, int bit){
if (l == u) return(data[l]);
if (bit > 32) {printf("radixselect fail!\n"); return 0;}
int s = partition(data, l, u, bit);
if (s>=m) return radixselect(data, l, s, m, bit+1);
return radixselect(data, s+1, u, m, bit+1);
}
int main(){
unsigned int data[8] = {32767, 22, 88, 44, 99, 101, 0, 7};
unsigned int data1[8];
for (int i = 0; i<8; i++){
for (int j=0; j<8; j++) data1[j] = data[j];
printf("value[%d] = %d\n", i, radixselect(data1, 0, 7, i, 0));}
unsigned int *d_data;
cudaMalloc((void **)&d_data, 1024*sizeof(unsigned int));
unsigned int h_result, *d_result;
cudaMalloc((void **)&d_result, sizeof(unsigned int));
cudaMemcpy(d_data, data, 8*sizeof(unsigned int), cudaMemcpyHostToDevice);
for (int i = 0; i < 8; i++){
gpu_radixkernel<<<1,1024>>>(d_data, i, 8, d_result);
cudaMemcpy(&h_result, d_result, sizeof(unsigned int), cudaMemcpyDeviceToHost);
printf("gpu result index %d = %d\n", i, h_result);
}
unsigned int data2[1024];
unsigned int data3[1024];
for (int i = 0; i < 1024; i++) data2[i] = rand();
cudaMemcpy(d_data, data2, 1024*sizeof(unsigned int), cudaMemcpyHostToDevice);
for (int i = 0; i < 1024; i++){
for (int j = 0; j<1024; j++) data3[j] = data2[j];
unsigned int cpuresult = radixselect(data3, 0, 1023, i, 0);
gpu_radixkernel<<<1,1024>>>(d_data, i, 1024, d_result);
cudaMemcpy(&h_result, d_result, sizeof(unsigned int), cudaMemcpyDeviceToHost);
if (h_result != cpuresult) {printf("mismatch at index %d, cpu: %d, gpu: %d\n", i, cpuresult, h_result); return 1;}
}
printf("Finished\n");
return 0;
}
Here are some notes, in no particular order:
I got rid of all your thrust code, it's not doing anything useful as far as the radix select algorithm is concerned. I also find your casting of float to int curious. I haven't thought through the ramifications of trying to do a bitwise radix select in order on a sequence of exponent bits followed by a sequence of mantissa bits. It might work, (although I think if you include the sign bit, it definitely won't work) but again I don't think it's central to understanding the algorithm.
I included a host version that I wrote just to check my device results.
I'm pretty sure this algorithm will fail in some cases where there are duplicated elements. For example, if you hand it a vector of all zeroes, I think it will fail. I don't think it would be difficult to handle that case however.
my host version is recursive, but my device version is not. I don't see that recursion is that useful here, since the non-recursive form of the algorithm is easy to write as well, especially since there are at most 32 bits to travel through. Still, if you wanted to create a recursive device version, it should not be difficult, by incorporating the u,s, and l manipulation code inside the partition function.
I have dispensed with typical cuda error checking. However I recommend it.
I don't consider this to be a paragon of cuda programming. If you delve into for example a radix sort algorithm (such as here), you will see that it is pretty complex. A fast GPU radix select would look nothing like my code. I wrote my code to be analogous to the serial recursive partitioned radix sort, which is not the best way to do it on a massively parallel architecture.
Since radix select is not a sort, I attempted to write a device code that would do no data movement of the input data, since I considered this to be expensive and unnecessary. I do a single read from global memory for the data at the beginning of the kernel, and thereafter I do all work out of shared memory, and even in shared memory I am not re-arranging the data (as I do in my host version) so as to avoid the cost of data movement. Instead I keep flag arrays of ones and zeroes partitions, to feed to the next partitioning step. The data movement would involve a fair amount of uncoalesced and/or bank-conflicted traffic, whereas the flag arrays allow all accesses to be non-bank-conflicted.