I'm learning C++ for a project, and for my project I need to generate a random number on the GPU.
For this, I decided to use cuRAND.
However, I'm running into a small issue on this line:
random << <1, 1 >> >(time(NULL), gpu_x);
I got the error expected an expression on that line.
using this code, which I got from here:
__global__ void random(unsigned int seed, int* result) {
/* CUDA's random number library uses curandState_t to keep track of the seed value
we will store a random state for every thread */
curandState_t state;
/* we have to initialize the state */
curand_init(seed, /* the seed controls the sequence of random values that are produced */
0, /* the sequence number is only important with multiple cores */
0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
&state);
/* curand works like rand - except that it takes a state as a parameter */
*result = curand(&state) % MAX;
}
void Miner::GoMine() {
int* gpu_x;
cudaMalloc((void**)&gpu_x, sizeof(int));
/* invoke the GPU to initialize all of the random states */
random << <1, 1 >> >(time(NULL), gpu_x);
/* copy the random number back */
int x;
cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);
printf("Random number = %d.\n", x);
/* free the memory we allocated */
cudaFree(gpu_x);
}
As I am new to C++, I couldn't figure out what is going on.
I hope somebody here was able to help me?
Cheers
I managed to fix the issue by placing the CUDA related code in cuRAND.cu (Add -> New Item -> CUDA 9.0 -> Code -> CUDA C/C++ File).
I renamed the function void Miner::GoMine() to int cuRND()
I added some extra code so my entire cuRAND.cu file now looks like this:
// For the RNG using CUDA
#include <curand.h>
#include <curand_kernel.h>
#include <iomanip>
#include "sha256.h"
#ifndef __Kernel_CU__
#define __Kernel_CU__
#define MAX 100
__global__ void random(unsigned int seed, int* result) {
/* CUDA's random number library uses curandState_t to keep track of the seed value
we will store a random state for every thread */
curandState_t state;
/* we have to initialize the state */
curand_init(seed, /* the seed controls the sequence of random values that are produced */
0, /* the sequence number is only important with multiple cores */
0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */
&state);
/* curand works like rand - except that it takes a state as a parameter */
*result = curand(&state) % MAX;
}
extern "C"
int cuRND() {
int* gpu_x;
cudaMalloc((void**)&gpu_x, sizeof(int));
/* invoke the GPU to initialize all of the random states */
random <<< 1, 1 >> >(time(NULL), gpu_x);
/* copy the random number back */
int x;
cudaMemcpy(&x, gpu_x, sizeof(int), cudaMemcpyDeviceToHost);
/* free the memory we allocated */
cudaFree(gpu_x);
return floor(99999999 * x);
}
#endif
I then proceeded to add this code to my miner.cpp (which is the file I need it in):
extern "C"
int cuRND();
I can now make a call to cuRND() from my miner.cpp.
Hit start, and I was off to the races!
Thanks for the help, I hope this answer can help somebody later down the road!
Related
I need to make Viterbi decoding of some convolutional-encoded signal. My application shall work with large files, therefore I cannot insert all the signal into a heap, so I need to process a data by a sequence of separate buffers. I have found a good library for Viterbi decoding - Encoder and a Viterbi decoder in C++ on the dr. Dobbs. I have applied the decoder from the libarary, it works correct, but doesn't provide a function for continuous use (call a function many times for each signal buffer with considering of previous calculations). Then I have found the GNU Radio C++ library which provide the necessary functions. But I don't understand how to use its functions, because it doesn't provide a documentation. It contains the example of Viterbi decoding with is present below:
extern "C" {
#include <gnuradio/fec/viterbi.h>
}
#include <cstdio>
#include <cmath>
#define MAXCHUNKSIZE 4096
#define MAXENCSIZE MAXCHUNKSIZE*16
int main()
{
unsigned char data[MAXCHUNKSIZE];
signed char syms[MAXENCSIZE];
int count = 0;
// Initialize metric table
int mettab[2][256];
int amp = 100; // What is it? ***
float RATE=0.5;
float ebn0 = 12.0;
float esn0 = RATE*pow(10.0, ebn0/10);
gen_met(mettab, amp, esn0, 0.0, 4);
// Initialize decoder state
struct viterbi_state state0[64];
struct viterbi_state state1[64];
unsigned char viterbi_in[16];
viterbi_chunks_init(state0);
while (!feof(stdin)) {
unsigned int n = fread(syms, 1, MAXENCSIZE, stdin);
unsigned char *out = data;
for (unsigned int i = 0; i < n; i++) {
// FIXME: This implements hard decoding by slicing the input stream
unsigned char sym = syms[i] > 0 ? -amp : amp; // What is it? ***
// Write the symbol to the decoder input
viterbi_in[count % 4] = sym;
// Every four symbols, perform the butterfly2 operation
if ((count % 4) == 3) {
viterbi_butterfly2(viterbi_in, mettab, state0, state1);
// Every sixteen symbols, perform the readback operation
if ((count > 64) && (count % 16) == 11) {
viterbi_get_output(state0, out);
fwrite(out++, 1, 1, stdout);
}
}
count++;
}
}
return 0;
}
File viterbi.c from it also contains the next function viterbi(), without a declaration:
/* Viterbi decoder */
int viterbi(unsigned long *metric, /* Final path metric (returned value) */
unsigned char *data, /* Decoded output data */
unsigned char *symbols, /* Raw deinterleaved input symbols */
unsigned int nbits, /* Number of output bits */
int mettab[2][256] /* Metric table, [sent sym][rx symbol] */
) { ...
Also I found one more implementation for Viterbi decoding - The Spiral project. But it also doesn't contain a normal description and doesn't want to compile. And two more implementation on the ExpertCore and Forward Error Correction DSP library.
My question: Can anyone understand how to use the above GNU Radio's implementation of the Viterbi algorithm for continuous use for binary interleaved digital signal (Encoder parameters of my signal: K=7 rate=1/2, every bit in my file is a demodulated sample of a signal)?
I made a custom source block that is reading switch values on a zedboard. It is accessing them via a proc driver that I wrote. The /var/log/kern.log is reporting proper output. The debug printf in the source block is reporting proper output.
However pushing the data to a filesink as well as a GUI number sink is only reading zeros. Did I not set up the block properly?
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif
#include <gnuradio/io_signature.h>
#include "switches_impl.h"
#include <stdio.h>
#include <stdlib.h>
#include <uinstd.h>
namespace gr {
namespace zedboard {
switches::sptr
switches::make()
{
return gnuradio::get_initial_sptr
(new switches_impl());
}
/*
* The private constructor
*/
switches_impl::switches_impl()
: gr::block("switches",
gr::io_signature::make(0,0,0),
gr::io_signature::make(1, 1, sizeof(unsigned int *)))
{}
/*
* Our virtual destructor.
*/
switches_impl::~switches_impl()
{
}
void
switches_impl::forecast (int noutput_items, gr_vector_int &ninput_items_required)
{
/* <+forecast+> e.g. ninput_items_required[0] = noutput_items */
}
int
switches_impl::general_work (int noutput_items,
gr_vector_int &ninput_items,
gr_vector_const_void_star &input_items,
gr_vector_void_star &output_items)
{
//const <+ITYPE+> *in = (const <+ITYPE+> *) input_items[0];
unsigned int *out = (unsigned int *) output_items[0];
// Do <+signal processing+>
// Tell runtime system how many input items we consumed on
// each input stream.
char buffer[5];
size_t size = 1;
size_t nitems = 5;
FILE* fp;
fp = fopen("/proc/zedSwitches","r");
if (fp == NULL)
{
printf("Cannot open for read\n");
return -1;
}
/*
Expect return format:
0x00
*/
fread(buffer, size, nitems, fp);
fclose(fp);
out=(unsigned int *)strtoul(buffer,NULL,0);
printf("read: 0x%02x",out);
consume_each (noutput_items);
// Tell runtime system how many output items we produced.
return noutput_items;
}
} /* namespace zedboard */
} /* namespace gr */
A pointer is a pointer to data, usually:
unsigned int *out = (unsigned int *) output_items[0];
out refers to the buffer for your output.
But you overwrite that pointer with another pointer:
out=(unsigned int *)strtoul(buffer,NULL,0);
which just bends around your copy of that pointer, and doesn't affect the content of that buffer at all. Basic C!
You probably meant to say something like:
out[0]= strtoul(buffer,NULL,0);
That will put your value into the first element of the buffer.
However, you tell GNU Radio that you not only produced a single item (the line above), but noutput_items:
return noutput_items;
That must read
return 1;
when you're only producing a single item, or you must actually produce as many items as you return.
Your consume_each call is nonsensical – GNU Radio Sources are typically instances of gr::sync_block, which means that you'd write a work() instead of a general_work() method as you did.
From the fact alone that this is a general_work and not a work I'd say you haven't used gr_modtool (with block type set to source!) to generate the stub for this block – you really should. Again, I'd like to point you to the Guided Tutorials which should really quickly explain usage of gr_modtool as well as the underlying GNU Radio concepts.
I am trying to generate a set of random number there are only 1 and zero. The code below almost works. When I do the print for loop I notice that some times I have a number that generates that is not 1 or 0. I know I am missing something just not sure what. I think its a memory misplacement.
#include <stdio.h>
#include <curand.h>
#include <curand_kernel.h>
#include <math.h>
#include <assert.h>
#define MIN 1
#define MAX (2048*20)
#define MOD 2 // only need one and zero for each random value.
#define THREADS_PER_BLOCK 256
__global__ void setup_kernel(curandState *state, unsigned long seed)
{
int idx = threadIdx.x+blockDim.x*blockIdx.x;
curand_init(seed, idx, 0, state+idx);
}
__global__ void generate_kernel(curandState *state, unsigned int *result){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
result[idx] = curand(state+idx) % MOD;
}
int main(){
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState));
unsigned *d_result, *h_result;
cudaMalloc(&d_result, (MAX-MIN+1) * sizeof(unsigned));
h_result = (unsigned *)malloc((MAX-MIN+1)*sizeof(unsigned));
cudaMemset(d_result, 0, (MAX-MIN+1)*sizeof(unsigned));
setup_kernel<<<MAX/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_state,time(NULL));
generate_kernel<<<MAX/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_state, d_result);
cudaMemcpy(h_result, d_result, (MAX-MIN+1) * sizeof(unsigned), cudaMemcpyDeviceToHost);
printf("Bin: Count: \n");
for (int i = MIN; i <= MAX; i++)
printf("%d %d\n", i, h_result[i-MIN]);
free(h_result);
cudaFree(d_result);
system("pause");
return 0;
}
What I am attempting to do is transform a genetic algorithm from this site.
http://www.ai-junkie.com/ga/intro/gat3.html
I thought it would be a good problem to learn CUDA and have some fun at the same time.
The first part is to generate my random array.
The problem here is that both your setup_kernel and generate_kernel are not running to completion because of out of bounds memory access. Both kernels are expecting that there will be a generator state for each thread, but you are only allocating a single state on the device. This results in out of bounds memory reads and writes on both kernels. Change this:
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState));
to something like
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState) * (MAX-MIN+1));
so that you have one generator state per thread you are running, and things should start working. If you had made any attempt at checking errors from either the runtime API return statuses or using cuda-memcheck, the source of the error would have been immediately apparent.
I'm running CUDA 4.2 on Windows 7 64 bits in the Visual Studio 2010 Professional environment
First, I have the following code running:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid==0)
d_bStopPtr[tid]=0;
else if(tid<count)
{
d_bPtr[tid]=tid;
// only if the arrary cell before it is 0, then change it to 0 too
if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int threadsPerBlock = prop.maxThreadsDim[0];
int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
while(d_bStop[count-1])
{
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
}
//copy device back to host again
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
However, each time I need to check the while condition, but it is slow. So I'm thinking of checking the condition of this device vector inside the kernel, and change the code like this:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid==0)
d_bStopPtr[tid]=0;
else if(tid<count)
{
// if the last cell of the arrary is still not 0 yet, repeat
while(d_bStopPtr[count-1])
{
d_bPtr[tid]=tid;
// only if the arrary cell before it is 0, then change it to 0 too
if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
}
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int threadsPerBlock = prop.maxThreadsDim[0];
int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
//copy device back to host again
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
However, the second version always causes the graphic card and the computer to hang. Can you please help me with speeding up the first version? How to check the condition inside the kernel and then jump out and stop the kernel?
You are basically looking for global thread synchronous behavior. This is a no-no in GPU programming. Ideally each threadblock is independent, and can complete the work based on it's own data and processing. Creating threadblocks that depend on the results of other threadblocks to complete their work is creating the possibility of a deadlock condition. Suppose I have a GPU with 14 SMs (threadblock execution units), and suppose I create 100 threadblocks. Now suppose threadblocks 0-13 are waiting for threadblock 99 to release a lock (e.g. write a zero value to a particular location). Now suppose those first 14 threadblocks begin executing on the 14 SMs, perhaps looping, spinning on the lock value. There is no mechanism in the GPU to guarantee that threadblock 99 will execute first or even execute at all, if threadblocks 0-13 have the SMs tied up.
Let's not get into questions about "what about GMEM stalls that force eviction of threadblocks 0-13" because none of that guarantees that threadblock 99 will get priority to execute at any point. The only thing that guarantees that threadblock 99 will execute is the draining (i.e. completion) of other threadblocks. But if the other threadblocks are spinning, waiting for results from threadblock 99, that may never happen.
Good forward-compatible, scalable GPU code depends on independent parallel work. So you're advised to re-craft your algorithm to make the work you are trying to accomplish independent, at least at the inter-threadblock level.
If you must do global thread syncing, the kernel launch is the only truly guaranteed point for this, and thus your first approach is the working approach.
To help with this, it may be useful to study how reduction algorithms get implemented on a GPU. Various types of reductions have dependencies across all threads, but by creating intermediate results, we can break the work into independent pieces. The independent pieces can then be aggregated using a multi-kernel approach (or some other more advanced approaches) to speed up what amounts to a serial algorithm.
Your kernel doesn't actually do much. It sets one array equal to it's index, i.e. a[i] = i; and it sets the other array to all zeroes (although sequentially) b[i]=0;. To show an example of your first code "speeded up", you could do something like this:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid<count)
{
d_bPtr[tid]=tid;
while(d_bStopPtr[tid]!=0)
// only if the arrary cell before it is 0, then change it to 0 too
if (tid==0) d_bStopPtr[tid] =0;
else if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
tid += blockDim.x;
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// int threadsPerBlock = prop.maxThreadsDim[0];
int threadsPerBlock = 32;
// int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
int blocksPerGrid = 1;
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
// while(d_bStop[count-1])
// {
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
// }
//copy device back to host again
cudaDeviceSynchronize();
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
On my machine this speeds the execution time up from 10 secs to almost instantaneous (much less than 1 second). Note that this is not a great example of CUDA programming, because I am only launching one block of 32 threads. That's not enough to effectively utilize the machine. But the work done by your kernel is so trivial that I'm not sure what a good example would be. I could just create a kernel that sets one array to it's index a[i]=i; and the other array to zero b[i]=0; all in parallel. That would be even faster, and we could use the whole machine that way.
main .cpp
#include "stdafx.h"
#include "random_generator.h"
int
main ( int argc, char *argv[] )
{
cout.setf(ios::fixed);
base_generator_type base_generator;
int max = pow(10, 2);
distribution_type dist(1, max);
boost::variate_generator<base_generator_type&,
distribution_type > uni(base_generator, dist);
for ( int i=0; i<10; i++ ) {
//cout << random_number(2) << endl;
cout << uni() << endl;
}
return EXIT_SUCCESS;
} /* ---------- end of function main ---------- */
random_gemerator.h
#include "stdafx.h"
#include <boost/random.hpp>
#include <boost/generator_iterator.hpp>
typedef boost::mt19937 base_generator_type;
typedef boost::lagged_fibonacci19937 fibo_generator_type;
typedef boost::uniform_int<> distribution_type;
typedef boost::variate_generator<fibo_generator_type&,
distribution_type> gen_type;
int
random_number ( int bits )
{
fibo_generator_type fibo_generator;
int max = pow(10, bits);
distribution_type dist(1, max);
gen_type uni(fibo_generator, dist);
return uni();
} /* ----- end of function random_number ----- */
stdafx.h
#include <iostream>
#include <cstdlib>
#include <cmath>
using namespace std;
every time I run it, it all generate the same number sequence
like 77, 33,5, 22 , ...
how to use boost:random correctly?
that is it. but maybe have a little problem, like the following:
it seems sound
get_seed(); for (;;) {cout << generate_random() << endl; } // is ok
it genereate the same random number
int get_random() {get_seed();return generate_random();} for (;;) {cout << get_random() <<endl;} // output the same random number yet
if you want the sequence of random numbers to change every time you run your program, you need to change the random seed by initializing it with the current time for instance
you will find an example there, excerpt:
/*
* Change seed to something else.
*
* Caveat: std::time(0) is not a very good truly-random seed. When
* called in rapid succession, it could return the same values, and
* thus the same random number sequences could ensue. If not the same
* values are returned, the values differ only slightly in the
* lowest bits. A linear congruential generator with a small factor
* wrapped in a uniform_smallint (see experiment) will produce the same
* values for the first few iterations. This is because uniform_smallint
* takes only the highest bits of the generator, and the generator itself
* needs a few iterations to spread the initial entropy from the lowest bits
* to the whole state.
*/
generator.seed(static_cast<unsigned int>(std::time(0)));
You need to seed your random number generator so it doesn't start from the same place each time.
Depending on what you are doing with the numbers, you may need to put some thought into how you choose your seed value. If you need high quality randomness (if you are generating cryptographic keys and want them fairly secure), you will need a good seed value. If this were Posix, I would suggest /dev/random - but you look to be using Windows so I'm not sure what a good seed source would be.
But if you don't mind a predictable seed (for games, simulations, etc.), a quick and dirty seed is the current timestamp returned by time().
If you are running on a 'nix system, you could always try something like this;
int getSeed()
{
ifstream rand("/dev/urandom");
char tmp[sizeof(int)];
rand.read(tmp,sizeof(int));
rand.close();
int* number = reinterpret_cast<int*>(tmp);
return (*number);
}
I'm guessing seeding the random number generator this way is faster than simply reading the /dev/urandom (or /dev/random) for all your random number needs.
You can use the boost::random::random_device class either as-is, or to seed your other generator.
You can get a one-off random number out of it with a simple:
boost::random::random_device()()