I have been having trouble converting from a single character to an integer while in the host function of my CUDA program. After the line -
token[j] = token[j] * 10 + (buf[i] - '0' );
I use cuda-gdb check the value for token[j], and I always get different numbers that do not seem to have a pattern. I have also tried simple casting, not multiplying by ten (which I saw in another thread), not subtracting '0', and I always seem to get a different result. Any help would be appreciated. This is my first time posting on stack overflow, so give me a break if my formatting is awful.
-A fellow struggling coder
__global__ void rread(unsigned int *table, char *buf, int *threadbytes, unsigned int *token) {
int i = 0;
int j = 0;
*token = NULL;
int tid = threadIdx.x;
unsigned int key;
char delim = ' ';
for(i = tid * *threadbytes; i <(tid * *threadbytes) + *threadbytes ; i++)
{
if (buf[i] != delim) { //check if its not a delim
token[j] = token[j] * 10 + (buf[i] - '0' );
There's a race condition on writing to token.
If you want to have a local array per block you can use shared memory. If you want a local array per thread, you will need to use local per-thread memory and declare the array on the stack. In the first case you will have to deal with concurrency inside the block as well. In the latter you don't have to, although you might potentially waste a lot more memory (and reduce collaboration).
Related
typedef unsigned char Byte;
...
void ReverseBytes( void *start, int size )
{
Byte *buffer = (Byte *)(start);
for( int i = 0; i < size / 2; i++ ) {
std::swap( buffer[i], buffer[size - i - 1] );
}
}
What this method does right now is it reverses bytes in memory. What I would like to know is, is there a better way to get the same effect? The whole "size / 2" part seems like a bad thing, but I'm not sure.
EDIT: I just realized how bad the title I put for this question was, so I [hopefully] fixed it.
The standard library has a std::reverse function:
#include <algorithm>
void ReverseBytes( void *start, int size )
{
char *istart = start, *iend = istart + size;
std::reverse(istart, iend);
}
A performant solution without using the STL:
void reverseBytes(void *start, int size) {
unsigned char *lo = start;
unsigned char *hi = start + size - 1;
unsigned char swap;
while (lo < hi) {
swap = *lo;
*lo++ = *hi;
*hi-- = swap;
}
}
Though the question is 3 ½ years old, chances are that someone else will be searching for the same thing. That's why I still post this.
If you need to reverse there is a chance that you can improve your algorithms and just use reverse iterators.
If you're reversing binary data from a file with different endianness you should probably use the ntoh* and hton* functions, which convert specified data sizes from network to host order and vice versa. ntohl for instance converts a 32 bit unsigned long from big endian (network order) to host order (little endian on x86 machines).
I would review the stl::swap and make sure it's optimized; after that I'd say you're pretty optimal for space. I'm reasonably sure that's time-optimal as well.
I have a program that generates files containing random distributions of the character A - Z. I have written a method that reads these files (and counts each character) using fread with different buffer sizes in an attempt to determine the optimal block size for reads. Here is the method:
int get_histogram(FILE * fp, long *hist, int block_size, long *milliseconds, long *filelen)
{
char *buffer = new char[block_size];
bzero(buffer, block_size);
struct timeb t;
ftime(&t);
long start_in_ms = t.time * 1000 + t.millitm;
size_t bytes_read = 0;
while (!feof(fp))
{
bytes_read += fread(buffer, 1, block_size, fp);
if (ferror (fp))
{
return -1;
}
int i;
for (i = 0; i < block_size; i++)
{
int j;
for (j = 0; j < 26; j++)
{
if (buffer[i] == 'A' + j)
{
hist[j]++;
}
}
}
}
ftime(&t);
long end_in_ms = t.time * 1000 + t.millitm;
*milliseconds = end_in_ms - start_in_ms;
*filelen = bytes_read;
return 0;
}
However, when I plot bytes/second vs. block size (buffer size) using block sizes of 2 - 2^20, I get an optimal block size of 4 bytes -- which just can't be correct. Something must be wrong with my code but I can't find it.
Any advice is appreciated.
Regards.
EDIT:
The point of this exercise is to demonstrate the optimal buffer size by recording the read times (plus computation time) for different buffer sizes. The file pointer is opened and closed by the calling code.
There are many bugs in this code:
It uses new[], which is C++.
It doesn't free the allocated memory.
It always loops over block_size bytes of input, not bytes_read as returned by fread().
Also, the actual histogram code is rather inefficient, since it seems to loop over each character to determine which character it is.
UPDATE: Removed claim that using feof() before I/O is wrong, since that wasn't true. Thanks to Eric for pointing this out in a comment.
You're not stating what platform you're running this on, and what compile time parameters you use.
Of course, the fread() involves some overhead, leaving user mode and returning. On the other hand, instead of setting the hist[] information directly, you're looping through the alphabet. This is unnecessary and, without optimization, causes some overhead per byte.
I'd re-test this with hist[j-26]++ or something similar.
Typically, the best timing would be achieved if your buffer size equals the system's buffer size for the given media.
I am using GPU to do some calculation for processing words.
Initially, I used one block (with 500 threads) to process one word.
To process 100 words, I have to loop the kernel function 100 times in my main function.
for (int i=0; i<100; i++)
kernel <<< 1, 500 >>> (length_of_word);
My kernel function looks like this:
__global__ void kernel (int *dev_length)
{
int length = *dev_length;
while (length > 4)
{ //do something;
length -=4;
}
}
Now I want to process all 100 words at the same time.
Each block will still have 500 threads, and processes one word (per block).
dev_totalwordarray: store all characters of the words (one after another)
dev_length_array: store the length of each word.
dev_accu_length: stores the accumulative length of the word (total char of all previous words)
dev_salt_ is an array of of size 500, storing unsigned integers.
Hence, in my main function I have
kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);
to populate the cpu array:
for (int i=0; i<wordnumber; i++)
{
int length=0;
while (word_list_ptr_array[i][length]!=0)
{
length++;
}
actualwordlength2[i] = length;
}
to copy from cpu -> gpu:
int* dev_array_of_word_length;
HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),
My function kernel now looks like this:
__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{
tid = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int hash[N];
int length = dev_length_array[blockIdx.x];
while (tid < 50000)
{
const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
hash[tid] = dev_salt_[threadIdx.x];
unsigned int loop = 0;
while (length > 4)
{ const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
hash[tid] ^= (hash[tid] << 7) ^ i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
length -=4;
}
tid += blockDim.x * gridDim.x;
}
}
However, kernel2 doesn't seem to work at all.
It seems while (length > 4) causes this.
Does anyone know why? Thanks.
I am not sure if the while is the culprit, but I see few things in your code that worry me:
Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your hash[N] table will be allocated per-thread and discarded at the end of the kernel. If N is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing the hash will be almost as slow as accessing global memory.
All threads in a block will have the same itr value. Is it intended?
Every thread initializes only a single field within its own copy of hash table.
I see hash[tid] where tid is a global index. Be aware that even if hash was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion of hash, another block might not even start!
I'm tying to implement a basic audio delay - but all I'm getting is garbage, probably something very obvious - but I can't seem to spot it...
Audio is processed via buffers that are determined at runtime.
I think I'm doing something horribly wrong with the pointers, tried looking at some other code - but they all seem "incomplete" always something rudimentary is missing - probably what's miss in my code as well.
// Process audio
// 1
void Gain::subProcessSimpleDelay( int bufferOffset, int sampleFrames )
{
// Assign pointers to your in/output buffers.
// Each buffer is an array of float samples.
float* in1 = bufferOffset + pinInput1.getBuffer();
float* in2 = bufferOffset + pinInput2.getBuffer();
float* out1 = bufferOffset + pinOutput1.getBuffer();
// SampleFrames = how many samples to process (can vary).
// Repeat (loop) that many times
for( int s = sampleFrames; s > 0; --s )
{
// get the sample 'POINTED TO' by in1.
float input1 = *in1;
float feedback = *in2;
float output;
unsigned short int p, r;
unsigned short int len;
len = 600;
// check at delay length calculation
if (len > 65535)
len = 65535;
// otherwise, a length of 0 will output the input from
// 65536 samples ago
else if (len < 1)
len = 1;
r = p - len; // loop
output = buffer[r];
buffer[p] = input1 + output * feedback;
p++;
*out1 = output;
// store the result in the output buffer.
// increment the pointers (move to next sample in buffers).
in1++;
in2++;
out1++;
}
}
Could anyone tell me what's wrong?
You haven't initialized p. Other things to be careful of in this code:-
Are you sure that sampleFrames + bufferOffset is less than the size of your input and output buffers? You could really do with a way to check that.
It's not clear where buffer comes from, or what else might be writing to it. If it's garbage before your code runs, you're going to end up with garbage everywhere, because the first thing you do is read from it.
You don't say what types pinInput1.getBuffer() etc. return. If they return a char*, and you just know that it happens to point to an array of floats, you need to cast the result to float* before you do any pointer arithmetic, to make sure you're advancing to the next float in the array, not the next byte of the array.
I have a function I've written to convert from a 64-bit integer to a base 62 string. Originally, I achieved this like so:
char* charset = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = strlen(charset);
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += charset[num % charsetLength];
num /= charsetLength;
}
return key;
}
However, this was too slow.
I improved the speed by providing an option to generate a lookup table. The table is about 624 strings in size, and is generated like so:
// Create the integer to key conversion lookup table
int lookupChars;
if(lookupDisabled)
lookupChars = 1;
else
largeLookup ? lookupChars = 4 : lookupChars = 2;
lookupSize = pow(charsetLength, lookupChars);
integerToKeyLookup = new char*[lookupSize];
for(unsigned long i = 0; i < lookupSize; i++)
{
unsigned long num = i;
int j = 0;
integerToKeyLookup[i] = new char[lookupChars];
while(num)
{
integerToKeyLookup[i][j] = charset[num % charsetLength];
num /= charsetLength;
j++;
}
// Null terminate the string
integerToKeyLookup[i][j] = '\0';
}
The actual conversion then looks like this:
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += integerToKeyLookup[num % lookupSize];
num /= lookupSize;
}
return key;
}
This improved speed by a large margin, but I still believe it can be improved. Memory usage on a 32-bit system is around 300 MB, and more than 400 MB on a 64-bit system. It seems like I should be able to reduce memory and/or improve speed, but I'm not sure how.
If anyone could help me figure out how this table could be further optimized, I'd greatly appreciate it.
Using some kind of string builder rather than repeated concatenation into 'key' would provide a significant speed boost.
You may want to reserve memory in advance for your string key. This may get you a decent performance gain, as well as a gain in memory utilization. Whenever you call the append operator on std::string, it may double the size of the internal buffer if it has to reallocate. This means each string may be taking up significantly more memory than is necessary to store the characters. You can avoid this by reserving memory for the string in advance.
I agree with Rob Walker - you're concentrating on improving performance in the wrong area. The string is the slowest part.
I timed the code (your original is broken, btw) and your original (when fixed) was 44982140 cycles for 100000 lookups and the following code is about 13113670.
const char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
#define CHARSET_LENGTH 62
// maximum size = 11 chars
void integerToKey(char result[13], unsigned long long input)
{
char* p = result;
while(input > 0)
{
*p++ = charset[input % CHARSET_LENGTH];
input /= CHARSET_LENGTH;
}
// null termination
*p = '\0';
// need to reverse the output
char* o = result;
while(o + 1 < p)
swap(*++o, *--p);
}
This is almost a textbook case of how not to do this. Concatenating strings in a loop is a bad idea, both because appending isn't particularly fast, and because you're constantly allocating memory.
Note: your question states that you're converting to base-62, but the code seems to have 63 symbols. Which are you trying to do?
Given a 64-bit integer, you can calculate that you won't need any more than 11 digits in the result, so using a static 12 character buffer will certainly help improve your speed. On the other hand, it's likely that your C++ library has a long-long equivalent to ultoa, which will be pretty optimal.
Edit: Here's something I whipped up. It allows you to specify any desired base as well:
std::string ullToString(unsigned long long v, int base = 64) {
assert(base < 65);
assert(base > 1);
static const char digits[]="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
const int max_length=65;
static char buffer[max_length];
buffer[max_length-1]=0;
char *d = buffer + max_length-1;
do {
d--;
int remainder = v % base;
v /= base;
*d = digits[remainder];
} while(v>0);
return d;
}
This only creates one std::string object, and doesn't move memory around unnecessarily. It currently doesn't zero-pad the output, but it's trivial to change it to do that to however many digits of output you want.
You don't need to copy input into num, because you pass it by value. You can also compute the length of charset in compiletime, there's no need to compute it in runtime every single time you call the function.
But these are very minor performance issues. I think the the most significant help you can gain is by avoiding the string concatenation in the loop. When you construct the key string pass the string constructor the length of your result string so that there is only one allocation for the string. Then in the loop when you concatenate into the string you will not re-allocate.
You can make things even slightly more efficient if you take the target string as a reference parameter or even as two iterators like the standard algorithms do. But that is arguably a step too far.
By the way, what if the value passed in for input is zero? You won't even enter the loop; shouldn't key then be "0"?
I see the value passed in for input can't be negative, but just so we note: the C remainder operator isn't a modulo operator.
Why not just use a base64 library? Is really important that 63 equals '11' and not a longer string?
size_t base64_encode(char* outbuffer, size_t maxoutbuflen, const char* inbuffer, size_t inbuflen);
std::string integerToKey(unsigned long long input) {
char buffer[14];
size_t len = base64_encode(buffer, sizeof buffer, (const char*)&input, sizeof input);
return std::string(buffer, len);
}
Yes, every string will end with an equal size. If you don't want it to, strip off the equal sign. (Just remember to add it back if you need to decode the number.)
Of course, my real question is why are you turning a fixed width 8byte value and not using it directly as your "key" instead of the variable length string value?
Footnote: I'm well aware of the endian issues with this. He didn't say what the key will be used for and so I assume it isn't being used in network communications between machines of disparate endian-ness.
If you could add two more symbols so that it is converting to base-64, your modulus and division operations would turn into a bit mask and shift. Much faster than a division.
If all you need is a short string key, converting to base-64 numbers would speed up things a lot, since div/mod 64 is very cheap (shift/mask).