Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I am looking for a hash table implementation that I can use for CUDA coding. are there any good one's out there. Something like the Python dictionary . I will use strings as my keys
Alcantara et al have demonstrated a data-parallel algorithm for building hash tables on the GPU. I believe the implementation was made available as part of CUDPP.
That said, you may want to reconsider your original choice of a hash table. Sorting your data by key and then performing lots of queries en masse should yield much better performance in a massively parallel setting. What problem are you trying to solve?
When I wrote an OpenCL kernel to create a simple hash table for strings, I used the hash algorithm from Java's String.hashCode(), and then just modded that over the number of rows in the table to get a row index.
Hashing function
uint getWordHash(__global char* str, uint len) {
uint hash = 0, multiplier = 1;
for(int i = len - 1; i >= 0; i--) {
hash += str[i] * multiplier;
int shifted = multiplier << 5;
multiplier = shifted - multiplier;
}
return hash;
}
Indexing
uint hash = getWordHash(word, len);
uint row = hash % nRows;
I handled collisions manually of course, and this approach worked well when I knew the number of strings ahead of time.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
There have already been some questions on this topic (1, 2, 3). The thing is that there doesn't seem to be a clear-cut answer. Some answers suggest size_t (1, 2), some suggest ptrdiff_t (1, 2). Other options include int, uint32_t, auto or using decltype on .size() of a container or the member type size_type.
This question may seem unsuitable as being opinion-based, but I don't think that's the case. Just because there isn't already a consensus on which type to use, doesn't mean that there cannot exist an objective answer. This is due to the fact that the different choices aren't only aesthetical, but can actually influence the behavior of the code.
For example, using an index variable type with mismatched signedness in the loop condition will cause compiler warnings, like this. Also, using a type that has a range that is too small can cause an overflow, which in the case of signed types is UB. At the same time, in some cases changing the loop counter type can cause "crazy performance deviations".
I also wanted to find out what is the most popular, though not necessarily the best, way to create for loop, so I used GitHub* search to find out. Here are the results:
Loop type
Code result count on GitHub (averaged; "manual" loop + range-based)
for (int
15.8m
for (size_t
11.6m
for (auto
7.5m
for (uint32_t
2.3m
std::for_each
501k
for (ptrdiff_t
98.7k
for (decltype
77.5k
There are certainly large differences in occurrence count between the different loop types, however, there doesn't seem to be a clear outstanding leader.
As such I post this question asking, what is the best type to use for the index variable in a for loop in C++ or what are the rules or conditions based on which this type should be chosen?
*: The GitHub search tool produces varying results for "code results" (count) each time, so I averaged 26 values. As the search is text-based it includes both results of the form for (int i = 0; i < n; ++i) and for (int i : vec).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am trying to mess with using basic Linux system calls and writing good code by writing a program that will write random bytes to a file. I've envisioned a couple different ways of doing this, but I'm curious of what methods are more efficient. Please feel free to give me improvements on my code or suggest completely different methods; I'm just trying to improve.
Currently, I'm generating random uint32_t's using mt19937 from C++11's random and placing them in a buffer using memcpy before writing to the file. Is this a very efficient way to do this?
Would I be better off initializing the buffer to 0 with a memset and using or'ing/bit-shifting? With that, I am forgetting how to/if there is a way to do something like *(buffer + offset) |= (random32 << BUF_SIZE - sizeof(random32) * currIndex) and get C++ to place all 32 bits rather than a single char.
//Here's the actual buffer manipulation/write code I currently have
std::mt19937 rand(std::chrono::system_clock::now().time_since_epoch().count());
for(ssize_t i = 0; i < size; i += BUF_SIZE)
{
//Fill the buffer with random uint32_t's (the buffer is a multiple of 32 right now)
char buffer[BUF_SIZE];
for(int j = 0; j < BUF_SIZE; j += sizeof(uint32_t))
{
uint32_t num = rand();
std::fprintf(stderr, "DEBUG: generated random number %x\n", num);
std::memcpy(buffer + j, &num, sizeof(num));
}
// Write buffer to file
ssize_t bytes_left = size - i;
ssize_t bytes_to_write = BUF_SIZE > bytes_left ? bytes_left : BUF_SIZE;
std::fprintf(stderr, "DEBUG: writing %zu bytes to buffer\n", bytes_to_write);
if(write(fd, buffer, bytes_to_write) != bytes_to_write)
{
std::fprintf(stderr, "Failed to write at %zu", i);
exit(EXIT_FAILURE);
}
}
Do this:
::std::array<char, BUF_SIZE> buffer;
::std::generate(buffer.begin(), buffer.end(), rand);
instead of the loop you're currently using to fill up the buffer. Then adjust the rest of the program to use a buffer that's an array that has it's own size method and the like rather than a bare char array like you have now.
That's more the C++ way than what you're doing. Minimize the use of bare pointers and arrays. Use the standard algorithms where possible. If you find yourself writing a for loop, look to see if the algorithms library has already written the loop for you.
And that goes for code that isn't already in the standard library as well. If you find yourself writing a for loop, stop. Instead, figure out how to abstract what that for loop is doing into a reusable function so that you can write the for loop once and use it in a lot of situations.
And if you want to understand how the standard library works, then take it upon yourself to write those functions.
Don't Repeat Yourself is one of the absolutely most important programming principles. Larry Wall famously recast it as 'laziness'. Learn how to never repeat yourself. If the standard library makes you uncomfortable because you don't know what it's doing, try using compiler explorer to see the assembly language. And try writing the standard library functions yourself. Knowing your tools up, down, and sideways is also really important. So this work is worth doing.
But, that's what you should be doing, not repeating yourself and training yourself to do it over and over again. Train yourself to do it the right way. And if you're uncomfortable with not knowing the details, write the details instead of writing it the wrong way so you can see the details.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I'm trying to find the fastest way to get the sum of a character orders in a string
for example if the string contains:
ABAACA;
Then the sum of character A will be:
(A=0)+(A=2)+(A=3)+(A=5)=10;
A=10;
i know some way to that but it takes too long,so would you please tell me how can i get the fastest sum?
Fastest way I see to do this with C++ (since there's nothing else describing your problem), involves parallel processing:
Parallel scan and in-place indexing
Reduction
Although it might not be what you were looking for.
Fastest non parallel solution:
-go over every char.
-increase count if match (if(ch=='A')count+=i;).
there is just no faster way, because you MUST visit each character.
Anyway if you have a working solution, it's probably the fastest already..
If no parallel tools are in place the fastest solution would be to just visit each char in loop and increase sum.
int count_match ( char* str, int length, char digit )
{
int output = 0;
for ( int index = 0; index < length; ++ index )
output += index & - (str[index] == digit);
return output;
}
If the length of the string could be known at compile time, then you could conceivably template that value out and let the compiler vectorize the loop for you.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I'm a newbee in C++ and I'm writing a C++ program that asks the user to input two integers and then it raises the first integer to the power specified by the second integer. For example, if the user enters 5 and 8, the result will be 5 subscript 8, i.e., number five will be raised to the eighth power. The program must not use any pre-defined C++ functions (like pow function) for this task. The program should allow the user to perform another calculation if they so desire. Can anyone help
I'm not going to give you any code, because that won't allow you to truly explore this concept. Rather, you should use this pseudo code to implement something on your own.
Create a function which accepts two inputs, the base and the exponent.
Now there are several ways to go about doing this. You can use efficient bit shifting, but let's start simple, shall we?
answer = base
i = 1
while i is less than or equal to exponent
answer = answer * base
return answer
Simply loop through multiplying the base by itself.
There are other ways that focus on efficiency. Look here to see something that you may want to attempt: are 2^n exponent calculations really less efficient than bit-shifts?
The program must not use any pre-defined C++ functions (like pow function) for this task
You can use some piece of c++ code like follows, to compute xy, without using any predefined function:
int x = 5;
int y = 3;
int result = 1;
for(int i = 0; i < y; ++i)
{
result *= x;
}
cout << result << endl;
Output:
125
See a working sample here.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm working on an assignment in my Computer Architecture class where we have to implement a branch prediction algorithm in C++ (for the Alpha 21264 microprocessor architecture).
There is a solution provided as an example. This solution is an implementation of a Global Share Predictor.
I am simply trying to understand the given solution, specifically what is going on in:
*predict (branch_info &b) {...}
specifically,
if (b.br_flags & BR_CONDITIONAL) {...}
Can anyone provide me with an explanation? Thank you.
I think the following paper by Scott McFarling provides the detailed answer:
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-36.pdf
Let me use your code to explain.
unsigned char tab[1<<TABLE_BITS];
is the Pattern History Table. Each entry in the tab keeps a 2-bit saturating counter. The direction of the conditional branch is finally determined by the MSB of the counter:
u.direction_prediction (tab[u.index] >> 1);
The reason why we use a counter of two or more bits instead of just one bit is to make the pattern less sensitive to reduce misprediction. For example,
for( int i = 0; i < m; i++ )
{
for( int j = 0; j < n; j++ )
{
...
}
}
when the inner loop is executed for the next time, one bit counter will mispredict the branch while 2-bit counter can prevent it.
The next is how to find the correct pattern in the Pattern History Table.
The naive way is to use branch address as index. But it ignores the correlation between different branches. That is why Global Branch History is introduced (For more details, please refer to http://www.eecg.utoronto.ca/~moshovos/ACA06/readings/two-level-bpred.pdf).
In your code,
unsigned int history;
is the Branch History Register which stores the Global Branch History.
Then some guys have found that combining Global Branch History and Branch Address as index can lead to more accurate prediction than just using one of them. The reason is that both Global Branch History and Branch Address affect the branch pattern.
If ignoring one of them, different branch pattern may be hashed to the same position of the Pattern History Table, thus causing collision problem.
Before Gshare is proposed, there is a solution called Gselect, which uses concatenation of Global Branch History and Branch Address as index of Pattern History Table.
The solution provided by Gshare is the hash function of
index = branch_addr XOR branch_history
This is what exactly what the following code means:
u.index = (history << (TABLE_BITS - HISTORY_LENGTH)) ^ (b.address & ((1<<TABLE_BITS)-1));
Scott McFarling's paper provides a good example to show how Gshare works better than Gselect:
Branch Address=1111_1111 Global Branch History=0000_0000
Branch Address=1111_1111 Global Branch History=1000_0000
Assume that we use the following Gselect strategy to prevent bias:
index = { {branch_addr[7:4]}, {branch_history[3:0]} }
Then Gselect will produce 1111_0000 for both cases while Gshare can distinguish the different patterns.
As far as I know, Gshare turns out to be the best solution by far to remove the collision.