OpenMP: how to optimize a thread-safe hash-table - c++

I'm trying to make a hash table (unordered_map) thread-safe, for a performance-critical application. I'm using OpenMP to handle multithreading. The first solution I tried was to make all access to the hash-table critical, but this ended up lowering the performance of threads so much that the parallel version wasn't faster anymore. I then came with the following idea: make 100 separate hash-tables, initialized to empty set, and store them in an array. Use some hash function to map the key to the range [0, 1, ..., 99], and store the value in the corresponding hash table. Here is a pseudo-code to demonstrate this idea:
auto hash_vec = vector<unordered_map<int,int> >(100);
int prime = 534161472029; // some big prime
int rand_int = rand(); // random int used for
int simple_hash(int v) {
return (rand_int * v) % prime;
}
void safe_write(int key, int val) {
int h = simple_hash(key) % 100;
hash_vec[h][key] = val; // must be made safe
}
int safe_read() {
int h = simple_hash(key) % 100;
int val = hash_vec[h][key]; // must be made safe
return val;
}
so the idea is that if value of h is not the same, there won't be any race condition. For example, if simple_hash(k1) is not equal to simple_hash(k2), a simultaneous call to safe_write(k1,v1) and safe_write(k2,v2) won't cause any thread to wait. Is such a thing possible? If not, is there a better way to ensure performance is not too compromised?

Related

Parallelization of bin packing problem by OpenMp

I am learning open mp, and I want to parallelize well-known BinPacking problem. But the problem is what whatever I try, can't get correct solution ( the one I get with sequential verstion).
So far, I have tried multiple different versions (including reduction, tasks, schedule) but didn't get anything useful.
Below is my the most recent try.
int binPackingParallel(std::vector<int> weight, int n, int c)
{
int resltut = 0;
int bin_rem[n];
#pragma omp parallel for schedule(dynamic) reduction(+:result)
for (int i = 0; i < n; i++) {
bool done = false;
int j;
for (j = 0; j < result && !done; j++) {
int b ;
#pragma omp atomic
b = bin_rem[j] - weight[i];
if ( b >= 0) {
bin_rem[j] = bin_rem[j] - weight[i];
done = true;
}
}
if (!done) {
#pragma omp critical
bin_rem[result] = c - weight[i];
result++;
}
}
return result;
}
Edit: I made modification on starting problem, so now there is given number of bins N and we need to check if all elements can be put in N bins. I made this by using recursion, still my parallel version is slower.
bool can_fit_parallel(std::vector<int> arr, std::vector<int> bins, int n) {
// base case: if the array is empty, we can fit the elements
if (arr.empty()) {
return true;
}
bool found = false;
#pragma omp parallel for schedule (dynamic,10)
for (int i = 0; i < n; i++) {
if (bins[i] >= arr[0]) {
bins[i] -= arr[0];
if (can_fit_parallel(std::vector<int>(arr.begin() + 1, arr.end()), bins, n)) {
found = true;
#pragma omp cancel for
}
// if the element doesn't fit or if the recursion fails,
// restore the bin's capacity and try the next bin
bins[i] += arr[0];
}
}
// if the element doesn't fit in any of the bins, return false
return found;
}
Any help would be great
You do not need parallelization to make your code significantly faster. You have implemented the First Fit method (its complexity is O(n2)), but it can be significantly faster if you use binary search trees (O(n Log n)). To do so, you just have to use the standard library (std::multiset), in this example I have implemented the BestFit algorithm:
int binPackingSTL(const std::vector<int>& weight, const int n, const int c)
{
std::multiset<int> bins; //multiset to store bins
for (const auto x: weight) {
const auto it=bins.lower_bound(x); // find the best bin to accomodate x
if(it==bins.end()){
bins.insert(c - x); // if no suitale bin found insert a new one
} else {
//suitable bin found - replace it with a smaller value
auto value=*it; // store its value
bins.erase(it); // erase the old value
bins.insert(value-x); // insert the new value
}
}
return bins.size(); // number of bins
}
In my measurements, it is 100x times faster than your code in the case of n=50000
EDIT: Both algorithms mentioned above (First-Fit and Best-Fit) are approximations to the bin packing problem. To answer your revised question, you have to use an algorithm that finds the optimal solution. So, you need to find an algorithm for the exact solution, not an approximation. Instead of trying to reinvent the wheel, you can consider using already available libraries such as BPPLIB – A Bin Packing Problem Library.
This is not a reduction: that would cause each thread to have it own partial result, and you want result to be global. I think that putting a critical section around two statements might work. The atomic statement is meaningless since it is not on a shared variable.
But there a deeper problem: each i iteration can write a result, which affects how far the search of the other iterations goes. That means that the outer iteration has to be sequential. (You really need to think hard about whether iterations are independent before you slap a parallel directive on them!) Maybe you can make the inner iteration parallel: it's a search, which would be a reduction on j. However that loop would have to be pretty dang long before you'd see a performance improvement.
This looks to me like the sort of algorithm that you'd have to reformulate before you can make it parallel.

Manipulating array's values in a certain way

So I was asked to write a function that changes array's values in a way that:
All of the values that are the smallest aren't changed
if, let's assume, the smallest number is 2 and there is no 3's and 4's then all 5's are changed for 3's etc.
for example, for an array = [2, 5, 7, 5] we would get [2, 3, 4, 3], which generalizes to getting a minimal value of an array which remains unchanged, and every other minimum (not including the first one) is changed depending on which minimum it is. On our example - 5 is the first minimum (besides 2), so it is 2 (first minimum) + 1 = 3, 7 is 2nd smallest after 2, so it is 2+2(as it is 2nd smallest).
I've come up with something like this:
int fillGaps(int arr[], size_t sz){
int min = *min_element(arr, arr+sz);
int w = 1;
for (int i = 0; i<sz; i++){
if (arr[i] == min) {continue;}
else{
int mini = *min_element(arr+i, arr+sz);
for (int j = 0; j<sz; j++){
if (arr[j] == mini){arr[j] = min+w;}
}
w++;}
}
return arr[sz-1];
}
However it works fine only for the 0th and 1st value, it doesnt affect any further items. Could anyone please help me with that?
I don't quite follow the logic of your function, so can't quite comment on that.
Here's how I interpret what needs to be done. Note that my example implementation is written to be as understandable as possible. There might be ways to make it faster.
Note that I'm also using an std::vector, to make things more readable and C++-like. You really shouldn't be passing raw pointers and sizes, that's super error prone. At the very least bundle them in a struct.
#include <algorithm>
#include <set>
#include <unordered_map>
#include <vector>
int fillGaps (std::vector<int> & data) {
// Make sure we don't have to worry about edge cases in the code below.
if (data.empty()) { return 0; }
/* The minimum number of times we need to loop over the data is two.
* First to check which values are in there, which lets us decide
* what each original value should be replaced with. Second to do the
* actual replacing.
*
* So let's trade some memory for speed and start by creating a lookup table.
* Each entry will map an existing value to its new value. Let's use the
* "define lambda and immediately invoke it" to make the scope of variables
* used to calculate all this as small as possible.
*/
auto const valueMapping = [&data] {
// Use an std::set so we get all unique values in sorted order.
std::set<int> values;
for (int e : data) { values.insert(e); }
std::unordered_map<int, int> result;
result.reserve(values.size());
// Map minimum value to itself, and increase replacement value by one for
// each subsequent value present in the data vector.
int replacement = *values.begin();
for (auto e : values) { result.emplace(e, replacement++); }
return result;
}();
// Now the actual algorithm is trivial: loop over the data and replace each
// element with its replacement value.
for (auto & e : data) { e = valueMapping.at(e); }
return data.back();
}

Minimizing variable copies in recursive function

I'm looking for efficient memory allocation when dealing with recursive function. As far as I understand, variables I use in the function will remain allocated in memory until recursion is finished. Is there a way to avoid this as I believe this causes slow run of my code below where state variable is copied every time the function is called (correct me if I'm wrong as I'm new to C++).
#include <fstream>
#include <vector>
using namespace std;
int N = 30;
double MIN_COST = 1000000;
vector<int> MIN_CUT = {};
void minCut(vector<int> state, int index, int nodeValue) {
double currentCost;
if (index >= 0) {
currentCost = getCurrentCost(state); // some magic evaluating state cost
state.push_back(nodeValue);
if (currentCost >= MIN_COST) { // kill branch if incomplete solution is already worse than best achieved solution
return;
}
}
if (index == N - 1) { // check if leaf node
if (currentCost < MIN_COST) {
MIN_COST = currentCost;
MIN_CUT = state;
}
return;
}
minCut(state, index + 1, 1); // left subtree - adding 1 to vector
minCut(state, index + 1, 0); // right subtree - adding 0 to vector
return;
}
int main() {
vector<int> state = {};
minCut(state, -1, NULL);
cout << MIN_COST << "\n";
return 0;
}
Your algorithm is effectively building a tree of paths, but you're using a vector to hold the nodes for each path.
A
/ \
/ \
B C
/ \ / \
D E F G
This is the tree you're traversing.
But you're creating new vectors at every node, which contain the whole path up to that node. So as you're visiting node G, in your stack you have 3 vectors:
vector { A, C, G }
vector { A, C }
vector { A }
It should be clear how this is less efficient as you have noticed, but maybe seeing it this way hints at the correct efficient implementation.
The call stack itself holds the path to the root node. The stack when visiting G would be something like
minCut < visiting G >
minCut < visiting C >
minCut < visiting A >
In order to efficiently exploit this fact, make minCut pass the minimum amount of information. In this case we're talking about something linked-list like.
You have then two options that jump out:
Use vector, but:
Pass it by reference.
And you must then maintain it across calls, pushing and popping nodes to keep synchronized with the actual state.
Use an actual linked list. It should be easy to construct the vector by traversing pointers-to-parent-nodes.
Yes, there is a more efficient way to pass state through each function call. This is called passing by reference and can be achieved like so:
void minCut(vector<int>& state, int index, int nodeValue) { ...
This will result in the original state being referenced instead of copied each time the function is called.
For this to work correctly in the code you posted you will have to make some modifications, this is just the general concept.

Vector performance suffering

I've been working on state space exploration and was originally using a map to store the assignment of the world states like map<Variable *, int>, where variables are objects in the world with a domain from 0 to n where n is finite. The implementation was extremely quick for performance, but I noticed that it does not scale well with the size of the state space. I changed the states to use vector<int> instead, where I use the id of a variable to find its index in the vector. Memory usage improved greatly, but the efficiency of the solver has tanked (gone from <30 seconds to 400+). The only code that I modified was generating the states and validating if the state is the goal. I can't figure out why using a vector has degraded performance, especially since the vector operations should only take linear time at worst.
Originally this is was how I generated nodes:
State * SuccessorGen::generate_successor(const Operator &op, map<Variable *, int> &var_assignment){
map<Variable *, int> values;
values.insert(var_assignment.begin(), var_assignment.end());
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
values[eff_it->var] = eff_it->after;
}
return new State(values);
}
And in my new implementation:
State* SuccessorGen::generate_successor(const Operator &op, const vector<int> &assignment){
vector<int> child;
child = assignment;
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
Variable *v = eff_it->var;
int id = v->get_id();
child[id] = eff_it->after;
}
return new State(child);
}
(The goal checking is similar, just looping over the goal assignment instead of operator effects.)
Are these vector operations really that much slower than using a map? Is there an equally efficient STL container I can use that has a lower overhead? The number of variables is relatively small (<50) and the vector never needs to be resized or modified after the for loop.
Edit:
I tried timing one loop through all the operators to see timing comparisons, with the effect list and assignment the vector version runs one loop in 0.3 seconds, while the map version is a little over 0.4 seconds. When I comment that section out the map was about the same, yet the vector jumped up to closer to 0.5 seconds. I added child.reserve(assignment.size()) but that did not make any change.
Edit 2:
From user63710's answer, I've also been digging through the rest of the code and noticed something really strange going on in the heuristic calculation. The vector version works fine, but for the map I use this line Node *n = new Node(i, transition.value, label_cost); open_list.push(n);, but once the loop finishes filling the queue the node gets totally screwed up. Nodes are a simple struct as:
struct Node{
// Source Value, Destination Value
int from;
int to;
int distance;
Node(int &f, int &t, int &d) : from(f), to(t), distance(d){}
};
Instead of having from, to, distance, it replaces from and to with id with some random number, and that search does not do what it should and is returning much faster then it should. When I tweak the map version to convert the map to a vector and run this:
Node n(i, transition.value, label_cost); open_list.push(n);
the performance is about equal to that of the vector. So that fixes my main issue, but this leaves me wondering why using Node *n gets this behaviour opposed to Node n()?
If as you say, the sizes of these structures are fairly small (~50 elements), I have to think that the issue is somewhere else. At least, I don't think it involves the memory accesses or allocation of the vector/map.
Some example code I made to test: Map version:
unique_ptr<map<int, int>> make_successor_map(const vector<int> &ids,
const map<int, int> &input)
{
auto new_map = make_unique<map<int, int>>(input.begin(), input.end());
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_map)[ids[i]], (*new_map)[i]);
return new_map;
}
int main()
{
auto a_map = make_unique<map<int, int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_map->insert({i, rand()});
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_map = make_successor_map(ids, *a_map);
swap(temp_map, a_map);
}
cout << a_map->begin()->second << endl;
}
Vector version:
unique_ptr<vector<int>> make_successor_vec(const vector<int> &ids,
const vector<int> &input)
{
auto new_vec = make_unique<vector<int>>(input);
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_vec)[ids[i]], (*new_vec)[i]);
return new_vec;
}
int main()
{
auto a_vec = make_unique<vector<int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_vec->push_back(rand());
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_vec = make_successor_vec(ids, *a_vec);
swap(temp_vec, a_vec);
}
cout << *a_vec->begin() << endl;
}
The map version takes around 15 seconds to run on my old Core 2 Duo T9600, and the vector version takes 0.406 seconds. Both we're compiled on G++ 4.9.2 with g++ -O3 --std=c++1y. So if your code takes 0.4s per iteration (note that it took my example code 0.4s for 1 million calls), then I'm really thinking your problem is somewhere else.
That's not to say you aren't having performance decreases due to switching from map->vector, but that the code you posted doesn't show much reason for that to happen.
The problem is that you create vectors without reserving space. Vectors store elements contiguously. That ensures constant access to elements.
So everytime you add an item to the vector (for example via your inserter), the vector has to reallocate more space and eventuelly move all the existing elements to a reallocated memory location. This causes slowdown and considerable heap fragmentation.
The solution to this is to reserve() elements if you know in advance how many elements you'll have. Or if you don't reserve() larger chunks and compare size() and capacity() to check if it's time to reserve more.

Finding the minimum in an array (but skipping some elements) using reduction in CUDA

I have a large array of floating point numbers and I want to find out the minimum value of the array (ignoring -1s wherever present) as well as its index, using reduction in CUDA. I have written the following code to do this, which in my opinion should work:
__global__ void get_min_cost(float *d_Cost,int n,int *last_block_number,int *number_in_last_block,int *d_index){
int tid = threadIdx.x;
int myid = blockDim.x * blockIdx.x + threadIdx.x;
int s;
if(result == (*last_block_number)-1){
s = (*number_in_last_block)/2;
}else{
s = 1024/2;
}
for(;s>0;s/=2){
if(myid+s>=n)
continue;
if(tid<s){
if(d_Cost[myid+s] == -1){
continue;
}else if(d_Cost[myid] == -1 && d_Cost[myid+s] != -1){
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}else{
// both not -1
if(d_Cost[myid]<=d_Cost[myid+s])
continue;
else{
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}
}
}
else
continue;
__syncthreads();
}
if(tid==0){
d_Cost[blockIdx.x] = d_Cost[myid];
d_index[blockIdx.x] = d_index[myid];
}
return;
}
The last_block_number argument is the id of the last block, and number_in_last_block is the number of elements in last block (which is a power of 2). Thus, all blocks will launch 1024 threads every time and the last block will only use number_in_last_block threads, while others will use 1024 threads.
After this function runs, I expect the minimum values for each block to be in d_Cost[blockIdx.x] and their indices in d_index[blockIdx.x].
I call this function multiple times, each time updating the number of threads and blocks. The second time I call this function, the number of threads now become equal to the number of blocks remaining etc.
However, the above function isn't giving me the desired output. In fact, it gives a different output every time I run the program, i.e, it returns an incorrect value as the minimum during some intermediate iteration (though that incorrect value is quite close to the minimum every time).
What am I doing wrong here?
As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.
Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1's from the counting. This can be reasonably done by replacing all the -1's in the array by INT_MAX, i.e., the maximum representable integer according to IEEE floating point standards.
#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>
// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
__host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
{
if (a < b)
return a;
else
return b;
}
};
void main() {
const int N = 20;
const int large_value = INT_MAX;
// --- Setting the data vector
thrust::device_vector<int> d_vec(N,10);
d_vec[3] = -1; d_vec[5] = -2;
// --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
thrust::device_vector<int> d_vec_temp(d_vec);
thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);
// --- Creating the index sequence [0, 1, 2, ... )
thrust::device_vector<int> indices(d_vec_temp.size());
thrust::sequence(indices.begin(), indices.end());
// --- Setting the initial value of the search
thrust::tuple<int,int> init(d_vec_temp[0],0);
thrust::tuple<int,int> smallest;
smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
init, smaller_tuple());
printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
getchar();
}