Parallelization of bin packing problem by OpenMp - c++

I am learning open mp, and I want to parallelize well-known BinPacking problem. But the problem is what whatever I try, can't get correct solution ( the one I get with sequential verstion).
So far, I have tried multiple different versions (including reduction, tasks, schedule) but didn't get anything useful.
Below is my the most recent try.
int binPackingParallel(std::vector<int> weight, int n, int c)
{
int resltut = 0;
int bin_rem[n];
#pragma omp parallel for schedule(dynamic) reduction(+:result)
for (int i = 0; i < n; i++) {
bool done = false;
int j;
for (j = 0; j < result && !done; j++) {
int b ;
#pragma omp atomic
b = bin_rem[j] - weight[i];
if ( b >= 0) {
bin_rem[j] = bin_rem[j] - weight[i];
done = true;
}
}
if (!done) {
#pragma omp critical
bin_rem[result] = c - weight[i];
result++;
}
}
return result;
}
Edit: I made modification on starting problem, so now there is given number of bins N and we need to check if all elements can be put in N bins. I made this by using recursion, still my parallel version is slower.
bool can_fit_parallel(std::vector<int> arr, std::vector<int> bins, int n) {
// base case: if the array is empty, we can fit the elements
if (arr.empty()) {
return true;
}
bool found = false;
#pragma omp parallel for schedule (dynamic,10)
for (int i = 0; i < n; i++) {
if (bins[i] >= arr[0]) {
bins[i] -= arr[0];
if (can_fit_parallel(std::vector<int>(arr.begin() + 1, arr.end()), bins, n)) {
found = true;
#pragma omp cancel for
}
// if the element doesn't fit or if the recursion fails,
// restore the bin's capacity and try the next bin
bins[i] += arr[0];
}
}
// if the element doesn't fit in any of the bins, return false
return found;
}
Any help would be great

You do not need parallelization to make your code significantly faster. You have implemented the First Fit method (its complexity is O(n2)), but it can be significantly faster if you use binary search trees (O(n Log n)). To do so, you just have to use the standard library (std::multiset), in this example I have implemented the BestFit algorithm:
int binPackingSTL(const std::vector<int>& weight, const int n, const int c)
{
std::multiset<int> bins; //multiset to store bins
for (const auto x: weight) {
const auto it=bins.lower_bound(x); // find the best bin to accomodate x
if(it==bins.end()){
bins.insert(c - x); // if no suitale bin found insert a new one
} else {
//suitable bin found - replace it with a smaller value
auto value=*it; // store its value
bins.erase(it); // erase the old value
bins.insert(value-x); // insert the new value
}
}
return bins.size(); // number of bins
}
In my measurements, it is 100x times faster than your code in the case of n=50000
EDIT: Both algorithms mentioned above (First-Fit and Best-Fit) are approximations to the bin packing problem. To answer your revised question, you have to use an algorithm that finds the optimal solution. So, you need to find an algorithm for the exact solution, not an approximation. Instead of trying to reinvent the wheel, you can consider using already available libraries such as BPPLIB – A Bin Packing Problem Library.

This is not a reduction: that would cause each thread to have it own partial result, and you want result to be global. I think that putting a critical section around two statements might work. The atomic statement is meaningless since it is not on a shared variable.
But there a deeper problem: each i iteration can write a result, which affects how far the search of the other iterations goes. That means that the outer iteration has to be sequential. (You really need to think hard about whether iterations are independent before you slap a parallel directive on them!) Maybe you can make the inner iteration parallel: it's a search, which would be a reduction on j. However that loop would have to be pretty dang long before you'd see a performance improvement.
This looks to me like the sort of algorithm that you'd have to reformulate before you can make it parallel.

Related

Optimize C++ function to generate combinations

I'm trying to get a function to generate all possible combinations of numbers but my problem is the too long elaboration time. So I think I've to optimize it.
Problem: Generate all set of "r" size with 1 to n elements without repeat it in reverse order (1,2 is equal to 2, 1).
Example:
n = 3 //elements: 1,2,3
r = 2 //size of set
Output:
2 3
1 3
1 2
The code I'm using is the following:
void func(int n, int r){
vector <vector <int>> reas;
vector<bool> v(n);
fill(v.end() - r, v.end(), true);
int a = 0;
do {
reas.emplace_back();
for (int i = 0; i < n; ++i) {
if (v[i]) {
reas[a].push_back(i+1);
}
}
a++;
} while (next_permutation(v.begin(), v.end()));
}
If n = 3 and r = 2 the output will be the same of the example upside.
My problem is that if I put n = 50 and r = 5 the elaboration time is too high and I need to work with a range of n = 50...100 and r= 1..5;
Is there a way to optimize this function?
Thank's a lot
Yes, there are several things you can improve significantly. However, you should keep in mind that the number of combinations you are calculating is so large, that it has to be slow if it is to enumerate all subsets. On my machine and with my personal patience budget (100,5) is out of reach.
Given that, here are the things you can improve without completely rewriting your entire algorithm.
First: Cache locality
A vector<vector<T>> will not be contiguous. The nested vector is rather small, so even with preallocation this will always be bad, and iterating over it will be slow because each new sub-vector (and there are a lot) will likely cause a cache miss.
Hence, use a single vector<T>. Your kth subset will then not sit at location k but at k*r. But this is a significant speedup on my machine.
Second: Use a cpu-friendly permutation vector
Your idea to use next_permutation is not bad. But the fact that you use a vector<bool> makes this extremely slow. Paradoxically, using a vector<size_t> is much faster, because it is easier to load a size_t and check it than it is to do the same with a bool.
So, if you take these together the code looks something like this:
auto func2(std::size_t n, std::size_t r){
std::vector<std::size_t> reas;
reas.reserve((1<<r)*n);
std::vector<std::size_t> v(n);
std::fill(v.end() - r, v.end(), 1);
do {
for (std::size_t i = 0; i < n; ++i) {
if (v[i]) {
reas.push_back(i+1);
}
}
} while (std::next_permutation(v.begin(), v.end()));
return reas;
}
Third: Don't press the entire result into one huge buffer
Use a callback to process each sub-set. Thereby you avoid having to return one huge vector. Instead you call a function for each individual sub-set that you found. If you really really need to have one huge set, this callback can still push the sub-sets into a vector, but it can also operate on them in-place.
std::size_t func3(std::size_t n, std::size_t r,
std::function<void(std::vector<std::size_t> const&)> fun){
std::vector<std::size_t> reas;
reas.reserve(r);
std::vector<std::size_t> v(n);
std::fill(v.end() - r, v.end(), 1);
std::size_t num = 0;
do {
reas.clear(); // does not shrink capacity to 0
for (std::size_t i = 0; i < n; ++i) {
if (v[i]) {
reas.push_back(i+1);
}
}
++num;
fun(reas);
} while (std::next_permutation(v.begin(), v.end()));
return num;
}
This yields a speedup of well over 2x in my experiments. But the speedup goes up the more you crank up n and r.
Also: Use compiler optimisation
Use your compiler options to speed up the compilation as much as possible. On my system the jump from -O0 to -O1 is a speedup of well more than 10x. The jump to -O3 from -O1 is much smaller but still there (about x1.1).
Unrelated to performance, but still relevant: Why is "using namespace std;" considered bad practice?

modifying values in pointers is very slow?

I'm working with a huge amount of data stored in an array, and am trying to optimize the amount of time it takes to access and modify it. I'm using Window, c++ and VS2015 (Release mode).
I ran some tests and don't really understand the results I'm getting, so I would love some help optimizing my code.
First, let's say I have the following class:
class foo
{
public:
int x;
foo()
{
x = 0;
}
void inc()
{
x++;
}
int X()
{
return x;
}
void addX(int &_x)
{
_x++;
}
};
I start by initializing 10 million pointers to instances of that class into a std::vector of the same size.
#include <vector>
int count = 10000000;
std::vector<foo*> fooArr;
fooArr.resize(count);
for (int i = 0; i < count; i++)
{
fooArr[i] = new foo();
}
When I run the following code, and profile the amount of time it takes to complete, it takes approximately 350ms (which, for my purposes, is far too slow):
for (int i = 0; i < count; i++)
{
fooArr[i]->inc(); //increment all elements
}
To test how long it takes to increment an integer that many times, I tried:
int x = 0;
for (int i = 0; i < count; i++)
{
x++;
}
Which returns in <1ms.
I thought maybe the number of integers being changed was the problem, but the following code still takes 250ms, so I don't think it's that:
for (int i = 0; i < count; i++)
{
fooArr[0]->inc(); //only increment first element
}
I thought maybe the array index access itself was the bottleneck, but the following code takes <1ms to complete:
int x;
for (int i = 0; i < count; i++)
{
x = fooArr[i]->X(); //set x
}
I thought maybe the compiler was doing some hidden optimizations on the loop itself for the last example (since the value of x will be the same during each iteration of the loop, so maybe the compiler skips unnecessary iterations?). So I tried the following, and it takes 350ms to complete:
int x;
for (int i = 0; i < count; i++)
{
fooArr[i]->addX(x); //increment x inside foo function
}
So that one was slow again, but maybe only because I'm incrementing an integer with a pointer again.
I tried the following too, and it returns in 350ms as well:
for (int i = 0; i < count; i++)
{
fooArr[i]->x++;
}
So am I stuck here? Is ~350ms the absolute fastest that I can increment an integer, inside of 10million pointers in a vector? Or am I missing some obvious thing? I experimented with multithreading (giving each thread a different chunk of the array to increment) and that actually took longer once I started using enough threads. Maybe that was due to some other obvious thing I'm missing, so for now I'd like to stay away from multithreading to keep things simple.
I'm open to trying containers other than a vector too, if it speeds things up, but whatever container I end up using, I need to be able to easily resize it, remove elements, etc.
I'm fairly new to c++ so any help would be appreciated!
Let's look from the CPU point of view.
Incrementing an integer means I have it in a CPU register and just increments it. This is the fastest option.
I'm given an address (vector->member) and I must copy it to a register, increment, and copy the result back to the address. Worst: My CPU cache is filled with vector pointers, not with vector-member pointers. Too few hits, too much cache "refueling".
If I could manage to have all those members just in a vector, CPU cache hits would be much more frequent.
Try the following:
int count = 10000000;
std::vector<foo> fooArr;
fooArr.resize(count, foo());
for (auto it= fooArr.begin(); it != fooArr.end(); ++it) {
it->inc();
}
The new is killing you and actually you don't need it because resize inserts elements at the end if the size it's greater (check the docs: std::vector::resize)
And the other thing it's about using pointers which IMHO should be avoided until the last moment and it's uneccesary in this case. The performance should be a little bit faster in this case since you get better locality of your references (see cache locality). If they were polymorphic or something more complicated it might be different.

Chain of doughnut - codechef

I am trying to solve this problem since last two days. I am not getting the correct results.
The solutions which are accepted are sorting the number of chains first. I didn't understand why they do it.
Just the first task is correct. For Second task the answer is wrong and for third time limit exceeds.
Here is my code:
#include<iostream>
using namespace std;
int main() {
int t;
cin>>t;
while(t--) {
long n=0;
int f=0,c=0,cuts=0;
cin>>n>>c;
int toJoint=c-1;
int A[c];
for (int i =0;i<c;i++)
cin>>A[i];
if (c>2){
for (int i =0;i<c;i++) {
if (A[i]==1) {
f++;
cuts++;
toJoint-=2;
if(toJoint<=1) break;
}
}
if (toJoint>0){
if (f==0) cout<<toJoint<<endl;
else cout<<(cuts+toJoint)<<endl;
}
else cout<<cuts<<endl;
}
else if (c==1) cout<<0<<endl;
else cout<<++cuts<<endl;
}
return 0;
}
You have the following operations, each of which can be used to link two chains together:
Cut a chain (>=3) in the middle (0 less chains)
Cut a chain (>=2) at the end (1 less chain)
Cut a single donut (2 less chains)
An optimal solution never needs to use (1), thus the objective is to make sure that as many operations as possible are (3)s, the rest being (2)s. The obvious best way to do this is to repeatedly cut a donut from the end of the smallest chain and use it to stick together the biggest two chains. This is the reason for sorting the chains. Even so, it might be faster to make the lengths into a heap, and only extract the minimum element as many times as we need to.
Now to the question: your algorithm only uses operation (3) on single donuts, but doesn't try to make more single donuts by cutting donuts from the end of the smallest chain. And so as Jarod42 points out, with
counterexample, it isn't optimal.
I should also point out that your use of VLAs
int A[c];
is an non-standard extension. To be strict, you should use std::vector instead.
For completeness, here's an example:
std::sort(A.begin(), A.end());
int smallest_index = 0;
int cuts = 0;
while (M > 1)
{
int smallest = A[smallest_index];
if (smallest <= M - 2)
{
// Obliterate the smallest chain, using all its donuts to link other chains
smallest_index++;
M -= smallest + 1;
cuts += smallest;
}
else
{
// Cut M - 2 donuts from the smallest chain - linking the other chains into one.
// Now there are two chains, requiring one more cut to link
cuts += M - 1;
break;
}
}
return cuts;
(disclaimer: only tested on the sample data, may fail in corner-cases or not work at all.)

Is there only one way to implement a bubble sort algorithm?

I was trying to implement my own bubble sort algorithm without looking at any pseudo-code online, but now that I've successfully done it, mine code looks really different from the examples I see online. They all involve dealing with a swapped variable that is either true or false. My implementation does not include that at all, so did I NOT make a bubble sort?
Here is an example I see online:
for i = 1:n,
swapped = false
for j = n:i+1,
if a[j] < a[j-1],
swap a[j,j-1]
swapped = true
→ invariant: a[1..i] in final position
break if not swapped
end
Here is my implementation of it:
void BubbleSort(int* a, int size)
{
while (!arraySorted(a, size))
{
int i = 0;
while (i < (size-1))
{
if (a[i] < a[i+1])
{
i++;
}
else
{
int tmp = 0;
tmp = a[i+1];
a[i+1] = a[i];
a[i] = tmp;
i++;
}
}
}
}
It does the same job, but does it do it any differently?
As some people noted, your version without the flag works, but is needlessly slow.
However, if you take the original version and just throw away the flag (together with the break), it will still work. It's easy to see from the invariant that you conveniently posted.
The version without the break has roughly the same worst-case performance as with the break (worst case is for an array sorted in reverse order). It's better than the original one if you want an algorithm that is guaranteed to finish in a pre-defined time.
Wikipedia describes another idea for optimization of the bubble-sort, which includes throwing away the break.

Stack versus Integer

I've created a program to solve Cryptarithmetics for a class on Data Structures. The professor recommended that we utilize a stack consisting of linked nodes to keep track of which letters we replaced with which numbers, but I realized an integer could do the same trick. Instead of a stack {A, 1, B, 2, C, 3, D, 4} I could hold the same info in 1234.
My program, though, seems to run much more slowly than the estimation he gave us. Could someone explain why a stack would behave much more efficiently? I had assumed that, since I wouldn't be calling methods over and over again (push, pop, top, etc) and instead just add one to the 'solution' that mine would be faster.
This is not an open ended question, so do not close it. Although you can implement things different ways, I want to know why, at the heart of C++, accessing data via a Stack has performance benefits over storing in ints and extracting by moding.
Although this is homework, I don't actually need help, just very intrigued and curious.
Thanks and can't wait to learn something new!
EDIT (Adding some code)
letterAssignments is an int array of size 26. for a problem like SEND + MORE = MONEY, A isn't used so letterAssignments[0] is set to 11. All chars that are used are initialized to 10.
answerNum is a number with as many digits as there are unique characters (in this case, 8 digits).
int Cryptarithmetic::solve(){
while(!solved()){
for(size_t z = 0; z < 26; z++){
if(letterAssignments[z] != 11) letterAssignments[z] = 10;
}
if(answerNum < 1) return NULL;
size_t curAns = answerNum;
for(int i = 0; i < numDigits; i++){
if(nextUnassigned() != '$') {
size_t nextAssign = curAns % 10;
if(isAssigned(nextAssign)){
answerNum--;
continue;
}
assign(nextUnassigned(), nextAssign);
curAns /= 10;
}
}
answerNum--;
}
return answerNum;
}
Two helper methods in case you'd like to see them:
char Cryptarithmetic::nextUnassigned(){
char nextUnassigned = '$';
for(int i = 0; i < 26; i++) {
if(letterAssignments[i] == 10) return ('A' + i);
}
}
void Cryptarithmetic::assign(char letter, size_t val){
assert('A' <= letter && letter <= 'Z'); // valid letter
assert(letterAssignments[letter-'A'] != 11); // has this letter
assert(!isAssigned(val)); // not already assigned.
letterAssignments[letter-'A'] = val;
}
From the looks of things the way you are doing things here is quite inefficiant.
As a general rule try to have the least amount of for loops possible since each one will slow down your implementation greatly.
for instance if we strip all other code away, your program looks like
while(thing) {
for(z < 26) {
}
for(i < numDigits) {
for(i < 26) {
}
for(i < 26) {
}
}
}
this means that for each while loop you are doing ((26+26)*numDigits)+26 loop operations. Thats assuming isAssigned() does not use a loop.
Idealy you want:
while(thing) {
for(i < numDigits) {
}
}
which i'm sure is possible with changes to your code.
This is why your implementation with the integer array is much slower than an implementation using the stack which does not use the for(i < 26) loops (I assume).
In Answer to your original question however, storing an array of integers will always be faster than any struct you can come up with simply because there are more overheads involved in assigning the memory, calling functions, etc.
But as with everything, implementation is the key difference between a slow program and a fast program.
The problem is that by counting you are considering also repetitions, when may be the problem asks to assign a different number to each different letter so that the numeric equation holds.
For example for four letters you are testing 10*10*10*10=10000 letter->number mappings instead of 10*9*8*7=5040 of them (the bigger is the number of letters and bigger becomes the ratio between the two numbers...).
The div instruction used by the mod function is quite expensive. Using it for your purpose can easily be less efficient than a good stack implementation. Here is an instruction timings table: http://gmplib.org/~tege/x86-timing.pdf
You should also write unit tests for your int-based stack to make sure that it works as intended.
Programming is actually trading memory for time and vice versa.
Here you are packing data into integer. You spare memory but loose time.
Speed of course depends on the implementation of stack. C++ is C with classes. If you are not using classes it's basically C(as fast as C).
const int stack_size = 26;
struct Stack
{
int _data[stack_size];
int _stack_p;
Stack()
:_stack_size(0)
{}
inline void push(int val)
{
assert(_stack_p < stack_size); // this won't be overhead
// unless you compile debug version(-DNDEBUG)
_data[_stack_p] = val;
}
inline int pop()
{
assert(_stack_p > 0); // same thing. assert is very useful for tracing bugs
return _data[--_stack_p]; // good hint for RVO
}
inline int size()
{
return _stack_p;
}
inline int val(int i)
{
assert(i > 0 && i < _stack_p);
return _data[i];
}
}
There is no overhead like vtbp. Also pop() and push() are very simple so they will be inlined, so no overhead of function call. Using int as stack element also good for speed because int is guaranteed to be of best suitable size for processor(no need for alignment etc).