Context
I have a code like this:
..
vector<int> values = ..., vector<vector<int>> buckets;
//reserve space for values and each buckets sub-vector
for (int i = 0; i < values.size(); i++) {
buckets[values[i]].push_back(i);
}
...
So I get a "buckets" with indexes of entries that have same value. Those buckets are used then in further processing.
Actually I'm working with native dynamic arrays (int ** buckets;) but for simplicity's sake I've used vectors above.
I know the size of each bucket before filling.
Size of vectors is about 2,000,000,000.
The problem
As you can see the code above access "buckets" array in a random manner. Thus it have constant cache misses that slows execution time dramatically.
Yes, I see such misses in profile report.
Question
Is there a way to improve speed of such code?
I've tried to create a aux vector and put first occurrence of value there thus I can put two indexes in corresponding bucket as I found second one. This approach didn't give any speedup.
Thank you!
Why are you assuming it's cache misses that make your code slow? Have you profiled or is that just what came to mind?
There's a number of things very wrong with your code from a performance perspective. The first and most obvious is that you never reserve a vector size. What's happening is that your vector is starting out very small (say, 2 elements), then each time you add past the size, it'll resize again and copy the contents over to the new memory location. If you're saying there are 2 billion entries, you're resizing maybe 30 times!
You need to call the function vector.reserve() (or vector.resize(), depending on what behavior works best for you) before you look at other improvements.
EDIT
Seriously? You mention that you're not even using a vector in your PS? How are we supposed to guess what your actual code looks like and how it will perform?
Is foo at least reversible and surjective for a given interval? Then you can run through that interval and fill buckets[j] completely with bar(j,k) (if bar is the inverse of foo), for k in [0,...,MAX_BAR_J), then continue with j+1 and so forth.
If however foo has hashing properties, you have very little chance, because you cannot predict to which index the next i will get you. So I see no chance right now.
Related
I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.
I have a bit of an issue, I was recently told that for an un-ordered value for input, a bunch of random values, lets say 1 Million of them, that using a set would be more efficient than using a vector, and then sorting said vector with the basic sort algorithm function, but when I used them, and checked them through the time function, in the terminal, and valgrind, it showed that both time complexity, and space usage were faster for the vector, even with the addition of the sort function being called. The person who gave me the advice to use the set is a lot more experienced than me in the C++ language, but I always have to test things out myself prior to taking peoples advice. The test codes follow.
For Set
std::set<int> testSet;
for(int i(0); i<= 1000000; ++i)
testSet.insert(-i);
For Vector
std::vector<int> testVector;
for(int i(0); i<= 1000000; ++i)
testVector.push_back(i * -1);
std::sort(testVector.begin(), testVector.end());
I know that these are not random variables, it wouldn't be fair since set does not allow duplicates, and vector does sothey would be different sizes for this basic function point. Can anyone clarify why the set should be used, sans the point of the no duplicates one.
I did not do any tests with the unordered set either. Not too sure of the differences between the two given points.
This is too vague and ignores/misses out several crucial factors. If your friend said precisely this, then your friend (regardless of his or her experience) was wrong. More likely you are somewhat misinterpreting their words and reading into them a simplified version of matters.
When you want a sorted final product, the sorting is "amortized" when you insert into a set, because you get little bits of sorting action each time. If you will be inserting periodically and many times, then that spreading-out of the workload may be what you want. The total, when added up, may still be more than for a vector (consider the occasional rebalancing and so forth; your vector just needs to be moved to a larger block of memory once in a while), but you've spread it out so as not to noticeably slow down some individual other part of your program.
But if you're just dumping all the elements into a vector and sorting straight away, not only is there less work for the container & algorithm to do but you probably don't mind it taking a noticeable amount of time.
You haven't really stated your use case in any detail so I won't pretend to give specifics here, but the only possible answer to your question as posed is both "it depends" and "the question is fundamentally somewhat meaningless"; you cannot just take two data structures and sorting methodologies, and ask "which is more efficient?" without a use case. You have, however, correctly measured the time and space requirements and if you've done that against your real-world use case then, well, you have your answer don't you?
pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.
The title almost tells everything,but I will exemplify this: suppose that you have an array a of chars, and another array b also of chars. Is there a better way to put in a only the char located at prime positions in b? Suppose that we have an array with prime positions.
For now my naive code looks like this.
for(i = 0; i < n; i++)
a[i] = b[j + prime[i]];
Here prime[i] stores the prime positions of b and b is much larger than a,j is an arbitrary position in b(there will not be an out of bound problem because j+prime[i] does not exceed border of b).
What is better? One way is: If the prime[] locations are known at compile time, then we could add a prefetch to get the cache lines in ahead of time.
This is making the memory access time better.
You can either do this when you read (or copy) values into the array, using a prime function that tells you if a number is prime or not.
A way I sketched quickly is to generate prime numbers until they reach your array capacity and simply iterate through them and copy the desired elements from your a array. I can think of several ways of optimizing this, such as having a "preprocess" function that generates prime numbers in your program so you can reuse the list.
The prime number list will get cached and it will take a lot less time to be accessed(it s unlikely that you have an extremely huge prime number list)
Let's look at this from an algorithmic perspective.
You want to perform a hash function on each of the entries in array A. Assuming that you know nothing about the state of the items in array A, then that places the lower bound of run time for the algorithm at O(n), linear time. You must iterate through every member because you don't have any more information that could assist you in "skipping" some elements or optimizing the process.
That said, the challenge then becomes keeping the algorithm down at O(n). The code you demonstrate does do this, assuming you then follow up with copying the non-prime numbers in the same manner. So for the copying step, no there is not a way to make this any faster from an algorithm point of view. That doesn't mean that how you perform the hashing step won't affect the speed, though.
The function is defined as
void bucketsort(Array& A){
size_t numBuckets=A.size();
iarray<List> buckets(numBuckets);
//put in buckets
for(size_t i=0;i!=A.size();i++){
buckets[int(numBuckets*A[i])].push_back(A[i]);
}
////get back from buckets
//for(size_t i=0,head=0;i!=numBuckets;i++){
//size_t bucket_size=buckets[i].size();
//for(size_t j=0;j!=bucket_size;j++){
// A[head+j] = buckets[i].front();
// buckets[i].pop_front();
//}
//head += bucket_size;
//}
for(size_t i=0,head=0;i!=numBuckets;i++){
while(!buckets[i].empty()){
A[head] = buckets[i].back();
buckets[i].pop_back();
head++;
}
}
//inseration sort
insertionsort(A);
}
where List is just list<double> in STL.
The content of array are generate randomly in [0,1). Theoretically bucket sort should be faster than quicksort for large size for it's O(n),but it fails as in the following graph.
I use google-perftools to profile it on a 10000000 double array. It reports as follow
It seems I should not use STL list,but I wonder why? Which does std_List_node_base_M_hook do? Should I write list class myself?
PS:The experiment and improvement
I have tried just leave the codes of putting in buckets and this explained that most time is used on building up buckets.
The following improvement is made:
- Use STL vector as buckets and reserve reasonable space for buckets
- Use two helper array to store the information used in building buckets,thus avoiding the use of linked list,as in following code
void bucketsort2(Array& A){
size_t numBuckets = ceil(A.size()/1000);
Array B(A.size());
IndexArray head(numBuckets+1,0),offset(numBuckets,0);//extra end of head is used to avoid checking of i == A.size()-1
for(size_t i=0;i!=A.size();i++){
head[int(numBuckets*A[i])+1]++;//Note the +1
}
for(size_t i=2;i<numBuckets;i++){//head[1] is right already
head[i] += head[i-1];
}
for(size_t i=0;i<A.size();i++){
size_t bucket_num = int(numBuckets*A[i]);
B[head[bucket_num]+offset[bucket_num]] = A[i];
offset[bucket_num]++;
}
A.swap(B);
//insertionsort(A);
for(size_t i=0;i<numBuckets;i++)
quicksort_range(A,head[i],head[i]+offset[i]);
}
The result in the following graph
where line start with list using list as buckets,start with vector using vector as buckets,start 2 using helper arrays.By default insertion sort is used at last and some use quick sort as the bucket size is big.
Note "list" and "list,only put in" ,"vector,reserve 8" and "vector,reserve 2" nearly overlap.
I will try small size with enough memory reserved.
In my opinion, the biggest bottleneck here is memory management functions (such as new and delete).
Quicksort (of which STL probably uses an optimized version) can sort an array in-place, meaning it requires absolutely no heap allocations. That is why it performs so well in practice.
Bucket sort relies on additional working space, which is assumed to be readily available in theory (i.e. memory allocation is assumed to take no time at all). In practice, memory allocation can take anywhere from (large) constant time to linear time in the size of memory requested (Windows, for example, will take time to zero the contents of pages when they are allocated). This means standard linked list implementations are going to suffer, and dominate the running time of your sort.
Try using a custom list implementation that pre-allocates memory for a large number of items, and you should see your sort running much faster.
With
iarray<List> buckets(numBuckets);
you are basically creating a LOT of lists and that can cost you a lot especially in memory access which it theoretically linear but that's not the case in practice.
Try to reduce the number of buckets.
To verify my assertion analyse your code speed with only the creation of the lists.
Also to iterate over the elements of the lists you should not use .size() but rather
//get back from buckets
for(size_t i=0,head=0;i!=numBuckets;i++)
while(!buckets[i].empty())
{
A[head++] = buckets[i].front();
buckets[i].pop_front();
}
In some implementations .size() can be in O(n). Unlikely but...
After some research I found
this page explaining what is the code for std::_List_node_base::hook.
Seems it is only to insert an element at a given place in a list. Shouldn't cost a lot..
Linked Lists are not arrays. They are substantially slower to perform operations like lookup. The STL sort may well have a specific version for lists that takes this into account and optimizes for it- but your function blindly ignores what container it's using. You should try using an STL vector as your array.
I think perhaps the interesting question is, Why are you creating an inordinately large number of buckets?
Consider the input {1,2,3}, numBuckets = 3. The loop containing buckets[int(numBuckets*A[i])].push_back(A[i]); is going to unroll to
buckets[3].push_back(1);
buckets[6].push_back(2);
buckets[9].push_back(3);
Really? Nine buckets for three values...
Consider if you passed a permutation of the range 1..100. You'd create 10,000 buckets and only use 1% of them. ... and each of those unused buckets requires creating a List in it. ... and has to be iterated over and then discarded in the readout loop.
Even more exciting, sort the list 1..70000 and watch your heap manager explode trying to create 4.9 billion Lists.
I didnt really manage to get to the details of your code, as i dont know enough of Java at this point of my study, tho i have had some exprience in algorithms and C programming so here's my opinion:
Bucket Sort assuming fair distrabution of the Elements on the array, thats actually more like a condition for your bucket sort to work on O(n), notice in worst case, it can be that you put major amount of elements on 1 of your buckets, thus in the next iteration youre gonna deal with almost the same problem as you been trying to fix in the first place which leads you to a bad performance.
Notice that the ACTUALL Time Complexity of Bucket sort is O(n+k) where k is the number of buckets, did you count your buckets? is k=O(n)?
the most time wasting problem in bucket sort is the Empty buckets after the partition to buckets is over with, when concatenate your sorted buckets you cant tell if the bucket's empty without actually testing it.
hope i helped.