What makes this bucket sort function slow? - c++

The function is defined as
void bucketsort(Array& A){
size_t numBuckets=A.size();
iarray<List> buckets(numBuckets);
//put in buckets
for(size_t i=0;i!=A.size();i++){
buckets[int(numBuckets*A[i])].push_back(A[i]);
}
////get back from buckets
//for(size_t i=0,head=0;i!=numBuckets;i++){
//size_t bucket_size=buckets[i].size();
//for(size_t j=0;j!=bucket_size;j++){
// A[head+j] = buckets[i].front();
// buckets[i].pop_front();
//}
//head += bucket_size;
//}
for(size_t i=0,head=0;i!=numBuckets;i++){
while(!buckets[i].empty()){
A[head] = buckets[i].back();
buckets[i].pop_back();
head++;
}
}
//inseration sort
insertionsort(A);
}
where List is just list<double> in STL.
The content of array are generate randomly in [0,1). Theoretically bucket sort should be faster than quicksort for large size for it's O(n),but it fails as in the following graph.
I use google-perftools to profile it on a 10000000 double array. It reports as follow
It seems I should not use STL list,but I wonder why? Which does std_List_node_base_M_hook do? Should I write list class myself?
PS:The experiment and improvement
I have tried just leave the codes of putting in buckets and this explained that most time is used on building up buckets.
The following improvement is made:
- Use STL vector as buckets and reserve reasonable space for buckets
- Use two helper array to store the information used in building buckets,thus avoiding the use of linked list,as in following code
void bucketsort2(Array& A){
size_t numBuckets = ceil(A.size()/1000);
Array B(A.size());
IndexArray head(numBuckets+1,0),offset(numBuckets,0);//extra end of head is used to avoid checking of i == A.size()-1
for(size_t i=0;i!=A.size();i++){
head[int(numBuckets*A[i])+1]++;//Note the +1
}
for(size_t i=2;i<numBuckets;i++){//head[1] is right already
head[i] += head[i-1];
}
for(size_t i=0;i<A.size();i++){
size_t bucket_num = int(numBuckets*A[i]);
B[head[bucket_num]+offset[bucket_num]] = A[i];
offset[bucket_num]++;
}
A.swap(B);
//insertionsort(A);
for(size_t i=0;i<numBuckets;i++)
quicksort_range(A,head[i],head[i]+offset[i]);
}
The result in the following graph
where line start with list using list as buckets,start with vector using vector as buckets,start 2 using helper arrays.By default insertion sort is used at last and some use quick sort as the bucket size is big.
Note "list" and "list,only put in" ,"vector,reserve 8" and "vector,reserve 2" nearly overlap.
I will try small size with enough memory reserved.

In my opinion, the biggest bottleneck here is memory management functions (such as new and delete).
Quicksort (of which STL probably uses an optimized version) can sort an array in-place, meaning it requires absolutely no heap allocations. That is why it performs so well in practice.
Bucket sort relies on additional working space, which is assumed to be readily available in theory (i.e. memory allocation is assumed to take no time at all). In practice, memory allocation can take anywhere from (large) constant time to linear time in the size of memory requested (Windows, for example, will take time to zero the contents of pages when they are allocated). This means standard linked list implementations are going to suffer, and dominate the running time of your sort.
Try using a custom list implementation that pre-allocates memory for a large number of items, and you should see your sort running much faster.

With
iarray<List> buckets(numBuckets);
you are basically creating a LOT of lists and that can cost you a lot especially in memory access which it theoretically linear but that's not the case in practice.
Try to reduce the number of buckets.
To verify my assertion analyse your code speed with only the creation of the lists.
Also to iterate over the elements of the lists you should not use .size() but rather
//get back from buckets
for(size_t i=0,head=0;i!=numBuckets;i++)
while(!buckets[i].empty())
{
A[head++] = buckets[i].front();
buckets[i].pop_front();
}
In some implementations .size() can be in O(n). Unlikely but...
After some research I found
this page explaining what is the code for std::_List_node_base::hook.
Seems it is only to insert an element at a given place in a list. Shouldn't cost a lot..

Linked Lists are not arrays. They are substantially slower to perform operations like lookup. The STL sort may well have a specific version for lists that takes this into account and optimizes for it- but your function blindly ignores what container it's using. You should try using an STL vector as your array.

I think perhaps the interesting question is, Why are you creating an inordinately large number of buckets?
Consider the input {1,2,3}, numBuckets = 3. The loop containing buckets[int(numBuckets*A[i])].push_back(A[i]); is going to unroll to
buckets[3].push_back(1);
buckets[6].push_back(2);
buckets[9].push_back(3);
Really? Nine buckets for three values...
Consider if you passed a permutation of the range 1..100. You'd create 10,000 buckets and only use 1% of them. ... and each of those unused buckets requires creating a List in it. ... and has to be iterated over and then discarded in the readout loop.
Even more exciting, sort the list 1..70000 and watch your heap manager explode trying to create 4.9 billion Lists.

I didnt really manage to get to the details of your code, as i dont know enough of Java at this point of my study, tho i have had some exprience in algorithms and C programming so here's my opinion:
Bucket Sort assuming fair distrabution of the Elements on the array, thats actually more like a condition for your bucket sort to work on O(n), notice in worst case, it can be that you put major amount of elements on 1 of your buckets, thus in the next iteration youre gonna deal with almost the same problem as you been trying to fix in the first place which leads you to a bad performance.
Notice that the ACTUALL Time Complexity of Bucket sort is O(n+k) where k is the number of buckets, did you count your buckets? is k=O(n)?
the most time wasting problem in bucket sort is the Empty buckets after the partition to buckets is over with, when concatenate your sorted buckets you cant tell if the bucket's empty without actually testing it.
hope i helped.

Related

How to approximate the size of an std::unordered_map in C++

It is said that an unordered_map<int,int> takes up much more space than a vector<int>. While I am completely aware of that, I would like to know how to get the approximate size of a single instance of an unordered_map in C++. For now, let's say that I inserted n = 1000000 elements into it. I presume that the memory taken up is n multiplied by some kind of constant, I am, however, unable to find an accurate answer anywhere on the Internet. Here is what I'm doing. I'd like to calculate how much memory u_m uses, without writing any code. Is there a way to do that?
#include<bits/stdc++.h>
using namespace std;
const int N = 1000000;
unordered_map<int,int> u_m ;
int main(){
for(int i = 0;i<N;i++){
u_m[i] = 123+i;
}
return 0;
}
If that makes a difference, I intentionally put u_m outside of main
There is no general purpose answer to this. The memory used can vary wildly based on the implementation. To be clear, unordered_map is not tree-based; it's typically implemented as an array of buckets.
But while the spec allows you to know how many buckets are currently in play (via bucket_count) and you can ask for the number of items in each bucket (with bucket_size), there is no way to ask how a bucket is implemented. Based on the various requirements of methods like bucket_size and extract/merge, it's likely a bare bones linked list (bucket_size is allowed to be O(n) in the size of the bucket, so it needn't know its own size directly; extract needs to be able to return a handle that can be transferred between unordered_maps, and merge is guaranteed not to copy or move when moving elements from one unordered_map to another), but the details of implementation are largely hidden.
There's also no guarantee on what is stored in the first place. It could be just key and value, or key, value and hash, or something else.
So while you can get basic info from the various .bucket* APIs, since the contents and implementation of a "bucket" is itself essentially unspecified, you'll never get an answer to "how big is an unordered_map" from any C++ standard APIs; you'd need to know the implementation details and use them alongside the .bucket* APIs, to get an estimate.
An unordered_map is not a tree, it is a hash table. It size is dependent on several things, both on amount you've inserted, but also on things like calling reserve to pre-allocate memory. The specific settings of initial allcoation size and load factors are implementation dependent, so any guess you make will probably differ between compilers, and the order of operations that result in when and by how much the hash table resizes will differ too.

Difference between multimap and unordered_multimap in c++? [duplicate]

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;

Why is my insertion into STL list running slow?

I'm trying to implement an undirected graph using adjacency list. I used the following code:
int v,e;
scanf("%d%d",&v,&e);
list<int> graph[3000];
for(int i=0;i<e;i++){
int a,b;
scanf("%d%d",&a,&b);
graph[a].push_back(b);
graph[b].push_back(a);
}
To test the running time of my code I created an input file with 3000 vertices and all possible edges. It took 2.2 seconds to run. I tried to optimise by changing it to a two dimensional array as follows
int graph[3000][3000];
for(int i=0;i<e;i++){
int a,b;
scanf("%d%d",&a,&b);
graph[a][p[a]]=b;
graph[b][p[b]]=a;
p[a]++;
p[b]++;
}
where 'p' is of size 3000 initalised with all zeros. This code ran in just 0.35 seconds for the same input file. I'm using gcc-4.3.2 compiler. I know insertion at the end of a list can be done in constant time then why is the first code running slow? Is there a chance of optimising the linked list implementation?
Thanks in advance
Avoid std::list. That's a doubly linked list, which is very cache unfriendly (the nodes are randomly distributed in memory) and involves a large overhead (2 pointers per element). So every time you append something, the list allocates 2*sizeof(void*)+sizeof(int) bytes and additionally some memory management overhead of operator new.
Later in the algorithm, when you will iterate over the values, you literally jump all over the whole memory, which is further slow.
The 2d array doesn't have this problem, but it does waste some memory.
I usually represent an adjacency list as a vector of vectors.
std::vector<std::vector<int> > graph;
Note that a vector can also push_back values in O(1) (as well as a std::deque, which can append even faster but is slower when traversing). If the graph is expected to be dense, then an adjacency matrix may be a better choice.
Insertion into a list requires allocating a new node. So when you're doing your 6000 push-backs, you have to do 6000 memory allocations. In the array case, you don't have to do any allocations at all, so that's a lot faster. That's the full difference.
To expand on the answers here, implement a linked list class yourself, and you will find out why it is slow.
There are things that can be done such as implementing a list containing a capacity value, a size value and a pointer that points to the first node in the actual list. That pointer is actually a dynamic array, and when size==capacity, the array is resized and the capacity increased by some factor (e.g. 10).
The drawback is that it is limited to 2^(sizeof capacity * CHAR_BIT) - 1 elements whereas just allocating nodes each time involves longer insertion times with the benefit of a theoretically unlimited amount of nodes. You'd most likely run out of memory before maxing out the capacity of the faster list implementation, but there is no guarantee of that, not to mention resizing the list usually involves making a copy of it, so that capacity maximum suddenly has a much smaller limit on it anyway.
Linked lists are generally slow. They have their uses, but if you need fast run times, find a better implementation, use a different container such as std::vector, or create a solution yourself, though honestly the standard containers do pretty well.

How to speedup non-sequential array filling?

Context
I have a code like this:
..
vector<int> values = ..., vector<vector<int>> buckets;
//reserve space for values and each buckets sub-vector
for (int i = 0; i < values.size(); i++) {
buckets[values[i]].push_back(i);
}
...
So I get a "buckets" with indexes of entries that have same value. Those buckets are used then in further processing.
Actually I'm working with native dynamic arrays (int ** buckets;) but for simplicity's sake I've used vectors above.
I know the size of each bucket before filling.
Size of vectors is about 2,000,000,000.
The problem
As you can see the code above access "buckets" array in a random manner. Thus it have constant cache misses that slows execution time dramatically.
Yes, I see such misses in profile report.
Question
Is there a way to improve speed of such code?
I've tried to create a aux vector and put first occurrence of value there thus I can put two indexes in corresponding bucket as I found second one. This approach didn't give any speedup.
Thank you!
Why are you assuming it's cache misses that make your code slow? Have you profiled or is that just what came to mind?
There's a number of things very wrong with your code from a performance perspective. The first and most obvious is that you never reserve a vector size. What's happening is that your vector is starting out very small (say, 2 elements), then each time you add past the size, it'll resize again and copy the contents over to the new memory location. If you're saying there are 2 billion entries, you're resizing maybe 30 times!
You need to call the function vector.reserve() (or vector.resize(), depending on what behavior works best for you) before you look at other improvements.
EDIT
Seriously? You mention that you're not even using a vector in your PS? How are we supposed to guess what your actual code looks like and how it will perform?
Is foo at least reversible and surjective for a given interval? Then you can run through that interval and fill buckets[j] completely with bar(j,k) (if bar is the inverse of foo), for k in [0,...,MAX_BAR_J), then continue with j+1 and so forth.
If however foo has hashing properties, you have very little chance, because you cannot predict to which index the next i will get you. So I see no chance right now.

Difference in performance between map and unordered_map in c++

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;