Why is my insertion into STL list running slow? - c++

I'm trying to implement an undirected graph using adjacency list. I used the following code:
int v,e;
scanf("%d%d",&v,&e);
list<int> graph[3000];
for(int i=0;i<e;i++){
int a,b;
scanf("%d%d",&a,&b);
graph[a].push_back(b);
graph[b].push_back(a);
}
To test the running time of my code I created an input file with 3000 vertices and all possible edges. It took 2.2 seconds to run. I tried to optimise by changing it to a two dimensional array as follows
int graph[3000][3000];
for(int i=0;i<e;i++){
int a,b;
scanf("%d%d",&a,&b);
graph[a][p[a]]=b;
graph[b][p[b]]=a;
p[a]++;
p[b]++;
}
where 'p' is of size 3000 initalised with all zeros. This code ran in just 0.35 seconds for the same input file. I'm using gcc-4.3.2 compiler. I know insertion at the end of a list can be done in constant time then why is the first code running slow? Is there a chance of optimising the linked list implementation?
Thanks in advance

Avoid std::list. That's a doubly linked list, which is very cache unfriendly (the nodes are randomly distributed in memory) and involves a large overhead (2 pointers per element). So every time you append something, the list allocates 2*sizeof(void*)+sizeof(int) bytes and additionally some memory management overhead of operator new.
Later in the algorithm, when you will iterate over the values, you literally jump all over the whole memory, which is further slow.
The 2d array doesn't have this problem, but it does waste some memory.
I usually represent an adjacency list as a vector of vectors.
std::vector<std::vector<int> > graph;
Note that a vector can also push_back values in O(1) (as well as a std::deque, which can append even faster but is slower when traversing). If the graph is expected to be dense, then an adjacency matrix may be a better choice.

Insertion into a list requires allocating a new node. So when you're doing your 6000 push-backs, you have to do 6000 memory allocations. In the array case, you don't have to do any allocations at all, so that's a lot faster. That's the full difference.

To expand on the answers here, implement a linked list class yourself, and you will find out why it is slow.
There are things that can be done such as implementing a list containing a capacity value, a size value and a pointer that points to the first node in the actual list. That pointer is actually a dynamic array, and when size==capacity, the array is resized and the capacity increased by some factor (e.g. 10).
The drawback is that it is limited to 2^(sizeof capacity * CHAR_BIT) - 1 elements whereas just allocating nodes each time involves longer insertion times with the benefit of a theoretically unlimited amount of nodes. You'd most likely run out of memory before maxing out the capacity of the faster list implementation, but there is no guarantee of that, not to mention resizing the list usually involves making a copy of it, so that capacity maximum suddenly has a much smaller limit on it anyway.
Linked lists are generally slow. They have their uses, but if you need fast run times, find a better implementation, use a different container such as std::vector, or create a solution yourself, though honestly the standard containers do pretty well.

Related

Difference between multimap and unordered_multimap in c++? [duplicate]

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;

Why is it so slow to add or remove elements in the middle of a vector?

According to Accelerated C++:
To use this strategy, we need a way to remove an element from a vector. The good news is that such a facility exists; the bad news is that removing elements from vectors is slow enough to argue against using this approach for large amounts of input data. If the data we process get really big, performance degrades to an astonishing extent.
For example, if all of our students were to fail, the execution time of the function that we are about to see would grow proportionally to the square of the number of students. That means that for a class of 100 students, the program would take 10,000 times as long to run as it would for one student. The problem is that our input records are stored in a vector, which is optimized for fast random access. One price of that optimization is that it can be expensive to insert or delete elements other than at the end of the vector.
The authors do not explain why the vector would be so slow for 10,000+ students, and why in general it is slow to add or remove elements to the middle of a vector. Could somebody on Stack Overflow come up with a beautiful answer for me?
Take a row of houses: if you build them in a straight line, then finding No. 32 is really easy: just walk along the road about 32 houses' worth, and you're there. But it's not quite so fun to add house No. 31&half; in the middle — that's a big construction project with a lot of disruption to husband's/wife's and kids' lives. In the worst case, there is not enough space on the road for another house anyway, so you have to move all the houses to a different street before you even start.
Similarly, vectors store their data contiguously, i.e. in a continuous, sequential block in memory.
This is very good for quickly finding the nth element (as you simply have to trundle along n positions and dereference), but very bad for inserting into the middle as you have to move all the later elements along by one, one at a time.
Other containers are designed to be easy to insert elements, but the trade-off is that they are consequently not quite as easy to find things in. There is no container which is optimal for all operations.
When inserting elements into or removing elements from the middle of a std::vector<T> all elements after the modification point need to moved: when inserting they need to be moved further to the back, when removing they need to be moved forward to close the gap. The background is that std::vector<T> is basically just a contiguous sequence of elements.
Although this operation isn't too bad for certain types it can become comparatively slow. Note, however, that the size of the container needs to be of some sensible size or the cost of moving be significant: for small vectors, inserting into/removing from the middle is probably faster than using other data structures, e.g., lists. Eventually the cost of maintaining a more complex structure does pay off, however.
std::vector allocates memory as one extent. If you need to insert an element in the middle of the extend you have to shift right all elements of the vector that to make a free slot where you will nsert the new element. And moreover if the extend is already full of elements the vector need to allocate a new larger extend and copy all elements from the original extent to the new one.

Graph memory implementation

The two ways commonly used to represent a graph in memory are to use either an adjacency list or and adjacency matrix. An adjacency list is implemented using an array of pointers to linked lists. Is there any reason that that is faster than using a vector of vectors? I feel like it should make searching and traversals faster because backtracking would be a lot simpler.
The vector of linked adjacencies is a favorite textbook meme with many variations in practice. Certainly you can use vectors of vectors. What are the differences?
One is that links (double ones anyway) allow edges to be easily added and deleted in constant time. This obviously is important only when the edge set shrinks as well as grows. With vectors for edges, any individual operation may require O(k) where k is the incident edge count.
NB: If the order of edges in adjacency lists is unimportant for your application, you can easily get O(1) deletions with vectors. Just copy the last element to the position of the one to be deleted, then delete the last! Alas, there are many cases (e.g. where you're worried about embedding in the plane) when order of adjacencies is important.
Even if order must be maintained, you can arrange for copying costs to amortize to an average that is O(1) per operation over many operations. Still in some applications this is not good enough, and it requires "deleted" marks (a reserved vertex number suffices) with compaction performed only when the number of marked deletions is a fixed fraction of the vector. The code is tedious and checking for deleted nodes in all operations adds overhead.
Another difference is overhead space. Adjacency list nodes are quite small: Just a node number. Double links may require 4 times the space of the number itself (if the number is 32 bits and both pointers are 64). For a very large graph, a space overhead of 400% is not so good.
Finally, linked data structures that are frequently edited over a long period may easily lead to highly non-contiguous memory accesses. This decreases cache performance compared to linear access through vectors. So here the vector wins.
In most applications, the difference is not worth worrying about. Then again, huge graphs are the way of the modern world.
As others have said, it's a good idea to use a generalized List container for the adjacencies, one that may be quickly implemented either with linked nodes or vectors of nodes. E.g. in Java, you'd use List and implement/profile with both LinkedList and ArrayList to see which works best for your application. NB ArrayList compacts the array on every remove. There is no amortization as described above, although adds are amortized.
There are other variations: Suppose you have a very dense graph, where there's a frequent need to search all edges incident to a given node for one with a certain label. Then you want maps for the adjacencies, where the keys are edge labels. Of course the maps can be hashes or trees or skiplists or whatever you like.
The list goes on. How to implement for efficient vertex deletion? As you might expect, there are alternatives here, too, each with advantages and disadvantages.

What makes this bucket sort function slow?

The function is defined as
void bucketsort(Array& A){
size_t numBuckets=A.size();
iarray<List> buckets(numBuckets);
//put in buckets
for(size_t i=0;i!=A.size();i++){
buckets[int(numBuckets*A[i])].push_back(A[i]);
}
////get back from buckets
//for(size_t i=0,head=0;i!=numBuckets;i++){
//size_t bucket_size=buckets[i].size();
//for(size_t j=0;j!=bucket_size;j++){
// A[head+j] = buckets[i].front();
// buckets[i].pop_front();
//}
//head += bucket_size;
//}
for(size_t i=0,head=0;i!=numBuckets;i++){
while(!buckets[i].empty()){
A[head] = buckets[i].back();
buckets[i].pop_back();
head++;
}
}
//inseration sort
insertionsort(A);
}
where List is just list<double> in STL.
The content of array are generate randomly in [0,1). Theoretically bucket sort should be faster than quicksort for large size for it's O(n),but it fails as in the following graph.
I use google-perftools to profile it on a 10000000 double array. It reports as follow
It seems I should not use STL list,but I wonder why? Which does std_List_node_base_M_hook do? Should I write list class myself?
PS:The experiment and improvement
I have tried just leave the codes of putting in buckets and this explained that most time is used on building up buckets.
The following improvement is made:
- Use STL vector as buckets and reserve reasonable space for buckets
- Use two helper array to store the information used in building buckets,thus avoiding the use of linked list,as in following code
void bucketsort2(Array& A){
size_t numBuckets = ceil(A.size()/1000);
Array B(A.size());
IndexArray head(numBuckets+1,0),offset(numBuckets,0);//extra end of head is used to avoid checking of i == A.size()-1
for(size_t i=0;i!=A.size();i++){
head[int(numBuckets*A[i])+1]++;//Note the +1
}
for(size_t i=2;i<numBuckets;i++){//head[1] is right already
head[i] += head[i-1];
}
for(size_t i=0;i<A.size();i++){
size_t bucket_num = int(numBuckets*A[i]);
B[head[bucket_num]+offset[bucket_num]] = A[i];
offset[bucket_num]++;
}
A.swap(B);
//insertionsort(A);
for(size_t i=0;i<numBuckets;i++)
quicksort_range(A,head[i],head[i]+offset[i]);
}
The result in the following graph
where line start with list using list as buckets,start with vector using vector as buckets,start 2 using helper arrays.By default insertion sort is used at last and some use quick sort as the bucket size is big.
Note "list" and "list,only put in" ,"vector,reserve 8" and "vector,reserve 2" nearly overlap.
I will try small size with enough memory reserved.
In my opinion, the biggest bottleneck here is memory management functions (such as new and delete).
Quicksort (of which STL probably uses an optimized version) can sort an array in-place, meaning it requires absolutely no heap allocations. That is why it performs so well in practice.
Bucket sort relies on additional working space, which is assumed to be readily available in theory (i.e. memory allocation is assumed to take no time at all). In practice, memory allocation can take anywhere from (large) constant time to linear time in the size of memory requested (Windows, for example, will take time to zero the contents of pages when they are allocated). This means standard linked list implementations are going to suffer, and dominate the running time of your sort.
Try using a custom list implementation that pre-allocates memory for a large number of items, and you should see your sort running much faster.
With
iarray<List> buckets(numBuckets);
you are basically creating a LOT of lists and that can cost you a lot especially in memory access which it theoretically linear but that's not the case in practice.
Try to reduce the number of buckets.
To verify my assertion analyse your code speed with only the creation of the lists.
Also to iterate over the elements of the lists you should not use .size() but rather
//get back from buckets
for(size_t i=0,head=0;i!=numBuckets;i++)
while(!buckets[i].empty())
{
A[head++] = buckets[i].front();
buckets[i].pop_front();
}
In some implementations .size() can be in O(n). Unlikely but...
After some research I found
this page explaining what is the code for std::_List_node_base::hook.
Seems it is only to insert an element at a given place in a list. Shouldn't cost a lot..
Linked Lists are not arrays. They are substantially slower to perform operations like lookup. The STL sort may well have a specific version for lists that takes this into account and optimizes for it- but your function blindly ignores what container it's using. You should try using an STL vector as your array.
I think perhaps the interesting question is, Why are you creating an inordinately large number of buckets?
Consider the input {1,2,3}, numBuckets = 3. The loop containing buckets[int(numBuckets*A[i])].push_back(A[i]); is going to unroll to
buckets[3].push_back(1);
buckets[6].push_back(2);
buckets[9].push_back(3);
Really? Nine buckets for three values...
Consider if you passed a permutation of the range 1..100. You'd create 10,000 buckets and only use 1% of them. ... and each of those unused buckets requires creating a List in it. ... and has to be iterated over and then discarded in the readout loop.
Even more exciting, sort the list 1..70000 and watch your heap manager explode trying to create 4.9 billion Lists.
I didnt really manage to get to the details of your code, as i dont know enough of Java at this point of my study, tho i have had some exprience in algorithms and C programming so here's my opinion:
Bucket Sort assuming fair distrabution of the Elements on the array, thats actually more like a condition for your bucket sort to work on O(n), notice in worst case, it can be that you put major amount of elements on 1 of your buckets, thus in the next iteration youre gonna deal with almost the same problem as you been trying to fix in the first place which leads you to a bad performance.
Notice that the ACTUALL Time Complexity of Bucket sort is O(n+k) where k is the number of buckets, did you count your buckets? is k=O(n)?
the most time wasting problem in bucket sort is the Empty buckets after the partition to buckets is over with, when concatenate your sorted buckets you cant tell if the bucket's empty without actually testing it.
hope i helped.

Difference in performance between map and unordered_map in c++

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;