I'm working with a large dataset. When inserting elements one by one to a set, or to a vector, I had the following confusion (in C++):
Should I reserve enough space for the vector before insertion, and then add to the vector one by one?
Or, should I insert the elements to a set one by one (since insertion to a set is faster than that of a vector) and then that set at once to a vector?
Which one would be more time-efficient?
Should I reserve enough space for the vector before insertion, and
then add to the vector one by one?
Yes. By allocating space for all required vector elements you will be avoiding additional memory allocations and memory copying. Use the vector only if you don't need the elements in any particular order and if you will only add elements to the end of the vector. Inserting elements somewhere in the middle is very inefficient because the vector stores all elements in contiguous memory and would therefore need to move all elements after the insertion point out of the way before inserting the new element.
Or, should I insert the elements to a set one by one (since insertion
to a set is faster than that of a vector) and then that set at once to
a vector?
If you need elements to be in a specific order then you should use the set. The set will place elements into the "right" place efficiently, assuming that your elements are of a type that set understands (for example, numerical types or string), otherwise you may need to write your own comparison function. Alternatively, if you have more complex data but can identify a sort-able key value then you may want to look at map instead of set. - Afterwards, you can't initialize a vector from a set "at once"; you would need to loop through the set and append to the vector.
Which one would be more time-efficient?
Considering that you have a large amount of data as input and assuming that the data is in random order:
If you don't care about the order of elements then calling push_back on a vector is most likely faster.
If you plan to insert elements somewhere in the middle then the set is most likely faster, probably even if you need to transfer the data to a vector in a second step.
All this depends a bit on the type of data, the potential comparison that you may want to perform, the implementation of the standard library and the compiler.
Now that you know the expected result I suggest that you try both. Test and measure!
Related
I got a decision making problem here. In my application, I need to merge two vectors. I can't use stl algorithms since data order is important (It should not be sorted.).
Both the vectors contains the data which can be same sometimes or 75% different in the worst case.
Currently I am confused b/w two approaches,
Approach 1:
a. take an element in the smaller vector.
b. compare it with the elements in bigger one.
c. If element matches then skip it (I don't want duplicates).
d. If element is not found in bigger one, calculate proper position to insert.
e. re-size the bigger one to insert the element (multiple time re-size may happen).
Approach 2:
a. Iterate through vectors to find matched element positions.
b. Resize the bigger one at a go by calculating total size required.
c. Take smaller vector and go to elements which are not-matched.
d. Insert the element in appropriate position.
Kindly help me to choose the proper one. And if there is any better approach or simpler techniques (like stl algorithms), or easier container than vector, please post here. Thank you.
You shouldn't be focusing on the resizes. In approach 1, you should use use vector.insert() so you don't actually need to resize the vector yourself. This may cause reallocations of the underlying buffer to happen automatically, but std::vector is carefully implemented so that the total cost of these operations will be small.
The real problem with your algorithm is the insert, and maybe the search (which you didn't detail). When you into a vector anywhere except at the end, all the elements after the insertion point must be moved up in memory, and this can be quite expensive.
If you want this to be fast, you should build a new vector from your two input vectors, by appending one element at a time, with no inserting in the middle.
Doesn't look like you can do this in better time complexity than O(n.log(n)) because removing duplicates from a normal vector takes n.log(n) time. So using set to remove duplicates might be the best thing you can do.
n here is number of elements in both vectors.
Depending on your actual setup (like if you're adding object pointers to a vector instead of copying values into one), you might get significantly faster results using a std::list. std::list allows for constant time insertion which is going to be a huge performance overhead.
Doing insertion might be a little awkward but is completely do-able by only changing a few pointers (inexpensive) vs insertion via a vector which moves every element out of the way to put the new one down.
If they need to end up as vectors, you can then convert the list to a vector with something like (untested)
std::list<thing> things;
//efficiently combine the vectors into a list
//since list is MUCH better for inserts
//but we still need it as a vector anyway
std::vector<thing> things_vec;
things_vec.reserve(things.size()); //allocate memory
//now move them into the vector
things_vec.insert(
things_vec.begin(),
std::make_move_iterator(things.begin()),
std::make_move_iterator(things.end())
);
//things_vec now has the same content and order as the list with very little overhead
For my application I am using STD vector. I am inserting to the vector at the end, but erasing from vector randomly i.e element can be erased from middle, front anywhere. These two are only requirement, 1)insert at the end 2) erase from anywhere.
So should I use STD List, since erasing does shifting of data. Or I would retain Vector in my code for any reason??
Please give comment, If Vector is the better option, how it would be better that List here?
One key reason to use std::vector over std::list is cache locality. A list is terrible in this regard, because its elements can be (and usually are) fragmented in your memory. This will degrade performance significantly.
Some would recommend using std::vector almost always. In terms of performance, cache locality is often more important than the complexity of insertion or deletion.
Here's a video about Bjarne Stroustrup's opinion regarding subject.
I would refer you to this cheat sheet, and the conclusion would be the list.
A list supports deletion at an arbitrary but known position in constant time.
Finding that position takes linear time, just like modifying a vector.
The only advantage of the list is if you repeatedly erase (or insert) at (approximately) the same position.
If you're erasing more or less at random, chances are that the better memory locality of the vector could win out in the end.
The only way to be sure is to measure and compare.
List is better in this case most probably. The advantage of a list over vector is that it supports deletion at arbitrary position with constant complexity. A vector would only be better choice if you require constant index operation of elements of the container. Still you have to take into consideration how is the element you would like to delete passed to your function for deletion. If you only pass an index, vector will be able to find the element in constant time, while in list you will have to iterate. In this case I would benchmark the two solution, but still I would bet on list performing better.
It depends on many factors and how are you using your data.
One factor: do you need an erase that maintains the order of the collection? or you can live with changing order?
Other factor: what kind of data is in the collection? (numbers: ints/floats) || pointers || objects
Not keeping order
You could continue using vector if the data is basic numbers or pointers, to delete one element you could copy the last element of the vector over the deleted position, then pop_back(). This way you avoid moving all the data.
If using objects, you could still use the same method if the object you need to copy is small.
Keeping order
Maybe List would be your friend here, still, some tests would be advised, depends on size of data, size of list, etc
I was just about to implement my own class for efficient removal from an array, but thought I'd ask to see if anything like it already exists. What I want is list-like access efficiency but using an array. I want to use an array for reasons of cache coherence and so I don't have to continually be calling a memory allocator (as using std::list would when allocating nodes).
What I thought about doing was creating a class with two arrays. The first is a set of elements and the second array is a set of integers where each integer is a free slot in the first array. So I can add/remove elements from the array fairly easily, without allocating new memory for them, simply by taking an index from the free list and using that for the new element.
Does anything like this exist already? If I do my own, I'll have to also make my own iterators, so you can iterate the set avoiding any empty slots in the array, and I don't fancy that very much.
Thanks.
Note: The kind of operations I want to perform on the set are:
Iteration
Random access of individual elements, by index (or "handle" as I'm thinking of it)
Removal of an element anywhere in the set
Addition of an element to the set (order unimportant)
std::list<T> actually does sound exactly like the theoretically correct data structure for your job, because it supports the four operations you listed, all with optimal space and time complexity. std::list<T>::iterator is a handle that remains valid even if you add/remove other items to/from the list.
It may be that there is a custom allocator (i.e. not std::allocator<T>) that you could use with std::list<T, Allocator> to get the performance you want (internally pool nodes and then don't do runtime allocation everytime you add or remove a node). But that might be overkill.
I would start just using a std::list<T> with the default allocator and then only look at custom allocators or other data structures if you find the performance is too bad for your application.
If maintaining order of elements is irrelevant, use swap-and-pop.
Copy/move the last element over the one to be removed, then pop the back element. Super easy and efficient. You don't even need to bother with special checks for removing the element since it'll Just Work(tm) if you use the standard C++ vector and operations.
*iter = std::move(container.back());
container.pop_back();
I don't recall if pop_back() invalidated iterators on vector, but I don't think it does. If it does, just use indices directly or to recalculate a new valid iterator.
auto delta = iter - container.begin();
// mutate container
iter = container.begin() + delta;
You can use a single array by storing the information about the "empty" slots in the space of the empty slots.
For a contiguous block of empty slots in your array A, say of k slots starting from index n, store (k, n') at location A[n] (where n' is the index of the next block of free indexes). You may have to pack the two ints into a single word if your array is storing word-sized objects.
You're essentially storing a linked-list of free blocks, like a memory-manager might do.
It's a bit of a pain to code, but this'll allow you to allocate a free index in O(1) time, and to iterate through the allocated indices in O(n) time, where n is the number of allocated slots. Freeing an index will be O(n) time though in the worst case: this is the same problem as fragmented memory.
For the first free block, you can either store the index separately, or have the convention that you never allocate A[0] so you can always start a free-index search from there.
std::map might be useful in your case.
I understand the algorithm for doing this but I don't know what data structure (array, linked list, vector, other?) would be best for returning the final set of sets since every example I see just asks to print the sets.
Can someone explain the thought process for deciding between the 3 data structures I mentioned?
Also, are vectors even used anymore? I heard they were obsolete but still see many recent examples.
To be clear, I mean ALL subsets, so they have different sizes.
The decision of which data structure to use depends on:
Type of data to be stored
Operations that you intend to perform on the data
A normal array, would give you contiguous block of memory and random access to the elements, however you need to know the exact number of elements before hand so that you can allocate an array of appropriate size.
With std::vector you get random access to the data, and contiguous just like arrays but vector is a dynamic array,it grows as you add new elements, and a amortized constant complexity, however insertion/deletion of elements is faster only at the ends since all elements need to be moved.
With std::list you don't get the random access but insertion and deletion of elements is faster because it involves just moving around of the pointer links.
Also, are vectors even used anymore?
That is not true at all.
They are very much in use and one of the most widely used data structures provided by the Standard Library.
Once i used bit fields to determine the subsets. Where if the i th bit is 1 then the i the element in the set is selected in the subset and 0 otherwise. In this case you need to store the ordering of the elements. The same can be done with bool vectors i think.
I am maintaining a fixed-length table of 10 entries. Each item is a structure of like 4 fields. There will be insert, update and delete operations, specified by numeric position. I am wondering which is the best data structure to use to maintain this table of information:
array - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item using [] is faster.
stl vector - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item is slower than an array since it is a call to operator[] and a linked list .
stl list - insert and delete takes linear time since you need to iterate to a specific position before applying the insert/delete; additional space is needed for pointers; accessing an item is slower than an array since it is a linked list linear traversal.
Right now, my choice is to use an array. Is it justifiable? Or did I miss something?
Which is faster: traversing a list, then inserting a node or shifting items in an array to produce an empty position then inserting the item in that position?
What is the best way to measure this performance? Can I just display the timestamp before and after the operations?
Use STL vector. It provides an equally rich interface as list and removes the pain of managing memory that arrays require.
You will have to try very hard to expose the performance cost of operator[] - it usually gets inlined.
I do not have any number to give you, but I remember reading performance analysis that described how vector<int> was faster than list<int> even for inserts and deletes (under a certain size of course). The truth of the matter is that these processors we use are very fast - and if your vector fits in L2 cache, then it's going to go really really fast. Lists on the other hand have to manage heap objects that will kill your L2.
Premature optimization is the root of all evil.
Based on your post, I'd say there's no reason to make your choice of data structure here a performance based one. Pick whatever is most convenient and return to change it if and only if performance testing demonstrates it's a problem.
It is really worth investing some time in understanding the fundamental differences between lists and vectors.
The most significant difference between the two is the way they store elements and keep track of them.
- Lists -
List contains elements which have the address of a previous and next element stored in them. This means that you can INSERT or DELETE an element anywhere in the list with constant speed O(1) regardless of the list size. You also splice (insert another list) into the existing list anywhere with constant speed as well. The reason is that list only needs to change two pointers (the previous and next) for the element we are inserting into the list.
Lists are not good if you need random access. So if one plans to access nth element in the list - one has to traverse the list one by one - O(n) speed
- Vectors -
Vector contains elements in sequence, just like an array. This is very convenient for random access. Accessing the "nth" element in a vector is a simple pointer calculation (O(1) speed). Adding elements to a vector is, however, different. If one wants to add an element in the middle of a vector - all the elements that come after that element will have to be re allocated down to make room for the new entry. The speed will depend on the vector size and on the position of the new element. The worst case scenario is inserting an element at position 2 in a vector, the best one is appending a new element. Therefore - insert works with speed O(n), where "n" is the number of elements that need to be moved - not necessarily the size of a vector.
There are other differences that involve memory requirements etc., but understanding these basic principles of how lists and vectors actually work is really worth spending some time on.
As always ... "Premature optimization is the root of all evil" so first consider what is more convenient and make things work exactly the way you want them, then optimize. For 10 entries that you mention - it really does not matter what you use - you will never be able to see any kind of performance difference whatever method you use.
Prefer an std::vector over and array. Some advantages of vector are:
They allocate memory from the free space when increasing in size.
They are NOT a pointer in disguise.
They can increase/decrease in size run-time.
They can do range checking using at().
A vector knows its size, so you don't have to count elements.
The most compelling reason to use a vector is that it frees you from explicit memory management, and it does not leak memory. A vector keeps track of the memory it uses to store its elements. When a vector needs more memory for elements, it allocates more; when a vector goes out of scope, it frees that memory. Therefore, the user need not be concerned with the allocation and deallocation of memory for vector elements.
You're making assumptions you shouldn't be making, such as "accessing an item is slower than an array since it is a call to operator[]." I can understand the logic behind it, but you nor I can know until we profile it.
If you do, you'll find there is no overhead at all, when optimizations are turned on. The compiler inlines the function calls. There is a difference in memory performance. An array is statically allocated, while a vector dynamically allocates. A list allocates per node, which can throttle cache if you're not careful.
Some solutions are to have the vector allocate from the stack, and have a pool allocator for a list, so that the nodes can fit into cache.
So rather than worry about unsupported claims, you should worry about making your design as clean as possible. So, which makes more sense? An array, vector, or list? I don't know what you're trying to do so I can't answer you.
The "default" container tends to be a vector. Sometimes an array is perfectly acceptable too.
First a couple of notes:
A good rule of thumb about selecting data structures: Generally, if you examined all the possibilities and determined that an array is your best choice, start over. You did something very wrong.
STL lists don't support operator[], and if they did the reason that it would be slower than indexing an array has nothing to do with the overhead of a function call.
Those things being said, vector is the clear winner here. The call to operator[] is essentially negligible since the contents of a vector are guaranteed to be contiguous in memory. It supports insert() and erase() operations which you would essntially have to write yourself if you used an array. Basically it boils down to the fact that a vector is essentially an upgraded array which already supports all the operations you need.
I am maintaining a fixed-length table of 10 entries. Each item is a
structure of like 4 fields. There will be insert, update and delete
operations, specified by numeric position. I am wondering which is the
best data structure to use to maintain this table of information:
Based on this description it seems like list might be the better choice since its O(1) when inserting and deleting in the middle of the data structure. Unfortunately you cannot use numeric positions when using lists to do inserts and deletes like you can for arrays/vectors. This dilemma leads to a slew of questions which can be used to make an initial decision of which structure may be best to use. This structure can later be changed if testing clearly shows its the wrong choice.
The questions you need to ask are three fold. The first is how often are you planning on doing deletes/inserts in the middle relative to random reads. The second is how important is using a numeric position compared to an iterator. Finally, is order in your structure important.
If the answer to the first question is random reads will be more prevalent than a vector/array will probably work well. Note iterating through a data structure is not considered a random read even if the operator[] notation is used. For the second question, if you absolutely require numeric position than a vector/array will be required even though this may lead to a performance hit. Later testing may show this performance hit is negligible relative to the easier coding with numerical positions. Finally if order is unimportant you can insert and delete in a vector/array with an O(1) algorithm. A sample algorithm is shown below.
template <class T>
void erase(vector<T> & vect, int index) //note: vector cannot be const since you are changing vector
{
vect[index]= vect.back();//move the item in the back to the index
vect.pop_back(); //delete the item in the back
}
template <class T>
void insert(vector<T> & vect, int index, T value) //note: vector cannot be const since you are changing vector
{
vect.push_back(vect[index]);//insert the item at index to the back of the vector
vect[index] = value; //replace the item at index with value
}
I Believe it's as per your need if one needs more insert/to delete in starting or middle use list(doubly-linked internally) if one needs to access data randomly and addition to last element use array ( vector have dynamic allocation but if you require more operation as a sort, resize, etc use vector)