What is the difference between std::set and std::vector?

What is the difference between std::set and std::vector? - c++

I am learning STL now. I read about set container. I have question when you want to use set? After reading description of set it looks like it is useless because we can substitute it by vector. Could you say pros and cos for vector vs set containers. Thanks

A set is ordered. It is guaranteed to remain in a specific ordering, according to a functor that you provide. No matter what elements you add or remove (unless you add a duplicate, which is not allowed in a set), it will always be ordered.
A vector has exactly and only the ordering you explicitly give it. Items in a vector are where you put them. If you put them in out of order, then they're out of order; you now need to sort the container to put them back in order.
Admittedly, set has relatively limited use. With proper discipline, one could insert items into a vector and keep it ordered. However, if you are constantly inserting and removing items from the container, vector will run into many issues. It will be doing a lot of copying/moving of elements and so forth, since it is effectively just an array.
The time it takes to insert an item into a vector is proportional to the number of items already in the vector. The time it takes to insert an item into a set is proportional to the log₂ of the number of items. If the number of items is large, that's a huge difference. log₂(100,000) is ~16; that's a major speed improvement. The same goes for removal.
However, if you do all of your insertions at once, at initialization time, then there's no problem. You can insert everything into the vector, sort it (paying that price once), and then use standard algorithms for sorted vectors to find elements and iterate over the sorted list. And while iteration over the elements of a set isn't exactly slow, iterating over a vector is faster.
So there are cases where a sorted vector beats a set. That being said, you really shouldn't bother with the expense of this kind of optimization unless you know that it is necessary. So use a set unless you have experience with the kind of system you're writing (and thus know that you need that performance) or have profiling data in hand that tells you that you need a vector and not a set.

They are different things: you decide how vectors are ordered, and you can also put as many equal things into a vector as you please. Sets are ordered in accordance to that set's internal rules (you may set the rules, but the set will deal with the ordering), and you cannot put multiple equal items into a set.
Of course you could maintain a vector of unique items, but your performance would suffer a lot when you do set-oriented operations. For example, assume that you have a set of 10000 items and a vector of 10000 distinct unordered items. Now suppose that you need to check if a value X is among the values in the set (or among the values in the vector). When X is not among the items, searching the vector would be some 100 times slower. You would see similar performance differences on calculating set unions and intersections.
To summarize, sets and vectors have different purposes. You can use a vector instead of a set, but it would require more work, and would likely hurt the performance rather severely.

form cpluplus.com
set:
Sets are containers that store unique elements following a specific
order.
so the set is ordered AND item are uniquely represented
while vect:
Vectors are sequence containers representing arrays that can change in
size.
so vector is in the order you fill it AND can hold multiple identical items
prefer set:
if you wish to filter multiple identical values
if you wish to parse items in a specified order (doing this in vector requires to specifically sort vector).
prefer vector:
if you want to keep identical values
if you wish to parse items in same order as you pushed them (assuming you don't process the vector order)

The simple difference is that set can contain only unique values, and it is sorted. So you can use it for the cases where you need to continuously sort the values after every insert / delete.
set<int> a;
vector<int> b;
for (int i = 0; i < 10; ++i)
{
int val = rand() % 10;
a.insert(val);
b.push_back(val);
}
cout << "--SET---\n"; for (auto i : a) cout << i << ","; cout << endl;
cout << "--VEC---\n"; for (auto j : b) cout << j << ","; cout << endl;
The output is
--SET---
0,1,2,4,7,8,9,
--VEC---
1,7,4,0,9,4,8,8,2,4,

it is faster to search an item against a set than a vector (O(log(n)) vs O(n)). To search an item against a vector, you need to iterate all items in the vector, but the set use red-black tree to optimize the search, only a few item will be looked to find a match.
The set is ordered, it means you can only iterate it from smallest one to biggest one by order, or the reversed order.
But the vector is unordered, you can travel it by the insert order.

Related

Ordered and unordered map

I am trying to understand ordered and unordered map
I understand that ordered map will sort your list and unordered map will not sort it. base from this site
But why is this code get sorted anyway
std::unordered_map<std::string, int> map;
map.insert(std::make_pair("hello", 1));
map.insert(std::make_pair("world", 2));
map.insert(std::make_pair("call", 3));
map.insert(std::make_pair("ZZ", 4));
for(auto it = map.begin(); it != map.end();it++)
{
std::cout << it->first << std::endl;
}
result:
world
hello
call
ZZ
I dont get it. its supposed to be hello then world then call then ZZ. How did the world came first if its unordered container is not suppose to sort it.

std::hash<std::string> produces hashes for your input keys. Then, a transformation is applied to these hashes to get a valid index to a bucket in your map (The most basic one one can think of is a simple modulo). These operations take constant time and give you the location of your items. In order to get the good item from a transformed hash, you need it to be at the right place. That is why you observe an apparently random ordering of your inputs.
Now there is NO assumption you can make on the order of the computed indices. Suppose you would add some more strings in your map. Maybe a rehash would be needed for some strings and your input strings could appear in yet another order.
This has nothing to do with the order of insertion which has no reason to be kept, the purpose of the data structure is to place the items at the appropriate location so that their retreival can be made in amortized constant time (here by placing them at the indices given by the transformed hashes).

Probably, the best answer to this question is to say: "study a little about data structures.
Unordered map is supposed to store the data into a special data structure specially designed to be efficient but without any restrictions on storage order, and is therefore "unorderer".
Note that if you get the order of insertion is a type of restriction about the order.
Check out this data structure: hash table.

What they mean by "unordered", is that the elements are returned in a random order (based on element hash).

Remove elements from first set element which second set contains without iteration

I have two sets of pairs ( I cannot use c++11)
std::set<std::pair<int,int> > first;
std::set<std::pair<int,int> > second;
and I need to remove from first set all elements which are in second set(if first contain element from second to remove). I can do this by iterating through second set and if first contains same element erase from first element, but I wonder is there way to do this without iteration ?

If I understand correctly, basically you want to calculate the difference of first and second. There is an <algorithm> function for that.
std::set<std::pair<int, int>> result;
std::set_difference(first.begin(), first.end(), second.begin(), second.end(), inserter(result, result.end()));

Yes, you can.
If you want to remove, not just to detect, that is here another <algorithm> function: remove_copy_if():
http://www.cplusplus.com/reference/algorithm/remove_copy_if/
imho. It's not so difficult to understand how it works.

I wonder is there way to do this without iteration.
No. Internally, sets are balanced binary trees - there's no way to operate on them without iterating over the structure. (I assume you're interested in the efficiency of implementation, not the convenience in code, so I've deliberately ignored library routines that must iterates internally).
Sets are sorted though, so you could do an iterations over each, removing as you went (so # operations is the sum of set sizes) instead of an iteration and a lookup for each element (where number of operations is the number of elements you're iterating over times log base 2 of the number of elements in the other set). Only if one of your sets is much smaller than the other will the iterate/find approach will win out. If you look at the implementation of your library's set_difference function )mentioned in Amen's answer) - it should show you how to do the two iterations nicely.
If you want something more efficient, you need to think about how to achieve that earlier: for example, storing your pairs as flags in identically sized two-dimension matrix such that you can AND with the negation of the second set. Whether that's practical depends on the range of int values you're storing, whether the amount of memory needed is ok for your purposes....

How to efficiently nearly sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);

Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.

You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x

One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).

If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.

Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.

Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.

Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.

Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Boost multi_index: retrieve unique values of a non-unique key

I have a boost::multi_index_container whose elements are structs like this:
struct Elem {
A a;
B b;
C c;
};
The main key (in a database sense) is a composite_key of a and b. Other
keys exist to perform various types of queries.
I now need to retrieve a set of all different values of c. These values are
by all means not unique, but iterating through all entries (albeit ordered),
or using std::unique seems quite a waste, considering that
the number of different values of c is expected to be << than the total
number of entries (say, 10 to 1000).
Am I missing a simple way to obtain this result more efficiently?

I scoured the Boost.MultiIndex documentation and can't seem to find a way to do what you want. I'm interested in knowing if it's doable.
Perhaps the best you can do is maintain a std::map<C, size_t> (or hash map) alongside your multi_index_container and keep them both "synchronized".
The map associates a C value with its occurrence count (frequency). It's essentially a histogram of C values. Each time you add an Elem to your multi_index_container, you increment the corresponding frequency in the histogram. When you remove an Elem from your multi_index_counter, you decrement the corresponding frequency in the histogram. When the frequency reaches zero, you delete that entry from the histogram.
To retrieve the set of distinct C values, you simply iterate through the <key,value> pairs in the histogram and look at the key part of each pair. If you used a std::map, then the distinct C values will come out sorted.
If you're going to examine the set of distinct C values only once (or rarely) then the approach I described above may be overkill. A simpler approach would be to insert all C values into a std::set<C> and then iterate through the set to retrieve the distinct C values.
You said that the set of distinct C's is much smaller then the total number of C's. The std::set<C> approach should therefore waste much less space than copying the C's to a std::vector, sorting the vector, then running std::unique.
Let's compare the time complexity of copying to a set versus copying to a vector, sorting, then running unique. Let N be the total number of C values, and let M be the number of distinct C values. The set approach, by my reckoning, should have a time complexity of O(N*log(M)). Since M is small and does not grow much with higher N's, the complexity effectively becomes O(N). The sorting+unique technique, on the other hand, should have a time complexity of O(N*log(N)).

The approach I took to solving this problem was to use boost range adaptors as follows
const auto& indexedContainer = container.get<IndexType>();
const auto uniqueIndexRange = indexedContainer
| boost::adaptors::transformed([&](auto&& v) {
return indexedContainer.key_extractor()(v); })
| boost::adaptors::uniqued;

Sort, pack and remap array of indexed values to minimize overlapping

Sitation:
overview:
I have something like this:
std::vector<SomeType> values;
std::vector<int> indexes;
struct Range{
int firstElement;//first element to be used in indexes array
int numElements;//number of element to be used from indexed array
int minIndex;/*minimum index encountered between firstElement
and firstElements+numElements*/
int maxIndex;/*maximum index encountered between firstElement
and firstElements+numElements*/
Range()
:firstElement(0), numElements(0), minIndex(0), maxIndex(0){
}
}
std::vector<Range> ranges;
I need to sort values, remap indexes, and recalculate ranges to minimize maxValueIndex-minValueIndex for each range.
details:
values is an array(okay, "vector") of some type (irrelevant which one). elements in values may be unique, but this is not guaranteed.
indexes is an vector of ints. each element in "indexes" is an indexes that correspond to some element in values. Elements in indexes are not unique, one value may repeat multiple types. And indexes.size() >= values.size().
Now, ranges correspond to a "chunk" of data from indexes. firstElement is an index of element to be used from indexes (i.e. used like this: indexes[range.firstElement]), numElements is (obviously) number of elements to be used, minIndex is mininum in (indexes[firstElement]...indexes[firstElement+numElements-1]) a,d maxIndex is maximum in (indexes[firstElement]...indexes[firstElement+numElements-1]). Ranges never overlap.
I.e. for every two ranges a, b
((a.firstElement >= b.firstElement) && (a.firstElement < (b.firstElement+b.numElements)) == false
Obviously, when I do any operation on values (swap to elements, etc), I need to update indexes (so they keep pointing on the same value), and recalculate corresponding range, so range's minIndex and maxIndex are correct.
Now, I need to rearrange values in the way that will minimize Range.maxIndex - Range.minIndex. I do not need the "best" result after packing, having "probably the best" or "good" packing will be enough.
problem:
Remapping indexes and recalculating ranges is easy. The problem is that I'm not sure how to sort elements in values, because same index may be encountered in multiple ranges.
Any ideas about how to proceed?
restrictions:
Changing container type is not allowed. Containers should be array-like. No maps, not lists.
But you're free to use whatever container you want during the sorting.
Also, no boost or external libraries - pure C++/STL, I really neeed only an algorithm.
additional info:
There is no greater/less comparison defined for SomeType - only equality/non-equality.
But there should be no need to ever compare two values, only indexes.
The goal of algorithm is to make sure that output of
for (int i = 0; i < indexes.size; i++){
print(values[indexes[i]]); //hypothetical print function
}
Will be identical before and after sorting, while also making sure that for each range
Range.maxIndex-Range.minIndex (after sorting) is as small as possible to achieve with reasonable effort.
I'm not looking for a "perfect" or "most optimal" solution, having a "probably perfect" or "probably most optimal" solution should be enough.
P.S. This is NOT a homework.

This is not an algorithm, just some thinking aloud. It will probably break if there are too many duplicates.
If there was no duplicates, you'd simply rearrange the values so the indexes are 0,1,2, and so on. So for the starting point, let's exclude the values that are double-referenced and arrange the rest
Since there are duplicates, you need to figure out where to stick them. Suppose the duplicate is referred to by ranges r1, r2, r3. Now, as long as you insert the duplicate between min([r1,r2,r3].minIndex)-1 and max([r1,r2,r3].maxIndex)+1, the sum of maxIndex-minIndex will be the same no matter where you insert it. Moving the insertion point to the left will reduce max-min for all ranges to the left, but increment it for all ranges to the right. So, I think the sensible thing to do is to insert the duplicate at the left edge (minindex) of the rightmost range (one with largest minIndex) of r1,r2,r3. Repeat with all duplicates.

Okay, it looks like there is only one way to reliably solve this problem:
Make sure that no index is ever used by two ranges at once by duplicating values.
I.e scan entire array of indexes, and when you find index (of value) that is being used in more than one range, you add copy of that value for each range - each with unique index. After that problem becomes trivial - you simply sort values in the way that will make sure that values array first contains values used only by first range, then values for 2nd range, and so on. I.e. this will get maximum packing.
Because in my app it is more important to minimize sum(ranges[i].maxIndex-ranges[i].minIndex) that to minimize number of values, this approach works for me.
I do not think that there is other reliable way to solve the problem - it is quite easy to get situation when there are indexes used by every range, and in this case it will not be possible to "pack" data no matter what you do. Even allowing index to be used by two ranges at once will lead to problems - you can get ranges a, b and c where a and b, b and c, a and c will have common indexes. In this case it also won't be possible to pack the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js