python reordering of list elements with conversion to set - list

I understand that Python sets are not ordered.
And the ordering of a list will not be preserved when converted to a set
My question is: why are the elements of a list reordered when converted (in some implementations).
It would seem an extra action to reorder the elements during conversion.
If nothing else, it does not seem there would be much overhead in preserving the order (for convenience).

Thats because sets are based on the Hash Table data structure, where records are stored in buckets using hash keys, and the order of items (when printed or converted to list) depends on the hashes, but internally the order doest matter. so it doesnt really bother to change the order, it just adds the item and creates a hash index for it. when you print the set it is probably printed according to the lexographical order of hashes, or something like that.
As you can see from the following, the list when created from a set, takes the same order of hashes of these items.
>>> s=set([5,4,3,7,6,2,1,0])
>>> s
{0, 1, 2, 3, 4, 5, 6, 7}
>>> list(s)
[0, 1, 2, 3, 4, 5, 6, 7]

Related

C++ set how to check if a list of sets contains a subset

I have a list of sets, right now the list a vector but it does not need to be.
vector<unordered_set<int>> setlist;
then i am filling it with some data, lets just say for example it looks like this:
[ {1, 2}, {2, 3}, {5, 9} ]
Now i have another set, lets say its this: {1, 2, 3}
I want to check if any of these sets in the list is a subset of the above set. For example, setlist[0] and setlist[1] are both subsets, so the output would be true
My idea is to loop through the whole vector and check if any of the indexes are a subset using the std::includes function, but I am looking for a faster way. Is this possible?
Consider using a list of set<int> instead. This allows you to use std::include. Run your loop on the vector after having sorted it by number of elements in the set (i.e. from the sets with the smallest number of elements, to the sets with the largest number of items). The inner loop will start at the current index. This avoids that you check inclusion of the larger sets in the smaller ones.
If the range of the integers is not too large, you could consider implementing the set with a std::bitset (bit n is true if n is included). The inclusion test is then done with very fast logical operation (e.g. subset & large_set == subset). You could still sort the vector by count, but not sure that this would be needed considering the speed of the logical operation.

How to store repeated data in an efficient way?

I am trying to find an efficient structure to store data in c++. Efficiency in both time and storage is important.
I have a set S = {0, 1, 2, 3, ..., N}, and multiple levels, say L levels. For each level l ∈ {0, 1, ..., L}, I need a structure, say M, to store 2 subsets of S, which will be the rows and columns of a matrix in the next steps. For example:
S = {0, 1, 2, 3, ..., 15}
L = 2
L1_row = {0, 1, 2, 3, 4, 5}
L1_col = {3, 4, 5, 6, 9}
L2_row = {0, 2, 4, 5, 12, 14}
L2_col = {3, 6, 10, 11, 14, 15}
I have used an unordered_map with integer keys as level and a pair of unordered_sets for row and column, as follow:
unordered_map<int, pair<unordered_set<int>, unordered_set<int>>> M;
However, this is not efficient, for example, {3, 4, 5} recorded 3 times. As S is a large set, so M will contain many repeated numbers.
In the next step, I will extract row and cols per level from M and create a matrix, so, fast access is important.
M may or may not contain all items in S.
M will fill in at 2 stages, first, rows for all levels, and then columns for all levels.
That is a tough one. Memory and efficiency really depend on your concrete data set. If you don't want to store {3,4,5} 3 times, you'll have to create a "token" for it and use that instead.
There are patterns such as
flyweight or
run length encoding or
dictionary
for that. (boost-flyweight or dictionaries for ZIP/7Z and other compression algorithms).
However under some circumstances this can actually use more memory than just repeating the numbers.
Without further knowledge about the concrete use case it is very hard to suggest anything.
Example: Run length encoding is an efficient way to store {3,4,5,6,7...}. Basically you just store the first index and a length. So {3,4...12} becomes {{3, 10}}. That's easy to 'decode' and uses a lot less memory than the original sequence, but only if you have many consecutive sequences. If you have many short sequences it will be worse.
Another example: If you have lots of recurring patterns, say, {2,6,11} appears 23 times, then you could store this pattern as a flyweight or use a dictionary. You replace {2,6,11} with #4. The problem here is identifying patterns and optimizing your dictionary. That is one of the reasons why compressing files (7Zip, bzip etc...) takes longer than uncompressing.
Or you could even store the info in a bit pattern. If your matrix is 1024 columns wide, you could store the columns used in 128 Bytes with "bit set" == "use this column". Like a bitmask in image manipulation.
So I guess my answer really is: "it depends".

How would I partition an array of integers into N number of partitions?

For instance, I have array1 = {1, 2, 3, 4} and want to partition it into 2 subarrays, so:
subarray1 = {1, 2} and subarray2 = {3, 4}
Is there a way to partition it and create the arrays automatically, depending on the user input for N?
(For background, I am taking an array with 100000 integer values, sorted, and partitioning them so that to find a number that is in the array will be a lot more efficient. Since its sorted and partitioned, I can know their start and end range for each array, and just search there)
You're asking the wrong question. If you want to find if the number exists in the array, the easiest and fastest way would be to use std::unordered_set, the search becomes a constant time operation.

number of elements strictly lesser than a given number

I want a data structure in which I want to insert elements in log(n) time and the elements should be sorted in the ds after every insertion. I can use a multiset for this.
After that I want to find the numbers of elements strictly smaller than a given number again in log(n) time. And yes duplicates are also present and they need to be considered. For example if the query element is 5 and the ds contains {2, 2, 4, 5, 6, 8, 8} then answer would be 3(2, 2, 4) as these 3 elements are stricly lesser than 5
I could have used multiset but even if I use upper_bound I will have to use distance method which runs in linear time. How can I achieve this efficiently with c++ stl. Also I cannot use
The data structure you need is an order statistic tree: https://en.wikipedia.org/wiki/Order_statistic_tree
The STL doesn't have one, and they're not very common so you might have to roll your own. You can find code in Google, but I can't vouch for any specific implementation.

Can I sort a vector to match the sorting of an unordered_map?

Can I sort a vector so that it will match the sorting of an unordered_map? I want to iterate over the unordered_map and if I could only iterate each container once to find their intersection, rather than having to search for each key.
So for example, given an unordered_map containing:
1, 2, 3, 4, 5, 6, 7, 8, 9
Which is hashed into this order:
1, 3, 4, 2, 5, 7, 8, 6, 9
I'd like if given a vector of:
1, 2, 3, 4
I could somehow distill the sorting of the unordered_map for use in sorting the vector so it would sort into:
1, 3, 4, 2
Is there a way to accomplish this? I notice that unordered_map does provide it's hash_function, can I use this?
As comments correctly state, there is no even remotely portable way of matching sorting on unordered_map. So, sorting is unspecified.
However, in the land of unspecified, sometimes for various reasons we can be cool with whatever our implementation does, even if unspecified and non-portable. So, could someone look into your map implementation and use the determinism it has there on the vector?
The problem with unordered_map is that it's a hash. Every element inserted into it will be hashed, with hash (mapped to the key space) used as an index in internal array. This looks promising, and it would be promising if not for collision. In case of key collision, the elements are put into the collision list, and this list is not sorted at all. So the order of iteration over collision would be determined by the order of inserts (reverse or direct). Because of that, absent information of order of inserts, it would not be possible to mimic the order of the unordered_map, even for specific implementation.