How to efficiently look-up large std::map - c++

I want to perform transitive closure of 2 large key, value lists. For doing so I have two "std::map". Both std::map maps an integer to a vector of integers.
std::map<unsigned,vector<unsigned> > mapIntVecOfInts1;
std::map<unsigned,vector<unsigned> > mapIntVecOfInts2;
"mapIntVecOfInts1" maps keys to another set of keys(VALUES). Some of the example values in it are of the following form:
0 -> (101, 102, 201)
1 -> (101, 102, 103, 203, 817, 1673)
2 -> (201, 829, 858, 1673)
"mapIntVecOfInts2" maps the VALUES present in "mapIntVecOfInts1" to another set of values. e.g.
101 -> (4002, 8293, 9000)
102 -> (4002, 8293, 10928)
103 -> (8293, 10928, 19283, 39201)
201 -> (8293)
203 -> (9393, 9830)
817 -> (19393, 19830)
1673-> (5372, 6830)
Now I want to map the keys present in "mapIntVecOfInts1" to the values present in "mapIntVecOfInts2" using the transitive mapping from "mapIntVecOfInts1" to "mapIntVecOfInts2". E.g. I want to do the following for key "0" of mapIntVecOfInts1:
0 -> 4002, 9000, 10928, 8293, 19283, 39201
1 -> 4002, 8293, 9000, 10928, 19283, 39201, 9393, 9830, 19393, 19830, 5372, 6830
"mapIntVecOfInts1" and "mapIntVecOfInts2" contain a billion elements (keys). vector within the two maps themselves contain million unsigned integers. I tried perform this transitive closure between the two maps by storing "mapIntVecOfInts1" and "mapIntVecOfInts2" in-memory. Using the following code:
std::vector<unsigned,vector<unsigned> > result;
for(std::map<unsigned,vector<unsigned> >::iterator i1= mapIntVecOfInts1.begin(), l1=mapIntVecOfInts1.end(); i1!=l1;++i1)
{
vector<unsigned> vec1;
for(vector<unsigned>::iterator i2=(*i1).second.begin(), l2=(*i1).second.end(); i2!=l2; ++i2)
vec1.insert(vec1.begin(), mapIntVecOfInts2[*i2].begin(), mapIntVecOfInts2[*i2].end());
result.push_back(make_pair((*i1).first, vec1));
}
However, performing transitive closure this way is taking a lot of time. Is there some way by which I can speed this up.

One can say that your suggested code does 2 things:
maps the second relation to the entry of the first
builds up the new relation from the results of said mapping
The resulting map will have the exact same key set as the first relation, so you can (kind of) avoid the whole red-black tree building process by just copying the whole mapIntVecOfInts1 first and then modifying the values of the copy instead of adding vectors one by one.
Of course that will not fix the major bottleneck which is the access speed of your second relation (mapIntVecOfInts2). You can try to reduce it to amortized O(1) with a hash table (std::unordered_map) or even a vector if your "billion of keys" is not too sparse.
Also as #SpectralSequence said, your code does not preserve the unique-ness in the value vectors, perhaps you want to do something about that.

At the very least, you should insert at the end of the vector in the inner loop, since inserting at the beginning requires copying the elements already in the vector.
vec1.insert(vec1.end(), mapIntVecOfInts2[*i2].begin(), mapIntVecOfInts2[*i2].end());
Also, if you don't want duplicate values then consider using a set.

Related

Scala List of tuple becomes empty after for loop

I have a Scala list of tuples, "params", which are of size 28. I want to loop through and print each element of the list, however, nothing is printed out. After finishing the for loop, I checked the size of the list, which now becomes 0.
I am new to scala and I could not figure this out after a long time googling.
val primes = List(11, 13, 17, 19, 2, 3, 5, 7)
val params = primes.combinations(2)
println(params.size)
for (param <- params) {
print(param(0), param(1))
}
println(params.size)
combinations methods in List create an Iterator. Once the Iterator is consumed using methods like size, it will be empty.
From the docs
one should never use an iterator after calling a method on it.
If you comment out println(params.size), you can see that for loop is printing out the elements, but the last println(params.size) will remain as 0.
Complementing Johny's great answer:
Do you know how I can save the result from combination methods to use for later?
Well, as already suggested you can just toList
However, note there is a reason why combinations returns an Iterator and it is because the data can be too big, if you are okay with that then go ahead; but you may still take advantage of laziness.
For example, let's convert the inner lists into a tuples before collecting the results:
val params =
primes
.combinations(2)
.collect {
case a :: b :: Nil => (a, b)
}.toList
In the same way, you may add extra steps in the chain like another map or a filter before doing the toList
Even better, if your end action is something like foreach(foo) then you do not even need to collect everything into a List
primes.combinations(2) returns Iterator.
Iterators are data structures that allow to iterate over a sequence of
elements. They have a hasNext method for checking if there is a next
element available, and a next method which returns the next element
and discards it from the iterator.
So, it is like pointer to Iterable collection. Once you have done iteration you no longer will be able to iterate again.
When println(params.size) executed that time iteration completed while computing size and now params is pointing to end. Because of this for (param <- params) will be equivalent looping around empty collection.
There can 2 possible solution:
Don't check the size before for loop.
Convert iterator to Iterable e.g. list.
params = primes.combinations(2).toList
To learn more about Iterator and Iterable refer What is the relation between Iterable and Iterator?

Finding the max value of a list of tuples, (applying max to the second value of the tuple)

So I have a list of tuples which I created from zipping two lists like this:
zipped =list(zip(neighbors, cv_scores))
max(zipped) produces
(49, 0.63941769316909292) where 49 is the max value.
However, I'm interesting in finding the max value among the latter value of the tuple (the .63941).
How can I do that?
The problem is that Python compares tuples lexicographically so it orders on the first item and only if these are equivalent, it compares the second and so on.
You can however use the key= in the max(..) function, to compare on the second element:
max(zipped,key=lambda x:x[1])
Note 1: Note that you do not have to construct a list(..) if you are only interested in the maximum value. You can use
max(zip(neighbors,cv_scores),key=lambda x:x[1]).
Note 2: Finding the max(..) runs in O(n) (linear time) whereas sorting a list runs in O(n log n).
max(zipped)[1]
#returns second element of the tuple
This should solve your problem in case you want to sort your data
and find the maximum you can use itemgetter
from operator import itemgetter
zipped.sort(key=itemgetter(1), reverse = True)
print(zipped[0][1]) #for maximum

how reduce RDD work in Apache Spark

Am currently working with Apache Spark.
But i can not understand how reduce work after map ..
my example is pretty simple
val map = readme.map(line => line.split(" ").size)
i know this will return array of number of words per line but where is the key/value here to pass a reduce function ..
map.reduce((a,b) => {if(a>b) a else b})
reduce phase how it works .. (a,b) is the tuple_2 ? or its key/value from map function ??
Once you have
val map = readme.map(line => line.split(" ").size)
Each element of the RDD consists of a single number, the number of words in a line of the file.
You could count all the words in your dataset with map.sum() or map.reduce( (a,b) => a+b ), which are equivalent.
The code you have posted:
map.reduce((a,b) => {if(a>b) a else b})
would find the maximum number of words per line for your entire dataset.
The RDD.reduce method works by converting every two elements it encounters, which at first are taken from pairs of RDD rows, to another element, in this case a number. The aggregation function should be written so it can be nested and called on the rows in any order. For example, subtraction will not yield useful results as a reduce function because you can not predict ahead of time in what order results would be subtracted from one another. Addition, however, or maximization, still works correctly no matter the order.

How to implement LSH by MapReduce?

Suppose we wish to implement Local Sensitive Hashing(LSH) by MapReduce. Specifically, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values).
(a) Show how to produce the buckets for all the bands as output of a single
MapReduce process. Hint: Remember that a Map function can produce
several key-value pairs from a single element.
(b) Show how another MapReduce process can convert the output of (a) to
a list of pairs that need to be compared. Specifically, for each column i,
there should be a list of those columns j > i with which i needs to be
compared.
(a)
Map: the elements and its signature as input, produce the key-value pairs (bucket_id, element)
Reduce: produce the buckets for all the bands as output, i.e.
(bucket_id, list(elements))
map(key, value: element):
split item to bands
for band in bands:
for sig in band:
key = hash(sig) // key = bucket id
collect(key, value)
reduce(key, values):
collect(key, values)
(b)
Map: output of (a) as input, produce the list of combination in same
bucket, i.e. (bucket_id, list(elements)) -> (bucket_id,
combination(list(elements))), which combination() is any two elements
chosen from same bucket.
Reduce: output the item pairs need to be
compared, Specifically, for each column i, there should be a list of
those columns j > i with which i needs to be compared.
map(key, value):
for itemA, itemB in combinations(value)
key = (itemA.id, itemB.id)
collect(key, [itemA, itemB])
reduce(key, values):
collect(key, values)

Comapring dictionary with list type values

I have the following 2 dictionaries,
d1={"aa":[1,2,3],"bb":[4,5,6],"cc":[7,8,9]}
d2={"aa":[1,2,3],"bb":[1,1,1,1,1,1],"cc":[7,8]}
How could I compare these two dictionaries and get the
positions(indexes) of UNMATCHED key value pairs? since I am dealing
with files of size around 2 GB, the dictionaries contain very large
data. How can this be implemented in optimized way?
def getUniqueEntry(dictionary1, dictionary2, listOfKeys):
assert sorted(dictionary1.keys()) == sorted(dictionary2.keys()), "Keys don't match" #check that they have the same keys
for key in dictionary1:
if dictionary1[key] != dictionary2[key]:
listOfKeys.append(key)
When calling the function, the third param listOfKeys is an empty list where you want the keys to be stored. Note that reading 2 gb worth of data into a dict requires alot of ram and will most likely fail.
and this is a more pythonic way: The list expansion will consider just the values that are not equal in both dictionaries:
diffrent_keys = [key for key in d1 if d1[key] != d2[key] ]