using mapreduce to find common items between users - mapreduce

Suppose I have the following user/item sets where items could also be replicates for each user (like user1)
{ "u1", "item" : [ "a", "a", "c","h" ] }
{ "u2", "item" : [ "b", "a", "f" ] }
{ "u3", "item" : [ "a", "a", "f" ] }
I want to find a map-reduce algorithm that will calculate the number of common items between each pair of users some like that
{ "u1_u2", "common_items" : 1 }
{ "u1_u3", "common_items" : 2 }
{ "u2_u3", "common_items" : 2 }
It basically finds the intersections of itemsets for each pair and considers replicates as new items. I am new to mapreduce, how can I do a map-reduce for this?

With these sorts of problems, you need to appreciate that some algorithms will scale better than others, and performance of any one algorithm will depend on the 'shape' and size of your data.
Comparing the item sets for every user to every other user might be appropriate for small domain datasets (say 1000's or users, maybe even 10,000's, with a similar number of items), but is an 'n squared' problem (or an order of thereabouts, my Big O is rusty to say the least!):
Users Comparisons
----- -----------
2 1
3 3
4 6
5 10
6 15
n (n^2 - n)/2
So a user domain of 100,000 would yield 4,999,950,000 set comparisons.
Another approach to this problem, would be to inverse the relationship, so run a Map Reduce job to generate a map of items to users:
'a' : [ 'u1', 'u2', 'u3' ],
'b' : [ 'u2' ],
'c' : [ 'u1' ],
'f' : [ 'u2', 'u3' ],
'h' : [ 'u1' ],
From there you can iterate the users for each item and output user pairs (with a count of one):
'a' would produce: [ 'u1_u2' : 1, 'u1_u3' : 1, 'u2_u3' : 1 ]
'f' would produce: [ 'u2_u3' : 1 ]
Then finally produce the sum for each user pairing:
[ 'u1_u2' : 1, 'u1_u3' : 1, 'u2_u3' : 2 ]
This doesn't produce the behavior you are interested (the double a's in both u1 and u3 item sets), but details an initial implementation.
If you know your domain set typically has users which do not have items in common, a small number of items per user, or an item domain which has a large number of distinct values, then this algorithm will be more efficient (previously you were comparing every user to another, with a low probability of intersection between the two sets). I'm sure a mathematician could prove this for you, but that i am not!
This also has the same potential scaling problem as before - in that if you have an item that all 100,000 users all have in common, you still need to generate the 4 billion user pairs. This is why it is important to understand your data, before blindly applying an algorithm to it.

You want a step that emits all of the things the user has, like:
{ 'a': "u1" }
{ 'a': "u1" }
{ 'c': "u1" }
{ 'h': "u1" }
{ 'b': "u2" }
{ 'a': "u2" }
{ 'f': "u2" }
{ 'a': "u1" }
{ 'a': "u3" }
{ 'f': "u3" }
Then reduce them by key like:
{ 'a': ["u1", "u1", "u2", "u3"] }
{ 'b': ["u2"] }
{ 'c': ["u1"] }
{ 'f': ["u2", "u3"] }
{ 'h': ["u1"] }
And in that reducer emit the permutations of each user in each value, like:
{ 'u1_u2': 'a' }
{ 'u2_u3': 'a' }
{ 'u1_u3': 'a' }
{ 'u2_u3': 'f' }
Note that you'll want to make sure that in a key like k1_k2 that k1 < k2 so that they match up in any further mapreduce steps.
Then if if you need them all grouped like your example, another mapreduce phase to combine them by key and they'll end up like:
{ 'u1_u2': ['a'] }
{ 'u1_u3': ['a'] }
{ 'u2_u3': ['a', 'f'] }
{ 'u2_u3': ['f'] }

Does this work for you?
from itertools import combinations
user_sets = [
{ 'u1': [ 'a', 'a', 'c', 'h' ] },
{ 'u2': [ 'b', 'a', 'f' ] },
{ 'u3': [ 'a', 'a', 'f' ] },
]
def compare_sets(set1, set2):
sum = 0
for n, item in enumerate(set1):
if item in set2:
sum += 1
del set2[set2.index(item)]
return sum
for set in combinations(user_sets, 2):
comp1, comp2 = set[0], set[1]
print 'Common items bwteen %s and %s: %s' % (
comp1.keys()[0], comp2.keys()[0],
compare_sets(comp1.values()[0], comp2.values()[0])
)
Here's the output:
Common items bwteen u1 and u2: 1
Common items bwteen u1 and u3: 2
Common items bwteen u2 and u3: 1

Related

Length of List with conditionals in dart

Is there a way to get 4 instead of 2 as a result?
List<String> test = [
'first',
'second',
if (false) 'third',
if (false) 'fourth',
];
print('length: ' + test.length.toString());
The length property on lists returns the number of elements in the list. In you example you are only inserting two values (because of the condition) so a length of 4 would not make sense and would give problems when you e.g. want to iterate over the list.
You can however add null elements if the condition are false like this:
void main() {
List<String> list = [
'first',
'second',
(false) ? 'third' : null,
(false) ? 'fourth' : null,
];
final listLengthIncludingConditions = list.length;
list.removeWhere((x) => x != null);
print('Number of possible elements in list: $listLengthIncludingConditions'); // 4
print('Number of elements in list: ${list.length}'); // 2
}
You can then save the length and remove the null elements.

RethinkDB Multiple emits in Map

I've been trying out RethinkDB for a while and i still don't know how to do something like this MongoDB example:
https://docs.mongodb.com/manual/tutorial/map-reduce-examples/#calculate-order-and-total-quantity-with-average-quantity-per-item
In Mongo, in the map function, I could iterate over an array field of one document, and emit multiple values.
I don't know how to set the key to emit in map or return more than one value per document in the map function.
For example, i would like to get from this:
{
'num' : 1,
'lets': ['a','b,'c']
}
to
[
{'num': 1, 'let' : 'a' },
{'num': 1, 'let' : 'b' },
{'num': 1, 'let' : 'c' }
]
I'm not sure if I should think this differently in RethinkDB or use something different from map-reduce.
Thanks.
I'm not familiar with Mongo; addressing your transformation example directly:
r.expr({
'num' : 1,
'lets': ['a', 'b', 'c']
})
.do(function(row) {
return row('lets').map(function(l) {
return r.object(
'num', row('num'),
'let', l
)
})
})
You can, of course, use map() instead of do() (in case not a singular object)

Making groups (combinations) of objects using their min/max values

First of all, this is my first question, you can tell me how to improve it and what tags to use.
What I am trying to do is I have a bunch of objects that have minimal and maximal values by those values you can deduce if two objects have some sort of overlapping value and thus they can be put together in a group
This question might need dynamic programming to solve.
example objects:
1 ( min: 0, max: 2 )
2 ( min: 1, max: 3 )
3 ( min: 2, max: 4 )
4 ( min: 3, max: 5 )
object 1 can be grouped with objects 2, 3
object 2 can be grouped with objects 1, 3, 4
object 3 can be grouped with objects 1, 2, 4
object 4 can be grouped with objects 2, 3
as you can see there are multiple ways to group those elements
[1, 2]
[3, 4]
[1]
[2, 3]
[4]
[1]
[2, 3, 4]
[1, 2, 3]
[4]
now there should be some sort of rule to deduce which of the solutions is the best solution
for example least amount of groups
[1, 2]
[3, 4]
or
[1]
[2, 3, 4]
or
[1, 2, 3]
[4]
or most objects in one group
[1]
[2, 3, 4]
or
[1, 2, 3]
[4]
or any other rule that uses another attribute of said objects to compare the solutions
what I have now:
$objects = [...objects...];
$numberOfObjects = count($objects);
$groups = [];
for ($i = 0; $i < $numberOfObjects; $i++) {
$MinA = $objects[$i]['min'];
$MaxA = $objects[$i]['max'];
$groups[$i] = [$i];
for ($j = $i + 1; $j < $numberOfObjects; $j++) {
$MinB = $objects[$j]['min'];
$MaxB = $objects[$j]['max'];
if (($MinA >= $MinB && $MinA <= $MaxB) || ($MaxA >= $MinB && $MaxA <= $MaxB) || ($MinB >= $MinA && $MinB <= $MaxA)) {
array_push($groups[$i], $j);
}
}
}
this basically creates an array with indexes of objects that can be grouped together
from this point, I don't know how to proceed, how to generate all the solution and then check each of them how good it is, and the pick the best one
or maybe there is even better solution that doesn't use any of this?
PHP solutions are preferred, although this problem is not PHP-specific
When I was first looking at your algorithm, I was impressed by how efficient it is :)
Here it is rewritten in javascript, because I moved away from perl a good while ago:
function setsOf(objects){
numberOfObjects = objects.length
groups = []
let i
for (i = 0; i < numberOfObjects; i++) {
MinA = objects[i]['min']
MaxA = objects[i]['max']
groups[i] = [i]
for (j = i + 1; j < numberOfObjects; j++) {
MinB = objects[j]['min']
MaxB = objects[j]['max']
if ((MinA >= MinB && MinA <= MaxB) || (MaxA >= MinB && MaxA <= MaxB) ||
(MinB >= MinA && MinB <= MaxA)) {
groups[i].push(j)
}
}
}
return groups
}
if you happen to also think well in javascript, you might find this form more direct (it is identical, however):
function setsOf(objects){
let groups = []
objects.forEach((left,i) => {
groups[i]=[i]
Array.from(objects).splice(i+1).forEach((right, j) => {
if ((left.min >= right.min && left.min <= right.max) ||
(left.max >=right.max && left.max <= right.max) ||
(right.min >= left.min && right.min <= left.max))
groups[i].push(j+i+1)
})
})
return groups
}
so if we run it, we get:
a = setsOf([{min:0, max:2}, {min:1, max:3}, {min:2, max:4}, {min:3, max: 5}])
[Array(3), Array(3), Array(2), Array(1)]0: Array(3)1: Array(3)2: Array(2)3: Array(1)length: 4__proto__: Array(0)
JSON.stringify(a)
"[[0,1,2],[1,2,3],[2,3],[3]]"
and it does impressively catch the compound groups :) a weakness is that it is capturing groups containing more objects than necessary, without capturing all available objects. You seem to have a very custom selection criteria. To me, it seems like the groups should either be every last intersecting subset, or only subsets where each element in the group provides unique coverage: [0,1], [0,2], [1,2], [1,3], [2,3], [0,1,3]
the algorithm for that is perhaps more involved. this was my approach, and it is nowhere near as terse and elegant as yours, but it works:
function intersectingGroups (mmvs) {
const min = []
const max = []
const muxo = [...mmvs]
mmvs.forEach(byMin => {
mmvs.forEach(byMax => {
if (byMin.min === byMax.min && byMin.max === byMax.max) {
console.log('rejecting identity', byMin, byMax)
return // identity
}
if (byMax.min > byMin.max) {
console.log('rejecting non-overlapping objects', byMin, byMax)
return // non-overlapping objects
}
if ((byMax.max <= byMin.max) || (byMin.min >= byMax.min)) {
console.log('rejecting non-expansive coverage or inversed order',
byMin, byMax)
return // non-expansive coverage or inversed order
}
const entity = {min: byMin.min, max: byMax.max,
compositeOf: [byMin, byMax]}
if(muxo.some(mv => mv.min === entity.min && mv.max === entity.max))
return // enforcing Set
muxo.push(entity)
console.log('adding', byMin, byMax, muxo)
})
})
if(muxo.length === mmvs.length) {
return muxo.filter(m => 'compositeOf' in m)
// solution
} else {
return intersectingGroups(muxo)
}
}
now there should be some sort of rule to deduce which of the solutions is the best solution
Yeah, so, usually for puzzles or for a specification you are fulfilling, that would be given as part of the problem. As it is, you want a general method that is adaptable. It's probably best to make an object that can be configured with the results and accepts rules, then load the rules you are interested in, and the results from the search, and see what rules match where. For example, using your algorithm and sample criteria:
least amount of groups
start with code like:
let reviewerFactory = {
getReviewer (specification) { // generate a reviewer
return {
matches: [], // place to load sets to
criteria: specification,
review (objects) { // review the sets already loaded
let group
let results = {}
this.matches.forEach(mset => {
group = [] // gather each object from the initial set for each match in the result set
mset.forEach(m => {
group.push(objects[m])
})
results[mset] = this.criteria.scoring(group) // score the match relative to the specification
})
return this.criteria.evaluation(results) // pick the best score
}
}
},
specifications: {}
}
now you can add specifications like this one for least amount of groups:
reviewerFactory.specifications['LEAST GROUPS'] = {
scoring: function (set) { return set.length },
evaluation: function (res) { return Object.keys(res).sort((a,b) => res[a] - res[b])[0] }
}
then you can use that in the evaluation of a set:
mySet = [{min:0, max:2}, {min:1, max:3}, {min:2, max:4}, {min:3, max: 5}]
rf = reviewerFactory.getReviewer(reviewerFactory.specifications['LEAST GROUPS'])
Object {matches: Array(0), criteria: Object, review: function}
rf.matches = setsOf(mySet)
[Array(3), Array(3), Array(2), Array(1)]
rf.review(mySet)
"3"
or, most objects:
reviewerFactory.specifications['MOST GROUPS'] = {
scoring: function (set) { return set.length },
evaluation: function (res) { return Object.keys(res).sort((a,b) => res[a] - res[b]).reverse()[0] }
}
mySet = [{min:0, max:2}, {min:1, max:3}, {min:2, max:4}, {min:3, max: 5}]
reviewer = reviewerFactory.getReviewer(reviewerFactory.specifications['MOST GROUPS'])
reviewer.matches = setsOf(mySet)
reviewer.review(mySet)
"1,2,3"
Of course this is arbitrary, but so are the criteria, by definition in the OP. Likewise, you would have to change the algorithms here to work with my intersectingGroups function because it doesn't return indices. But this is what you are looking for I believe.

dictionary: Unique relative values where values are of list type

I am getting the output of word2vec_basic.py in the following format
Nearest to key1 : node1, node2, node3 ..
Nearest to key2 : node2, node4, node5 ..
This implies that node2 is comparatively closer to key2 over key1 (Please correct me if I am wrong, as I am newbie here)
It would be great if I get the output in the following format
Nearest to key1 : node1, node3 , node6..
Nearest to key2 : node2, node4, node5 ..
That is, consider only the closest neighbor for clustering.
Suggestions for the same?
I am maintaining a python dictionary for the same of the following format:
{
key1: [node1,node2,node3],
key2: [node2,node4,node5]
}
But I required,
{
key1: [node1,node3,node6],
key2: [node2,node4,node5]
}
And for the above dictionary, I will be needing
Nearest to key1 : node1, node3 , node6..
Nearest to key2 : node2, node4, node5 ..
Could we do this in tensorflow itself, or should I define a function which takes dictionary as input and give me the required output?
For eg:
If we have a python dictionary of the following format:
{
a: ["abc","bcd","def"],
b: ["def","xyz"]
}
Here the values are list. I am looking for the following format from the above input:
{
a: ["abc","bcd"],
b: ["def","xyz"]
}
Suggestions are welcome on how I could achieve it.
Also, are there any python in built functions which could help me to reach the above output format?
dicts are unordered so which dupe gets removed is not guaranteed but you can keep a set of elements seen so far as you iterate over the items, updating/removing elements from the list/value if it has already been seen:
d = {
"a": ["abc","bcd","def"],
"b": ["def","xyz"]
}
seen = set()
for k,v in d.items():
d[k] = [seen.add(ele) or ele for ele in v if ele not in seen]
print(d)
This could output:
{'b': ['def', 'xyz'], 'a': ['abc', 'bcd']}
Or:
d = { "a": ["abc","bcd","def"], "b": ["xyz"]}
It completely depends on which key you hit first.
As you can see from this top answer with 436 upvotes, the removal logic is efficient and it maintains the order if required. To also to avoid the set.add lookup each time as in the link, you can set seen_add = seen.add and use seen._add(ele) in place of seen.add.
Since dictionaries entries in Python are unordered, you need to first build a separate dictionary keyed by node recording each list (or sequence) it's in as well as its index in that list so relative distances in each list can be compared to one another. After that's done, it can be referenced to determine whether each node should stay in each list it is in or not by making a second pass through the dictionary's contents.
d = {
"a": ["abc", "bcd", "def"],
"b": ["def", "xyz"]
}
def check_usage(k, elem_usage):
if len(elem_usage) == 1: # unique?
return True
else:
index = elem_usage[k] # within this elem's seq
for key,value in elem_usage.items():
if key != k:
if value < index:
return False
else:
return True
usage = {}
for key in d: # build usage dictionary
for index, item in enumerate(d[key]):
usage.setdefault(item, {})[key] = index
for k,seq in d.items():: # remove nodes that are closer in other lists
d[k] = [elem for elem in seq if check_usage(k, usage[elem])]
# display results
print('{')
for k in sorted(d):
print(' {!r}: {},'.format(k, d[k]))
print('}')
Output:
{
'a': ['abc', 'bcd'],
'b': ['def', 'xyz'],
}

Merging 2 lists of dicts based on common values

So I have 2 lists of dicts in Python as follows:
list1 = [
{
"medication_name": "Actemra IV",
"growth": 0,
"total_prescriptions": 3
},
{
"medication_name": "Actemra SC",
"growth": 0.0,
"total_prescriptions": 2
},
{
"medication_name": "Adempas",
"growth": 0,
"total_prescriptions": 1
}
]
list2 = [
{
"medication_name": "Actemra IV",
"fulfillment_time": 94340
},
{
"medication_name": "Actemra SC",
"fulfillment_time": 151800
},
{
"medication_name": "Adempas",
"fulfillment_time": 156660
}
]
What I would want is to have list1 appended with the fulfillment_time key from list2 so that the output is as follows:
[
{
"medication_name": "Actemra IV",
"growth": 0,
"fulfillment_time": 94340,
"total_prescriptions": 3
},
{
"medication_name": "Actemra SC",
"growth": 0.0,
"fulfillment_time": 151800,
"total_prescriptions": 2
},
{
"medication_name": "Adempas",
"growth": 0,
"fulfillment_time": 156660,
"total_prescriptions": 1
}
]
I achieved this in the traditional way of looping over both lists as follows:
for i in list1:
for j in list2:
if i['medication_name'] == j['medication_name']:
i['fulfillment_time'] = j['fulfillment_time']
What I wanted to know is that are there any inbuilt one line functions already in python that perform the same task that I may not know of ?
There is no "one line" way to do what you want, mainly because it's not a very natural operation: the data structures you are using don't naturally allow the operations you want to do. This is also borne out by the fact that your algorithm is rather inefficient: it loops all the way through list2 for every element of list1. It is quadratic in the number of elements.
It seems like you are thinking of the 'medication_name' as the key for the dictionaries in the list. But the list type provides no operation to find elements by that key.
A more pythonic approach would be to convert the list into a dictionary: then finding the right dictionary will become O(1). Something like this:
d = { i['medication_name'] : i for i in list2}
for i in list1:
i.update(d[i['medication_name']])
Python does provide a one-liner to merge the dictionaries, as shown.
This does raise the question of what you want to do if list2 contains no entry for one of the entries in list1: a try/except could be used to deal with that.
These data structures are a little bit "database-like". Perhaps you should be using sqlite3?