In mapreduce, how does the "shuffle step" decides where should go each key? - mapreduce

Lets considere the basic word count example for a map reduce job and the following input:
word1
word2
word1
word2
word3
For our processing, we consider that we have three mappers and three reducers.
For the mapping, the data is processed as followed:
MAP1: (word1,1), (word2,1)
MAP2: (word1,1), (word2,1)
MAP3: (word3,1)
Now, the shuffle phase starts. word1 keys need to be together, as well as word2 and word3 keys.
The shuffle phase could decide to send word1 to reducer1, word2 to reducer2 and word3 to reducer3, or word1 to reducer2, etc.
How is it decided which to which reducer will be shuffled each key ?

Before reduce step, hadoop uses implementation of Partitioner to determinate where key should be sent.
By default it is HashPartitioner with method:
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
You can use custom implementation if your job requires some additional logic:
job.setPartitionerClass(YourPartitioner.class)

Related

How do you check for multiple specific words in a string?

I am working on a text-based game, and want the program to search for multiple specific words in order in the user's answer. For example, I wan't to find the words "Take" and "item" in a user's response without making the user type specifically "Take item".
I know that you can use
if this in that
to check if this word is in that string, but what about multiple words with fluff in between?
The code I am using now is
if ("word1" and "word2" and "word3) in ans:
but this is lengthy and won't work for every single input in a text-based game. What else works?
A regex based solution might be to use re.match:
input = "word1 and word2 and word3"
match = re.match(r'(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b).*', input)
if match:
print("MATCH")
The regex pattern used makes use of positive lookaheds which assert that each word appears in the string.
We might here want to design a library with keys and values, then look up for our desired outputs, if I understand the problem correctly:
word_action_library={
'Word1':'Take item for WORD1',
'Some other words we wish before Word1':'Do not take item for WORD1',
'Before that some WOrd1 and then some other words':'Take items or do not take item, if you wish for WORD1',
'Word2':'Take item for WORD2',
'Some other words we wish before Word2':'Do not take item for WORD2',
'Before that some WOrd2 and then some other words':'Take items or do not take item, if you wish for WORD2',
}
print list(value for key,value in word_action_library.iteritems() if 'word1' in key.lower())
print list(value for key,value in word_action_library.iteritems() if 'word2' in key.lower())
Output
['Take items or do not take item, if you wish for WORD1', 'Do not take item for WORD1', 'Take item for WORD1']
['Take items or do not take item, if you wish for WORD2', 'Do not take item for WORD2', 'Take item for WORD2']

Solr search query : Given word with numbers in neighborhood

I just found out, that solr server can find words which are in a given distance to another word like this:
text_original : "word1 word2"~10
So solr is searching for word1 which has a word2 in a maximal distance of 10 words around.
great, YAY
but now I just want to do the same just with some undifined numbers. I just want to have a look for numbers which occure in a given range os some keywords. As a regex I would write something like that:
myWord(\s)+(([A-Za-z]+)\s){0,10}([0-9]{3,12}(\.|\,)[0-9]{1,4})
or something like that.
So I thought it would be easy in solr to do it similar to words in a range:
text_original: Word1 /[0-9]{3,12}/~10
But yes, the both terms are now linked with OR, so I find numbers OR my given word. But i can't use quotation because the regex won't work then.
Can anyone please leave me a hint in which constellation this search terms have to be, that it works like described?
You can do this through the ComplexPhraseQueryParser, with a query like:
text_original:"Word1 /[0-9]{3,12}/"~10
Keep in mind, that a regex query in lucene must match the whole term, so this would not match "word1 word2", but it would match "word1 extra stuff 20". Slop also seemed a bit odd in my testing.
You could do it if you are willing to fall back on writing a raw lucene query, you can also accomplish it using the SpanQuery API, such as:
SpanQuery wordQuery = new SpanTermQuery(new Term("text_original", "Word1"));
SpanQuery numQuery = new SpanMultiTermQueryWrapper(new RegexpQuery("text_original", "[0-9]{3,12}"));
Query proxQuery = new SpanNearQuery(new SpanQuery[] {wordQuery, numQuery}, 10, false);
searcher.search(proxQuery, numHits);

Compare or match 2 strings and display matched word

I would like to compare 2 strings and display any matched words.
For example -
string1 = "cat feet"
string2 = "cat shoes"
The result should = "cat"
How can I do this with regular expressions? Or is there a better way to do this?
Split each string on whitespace, and convert both to sets. Their intersection will contain all of the words they have in common.
>>> set("cat feet".split()).intersection(set("cat shoes".split()))
set(['cat'])
This method does not care about ordering of words. "feet cat" and "cat shoes" will have output "cat", even though "cat" does not appear in the same position in both strings. If you want to find words that exist in the same position in both strings, you can zip the split strings together, and display only the words that exist in the same place in both:
>>> [a for a,b in zip("cat feet".split(), "cat shoes".split()) if a == b]
['cat']
>>> [a for a,b in zip("feet cat".split(), "cat shoes".split()) if a == b]
[]
Just regarding the use of regular expressions:
Regular expressions are equivalent to finite automatons and these have the property that they have only a finite set of states, which in turn means they have kind of finite memory. Thus you can't do stuff involving an unknown arbitrary lenght objective string.

Need string matching algorithm

Data:
x.txt,simple text file(around 1 MB)
y.txt dictionary file(around 1Lakh words).
Need to find whether any of the word/s in y.txt is present in x.txt.
Need an algorithm which consumes less time for execution and language preferred for the same.
PS: Please suggest any algorithm apart from BRUTE FORCE METHOD.
I need pattern matching rather than exact string matching.
For instance :
x.txt : "The Old Buzzards were disestablished on 27 April"
Y.txt : "establish"
Output should be : Found establish in X.txt : Line 1
Thank you.
It is not clear to me whether you need this to get a job done or it is home work. If you need it to get a job done then:
#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
echo "found"
else
echo "no luck"
fi
is hard to beat as you slurp in all the patterns from a file, construct a regular expression (the echo shows the regex) and then hand it to grep which constructs a finite state automata for you. That is going to fly as it compares every character in the text at most once. If it is homework then I suggest you consult Cormen et al 'Introduction to Algorithms', or the first few chapters of the Dragon Book which will also explain what I just said.
Forgot to add: y.txt should contain your pattern one per line, but as a nice side effect your patterns do not have to be single words.
Suppose, you have any Set implementation in your standard library, here is some pseudo-code:
dictionary = empty set
def populate_dict():
for word in dict_file:
add(dictionary, word)
def validate_text(text_file):
for word in text_file: ### O(|text_file|)
if word in dictionary: ### O(log |dictonary|)
report(word)
populate_dict()
every_now_and_then(populate_dict)
That would give you O(t * log d) instead of the brute-force O(t * d) where t and d are the lengths of the input text file and dictionary respectively. I don't think that anything faster is possible since you can't read the file faster that O(t) and can't search faster than O(log d).
This is a search algorithm I had in mind for a while.
Basically the algorithm is in two steps.
In the first step all the words from y.txt are inserted in a tree. Every path in the tree from the root to a leaf is a word. The leaf is empty.
For example, the tree for the words dog and day is the following.
<root>--<d>-<a>-<y>-<>
\-<o>-<g>-<>
The second part of the algorithm is a search down the tree. When you reach an empty leaf then you have found a word.
The implementation in Groovy, if more comments are needed just ask
//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
def t=tree
word.each{ c ->
if(!t[c]){
t[c]=[:]
}
t=t[c]
}
t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
for(int i=0; i<str.length(); i++){
if(tree[str[i]]){
def t=tree[str[i]]
def res=str[i]
def found=false
for(int j=i+1; j<str.length(); j++){
if(t[str[j]]==null){
if(found){
println "Found $res at line $line, col $i"
res=str[j]
found=false
}
break
}else if(t[str[j]][0]==0){
found=true
res+=str[j]
t=t[str[j]]
continue
}else{
t=t[str[j]]
res+=str[j]
}
found=false
}
if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
}
}
}
this is my y.txt
dog
day
apple
daydream
and x.txt
This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream
The output is the following:
$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7
This algorithm should be fast because the depth of the tree doesn't depend on the number of words in y.txt. The depth is equal to the length of the longest word in y.txt.

Regex that find words with exactly two 'a'

I'd like a regex that finds word with exactly two a (not 3,4,5,.) need pattern? don't have to be in row.
["taat","weagda","aa"] is ok,
but not this ["a","eta","aaa","aata","ssdfaasdfa"].
This one will work:
^[^a]*a[^a]*a[^a]*$
More generalized version where you can replace 2 with any number:
^(?:[^a]*a){2}[^a]*$
The 2 regexes above make use of the fact that a is a single character, so we can make sure that all other characters are not a. The 2nd one uses repetition notation.
Even more generalized version "not more than n non-overlapping substring" (DOTALL mode enabled):
^(?!(?:.*sstr){3})(?:.*sstr){2}.*$
Where sstr is a regex-escaped substring, and the number of repetition in the negative lookahead must be 1 more than the number we want to match.
This one is slightly trickier, and I use negative look-ahead to make sure the string doesn't contain n + 1 non-overlapping instances of the substring sstr, then try to find exactly n non-overlapping instances.
In this situation , i think, you can just use string to find out, just use a for loop.
mylist = ["taat","weagda","aa","eta","aaa","aata","ssdfaasdfa"];
resultList = [];
for x in mylist:
count = 0;
for c in x:
if c == 'a':
count = count +1;
if count == 2:
resultList.append(x);
print(resultList);
Do it with two regexes rather than trying to cram it all into one.
Check that your word matches a[^a]*a and does not match a.*a.*a
You can also use a Counter object for this task.
In [1]: from collections import Counter
In [2]: words = ["taat","weagda","aa", "a","eta","aaa","aata","ssdfaasdfa"]
In [3]: [word for word in words if Counter(word)['a'] == 2]
Out[3]: ['taat', 'weagda', 'aa']