Hadoop MapReduce Shuffle&Sort: Why need ‘group’ operation? - mapreduce

The 'group' operation of the 'shuffle' is to change the data into <key, List <value>> form, but my reducer.py does not recognize that List and just continues to treat it as a line of <key, value> form of standard input.
look at the code below:
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
So why does it do this? Is hadoop streaming changing <key, List <value> data from to <key, value> form in the standard input? If so, why need 'group' operation? 'Sort' operation directly to the same key sort together, and then line by line of the input to the reduce.py is not the same?
reducer.py:
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_word = word
current_count = count
if current_word == word:
print('%s\t%s' % (current_word, current_count))
sys.exit(0)
Suppose there is an example of word frequency statistics that counts the number of occurrences of a, b, c, d.
1.With the 'group' operation, the data becomes like:
(b,[2,3])
(c,[1,5])
(d,[3,6])
(a,[2,4])
2.With the 'sort' operation, the data becomes like:
(a,[2,4])
(b,[2,3])
(c,[1,5])
(d,[3,6])
3.reducer.py when receiving data, the data becomes like:
(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)
So I want to know who made 2 stage into 3 stage. And If there is no 'group' step:
1.Without the 'group' operation, but with the 'sort' operation, the data also become like:
(a,2)
(a,4)
(b,2)
(b,3)
(c,1)
(c,5)
(d,3)
(d,6)
2.reducer.py receives the above data, is it not OK? I do not understand. :-)

The mapper only outputs lists of values (actually Iterator) for the Java API.
Yes, in MapReduce, there is a Shuffle and Sort phase, but in Streaming, the keys are presented to the reducers in a line delimited manner and sorted the keys. With that information, you can detect the boundaries between different keys, thus naturally forming groups, and reducing on those
From the source you copied the code from, you can see the output of the mapper.py | sort -k1,1
The input to the (first) reducer is like this
a 1
a 1
b 1
b 1
b 1
b 1
c 1
c 1
d 1
Just read the code... Nothing is taking out () or [] characters. Nothing is split on a comma... Your mapper is printing out tabs between the key and the value 1.
The first iteration of the reducer will hit this code
current_word = word # found a new word
current_count = count # this will always start at 1 for word count
And the reducer accumulates the sorted keys until it finds a new word and prints the totals of the previous word

Related

Scoring multiple TRUES in Pythton RE Search

Background
I have a list of "bad words" in a file called bad_words.conf, which reads as follows
(I've changed it so that it's clean for the sake of this post but in real-life they are expletives);
wrote (some )?rubbish
swore
I have a user input field which is cleaned and striped of dangerous characters before being passed as data to the following script, score.py
(for the sake of this example I've just typed in the value for data)
import re
data = 'I wrote some rubbish and swore too'
# Get list of bad words
bad_words = open("bad_words.conf", 'r')
lines = bad_words.read().split('\n')
combine = "(" + ")|(".join(lines) + ")"
#set score incase no results
score = 0
#search for bad words
if re.search(combine, data):
#add one for a hit
score += 1
#show me the score
print(str(score))
bad_words.close()
Now this finds a result and adds a score of 1, as expected, without a loop.
Question
I need to adapt this script so that I can add 1 to the score every time a line of "bad_words.conf" is found within text.
So in the instance above, data = 'I wrote some rubbish and swore too' I would like to actually score a total of 2.
1 for "wrote some rubbish" and +1 for "swore".
Thanks for the help!
Changing combine to just:
combine = "|".join(lines)
And using re.findall():
In [33]: re.findall(combine,data)
Out[33]: ['rubbish', 'swore']
The problem with having the multiple capturing groups as you originally were doing is that re.findall() will return each additional one of those as an empty string when one of the words is matched.

Regex searching rows in CSV for characters getting hung up on first match

I am newer to scripting, so my code may be a bit mangled, I apologize in advance.
I am trying to iterate through a CSV and write it to an excel workbook using openpyxl. But before I write it, I am performing a few checks to determine which sheet to write the row to.
The row has content such as:"KB4462941", "kb/9191919", "kb -919", "sdfklKB91919".
I am trying to pull the first numbers following "KB" then stop reading in characters once a non-numeric character is found. One I find it, then I run a separate function that queries a DB. That function works.
The problem I am running into, is once it finds the first KB: KB4462941, it gets hung up and goes over that KB multiple times until the last time it appears in that row, then the program finishes.
Unfortunately, there is not default location for where the KB characters will be in the row, and there is no default character count between the KB and the first numbers.
My code:
with open('test.csv') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
if str(row).find("SSL") != -1:
ws = book.get_sheet_by_name('SSL')
ws.append(row)
else:
mylist = list(row)
string = ''.join(mylist)
tmplist = list()
resultlist = list()
pattern = 'KB.*[0-9]*'
for i in mylist:
tmplist += re.findall(pattern, i, re.IGNORECASE)
for i in tmplist:
resultlist += re.findall('[0-9]*', i)
for i in resultlist:
if len(i) > 4:
print i
if dbFunction(i) == 1:
ws = book.get_sheet_by_name('Found')
ws.append(row)
else:
ws = book.get_sheet_by_name('Nothing')
ws.append(row)
output:
1st row is skipped
2nd row is in the right place
3rd and 4th row in the right place
5th row is written for the next nine 9 rows.
never gets to the following 3 rows.

Python3: Checking if a key word within a dictionary matches any part of a string

I'm having trouble converting my working code from lists to dictionaries. The basics of the code checks a file name for any keywords within the list.
But I'm having a tough time understanding dictionaries to convert it. I am trying to pull the name of each key and compare it to the file name like I did with lists and tuples. Here is a mock version of what i was doing.
fname = "../crazyfdsfd/fds/ss/rabbit.txt"
hollow = "SFV"
blank = "2008"
empty = "bender"
# things is list
things = ["sheep", "goat", "rabbit"]
# other is tuple
other = ("sheep", "goat", "rabbit")
#stuff is dictionary
stuff = {"sheep": 2, "goat": 5, "rabbit": 6}
try:
print(type(things), "things")
for i in things:
if i in fname:
hollow = str(i)
print(hollow)
if hollow == things[2]:
print("PERFECT")
except:
print("c-c-c-combo breaker")
print("\n \n")
try:
print(type(other), "other")
for i in other:
if i in fname:
blank = str(i)
print(blank)
if blank == other[2]:
print("Yes. You. Can.")
except:
print("THANKS OBAMA")
print("\n \n")
try:
print(type(stuff), "stuff")
for i in stuff: # problem loop
if i in fname:
empty = str(i)
print(empty)
if empty == stuff[2]: # problem line
print("Shut up and take my money!")
except:
print("CURSE YOU ZOIDBERG!")
I am able to get a full run though the first two examples, but I cannot get the dictionary to run without its exception. The loop is not converting empty into stuff[2]'s value. Leaving money regrettably in fry's pocket. Let me know if my example isn't clear enough for what I am asking. The dictionary is just short cutting counting lists and adding files to other variables.
A dictionary is an unordered collection that maps keys to values. If you define stuff to be:
stuff = {"sheep": 2, "goat": 5, "rabbit": 6}
You can refer to its elements with:
stuff['sheep'], stuff['goat'], stuff['rabbit']
stuff[2] will result in a KeyError, because the key 2 is not found in your dictionary. You can't compare a string with the last or 3rd value of a dictionary, because the dictionary is not stored in an ordered sequence (the internal ordering is based on hashing). Use a list or tuple for an ordered sequence - if you need to compare to the last item.
If you want to traverse a dictionary, you can use this as a template:
for k, v in stuff.items():
if k == 'rabbit':
# do something - k will be 'rabbit' and v will be 6
If you want to check to check the keys in a dictionary to see if they match part of a string:
for k in stuff.keys():
if k in fname:
print('found', k)
Some other notes:
The KeyError would be much easier to notice... if you took out your try/except blocks. Hiding python errors from end-users can be useful. Hiding that information from YOU is a bad idea - especially when you're debugging an initial pass at code.
You can compare to the last item in a list or tuple with:
if hollow == things[-1]:
if that is what you're trying to do.
In your last loop: empty == str(i) needs to be empty = str(i).

IndexError: list index out of range for list of lists in for loop

I've looked at the other questions posted on the site about index error, but I'm still not understanding how to fix my own code. Im a beginner when it comes to Python. Based on the users input, I want to check if that input lies in the fourth position of each line in the list of lists.
Here's the code:
#create a list of lists from the missionPlan.txt
from __future__ import with_statement
listoflists = []
with open("missionPlan.txt", "r") as f:
results = [elem for elem in f.read().split('\n') if elem]
for result in results:
listoflists.append(result.split())
#print(listoflists)
#print(listoflists[2][3])
choice = int(input('Which command would you like to alter: '))
i = 0
for rows in listoflists:
while i < len(listoflists):
if listoflists[i][3]==choice:
print (listoflists[i][0])
i += 1
This is the error I keep getting:
not getting inside the if statement
So, I think this is what you're trying to do - find any line in your "missionPlan.txt" where the 4th word (after splitting on whitespace) matches the number that was input, and print the first word of such lines.
If that is indeed accurate, then perhaps something along this line would be a better approach.
choice = int(input('Which command would you like to alter: '))
allrecords = []
with open("missionPlan.txt", "r") as f:
for line in f:
words = line.split()
allrecords.append(words)
try:
if len(words) > 3 and int(words[3]) == choice:
print words[0]
except ValueError:
pass
Also, if, as your tags suggest, you are using Python 3.x, I'm fairly certain the from __future__ import with_statement isn't particularly necessary...
EDIT: added a couple lines based on comments below. Now in addition to examining every line as it's read, and printing the first field from every line that has a fourth field matching the input, it gathers each line into the allrecords list, split into separate words as a list - corresponding to the original questions listoflists. This will enable further processing on the file later on in the code. Also fixed one glaring mistake - need to split line into words, not f...
Also, to answer your "I cant seem to get inside that if statement" observation - that's because you're comparing a string (listoflists[i][3]) with an integer (choice). The code above addresses both that comparison mismatch and the check for there actually being enough words in a line to do the comparison meaningfully...

Create a vector of occurrences the same size as an input string

I'm new to python and needed some help.
I have a string such a ACAACGG
I would now like to create 3 vectors where the elements are the counts of particular letter.
For example, for "A", this would produce (1123333)
For "C", this would produce (0111222)
etc.
I'm not sure how to put the results of the counting into an string or into a vector.
I believe this is similar to counting the occurrences of a character in a string, but I'm not sure how to have it run through the string and place the count value at each point.
For reference, I'm trying to implement the Burrows-Wheeler transform and use it for a string search. But, I'm not sure how to create the occurrence vector for the characters.
def bwt(s):
s = s + '$'
return ''.join([x[-1] for x in
sorted([s[i:] + s[:i] for i in range(len(s))])])
This gives me the transform and I'm trying to create the occurrence vector for it. Ultimately, I want to use this to search for repeats in a DNA string.
Any help would be greatly appreciated.
I'm not sure what type you want the vectors to be in, but here's a function that returns a list of ints.
In [1]: def countervector(s, char):
....: c = 0
....: v = []
....: for x in s:
....: if x == char:
....: c += 1
....: v.append(c)
....: return v
....:
In [2]: countervector('ACAACGG', 'A')
Out[2]: [1, 1, 2, 3, 3, 3, 3]
In [3]: countervector('ACAACGG', 'C')
Out[3]: [0, 1, 1, 1, 2, 2, 2]
Also, here's a much shorter way to do it, but it will probably be inefficient on long strings:
def countervector(s, char):
return [s[:i+1].count(char) for i, _ in enumerate(s)]
I hope it helps.
As promised here is the finished script I wrote. For reference, I'm trying to use the Burrows-Wheeler transform to do repeat matching in strings of DNA. Basically the idea is to take a strand of DNA of some length M and find all repeat within that string. So, as an example, if I had strange acaacg and searched for all duplicated substrings of size 2, I would get a count of 1 and the starting locations of 0,3. You could then type in string[0:2] and string[3:5] to verify that they do actually match and their result is "ac".
If anyone is interested in learning about the Burrows-Wheeler, a Wikipedia search on it produces very helpful results. Here's is another source from Stanford that also explains it well. http://www.stanford.edu/class/cs262/notes/lecture5.pdf
Now, there are a few issues that I did not address in this. First, I'm using n^2 space to create the BW transform. Also, I'm creating a suffix array, sorting it, and then replacing it with numbers so creating that may take up a bit of space. However, at the end I'm only really storing the occ matrix, the end column, and the word itself.
Despite the RAM problems for strings larger that 4^7 (got this to work with a string size of 40,000 but no larger...), I would call this a success seeing as before Monday, the only thing I new how to do in python was to have it print my name and hello world.
# generate random string of DNA
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
# Make the BW transform from the generated string
def make_bwt(word):
word = word + '$'
return ''.join([x[-1] for x in
sorted([word[i:] + word[:i] for i in range(len(word))])])
# Make the occurrence matrix from the transform
def make_occ(bwt):
letters=set(bwt)
occ={}
for letter in letters:
c=0
occ[letter]=[]
for i in range(len(bwt)):
if bwt[i]==letter:
c+=1
occ[letter].append(c)
return occ
# Get the initial starting locations for the Pos(x) values
def get_starts(word):
list={}
word=word+"$"
for letter in set(word):
list[letter]=len([i for i in word if i < letter])
return list
# Single range finder for the BWT. This produces a first and last position for one read.
def get_range(read,occ,pos):
read=read[::-1]
firstletter=read[0]
newread=read[1:len(read)]
readL=len(read)
F0=pos[firstletter]
L0=pos[firstletter]+occ[firstletter][-1]-1
F1=F0
L1=L0
for letter in newread:
F1=pos[letter]+occ[letter][F1-1]
L1=pos[letter]+occ[letter][L1] -1
return F1,L1
# Iterate the single read finder over the entire string to search for duplicates
def get_range_large(readlength,occ,pos,bwt):
output=[]
for i in range(0,len(bwt)-readlength):
output.append(get_range(word[i:(i+readlength)],occ,pos))
return output
# Create suffix array to use later
def get_suf_array(word):
suffix_names=[word[i:] for i in range(len(word))]
suffix_position=range(0,len(word))
output=zip(suffix_names,suffix_position)
output.sort()
output2=[]
for i in range(len(output)):
output2.append(output[i][1])
return output2
# Remove single hits that were a result of using the substrings to scan the large string
def keep_dupes(bwtrange):
mylist=[]
for i in range(0,len(bwtrange)):
if bwtrange[i][1]!=bwtrange[i][0]:
mylist.append(tuple(bwtrange[i]))
newset=set(mylist)
newlist=list(newset)
newlist.sort()
return newlist
# Count the duplicate entries
def count_dupes(hits):
c=0
for i in range(0,len(hits)):
sum=hits[i][1]-hits[i][0]
if sum > 0:
c=c+sum
else:
c
return c
# Get the coordinates from BWT and use the suffix array to map them back to their original indices
def get_coord(hits):
mylist=[]
for element in hits:
mylist.append(sa[element[0]-1:element[1]])
return mylist
# Use the coordinates to get the actual strings that are duplicated
def get_dupstrings(coord,readlength):
output=[]
for element in coord:
temp=[]
for i in range(0,len(element)):
string=word[element[i]:(element[i]+readlength)]
temp.append(string)
output.append(temp)
return output
# Merge the strings and the coordinates together for one big list.
def together(dupstrings,coord):
output=[]
for i in range(0,len(coord)):
merge=dupstrings[i]+coord[i]
output.append(merge)
return output
Now run the commands as follows
import random # This is needed to generate a random string
readlength=12 # pick read length
word=get_string(4**7) # make random word
bwt=make_bwt(word) # make bwt transform from word
occ=make_occ(bwt) # make occurrence matrix
pos=get_starts(word) # gets start positions of sorted first row
bwtrange=get_range_large(readlength,occ,pos,bwt) # Runs the get_range function over all substrings in a string.
sa=get_suf_array(word) # This function builds a suffix array and numbers it.
hits=keep_dupes(bwtrange) # Pulls out the number of entries in the bwt results that have more than one hit.
dupes=count_dupes(hits) # counts hits
coord=get_coord(hits) # This part attempts to pull out the coordinates of the hits.
dupstrings=get_dupstrings(coord,readlength) # pulls out all the duplicated strings
strings_coord=together(dupstrings,coord) # puts coordinates and strings in one file for ease of viewing.
print dupes
print strings_coord