Mapreduce wordcount in python - python-2.7

Well I am new to mapreducer programs. So when i search a example of mapreducer programs all I get is word count program. All the programs related to word count use the text to file as the input. I tried using a csv file as a input and the reducer is not working as it works for text file.
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
This is the current example i am looking at. Would someone explain the reason for this?

You can use the collections.Counter class:
from collections import Counter
with open(filename) as handler:
counter = Counter(handler.read().split())
print(counter.most_common(10))
On the documentation you will find a lot of more useful informations.

Distinguish between
MapReduce: a programming model for parallel execution
Python: a programming language
Hadoop: a cluster platform Python can run on
Do you really need the MapReduce for Hadoop or just a pure Python example? If the latter one, it can be much easier than your link:
import multiprocessing
def word_count(line, delimiter=","):
"""Worker"""
summary = {}
for word in line.strip().split(delimiter):
if word in summary:
summary[word] += 1
else:
summary[word] = 1
return summary
pool = multiprocessing.Pool()
result = {}
# Map: each line to a separate worker
for summary in pool.imap_unordered(word_count, open("/path/to/file.csv")):
# Reduce: aggregate the result of each line into one result
for (word, count) in summary.items():
result[word] = result[word]+count if word in result else count
print(result)

Related

Parse CSV efficiently in python

I am writing a CSV parser which has following structure
class decode:
def __init__(self):
self.fd = open('test.csv')
def decodeoperation(self):
for row in self.fd:
getcmd = self.decodecmd(row)
if cmd == 'A'
self.decodeAopt()
elif cmd == 'B':
self.decodeBopt()
def decodeAopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd A till
#a condition occurs on any further row
return
def decodeBopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd B till
#a condition occurs on any further row
return
The current code is working fine for me but I am not feeling good to iterate through the CSV file in all the methods. Could it be done in a better way?
There is nothing inherently wrong with using a common iterator across multiple methods, as long as you can determine in advance which method to dispatch to at any given point in the sequence (which you are doing by decoding the cmd from the row and getting 'A', 'B', etc.). The design has issues if you have to read several items before you could determine which method to call, and might have to back up if you picked the wrong method and needed to try another. In parsing, this is called backtracking. Since you are passing around a file object, backing up is difficult. Note that your separate decoder methods will have to know when to stop before reading the next row that contains a command, so they will need some sort of terminating sentinel row that they can recognize.
Some general comments on your Python and class design:
You have a nice simple if-elif-elif dispatch table that can translate to a Python dict like this:
# put this code in place of your "if cmd == ... elif elif elif..." code
dispatch = {
# note - no ()'s, we just want to reference the methods, not call them
'A': self.decodeAopt,
'B': self.decodeBopt,
'C': self.decodeCopt,
# look how easy it is to add more decoders
}
# lookup which decoder to use for the current cmd
decoder = dispatch[cmd]
# run it
decoder()
# or do it all in one line
dispatch[cmd]()
Instead of having your __init__ method open a file, let it accept an iterator object. This will make it much easier to write tests for your object, since you'll be able to pass simple Python lists containing CSV rows.
class decode:
def __init__(self, sequence):
self.fd = sequence
You might want to rename this var from 'fd' to something like 'seq', since it doesn't have to be a file, but could be any iterable that gives you decodable rows.
If you are doing your own CSV parsing, look at using the builtin csv module. It will do quite a bit of work for you, like parsing quoted strings that could contain commas, and can give you easy-to-work-with dicts for each row, given headers read from the input file, or specified by you. If you have modified __init__ as I suggested, you can use it like:
import csv
# assuming test.csv has a header row
reader = csv.DictReader(open('test.csv'))
# or specify headers if not - I encourage you to give these columns better names
reader.fieldnames = ['cmd', 'val1', 'val2', 'val3']
decoder = decode(reader)
decoder.decodeoperation()
Then you can write in decodeoperation:
cmd = row['cmd']
Note that this would impart a slightly different design to your class, that it would expect to be given a sequence of dicts, rather than a sequence of strings.

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

How can I share an updated values in dictionary among different process?

I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.
In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.

Regular Expressions to Update a Text File in Python

I'm trying to write a script to update a text file by replacing instances of certain characters, (i.e. 'a', 'w') with a word (i.e. 'airplane', 'worm').
If a single line of the text was something like this:
a.function(); a.CallMethod(w); E.aa(w);
I'd want it to become this:
airplane.function(); airplane.CallMethod(worm); E.aa(worm);
The difference is subtle but important, I'm only changing 'a' and 'w' where it's used as a variable, not just another character in some other word. And there's many lines like this in the file. Here's what I've done so far:
original = open('original.js', 'r')
modified = open('modified.js', 'w')
# iterate through each line of the file
for line in original:
# Search for the character 'a' when not part of a word of some sort
line = re.sub(r'\W(a)\W', 'airplane', line)
modified.write(line)
original.close()
modified.close()
I think my RE pattern is wrong, and I think i'm using the re.sub() method incorrectly as well. Any help would be greatly appreciated.
If you're concerned about the semantic meaning of the text you're changing with a regular expression, then you'd likely be better served by parsing it instead. Luckily python has two good modules to help you with parsing Python. Look at the Abstract Syntax Tree and the Parser modules. There's probably others for JavaScript if that's what you're doing; like slimit.
Future reference on Regular Expression questions, there's a lot of helpful information here:
https://stackoverflow.com/tags/regex/info
Reference - What does this regex mean?
And it took me 30 minutes from never having used this JavaScript parser in Python (replete with installation issues: please note the right ply version) to writing a basic solution given your example. You can too.
# Note: sudo pip3 install ply==3.4 && sudo pip3 install slimit
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = 'a.funktion(); a.CallMethod(w); E.aa(w);'
tree = Parser().parse(data)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Identifier):
if node.value == 'a':
node.value = 'airplaine'
elif node.value == 'w':
node.value = 'worm'
print(tree.to_ecma())
It runs to give this output:
$ python3 src/python_renames_js_test.py
airplaine.funktion();
airplaine.CallMethod(worm);
E.aa(worm);
Caveats:
function is a reserved word, I used funktion
the to_ecma method pretty prints; there is likely another way to output it closer to the original input.
line = re.sub(r'\ba\b', 'airplane', line)
should get you closer. However, note that you will also be replacing a.CallMethod("That is a house") into airplane("That is airplane house"), and open("file.txt", "a") into open("file.txt", "airplane"). Getting it right in a complex syntax environment using RegExp is hard-to-impossible.

Removing Duplicate Lines by Title Only

I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.
I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)