How to pass input path to MRJob.mapper_raw?

How to pass input path to MRJob.mapper_raw? - mapreduce

I'm running a mapreduce job with mrjob package and I am trying to use mapper raw as below
class MRJOB(MRJob):
def mapper_raw(self, input_path, input_uri):
import csv
print(input_path, input_uri)
with open(input_path) as f:
reader = csv.reader(f)
for line in reader:
if line:
yield (0, line)
def steps(self):
return [
MRStep(mapper=self.mapper_raw)
] # , reducer=self.reducer)]
if __name__ == "__main__":
MRJOB().run()
however, if I run this as python mrjobfile.py inputfile.csv I get the following error:
TypeError: expected str, bytes or os.PathLike object, not NoneType
How to tell mrjob to treat input as a string with the filename? It seems to just pass in the first line of the file.

You have to declare mapper_raw instead of mapper in your MRStep. So, how it will look like:
def steps(self):
return [
MRStep(mapper_raw=self.mapper_raw)
] # , reducer=self.reducer)]

Related

PYTHON2 - Is there ways to declare encoding method in csv write

When I try to write cursor result coming from database execution (type is a list) to the csv, the error throws
a.writerow(lst) TypeError: write() argument 1 must be unicode, not str
This is for python2. I've tried in python3 like the script below. But the system requirement asks me to change Python2.
This is the correct script using python3.
results_percent = cursor.fetchall()
with open(file4,'w',encoding="utf-8",newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerow(['MFIName','ClientCountAtSignUp','UploadCountLastMonth','UploadCount','80%','Status'])
a.writerows(results_percent)
The below is by using python2 which gives me error a.writerow(lst) TypeError: write() argument 1 must be unicode, not str
results_percent = cursor.fetchall()
with io.open(file4,'w',encoding='utf-8') as fp:
a = csv.writer(fp, delimiter=',')
lst = ['MFIName','ClientCountAtSignUp','UploadCountLastMonth','UploadCount','80%','Status']
a.writerow(lst)
a.writerows(results_percent)
The output is to write results_percent to csv file.

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?

I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename

There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

3D Drawing from a file in an extra directory [duplicate]

I'm trying to get a data parsing script up and running. It works as far as the data manipulation is concerned. What I'm trying to do is set this up so I can enter multiple user defined CSV's with a single command.
e.g.
> python script.py One.csv Two.csv Three.csv
If you have any advice on how to automate the naming of the output CSV so that if input = test.csv, output = test1.csv, I'd appreciate that as well.
Getting
TypeError: coercing to Unicode: need string or buffer, list found
for the line
for line in csv.reader(open(args.infile)):
My code:
import csv
import pprint
pp = pprint.PrettyPrinter(indent=4)
res = []
import argparse
parser = argparse.ArgumentParser()
#parser.add_argument("infile", nargs="*", type=str)
#args = parser.parse_args()
parser.add_argument ("infile", metavar="CSV", nargs="+", type=str, help="data file")
args = parser.parse_args()
with open("out.csv","wb") as f:
output = csv.writer(f)
for line in csv.reader(open(args.infile)):
for item in line[2:]:
#to skip empty cells
if not item.strip():
continue
item = item.split(":")
item[1] = item[1].rstrip("%")
print([line[1]+item[0],item[1]])
res.append([line[1]+item[0],item[1]])
output.writerow([line[1]+item[0],item[1].rstrip("%")])
I don't really understand what is going on with the error. Can someone explain this in layman's terms?
Bear in mind I am new to programming/python as a whole and am basically learning alone, so if possible could you explain what is going wrong/how to fix it so I can note it for future reference.

args.infile is a list of filenames, not one filename. Loop over it:
for filename in args.infile:
base, ext = os.path.splitext(filename)
with open("{}1{}".format(base, ext), "wb") as outf, open(filename, 'rb') as inf:
output = csv.writer(outf)
for line in csv.reader(inf):
Here I used os.path.splitext() to split extension and base filename so you can generate a new output filename adding 1 to the base.

If you specify an nargs argument to .add_argument, the argument will always be returned as a list.
Assuming you want to deal with all of the files specified, loop through that list:
for filename in args.infile:
for line in csv.reader(open(filename)):
for item in line[2:]:
#to skip empty cells
[...]
Or if you really just want to be able to specify a single file; just get rid of nargs="+".

Explanation of this MRJob example

from mrjob.job import job
class KittyJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper_cmd(self):
return "grep kitty"
def reducer(self, key, values):
yield None, sum(1 for _ in values)
if __name__ == '__main__':
KittyJob().run()
Source : https://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols
How does this code do its task of counting the number of lines containing kitty?
Also where is OUTPUT_PROTOCOL defined?

Well, the short answer is that this example doesn't count lines containing 'kitty'.
Here is some code using filters that does count lines containing (case-insensitive) kitty:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
from mrjob.step import MRStep
class KittyJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper(self, _, line):
yield 'kitty', 1
def sum_kitties(self, key, values):
yield None, sum(values)
def steps(self):
return [
MRStep(mapper_pre_filter='grep -i "kitty"',
mapper=self.mapper,
reducer=self.sum_kitties)]
if __name__ == '__main__':
KittyJob().run()
If I run it using the local runner as noted in Shell Commands as Steps over the text of the english wikipedia page for 'Kitty', then I get a count of all lines containing 'kitty' as expected:
$ python grep_kitty.py -q -r local kitty.txt
20
$ grep -ci kitty kitty.txt
20
It looks like the example you cite from the mrjob docs is just wrong.

Python 2.7 Using Tkinter to concatenate a source path and filename gives error of nonetype

I am using Python 2.7 and imported Tkinter and TK.
What I am trying to do is use a sourced path (a directory path) and concatenate it from picking a file by opening windows explorer. This will enable the user to not have to type in a file name.
I realized I wasn't using a return and would get the following error:
TypeError: cannot concatenate 'str' and 'NoneType' objects
After searching here for this error I found I needed to do a return. I tried to put string in the parenthesis but it doesn't' work. I am definitely missing something.
Here is a sample of my code:
from Tkinter import *
from Tkinter import Tk
from tkFileDialog import askopenfilename
source = '\\\\Isfs\\data$\\GIS Carto\TTP_Draw_Count' ## this a public directory path
filename = ''
filename = getFileName() ##this part is in a different def area.
with open (os.path.join(source + filename), 'r' ) as f: ## this is were it failing.
def getFileName():
Tk().withdraw()
filename = askopenfilename()
return getFileName()
I need to concatenate the source + filename to be used to process a csv file.
I didn't want to put all the code here since it is long and requires a csv file and custom dictionary to merge. All of that works. I hope I have put enough information in this question.

def getFileName():
Tk().withdraw()
filename = askopenfilename()
return getFileName()
You aren't returning the filename that you get here. Change this to:
def getFileName():
Tk().withdraw()
filename = askopenfilename()
return filename
Also note that askopenfilename gets the full path of the chosen file, so source+filename will evaluate to something like u'\\\\Isfs\\data$\\GIS Carto\\TTP_Draw_CountC:/Users/kevin/Desktop/myinput.txt'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to pass input path to MRJob.mapper_raw? - mapreduce

You have to declare mapper_raw instead of mapper in your MRStep. So, how it will look like: def steps(self): return [ MRStep(mapper_raw=self.mapper_raw) ] # , reducer=self.reducer)]

Related

PYTHON2 - Is there ways to declare encoding method in csv write

Reading multiple files in a directory with pyyaml

3D Drawing from a file in an extra directory [duplicate]

Explanation of this MRJob example

Python 2.7 Using Tkinter to concatenate a source path and filename gives error of nonetype

Categories

Resources