Explanation of this MRJob example - python-2.7

from mrjob.job import job
class KittyJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper_cmd(self):
return "grep kitty"
def reducer(self, key, values):
yield None, sum(1 for _ in values)
if __name__ == '__main__':
KittyJob().run()
Source : https://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols
How does this code do its task of counting the number of lines containing kitty?
Also where is OUTPUT_PROTOCOL defined?

Well, the short answer is that this example doesn't count lines containing 'kitty'.
Here is some code using filters that does count lines containing (case-insensitive) kitty:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
from mrjob.step import MRStep
class KittyJob(MRJob):
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper(self, _, line):
yield 'kitty', 1
def sum_kitties(self, key, values):
yield None, sum(values)
def steps(self):
return [
MRStep(mapper_pre_filter='grep -i "kitty"',
mapper=self.mapper,
reducer=self.sum_kitties)]
if __name__ == '__main__':
KittyJob().run()
If I run it using the local runner as noted in Shell Commands as Steps over the text of the english wikipedia page for 'Kitty', then I get a count of all lines containing 'kitty' as expected:
$ python grep_kitty.py -q -r local kitty.txt
20
$ grep -ci kitty kitty.txt
20
It looks like the example you cite from the mrjob docs is just wrong.

Related

How to pass input path to MRJob.mapper_raw?

I'm running a mapreduce job with mrjob package and I am trying to use mapper raw as below
class MRJOB(MRJob):
def mapper_raw(self, input_path, input_uri):
import csv
print(input_path, input_uri)
with open(input_path) as f:
reader = csv.reader(f)
for line in reader:
if line:
yield (0, line)
def steps(self):
return [
MRStep(mapper=self.mapper_raw)
] # , reducer=self.reducer)]
if __name__ == "__main__":
MRJOB().run()
however, if I run this as python mrjobfile.py inputfile.csv I get the following error:
TypeError: expected str, bytes or os.PathLike object, not NoneType
How to tell mrjob to treat input as a string with the filename? It seems to just pass in the first line of the file.
You have to declare mapper_raw instead of mapper in your MRStep. So, how it will look like:
def steps(self):
return [
MRStep(mapper_raw=self.mapper_raw)
] # , reducer=self.reducer)]

Python watchdog Watch For the Creation of New File By Pattern Match

I'm still learning Python so I apologize up front. I am trying to write a watcher that watches for the creation of a file that matches a specific pattern. I am using watchdog with some code that is out there to get it watch a directory.
I don't quite understand how to have the watcher watch for a filename that matches a pattern. I've edited the regexes=[] field with what I think might work but I have not had luck.
Here is an example of a file that might come in: Hires For Monthly TM_06Jan_0946.CSV
I can get the watcher to tell me when a .CSV is created in the directory, I just can't get it to only tell me when essentially Hires.*\.zip is created.
I've checked the this link but have not had luck
How to run the RegexMatchingEventHandler of Watchdog correctly?
Here is my code:
import time
import watchdog.events
from watchdog.observers import Observer
from watchdog.events import RegexMatchingEventHandler
import re
import watchdog.observers
import time
import fnmatch
import os
class Handler(watchdog.events.RegexMatchingEventHandler):
def __init__(self):
watchdog.events.RegexMatchingEventHandler.__init__(self, regexes=['.*'], ignore_regexes=[], ignore_directories=False, case_sensitive=False)
def on_created(self, event):
print("Watchdog received created event - % s." % event.src_path)
def on_modified(self, event):
print("Watchdog received modified event - % s." % event.src_path)
if __name__ == "__main__":
src_path = r"C:\Users\Downloads"
event_handler = Handler()
observer = watchdog.observers.Observer()
observer.schedule(event_handler, path=src_path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join() ```
The reason why '.*' regex works and while the Hires.*\.zip regex doesn't is watchdog tries to match the full path with the regex. In terms of regex: r'Hires.*\.zip' will not give you a full match with "C:\Users\Downloads\Hires blah.zip"
Instead, try:
def __init__(self):
watchdog.events.RegexMatchingEventHandler.__init__(
self, regexes=[r'^.*?\.cvs$', r'.*?Hires.*?\.zip'],
ignore_regexes=[], ignore_directories=False, case_sensitive=False)
This will match with all the cvs files and the zip files starts with "Hires". The '.*?' matches with "C:\Users\Downloads" and the rest ensures file types, file names etc.

Have all file updated in a directory

I'm needing a bit of help.
I'm looking to use python 2.7 to see if all files in a directory have the same modified date.
I'll be adding this into an if statement so I'm only looking for a True & False value. So if all file have been modified then true else false.
Any help would awesome.
Update:-
The method below works but is this the most effective way?
import os,time
def get_information(directory):
file_list = []
for i in os.listdir(directory):
a = os.stat(os.path.join(directory,i))
file_list.append(str(time.ctime(a.st_atime))[:3])
return file_list
def checkEqual2(iterator):
return len(set(iterator)) <= 1
print get_information("C:\\Auto_Import\\")
print checkEqual2 (get_information("C:\\Auto_Import\\"))
You can use os.scandir for better efficiency since it pulls the information of all files in a given directory in one single call:
from os import scandir
def has_same_mtime(path):
return len(set(entry.stat().st_mtime for entry in scandir(path))) == 1
If you're using Python 3.4 or before you would have to fall back to os.listdir instead:
import os
def has_same_mtime(path):
return len(set(os.stat(os.path.join(path, name)).st_mtime for name in os.listdir(path))) == 1

awk command in subprocess.Popen - Getting Error

I am trying this code:This code work fine and gave me the output what i need .
import os,sys
import sitenamevalidation
import getpass
import subprocess
site_name = sys.argv[1]
def get_location(site_name):
site_name_vali = sitenamevalidation.sitenamevalidation(site_name)
if site_name_vali == True:
pass
else:
print("Sitename should be in this format [a-z][a-z][a-z][0-9][0-9] example abc54 or xyz50")
sys.exit(0)
site = site_name[:3].upper()
user =getpass.getuser()
path = os.path.join('/home/',user,'location.attr')
if os.path.exists(path) == True:
pass
else:
print("Not able to find below path "+path)
#p1 = subprocess.Popen(['awk', '/site/{getline; print}' , path] ,stdout=subprocess.PIPE)
p1 = subprocess.Popen(['awk', '/ABC/{getline; print}' , path] ,stdout=subprocess.PIPE)
output=p1.communicate()
print str(output[0].strip().lower())
if __name__ == "__main__":
get_location(site_name)
Main issue is if i use this below command
p1 = subprocess.Popen(['awk', '/ABC/{getline; print}' , path] ,stdout=subprocess.PIPE)
where ABC is static value in path it work fine and give me desired output but i wanted the value which user will enter (site in my code). When i am doing this
p1 = subprocess.Popen(['awk', '/site/{getline; print}' , path] ,stdout=subprocess.PIPE)
It is looking for site in that file which is not there.I want the value of site variable to go in that command rather the site itself.Tried different by putting quote , double quote but nothing seem to be working.

How to remove numbers from every list entry

I have a csv file with the following entries:
"Last,First,HW1,HW2,HW3,HW4,Test 1,Test 2"
I want to remove the numbers from the HWs and the Tests. So the result would be:
"Last,First,HW,HW,HW,HW,Test ,Test"
Is there some way to do this?
Try this:
def main(ifilename, ofilename):
with open(ifilename,'r') as i, open(ofilename,'w') as o:
for line in i:
for digit in (0,1,2,3,4,5,6,7,8,9):
line = line.replace(str(digit), '')
print(line,file=o,end='')
if __name__ == '__main__':
from sys import argv
main(argv[1], argv[2])
Use the script from the command line like this:
python my_script.py input.csv output.csv