Explanation of this MRJob example - python-2.7

from mrjob.job import job
class KittyJob(MRJob):
def mapper_cmd(self):
return "grep kitty"
def reducer(self, key, values):
yield None, sum(1 for _ in values)
if __name__ == '__main__':
Source : https://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols
How does this code do its task of counting the number of lines containing kitty?
Also where is OUTPUT_PROTOCOL defined?

Well, the short answer is that this example doesn't count lines containing 'kitty'.
Here is some code using filters that does count lines containing (case-insensitive) kitty:
from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
from mrjob.step import MRStep
class KittyJob(MRJob):
def mapper(self, _, line):
yield 'kitty', 1
def sum_kitties(self, key, values):
yield None, sum(values)
def steps(self):
return [
MRStep(mapper_pre_filter='grep -i "kitty"',
if __name__ == '__main__':
If I run it using the local runner as noted in Shell Commands as Steps over the text of the english wikipedia page for 'Kitty', then I get a count of all lines containing 'kitty' as expected:
$ python grep_kitty.py -q -r local kitty.txt
$ grep -ci kitty kitty.txt
It looks like the example you cite from the mrjob docs is just wrong.


How to pass input path to MRJob.mapper_raw?

I'm running a mapreduce job with mrjob package and I am trying to use mapper raw as below
class MRJOB(MRJob):
def mapper_raw(self, input_path, input_uri):
import csv
print(input_path, input_uri)
with open(input_path) as f:
reader = csv.reader(f)
for line in reader:
if line:
yield (0, line)
def steps(self):
return [
] # , reducer=self.reducer)]
if __name__ == "__main__":
however, if I run this as python mrjobfile.py inputfile.csv I get the following error:
TypeError: expected str, bytes or os.PathLike object, not NoneType
How to tell mrjob to treat input as a string with the filename? It seems to just pass in the first line of the file.
You have to declare mapper_raw instead of mapper in your MRStep. So, how it will look like:
def steps(self):
return [
] # , reducer=self.reducer)]

Python watchdog Watch For the Creation of New File By Pattern Match

I'm still learning Python so I apologize up front. I am trying to write a watcher that watches for the creation of a file that matches a specific pattern. I am using watchdog with some code that is out there to get it watch a directory.
I don't quite understand how to have the watcher watch for a filename that matches a pattern. I've edited the regexes=[] field with what I think might work but I have not had luck.
Here is an example of a file that might come in: Hires For Monthly TM_06Jan_0946.CSV
I can get the watcher to tell me when a .CSV is created in the directory, I just can't get it to only tell me when essentially Hires.*\.zip is created.
I've checked the this link but have not had luck
How to run the RegexMatchingEventHandler of Watchdog correctly?
Here is my code:
import time
import watchdog.events
from watchdog.observers import Observer
from watchdog.events import RegexMatchingEventHandler
import re
import watchdog.observers
import time
import fnmatch
import os
class Handler(watchdog.events.RegexMatchingEventHandler):
def __init__(self):
watchdog.events.RegexMatchingEventHandler.__init__(self, regexes=['.*'], ignore_regexes=[], ignore_directories=False, case_sensitive=False)
def on_created(self, event):
print("Watchdog received created event - % s." % event.src_path)
def on_modified(self, event):
print("Watchdog received modified event - % s." % event.src_path)
if __name__ == "__main__":
src_path = r"C:\Users\Downloads"
event_handler = Handler()
observer = watchdog.observers.Observer()
observer.schedule(event_handler, path=src_path, recursive=True)
while True:
except KeyboardInterrupt:
observer.join() ```
The reason why '.*' regex works and while the Hires.*\.zip regex doesn't is watchdog tries to match the full path with the regex. In terms of regex: r'Hires.*\.zip' will not give you a full match with "C:\Users\Downloads\Hires blah.zip"
Instead, try:
def __init__(self):
self, regexes=[r'^.*?\.cvs$', r'.*?Hires.*?\.zip'],
ignore_regexes=[], ignore_directories=False, case_sensitive=False)
This will match with all the cvs files and the zip files starts with "Hires". The '.*?' matches with "C:\Users\Downloads" and the rest ensures file types, file names etc.

Have all file updated in a directory

I'm needing a bit of help.
I'm looking to use python 2.7 to see if all files in a directory have the same modified date.
I'll be adding this into an if statement so I'm only looking for a True & False value. So if all file have been modified then true else false.
Any help would awesome.
The method below works but is this the most effective way?
import os,time
def get_information(directory):
file_list = []
for i in os.listdir(directory):
a = os.stat(os.path.join(directory,i))
return file_list
def checkEqual2(iterator):
return len(set(iterator)) <= 1
print get_information("C:\\Auto_Import\\")
print checkEqual2 (get_information("C:\\Auto_Import\\"))
You can use os.scandir for better efficiency since it pulls the information of all files in a given directory in one single call:
from os import scandir
def has_same_mtime(path):
return len(set(entry.stat().st_mtime for entry in scandir(path))) == 1
If you're using Python 3.4 or before you would have to fall back to os.listdir instead:
import os
def has_same_mtime(path):
return len(set(os.stat(os.path.join(path, name)).st_mtime for name in os.listdir(path))) == 1

awk command in subprocess.Popen - Getting Error

I am trying this code:This code work fine and gave me the output what i need .
import os,sys
import sitenamevalidation
import getpass
import subprocess
site_name = sys.argv[1]
def get_location(site_name):
site_name_vali = sitenamevalidation.sitenamevalidation(site_name)
if site_name_vali == True:
print("Sitename should be in this format [a-z][a-z][a-z][0-9][0-9] example abc54 or xyz50")
site = site_name[:3].upper()
user =getpass.getuser()
path = os.path.join('/home/',user,'location.attr')
if os.path.exists(path) == True:
print("Not able to find below path "+path)
#p1 = subprocess.Popen(['awk', '/site/{getline; print}' , path] ,stdout=subprocess.PIPE)
p1 = subprocess.Popen(['awk', '/ABC/{getline; print}' , path] ,stdout=subprocess.PIPE)
print str(output[0].strip().lower())
if __name__ == "__main__":
Main issue is if i use this below command
p1 = subprocess.Popen(['awk', '/ABC/{getline; print}' , path] ,stdout=subprocess.PIPE)
where ABC is static value in path it work fine and give me desired output but i wanted the value which user will enter (site in my code). When i am doing this
p1 = subprocess.Popen(['awk', '/site/{getline; print}' , path] ,stdout=subprocess.PIPE)
It is looking for site in that file which is not there.I want the value of site variable to go in that command rather the site itself.Tried different by putting quote , double quote but nothing seem to be working.

How to remove numbers from every list entry

I have a csv file with the following entries:
"Last,First,HW1,HW2,HW3,HW4,Test 1,Test 2"
I want to remove the numbers from the HWs and the Tests. So the result would be:
"Last,First,HW,HW,HW,HW,Test ,Test"
Is there some way to do this?
Try this:
def main(ifilename, ofilename):
with open(ifilename,'r') as i, open(ofilename,'w') as o:
for line in i:
for digit in (0,1,2,3,4,5,6,7,8,9):
line = line.replace(str(digit), '')
if __name__ == '__main__':
from sys import argv
main(argv[1], argv[2])
Use the script from the command line like this:
python my_script.py input.csv output.csv