counting genres in pig - mapreduce

I deal with the dataset movies.dat provided by movielensdata. First 5 rows of the data is
1:Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy
2:Jumanji (1995):Adventure|Children|Fantasy
3:Grumpier Old Men (1995):Comedy|Romance
4:Waiting to Exhale (1995):Comedy|Drama|Romance
5:Father of the Bride Part II (1995):Comedy
I want to count exact number of occurences of each genre. To do this, the following mapreduce (python) code is sufficient.
#!/usr/bin/env python
import sys
#mapper
for line in sys.stdin:
for genre in line.strip().split(":")[-1].split("|"):
print("{x}\t1".format(x=genre))
#!/usr/bin/env python
#reducer
import sys
genre_dict={}
for line in sys.stdin:
data=line.strip().split("\t")
if len(data)!=2:
continue
else:
if data[0] not in genre_dict.keys():
genre_dict[data[0]]=1
else:
genre_dict[data[0]]+=1
a=list(genre_dict.items())
a.sort(key=lambda x:x[1],reverse=True)
for genre,count in a:
print("{x}\t{y}".format(x=genre,y=count))
Any suggestion for the pig's query to do the same task?
Thanks in advance...

TOKENIZE and FLATTEN can help you out here. The TOKENIZE operator in Pig takes a string and a delimiter, splits the string into parts based on the delimiter and puts the parts into a bag. The FLATTEN operator in Pig takes a bag and explodes each element in the bag into a new record. The code will look as follows:
--Load you initial data and split into columns based on ':'
data = LOAD 'path_to_data' USING PigStorage(':') AS (index:long, name:chararray, genres:chararray);
--Split & Explode each individual genre into a separate record
dataExploded = FOREACH data GENERATE FLATTEN(TOKENIZE(genres, '|')) AS genre;
--GROUP and get counts for each genre
dataWithCounts = FOREACH (GROUP dataExploded BY genre) GENERATE
group AS genre,
COUNT(dataExploded) AS genreCount;
DUMP dataWithCounts;

Related

Splitting the name when a word matches with one in array?

As a part of my learning. After i successfully split with help, in my next step, wanted to know if i can split the names of files when the month name is found in the name of the file that matches with the name of the month given in this list below ---
Months=['January','February','March','April','May','June','July','August','September','October','November','December'].
When my file name is like this
1.Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt
2.Denied_Calls_SMS_Sent_December_14_2016_05_33_41 PM.txt
Please note that the names of files is not same..i.e why i need to split it like
Non IVR Entries Transactions as one part and December_16_2016_07_49_22 PM as another.
import os
import os.path
import csv
path = 'C:\\Users\\akhilpriyatam.k\\Desktop\\tes'
text_files = [os.path.splitext(f)[0] for f in os.listdir(path)]
for v in text_files:
print (v[0:9])
print (v[10:])
os.chdir('C:\\Users\\akhilpriyatam.k\\Desktop\\tes')
with open('file.csv', 'wb') as csvfile:
thedatawriter = csv.writer(csvfile,delimiter=',')
for v in text_files:
s = (v[0:9])
t = (v[10:])
thedatawriter.writerow([s,t])
import re
import calendar
fullname = 'Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt'
months = list(calendar.month_name[1:])
regex = re.compile('|'.join(months))
iter = re.finditer(regex, fullname)
if iter:
idx = [it for it in iter][0].start()
filename, timestamp = fullname[:idx],fullname[idx:-4]
print filename, timestamp
else:
print "Month not found"
Assuming that you want the filename and timestamp as splits and the month occurs only once in the string, I hope the following code solves your problem.

Python - webscraping; dictionary data structure

I need to scrape this website (http://setkab.go.id/profil-kabinet/#) and produce an Excel file that has headers "Cabinet names" in column 1 and "Era" in column 2. That means each Cabinet name (e.g. Kabinet Presidensil, Kabinet Sjahrir I) should have its own row - alongside its respective era (e.g. Era Revolusi Fisik, Era Republik Indonesia Serikat).
This is the closest I've gotten:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://setkab.go.id/profil-kabinet/#')
soup = BeautifulSoup(response.text, 'html.parser')
eras = soup.find_all('div', attrs={'class':"wpb_accordion_section group"})
setkab = {}
for element in eras:
setkab[element.a.get_text()] = {}
for element in eras:
cabname = element.find('div',attrs={'class':'wpb_wrapper'}).get_text()
setkab[element.a.get_text()]['cbnm'] = cabname
for item in setkab.keys():
print item + setkab[item]['cbnm']
import os, csv
os.chdir("/Users/mxcodes/Code")
with open("setkabfinal.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8"), setkab[a]["cbnm"]])
However, this creates an Excel file with the headers "Era" and "Cabinet names" in column 1 and 2, respectively. It fails to put each Cabinet name in a separate row. For example, it has 'Era Revolusi Fisik' in column 1 and lists all the cabinets together in column 2.
My guess is that I need to switch the key-value pairs somehow so that each Cabinet becomes a key and its era becomes its value - because currently it's the other way around. But I've tried and failed to do so. Any help? Thank you!
From what I can see, the cabinets[a]["cbnm"] variable you use for writing is just a long Unicode so when you do writer.writerow([a.encode("utf-8"), cabinets[a]["cbnm"]]) what actually happens is that you write the era at the first column and the whole Unicode in the single cell in the next column (even if you have \n in your string it does not prevent it from being writed in a single cell (csv actually think that you want the unicode to be in ONLY one cell so it puts " before and after the cabinets[a]["cbnm"] value to be sure it will actually be in one cell)), what you should do to write every cabinet value in another row is to use the writerow method separately for each desired row.
for example this code worked fine for me:
cabinets = setkab
with open("cabinets.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8")]) #write the era column
cabinets_list = [i for i in cabinets[a]["cbnm"].split('\n') if i != ''] #get all the values that are separated by newline chars (if they aren't empty strings)
for i in cabinets_list: writer.writerow([a.encode("utf-8"),i]) #write every value separately in the CABINET NAME row
as you can see I changed only the last 3 lines.
I hope this will help you!

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

Using Python CSV and glob to find matching strings and print row

I have hundreds of CSV files and I'm trying to write a Python script that will parse through all of them and print out rows that have matching string(s). I'll be happy if we can get this to work using one string (and not a list of strings). Using Python 2.7.5. I've figured out so far:
The csv module in Python will print the row with the matching string in a particular column (the eighth column from the left):
import csv
reader = csv.reader(open('2015-08-25.csv'))
for row in reader:
col8 = str(row[8])
if col8 == '36862210':
print row
So the above works for one .csv file. Now I need to parse hundreds of .csv files with glob. The glob module will print out all the file names with this code:
import glob
for name in glob.glob('20??-??-??.csv'):
print name
I tried putting the two together into one script but the error message reads:
File "test7.py", line 6, in
reader = csv.reader(open(csvfiles))
TypeError: coercing to Unicode: need string or buffer, list found
import csv
import glob
csvfiles = glob.glob('20??-??-??.csv')
for filename in csvfiles:
reader = csv.reader(open(csvfiles))
for row in reader:
col8 = str(row[8])
if col8 == '36862210':
print row
You are trying to open a List - csvfiles is the list you are iterating on.
Use this instead, because open() expects a filename:
reader = csv.reader(open(filename))

Python 2.7 csv read, modify then write with dict?

Ok I acknowledge that my question might duplicate this one but I have going to ask anyways 'cause although the ultimate goals are similar, the python code in use seems quite different.
I often have a list of students to create user accounts for. For this, I need to generate UserId's of the format
`Lastname[0:6].capitalize() + Firstname[0].capitalize()`
or six characters from the last name and First initial. I'd like to automate this with a python script reading from one .csv file containing firstname / lastname and writing firstname lastname userid to a different csv.
Here is my code, which almost works but I am having difficulty with the write rows to .csv part at the end:
import csv
input_file = csv.DictReader(open("cl.csv"))
index=0
fldnms=input_file.fieldnames
fldnms.append('UserName')
print fldnms
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
with open("users.csv",'wb') as out_csv:
dw = csv.DictWriter(out_csv, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
dw.writerow(row)
Advice / thoughts welcomed.
Thanks,
Brian H.
I went back to this after a good nights sleep and fwiw, here is the working version:
'''
Reads cl.csv (client list) as firstname lastname list with header
Writes users.csv as lastname firstname userid list w/o header
'''
import csv
INfile=open("..\cl_old.csv")
input_file = csv.DictReader(INfile, delimiter=' ')
fldnms={}
#print type (fldnms)
fldnms= input_file.fieldnames
fldnms.append('UserName')
#print type (fldnms)
#print (fldnms["Last"],fldnms["First"],fldnms["UserName"])
index =0
OUTfile=open("users.csv",'wb')
dw = csv.DictWriter(OUTfile, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
dw.writerow(row)
INfile.close()
OUTfile.close()
cl.csv contains a list of first, last name pairs. Results are stored in users.csv as names, userid.
I did this as an exercise in python as excel will do this in a single instruction.
=CONCATENATE(LEFT(A2,6),LEFT(B2,1))
Hope this is of interest.
BJH