Python 2.7 csv read, modify then write with dict? - python-2.7

Ok I acknowledge that my question might duplicate this one but I have going to ask anyways 'cause although the ultimate goals are similar, the python code in use seems quite different.
I often have a list of students to create user accounts for. For this, I need to generate UserId's of the format
`Lastname[0:6].capitalize() + Firstname[0].capitalize()`
or six characters from the last name and First initial. I'd like to automate this with a python script reading from one .csv file containing firstname / lastname and writing firstname lastname userid to a different csv.
Here is my code, which almost works but I am having difficulty with the write rows to .csv part at the end:
import csv
input_file = csv.DictReader(open("cl.csv"))
index=0
fldnms=input_file.fieldnames
fldnms.append('UserName')
print fldnms
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
with open("users.csv",'wb') as out_csv:
dw = csv.DictWriter(out_csv, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
dw.writerow(row)
Advice / thoughts welcomed.
Thanks,
Brian H.

I went back to this after a good nights sleep and fwiw, here is the working version:
'''
Reads cl.csv (client list) as firstname lastname list with header
Writes users.csv as lastname firstname userid list w/o header
'''
import csv
INfile=open("..\cl_old.csv")
input_file = csv.DictReader(INfile, delimiter=' ')
fldnms={}
#print type (fldnms)
fldnms= input_file.fieldnames
fldnms.append('UserName')
#print type (fldnms)
#print (fldnms["Last"],fldnms["First"],fldnms["UserName"])
index =0
OUTfile=open("users.csv",'wb')
dw = csv.DictWriter(OUTfile, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
dw.writerow(row)
INfile.close()
OUTfile.close()
cl.csv contains a list of first, last name pairs. Results are stored in users.csv as names, userid.
I did this as an exercise in python as excel will do this in a single instruction.
=CONCATENATE(LEFT(A2,6),LEFT(B2,1))
Hope this is of interest.
BJH

Related

counting genres in pig

I deal with the dataset movies.dat provided by movielensdata. First 5 rows of the data is
1:Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy
2:Jumanji (1995):Adventure|Children|Fantasy
3:Grumpier Old Men (1995):Comedy|Romance
4:Waiting to Exhale (1995):Comedy|Drama|Romance
5:Father of the Bride Part II (1995):Comedy
I want to count exact number of occurences of each genre. To do this, the following mapreduce (python) code is sufficient.
#!/usr/bin/env python
import sys
#mapper
for line in sys.stdin:
for genre in line.strip().split(":")[-1].split("|"):
print("{x}\t1".format(x=genre))
#!/usr/bin/env python
#reducer
import sys
genre_dict={}
for line in sys.stdin:
data=line.strip().split("\t")
if len(data)!=2:
continue
else:
if data[0] not in genre_dict.keys():
genre_dict[data[0]]=1
else:
genre_dict[data[0]]+=1
a=list(genre_dict.items())
a.sort(key=lambda x:x[1],reverse=True)
for genre,count in a:
print("{x}\t{y}".format(x=genre,y=count))
Any suggestion for the pig's query to do the same task?
Thanks in advance...
TOKENIZE and FLATTEN can help you out here. The TOKENIZE operator in Pig takes a string and a delimiter, splits the string into parts based on the delimiter and puts the parts into a bag. The FLATTEN operator in Pig takes a bag and explodes each element in the bag into a new record. The code will look as follows:
--Load you initial data and split into columns based on ':'
data = LOAD 'path_to_data' USING PigStorage(':') AS (index:long, name:chararray, genres:chararray);
--Split & Explode each individual genre into a separate record
dataExploded = FOREACH data GENERATE FLATTEN(TOKENIZE(genres, '|')) AS genre;
--GROUP and get counts for each genre
dataWithCounts = FOREACH (GROUP dataExploded BY genre) GENERATE
group AS genre,
COUNT(dataExploded) AS genreCount;
DUMP dataWithCounts;

Python3: split up list and save as file - how to?

I'm kinda new to Python, so thx for your help!
I want to tell Python to take a big .csv list and split it up to many small lists of only two columns
Take this .csv file
Always use column "year" which is the first column
Then take always the next column (for-loop?), starting with column 2 which is "Object1", then column 3 which is "Object2" and so on...
Save each list as .csv - now only containing two columns - and name it after the second column (f.e. "Object1")
So far I am up to this:
import csv
object = 0
f = open("/home/Data/data.csv")
csv_f = csv.reader(f, delimiter=';', quotechar='|')
writer = csv.writer(csv_f)
for row in csv_f:
writer("[0],[object]")
object += 1
f.close()
Your code is trying to open the same file for reading and writing, which may have unexpected results.
Think about your problem as a series of steps; one way to approach the problem is:
Open the big file
Read the first line of the file, which contains the column titles.
Go through the column titles (the first line of your big csv file), skipping the first one, then:
For each column title, create a new csv file, where the filename is the name of the column.
Take the value of the first column, plus the value of the column you are currently reading, and write it to the file.
Repeat till all column titles are read
Close the file
Close the big file.
Here is the same approach as above, taking advantage of Python's csv reading capabilities:
import csv
with open('big-file.csv') as f:
reader = csv.reader(f, delimiter=';', quotechar='|')
titles = next(reader)
for index, column_name in enumerate(titles[1:]):
with open('{}.csv'.format(column_name), 'w') as i:
writer = csv.writer(i, delimiter=';', quotechar='|')
for row in reader:
writer.writerow((row[0],row[index+1]))
f.seek(0) # start from the top of the big file again
next(reader) # skip the header column

Python - webscraping; dictionary data structure

I need to scrape this website (http://setkab.go.id/profil-kabinet/#) and produce an Excel file that has headers "Cabinet names" in column 1 and "Era" in column 2. That means each Cabinet name (e.g. Kabinet Presidensil, Kabinet Sjahrir I) should have its own row - alongside its respective era (e.g. Era Revolusi Fisik, Era Republik Indonesia Serikat).
This is the closest I've gotten:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://setkab.go.id/profil-kabinet/#')
soup = BeautifulSoup(response.text, 'html.parser')
eras = soup.find_all('div', attrs={'class':"wpb_accordion_section group"})
setkab = {}
for element in eras:
setkab[element.a.get_text()] = {}
for element in eras:
cabname = element.find('div',attrs={'class':'wpb_wrapper'}).get_text()
setkab[element.a.get_text()]['cbnm'] = cabname
for item in setkab.keys():
print item + setkab[item]['cbnm']
import os, csv
os.chdir("/Users/mxcodes/Code")
with open("setkabfinal.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8"), setkab[a]["cbnm"]])
However, this creates an Excel file with the headers "Era" and "Cabinet names" in column 1 and 2, respectively. It fails to put each Cabinet name in a separate row. For example, it has 'Era Revolusi Fisik' in column 1 and lists all the cabinets together in column 2.
My guess is that I need to switch the key-value pairs somehow so that each Cabinet becomes a key and its era becomes its value - because currently it's the other way around. But I've tried and failed to do so. Any help? Thank you!
From what I can see, the cabinets[a]["cbnm"] variable you use for writing is just a long Unicode so when you do writer.writerow([a.encode("utf-8"), cabinets[a]["cbnm"]]) what actually happens is that you write the era at the first column and the whole Unicode in the single cell in the next column (even if you have \n in your string it does not prevent it from being writed in a single cell (csv actually think that you want the unicode to be in ONLY one cell so it puts " before and after the cabinets[a]["cbnm"] value to be sure it will actually be in one cell)), what you should do to write every cabinet value in another row is to use the writerow method separately for each desired row.
for example this code worked fine for me:
cabinets = setkab
with open("cabinets.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8")]) #write the era column
cabinets_list = [i for i in cabinets[a]["cbnm"].split('\n') if i != ''] #get all the values that are separated by newline chars (if they aren't empty strings)
for i in cabinets_list: writer.writerow([a.encode("utf-8"),i]) #write every value separately in the CABINET NAME row
as you can see I changed only the last 3 lines.
I hope this will help you!

Python: Reading in .csv data as dictionary and printing out data as dictionary to .csv file?

I'm writing a python executable script that does the following:
I want to gather information from a .csv file and read it into python as a dictionary. This .csv file contains several columns of information with headings, and I only want to extract particular columns (those columns with specific headings I want) , and print those columns out to another .csv file. I am using the functions DictReader and DictWriter.
I am reading in the .csv file as a dictionary (with the headings being the key and the column values being the items),and output the information as a dictionary to another .csv file.
After I read it in, I print out the items in the particular headings (so I can double check what I have read it). I then open up a new .csv file and want to write the data (which I have just read in) as a dictionary. I can write in the keys (column headings) but my code doesn't print any of the item values for some reason. The headings that I want in this case are 'Name' and 'DOB'.
Here is my code:
#!/usr/bin/python
import os
import os.path
import re
import sys
import pdb
import csv
csv_file = csv.DictReader(open(sys.argv[1],'rU'),delimiter = ',')
for line in csv_file:
print line['Name'] + ',' + line['DOB']
fieldnames = ['Name','DOB']
test_file = open('test2.csv','wr')
csvwriter = csv.DictWriter(test_file, delimiter=',', fieldnames=fieldnames)
csvwriter.writerow(dict((fn,fn) for fn in fieldnames))
for row in csv_file:
csvwriter.writerow(row)
test_file.close()
Any ideas of where I'm going wrong ? I want to print the item values under their their corresponding column headers in the output file.
I am using python 2.7.11 on a Mac machine. I am also printing values to the terminal.
You're unfortunately tricked by your own testing, that is, the printing of the individual rows. By looping through csv_file initially, you've exhausted the iterator and are at the end. Further iterations, as done in the bottom of your code, are not possible and will be ignored.
Your question is essentially a duplicate of various other question, such as how to read from a CSV file repeatedly. Albeit that the issue here comes up in a different way: you didn't realise what the problem was, while those questions do know the cause, but not the solution.
Answers to those questions tell you to simply reset the file pointer of the input file. Unfortunately, the input file gets closed promptly after reading, in your current code.
Thus, something like this should work:
infile = open(sys.argv[1], 'rU')
csv_file = csv.DictReader(infile ,delimiter = ',')
<all other code>
infile.seek(0)
for row in csv_file:
csvwriter.writerow(row)
test_file.close()
infile.close()
As an aside, just use the with statement when opening files:
with open(sys.argv[1], 'rU') as infile, open('test2.csv', 'wr') as outfile:
csv_file = csv.DictReader(infile ,delimiter = ',')
for line in csv_file:
print line['Name'] + ',' + line['DOB']
fieldnames = ['Name','DOB']
csvwriter = csv.DictWriter(outfile, delimiter=',', fieldnames=fieldnames)
infile.seek(0)
for row in csv_file:
csvwriter.writerow(row)
Note: DictWriter will take care of the header row. No need to write it yourself.

Changing all occurences of similar word in csv python

I want to replace one specific word, 'my' with 'your'. But seems my code can only change one appearance.
import csv
path1 = "/home/bankdata/levelout.csv"
path2 = "/home/bankdata/leveloutmodify.csv"
in_file = open(path1,"rb")
reader = csv.reader(in_file)
out_file = open(path2,"wb")
writer = csv.writer(out_file)
with open(path1, 'r') as csv_file:
csvreader = csv.reader(csv_file)
col_count = 0
for row in csvreader:
while row[col_count] == 'my':
print 'my is used'
row[col_count] = 'your'
#writer.writerow(row[col_count])
writer.writerow(row)
col_count +=1
let's say the sentences is
'my book is gone and my bag is missing'
the output is
your book is gone and my bag is missing
the second thing is I want to make it appear without comma separated:
print row
the output is
your,book,is,gone,and,my,bag,is,missing,
for the second problem, im still trying to find the correct one as it keeps giving me the same output with comma separated.
with open(path1) as infile, open(path2, "w") as outfile:
for row in infile:
outfile.write(row.replace(",", ""))
print row
it gives me the result:
your,book,is,gone,and,my,bag,is,missing
I send out this sentence to my Nao robot and the robot seems pronouncing awkwardly as there are commas in between each word.
I solved it by:
with open(path1) as infile, open(path2, "w") as outfile:
for row in infile:
outfile.write(row.replace(",", ""))
with open(path2) as out:
for row in out:
print row
It gives me what I want:
your book is gone and your bag is missing too
However, any better way to do it?