Python - webscraping; dictionary data structure

Python - webscraping; dictionary data structure - python-2.7

I need to scrape this website (http://setkab.go.id/profil-kabinet/#) and produce an Excel file that has headers "Cabinet names" in column 1 and "Era" in column 2. That means each Cabinet name (e.g. Kabinet Presidensil, Kabinet Sjahrir I) should have its own row - alongside its respective era (e.g. Era Revolusi Fisik, Era Republik Indonesia Serikat).
This is the closest I've gotten:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://setkab.go.id/profil-kabinet/#')
soup = BeautifulSoup(response.text, 'html.parser')
eras = soup.find_all('div', attrs={'class':"wpb_accordion_section group"})
setkab = {}
for element in eras:
setkab[element.a.get_text()] = {}
for element in eras:
cabname = element.find('div',attrs={'class':'wpb_wrapper'}).get_text()
setkab[element.a.get_text()]['cbnm'] = cabname
for item in setkab.keys():
print item + setkab[item]['cbnm']
import os, csv
os.chdir("/Users/mxcodes/Code")
with open("setkabfinal.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8"), setkab[a]["cbnm"]])
However, this creates an Excel file with the headers "Era" and "Cabinet names" in column 1 and 2, respectively. It fails to put each Cabinet name in a separate row. For example, it has 'Era Revolusi Fisik' in column 1 and lists all the cabinets together in column 2.
My guess is that I need to switch the key-value pairs somehow so that each Cabinet becomes a key and its era becomes its value - because currently it's the other way around. But I've tried and failed to do so. Any help? Thank you!

From what I can see, the cabinets[a]["cbnm"] variable you use for writing is just a long Unicode so when you do writer.writerow([a.encode("utf-8"), cabinets[a]["cbnm"]]) what actually happens is that you write the era at the first column and the whole Unicode in the single cell in the next column (even if you have \n in your string it does not prevent it from being writed in a single cell (csv actually think that you want the unicode to be in ONLY one cell so it puts " before and after the cabinets[a]["cbnm"] value to be sure it will actually be in one cell)), what you should do to write every cabinet value in another row is to use the writerow method separately for each desired row.
for example this code worked fine for me:
cabinets = setkab
with open("cabinets.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8")]) #write the era column
cabinets_list = [i for i in cabinets[a]["cbnm"].split('\n') if i != ''] #get all the values that are separated by newline chars (if they aren't empty strings)
for i in cabinets_list: writer.writerow([a.encode("utf-8"),i]) #write every value separately in the CABINET NAME row
as you can see I changed only the last 3 lines.
I hope this will help you!

Related

Python3: split up list and save as file - how to?

I'm kinda new to Python, so thx for your help!
I want to tell Python to take a big .csv list and split it up to many small lists of only two columns
Take this .csv file
Always use column "year" which is the first column
Then take always the next column (for-loop?), starting with column 2 which is "Object1", then column 3 which is "Object2" and so on...
Save each list as .csv - now only containing two columns - and name it after the second column (f.e. "Object1")
So far I am up to this:
import csv
object = 0
f = open("/home/Data/data.csv")
csv_f = csv.reader(f, delimiter=';', quotechar='|')
writer = csv.writer(csv_f)
for row in csv_f:
writer("[0],[object]")
object += 1
f.close()

Your code is trying to open the same file for reading and writing, which may have unexpected results.
Think about your problem as a series of steps; one way to approach the problem is:
Open the big file
Read the first line of the file, which contains the column titles.
Go through the column titles (the first line of your big csv file), skipping the first one, then:
For each column title, create a new csv file, where the filename is the name of the column.
Take the value of the first column, plus the value of the column you are currently reading, and write it to the file.
Repeat till all column titles are read
Close the file
Close the big file.
Here is the same approach as above, taking advantage of Python's csv reading capabilities:
import csv
with open('big-file.csv') as f:
reader = csv.reader(f, delimiter=';', quotechar='|')
titles = next(reader)
for index, column_name in enumerate(titles[1:]):
with open('{}.csv'.format(column_name), 'w') as i:
writer = csv.writer(i, delimiter=';', quotechar='|')
for row in reader:
writer.writerow((row[0],row[index+1]))
f.seek(0) # start from the top of the big file again
next(reader) # skip the header column

reading in certain rows from a csv file using python

Say I have a csv file that looks like the following with the first column containing frequencies and the second column containing the power level (dBm).
Frequency | dBm
1 -11.43
2.3 -51.32
2.5 -12.11
2.8 -11.21
3.1 -73.22
3.2 -21.13
I only want to read in the data sets of this file that have a (dBm) value between -13 and -10. Therefore, in this example I only want the data sets (1, -11.43)(2.5, -12.11)(2.8, -11.21) to be read into my program variables x1 and y1. Could someone give me some help in how I could do this?

You can just use the csv library and check if each meets your criteria.
Something like this should work on your file:
with open('file.csv') as csvfile:
reader = csv.reader(csvfile,delimiter=' ')
reader.next()
reader.next()
for row in reader:
a = [float(i) for i in row if i!='']
if a[1]>=-13 and a[1]<=-1:
print (a[0],a[1])
Edit: If you're working with table data I would suggest trying out Pandas, it's really helpful in these situations.

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.

As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).

Python 2.7 csv read, modify then write with dict?

Ok I acknowledge that my question might duplicate this one but I have going to ask anyways 'cause although the ultimate goals are similar, the python code in use seems quite different.
I often have a list of students to create user accounts for. For this, I need to generate UserId's of the format
`Lastname[0:6].capitalize() + Firstname[0].capitalize()`
or six characters from the last name and First initial. I'd like to automate this with a python script reading from one .csv file containing firstname / lastname and writing firstname lastname userid to a different csv.
Here is my code, which almost works but I am having difficulty with the write rows to .csv part at the end:
import csv
input_file = csv.DictReader(open("cl.csv"))
index=0
fldnms=input_file.fieldnames
fldnms.append('UserName')
print fldnms
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
with open("users.csv",'wb') as out_csv:
dw = csv.DictWriter(out_csv, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
dw.writerow(row)
Advice / thoughts welcomed.
Thanks,
Brian H.

I went back to this after a good nights sleep and fwiw, here is the working version:
'''
Reads cl.csv (client list) as firstname lastname list with header
Writes users.csv as lastname firstname userid list w/o header
'''
import csv
INfile=open("..\cl_old.csv")
input_file = csv.DictReader(INfile, delimiter=' ')
fldnms={}
#print type (fldnms)
fldnms= input_file.fieldnames
fldnms.append('UserName')
#print type (fldnms)
#print (fldnms["Last"],fldnms["First"],fldnms["UserName"])
index =0
OUTfile=open("users.csv",'wb')
dw = csv.DictWriter(OUTfile, delimiter=',', fieldnames=fldnms)
dw.writerow(dict((fn,fn) for fn in fldnms))
for row in input_file:
index+=1
UserID=(row["Last"][0:6].capitalize() + row["First"][0].capitalize())
row['UserName'] = UserID
print index, row["Last"], row["First"], row["UserName"]
dw.writerow(row)
INfile.close()
OUTfile.close()
cl.csv contains a list of first, last name pairs. Results are stored in users.csv as names, userid.
I did this as an exercise in python as excel will do this in a single instruction.
=CONCATENATE(LEFT(A2,6),LEFT(B2,1))
Hope this is of interest.
BJH

Extracting columnar data correctly as it is in the file

Suppose i have tabular column as below.Now i want to extract the column wise data.I tried extracting data by creating a list.But it is extracting the first row correctly but from second row onwards there is space i.e under CEN/4.Now my code considers zeroth column has 5.0001e-1 form second row,it starts reading from there. How to extract the data correctly coulmn wise.output is scrambled.
0 1 25 CEN/4 -5.000000E-01 -3.607026E+04 -5.747796E+03 -8.912796E+02 -88.3178
5.000000E-01 3.607026E+04 5.747796E+03 8.912796E+02 1.6822
27 -5.000000E-01 -3.641444E+04 -5.783247E+03 -8.912796E+02 -88.3347
5.000000E-01 3.641444E+04 5.783247E+03 8.912796E+02 1.6653
28 -5.000000E-01 -3.641444E+04 -5.712346E+03 -8.912796E+02 -88.3386
5.000000E-01 3.641444E+04 5.712346E+03 8.912796E+02
my code is :
f1=open('newdata1.txt','w')
L = []
for index, line in enumerate(open('Trial_1.txt','r')):
#print index
if index < 0: #skip first 5 lines
continue
else:
line =line.split()
L.append('%s\t%s\t %s\n' %(line[0], line[1],line[2]))
f1.writelines(L)
f1.close()
my output looks like this:
0 1 CEN/4 -5.000000E-01 -5.120107E+04
5.000000E-01 5.120107E+04 1.028093E+04 5.979930E+03 8.1461
i want columnar data as it is in the file.How to do that.I am a bgeinner

its hard to tell from the way the input data is presented in your question, but Im guessing your file is using tabs to separate columns, in any case, consider using python csv module with the relevant delimiter like:
import csv
with open('input.csv') as f_in, open('newdata1', 'w') as f_out:
reader = csv.reader(f_in, delimiter='\t')
writer = csv.writer(f_out, delimiter='\t')
for row in reader:
writer.writerow(row)
see python csv module documentation for further details

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python - webscraping; dictionary data structure - python-2.7

Related

Python3: split up list and save as file - how to?

reading in certain rows from a csv file using python

How do I iterate a loop over several data frames in a list in python

Python 2.7 csv read, modify then write with dict?

Extracting columnar data correctly as it is in the file

Categories

Resources