Attribute Error for strings created from lists - regex

I'm trying to create a data-scraping file for a class, and the data I have to scrape requires that I use while loops to get the right data into separate arrays-- i.e. for states, and SAT averages, etc.
However, once I set up the while loops, my regex that cleared the majority of the html tags from the data broke, and I am getting an error that reads:
Attribute Error: 'NoneType' object has no attribute 'groups'
My Code is:
import re, util
from BeautifulSoup import BeautifulStoneSoup
# create a comma-delineated file
delim = ", "
#base url for sat data
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm"
#get webpage object for site
soup = util.mysoupopen(base)
#get column headings
colCols = soup.findAll("td", {"class":"vaTextBold"})
#get data
dataCols = soup.findAll("td", {"class":"vaText"})
#append data to cols
for i in range(len(dataCols)):
colCols.append(dataCols[i])
#open a csv file to write the data to
fob=open("sat.csv", 'a')
#initiate the 5 arrays
states = []
participate = []
math = []
read = []
write = []
#split into 5 lists for each row
for i in range(len(colCols)):
if i%5 == 0:
states.append(colCols[i])
i=1
while i<=250:
participate.append(colCols[i])
i = i+5
i=2
while i<=250:
math.append(colCols[i])
i = i+5
i=3
while i<=250:
read.append(colCols[i])
i = i+5
i=4
while i<=250:
write.append(colCols[i])
i = i+5
#write data to the file
for i in range(len(states)):
states = str(states[i])
participate = str(participate[i])
math = str(math[i])
read = str(read[i])
write = str(write[i])
#regex to remove html from data scraped
#remove <td> tags
line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0]
#append data point to the file
fob.write(line)
Any ideas regarding why this error suddenly appeared? The regex was working fine until I tried to split the data into different lists. I have already tried printing the various strings inside the final "for" loop to see if any of them were "None" for the first i value (0), but they were all the string that they were supposed to be.
Any help would be greatly appreciated!

It looks like the regex search is failing on (one of) the strings, so it returns None instead of a MatchObject.
Try the following instead of the very long #remove <td> tags line:
out_list = []
for item in (states, participate, math, read, write):
try:
out_list.append(re.search(">(.*)<", item).groups()[0])
except AttributeError:
print "Regex match failed on", item
sys.exit()
line = delim.join(out_list)
That way, you can find out where your regex is failing.
Also, I suggest you use .group(1) instead of .groups()[0]. The former is more explicit.

Related

Compress html as gzip instead of (pk)zip

I am trying to adapt a Python 2.7 routine for creating SQLite db dictionaries for an e-reader from TSV files.
The routine shown creates a (pk)zip file of an html string in lines 71-74 (# compress and save). I'd like the compressed file to be in gzip format instead. I'm really, really shaky on Python and I have been looking at examples of using gzip compression, trying to fit them into this routine (which I have already adapted in other ways for my own use from the original). I know that changes need to be made in line 9 (import) also, but am not sure which.
#!/usr/bin/python
# -*- coding: utf-8 -*-
# by Jiri Orsag, 2014
# https://github.com/geoRG77/nook-dictionary
# Many thanks to Homeless Ghost for his script 'createRenateNSTdictionaryfromnookdictionarydb.py'
# which was a great source of ideas for my work
import sqlite3, sys, zipfile, zlib, os
# config
DICTIONARY_FILE = 'test.txt' # input file (needed)
OUTPUT_DB = 'ox_en_GB.db' # output file
TEMP_DIRECTORY = './temp/' # will be deleted after successful run
STEP = 10000 # for print message
########################################################
print 'Converting dictionary...'
con = sqlite3.connect(OUTPUT_DB)
con.text_factory = str
cur = con.cursor()
index = 0
duplicateCount = 1
prevTerm = ''
try:
if not os.path.exists(TEMP_DIRECTORY):
os.makedirs(TEMP_DIRECTORY)
# open dict file
dict = open(DICTIONARY_FILE, 'r')
# delete previous tables
cur.execute('DROP TABLE IF EXISTS android_metadata')
cur.execute('DROP TABLE IF EXISTS tblWords')
# create tables
cur.execute('CREATE TABLE "android_metadata"("locale" TEXT)')
cur.execute('CREATE TABLE "tblWords"("_id" INTEGER PRIMARY KEY AUTOINCREMENT, "term" TEXT COLLATE nocase, "description" BLOB)')
# convert dict to sql
for line in dict:
index += 1
# uncomment next line to debug
# print '# current line = %d' % index
# split line
data = line.split('\t')
term = data.pop(0)
# create HTML
html = '<b>' + term + '</b>' + data[0].strip()
# check for duplicates
if term == prevTerm:
duplicateCount += 1
termEdited = term + '[' + str(duplicateCount) + ']'
else:
termEdited = term
duplicateCount = 1
# create html file
term_stripped = termEdited.replace('/', '')
temp_html = open(TEMP_DIRECTORY + term_stripped, 'wb')
temp_html.write(html)
temp_html.close()
# compress & save
zf = zipfile.ZipFile('_temp', mode='w')
zf.write(TEMP_DIRECTORY + term_stripped)
zf.close()
# read & insert compressed data
temp_compressed = open('_temp', 'rb')
compressed = temp_compressed.read()
cur.execute('INSERT INTO tblWords (_id, term, description) VALUES(?, ?, ?)', (index, termEdited, sqlite3.Binary(compressed)))
# if duplicate then update previous row with [1]
if duplicateCount == 2:
cur.execute('UPDATE tblWords SET term="' + str(term + "[1]") + '" WHERE _id=' + str(index - 1) + '')
os.remove(TEMP_DIRECTORY + term_stripped)
prevTerm = term
# print _id, term, description
if ((index % STEP) == 0):
print '# current line = %d' % index
#if index == 100:
# break;
# create term_index
cur.execute('CREATE INDEX term_index on tblWords (term ASC)')
cur.execute('SELECT * FROM tblWords order by _id LIMIT 10')
dict.close
# os.remove('_temp')
# os.rmdir(TEMP_DIRECTORY)
except Exception, e:
raise
else:
pass
finally:
pass
print 'Done. ' + str(index) + ' lines converted.'
Any hints or good references?
import gzip and replace zipfile.ZipFile with gzip.open. Write to the gzip file what you wrote to the temporary file, which is no longer necessary. E.g.
gz = gzip.open('_temp', mode='wb')
gz.write(html)
gz.close()
and get rid of the temp_html lines and the os.remove() of it. I recommend the b in the mode, which was missing for the zip file for some reason.

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

Python - webscraping; dictionary data structure

I need to scrape this website (http://setkab.go.id/profil-kabinet/#) and produce an Excel file that has headers "Cabinet names" in column 1 and "Era" in column 2. That means each Cabinet name (e.g. Kabinet Presidensil, Kabinet Sjahrir I) should have its own row - alongside its respective era (e.g. Era Revolusi Fisik, Era Republik Indonesia Serikat).
This is the closest I've gotten:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://setkab.go.id/profil-kabinet/#')
soup = BeautifulSoup(response.text, 'html.parser')
eras = soup.find_all('div', attrs={'class':"wpb_accordion_section group"})
setkab = {}
for element in eras:
setkab[element.a.get_text()] = {}
for element in eras:
cabname = element.find('div',attrs={'class':'wpb_wrapper'}).get_text()
setkab[element.a.get_text()]['cbnm'] = cabname
for item in setkab.keys():
print item + setkab[item]['cbnm']
import os, csv
os.chdir("/Users/mxcodes/Code")
with open("setkabfinal.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8"), setkab[a]["cbnm"]])
However, this creates an Excel file with the headers "Era" and "Cabinet names" in column 1 and 2, respectively. It fails to put each Cabinet name in a separate row. For example, it has 'Era Revolusi Fisik' in column 1 and lists all the cabinets together in column 2.
My guess is that I need to switch the key-value pairs somehow so that each Cabinet becomes a key and its era becomes its value - because currently it's the other way around. But I've tried and failed to do so. Any help? Thank you!
From what I can see, the cabinets[a]["cbnm"] variable you use for writing is just a long Unicode so when you do writer.writerow([a.encode("utf-8"), cabinets[a]["cbnm"]]) what actually happens is that you write the era at the first column and the whole Unicode in the single cell in the next column (even if you have \n in your string it does not prevent it from being writed in a single cell (csv actually think that you want the unicode to be in ONLY one cell so it puts " before and after the cabinets[a]["cbnm"] value to be sure it will actually be in one cell)), what you should do to write every cabinet value in another row is to use the writerow method separately for each desired row.
for example this code worked fine for me:
cabinets = setkab
with open("cabinets.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8")]) #write the era column
cabinets_list = [i for i in cabinets[a]["cbnm"].split('\n') if i != ''] #get all the values that are separated by newline chars (if they aren't empty strings)
for i in cabinets_list: writer.writerow([a.encode("utf-8"),i]) #write every value separately in the CABINET NAME row
as you can see I changed only the last 3 lines.
I hope this will help you!

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []