Using python regex, find, and accessing groups - regex

I'm using Python to (1) access an xml file, (2) search it for nodes containing regex1, (3) search the nodes found for regex2 (which has a couple capture groups), then (4) do things with the groups.
I've got steps 1 and 2 working. But I'm stuck on 3 and 4. Here's an example of my code:
from bs4 import BeautifulSoup
from urllib import urlopen
import re
from lxml import etree
url='https://www.gpo.gov/fdsys/bulkdata/BILLS/113/1/hr/BILLS-113hr2146ih.xml'
soup = BeautifulSoup(urlopen(url).read(), 'xml')
pattern = r'(am)(ed)'
regex = re.compile(pattern, re.IGNORECASE)
x = soup.find_all(text=re.compile("amended"))
count = 0
for each in x:
#I thought this would loop through x and search each result for
#the regex, then print the 2 groups like this: am--ed
print (regex.finditer(x[count]))
print (each.group(1), '--', each.group(2))
count = count + 1
But instead it prints this:
<callable-iterator object at 0x97efd0c>
Traceback (most recent call last):
File "/media/Windows/Documents and Settings/Andy/My Documents/Misc/Computer/Python/NLTK-Python Learning/test.py", line 17, in <module>
print (each.group(1), '--', each.group(2))
File "/usr/lib/python2.7/dist-packages/bs4/element.py", line 615, in __getattr__
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'group'
I've been playing with this for a week and have read everything relevant I can find online. But I'm obviously not understanding something. Any suggestions? - Thanks

Currently you aren't using your regex to search through each result of x. Try something like
for each in x:
for match in regex.finditer(each):
print (match.group(1), '--', match.group(2))

Related

Regex & BeautifulSoup - TypeError: expected string or bytes-like object

My code is running into some unexpedt error. Tried to tweak with have 'u' instead of 'r', but still get same error. Tried other solutions from stacks, but didn't go anywhere. Any suggestion?
#use urlib and beautifulsoup to scrpe table
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pandas as pd
url = 'https://www.example.com/profiles'
page = urlopen(url).read()
soup = BeautifulSoup(page, 'lxml')
#print(soup)
reEngName = re.compile(r'\[\*\*.+\*\*\]')
reKorName = re.compile(r'\([^\/h]*\)')
reProfile = re.compile(r'\|.+')
for line in re.findall(reEngName, soup):
print(line)
Error message:
Traceback (most recent call last):
File "ckurllib.py", line 18, in <module>
for line in re.findall(reEngName, soup):
File "C:\Users\Sammy\Anaconda3\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Regex works with strings. If you want to search whole raw text of file, give the page to regex. Soap is a parser, that internally splits html into its syntactic components, organized into a tree, you can iterate through them. For example, to iterate all <a> tags:
soup = BeautifulSoup.BeautifulSoup(urllib2.urlopen(url).read())
for a in soup('a'):
out = doThings(a)
in doThings(a):
if a['href'].startswith("http:///www.domain.net"):
Naturally, in latter stage you can use regexes to check for matches in strings.

Getting ParseError when parsing using xml.etree.ElementTree

I am trying to extract the <comment> tag (using xml.etree.ElementTree) from the XML and find the comment count number and add all of the numbers. I am reading the file via a URL using urllib package.
sample data: http://python-data.dr-chuck.net/comments_42.xml
But currently i am trying to trying to print the name, and count.
import urllib
import xml.etree.ElementTree as ET
serviceurl = 'http://python-data.dr-chuck.net/comments_42.xml'
address = raw_input("Enter location: ")
url = serviceurl + urllib.urlencode({'sensor': 'false', 'address': address})
print ("Retrieving: ", url)
link = urllib.urlopen(url)
data = link.read()
print("Retrieved ", len(data), "characters")
tree = ET.fromstring(data)
tags = tree.findall('.//comment')
for tag in tags:
Name = ''
count = ''
Name = tree.find('commentinfo').find('comments').find('comment').find('name').text
count = tree.find('comments').find('comments').find('comment').find('count').number
print Name, count
Unfortunately, I am not able to even parse the XML file into Python, because i am getting this error as follows:
Traceback (most recent call last):
File "ch13_parseXML_assignment.py", line 14, in <module>
tree = ET.fromstring(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 49
I have read previously in a similar situation that maybe the parser isn't accepting the XML file. Anticipating this, i did a Try and Except around tree = ET.fromstring(data) and I was able to get past this line, but later it is throwing an erro saying tree variable is not defined. This defeats the purpose of the output I am expecting.
Can somebody please point me in a direction that helps me?

Spacy is_stop function(bug?)

I am using the below code to check if a word is a stop word or not. As you can see below, if the try block fails, the IS_STOP function is throwing an error.
import spacy
nlp = spacy.load('en')
try:
print 0/0 #Raise and Exception
except:
print nlp.is_stop('is')`
I get the below error:
5 print 0/0
6 except:
----> 7 print spacy.load('en').is_stop('is')
AttributeError: 'English' object has no attribute 'is_stop' `
You need to process some text by 'calling' the nlp object as a function as explained here. You can then test for stop words on each token of the parsed sentence.
For example:
>>> import spacy
>>> nlp = spacy.load('en')
>>> sentence = nlp(u'this is a sample sentence')
>>> sentence[1].is_stop
True
In case you want to test for stop words directly from the English vocabulary, use the following:
>>> nlp.vocab[u'is'].is_stop
True

How to solve AttributeError in python active_directory?

Running the below script works for 60% of the entries from the MasterGroupList however suddenly fails with the below error. although my questions seem to be poor ou guys have been able to help me before. Any idea how I can avoid getting this error? or what is trhoughing off the script? The masterGroupList looks like:
Groups Pulled from AD
SET00 POWERUSER
SET00 USERS
SEF00 CREATORS
SEF00 USERS
...another 300 entries...
Error:
Traceback (most recent call last):
File "C:\Users\ks185278\OneDrive - NCR Corporation\Active Directory Access Scr
ipt\test.py", line 44, in <module>
print group.member
File "C:\Python27\lib\site-packages\active_directory.py", line 805, in __getat
tr__
raise AttributeError
AttributeError
Code:
from active_directory import *
import os
file = open("C:\Users\NAME\Active Directory Access Script\MasterGroupList.txt", "r")
fileAsList = file.readlines()
indexOfTitle = fileAsList.index("Groups Pulled from AD\n")
i = indexOfTitle + 1
while i <= len(fileAsList):
fileLocation = 'C:\\AD Access\\%s\\%s.txt' % (fileAsList[i][:5], fileAsList[i][:fileAsList[i].find("\n")])
#Creates the dir if it does not exist already
if not os.path.isdir(os.path.dirname(fileLocation)):
os.makedirs(os.path.dirname(fileLocation))
fileGroup = open(fileLocation, "w+")
#writes group members to the open file
group = find_group(fileAsList[i][:fileAsList[i].find("\n")])
print group.member
for group_member in group.member: #this is line 44
fileGroup.write(group_member.cn + "\n")
fileGroup.close()
i+=1
Disclaimer: I don't know python, but I know Active Directory fairly well.
If it's failing on this:
for group_member in group.member:
It could possibly mean that the group has no members.
Depending on how phython handles this, it could also mean that the group has only one member and group.member is a plain string rather than an array.
What does print group.member show?
The source code of active_directory.py is here: https://github.com/tjguk/active_directory/blob/master/active_directory.py
These are the relevant lines:
if name not in self._delegate_map:
try:
attr = getattr(self.com_object, name)
except AttributeError:
try:
attr = self.com_object.Get(name)
except:
raise AttributeError
So it looks like it just can't find the attribute you're looking up, which in this case looks like the 'member' attribute.

removing double quotes and brackets from csv in python

I am trying to remove quotes and brackets from csv in python,I tryed for the folloing code but it can't give proper csv the code is:
import json
import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup
import csv
u = urllib2.urlopen("http://timesofindia.indiatimes.com/")
content = u.read()
u.close()
soup2 = BeautifulSoup(content)
blog_posts = []
for e in soup2.findAll("a", attrs={'pg': re.compile('^Head')}):
for b in soup2.findAll("div", attrs={'style': re.compile('^color:#ffffff;font-size:12px;font-family:arial;padding-top:3px;text-align:center;')}):
blog_posts.append(("The Times Of India",e.text,b.text))
print blog_posts
out_file = os.path.join('resources', 'ch05-webpages','newspapers','time1.csv')
f = open(out_file, 'wb')
wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
#f.write(json.dumps(blog_posts, indent=1))
wr.writerow(blog_posts)
f.close()
print 'Wrote output file to %s' % (f.name, )
the csv looks like:
"('The Times Of India', u'Missing jet: Air search expands to remote south Indian Ocean', u'Fri, Mar 21, 2014 | Updated 11.53AM IST')",
but i want csv like this:
The Times Of India,u'Missing jet: Air search expands to remote south Indian Ocean, u'Fri, Mar 21, 2014 | Updated 11.53AM IST
So what can i do for getting this type of csv?
Writer.writerow() expects a sequence containing strings or numbers. You are passing a sequence of tuples. Use Writer.writerows() instead.