Parsing XML in python for numbers - python-2.7

I am new to python as well as XMl. I am trying to parse an XML file, find the values and the sum of those values. I have included the code as well as the data below.
import xml.etree.ElementTree as ET
data='''
<place>
<note>Test data</note>
<hospitals>
<doctor>
<name>John</name>
<count>97</count>
</doctor>
<doctor>
<name>Sam</name>
<count>97</count>
</doctor>
<doctor>
<name>Luke</name>
<count>90</count>
</doctor>
<doctor>
<name>Mark</name>
<count>90</count>
</doctor>
</hospitals>
</place> '''
tree=ET.fromstring (data)
for lines in tree.findall('place/hospitals/doctor'):
print lines.get('count'), lines.text
When I execute the above code, I am not getting any output.
Then I changed the code to :
tree=ET.fromstring (data)
print 'count:',tree.find('count').text
and the output is:
Traceback (most recent call last):
File "test2.py", line 26, in <module>
print 'count:',tree.find('count').text
AttributeError: 'NoneType' object has no attribute 'text'
Any help is appreciated guys.
Thank you

Element.findall() finds only elements with a tag which are direct children of the current element. The documentation for ElementTree is here.
So are the code examples.
For now, try this:
for line in tree.findall('./hospitals/doctor/count'):
print line.text
The above code just prints the counts. You will have to write the code to sum them up.

Related

Bypass file as parameter with a string for lxml iterparse function using Python 2.7

I am interating over an xml tree using the lxml.tree function iterparse().
This works ok with an input file
xml_source = "formatted_html_diff.xml"
context = ET.iterparse(xml_source, events=("start",))
event, root = context.next()
However, I would like to use a string containing the same information in the file.
I tried using
context = ET.iterparse(StringIO(result), events=("start",))
But this causes the following error:
Traceback (most recent call last):
File "c:/Users/pag/Documents/12_raw_handle/remove_from_xhtmlv02.py", line 96, in <module>
event, root = context.next()
File "src\lxml\iterparse.pxi", line 209, in lxml.etree.iterparse.__next__
TypeError: reading file objects must return bytes objects
Does anyone know how could I solve this error?
Thanks in advance.
Use BytesIO instead of StringIO. The following code works with both Python 2.7 and Python 3:
from lxml import etree
from io import BytesIO
xml = """
<root>
<a/>
<b/>
</root>"""
context = etree.iterparse(BytesIO(xml.encode("UTF-8")), events=("start",))
print(next(context))
print(next(context))
print(next(context))
Output:
('start', <Element root at 0x315dc10>)
('start', <Element a at 0x315dbc0>)
('start', <Element b at 0x315db98>)

Regex & BeautifulSoup - TypeError: expected string or bytes-like object

My code is running into some unexpedt error. Tried to tweak with have 'u' instead of 'r', but still get same error. Tried other solutions from stacks, but didn't go anywhere. Any suggestion?
#use urlib and beautifulsoup to scrpe table
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pandas as pd
url = 'https://www.example.com/profiles'
page = urlopen(url).read()
soup = BeautifulSoup(page, 'lxml')
#print(soup)
reEngName = re.compile(r'\[\*\*.+\*\*\]')
reKorName = re.compile(r'\([^\/h]*\)')
reProfile = re.compile(r'\|.+')
for line in re.findall(reEngName, soup):
print(line)
Error message:
Traceback (most recent call last):
File "ckurllib.py", line 18, in <module>
for line in re.findall(reEngName, soup):
File "C:\Users\Sammy\Anaconda3\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Regex works with strings. If you want to search whole raw text of file, give the page to regex. Soap is a parser, that internally splits html into its syntactic components, organized into a tree, you can iterate through them. For example, to iterate all <a> tags:
soup = BeautifulSoup.BeautifulSoup(urllib2.urlopen(url).read())
for a in soup('a'):
out = doThings(a)
in doThings(a):
if a['href'].startswith("http:///www.domain.net"):
Naturally, in latter stage you can use regexes to check for matches in strings.

Getting ParseError when parsing using xml.etree.ElementTree

I am trying to extract the <comment> tag (using xml.etree.ElementTree) from the XML and find the comment count number and add all of the numbers. I am reading the file via a URL using urllib package.
sample data: http://python-data.dr-chuck.net/comments_42.xml
But currently i am trying to trying to print the name, and count.
import urllib
import xml.etree.ElementTree as ET
serviceurl = 'http://python-data.dr-chuck.net/comments_42.xml'
address = raw_input("Enter location: ")
url = serviceurl + urllib.urlencode({'sensor': 'false', 'address': address})
print ("Retrieving: ", url)
link = urllib.urlopen(url)
data = link.read()
print("Retrieved ", len(data), "characters")
tree = ET.fromstring(data)
tags = tree.findall('.//comment')
for tag in tags:
Name = ''
count = ''
Name = tree.find('commentinfo').find('comments').find('comment').find('name').text
count = tree.find('comments').find('comments').find('comment').find('count').number
print Name, count
Unfortunately, I am not able to even parse the XML file into Python, because i am getting this error as follows:
Traceback (most recent call last):
File "ch13_parseXML_assignment.py", line 14, in <module>
tree = ET.fromstring(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 49
I have read previously in a similar situation that maybe the parser isn't accepting the XML file. Anticipating this, i did a Try and Except around tree = ET.fromstring(data) and I was able to get past this line, but later it is throwing an erro saying tree variable is not defined. This defeats the purpose of the output I am expecting.
Can somebody please point me in a direction that helps me?

Is outputMode Still Supported In alchemy_language.entities

I have this inherited code which in Python 2.7 successfully returns results in xml that are then parsed by ElementTree.
result = alchemyObj.TextGetRankedNamedEntities(text)
root = ET.fromstring(result)
I am updating program to Python 3.5 and am attempting to do this so that I don't need to modify xml parsing of results:
result = alchemy_language.entities(outputMode='xml', text='text', max_
items='10'),
root = ET.fromstring(result)
Per http://www.ibm.com/watson/developercloud/alchemy-language/api/v1/#entities outputMode allows the choice between json default and xml. However, I get this error:
Traceback (most recent call last):
File "bin/nerv35.py", line 93, in <module>
main()
File "bin/nerv35.py", line 55, in main
result = alchemy_language.entities(outputMode='xml', text='text', max_items='10'),
TypeError: entities() got an unexpected keyword argument 'outputMode'
Does outputMode actually still exist? If so, what is wrong with the entities parameters?
The watson-developer-cloud does not appear to have this option for Entities. The settings allowed are:
html
text
url
disambiguate
linked_data
coreference
quotations
sentiment
show_source_text
max_items
language
model
You can try accessing the API directly by using requests. For example:
import requests
alchemyApiKey = 'YOUR_API_KEY'
url = 'https://gateway-a.watsonplatform.net/calls/text/TextGetRankedNamedEntities'
payload = { 'apikey': alchemyApiKey,
'outputMode': 'xml',
'text': 'This is an example text. IBM Corp'
}
r = requests.post(url,payload)
print r.text
Should return this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<status>OK</status>
<usage>By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage>
<url></url>
<language>english</language>
<entities>
<entity>
<type>Company</type>
<relevance>0.961433</relevance>
<count>1</count>
<text>IBM Corp</text>
</entity>
</entities>
</results>

During migrating tool from windows to linux lxml error

I have developed a tool in python 2.7 that take xsd file as input ,
and give the process data into a test file
During processing the xsd file I used lxml, I am unable to resolve this sort of error.
AttributeError: 'Element' object has no attribute 'iterdescendants'
I don`t know what wrong with the lxml lib.
I want to know is there any lxml Linux compatible version for python 2.7
I have imported in the file like below:
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
I have imported only in file , and sending the element tree pointer to process the the element into another file ,
it is OK in the declared file , giving error in another file only.
the code throw the error is :
for tdocNode in lincFileRootNode:
rootNode = tdocNode.getroot()
lchildren = rootNode.getchildren()
for elt in lchildren:
if 'complex' == elt.tag:
if 'name' in elt.attrib:
if 'element' == item.tag:
if 'type' in item.attrib:
if elt.attrib['name'] == item.attrib['type']:
for key in elt.iterdescendants(tag='element'):
bIsElemTypeSimple = false
bIsElemTypeSimple = process_elementtype(key, lincFileRootNode)
where :
lincFileRootNode --> is list that containe the xsd file pointer to be processed
the error thrown is :
Traceback (most recent call last):
File "run.py", line 1210, in <module>
iret = xsd2dic_main()
File "run.py", line 71, in xsd2dic_main
iRet = yxsdtodic()
File "run.py", line 352, in yxsdtodic
iret = process_xsdfile(sXsdPath)
File "run.py", line 485, in xsdfile
sRet = process_dic_elementtype(item,lincFileRootNode)
File "run.py", line 817, in process_dic_elementtype
for key in elt.iterdescendants(tag='element'):
AttributeError: 'Element' object has no attribute 'iterdescendants'
i tired in the both the cases :
1:writing all code in a same file
2:writing different files
still i am getting the same error
This is mostly a guess, but look into it.
You appear to be calling iterdescendants from lxml's implementation of the Element type. However, if lxml fails to import, you fall back on Python's built in xml library instead. But it's implementation of Element doesn't have an iterdescendants methods of any kind. In other words, the two implementations have different public APIs. Add some print statements to see which library you're importing and do some additionally checking to see exactly what type elt is. If you want to be able to fall back on Python's built in xml, you'll need to structure your code to accommodate the different APIs.