Tags with an underscore cause failure with BeautifulSoup selector

Tags with an underscore cause failure with BeautifulSoup selector - python-2.7

XML File:
<?xml version="1.0" encoding="UTF-8"?>
<sites>
<site>
<name>Default</name>
<url_namespace>default</url_namespace>
</site>
</sites>
Soup info:
soup = BeautifulSoup(xml)
soup.select('url_namespace')
Error:
ValueError: Unsupported or invalid CSS selector: "url_namespace"
How does one select an xml tag, or and id which contains an underscore?

I'd suggest lxml just because this could be done with a simple XPath, but just for the fun of showing how to select an invalid CSS selector... well, you actually don't. There are a couple of things that can be done, one of which is to replace the offensive tag with perhaps a div tag with a specific class, so you can select it.
However, one really hackish way of doing this really quickly is to just change the name property of each element you find.
from bs4 import BeautifulSoup as bsoup
data = """
<?xml version='1.0' encoding='UTF-8'?>
<sites>
<site>
<name>Default</name>
<url_namespace>default1</url_namespace>
<url_namespace>default2</url_namespace>
<url_namespace>default3</url_namespace>
<url_namespace>default4</url_namespace>
</site>
</sites>
"""
soup = bsoup(data)
elements = soup.find_all("url_namespace")
for element in elements:
element.name = "urlnamespace"
print soup
The above changes the soup to the following:
<html><body><sites>
<site>
<name>Default</name>
<urlnamespace>default1</urlnamespace>
<urlnamespace>default2</urlnamespace>
<urlnamespace>default3</urlnamespace>
<urlnamespace>default4</urlnamespace>
</site>
</sites>
</body></html>
Adding the following codeblock to the above code...
targets = soup.select("urlnamespace")
for target in targets:
print target.get_text()
... gives you the following result:
default1
default2
default3
default4
Not really the prettiest way, but it works. Out of sheer curiosity, though, why the need to select the tag this way? find_all works on the tag, as you can see above.
Anyway, let us know if this works.

Related

Parsing OSM XML data with python with specific sub tags

I am trying to parse way tags out from an osm XML file for example I want to search through the entire xml file and when a way tag has a k value of bridge it saves the entire way tag into a csv file with all the other ways that have bridge tags.
<way id="108534076" visible="true" version="1" changeset="7866393" timestamp="2011-04-15T02:42:51Z" user="richlv" uid="47892">
<nd ref="1245024935"/>
<nd ref="1245025038"/>
<tag k="bridge" v="yes"/>
<tag k="highway" v="service"/>
</way>
here is the code I have written so far but keep having an AttributeError
import xml.etree.ElementTree as ET
tree = ET.parse('MER.xml')
root = tree.getroot()
for way in root.findall('way'):
tag = way.find('.//tag')
if tag.attrib['k'] == 'bridge' and tag.attrib['v'] == 'yes':
print tag
the file I have is very big and I am looking through 4000 way tags for about 34 bridge tags.
Error Traceback

The problem is that not all <way> elements have a <tag> underneath. You can fix it by checking
import xml.etree.ElementTree as ET
tree = ET.parse('MER.xml')
root = tree.getroot()
for way in root.findall('way'):
tag = way.find('.//tag')
if tag and tag.attrib['k'] == 'bridge' and tag.attrib['v'] == 'yes':
print tag
Or you could jump into xpath and let the xml doc do the work for you
import xml.etree.ElementTree as ET
tree = ET.parse('MER.xml')
for tag in tree.findall('//way//tag[#k="bridge"][#v="yes"]'):
print tag
And for large files, lxml is usually faster
import lxml.etree
tree = lxml.etree.parse('MER.xml')
for tag in tree.findall('//way//tag[#k="bridge"][#v="yes"]'):
print tag

not able to remove tags that "xsi:nil" in them via xslt

I have following xml which contains several xml tags with xsi:nil="true". These are tags that are basically null. I am not able to use/find any sxlt transformer to remove these tags from the xml and obtain the rest of the xml.
<?xml version="1.0" encoding="utf-8"?>
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a">
<retrieveAllValues>
<messages xsi:nil="true" />
<existingValues>
<Values>
<value1> 10.00</value1>
<value2>123456</value2>
<value3>1234</value3>
<value4 xsi:nil="true" />
<value5 />
</Values>
</existingValues>
<otherValues xsi:nil="true" />
<recValues xsi:nil="true" />
</retrieveAllValues>
</p849:retrieveAllValues>

The reason of error you get
[Fatal Error] file2.xml:5:30: The prefix "xsi" for attribute "xsi:nil" associated with an element type "messages" is not bound.
is absence of prefix named "xsi" declared, you should specify it in root element such as:
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<retrieveAllValues>
<messages xsi:nil="true" />
// other code...
update
If you could not change xml document you're receiving from webservice, you could try next approach(if this approach is acceptable for you):
Change your xslt document to process xml documents without specifying element prefixes
Set property namespaceAware of DocumentBuilderFactory to false
After this yout transformer shouldn't complain

It doesn't look like this is going to be possible in XSLT - because of the missing namespace declarations you have to parse the XML file with a non-namespace-aware parser, but all the XSLT processors I've tried don't get on well with such documents, they must rely on some information that is only present when parsing with namespace awareness enabled, even if the document in question doesn't actually contain any namespaced nodes.
So you'll have to approach it a different way, for example by traversing the DOM tree yourself. Since you say you're working in Java, here's an example using Java DOM APIs (the example runs as-is in the Groovy console, or wrap it up in a proper class definition and add whatever exception handling is required to run it as Java)
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.w3c.dom.ls.*;
public void stripNils(Node n) {
if(n instanceof Element &&
"true".equals(((Element)n).getAttribute("xsi:nil"))) {
// element is xsi:nil - strip it out
n.getParentNode().removeChild(n);
} else {
// we're keeping this node, process its children (if any) recursively
NodeList children = n.getChildNodes();
for(int i = 0; i < children.getLength(); i++) {
stripNils(children.item(i));
}
}
}
// load the document (NB DBF is non-namespace-aware by default)
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document xmlDoc = db.parse(new File("input.xml"));
stripNils(xmlDoc);
// write out the modified document, in this example to stdout
LSSerializer ser =
((DOMImplementationLS)xmlDoc.getImplementation()).createLSSerializer();
LSOutput out =
((DOMImplementationLS)xmlDoc.getImplementation()).createLSOutput();
out.setByteStream(System.out);
ser.write(xmlDoc, out);
On your original example XML this produces the correct result:
<?xml version="1.0" encoding="UTF-8"?>
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a">
<retrieveAllValues>
<existingValues>
<Values>
<value1> 10.00</value1>
<value2>123456</value2>
<value3>1234</value3>
<value5/>
</Values>
</existingValues>
</retrieveAllValues>
</p849:retrieveAllValues>
The empty lines are not actually empty, they contain the whitespace text nodes either side of the removed elements, as only the elements themselves are being removed here.

Adding new view to Dexterity type causes "page not found" viewing items

I'm working through the recent Professional Plone 4 Development book, on a Plone 4.1.2 install.
I have successfully defined the content types via Dexterity and am now trying to create a custom view for one of the types. The schema & view are defined as such:
from zope import schema
from plone.directives import form
from five import grok
from ctcc.contenttypes import CTCCTypesMessageFactory as _
class ITrial(form.Schema):
"""A clinical trial."""
title = schema.TextLine(
title = _(u'label_title', default=u'Title'),
required = True,
)
description = schema.Text(
title=_(u'label_description', default=u'Description'),
description = _(u'help_description', default=u'A short summary of the content'),
required = False,
missing_value = u'',
)
class View(grok.View):
grok.context(ITrial)
grok.require('zope2.View')
grok.name('view')
Here is the relevant section from the type's FTI:
view
False
<alias from="(Default)" to="(selected layout)"/>
<alias from="edit" to="##edit"/>
<alias from="sharing" to="##sharing"/>
<alias from="view" to="##view"/>
<action title="View" action_id="view" category="object" condition_expr=""
url_expr="string:${folder_url}/" visible="True">
<permission value="View"/>
</action>
And the template itself, located in ctcc.contenttypes/trial_templates/view.pt, which should simply display the title & description:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
xmlns:tal="http://xml.zope.org/namespaces/tal"
xmlns:metal="http://xml.zope.org/namespaces/metal"
xmlns:i18n="http://xml.zope.org/namespaces/i18n"
lang="en"
metal:use-macro="context/main_template/macros/master"
i18n:domain="ctcc.contenttypes">
<body>
<metal:content-core fill-slot="content-core">
<metal:content-core define-macro="content-core">
<div tal:replace="structure context/text/output" />
</metal:content-core>
</metal:content-core>
</body>
</html>
Accessing any instances of the type with all this in place causes a "page not found" error. Something doesn't seem to be tying up the new view to the expected path, but as this is my first week with Plone I've no idea where to begin to track this down. I'm seeing no errors running the site in foreground mode either.
Any help whatsoever would be greatly appreciated.

did you included the dependency in setup.py?
install_requires=[
'setuptools',
'plone.app.dexterity',
...
],
did you initialized Grok in your configure.zcml?
<configure
xmlns="http://namespaces.zope.org/zope"
...
xmlns:grok="http://namespaces.zope.org/grok">
<includeDependencies package="." />
<grok:grok package="." />
...
</configure>
did you included Dexterity's GenericSetup profile in your metadata.xml?
<metadata>
<version>1</version>
<dependencies>
<dependency>profile-plone.app.dexterity:default</dependency>
</dependencies>
</metadata>

The problem was with this line in the template:
<div tal:replace="structure context/text/output" />
I had stripped back an example template to what I thought was the bare minimum. Thanks to David Glick's suggestion, I removed NotFound from the ignored exceptions list in error_log and saw the following:
Module Products.PageTemplates.Expressions, line 225, in evaluateText
Module zope.tales.tales, line 696, in evaluate
- URL: /opt/plone41/zeocluster/src/ctcc.contenttypes/ctcc/contenttypes/trial_templates/view.pt
- Line 13, Column 8
- Expression: <PathExpr standard:u'context/text/output'>
[...]
Module OFS.Traversable, line 299, in unrestrictedTraverse
- __traceback_info__: ([], 'text')
NotFound: text
Now that I can see what's causing the problem and have started reading deeper into TALs, I can see why it's failing: ignorance on my behalf, as suspected.
Thanks, everyone!

Is it possible to search SharePoint metadata?

When I use the Search.asmx web service it won't allow me to search MetaData. Is there a way that I can do this?
Below is what I have come up with so far for my query, but it errors out with an InvalidPropertyException every time I run it.
<?xml version="1.0" encoding="utf-8" ?>
<QueryPacket xmlns="urn:Microsoft.Search.Query" Revision="1000">
<Query domain="QDomain">
<SupportedFormats><Format>urn:Microsoft.Search.Response.Document.Document</Format></SupportedFormats>
<Context>
<QueryText language="en-US" type="MSSQLFT">
<![CDATA[ SELECT Title, Rank, Size, Description, Write, Path FROM portal..scope() WHERE "Published" = 'Yes' ORDER BY "Rank" DESC ]]>
</QueryText>
</Context>
<Range><StartAt>1</StartAt><Count>20</Count></Range>
<EnableStemming>false</EnableStemming>
<TrimDuplicates>true</TrimDuplicates>
<IgnoreAllNoiseQuery>true</IgnoreAllNoiseQuery>
<ImplicitAndBehavior>true</ImplicitAndBehavior>
<IncludeRelevanceResults>true</IncludeRelevanceResults>
<IncludeSpecialTermResults>true</IncludeSpecialTermResults>
<IncludeHighConfidenceResults>true</IncludeHighConfidenceResults>
</Query></QueryPacket>

You can't just search an arbitrary column of metadata, you need to make sure it gets crawled first and is made available under a sensible name (managed property). See this blog post for an example.
Also, if Published is a boolean, I think you might want to test "Published" = 1, in stead of yes.

Evernote export format (ENEX) to HTML, including pictures?

#Solved
The two subquestions I have created have been solved (yay for splitting this one up!), so this one is solved. I'll award the check mark to samjudson, since his answer was the closest. For actual working solutions though, see the below subquestions; both my implemented solutions and the checked answers.
#Deprecated
I am splitting this question into two separate questions, since this is a fairly complicated problem. Answers are still welcome though.
The suquestions are:
XSLT: Convert base64 data into
image files
XSLT: Obtaining or matching hashes
for base64 encoded data
Hi, just wondering if anyone here has had any success in converting Evernote's export format, which is XML, to HTML including the pictures. I do know that Evernote has an export to HTML function which does this, but I eventually want to do more fancy stuff with it.
I have managed to accomplish getting the text only using the following XSLT:
Sample code removed
See child questions for implemented solutions.
However, a.t.m. this simply ignores any pictures, and this is where I need help.
Stumbling block #1: Evernote stores its pictures as GIFs or PNGs, and when exported, it embeds these GIFs & PNGs directly in the XML using what appears to be base64 (I could be wrong). I need to be able to reconsitute the pictures. If you open the file in a text editor, look for the huge blocks of data in the **//note/resource/data**. For example (indents added manually):
<resource>
<data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
</data>
<mime>image/gif</mime>
<resource-attributes>
<file-name>clip_image001.gif</file-name>
</resource-attributes>
</resource>
Stumbling block #2: Evernote stores the file names of each picture under the resource node
**//note/resource/resource-attributes/file-name**
however, in the actual note in which it refers to the picture, it references the picture not by the filename, but by its hash, for example:
<en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="Alt Text"/>
Can anyone shed some light on how to deal with (base64) encoded binary data inside XML?
Edit
I understand from the comments & answers that plain ol' XSLT won't get the job done handling images. The XSLT processor I am using is Xalan , however, if this is not good enough for the purposes of image processing or base64, then I am please suggest one that does do these!
Also, as requested, here is a sample Evernote export file. The code clips above are merely selected parts of this. I have stripped it down such that it contains just one note and edited most of the text out of it, and added indents for clarity.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export.dtd">
<en-export export-date="20091029T063411Z" application="Evernote/Windows" version="3.0">
<note>
<title>A title here</title>
<content><![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd">
<en-note bgcolor="#FFFFFF">
<p>Some text here (followed by the picture)
<p><en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="A picture"/></p>
<p>Some more text here (preceded by the picture)
</en-note>
]]></content>
<created>20090925T063154Z</created>
<note-attributes>
<author/>
</note-attributes>
<resource>
<data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
</data>
<mime>image/gif</mime>
<resource-attributes>
<file-name>clip_image001.gif</file-name>
</resource-attributes>
</resource>
</note>
</en-export>
And this needs to be transformed into this:
<html>
<body>
<p>Some text here (followed by the picture)
<p><img src="clip_image001.gif" border="0" width="16" height="16" alt="A picture"/></p>
<p>Some more text here (preceded by the picture)
</body>
</html>
With the file clip_image001.gif being generated and saved.

There is a new Data URI specification http://en.wikipedia.org/wiki/Data_URI_scheme which may be of some help provided you are only intending to support modern browsers, and your images are small (for example IE8 only support <32k images).
Other than that the only other thing you can do is use some external scripts to export the image data to file and use them. This would depend greatly on what XSLT processor you are using.

It exists a pure XSLT answer to this issue ; look at this page

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Tags with an underscore cause failure with BeautifulSoup selector - python-2.7

Related

Parsing OSM XML data with python with specific sub tags

not able to remove tags that "xsi:nil" in them via xslt

Adding new view to Dexterity type causes "page not found" viewing items

Is it possible to search SharePoint metadata?

Evernote export format (ENEX) to HTML, including pictures?

Categories

Resources