How to open non-valid HTML (from Wikipedia) via document() in XSLT? - xslt

I use XSLT 1.0 to extract information from Wikipedia infoboxes, and, for certain links, fetch additional information from further Wikipedia sites.
In principle, this works fine, unless the HTML returned for the Wikipedia pages is invalid. Unfortunately, this happens for all pages in, e.g., the Russian Wikipedia. Try the following example
<xsl:for-each
select="document('http://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B4%D0%B5%D0%BD_%D0%BA%D1%83%D0%BB%D1%8C%D1%82%D1%83%D1%80%D0%B0')//text()">
<xsl:value-of select="."/>
</xsl:for-each>
The trouble is that the entity ® is used on every page in this language edition, but not declared: The HTML declaration of Wikipedia pages is crippled.
<!DOCTYPE html>
Instead of, say,
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This is clearly a Wikipedia issue, not an XSLT issue, but is there any workaround to parse these sites nevertheless? Any pointers to a more robust XSLT parser? Is there any way to infuse entity declarations into the HTML before it gets parsed?
So far, I tried XSLTproc, Saxon6.5.5, Saxon-B 9.1.0.8, and Xalan, all with the same result.

Saxon and Xalan (I don't know about xsltproc) allow you to supply a URIResolver to handle document() requests. This is allowed to return any Source object. To process HTML input, return a SAXSource whose XMLReader is actually an HTML parser. There are a couple of candidates, TagSoup and validator.nu - the latter is probably better as it claims to implement the HTML5 parsing algorithm. The XSLT processor will then think it is dealing with well-formed XML.
Alternatively, in Saxon there is a saxon:parse-html() extension function. This in fact uses TagSoup underneath.

Related

Eliminate javascript from HTML with XSLT

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.
As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.
Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

eXist-db / XSLT / Saxon collection() slow as molasses (or errors out with memory limit)

Coming from this question, I managed one entirely unsatisfactory solution for accessing an eXist-DB collection() from an XSLT 2.0 document loaded from within an eXist-db/Xquery transformation function:
The XSLT file declares a variable :
<xsl:variable name="coll" select="collection('xmldb:exist:///db/apps/deheresi/data/collection_ms609.xml')"/>
This points to a catalog xml file I created (per Saxon documentation) that looks like this, in order to load the actual collection:
<collection stable="true">
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0001.xml"/>
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0002.xml"/>
...
...
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0709.xml"/>
<doc href="xmldb:exist:///db/apps/deheresi/data/ms609_0710.xml"/>
</collection>
This allows the XSLT file to use a key that needs to search across all these files:
<xsl:key name="correspkey" match="tei:seg[#type='dep_event' and #corresp]" use="#corresp"/>
<xsl:variable name="correspvar" select="self::seg[#type='dep_event' and #corresp]/#corresp"/>
<xsl:value-of select="$coll/(key('correspid',$correspvar) except $correspvar)/#id" separator=", "/>
As it stands, if I have 50 documents in the catalog, I get a result in 2 minutes; with all 710 I get a java GC error after 4 minutes.
I have set indexes on relevant nodes in eXist-DB, but this does nothing to performance. It seems to me Saxon is working 'outside' eXist-DB's optimisations, treating eXist-DB as a simple file system.
(For what it's worth, setting href="/db/apps/deheresi/data/ms609_0001.xml" does not let Saxon see the documents.)
I suspect all of this is why the eXist-DB documentation is non-existent.
As it goes, I am looking for solutions for intensive searches of collections from within XSLT 2.0 loaded within eXist-DB by Xquery transform().
If anything, I hope this post helps future searchers encountering the same problem.
The general architectural principle is: try to move the searching closer to the data. In this case this means: use eXist to find the documents of interest, don't extract every possible candidate document from eXist and then ask Saxon to do the searching. Select the actual documents of interest in an eXist XQuery, and then pass the list of these documents to Saxon in a stylesheet parameter.

Localization with xslt, do any standards exist that deal with different languages in XML?

I am working with a project where I need to create a html-file supporting several languages from an xslt-transformation. I have read several articles around this, and also looked at previous questions here in stackoverflow, like this one:
xslt localization
And the solution to put the translations in a separate xml document works just as I want it to. But I wonder if there exists any standardized/best practice for the "translate.xml" file? In the post referenced above, the following is given as an example:
<strings>
<string language="en" key="ContactDetails">Contact Details</string>
<string language="sv" key="ContactDetails">Kontaktuppgifter</string>
[...]
</strings>
As I said, the solution suggested with using keys retrieve the strings from the transalte.xml works as I want it to, but I like to use standards if they are available, so my question is if there is a standardized schema for these types of xml files, or some kind of best practice on the naming of tags etc in such a "translate.xml"?
Good question!
Yes, there is a standardized way of dealing with languages. Use the xml:lang attribute. The namespace magically exists in any XML document and is part of the core XML specification. It is defined to take a language specifier according to RFC-4646. These are the often-seen specifiers like en, en-US, es, es-BR, specifying the main language and the language variant.
The way it works is a bit like namespaces. If you define it on an element, then it is inherited by all descendants of that element, unless you redefine it, or undefine it to denote language-indepent elements.
For instance:
<text xml:lang="en">
We are
<t xml:lang="en-US">organizing</t>
<t xml:lang="en-GB">organising</t>
a conference on the effects of
<t xml:lang="en-US">color</t>
<t xml:lang="en-GB">colour</t>
in December this year.
</text>
Using a query language, i.e. XSLT, with a copy-idiom, this works excellently together with the lang() function, which takes the applicable language from the nearest ancestor or self and returns boolean true if found. It will also find the language variant like en-US if you set the main language, like en.
The following assumes XSLT 1.0, but works with 2.0 and 3.0 as well (this code was kindly corrected by Michael, see comments):
<!-- match English US language and default en -->
<xsl:template match="t[lang('en-US') or lang('en')">
<xsl:apply-templates />
</xsl:template>
<!-- remove any other <t> -->
<xsl:template match="t" />
Note: always set a default language on the outermost element, as lang() will return false if no language is found at all. You could test for this with the expression lang(''), which will return false only if no language was set at all.
About your XML files, if you have one file per language, and don't mix and match like suggested above, you can still use the same approach by setting xml:lang on the root element. Since this will then be inherited throughout the whole tree, you can still use the lang() function.

preproccesing in XSLT

is it at all possible to 'pre-proccess' in XSLT?
with preprocessing i mean updating the (in memory representation) of the source tree.
is this possible, or do i need to do multiple transforms for it.
use case:
we have Docbook reference manuals for out clients but for certain clients these need different 'skins' (different images etc). so what i was hoping to do is transform the image fileref path depending on a parameter. then apply the rest of the normal Docbook XSL templates.
Expanding on Eamon's answer...
In the case of either XSLT 1.0 or 2.0, you'd start by putting the intermediate (pre-processed) result in an <xsl:variable> element, declared either globally (top-level) or locally (inside a template).
<xsl:variable name="intermediate-result">
<!-- code to create pre-processed result, e.g.: -->
<xsl:apply-templates mode="pre-process"/>
</xsl:variable>
In XSLT 2.0, the value of the $intermediate-result variable is a node sequence consisting of one document node (was called "root node" in XSLT/XPath 1.0). You can access and use it just as you would any other variable, e.g., select="$intermediate-result/doc"
But in XSLT 1.0, the value of the $intermediate-result variable is not a first-class node-set. Instead, it's something called a "result tree fragment". It behaves like a node-set containing one root node, but you're restricted in how you can use it. You can copy it and get its string-value, but you can't drill down using XPath, as in select="$intermediate-result/doc". To do that, you must first convert it to a first-class node-set using your processor's node-set() extension function. In Saxon 6.5, libxslt, and 4xslt, you can use exsl:node-set() (as in Eamon's answer). In MSXML, you'd need to use msxsl:node-set(), where xmlns:msxsl="urn:schemas-microsoft-com:xslt", and in Xalan, I believe it's called xalan:nodeset() (without the hyphen, but you'll have to Google for the namespace URI). For example: select="exsl:node-set($intermediate-result)/doc"
XSLT 2.0 simply abolished the result tree fragment, making node-set() unnecessary.
This is not possible with standards compliant XSLT 1.0. It is possible in every actual implementation I've used, however. The extensions with which to do that differ by engine, however. It is also possible in standard XSLT 2.0 (which is in any case much easier to work with - so if you can, just use that).
If your xslt processor supports EXSLT, the exsl:node-set() function does what you're looking for. msxml has an identically named extension function as well (but with a different namespace uri, the functions are unfortunately not trivially compatible).
Since you are trying to generate slightly different output from the same DocBook XML source, you might want to look into the "profiling" (conditional markup) support in DocBook XSL stylesheets. See Chapter 26 in DocBook XSL: The Complete Guide by Bob Stayton:
Profiling is the term used in DocBook
to describe conditional text.
Conditional text means you can create
a single XML document with some
elements marked as conditional. When
you process such a document, you can
specify which conditions apply for
that version of output, and the
stylesheet will include or exclude the
marked text to satisfy the conditions.
This feature is useful when you need
to produce more than one version of a
document, and the versions differ in
minor ways.
For example, to use different images for, say, Windows and Mac versions of the same document, you might have a DocBook XML fragment like this:
<figure>
<title>The Foo dialog</title>
<mediaobject>
<imageobject os="windows">
<imagedata fileref="screenshots/windows/foo.png"/>
</imageobject>
<imageobject os="mac">
<imagedata fileref="screenshots/mac/foo.png"/>
</imageobject>
</mediaobject>
</figure>
Then, you would use the profiling-enabled versions of the DocBook XSL stylesheets with the profile.os parameter set to windows or mac.
Maybe you should use XSLT "OOP" methods here. Put all the common templates to all clients in a stylesheet, and create an stylesheet for each client with specific templates overriding common ones. Import the common stylesheet within the specific ones with xsl:import, and you'll do only one processing by calling the stylesheet corresponding to a client.

XSLT - how to deal with <![CDATA[

Im trying to output some XML using XSLT, however I've just come across this:
<description><![CDATA[<p>Using Money – recognise coins, getting change, paper money etc. A PowerPoint resource containing colour coded levels to suit different abilities – special needs. Self checking and interactive.</p>]]></description>
How do I output the actual HTML, not the <P>, but as if it was HTML?
You can use disable-output-escaping. Beware, though, that if the input value is not well-formed or valid, the output won't be either.
<xsl:value-of select="description" disable-output-escaping="yes"/>
XSLT handles data already parsed by the XML parser. The CDATA tags are parsed as text by the XML parser. You might need to do some pre-processing to remove the CDATA tags before turning over the XML to XSLT.