Eliminate javascript from HTML with XSLT - xslt

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.

As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.

Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

Related

XSL disable-output-escaping XML SPY vs SAXON

I need help with my XSLT.
I have an XML with encoded HTML tags with a tag:
Using XmlSpy (Altova) this DOES work:
'<xsl:value-of select="de" disable-output-escaping="yes"/>'
which returns html tags within the data tag.
But executing this XSL on SAXON does not work. The XSL is executed and returns output, but the output-escaping seems to be ignored.
Any ideas?
The key thing to remember is that disable-output-escaping is an instruction to the serializer, and it has no effect unless the XSLT processor is serializing the output. The most common reason for it "not working" is that the transformation output is going to a destination other than the serializer (for example, a DOM tree). So we need to know how you are running the transformation.
Also related to this, there have been changes to the spec regarding what happens if you use disable-output-escaping while writing to a temporary tree (that is, to a variable).
Processors are allowed to ignore disable-output-escaping entirely, but Saxon doesn't do that, except of course when the output isn't serialized. (That's because "escaping" is a serialization thing, and if you're not serializing, then you're not escaping anything, so there is nothing to disable).

Is it possible to generate both HTML and Wiki markup at the same time using XSLT?

I would like to generate both HTML and Wiki markup at the same time using XSLT (from an XML source document) - just wondering if it's possible. It would be nice if I could use the same XSLT to do both rather than writing/maintaining two separate files.
The HTML report will be for general viewing, and the Wiki markup will be published to Confluence.
If you want to create more than one result document using a single stylesheet than XSLT 2.0 and later support that using xsl:result-document, see the specification http://www.w3.org/TR/xslt20/#creating-result-trees. As you then want to process the same elements twice, you usually also make use of modes to separate the different processing, e.g. use one mode to produce HTML, the other mode to produce Wiki markup.
With pure XSLT 1.0 you can only create a single result document, however, some XSLT 1.0 processors, like Xalan (http://xml.apache.org/xalan-j/extensions_xsltc.html#redirect_ext) or xsltproc (http://exslt.org/exsl/elements/document/index.html) support an extension to create more than one result document.

I want to try a tag not closed with xpath

I want to try a tag not closed with xpath like this:
<figure class="img"><img class="immagine-in-linea-senza-cornice" width="16%" src="images/schema_1_fmt.jpeg" alt=""/>
I want to close the tag with a xslt transformation.
XPath does not work directly on the input document, but on an abstract, tree-like representation of the document (e.g. XDM or DOM). In this model, opening and closing tags of an element are not represented at all. Instead, an element appears as a single entity in the tree.
Therefore, manipulating < and /> is completely out of the question for languages like XPath, because the concept of opening and closing tags is simply not implemented. I would argue that this abstraction is an advantage of the models, though.
Also, XSLT transformations normally take as input XML documents. If your document has unclosed elements, it will be rejected by any application that is only prepared to process XML.
In short, fix the XML document with a language other than the combination of XSLT and XPath (see e.g. here), and think about XSLT as soon as you have well-formed XML as input.

How to open non-valid HTML (from Wikipedia) via document() in XSLT?

I use XSLT 1.0 to extract information from Wikipedia infoboxes, and, for certain links, fetch additional information from further Wikipedia sites.
In principle, this works fine, unless the HTML returned for the Wikipedia pages is invalid. Unfortunately, this happens for all pages in, e.g., the Russian Wikipedia. Try the following example
<xsl:for-each
select="document('http://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B4%D0%B5%D0%BD_%D0%BA%D1%83%D0%BB%D1%8C%D1%82%D1%83%D1%80%D0%B0')//text()">
<xsl:value-of select="."/>
</xsl:for-each>
The trouble is that the entity ® is used on every page in this language edition, but not declared: The HTML declaration of Wikipedia pages is crippled.
<!DOCTYPE html>
Instead of, say,
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This is clearly a Wikipedia issue, not an XSLT issue, but is there any workaround to parse these sites nevertheless? Any pointers to a more robust XSLT parser? Is there any way to infuse entity declarations into the HTML before it gets parsed?
So far, I tried XSLTproc, Saxon6.5.5, Saxon-B 9.1.0.8, and Xalan, all with the same result.
Saxon and Xalan (I don't know about xsltproc) allow you to supply a URIResolver to handle document() requests. This is allowed to return any Source object. To process HTML input, return a SAXSource whose XMLReader is actually an HTML parser. There are a couple of candidates, TagSoup and validator.nu - the latter is probably better as it claims to implement the HTML5 parsing algorithm. The XSLT processor will then think it is dealing with well-formed XML.
Alternatively, in Saxon there is a saxon:parse-html() extension function. This in fact uses TagSoup underneath.

XSLT - how to deal with <![CDATA[

Im trying to output some XML using XSLT, however I've just come across this:
<description><![CDATA[<p>Using Money – recognise coins, getting change, paper money etc. A PowerPoint resource containing colour coded levels to suit different abilities – special needs. Self checking and interactive.</p>]]></description>
How do I output the actual HTML, not the <P>, but as if it was HTML?
You can use disable-output-escaping. Beware, though, that if the input value is not well-formed or valid, the output won't be either.
<xsl:value-of select="description" disable-output-escaping="yes"/>
XSLT handles data already parsed by the XML parser. The CDATA tags are parsed as text by the XML parser. You might need to do some pre-processing to remove the CDATA tags before turning over the XML to XSLT.