XSLT - how to deal with <![CDATA[ - xslt

Im trying to output some XML using XSLT, however I've just come across this:
<description><![CDATA[<p>Using Money – recognise coins, getting change, paper money etc. A PowerPoint resource containing colour coded levels to suit different abilities – special needs. Self checking and interactive.</p>]]></description>
How do I output the actual HTML, not the <P>, but as if it was HTML?

You can use disable-output-escaping. Beware, though, that if the input value is not well-formed or valid, the output won't be either.
<xsl:value-of select="description" disable-output-escaping="yes"/>

XSLT handles data already parsed by the XML parser. The CDATA tags are parsed as text by the XML parser. You might need to do some pre-processing to remove the CDATA tags before turning over the XML to XSLT.

Related

XSL disable-output-escaping XML SPY vs SAXON

I need help with my XSLT.
I have an XML with encoded HTML tags with a tag:
Using XmlSpy (Altova) this DOES work:
'<xsl:value-of select="de" disable-output-escaping="yes"/>'
which returns html tags within the data tag.
But executing this XSL on SAXON does not work. The XSL is executed and returns output, but the output-escaping seems to be ignored.
Any ideas?
The key thing to remember is that disable-output-escaping is an instruction to the serializer, and it has no effect unless the XSLT processor is serializing the output. The most common reason for it "not working" is that the transformation output is going to a destination other than the serializer (for example, a DOM tree). So we need to know how you are running the transformation.
Also related to this, there have been changes to the spec regarding what happens if you use disable-output-escaping while writing to a temporary tree (that is, to a variable).
Processors are allowed to ignore disable-output-escaping entirely, but Saxon doesn't do that, except of course when the output isn't serialized. (That's because "escaping" is a serialization thing, and if you're not serializing, then you're not escaping anything, so there is nothing to disable).

Eliminate javascript from HTML with XSLT

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.
As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.
Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

I want to try a tag not closed with xpath

I want to try a tag not closed with xpath like this:
<figure class="img"><img class="immagine-in-linea-senza-cornice" width="16%" src="images/schema_1_fmt.jpeg" alt=""/>
I want to close the tag with a xslt transformation.
XPath does not work directly on the input document, but on an abstract, tree-like representation of the document (e.g. XDM or DOM). In this model, opening and closing tags of an element are not represented at all. Instead, an element appears as a single entity in the tree.
Therefore, manipulating < and /> is completely out of the question for languages like XPath, because the concept of opening and closing tags is simply not implemented. I would argue that this abstraction is an advantage of the models, though.
Also, XSLT transformations normally take as input XML documents. If your document has unclosed elements, it will be rejected by any application that is only prepared to process XML.
In short, fix the XML document with a language other than the combination of XSLT and XPath (see e.g. here), and think about XSLT as soon as you have well-formed XML as input.

How to open non-valid HTML (from Wikipedia) via document() in XSLT?

I use XSLT 1.0 to extract information from Wikipedia infoboxes, and, for certain links, fetch additional information from further Wikipedia sites.
In principle, this works fine, unless the HTML returned for the Wikipedia pages is invalid. Unfortunately, this happens for all pages in, e.g., the Russian Wikipedia. Try the following example
<xsl:for-each
select="document('http://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B4%D0%B5%D0%BD_%D0%BA%D1%83%D0%BB%D1%8C%D1%82%D1%83%D1%80%D0%B0')//text()">
<xsl:value-of select="."/>
</xsl:for-each>
The trouble is that the entity ® is used on every page in this language edition, but not declared: The HTML declaration of Wikipedia pages is crippled.
<!DOCTYPE html>
Instead of, say,
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This is clearly a Wikipedia issue, not an XSLT issue, but is there any workaround to parse these sites nevertheless? Any pointers to a more robust XSLT parser? Is there any way to infuse entity declarations into the HTML before it gets parsed?
So far, I tried XSLTproc, Saxon6.5.5, Saxon-B 9.1.0.8, and Xalan, all with the same result.
Saxon and Xalan (I don't know about xsltproc) allow you to supply a URIResolver to handle document() requests. This is allowed to return any Source object. To process HTML input, return a SAXSource whose XMLReader is actually an HTML parser. There are a couple of candidates, TagSoup and validator.nu - the latter is probably better as it claims to implement the HTML5 parsing algorithm. The XSLT processor will then think it is dealing with well-formed XML.
Alternatively, in Saxon there is a saxon:parse-html() extension function. This in fact uses TagSoup underneath.

Can XSLT be used to apply CSS styles?

I have some XML and a very small XSLT to convert that into HTML. When I import my XML content in InDesign using the XSLT, I can see the styles are applied to the elements on the left hand browsing side but, when I drag and drop the elements in the InDesign frames, nothing is happening. The content is flowing normally.
My question is, in InDesign, is XSLT getting used only for sequencing the elements or can we use XSLT to apply the styles (like font-size, line-spacing etc.) as well for elements?
Also, if you can send me any sample XSLT for converting an XML to HTML tags or any example, that will be great.
In general, formatting in InDesign has nothing in common with CSS styles -- that is a HTML construction, not an XML one. You can indeed only reorder elements (and other element-wise stuff, such as removing, replacing, or adding tags).
Formatting can be applied to the tags after you imported/translated your XML using Map Styles To Tags (or Map Tags to Styles; I don't think I've ever used either).
You can use HTML within XLST so, if you have something like:
<xsl:value-of select="node"/>
Then this can also be written like:
<div class='style'><xsl:value-of select="node"/></div>
Or you can use inline CSS like:
<div style='color:red;'><xsl:value-of select="node"/></div>
Hope this helps!