Force HTML Tidy to output XML (instead of XHTML), or force XSLTproc to parse XHTML files - xslt

I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we're doing with them.
I tried:
Use HTML Tidy to convert HTML -> XHTML / XML
Use document(filename) in XSLT to read in particular XHTML/XML files
...use standard nodeset commands to access e.g. "html/body/*"
This doesn't work, because:
It seems that XSLT (tried: libXSLT/xsltproc ... and Saxon) cannot process XHTML documents as external files (it sees the xhtml DOCTYPE, and refuses to parse it as nodes).
Fine (I thought) ... XHTML is just XML, I just need to put it through HTML Tidy and say:
"output-xml yes ... output-html no ... output-xhtml no"
...but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.
Any ideas for how to:
Force HTML Tidy to obey the command-line parameters, and set the doctype I asked for
Force XSLTproc to parse xhtml DOCTYPEs as xml
...some other cunning way that will work?
NB: this has to work on OS X - it's part of a build process for iOS apps. That shouldn't be a big problem, but e.g. any windows-only tools aren't available. I'd like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)

I finally discovered why XSLTproc / Saxon were refusing to parse the files if they were passed-in with a DOCTYPE html:
The DOCTYPE of the external document alters how they interpret the
xmlns (namespace) directive. Tidy was declaring (correctly)
"xmlns=...the xhtml: namespace" - so all my node-names were ... I don't know: non-existent? ... inside my XSLT. XSLT was just ignoring them, as if they didn't exist - it needed me to provide a compatible mapping to the same namespace
...strangely, if the DOCTYPE was xml, then they happily ignored the xmlns command - or they allowed me to reference nodes by unqualified name. This fooled me into thinking that they were point-blank ignoring the nodesets inside the xhtml DOCTYPE'd version.
So, the "solution" is something like this:
modify your XSLT stylesheet to ALSO import the "xhtml" namespace - NB: this is required so that you can reference the nodes in the external files
write all your XSL match / select / template rules with the "xhtml" prefix on every node (and every attribute, I think?)
let Tidy output whatever it wants: it doesn't matter, it'll Just Work, once you have the namespace support in there
Example code:
Your stylesheet goes from this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...to this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
Your select / match / document-import goes from this:
<xsl:copy-of select="document('html-files/file1.htm')/html/body"/>
...to this:
<xsl:copy-of select="document('html-files/file1.htm')/xhtml:html/xhtml:body"/>
NB: just to be clear: if you ignore namespaces, then it seems XSLT will work on files that are unDOCTYPED, even if they have a namespace in them. Don't make the mistake I made of thinking your XSLT is correct just because it appears to be :)

XHTML is XML (if it is valid).
To get your XHTML processed as XML, you must not serve it as "text/html" MIME. Use application/xhtml+xml instead (keep in mind, that IE6 does not support to render this and will prompt a download window for your site).
In PHP do you serve it as xhtml+xml with the header() function.
I think this should do the trick:
header('Content-Type: application/xhtml+xml');
Does this help?

If you run xsltproc --help, among the accepted input flags is a very conspicuous one called --html which supposedly tells xsltproc that:
--html: the input document is(are) an HTML file(s)
Presumably for this to work you must have valid HTML files to begin with, though. So you might want to tidy them up first.

I think the main problem is given by the XML catalog doctype declaration. You can test this by removing the external entity reference in the input XHTML and see if the processor correctly works with it.
I would do as follows:
Use Tidy with doctype omit option.
Add the Doctype at XSLT side as described here
The main problem is that Saxon and xsltproc has not any option to disable external entities resolution. This is supported by MSXSL.exe command line utility with option -xe.

It's been a while, but I remember trying to use HTMLTidy to prep HTML files for XSLT and was disappointed by how easily it gave up while trying to "well form" the HTML. Then I found TagSoup, and was very pleased.
TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
I don't know if you're bound to HTMLTidy, but if not try this: http://home.ccil.org/~cowan/tagsoup/
As an example, here's a bad HTML file:
<body>
<p>Testing
</body>
And here's the tagsoup command and its ouput:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html
src: bad.html
<html><body>
<p>Testing
</p></body></html>
Edit 01
Here is how tagsoup handles DOCTYPEs.
Here's a bad HTML file with a valid DOCTYPE:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<body>
<p>Testing
</body>
</html>
Here's how tagsoup handles it:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html
src: bad.html
<html><body>
<p>Testing
</p></body></html>
It isn't until you explicitly pass a DOCTYPE to tagsoup that it attempts to output one:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html --doctype-public=html bad.html
src: bad.html
<!DOCTYPE PUBLIC "html" "">
<html><body>
<p>Testing
</p></body></html>
I hope this helps,
Zachary

Related

Eliminate javascript from HTML with XSLT

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.
As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.
Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

DocBook XSL chunking

Can anyone point me to the part of this file that controls chunking?
http://docbook4j.googlecode.com/svn-history/r4/trunk/docbook4j/src/main/resources/xsl/docbook/webhelp/xsl/webhelp-common.xsl
I can't relate what I find in this doc to the code I see in that file:
http://www.sagehill.net/docbookxsl/Chunking.html
I want to modify the XSL so that it chunks at <?confluence type="page" ?> instead of <section xml:id= ...>.
The main XSL stylesheet DocBook Webhelp uses is webhelp.xsl. This includes the webhelp-common.xsl, and imports the xhtml chunk.xsl. The chunking code is included in the chunk.xsl and the stylesheets it imports especially the chunk-common.xsl.

How to open non-valid HTML (from Wikipedia) via document() in XSLT?

I use XSLT 1.0 to extract information from Wikipedia infoboxes, and, for certain links, fetch additional information from further Wikipedia sites.
In principle, this works fine, unless the HTML returned for the Wikipedia pages is invalid. Unfortunately, this happens for all pages in, e.g., the Russian Wikipedia. Try the following example
<xsl:for-each
select="document('http://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B4%D0%B5%D0%BD_%D0%BA%D1%83%D0%BB%D1%8C%D1%82%D1%83%D1%80%D0%B0')//text()">
<xsl:value-of select="."/>
</xsl:for-each>
The trouble is that the entity ® is used on every page in this language edition, but not declared: The HTML declaration of Wikipedia pages is crippled.
<!DOCTYPE html>
Instead of, say,
<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This is clearly a Wikipedia issue, not an XSLT issue, but is there any workaround to parse these sites nevertheless? Any pointers to a more robust XSLT parser? Is there any way to infuse entity declarations into the HTML before it gets parsed?
So far, I tried XSLTproc, Saxon6.5.5, Saxon-B 9.1.0.8, and Xalan, all with the same result.
Saxon and Xalan (I don't know about xsltproc) allow you to supply a URIResolver to handle document() requests. This is allowed to return any Source object. To process HTML input, return a SAXSource whose XMLReader is actually an HTML parser. There are a couple of candidates, TagSoup and validator.nu - the latter is probably better as it claims to implement the HTML5 parsing algorithm. The XSLT processor will then think it is dealing with well-formed XML.
Alternatively, in Saxon there is a saxon:parse-html() extension function. This in fact uses TagSoup underneath.

use javascript (or JQuery) in a standalone HTML file to select an XML and transform

I need a way to transform XML to HTML (using XSL) but without a server. So, I want to create a standalone HTML file (with hardcodes XSL path and name).
Allow the user to select an XML
Transform it with the XSL and display results in browser
Original XML cannot be changed (so cannot just embed XSL in XML)
Is this possible? Everything I found requires post, but I'm not using a server
Regards
Mark
Yes, it's possible. And you don't need javascript to do it, but you can use javascript if you want.
Just look at the previous (XSLT question)[https://stackoverflow.com/questions/12964917]
Use a processing-instruction like...
<?xml-stylesheet type="text/xsl" href="soccer.xslt"?>
Refer:
Direct linkage through pi: http://www.w3.org/TR/xml-stylesheet/
Transform through javascript:
http://dev.ektron.com/kb_article.aspx?id=482
Calling XSLT from javascript

XSLT: How do I trigger a template when there is no input file?

I'm creating a template which produces output based on a single string, passed via parameter, and does not use an input XML document. xsltproc seems to happily run with a single parameter specifying the stylesheet, but I don't see a way to trigger a template without an input file (no parameter to xsltproc to run a named template, for example).
I'd like to be able to run:
xsltproc --stringparam bar baz foo.xsl
But I'm currently having to run, with the "main" template matching "/":
echo '<xml/>' | xsltproc --stringparam bar baz foo.xsl -
How can I get this to work? I'm sure I've seen other templates in the past which were meant to be run without an input document, but I don't remember how they worked or where to find them again. :-)
Actually, this has been done quite often.
In XSLT 2.0 it is defined in the Spec. that providing an initial context node is optional.
If no initial context node is provided (no source XML document), then it is important to provide the name of a named template which is to be executed as the entry point to the transformation.
In XSLT 1.0 one can provide to the transformation its own primary stylesheet module (file) as the source XML document, and of course, the transformation can completely ignore this source XML document. This technique has long ago been demonstrated and used by Jeni Tennison.
For example:
<?xml-stylesheet type="text/xsl" href="example.xml"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<p>Hello, world!</p>
</xsl:template>
</xsl:stylesheet>
When the above code is saved in a file named "example.xml" and then the folder contents is displayed with Windows Explorer, double-clicking on the file "example.xml" will open IE and produce:
Hello, world!
In general, you cannot do this with XSLT - specification requires there to be an input document, and for the processing to start with applying any available templates to its root node. Some XSLT processors might give a way to do what you want (e.g. execute a named template) as an extension, but I don't know any such, and it doesn't seem that xsltproc is one of them, judging from its man page.
In fact, this sounds pretty dubious in general, as the purpose of using XSLT to produce some output from a plain string input is unclear - it's not the kind of task it's generally good at.