Editing an html file like xml file - xslt

I need to convert HTML file to iXBRL format. iXBRL format is basically html with some embedded nodes or some information of html file wrapped under ixbrl tags. For this, I will need to SEARCH and REMOVE some nodes from HTML file and WRAP some nodes under iXBRL tags.
I'm not able to use XML DOM as it shoots an exception on content type. .Net htmldocument class doesn't support removing the nodes and replacing them, neither I could see the Save option.
I tried using HTML Agility Pack but It can't find the nodes because of namespaces in the node name and it doesn't have any option to specify namespaces (line namespace manager in .net).
Can I specify namespace in XPATH expression? How?
Can anyone help me in Editing HTML (or XHTML) files using .net or any free library.

If you want to use XPATH with namespaces you just need to prefix the nodes with the right namespace.
If your Namespace looks like this:
xmlns:xbrli="http://www.xbrl.org/2003/instance"
And your Elements are like this:
<root>
<xbrli:elementname></xbrli:elementname>
<root>
Then you can select them in XPATH like:
//xbrli:elementname

Related

Can I force the XSLT collection function to treat all files as XSL?

I'm currently building an XSLT stylesheet used to document other XSLT stylesheets in a series of folders and sub-folders. My code pulls out specific details about variables, functions, etc and renders it in an output format. The sheets being read are created by a 3rd party product. Most of them have an XSL extension but some of them are proprietary extensions. I have some files with a DTCBS extension but they are just XSL stylesheets.
I'm currently loading the content of these files into a variable using the XSLT function "collection" as follows:
<xsl:variable name="Collection" select="collection(concat('file:///', encode-for-uri(replace($filePath, '\\', '/')),'?select=*.(xsl|dtcbs|xml);recurse=yes'))" as="node()*"/>
The variable works just fine if I use XSL|XML. But if I include the DTCBS extension, the variable blows up citing "the supplied value is xs:base64Binary".
If I manually put the xml declaration line at the top of my DTCBS file, the variable works fine. Those DTCBS files are auto-generated without the declaration line so I can't fix that, nor can I manually edit them each time I want to run my documenter code.
From what I can tell, because it's not an XSL extension, and the XML declaration line isn't present, the XSLT parser thinks it's base64 when it isn't.
I'm using Saxon as my XSLT parser and the Saxon documentation says it uses file extensions and http headers to detect the file type.
Does anyone know if there is a way to force collection() to treat every file as an XSL?
Tried adding the XML declaration line in the DTCBS file. This did correct the issue but I can't do this in all cases as I am trying to automate the entire thing.
I also renamed the DTCBS extension to XSL and the problem went away as well.
As well as Martin's suggestion, you can register content types with the Saxon configuration:
processor.getUnderlyingConfiguration()
.registerFileExtension("dtcbs", "application/xml");
This has been available since Saxon 9.7.
Try to add e.g. content-type=application/xml e.g. '?select=*.(xsl|dtcbs|xml);recurse=yes;content-type=application/xml'.

Eliminate javascript from HTML with XSLT

I am trying to transform an HTML report into XML, but some javascript in the file is throwing errors, due to statements with a less-than character (e.g., for(var i=0; i<els.length;i++) ). I thought I could eliminate the javascript with the following template, which should remove entire script nodes:
<xsl:template match="script"/>
I assumed the XSLT processor would simply skip over the entire script nodes, but it's still throwing the same errors. I also tried adding this one:
<xsl:template match="script/text()"/>
No luck. If I manually remove all the javascript from the file, my transform works, but that's not practical as I need to create and run a daily automated process on these HTML files to extract some data in the HTML tables.
As a general rule, XSLT will only process well-formed XML input: it's not designed to process other formats like HTML.
However, XSLT will generally accept input from a parser that delivers a stream of events that looks sufficiently like an XML stream. This allows parsers like TagSoup and validator.nu to be used as a front-end to your XSLT processor.
Saxon packages this up with a parse-html() function that invokes TagSoup to parse HTML input and turn it into a DOM-like tree (actually an XDM tree) that it can process as if it came from XML.
validator.nu is a more up-to-date HTML parser than TagSoup, but you would have to do a little more work to integrate that.
Question was answered by Martin Honnen in the comments:
oxygenxml.com/doc/versions/18.1/ug-editor/tasks/… suggests there is an HTML import feature so try whether that helps. Of course there are standalone applications like HTML Tidy I think you can use outside of the XSLT processsing to first convert your HTML to XHTML.

Include user control .ascx into xslt

I've created .ascx user control and I'm trying to find a way for including it into xslt rendering. How can I do this? I'm doing it for Sitecore. I thought maybe create a placeholder, but placeholders cannot be defined in renderings. I appreciate any help you can provide.
It's not possible to include ASCX file into xslt file because: XSLT transforms XML to HTML or to XML or to plain text but not to ASP.NET pages.
You can include xslt file into ascx but not ascx into xslt file. The best way is to change your xslt file into ascx file, and to include there with placeholders or directly .
I'd suggest to avoid using XSLT.
They seem pretty easy to use, but it's really hard to refactor the code.
Well, it's not possible to call user controls(.ascx) directly from XSLT files. However depending upon what you want to achieve, you can call .net methods, called XSLT extension methods, from XSLT file. For instance, you may need to write code similar to below to call custom .Net GetData() method.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:customObject="urn:yourNamespace">
<new-data>
<xsl:value-of select="customObject:GetData()"/>
</new-data>
...
Of course the type needs to be registered before it can be used. Type registration can be done into web.config or dynamically by calling AddExtensionObjectMethod of XSLTArgumentList class.
Sitecore offers XSLT extension controls too and unlike extension methods, it isn't a .net feature. XSL extension controls are XML elements in XSL renderings that correspond to .NET classes. For example, the XSL extension control corresponds to the
Sitecore.Web.UI.XslControls.Text .NET class. It will be consumed something like this in XSLT file:
XSL extension controls are standalone elements in the XSL code.
To register a custom type, add following to element in web.config:
<extension mode="on" type="NamespaceName.ClassName, AssemblyName" namespace="http://www.w3.org/1999/XSL/Transform" singleInstance="true"/>
Reference: http://sdn.sitecore.net/upload/sitecore6/64/presentation_component_xsl_reference_sc62-64-a4.pdf

Can XSLT be used to apply CSS styles?

I have some XML and a very small XSLT to convert that into HTML. When I import my XML content in InDesign using the XSLT, I can see the styles are applied to the elements on the left hand browsing side but, when I drag and drop the elements in the InDesign frames, nothing is happening. The content is flowing normally.
My question is, in InDesign, is XSLT getting used only for sequencing the elements or can we use XSLT to apply the styles (like font-size, line-spacing etc.) as well for elements?
Also, if you can send me any sample XSLT for converting an XML to HTML tags or any example, that will be great.
In general, formatting in InDesign has nothing in common with CSS styles -- that is a HTML construction, not an XML one. You can indeed only reorder elements (and other element-wise stuff, such as removing, replacing, or adding tags).
Formatting can be applied to the tags after you imported/translated your XML using Map Styles To Tags (or Map Tags to Styles; I don't think I've ever used either).
You can use HTML within XLST so, if you have something like:
<xsl:value-of select="node"/>
Then this can also be written like:
<div class='style'><xsl:value-of select="node"/></div>
Or you can use inline CSS like:
<div style='color:red;'><xsl:value-of select="node"/></div>
Hope this helps!

How to deal with presence or not of xml namespaces using xslt

I have some XML/TEI documents, and i'm writing an XSLT 2.0 to extract their content.
Almost all TEI documents has no namespace, but one has the default namespace (xmlns="http://www.tei-c.org/ns/1.0").
So all documents has the same aspect, with unqulified tags like <TEI> or <teiHeader>, but if I try to extract the content, all works with "non-namespaced-documents", but nothing (of course) is extracted from the namespaced-document.
So i used the attribute xpath-default-namespace="http://www.tei-c.org/ns/1.0" and now (of course) the only document working is the namespaced one.
I can't edit documents at all, so what I'm asking is if there's a way to change dynamically the xpath-default-namespace in order to make work xpaths like //teiHeader both with namespaced and non-namespaced documents
If you are using XSLT 2.0, then you do have the option for a wildcard match for the namespace in a node test.
e.g. //*:teiHeader
http://www.w3.org/TR/xpath20/#node-tests
A node test can also have the form
*:NCName. In this case, the node test is true for any node of the principal
node kind of the step axis whose local
name matches the given NCName,
regardless of its namespace or lack of
a namespace.
This is functionally equivalent to Dimitre Novatchev's example, but a little shorter/easier to type.
However, this will only work in XSLT/XPATH 2.0.
There isn't really a clean way to do precisely what you are asking. However, there are workarounds available. You could use a two stage process whereby you strip the namespace from the document if it's present and then pass it through the same templates for all content.
There is a good example (in XSLT 1) of doing this in the DocBook XSLT. Take a look at html/docbook.xsl and common/stripns.xsl
Basically, you would need to assign the result of stripping the namespace to a variable and then call your existing templates (for the non namespaced) content but select the variable.
It is ugly, but this gives you what you want:
//*[name()='teiHeader']
If you use this style for all location steps in any XPath expression, the XPath expressions will select elements only by name, regardless whether or not the elements belong to any namespace.