document() function for a file on another computer/server - xslt

I understand the use of document() as follows.
<xsl:value-of select="document('path\to\docuemnt.xml')/RootElement/Element"/>
And this has to be a relative path to the parent XSL file. But what if I need to reference a file which is hosted on another server on the local network? I've tried such things as.
<xsl:value-of select="document('\\servername\path\to\document.xml')/RootElement/Element"/>
But this throws an error, because it looks in
C:\path\to\xsl\\servername\path\to\document.xml
Which of course doesn't exist.

This solution only relates to the Saxon-HE 9.4.0.3N XSLT processor, in the console application form, on Windows 7.
In my experimentation, I found that the document() function will accept file names or URIs. However I would avoid filenames because they need to be short-form. If you use long-form, the file-name will be rejected.
Suppose your document is ...
c:\path\to\document.xml
on server 'servername' which is mapped to drive 'j'.
To form a URI from this use as the document() parameter value...
file:///j:/path/to/document.xml
In relation to the URI, I was mistaken about Saxon not accepting long-form. This only applies to filenames. However, there are a number of gotchas...
Note the forward slashes. Backslashes will not work.
I have not found a way to build a workable file: URI with just UNC names. You need to make a drive mapping to a letter.
Any failure to open the document for any reason will be reported as the same error. With file system, there are so many things that can go wrong, that if you can't open the file, it is not safe to assume that the URI is wrong. There could be many mundane reasons why a file cannot be opened at a particular time.
Beware of firewall issues. These play a role.
Many text editors, such as NotePad++ assume, in the absence of a BOM and not encoded in one of the two UTF-16 encodings, that a text file is encoded in the system code-page. Saxon will make the default assumption that the file is encoded in UTF-8 so if you have a character that looks like this in NotePad++ (ä) with my code-page, Saxon will spit the dummy, and report that it is unable to open the file. (Aside: I'm not sure what my code-page is. My o/s is Win7 and the Current system locale is English (Australia). It is the system local that determines the system code-page). The reason why Saxon will not open the document is that the (ä) encoded in some code-page results in a sequence of bytes which is not a valid UTF-8 sequence.
URI paths which are not URL paths are not supported by the underlying operating system. Saxon may well truthfully say that it supports URIs in relation to the document() function, but that doesn't boil any cabbages, because in practice, you can't use them. - Well at least not on the windows family of o/s.
Please ignore the MSDN page on the file protocol. The form of URL suggested on that page (with the | character etc) is not accepted by the Saxon document() function. Use the form that I have suggested above. I have tested it and it works.

Your understanding of document() is incorrect. It expects a URI, not a filename.

Related

Cross-platform way to get user friendly file type description

What's a cross-platform way for getting a user-friendly description of a file?
Examples:
foo.pdf -> "Portable Document Format (PDF)
bar.doc -> "Microsoft Word Document"
Pointers to libraries or appropriate system APIs would be highly appreciated.
A Qt/C++ solution is preferred but anything is fine.
Target platforms are Windows and Mac OS X. I'd prefer the descriptions to match what would be found in Explorer or Finder if possible (rather than maintaining a map of extensions -> descriptions myself).
The GNU File command is builtin for Linux and OSX, and there is a version available for Windows (http://gnuwin32.sourceforge.net/packages/file.htm).
File tests each argument in an attempt to classify it. There are three
sets of tests, performed in this order: filesystem tests, magic number
tests, and language tests. The first test that succeeds causes the
file type to be printed. The type printed will usually contain one of
the words text (the file contains only printing characters and a few
common control characters and is probably safe to read on an ASCII
terminal), executable (the file contains the result of compiling a
program in a form understandable to some UNIX kernel or another), or
data meaning anything else (data is usually `binary' or
non-printable). Exceptions are well-known file formats (core files,
tar archives) that are known to contain binary data.
You could invoke the file command using QProcess and display the returned info.
Output looks like :
$ file document.pdf
document.pdf: PDF document, version 1.5
$ file test.txt
test.txt: ASCII text, with CRLF, CR, LF line terminators
The closest that I think you can get out of Qt is QFileInfo.
Windows keeps track of the mapping through the registry that can be accessed through Qt's QSettings. But just from brief research it sounds like it might be kind tricky to mimic Explorer's mapping.
You can also launch the file with the default handler using QDesktopServices::openUrl().
I haven't researched how or where OSX keeps track of the file type description information.
Hope that helps.

Reading Environment Variables in an XSLT Stylesheet with Saxon

I'm trying to generate an XML file with the my machine's hostname in some arbitrary element or attribute, e.g.
<hostname>myHostname</hostname>
I'm using Saxon 9.2. I can think of three ways to do this:
Read and parse /etc/sysconfig/network (I'm using Fedora)
Read the environment variable (as in $ echo $HOSTNAME)
Pass the hostname to saxon and then use somehow dereference a variable (not sure if this is possible)
Are any of these possible? I think the first option is most likely to work, but I think the other two options will produce less verbose XSLT.
I also have a related question:
Currently, I have an XSLT and source XML file that generates a bunch of XML files, it works like I expect it to. Is there anyway I can selectively generate one file per host? That is, I want to say 'if the hostname is myHostName then generate the XML file for myHostName, if the hostname is myOtherHostName then generate the XML file for myOtherHostName'.
I ask this because I'm trying to configure a large number of machines and if I could drop an XSLT and XML file on each and then call the same command on every machine and hten get the right XML on each it would be really convienent.
You should pass a parameter to your xslt when "calling" it. I think this is the most robust solution.
So at the top of your stylesheet you would have something like :
<xsl:param name="hostName"/>
Then you can use it in your .xslt via the usual notation : $hostName etc.
You just then need to pass those parameters when calling the xslt processor. Depending on how you use it this may vary.
You can generate an XML file containing all needed parameters, then you can either pass it as parameter to the transformation (refer to the code samples to see examples of how this is done with Saxon).
Here is a link that can help: https://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/expr/instruct/GlobalParameterSet.html
Or simpler, save this XML file in the file system and just pass as parameter to the transformation the file path and name.
Then inside the transformation, use the standard XSLT function document() to load the XML document that contains the parameters.
Even further simplification is possible, if this file can be stored at a location that has exactly the same path on all machines. Then this avoids the need to pass this filepath as parameter to the transformation.
There are many possible ways of doing this: passing in parameters, reading the configuration file using the unparsed-text() function, calling an extension function.
But perhaps the most direct way is that Saxon 9.3 implements the new XPath 3.0 function get-environment-variable(). Support for XPath 3.0 requires Saxon-PE or higher.
(XPath 3.0 is of course still a draft and subject to change. In fact it has changed since Saxon 9.3 was released - the function has been renamed environment-variable()).

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.
Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.
Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.
The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.
If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.

Xerces/Xalan: UNC path as argument for document function?

I'm transforming an XML document by using Xerces-C 2.5 and Xalan-C 1.8. The XSL contains a "document" function, that references a file on the network. Unfortunately I cannot access this file by HTTP. I've only got the UNC path.
Xerces refuses to parse the referenced document, because WinSockNetAccessor::makeNew is called in Xerces as the "file" protocol is only accepted for local files. WinSockNetAccessor::makeNew is implemented for HTTP only, an exception is thrown and the file is ignored.
Is there a way to fool Xerces in order to accept the unc path as local file or any other known workaround without writing my own parser or manipulating Xerces?
A simple workaround would be, I guess, to just create a mapping, so you can call the network drive O: or whatever. That often fools programs that can't work directly with a UNC path (such as cmd.exe itself).
Does the UNC as it appears in the XSL have a "file:" prefix?
BTW, Xerces C V2.5 is several years old. Have you tried the latest version - V3.0.1 at the moment?

On Windows, when should you use the "\\\\?\\" filename prefix?

I came across a c library for opening files given a Unicode filename. Before opening the file, it first converts the filename to a path by prepending "\\?\". Is there any reason to do this other than to increase the maximum number of characters allowed in the path, per this msdn article?
It looks like these "\\?\" paths require the Unicode versions of the Windows API and standard library.
Yes, it's just for that purpose. However, you will likely see compatibility problems if you decide to creating paths over MAX_PATH length. For example, the explorer shell and the command prompt (at least on XP, I don't know about Vista) can't handle paths over that length and will return errors.
The best use for this method is probably not to create new files, but to manage existing files, which someone else may have created.
I managed a file server which routinely would get files with path_length > MAX_PATH. You see, the users saw the files as H:\myfile.txt, but on the server it was actually H:\users\username\myfile.txt. So if a user created a file with exactly MAX_PATH characters, on the server it was MAX_PATH+len("users\username").
(Creating a file with MAX_PATH characters is not so uncommon, since when you save a web page on Internet Explorer it uses the page title as the filename, which can be quite long for some pages).
Also, sharing a drive (via network or usb) with a Mac or a Linux machine, you can find yourself with files with names like con, prn or lpt1. And again, the prefix lets you and your scripts handle those files.
I think the first thing to note is that "\\?\" does not make the path a UNC path. You were more accurate the second time when you called it a UNC-style path. But even then, the similarity only comes from having two backslashes at the start. It really has nothing to do with UNC. That's backed up by the fact that you have to use even more characters to get a UNC path with the "\\?\" prefix.
I think you've got the entire reason for using that prefix. It lifts the maximum-length limit as described in the article you cited. And it only applies to Unicode paths; non-Unicode paths don't get to avoid the limit by using that prefix.
One thing to note is that the prefix is not allowed for relative paths, only for absolute ones. You might want to double-check that your C library honors that restriction.
As well as allowing longer paths, the "\\?\" prefix also lets you use files and directory names like "con" and "aux". Normally Windows would interpret those as old-fashioned DOS devices.
I've been writing Windows code since 1995, and although I'm aware of that prefix, I've never found any reason to use it. Increasing the path length beyond MAX_PATH seems to be the only reason for it, and neither I nor any of my programs' customers have ever done so, to my knowledge.