eXist DB and Xquery : xincludes or collections (TEI-XML)? - xslt

I have a corpus in TEI-XML which uses a 'master' corpus XML document that then contains, via xi:include, thousands of other documents. Each of these documents themselves contain xi:includes to master lists of named entities (people, places, etc linked by xml:ids) . All of this works very well in XSLT (and in my IDE Oxygen for fast encoding).
I am now embarking on building a website using eXist-DB applications. I am rewriting everything directly in Xquery (to replace XSLT), and I have hit upon an unexpected decision. I am used to using xi:includes to traverse the corpus and the various XMLs files. But reading the documentation of eXist DB, it seems that the encouraged practice is to use collections and query them directly, instead of navigating via xi:includes. It also seems that eXist-DB does not support the full implementation of xi:includes anyway and requires some work arounds?
I am looking for guidance as to best practices of eXist-DB/Xquery in this context.
Many thanks in advance.

Correct, eXist's XInclude implementation is focused on output (i.e., serialization) rather than on querying or indexing. As eXist's documentation page on XInclude states:
The XInclude processor is implemented as a filter in between the serializer's output event stream and the receiver... XInclude processing is therefore applied whenever eXist-db serializes an XML fragment, whether it's a document, the result of an XQuery or an XSLT stylesheet.
Thus, if you use XInclude to assemble your corpus and you want to query/traverse this corpus, you could do so by (1) writing a query to read your XInclude and following it like a map to find the component documents, (2) pre-serializing your data into a new document and then querying the resulting document directly, or (3) placing the documents into collections that facilitate the kinds of queries you want to do.

Depending on the size of those thousands of documents, traversing the xinclude when running xqueries tends to be slow and quite memory intensive. In my experience Joe's option 3 is usually the way to go.
Unlike with straight-up xslt, in exist-db you can define indexes. E.g. you have a <listPerson> element as a wrapper for 1000s xincludes going to <person> elements as root of their own document.
If you have defined and index for <person> you can use e.g. ft:query() to query the index directly, irrespective of where in the tree of sub-collections and documents the element is located. This tends to be orders of magnitude faster, compared to traversing the whole document starting at master, and resolving xincludes.
As for validation, you will need to decide if a full validation run of the whole expanded document is really always necessary. This requires some fiddling, but there isn't much general advice I can offer, without seeing the actual files and code.
You can find more information about indexing in exist in the documentation

Related

Regx to exclude elements in an xml file

I am comparing two xml files using win merge. The files are deployment files and im looking for variation between the environments. The main issue is that the xml files are littered with tags that indicate a change in underlying id e.g. 123 but this is unimportant for comparing.
I want to create a regex that i can use in winmerge to exclude elements to compare only the interesting elements. e.g. compare element in the example below
Environment 1
<table>
<tableInfo>
<tableId>293</tableId>
<name>Table Name New</name>
<repositoryId>0</repositoryId>
Environment 2
<table>
<tableInfo>
<tableId>965</tableId>
<name>Table Name Old</name>
<repositoryId>0</repositoryId>
Please note that the application producing the xml spits these out in line by line order so it is not a true xml compare
I would not recommend using a regex for this... to do it truly accurately, you would really need to effectively parse the XML, which is really not something for which you want to use a regex.
Win Merge is a line-based diff tool, which really isn't necessarily wholly effective for XML. I would recommend trying an XML-based diff tool, which has more of a concept of XML's tree structure. Most XML-based diff tools appear to be commercial products, but there is diffxml, which is open source, and may be worth a look.
If you can get an XML-based diff of the files, which should inherently be more accurate, since they are not wholly line-based, and take the tree structure into account, you could then further delve into the diffs using an XML parser, such as ElementTree in Python, specifically targeting the tags you consider to be interesting and comparing them to each other to see if they are different.
If diffxml proves to be too unwieldy, it may be worth just doing the parsing using ElementTree or similar (i.e. lxml) and doing the comparison yourself against the two different sources targeted just at the tags in which you are interested.
In short, I think XML parsers, perhaps in combination with a XML-aware diff tool, will be more useful than pure regexes in this case.

Which libxml2 API should I use for large files?

Our program currently uses the libxml2 DOM API (xmlReadFile) to load an entire file into memory. Unfortunately, this breaks down on "large" XML files, as the basic memory consumption of libxml2 DOM is about 4-5 times the base file size.
It seems libxml2 offers two APIs for reading XML when I don't want to store the whole tree in memory: SAX2 and xmlReader.
I haven't dug into the APIs yet, but I'm wondering which one is preferable under which circumstances?
Note: All I need to do with the XML file is populate some C++ datastructures with the data found in the XML file. And these will in turn be a lot smaller than the (very verbose) XML definition. At the moment, with xmlReadFile and the DOM API the process takes about 100MB memory for a 20MB XML file. The C++ data in memory for such a file is more like 5MB -- so I could go from 1:4 to 4:1, which would already help a lot.
I follow this approach, if the processing is sparse (need only an element here and there) xmlReader is better, if you need to process all elements, SAX is better. Although, opinion could come in to play as to whether you want to push the processing or you want the processing to push your code...
If you need to process large XML documents then size becomes the primary consideration. As you saw with 20MB -> 100MB for DOM parsing, if you get much larger than this that can be prohibitively expensive and SAX may be the only way to process it. For embedded or memory constrained devices SAX may be required even for small files.
If you want to start parsing before the file is complete SAX is the way to go. If you are writing a browser, are streaming XML, or require responsiveness then you will need to use SAX.
SAX is more of a pain, if you can get away with DOM parsing that will usually lead to less code and simpler code, for simpler DOM queries you can avoid a state machine for example. If you only care about a handful of fields in the document you could even avoid querying a DOM parser directly and query XSLT instead.

How to get correct data in a large xml file?

I have a large xml file (contains about few million records) and need to get about 100 records (based on id or something like that)
I tried TinyXml and Xalan-C but both of them using DOM, therefore it cause a out of memory issue.
Is there a C/C++ library that can do that without loading all data to memory as DOM?
How about Apache Xerces?
It's pretty damn mature and is optimized for performance (i.e. it won't read your complete files into memory!).
You need a SAX parser like Xerces
The Saxon-EE XSLT processor can handle a subset of XSLT in streaming mode (that is, without building a tree in memory). For details see
http://www.saxonica.com/documentation/sourcedocs/streaming.xml
It's not C/C++, but you don't say whether that's a hard constraint.

XSLT to convert an XML element containing RTF data to HTML?

OK, so here's the background:
We have a third-party piece of software that does a lot of complicated stuff to generate an XML file from a lot of tables based on a wide array of business rules. The software allows you to apply an XSL transformation by supplying an XSLT file as part of its workflow, before continuing on in the process, which is usually an upload to one or more servers, based on more business rules.
Here's the problem:
One of the elements (with more on the way) this application is processing contains RTF text, and needs to be converted into formatted HTML before being uploaded. There are no means of transforming the XML inside the application other than through an XSLT file, and once we output the file, we cannot resume the workflow. My original thought was, "Easy! someone must have written a few XSL transforms for converting RTF to formatted HTML!" Hours of searching later, I must conclude I either suck at searching or it's awfully obscure.
Disclaimers:
I know the software is pretty darned limited; I'm stuck with it.
I know there are a lot of third-party tools to do this; they are not available to me because I would need to run them externally.
I know that this is not a pretty or efficient thing to do with XSLT. Changing that is not an option for me at this point.
If I cannot find a means to do this through pure XSL transforms, I will need to output the files locally, run the extra process, and take the destination routing on through a custom process. I really don't want to do that.
Does anyone have access to an XSL transformation function/ scheme that will allow me to do this natively in the application? Perhaps a series of regular expressions I could use or something?
So it turns out that external scripts can be invoked from the XSLT. It seems I will be using another scripting language to get this to work. I'm a little bummed there was no other answer available.

Is there a lightweight approach in producing XML with Xerces-C++?

This application runs on an embedded platform with low processing power and memory. I want to produce huge XML from the application. Currently I am constructing DOM and serializing into XML using Xerces-C++ 3.1.1. But the DOM construction takes long time and consumes lot of memory.
I know SAX is lightweight approach of parsing XML compared to DOM. Like that is there a lightweight approach for producing XML? Ofcourse I can produce the XML by concatenating strings but I didn't choose that approach because I want to make sure I produce a well-formed XML and sanitize the texts I include in it.
What you are looking for is normally called streaming serialization where parts of the document are written out as they become available instead of accumulation them all and writing them out at the end (which is what the DOM approach entails).
Xerces-C++ does not currently have streaming serialization support. But it is not very difficult to emulate it using DOM. The idea is to construct a DOM document fragment when a chunk of your data is ready to be serialized, write it out using the DOMWriter API, and free it once done. When you have another chunk ready, repeat the above steps. The result is an application that uses only a fraction of the memory that would be required to create the complete document.
We use this approach in CodeSynthesis XSD, an XML data binding toolkit for C++, to be able to handle XML documents that are too big to fit into memory. In fact, we have written some helper classes that simplify all this and wich you can find as part of the 'streaming' example in the examples/cxx/tree/ directory (the example code is public domain so feel free to borrow it ;-)).