Memory-efficient XSLT Processor - xslt

I need a tool to execute XSLTs against very large XML files. To be clear, I don't need anything to design, edit, or debug the XSLTs, just execute them. The transforms that I am using are already well optimized, but the large files are causing the tool I have tried (Saxon v9.1) to run out of memory.

I found a good solution: Apache's Xalan C++. It provides a pluggable memory manager, allowing me to tune allocation based on the input and transform.
In multiple cases it is consuming ~60% less memory (I'm looking at private bytes) than the others I have tried.

You may want to look into STX for streaming-based XSLT-like transformations. Alternatively, I believe StAX can integrate with XSLT nicely through the Transformer interface.

It sounds like you're sorted - but often, another potential approach is to split the data first. Obviously this only works with some transformations (i.e. where different chunks of data can be treated in isolation from the whole) - but then you can use a simple parser (rather than a DOM) to do the splitting into manageable pieces, then process each chunk separately and reassemble.
Since I'm a .NET bod, things like XmlReader can do the chunking without a DOM; I'm sure there are equivalents for every language.
I'm not aware of any specific name; maybe Divide and Conquer.
For an example; if your data is actually a flat list of like objects, then you could simply split the first-level children - i.e. rather than having 2M rows, you split it into 10 lots of 200K rows, or 100 lots of 20K rows. I've done this before lots of times for working with bulk data (for example, uploading in chunks of data [all valid] and re-assembling at the server so that each individual upload is small enough to be robust).

For what it's worth, I suspect that for Java, Saxon is as good as it gets, if you need to use XSLT. It is quite efficient (both cpu and memory) for larger documents, but XSLT itself essentially forces full in-memory tree of contents to be created and retained, except for limited cases. Saxon-SA (for-fee version) supposedly has extensions to allow taking advantage of such "streaming" cases, so that might be worth checking out.
But the advice to split up the contents is the best one: if you are dealing with independent records, just split the input using other techniques (like, use Stax! :-) )

I have found that a custom tool built to run the XSLT using earlier versions of MSXML makes it very fast, but also consumes incredible amounts of memory, and will not actually complete if it is too large. You also lose out on some advanced XSLT functionality as the earlier versions of MSXML don't support the full xpath stuff.
It is worth a try if your other options take too long.

That's an interesting question. XSLT could potentially be optimized for space, but I expect all but the most obscure implementations around start by parsing the source document into DOM, which is bound to use a low multiple of the document size in memory.
Unless the stylesheet is specially designed to support a single-pass transformation, reasonable time performance would probably require parsing the source document into a disk-based hierarchical database.
It appears that Saxon 9.2 may provide an answer to your problem. If your document can be transformed without using predicates (does not reference any siblings of the current node) you may be able to use Streaming XSLT.
Are you using the Java version of Saxon, or the .Net port? You can assign more memory to the Java VM running Saxon, if you are running out of memory (using the -Xms command line parameter).
I've also found that the .Net version of Saxon runs out of memory less easily than the Java version.

For .NET you can use solution suggestion on Microsoft Knowledge Base:
XPathDocument srcDoc = new XPathDocument(srcFile);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
using (XmlWriter destDoc = XmlWriter.Create(destFile))
myXslTransform.Transform(srcDoc, destDoc);

Marklogic xslt performance

I have a XSLT that I'm executing via the xdmp:invoke() function and I'm running into very long processing times to see any result (in some instances timing out completely after max time out of 3600s is reached). This XSLT runs approximately in 5sec in Oxygen editor. Some areas I think maybe impacting performance:
The XSLT produces multiple output files, using xsl:result-document. The MarkLogic XSLT processor outputs these as result XML nodes, as it cannot physically save these documents to a file system.
The XSLT builds variables that contain xml nodes, which then are processed by other template calls. At times these variables can hold a large set of XML nodes.
I've done some profiling on the XSLT and it seem that building the variables seems to be the most time consuming part of the execution. I'm wondering why that's the case and why does it run a lot faster on the saxon processor?
Any insight is much appreciated.
My understanding is that there are some XSLT performance optimizations that are difficult or impossible to implement in the context of a database in comparison to a filesystem. Also, Saxon is the industry leader in XSLT and is significantly faster than almost anything on the market, although that probably doesn't account for the large discrepancy you describe.
You don't say which version of MarkLogic you're running, but version 8.0 has made significant improvements in XSLT performance. A few simple tests I ran suggested 3-4x speed improvement, depending on the XSLT.
I have run into some rare but serious performance edge cases for XSLT when running MarkLogic on Windows. Linux and OSX builds don't appear to have this problem. It is also far more highly pronounced when the XSLT tasks are running on multiple threads.
It is possible, however, to save data directly to the filesystem instead of the database using xdmp:save.
Unless your XSLTs involve very complex templating rules, I would recommend at least testing some of performance-sensitive XSLT logic in XQuery. It may be possible to port the slowest parts and pass the results of those queries to the XSLT. It's not ideal, but you might be able to achieve acceptable performance without rewriting the XSLTs.
Another idea, if the problem is simply the construction of variables in a multi-pass XSLT, is to break the XSLT into multiple XSLTs and make multiple calls to xdmp:xslt-invoke from XQuery. However, I know there is some overhead to making an xdmp:xslt-invoke call, so it may be a wash, or it may be worse.
I have come across similar performance issues with stylesheets in ML 7. To come to think of it I had similar stylesheets as the ones you have mentioned i.e. variables holding sequence of nodes. It seems xslt cannot be possibly optimised as well as xquery is. If you are not satisfied with the performance of your stylesheets I would recommend you to convert the xslt to it's equivalent xquery. I did this and achieved about 1~1.5 secs performance gains. It may be worth the effort :)
Well in my case, it seems that using the fn:not() function in template match rules is causing the slow performance. Perhaps if someone else is experiencing the same problem this might be a good starting point.

XML library optimized for big XML with memory constraints

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.
Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?
I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
There are two possible approaches I thought of for this problem:
Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
Parse the entire XML tree, but cache whatever you're not using right now on the disk.
Two of the approaches are acceptable, is there an existing solution.
I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.
It sounds like what you want is something similar to the Streaming API for XML (StAX).
While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.
There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.
edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.
I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.
You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.
XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.
I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.
Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.
How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see Also, as already mentioned, there is STX (though implementations are not especially mature).
Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.
Google for "streaming XML transformation"

BizTalk: XSLT versus mapping tool

We're doing a mapping process from an XML file generated by a legacy system to EDI 834/837 files. We have BizTalk 2010 and are using the Microsoft built in EDI schemas.
The EDI files are fairly complex, and the XML file we are getting is also complex, with a lot of pieces bolted on. I started going through the mapping tool, but it seemed like there was a lot of repitition that I could eliminate by running the XML file through an XSLT.
I found the following link, but I'm not happy with just one source.
So, any other advantages on using the mapping tool over just building a custom XSLT?
My experience with BizTalk maps is that things that are very simple to do with XSLT can be very complex with maps.
For good counter-examples of BizTalk maps, look at the book "Pro Mapping in BizTalk Server 2009". The book has some examples of very complex things you can achieve with BizTalk maps, but the downside to it is that in fact they have hidden all the complexity in scripting functoids. Therefore, the maps are not visual at all anymore (they don't even have links between nodes to provide at least hints to deduce what the map is doing).
XSLT can be more visual than a map, since you can see the resulting XML in the XSLT (keep in mind that "text" does not imply "not visual" - if you are transforming between text formats, then a natural way to visualize the transformation is by looking at text)
BizTalk maps can be used for very simple mappings, where you are essentially copying a set of properties from one structure to another structure with the same properties. However, as soon as you have to map a structure to another different structure, you quickly get something that's hard to write AND hard to read/understand.
Not really, I prefer XSLT too. It's easier to document (using comments in the source) and therefore to maintain. However, keep in mind that in BizTalk 2006 R2 you could not import external XSLTs, which reduces your options for reuse. I have no idea if this has changed in subsequent versions of BizTalk, that's for you to find out and perhaps let us all know...
Not really an answer, more sharing of expierence;
In my team we've had discussion on this issue. The argument for maps was that it is understood by most colleagues (as it is touched by every basic BizTalk training), and XSLT not.
I've personally worked with XSLT for a long time, before i started working with BizTalk, and find the mapper tool very .. unintuitive. Every connection i make raises more questions than it gives me comfort in knowing what the result is. What happens when the source node is nil, not present, or repeating? Whathappens when the target node is defined as minOccurs=2? What does the table mapping functoid do exactly? What does the table value extract functoid do when a value is not found? How do i create a node with an autonumbering sequence, and how do i relate other created nodes that can relate to those nodes by using the generated number?
Working with XSLT gives me the control back, i know exactly what happens.
XSLT maps have the added value of being text-based, wich works well with branching and mering in source control, and allows us to add coments in the sources. Ever tried to merge changes from a map from two diffrent branches?
End result is that we now prefer XSLT for mapping, but not every developer is fluent in XSLT. That requires some training.
One last tip: invest in unit test tooling for your maps. Find an open source toolkit, or write some plumbing to test your maps yourself. Most BizTalk artifacts are perfectly testable, even when it doesn't seem that way, with possible exception for orchestrations (which you should use as a last resort only anyway).
Benefits of XSLT
You get better DRY by reusing mapping functionality using XSLT apply + call
templates and custom script functions (e.g. C# script) in the same
map. Unfortunately AFAIK <xsl:include> doesn't work, so you will
need to copy-paste to get reuse across multiple map xslt files.
XSLT native call templates tend to be more performant than C# script (which is how most of the functoids are implemented anyhow)
You can use the XSLT debugger in Visual Studio.
And to emphasize ckarras' point that for complex maps, XSLT is actually easier to understand than a visual spider web.
Benefits of Visual Map
Productivity for trivial maps, e.g. where all elements are exactly the same name and type and can be mapped at the root level, or if you need a dummy map with hard coded output element values.
And I guess the hurdle rate for XSLT may be quite high.
As someone with experience in both BizTalk as well as another GUI-based mapping tool (BridgeGate), I can say that for the non-programmer these applications contain solutions in the form of their mapping interface to solve most problems. When they fall short, they offer a back door to exit to a more code-based solution in the form of a scripting functoid. So while XSLT is certainly an alternative, I find that those who prefer it often are those with more comfort writing code than those who are not.
My experience specifically with 837P and 837I files was with the prior mapping tool (BridgeGate), and it WAS arduous--but that was mainly the fault of the complexity of the file. What I CAN say and what is not being mentioned is that changes later to the process to accommodate client change requests WAS much easier in the GUI-based maps; I can only imagine how it would have been to have to dive into an XSLT big enough to handle 837 transformations and make changes to touch every node involved with a change request. You know how big an 837 is, and how complex the looping can be. Keep that in mind when making your choice.
I don't envy your task, but know the satisfaction when you complete it will make it all worthwhile. Good luck!

Good Option for XML Edit/Replace

I have a huge (100k+ lines, 5MB+) XML which acts as a database for my C++ Application. The structure of the XML is quite straight forward, for example, it has chunks of:
<bar prop="true"/>
The nesting of tags is several levels deep and there are many items with multiple properties. What is a good way to find and replace chunks of this kind of a file? For example, assume that the above section is repeated a few dozen times and in each chunk the value of the tag <baz> is different. I'd like to make edits such as:
Setting all the values contained in tag <baz> to a given value.
Remove chunks containing certain values
So far, I've learnt of the following methods for accomplishing this:
Find/Replace: A no-brainer, trivial solution and also my last fall-back. This approach, IMHO is the most time consuming, error prone and painful method. The absolute last resort.
RegExes: Use regular expressions to match blocks of interest and edit them using replacement expressions. Kinda like this blog entry: But I feel this would be error prone and there could be a bunch of missed items if the regex is not exactly right the first time around.
Parser & Save: Whip up a quick program to parse the XML using Xerces or XML DOM Interfaces (or some other XML library), read the XML in, manipulate it as desired and save back to disk. Again, this approach is a slow process, but once its up and running, easy to make modifications and more flexible then RegExes.
Are there any better ways to deal with this?
(EDIT: Thanks for all the redo it to use a DB suggestions, I know its a huge mess but by "better ways to deal with this" I meant the "find/replace" part. )
If you don't want to put the entire document in memory, I would read it using a SAX parser. As you read it, you append the transformed document to a second (or a temp) file. I think it could be pretty fast, and use only a little memory footprint.
Are there any better ways to deal with this?
If you must use XML, you could use an XML database such as BDB XML (which has C++ APIs). It supports XQuery, transactions, etc.
Other options include TinyXML which I've used with success in the past. Quick and easy to use, not necessarily the fastest on a file of that size, but it will get the job done.
What are your actual memory constraints? 5MB is large but not enormous by current RAM standards.
I would use DOM with XPath if you can, it will be a lot less development work than SAX or other stream-based parsing. My problem with SAX is that if you are really using this as a in-memory DB, that implies random access on-demand and SAX is not well-suited for that - you will have to parse and reserialize over and over, whereas once you have the DOM at least you can play with it as you like.
Echo comments about to store in-RAM database info too. Plenty of alternatives that are better suited to this than XML. Maybe you could implement a tactical solution using DOM/XPath and investigate rip-and-replace as a longer-term project.

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
and the input:
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...