XPath How to optimize performance over "preceding" axis? - xslt

I am using XSLT to transform XML files and this XPath is a very small part of it. The main object is a performance issue. First, I will describe the context:
Part of the transformation is a complex grouping operation, used to group up a sequence of similar elements, in the order they appear. This is a small sample from the data:
<!-- potentially a lot more data-->
<MeaningDefBlock>
<!-- potentially a lot more data-->
<MeaningSegment>
<Meaning>
<value> or </value>
</Meaning>
</MeaningSegment>
<MeaningSegment>
<MeaningInsert>
<OpenBracket>
<value>(</value>
</OpenBracket>
<Meaning>
<value>ex.: </value>
</Meaning>
<IllustrationInsert>
<value>ita, lics</value>
</IllustrationInsert>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
</MeaningInsert>
</MeaningSegment>
<!-- potentially a lot more data-->
</MeaningDefBlock>
<!-- potentially a lot more data-->
There are only parent elements (ex.: MeaningInsert) and elements that only contain a value element, which contains text (ex.: IllustrationInsert).
The text from the input gets grouped into elements that have such text segments: or (ex.:, ita, lics and ) (in this case, the "ita, lics" segment separates the groups that would otherwise be all in one). The main point is that elements from different levels can be grouped. XPath is used to identify groups via previous segments and keyed in the XSL. The whole key is very complicated and not the object of the question (but I still provide it for context):
<xsl:key name="leavesInGroupL4" match="MeaningSegment//*[value]" use="generate-id(((preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[not(boolean(self::IllustrationInsert|self::LatinName)=boolean(current()/self::IllustrationInsert|current()/self::LatinName))]|ancestor-or-self::MeaningDefBlock)[last()])"/>
The important part being:
(preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[...]
From the context of an element with a value child (like Meaning or OpenBracket), this XPath selects the previous siblings and all the elements with values from the preceding siblings of the parent/ancestor MeaningSegment. In practice, it basically selects all the text that came before it (or, rather, the grandparent of the text itself)
I have later realized that there might be even further complications with layers and differing depth of the elements with values. I might need to select all such preceding elements regardless of their parent and siblings but still in the same block. I have substituted "the important part" with a somewhat simpler XPath expression:
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock) = generate-id(current()/ancestor-or-self::MeaningDefBlock)]
This only checks that it's in the same block and it works! It successfully selects the preceding segments of text in the block, even if elements with values and parent elements are mixed together. Example input fragment:
...
<OpenBracket>
<value>(</value>
</OpenBracket>
<SomeParentElement>
<LatinName>
<value>also italics</value>
</LatinName>
</SomeParentElement>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
...
This is not something the first approach could do because the brackets and the LatinName are not siblings.
However, the new approach with preceding:* is extremely slow! On a real document, the XSL transformation takes up to 5 minutes instead of the usual 3 seconds that the original approach takes (including overhead), which is a 100x increase in time taken. Of course, that is because preceding checks nearly every node in the whole document when it is executed (a lot of times). The document has a lot of MeaningDefBlock blocks (nearly 2000), each with a couple segments of text (usually single-digit) and a bunch of other straight-forward elements/nodes unrelated to said text (usually in the low hundreds, each block). Quite easy to see how this all adds up to preceding trashing performance over preceding-sibling.
I was wondering if this could be optimized somehow. In XSL, keys have greatly improved performance multiple times in our project but I'm not sure if preceding and keys can be combined or if the XPath needs to be more complex and tailored to my specific case, perhaps enumerating the elements it should look at (and hopefully ignoring everything else).
Since the input will currently always work with the first approach, I have conceded and rolled back the change (and would probably rather take the 5 min hit every time than trying optimization myself).
I use XSLT 1.0 and XPath 1.0

I guess you've probably already worked out that
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock)
= generate-id(current()/ancestor-or-self::MeaningDefBlock)]
is going to search back to the beginning of the document; it's not smart enough to know that it only needs to search within the containing meaningDefBlock element.
One answer to that would be to change it to something like this:
ancestor-or-self::MeaningDefBlock//*[value][. << current()]
The << operator requires XPath 2.0 and for a problem as complex as this, you really ought to consider moving forwards. However you can simulate the operator in 1.0 with an expression like generate-id(($A|$B)[1]) = generate-id($A).
There's no guarantee this will be faster, but unlike your existing solution it should be independent of how many MeaningDefBlock elements there are in the document.

Related

XSLT -- detect if node has already been copied to the result tree

Using xsltproc to clean up input XML.
Think about a part number referencing a part description from random locations in the document. My XML input is poorly designed and it has part number references to part descriptions all over with no real pattern as to where they are located. Some references are text in elements, some in attributes, sometimes the attribute changes meaning depending on context. The attribute containing the part number does not have a consistent name, the name used alters depending on the value of other attributes. Maybe I could build a key selecting the dozen varying places containing part number but it would be a mess. I would also worry about inadvertently selecting the wrong items with complex patterns.
So my goal is to only copy the referenced part descriptions to the output document once (not all descriptions are referenced). I can insert tests in all of the various templates to detect the part number in context. The very simple solution would be to just test if it has already been copied to the result tree and not copy it again. But there is no way to track this?
Plan B is to copy it multiple times to the result tree and then do a second pass over the output document to remove the duplicates.
The use of temporal language in the question ("has already been") is a good clue that you're thinking about this the wrong way. In a declarative language, you shouldn't be thinking in terms of the order of processing.
What you're probably looking for is something like this:
<xsl:variable name="first-of-a-kind-part-references" as="node()*">
<xsl:for-each-group select="f:all-part-references(/)"
group-by="f:get-referenced-part(.)/#id">
<xsl:sequence select="current-group()[1]"/>
</xsl:for-each-group>
</xsl:variable>
and then when processing a part reference
<xsl:if test=". intersect $first-of-a-kind-part-references">
...
</xsl:if>

Revisited: Sort elements of arbitrary XML document recursively

This chapter in my XSLT saga is an extension of the question here. Thanks to all of you who have helped me get this far (#Martin Honnen, #Ian Roberts, #Tim C, and anyone else I missed)!
Here is my current problem:
I reorder some siblings in A_v1.xml to create A_v2.xml. I now consider these two files to be different "versions" of the same file. The files two files have the exact same content, only some siblings are in a different order. Another way of saying it, each element in A_v2.xml still has the same parent as it did in A_v1.xml, but it may now occur before siblings it used to occur after, or may occur after siblings it used to occur before.
I transform A_v1.xml into A_v1_transformed.xml
I transform A_v2.xml into A_v2_transformed.xml
I compare A_v1_transformed.xml to A_v2_transformed.xml and to my dismay they are not identical. Further more neither of them are in the expected order shown in expected.xml. They have the same content, but the elements are not sorted in the same order.
My first sort is <xsl:sort select="local-name()"/>. #G. Ken Holman turned me onto <xsl:sort select="."/> (which has the same effect as <xsl:sort select="self::*"/> which I was using). When I use those two sorts in combination I get almost exactly what I want, but in some places it seems the expected alphabetical order is just randomly broken.
I have beefed up my sample files. To keep the question short I just put them on pastebin.
A_v1.xml
A_v2.xml
A_v1_transformed.xml
A_v2_transformed.xml
Here is one of the transformed files with comments added by me to help you understand where/why I think the transform sorted these files incorrectly. I didn't comment the other transformed file because it has similar "failures".
A_v1_transformed_with_comments.xml
Both of the transformed documents should have the same checksum as expected.xml, but they don't. That is my biggest concern. Alphabetical sorting seems the most sane way to sort, but so long as the transform sorted in some sane way I couldn't care less how the sort happened so long as the sort is repeatable among different "versions" of the same file.
expected.xml
The following XLS files both yield the same result, but the "multi-pass" version may be easier to understand.
xsl_concise.xsl
xsl_multi_pass.xsl
Points for discussion:
I have noticed that when sorting alphabetically CAPITALIZED letters take precedence. Even if the capitalized letter comes after a lower case letter alphabetically it will come first in the sort.
Partial success...
I think I may have stumbled onto a partial solution myself, but I am unclear why it works. If you look at my xsl_multi_pass.xsl file you will see:
<!-- Third pass with sortElements mode templates -->
<xsl:variable name="sortElementsRslt">
<xsl:apply-templates mode="sortElements" select="$sortAttributesRslt"/>
</xsl:variable>
<!-- Fourth pass with deDup mode templates -->
<xsl:apply-templates mode="deDup" select="$sortElementsRslt"/>
If I turn that into:
<!-- Third pass with sortElements mode templates -->
<xsl:variable name="sortElementsRslt1">
<xsl:apply-templates mode="sortElements" select="$sortAttributesRslt"/>
</xsl:variable>
<!-- Fourth pass with sortElements mode templates -->
<xsl:variable name="sortElementsRslt2">
<xsl:apply-templates mode="sortElements" select="$sortElementsRslt1"/>
</xsl:variable>
<!-- Fifth pass with deDup mode templates -->
<xsl:apply-templates mode="deDup" select="$sortElementsRslt2"/>
This sorts the elements twice, I don't know why it is necessary. The result using the example files I have provided is what I expected minus the CAPITALIZED letters taking precedence, but that doesn't bother me so long as the result is consistent which it appears to be. The problem is that this "solution" causes another part of the real files I'm working with to be sorted inconsistently.
SUCCESS!
I think I finally got this working 100% how I want. I incorporated the function given in the answer here by #Dimitre Novatchev to elements by their attribute names and values. I still have to perform two passes to sort the elements (applying the exact same templates twice) as I described above for some reason, but it only takes an extra 3 seconds on a 20MB file, so I'm not too worried about it.
Here is the final result:
xsl_2.0_full_document_sorter.xsl
In a nutshell my ultimate goal with all of my XSLT questions is a stylesheet that when applied to a file will always generate the same result even if run on different "versions" of a that file. A different "version" of a file would be one that had the exact same content, just in a different order. That means an element's attributes may have been moved around and that elements may have occur eariler/later than they previously did.
Have you considered a different tool rather than XSLT for this purpose? The goal you've described sounds to me pretty much exactly the definition of similar() in XMLUnit
// control and test are the two XML documents you want to compare, they can
// be String, Reader, org.w3c.dom.Document or org.xml.sax.InputSource
Diff d = new Diff(control, test);
assert d.similar();
SUCCESS!
I think I finally got this working 100% how I want. I incorporated the function given in the answer here by #Dimitre Novatchev to sort elements by their attribute names and values. I still have to perform two passes to sort the elements (applying the exact same templates twice) as I described above for some reason, but it only takes an extra 3 seconds on a 20MB file, so I'm not too worried about it.
Here is the final result:
xsl_2.0_full_document_sorter.xsl
This transform is 100% generic and should be able to be used on any XML document to sort it in what I would consider the most sane way possible. The major benefit of this stylesheet is that it will transform multiple files that have the same content in different orders the exact same way, to the transformed results of all the files that have the same content will be identical.

XSLT 2.0: Limit the ancestor axes to a certain element/s level up the document tree

I'm seeing a quite odd behaviour, when trying to limit the results given by applying ancestor::* to an element I always get an extra ancestor although is expressly excluded by the predicate.
Here the code:
XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<level_a>
<level_b>
<level_c>
<level_d>
<level_e/>
</level_d>
</level_c>
</level_b>
</level_a>
<level_b>
<level_c>
<level_d>
<level_d>
<level_e/>
</level_d>
</level_d>
</level_c>
</level_b>
</root>
XPath:
(//level_d[not(level_d)])[last()]/ancestor::*[level_c|level_b]
so basically I'm selecting the level_d elements that doesn't have another level_d element nested, getting the last one of them and trying to get all the ancestors up to element level_b.
But the result I'm seeing using Altova XMLSpy 2011 is:
level_a
level_b
I don't quite understand why I'm getting that result and how can I improve my xpath to limit effectively the ancestors up to level_b (i.e. level_c and level_b).
Any hint is greatly appreciated!
Regards
Vlax
Well ancestor::*[level_c|level_b] selects all elements on the ancestor axis that have a level_c or level_b child.
You might want (//level_d[not(level_d)])[last()]/ancestor::*[self::level_c|self::level_b].
Or with your textual description "to limit effectively the ancestors to level_b" you simply want (//level_d[not(level_d)])[last()]/ancestor::level_b.
I think you get right result because clause ancestor::*[level_c|level_b] I read as "all ancestors containing element level_b or level_c". So, level_b is ok because it contains level_c and level_a is ok too because it contains level_b.
So if I change your XPath into (//level_d[not(level_d)])[last()]/ancestor::*[level_c] it results into level_b only.
Probably it is not exactly what you asking for but I'm not sure if I understand well the purpose of your XPath :-)

Create and use HTML full text search index (C++)

I need to create a search index for a collection of HTML pages.
I have no experience in implementing a search index at all, so any general information how to build one, what information to store, how to implement advanced searches such as "entire phrase", ranking of results etc.
I'm not afraid to build it myself, though I'd be happy to reuse an existing component (or use one to get started with a prototype). I am looking for a solution accessible from C++, preferrably without requiring additional installations at runtime. The content is static (so it makes sense to aggregate search information), but a search might have to accumulate results from multiple such repositories.
I can make a few educated guesses, though: create a map word ==> pages for all (relevant) words, a rank can be assigned to the mapping by promincence (h1 > h2 > ... > <p>) and proximity to top. Advanced searches could be built on top of that: searching for phrase "homo sapiens" could list all pages that contain "homo" and "sapiens", then scan all pages returned for locations where they occur together. However, there are a lot of problematic scenarios and unanswered questions, so I am looking for references to what should be a huge amount of existing work that somehow escapes my google-fu.
[edit for bounty]
The best resource I found until now is this and the links from there.
I do have an imlementation roadmap for an experimental system, however, I am still looking for:
Reference material regarding index creation and individual steps
available implementations of individual steps
reusable implementations (with above environment restrictions)
This process is generally known as information retrieval. You'll probably find this online book helpful.
Existing libraries
Here are two existing solutions that can be fully integrated into an application without requiring a separate process (I believe both will compile with VC++).
Xapian is mature and may do much of what you need, from indexing to ranked retrieval. Separate HTML parsing would be required because, AFAIK, it does not parse html (it has a companion program Omega, which is a front end for indexing web sites).
Lucene is a index/searching Apache library in Java, with an official pre-release C version lucy, and an unofficial C++ version CLucene.
Implementing information retrieval
If the above options are not viable for some reason, here's some info on the individual steps of building and using an index. Custom solutions can go from simple to very sophisticated, depending what you need for your application. I've broken the process into 5 steps
HTML processing
Text processing
Indexing
Retrieval
Ranking
HTML Processing
There are two approaches here
Stripping The page you referred to discusses a technique generally known as stripping, which involves removing all the html elements that won't be displayed and translating others to their display form. Personally, I'd preprocess using perl and index the resulting text files. But for an integrated solution, particularly one where you want to record significance tags (e.g. <h1>, <h2>), you probably want to role your own. Here is a partial implementation of a C++ stripping routine (appears in Thinking in C++ , final version of book here), that you could build from.
Parsing A level up in complexity from stripping is html parsing, which would help in your case for recording significance tags. However, a good C++ HTML parser is hard to find. Some options might be htmlcxx (never used it, but active and looks promising) or hubbub (C library, part of NetSurf, but claims to be portable).
If you are dealing with XHTML or are willing to use an HTML-to-XML converter, you can use one of the many available XML parsers. But again, HTML-to-XML converters are hard to find, the only one I know of is HTML Tidy. In addition to conversion to XHTML, its primary purpose is to fix missing/broken tags, and it has an API that could possibly be used to integrate it into an application. Given XHTML documents, there are many good XML parsers, e.g. Xerces-C++ and tinyXML.
Text Processing
For English at least, processing text to words is pretty straight forward. There are a couple of complications when search is involved though.
Stop words are words known a priori not to provide a useful distinction between documents in the set, such as articles and propositions. Often these words are not indexed and filtered from query streams. There are many stop word lists available on the web, such as this one.
Stemming involves preprocessing documents and queries to identify the root of each word to better generalize a search. E.g. searching for "foobarred" should yield "foobarred", "foobarring", and "foobar". The index can be built and searched on roots alone. The two general approaches to stemming are dictionary based (lookups from word ==> root) and algorithm based. The Porter algorithm is very common and several implementations are available, e.g. C++ here or C here. Stemming in the Snowball C library supports several languages.
Soundex encoding One method to make search more robust to spelling errors is to encode words with a phonetic encoding. Then when queries have phonetic errors, they will still map directly to indexed words. There are a lot of implementations around, here's one.
Indexing
The map word ==> page data structure is known as an inverted index. Its inverted because its often generated from a forward index of page ==> words. Inverted indexes generally come in two flavors: inverted file index, which map words to each document they occur in, and full inverted index, which map words to each position in each document they occur in.
The important decision is what backend to use for the index, some possibilities are, in order of ease of implementation:
SQLite or Berkly DB - both of these are database engines with C++ APIs that integrated into a project without requiring a separate server process. Persistent databases are essentially files, so multiple index sets can be search by just changing the associated file. Using a DBMS as a backend simplifies index creation, updating and searching.
In memory data structure - if your using a inverted file index that is not prohibitively large (memory consumption and time to load), this could be implemented as a std::map<std::string,word_data_class>, using boost::serialization for persistence.
On disk data structure - I've heard of blazingly fast results using memory mapped files for this sort of thing, YMMV. Having an inverted file index would involve having two index files, one representing words with something like struct {char word[n]; unsigned int offset; unsigned int count; };, and the second representing (word, document) tuples with just unsigned ints (words implicit in the file offset). The offset is the file offset for the first document id for the word in the second file, count is the number of document ids associate with that word (number of ids to read from the second file). Searching would then reduce to a binary search through the first file with a pointer into a memory mapped file. The down side is the need to pad/truncate words to get a constant record size.
The procedure for indexing depends on which backend you use. The classic algorithm for generating a inverted file index (detailed here) begins with reading through each document and extending a list of (page id, word) tuples, ignoring duplicate words in each document. After all documents are processed, sort the list by word, then collapsed into (word, (page id1, page id2, ...)).
The mifluz gnu library implements inverted indexes w/ storage, but without document or query parsing. GPL, so may not be a viable option, but will give you an idea of the complexities involved for an inverted index that supports a large number of documents.
Retrieval
A very common method is boolean retrieval, which is simply the union/intersection of documents indexed for each of the query words that are joined with or/and, respectively. These operations are efficient if the document ids are stored in sorted order for each term, so that algorithms like std::set_union or std::set_intersection can be applied directly.
There are variations on retrieval, wikipedia has an overview, but standard boolean is good for many/most application.
Ranking
There are many methods for ranking the documents returned by boolean retrieval. Common methods are based on the bag of words model, which just means that the relative position of words is ignored. The general approach is to score each retrieved document relative to the query, and rank documents based on their calculated score. There are many scoring methods, but a good starting place is the term frequency-inverse document frequency formula.
The idea behind this formula is that if a query word occurs frequently in a document, that document should score higher, but a word that occurs in many documents is less informative so this word should be down weighted. The formula is, over query terms i=1..N and document j
score[j] = sum_over_i(word_freq[i,j] * inv_doc_freq[i])
where the word_freq[i,j] is the number of occurrences of word i in document j, and
inv_doc_freq[i] = log(M/doc_freq[i])
where M is the number of documents and doc_freq[i] is the number of documents containing word i. Notice that words that occur in all documents will not contribute to the score. A more complex scoring model that is widely used is BM25, which is included in both Lucene and Xapian.
Often, effective ranking for a particular domain is obtained by adjusting by trial and error. A starting place for adjusting rankings by heading/paragraph context could be inflating word_freq for a word based on heading/paragraph context, e.g. 1 for a paragraph, 10 for a top level heading. For some other ideas, you might find this paper interesting, where the authors adjusted BM25 ranking for positional scoring (the idea being that words closer to the beginning of the document are more relevant than words toward the end).
Objective quantification of ranking performance is obtained by precision-recall curves or mean average precision, detailed here. Evaluation requires an ideal set of queries paired with all the relevant documents in the set.
Depending on the size and number of the static pages, you might want to look at an already existent search solution.
"How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles."
I would choose the Sphinx engine for full text searching. The licence is GPL but the also have a commercial version available. It is meant to be run stand-alone [2], but it can also be embedded into applications by extracting the needed functionality (be it indexing[1], searching [3], stemming, etc).
The data should be obtained by parsing the input HTML files and transforming them to plain-text by using a parser like libxml2's HTMLparser (I haven't used it, but they say it can parse even malformed HTML). If you aren't bound to C/C++ you could take a look at Beautiful Soup.
After obtaining the plain-texts, you could store them in a database like MySQL or PostgreSQL. If you want to keep everything embedded you should go with sqlite.
Note that Sphinx doesn't work out-of-the-box with sqlite, but there is an attempt to add support (sphinx-sqlite3).
I would attack this with a little sqlite database. You could have tables for 'page', 'term' and 'page term'. 'Page' would have columns like id, text, title and url. 'Term' would have a column containing a word, as well as the primary ID. 'Page term' would have foreign keys to a page ID and a term ID, and could also store the weight, calculated from the distance from the top and the number of occurrences (or whatever you want).
Perhaps a more efficient way would be to only have two tables - 'page' as before, and 'page term' which would have the page ID, the weight, and a hash of the term word.
An example query - you want to search for "foo". You hash "foo", then query all page term rows that have that term hash. Sort by descending weight and show the top ten results.
I think this should query reasonably quickly, though it obviously depends on the number and size of the pages in question. Sqlite isn't difficult to bundle and shouldn't need an additional installation.
Ranking pages is the really tricky bit here. With a large sample of pages you can use links quite a lot in working out ranks. Other wise you need to check how words seem to be placed, and also making sure your engine doesn't get fooled by 'dictionary' pages.
Good luck!

How do I de-duplicate a list of nodes in XSLT - and return the last node encountered?

I've seen lots of "de-duplicate this xml" questions but everyone wants the first node or the nodes are identical. I have a bit of a bigger puzzle.
I have a list of articles in XML, a relevant snippet is shown:
<item><key>Article1</key><stamp>100</stamp></item>
<item><key>Article1</key><stamp>130</stamp></item>
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>900</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'd like a list of nodes where the keys are unique and where the last instance is returned, not the first: Stamp (integer) is always increasing for elements of a particular key. Ideally I'd like "largest stamp" but they're always in order so the shortcut is ok.
Desired result: (Order doesn't really matter.)
<item><key>Article2</key><stamp>800</stamp></item>
<item><key>Article1</key><stamp>180</stamp></item>
<item><key>Article3</key><stamp>950</stamp></item>
<item><key>Article4</key><stamp>990</stamp></item>
<item><key>Article5</key><stamp>999</stamp></item>
I'm somewhat confused on how to get this list. Any ideas?
I'm using the Saxon processor if it matters.
The short version:
Instead of using [1] in the Muenchian grouping, use [last()]