XSLT - filter out elements that are not x-referenced

XSLT - filter out elements that are not x-referenced - xslt

I have developed a (semi-)identity transformation from which I need to filter out elements that are unused.
The source XML contains 2001 "zones". No more, no less.
It also contains any number of devices, which are placed in these zones.
One specific example source XML contains 8800 of these devices.
More than one device can be placed in the same zone.
Zone 0 is a "null zone", meaning that a device placed in this zone is currently unassigned.
This means that the number of real zones is 2000.
Simplified source XML:
<configuration>
<zones>
<zone id="0">
...
<zone id="2000"/>
</zones>
<devices>
<device addr="1">
<zone>1</zone>
</device>
...
<device addr="8800">
<zone>1</zone>
</device>
</devices>
</configuration>
The problem we have is that out of the 2000 usable zones, most often only roughly 200 of these contain one or more devices.
I need to whittle out unused zones. There are reasons for this which serve only to detract from the question at hand, so if you don't mind I will not elaborate here.
I currently have this problem solved, like so:
<xsl:for-each select="zones/zone[#id > 0]">
<xsl:when test="/configuration/devices/device[zone=current()/#id]">
<xsl:call-template name="Zone"/>
</xsl:when>
</xsl:for-each>
And this works.
But on some of the larger projects the transformation takes absolute ages.
That is because in pseudo code this translates to:
for each <zone> in <zones>
find any <device> in <devices> with reference to <zone>
if found
apply zone template
endif
endfor
With 2000 zones to iterate over - and each iteration triggering up to 8800 searches for a qualifying device - you can imagine this taking a very long time.
And to compound problems, libxslt provides no API for progress reporting. This means that for a long time our application will appear frozen while it imports and converts the customer XML.
I do have the option to write every zone unconditionally, and upon the application bootstrapping from our (output) XML, remove or ignore any zones that have no devices placed in them.
And it may turn out that this may be the only option I have.
The downside to this is that my output XML then contains a lot of zones that are not referenced.
That makes it a bit difficult to consolidate what we have in our configuration and what parts of it the application is actually using.
My question to you is:
Have I got other options that ensure that the output XML only contains used zones?
I am not averse to performing a follow-up XSLT conversion.
I was for instance thinking that it may be possible(?) to write an attribute used="false" to each <Zone> element in my output.
Then as I go over the devices, I find the relevant zone in my output XML (providing it is assigned / zone is non-zero) and change this to used="true".
Then follow up with a quick second transformation to remove all zones which have used="false".
But, can I reference my own output elements during an XSLT transformation and change its contents?

You said you have a kind of identity transformation so I would use that as the starting point, plus a key:
<xsl:key name="zone-ref" match="device" use="zone"/>
and an empty template
<xsl:template match="zones/zone[not(key('zone-ref', #id))]"/>
that prevents unreferences zones from being copied.
Or, if there are other conditions, then e.g.
<xsl:template match="zones/zone[#id > 0 and not(key('zone-ref', #id))]"/>

Related

XSLT -- detect if node has already been copied to the result tree

Using xsltproc to clean up input XML.
Think about a part number referencing a part description from random locations in the document. My XML input is poorly designed and it has part number references to part descriptions all over with no real pattern as to where they are located. Some references are text in elements, some in attributes, sometimes the attribute changes meaning depending on context. The attribute containing the part number does not have a consistent name, the name used alters depending on the value of other attributes. Maybe I could build a key selecting the dozen varying places containing part number but it would be a mess. I would also worry about inadvertently selecting the wrong items with complex patterns.
So my goal is to only copy the referenced part descriptions to the output document once (not all descriptions are referenced). I can insert tests in all of the various templates to detect the part number in context. The very simple solution would be to just test if it has already been copied to the result tree and not copy it again. But there is no way to track this?
Plan B is to copy it multiple times to the result tree and then do a second pass over the output document to remove the duplicates.

The use of temporal language in the question ("has already been") is a good clue that you're thinking about this the wrong way. In a declarative language, you shouldn't be thinking in terms of the order of processing.
What you're probably looking for is something like this:
<xsl:variable name="first-of-a-kind-part-references" as="node()*">
<xsl:for-each-group select="f:all-part-references(/)"
group-by="f:get-referenced-part(.)/#id">
<xsl:sequence select="current-group()[1]"/>
</xsl:for-each-group>
</xsl:variable>
and then when processing a part reference
<xsl:if test=". intersect $first-of-a-kind-part-references">
...
</xsl:if>

XPath How to optimize performance over "preceding" axis?

I am using XSLT to transform XML files and this XPath is a very small part of it. The main object is a performance issue. First, I will describe the context:
Part of the transformation is a complex grouping operation, used to group up a sequence of similar elements, in the order they appear. This is a small sample from the data:
<!-- potentially a lot more data-->
<MeaningDefBlock>
<!-- potentially a lot more data-->
<MeaningSegment>
<Meaning>
<value> or </value>
</Meaning>
</MeaningSegment>
<MeaningSegment>
<MeaningInsert>
<OpenBracket>
<value>(</value>
</OpenBracket>
<Meaning>
<value>ex.: </value>
</Meaning>
<IllustrationInsert>
<value>ita, lics</value>
</IllustrationInsert>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
</MeaningInsert>
</MeaningSegment>
<!-- potentially a lot more data-->
</MeaningDefBlock>
<!-- potentially a lot more data-->
There are only parent elements (ex.: MeaningInsert) and elements that only contain a value element, which contains text (ex.: IllustrationInsert).
The text from the input gets grouped into elements that have such text segments: or (ex.:, ita, lics and ) (in this case, the "ita, lics" segment separates the groups that would otherwise be all in one). The main point is that elements from different levels can be grouped. XPath is used to identify groups via previous segments and keyed in the XSL. The whole key is very complicated and not the object of the question (but I still provide it for context):
<xsl:key name="leavesInGroupL4" match="MeaningSegment//*[value]" use="generate-id(((preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[not(boolean(self::IllustrationInsert|self::LatinName)=boolean(current()/self::IllustrationInsert|current()/self::LatinName))]|ancestor-or-self::MeaningDefBlock)[last()])"/>
The important part being:
(preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[...]
From the context of an element with a value child (like Meaning or OpenBracket), this XPath selects the previous siblings and all the elements with values from the preceding siblings of the parent/ancestor MeaningSegment. In practice, it basically selects all the text that came before it (or, rather, the grandparent of the text itself)
I have later realized that there might be even further complications with layers and differing depth of the elements with values. I might need to select all such preceding elements regardless of their parent and siblings but still in the same block. I have substituted "the important part" with a somewhat simpler XPath expression:
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock) = generate-id(current()/ancestor-or-self::MeaningDefBlock)]
This only checks that it's in the same block and it works! It successfully selects the preceding segments of text in the block, even if elements with values and parent elements are mixed together. Example input fragment:
...
<OpenBracket>
<value>(</value>
</OpenBracket>
<SomeParentElement>
<LatinName>
<value>also italics</value>
</LatinName>
</SomeParentElement>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
...
This is not something the first approach could do because the brackets and the LatinName are not siblings.
However, the new approach with preceding:* is extremely slow! On a real document, the XSL transformation takes up to 5 minutes instead of the usual 3 seconds that the original approach takes (including overhead), which is a 100x increase in time taken. Of course, that is because preceding checks nearly every node in the whole document when it is executed (a lot of times). The document has a lot of MeaningDefBlock blocks (nearly 2000), each with a couple segments of text (usually single-digit) and a bunch of other straight-forward elements/nodes unrelated to said text (usually in the low hundreds, each block). Quite easy to see how this all adds up to preceding trashing performance over preceding-sibling.
I was wondering if this could be optimized somehow. In XSL, keys have greatly improved performance multiple times in our project but I'm not sure if preceding and keys can be combined or if the XPath needs to be more complex and tailored to my specific case, perhaps enumerating the elements it should look at (and hopefully ignoring everything else).
Since the input will currently always work with the first approach, I have conceded and rolled back the change (and would probably rather take the 5 min hit every time than trying optimization myself).
I use XSLT 1.0 and XPath 1.0

I guess you've probably already worked out that
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock)
= generate-id(current()/ancestor-or-self::MeaningDefBlock)]
is going to search back to the beginning of the document; it's not smart enough to know that it only needs to search within the containing meaningDefBlock element.
One answer to that would be to change it to something like this:
ancestor-or-self::MeaningDefBlock//*[value][. << current()]
The << operator requires XPath 2.0 and for a problem as complex as this, you really ought to consider moving forwards. However you can simulate the operator in 1.0 with an expression like generate-id(($A|$B)[1]) = generate-id($A).
There's no guarantee this will be faster, but unlike your existing solution it should be independent of how many MeaningDefBlock elements there are in the document.

Revisited: Sort elements of arbitrary XML document recursively

This chapter in my XSLT saga is an extension of the question here. Thanks to all of you who have helped me get this far (#Martin Honnen, #Ian Roberts, #Tim C, and anyone else I missed)!
Here is my current problem:
I reorder some siblings in A_v1.xml to create A_v2.xml. I now consider these two files to be different "versions" of the same file. The files two files have the exact same content, only some siblings are in a different order. Another way of saying it, each element in A_v2.xml still has the same parent as it did in A_v1.xml, but it may now occur before siblings it used to occur after, or may occur after siblings it used to occur before.
I transform A_v1.xml into A_v1_transformed.xml
I transform A_v2.xml into A_v2_transformed.xml
I compare A_v1_transformed.xml to A_v2_transformed.xml and to my dismay they are not identical. Further more neither of them are in the expected order shown in expected.xml. They have the same content, but the elements are not sorted in the same order.
My first sort is <xsl:sort select="local-name()"/>. #G. Ken Holman turned me onto <xsl:sort select="."/> (which has the same effect as <xsl:sort select="self::*"/> which I was using). When I use those two sorts in combination I get almost exactly what I want, but in some places it seems the expected alphabetical order is just randomly broken.
I have beefed up my sample files. To keep the question short I just put them on pastebin.
A_v1.xml
A_v2.xml
A_v1_transformed.xml
A_v2_transformed.xml
Here is one of the transformed files with comments added by me to help you understand where/why I think the transform sorted these files incorrectly. I didn't comment the other transformed file because it has similar "failures".
A_v1_transformed_with_comments.xml
Both of the transformed documents should have the same checksum as expected.xml, but they don't. That is my biggest concern. Alphabetical sorting seems the most sane way to sort, but so long as the transform sorted in some sane way I couldn't care less how the sort happened so long as the sort is repeatable among different "versions" of the same file.
expected.xml
The following XLS files both yield the same result, but the "multi-pass" version may be easier to understand.
xsl_concise.xsl
xsl_multi_pass.xsl
Points for discussion:
I have noticed that when sorting alphabetically CAPITALIZED letters take precedence. Even if the capitalized letter comes after a lower case letter alphabetically it will come first in the sort.
Partial success...
I think I may have stumbled onto a partial solution myself, but I am unclear why it works. If you look at my xsl_multi_pass.xsl file you will see:

<xsl:variable name="sortElementsRslt">
<xsl:apply-templates mode="sortElements" select="$sortAttributesRslt"/>
</xsl:variable>

<xsl:apply-templates mode="deDup" select="$sortElementsRslt"/>
If I turn that into:

<xsl:variable name="sortElementsRslt1">
<xsl:apply-templates mode="sortElements" select="$sortAttributesRslt"/>
</xsl:variable>

<xsl:variable name="sortElementsRslt2">
<xsl:apply-templates mode="sortElements" select="$sortElementsRslt1"/>
</xsl:variable>

<xsl:apply-templates mode="deDup" select="$sortElementsRslt2"/>
This sorts the elements twice, I don't know why it is necessary. The result using the example files I have provided is what I expected minus the CAPITALIZED letters taking precedence, but that doesn't bother me so long as the result is consistent which it appears to be. The problem is that this "solution" causes another part of the real files I'm working with to be sorted inconsistently.
SUCCESS!
I think I finally got this working 100% how I want. I incorporated the function given in the answer here by #Dimitre Novatchev to elements by their attribute names and values. I still have to perform two passes to sort the elements (applying the exact same templates twice) as I described above for some reason, but it only takes an extra 3 seconds on a 20MB file, so I'm not too worried about it.
Here is the final result:
xsl_2.0_full_document_sorter.xsl

In a nutshell my ultimate goal with all of my XSLT questions is a stylesheet that when applied to a file will always generate the same result even if run on different "versions" of a that file. A different "version" of a file would be one that had the exact same content, just in a different order. That means an element's attributes may have been moved around and that elements may have occur eariler/later than they previously did.
Have you considered a different tool rather than XSLT for this purpose? The goal you've described sounds to me pretty much exactly the definition of similar() in XMLUnit
// control and test are the two XML documents you want to compare, they can
// be String, Reader, org.w3c.dom.Document or org.xml.sax.InputSource
Diff d = new Diff(control, test);
assert d.similar();

SUCCESS!
I think I finally got this working 100% how I want. I incorporated the function given in the answer here by #Dimitre Novatchev to sort elements by their attribute names and values. I still have to perform two passes to sort the elements (applying the exact same templates twice) as I described above for some reason, but it only takes an extra 3 seconds on a 20MB file, so I'm not too worried about it.
Here is the final result:
xsl_2.0_full_document_sorter.xsl
This transform is 100% generic and should be able to be used on any XML document to sort it in what I would consider the most sane way possible. The major benefit of this stylesheet is that it will transform multiple files that have the same content in different orders the exact same way, to the transformed results of all the files that have the same content will be identical.

XSL-FO: Need to disable page break at page sequence level

My requirement is to produce, essentially, a page WITHIN a page. The xsl defines a page 1/3 the size of an A4 sheet, but up to 3 of them must print on the sheet. The page is a standard header/body/footer, with a 'Page X of Y' on it and of course it is simple. But if there are 4 of these "pages", only 2 pages should be printed, 3 on the first and 1 on the second. But 4 pages are printed, because the output PDF "tells" the printer that is a complete sheet. So what I want to do is either:
code the fo so it does not page break after it finished a page (something like page-break-after="avoid" but at the page sequence level)
OR
generate a page sequence within a page sequence, the outer one being defined as A4 size, the inner 1/3 of that.
I've tried the 2nd directly in a simple way, i.e.,
<fo:page-sequence master-reference="A4">
<fo:page-sequence master-reference="one-third_A4">
...
</fo:page-sequence>
</fo:page-sequence>
..and the processor definitely does NOT like it.
Are there any instructions I can apply that could do either of these? I looked in places like schemacentral and w3schools.com and cannot find anything.

Not sure if I understood you needs (a drawing would help), but can't you simply put your content into a series of fo:block-container elements with specified height? They would come out stacked vertically.
If you need more complex geometry, check out flow-maps feature of XSL 1.1. Three body regions per page connected sequentially in a <fo:flow-target-list>, and a single fo:flow providing content for them.
Having fo:page-sequence within another fo:page-sequence is not valid according to the spec (what would a page number mean then?), and every fo:page-sequence starts a new physical page by definition.
fo:page-sequence-wrapper won't help you on sub-page level either.

Limiting for-each criteria due to performance issues

I have some code that looks like this:
<xsl:for-each select="($home/item[#key='issues']/item[#key='cwa']/item[#key='archives']/item[#key='2012']/*/*)">
<xsl:if test="(position() < 15) and (position() > 1)">
...
It works fine, except there are hundreds of items in the result set, and I only want to show 20. The structure beneath 2012 looks something like this
2012
01
02
03
So in theory I only need the current month and last month. Is there a way to limit that in the for-each statement itself?
This is in the Sitecore CMS, so unfortunately I don't have easy access to the raw XML.

I have made the experience that a "template - apply-templates" construct is better performance wise. I also had a XML with hundreds of items. In the particular case I had to rename elements and the performace gain was almost 30 seconds.
Also for other reasons I would recommend to use the "template - apply-templates" construct:
For loops vs. apply-templates
I am sure you can also adapt the code in your example.

You could also create an XSL extension to populate the foreach. In the XSL extension you could use Linq to select only the items that you really need.
Another drawback of having hundreds of items as a child items is that the content editor could become very slow when you try to open the parent item. Dividing the child items automatically into folders per month for example, could resolve this (or try the ItemBuckets package of Tim Ward for another approach with using Lucene Search index)
Edit: just comes to mind that defining the select in the for-loop before doing the actual loop could also gain some performance improvement, like this. (not exactly sure if this is true, by the way ;) )
<xsl:variable name="list" select="($home/item[]..... etc.)" />
<xsl:for-each select="$list>
....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js