How to check if an apply-template filled variable is a string or (possibly) empty node in XSLT? - xslt

I need a test for a variable which would evaluate to true in two cases:
There is a string inside which contains any non white space characters
There is any node (which can be possibly empty)
and the variable is filled with apply-template call result.
I tried
test="normalize-space($var)"
but this doesn't cover the empty tag possibility. I also tried simply this:
test="$var"
but this evaluate to true even for white space only strings.
By the way "$var/*" produces an error "Expression ...something I don't remember... node-set" which is I think because of apply-template variable instantiation.
Is there any (which means even multi level decision) solution for this?
EDIT: I forgot to say that it's for XSLT 1.0 and preferably without any exslt extensions or similar.

If you want to check whether a result tree fragment contains any child nodes then you need exsl:node-set e.g.
<xsl:if test="exsl:node-set($variable)/node()"
xmlns:exsl="http://exslt.org/common">
Without that extension function (or a similar one your particular XSLT 1.0 processor offers) you can't perform that check.

Hope this helps:
"string($variable)" will test true if the string has any characters in it, false if it equals ''.
http://www.dpawson.co.uk/xsl/sect2/N3328.html#d4474e64
Contains no child nodes: not(node())
Contains no text content: not(string(.))
Contains no text other than whitespace: not(normalize-space(.))
Contains nothing except comments: not(node()[not(self::comment())])

Related

XSL "not" not working as expected attempting to compare two variables

Using XSL v3.0 I'm trying to compare two variables. One was created from .txt format directory listings imported as unparsed text. The other was created by querying xml files. Both contain references to jpgs.
I want to create a third variable using select="not" to find out which jpg references are present in one variable but not the other. I know the syntax for this $var1[not(.=$var2)] I was successfully able to do it in another place in this same XSL file.
I can output the values from each of the two variables and they look just like a .txt file would, and the values are what I would expect to see.
But for the life of me I cannot get the "not" to work. As far as I can tell it just returns the entire value of one of the variables.
Is there a way to just brute force these two variables into the same format so I can do this? I just want each variable to be a flat file that I can compare to the other and output another boring old flat file. I've tried all the combinations of tokenize and string-join etc. that I've stumbled across and nothing seems to work.
If I was using a bash script I would just pipe the dirs to two .txt files and use diff to do this, but achieving the same thing in XSL is killing me.
Clearly I am a novice at XSL. Any assistance appreciated.
per Michael Kay's suggestion
complete xsl available at this dropbox link
https://www.dropbox.com/s/fsltr34f5l3ci5a/jpg_report_stack.xsl?dl=0
variable with all jpg names - $jpg_all_distinct_joined
<xsl:variable name="jpg_all_distinct_joined" as="xs:string" select="string-join((distinct-values(($token_full, $token_800, $token_thumb))),'
')"/>
variable with all jpg references from xml - $jpg_all_links
<xsl:variable name="jpg_all_links" select="($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref)"/>
not statement
<xsl:variable name="jpgs_in_xml_not_directories" select="($jpg_all_links)[not(.=$jpg_all_distinct_joined)]"/>
outputs the value of $jpg_all_links - this is not what I want - I want the output to be all jpg references from $jpg_all_links that are not in $jpg_all_distinct_joined
The variable $jpg_all_links is a sequence concatenation of the three variables ($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref). All three of these variables are constructed using string-join() with newline as the separator, so it seems likely that $jpg_all_links is a sequence of three strings, each comprising multiple strings separated by newlines. The variable $jpg_all_distinct_joined is also formed using string-join() with a newline separator. So my suspicion is that (writing # to represent a newline character), you are doing something like
("A#B#C", "D#E#F")[not(. = "C#D")]
and nothing is being eliminated because none of the strings is equal to "C#D". You want to compare the sets of individual strings, not the composite strings formed using string-join().

XSLT missing linebreaks when selecting nodes by path expression

I am copying some nodes according to XSLT: Copy child elements of a complex type only once by using a path expression within a copy-of tag:
<xsl:copy-of select="/xs:schema/xs:complexType[#name=current()/xs:element/#type]"/>
In the output all linebreaks are missing at the elements processed by this statement. (Elesewhere they are shown) It looks like this:
...</xs:complexType><xs:complexType....
I can only add linebreaks before and after, but not between them. How can i achive this? Thanx for your help!
You provided too little data to attempt any testing. E.g. it is not clear, what output method uses your script.
Quite often XSLT script contains xsl:strip-space instruction, which causes normalization of text nodes.
This normalization a.o. changes "internal" sequences of "white" chars, including line breaks,
into a single space.
Maybe this is the cause.
Take alse a look at xsl:output instruction in your script.
Does it contain indent="yes" attribute?
If it doesn't, the output contains no line breaks between output elements.
Maybe your script contains in some places output of explicite line breaks
(e.g. <xsl:text>&#aA;</xsl:text>), so these line breaks are rendered.
But if you have no indent="yes" attribute, then no line breaks are inserted
"automatically" between consecutive elements.
Your XPath expression only selects the xs:complexType elements, not the whitespace that separates them.
When you're working with a vocabulary such as XSD that doesn't use mixed content (except perhaps in annotations) it's probably best to remove all whitespace text nodes from the input using xsl:strip-space and then to generate new whitespace in the output using xsl:output indent='yes'.

Tokenize the text depending on some specific rules. Algorithm in C++

I am writing a program which will tokenize the input text depending upon some specific rules. I am using C++ for this.
Rules
Letter 'a' should be converted to token 'V-A'
Letter 'p' should be converted to token 'C-PA'
Letter 'pp' should be converted to token 'C-PPA'
Letter 'u' should be converted to token 'V-U'
This is just a sample and in real time I have around 500+ rules like this. If I am providing input as 'appu', it should tokenize like 'V-A + C-PPA + V-U'. I have implemented an algorithm for doing this and wanted to make sure that I am doing the right thing.
Algorithm
All rules will be kept in a XML file with the corresponding mapping to the token. Something like
<rules>
<rule pattern="a" token="V-A" />
<rule pattern="p" token="C-PA" />
<rule pattern="pp" token="C-PPA" />
<rule pattern="u" token="V-U" />
</rules>
1 - When the application starts, read this xml file and keep the values in a 'std::map'. This will be available until the end of the application(singleton pattern implementation).
2 - Iterate the input text characters. For each character, look for a match. If found, become more greedy and look for more matches by taking the next characters from the input text. Do this until we are getting a no match. So for the input text 'appu', first look for a match for 'a'. If found, try to get more match by taking the next character from the input text. So it will try to match 'ap' and found no matches. So it just returns.
3 - Replace the letter 'a' from input text as we got a token for it.
4 - Repeat step 2 and 3 with the remaining characters in the input text.
Here is a more simple explanation of the steps
input-text = 'appu'
tokens-generated=''
// First iteration
character-to-match = 'a'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ap'
pattern-found = false
tokens-generated = 'V-A'
// since no match found for 'ap', taking the first success and replacing it from input text
input-text = 'ppu'
// second iteration
character-to-match = 'p'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'pp'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ppu'
pattern-found = false
tokens-generated = 'V-A + C-PPA'
// since no match found for 'ppu', taking the first success and replacing it from input text
input-text = 'u'
// third iteration
character-to-match = 'u'
pattern-found = true
tokens-generated = 'V-A + C-PPA + V-U' // we'r done!
Questions
1 - Is this algorithm looks fine for this problem or is there a better way to address this problem?
2 - If this is the right method, std::map is a good choice here? Or do I need to create my own key/value container?
3 - Is there a library available which can tokenize string like the above?
Any help would be appreciated
:)
So you're going through all of the tokens in your map looking for matches? You might as well use a list or array, there; it's going to be an inefficient search regardless.
A much more efficient way of finding just the tokens suitable for starting or continuing a match would be to store them as a trie. A lookup of a letter there would give you a sub-trie which contains only the tokens which have that letter as the first letter, and then you just continue searching downward as far as you can go.
Edit: let me explain this a little further.
First, I should explain that I'm not familiar with these the C++ std::map, beyond the name, which makes this a perfect example of why one learns the theory of this stuff as well as than details of particular libraries in particular programming languages: unless that library is badly misusing the name "map" (which is rather unlikely), the name itself tells me a lot about the characteristics of the data structure. I know, for example, that there's going to be a function that, given a single key and the map, will very efficiently search for and return the value associated with that key, and that there's also likely a function that will give you a list/array/whatever of all of the keys, which you could search yourself using your own code.
My interpretation of your data structure is that you have a map where the keys are what you call a pattern, those being a list (or array, or something of that nature) of characters, and the values are tokens. Thus, you can, given a full pattern, quickly find the token associated with it.
Unfortunately, while such a map is a good match to converting your XML input format to a internal data structure, it's not a good match to the searches you need to do. Note that you're not looking up entire patterns, but the first character of a pattern, producing a set of possible tokens, followed by a lookup of the second character of a pattern from within the set of patterns produced by that first lookup, and so on.
So what you really need is not a single map, but maps of maps of maps, each keyed by a single character. A lookup of "p" on the top level should give you a new map, with two keys: p, producing the C-PPA token, and "anything else", producing the C-PA token. This is effectively a trie data structure.
Does this make sense?
It may help if you start out by writing the parsing code first, in this manner: imagine someone else will write the functions to do the lookups you need, and he's a really good programmer and can do pretty much any magic that you want. Writing the parsing code, concentrate on making that as simple and clean as possible, creating whatever interface using these arbitrary functions you need (while not getting trivial and replacing the whole thing with one function!). Now you can look at the lookup functions you ended up with, and that tells you how you need to access your data structure, which will lead you to the type of data structure you need. Once you've figured that out, you can then work out how to load it up.
This method will work - I'm not sure that it is efficient, but it should work.
I would use the standard std::map rather than your own system.
There are tools like lex (or flex) that can be used for this. The issue would be whether you can regenerate the lexical analyzer that it would construct when the XML specification changes. If the XML specification does not change often, you may be able to use tools such as lex to do the scanning and mapping more easily. If the XML specification can change at the whim of those using the program, then lex is probably less appropriate.
There are some caveats - notably that both lex and flex generate C code, rather than C++.
I would also consider looking at pattern matching technology - the sort of stuff that egrep in particular uses. This has the merit of being something that can be handled at runtime (because egrep does it all the time). Or you could go for a scripting language - Perl, Python, ... Or you could consider something like PCRE (Perl Compatible Regular Expressions) library.
Better yet, if you're going to use the boost library, there's always the Boost tokenizer library -> http://www.boost.org/doc/libs/1_39_0/libs/tokenizer/index.html
You could use a regex (perhaps the boost::regex library). If all of the patterns are just strings of letters, a regex like "(a|p|pp|u)" would find a greedy match. So:
Run a regex_search using the above pattern to locate the next match
Plug the match-text into your std::map to get the replace-text.
Print the non-matched consumed input and replace-text to your output, then repeat 1 on the remaining input.
And done.
It may seem a bit complicated, but the most efficient way to do that is to use a graph to represent a state-chart. At first, i thought boost.statechart would help, but i figured it wasn't really appropriate. This method can be more efficient that using a simple std::map IF there are many rules, the number of possible characters is limited and the length of the text to read is quite high.
So anyway, using a simple graph :
0) create graph with "start" vertex
1) read xml configuration file and create vertices when needed (transition from one "set of characters" (eg "pp") to an additional one (eg "ppa")). Inside each vertex, store a transition table to the next vertices. If "key text" is complete, mark vertex as final and store the resulting text
2) now read text and interpret it using the graph. Start at the "start" vertex. ( * ) Use table to interpret one character and to jump to new vertex. If no new vertex has been selected, an error can be issued. Otherwise, if new vertex is final, print the resulting text and jump back to start vertex. Go back to (*) until there is no more text to interpret.
You could use boost.graph to represent the graph, but i think it is overly complex for what you need. Make your own custom representation.

How do I extract the first element in a list using replaceregex in an Ant file?

Working with Ant's regular expressions system seems to give me no end of trouble. With enough work I can usually get it to work (and understand what I was doing wrong earlier). But not this time. I have a simple target wherein I want to extract the first element out of a property that contains one or more comma separated words, like this:
tgt.list.full=word1,word2,word3,word4
(Edit: tgt.list.full is actually populated by another property: tgt.list.basic, so the actual cfg.list.file looks like this:
tgt.list.basic=word1,word2,word3,word4
tgt.list.full=${tgt.list.basic}
)
I want the first word: "word1" to replace the ${target} property. This is what my task looks like:
<target name="load-configuration-list">
<loadproperties srcfile="${cfg.list.file}">
<filterchain>
<containsregex pattern="^tgt.list.full=(.*),?.*" replace="target=\1" />
<concatfilter prepend="${cfg.list.file}" />
</filterchain>
</loadproperties>
<echo message="TGT: ${target}, FULL: ${tgt.list.full}"/>
<fail unless="target" message="A target cannot be determined"/>
</target>
With the current version I have listed under the "containsregex" task, ${target} gets populated with the full list ("word1,word2,word3,word4") and not simply "word1." I have tried a large number of variations o the them. Here's an example:
<containsregex pattern="^tgt.list.full=(word1),?.*" replace="target=\1" />
I would expect that this would at least FORCE the target property to be populated, but in this case, ${target} remains undefined (not even the full list is put into it).
Perhaps there is a flaw in my filterchain logic. I know I could probably write task of my own, but Ant seems to have the components already that I need, if I can understand them better.
To rephrase my original question: given a comma separated list in an Ant property, how might I use an Ant task (not necessarily even using containsregex or replaceregex) to extract the first element?
Would this regex be better suited to what you need ?
^tgt.list.full=([^,]+),?.*
^tgt.list.full=([^,]+),?[^\r\n]*$
Since '.' (dot) represents any character, '(.*),?.*' does select word1,word2,... because of the greediness of the * quantifier.
May be '(.*?),?.*' would have been better, but at least with [^,]+, we know a greedy operator will not capture any ','.
The second form may be needed to be sure to capture only what is on one line, and not "everything that follows a ," (including the next lines, since '.' in a 'dotall' mode, can include crlf characters).
As mentioned by Adam in the comments:
The "target" prop was actually being populated by "${tgt.list.basic}," not "full".
So after everything resolved, the target was now populated by the basic list.
I moved the full list out of the cfg file to be populated by the basic list later (instead of immediately)

Element-in-List testing

For a stylesheet I'm writing (actually for a set of them, each generating a different output format), I have the need to evaluate whether a certain value is present in a list of values. In this case, the value being tested is taken from an element's attribute. The list it is to be tested against comes from the invocation of the stylesheet, and is taken as a top-level <xsl:param> (to be provided on the command-line when I call xsltproc or a Saxon equivalent invocation). For example, the input value may be:
v0_01,v0_10,v0_99
while the attribute values will each look very much like one such value. (Whether a comma is used to separate values, or a space, is not important-- I chose a comma for now because I plan on passing the value via command-line switch to xsltproc, and using a space would require quoting the argument, and I'm lazy-enough to not want to type the extra two characters.)
What I am looking for is something akin to Perl's grep, wherein I can see if the value I currently have is contained in the list. It can be done with sub-string tests, but this would have to be clever so as not to get a false-positive (v0_01 should not match a string that contains v0_011). It seems that the only non-scalar data-type that XSL/XSLT supports is a node-set. I suppose it's possible to convert the list into a set of text nodes, but that seems like over-kill, even compared to making a sub-string test with extra boundaries-checking to prevent false matches.
Actually, using XPath string functions is the right way to do it. All you have to make sure is that you test for the delimiters as well:
contains(concat(',' $list, ','), concat(',', $value, ','))
would return a Boolean value. Or you might use one of these:
substring-before(concat('|,' $list, ',|'), concat(',', $value, ','))
or
substring-after(concat('|,' $list, ',|'), concat(',', $value, ','))
If you get an empty string as the result, $value is not in the list.
EDIT:
#Dimitre's comment is correct: substring-before() (or substring-after()) would also return the empty string if the found string is the first (or the last) in the list. To avoid that, I added something at the start and the end of the list. Still contains() is the recommended way of doing this.
In addition to the XPath 1.0 solution provided by Tomalak,
Using XPath 2.0 one can tokenize the list of values:
exists(tokenize($list, ',')[. = $value])
evaluates to true() if and only if $value is contained in the list of values $list