Xslt sequence and node set - xslt

I know this has been asked before, but what is the exact difference between the concept of a sequence in XSLT 2 and the earlier node set?
Also in which case/form are duplicates allowed in a sequence? If there are duplicates, would count(node1|node1) = 1 be true?

Expanding Max Toro's answer, there are two main differences between XSLT 1.0 node-sets and XSLT 2.0 sequences:
(1) sequences are ordered
(2) sequences may contain atomic values (such as strings and numbers) as well as nodes. (In 3.0 they may also contain other kinds of item, such as functions, maps, and arrays).
Rather than support both sets and sequences in 2.0, it was decided to only have sequences, but that some operations would cause a sequence of nodes to be sorted into document order with duplicates eliminated. The main operations that do this are:
(a) the union ('|'), intersect, and except operators.
(b) the path operator "/" (in the case where the right-hand operand returns nodes)
Because '|' eliminates duplicates, count($n|$n) returns count($n) (which is 1 if $n is a singleton). However, count(($m, $n)) returns count($m) + count($n), because the "," operator concatenates two sequences without eliminating duplicates. Similarly the new '!' operator does not eliminate duplicates, so count($m!$n) is count($m)*count($n).

Sequences are ordered and may contain duplicates. The path operator (/) eliminates duplicates and returns nodes in document order.
Sequences may contain atomic values.

Related

What does this '|' mean in node()|#* [duplicate]

I have been working with xslt recently, and I have been having trouble understanding the difference between | and or and when I should use which. I understand I can just use the error messages to figure out which one I need to be using but I am interested in learning why I can't use one or the other.
Would anyone be able to help point me in the right direction to someplace where I can learn the difference?
| is concerned with nodes, or is concerned with "truth", that is, Boolean values.
Explanation
The | or union operator
This operator returns the union of two sequences that, in this case, are interpreted as two sets of nodes. An interesting detail is that the union operator removes any duplicate nodes. Also, it only accepts operands of the type node()*, i.e. a sequence of nodes. The nodes are returned in document order.
The or operator
The technical term for this operator is Boolean Disjunction. It takes two arguments, both of which must evaluate to a Boolean value ("true" or "false") individually. Or, more precisely, the operands of or are jugded by their effective Boolean value by converting them to xs:boolean. All of this also applies to the and operator, by the way.
Examples
Use the union operator to enlarge a set of nodes, typically in a template match:
<xsl:template match="Root|Jon">
Why not use the or operator here? Because the match attribute expects a set of nodes as its value. union returns a set of nodes, whereas the or operator returns a Boolean value. You cannot define a template match for a Boolean.
Use the or operator to implement alternatives in XSLT code, mainly using xsl:if or xsl:choose:
<xsl:if test="$var lt 354 or $var gt 5647">
If any of the two operands of this or operation evaluates to true, then the content of xsl:if will be evaluated, too. But not only comparisons like "less than" (lt) have a Boolean value, the following is also perfectly legal:
<xsl:if test="$var1 or $var2">
The above expression only evaluates to true if at least one of the two variables is a non-empty sequence. This is because an empty sequence is defined to have the effective Boolean value of false.
Coercion
Note that because of the way XSLT coerces things to appropriate types, there are some contexts where either operator can be used. Consider these two conditionals:
<xsl:if test="Root | Jon"> ... <xsl:if>
<xsl:if test="Root or Jon"> ... <xsl:if>
The first conditional tests whether the union of the set of children named Root and the set of children named Jon is non-empty. The expression Root | Jon returns a sequence of nodes, and then that sequence is coerced to a boolean value because if requires a boolean value; if the sequence is non-empty, the effective boolean value is true.
The second conditional tests whether either of the two sets of children (children named Root and children named Jon) is non-empty. The expression Root or Jon returns a boolean value, and since the operator or requires boolean arguments, the two sets are each coerced to boolean, and then the or operator is applied.
The outcome is the same, but (as you can see) the two expressions reach that outcome in subtly different ways.
From Documentation
| - The union operator. For example, the match attribute in the element <xsl:template match="a|b"> matches all <a> and <b> elements
or - Tests whether either the first or second expressions are true. If the first expression is true, the second is not evaluated.

Find simplest regular expression matching all given strings

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:
foo(a|b|cd)bar should match fooabar, foobbar and foocdbar.
Examples
Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).
Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)
There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)
The procedure is simple, if you have access to standard FSM algorithms:
Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.
Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.
Minimize the DFA.
DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.
For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.
You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:
h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c
will generate a finite machine that most probably look like this:
[h_] <-level 0
/ \
[q1] [p2] <-level 1
\ /
[_] <-level 2
/\ \
/ \ \
a b c <-level 3
Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so
h_(q1|p2)_(a|b|c)
L0 L1 L2 L3

regex - At most two pair of consecutives

I'm taking a computation course which also teaches about regular expressions. There is a difficult question that I cannot answer.
Find a regular expression for the language that accepts words that contains at most two pair of consecutive 0's. The alphabet consists of 0 and 1.
First, I made an NFA of the language but cannot convert it to a GNFA (that later be converted to regex). How can I find this regular expressin? With or without converting it to a GNFA?
(Since this is a homework problem, I'm assuming that you just want enough help to get started, and not a full worked solution?)
Your mileage may vary, but I don't really recommend trying to convert an NFA into a regular expression. The two are theoretically equivalent, and either can be converted into the other algorithmically, but in my opinion, it's not the most intuitive way to construct either one.
Instead, one approach is to start by enumerating various possibilities:
No pairs of consecutive zeroes at all; that is, every zero, except at the end of the string, must be followed by a one. So, the string consists of a mixed sequence of 1 and 01, optionally followed by 0:
(1|01)*(0|ε)
Exactly one pair of consecutive zeroes, at the end of the string. This is very similar to the previous:
(1|01)*00
Exactly one pair of consecutive zeroes, not at the end of the string — and, therefore, necessarily followed by a one. This is also very similar to the first one:
(1|01)*001(1|01)*(0|ε)
To continue that approach, you would then extend the above to support two pair of consecutive zeroes; and lastly, you would merge all of these into a single regular expression.
(0+1)*00(0+1)*00(0+1)* + (0+1)*000(0+1)*
contains at most two pair of consecutive 0's
(1|01)*(00|ε)(1|10)*(00|ε)(1|10)*

Is string::compare reliable to determine alphabetical order?

Simply put, if the input is always in the same case (here, lower case), and if the characters are always ASCII, can one use string::compare to determine reliably the alphabetical order of two strings?
Thus, with stringA.compare(stringB) if the result is 0, they are the same, if it is negative, stringA comes before stringB alphabetically , and if it is positive, stringA comes after?
According to the docs at cplusplus.com,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
So it will sort strings in ASCII order, which will be alphabetical for English strings (with no diacritical marks or other extended characters) of the same case.
Yes, as long as all of the characters in both strings are of the same case, and as long as both strings consist only of letters, this will work.
compare is a member function, though, so you would call it like so:
stringA.compare(stringB);
In C++, string is the instantiation of the template class basic_string with the default parameters: basic_string<char, char_traits<char>, allocator<char> >. The compare function in the basic_string template will use the char_traits<TChar>::compare function to determine the result value.
For std::string the ordering will be that of the default character code for the implementation (compiler) and that is usually ASCII order. If you require a different ordering (say you want to consider { a, á, à, â } as equivalent), you can instantiate a basic_string with your own char_traits<> implementation. providing a different compare function pointer.
yes,
The member function returns 0 if all
the characters in the compared
contents compare equal, a negative
value if the first character that does
not match compares to less in the
object than in the comparing string,
and a positive value in the opposite
case.
For string objects, the result of a
character comparison depends only on
its character code (i.e., its ASCII
code), so the result has some limited
alphabetical or numerical ordering
meaning.
The specifications for the C and C++ language guarantee for lexical ordering, 'A' < 'B' < 'C' ... < 'Z'. The same is true for lowercase.
The ordering for text digits is also guaranteed: '0' < ... < '9'.
When working with multiple languages, many people create an array of characters. The array is searched for the character. Instead of comparing characters, the indices are compared.

Sorting names with numbers correctly

For sorting item names, I want to support numbers correctly. i.e. this:
1 Hamlet
2 Ophelia
...
10 Laertes
instead of
1 Hamlet
10 Laertes
2 Ophelia
...
Does anyone know of a comparison functor that already supports that?
(i.e. a predicate that can be passed to std::sort)
I basically have two patterns to support: Leading number (as above), and number at end, similar to explorer:
Dolly
Dolly (2)
Dolly (3)
(I guess I could work that out: compare by character, and treat numeric values differently. However, that would probably break unicode collaiton and whatnot)
That's called alphanumeric sorting.
Check out this link: The Alphanum Algorithm
i think u can use a pair object and then make vector > and then sort this vector.
Pairs are compared based on their first elements. So, this way you can get the sort you desire.