Network Mask to Shorthand in XSL - xslt

I'm running into a bit of a pickle where I'm working out network masks into shorthand (ie. 255.255.255.0 = /24)
I did a whole bunch of googling, and weirdly enough, no one has ever asked how to compute this in XSL.
So I came up with my own solution: why not do a whole bunch of statements like:
<xsl:choose>
<xsl:when test="'255.255.255.0'">
<xsl:value-of select="'/24'"/>
</xsl:when>
<xsl:when test="'255.255.0.0'">
<xsl:value-of select="'/16'"/>
</xsl:when>
...
and so on and so forth. Then I realized. I think I'm thinking too much outside the box. There has to be a solution to calculate this. There are too many possibilities of a network mask. Does anyone know how to calculate it?

Theoretically, it is possible, but its probably going to require extraordinarily verbose code, especially in XSLT1.0.
Network masks rely heavily on bitwise logic, which this answer covers in XSLT. But not only that, you'll need to tokenize the string first, which isn't easy/short in XSLT 1.0.
Then you need to verify that each octet is correct (i.e. consecutive 1's followed by consecutive 0s).
In all, it might just be shorter code-wise to list the 31 cases, and check against those in its own little named template somewhere. Possibly even computationally quicker, since string tokenization would be recursive, as would the bit logic.

Another alternative would be to step outside XSLT and write, and plug in, an extension function. In fact if there's an existing Java static function which will do this, some XSLT processors (eg Apache Xalan) will let you invoke it fairly directly. That does mean giving up some portability since you need to make sure the same extension will be available everywhere you run the stylesheet, but sometimes it is the best solution.
Unfortunately I don't think the standardized ESXLT extension functions include something for this purpose.

Related

How to write efficient XSLT

I recently found these tips in a post from year 2000:
Eight tips for how to write efficient XSLT:
1. Avoid repeated use of "//item".
2. Don't evaluate the same node-set more than once; save it in a variable.
3. Avoid <xsl:number> if you can. For example, by using position().
4. Use <xsl:key>, for example to solve grouping problems.
5. Avoid complex patterns in template rules. Instead, use <xsl:choose> within the rule.
6. Be careful when using the preceding[-sibling] or following[-sibling] axes. This often indicates an algorithm with n-squared performance.
7. Don't sort the same node-set more than once. If necessary, save it as a result tree fragment and access it using the node-set() extension function.
8. To output the text value of a simple #PCDATA element, use <xsl:value-of> in preference to <xsl:apply-templates>.
Mike Kay
Obviously, this advice relates to XSLT 1.0. My question is how much are these tips still relevant today with XSLT 3.0? I realize that performance can vary between different processors, so I am tagging this question as Saxon, since that is the processor I currently use.
I am particularly interested in tips #5 and #8 because these are the ones I often see being ignored today. The other ones seem to me as self evident at all times.
I suspect I might have written that myself...
These days I think I would confine myself to one meta-tip - don't do anything to improve performance unless you first have a reliable measurement framework to assess the impact. If you can measure performance reliably, then improving it is usually very easy, if you can't then it's impossible.
Optimizers are certainly smarter now than they were in 2000, but they are also a lot less predictable. For example, they'll do a very good job with some flavours of xsl:number, but struggle with others. That means (a) you need to make your own measurements, and (b) it's a good idea to write code that doesn't rely too heavily on the optimizer if you can. For example, define explicit keys rather than relying on the optimizer to work out where introducing an index would be useful. Another example, move code out of a loop into a variable rather than relying on the optimizer to do it for you.
The other advice I would give is to read before you code. Don't copy and paste code from other people if you don't understand what it does. Don't use "//" rather than "/" on the basis that "I don't really understand the difference but that's what worked last time". The worst inefficiencies come from code written by people who just didn't understand what they were writing.
As regards #5, there are two things that can make template rules inefficient: one is having lots of complex template rules that involve repeated effort to see which of them applies, for example:
<xsl:template match="a[.//x[1]='foo']"/>
<xsl:template match="a[.//x[1]='bar']"/>
<xsl:template match="a[.//x[1]='baz']"/>
.. plus 20 more similar ..
Here the problem is that each node is being tested against many different rules. (Saxon-EE now goes to some lengths to try and optimise that, but it doesn't always succeed)
and the second is single template rules that do expensive navigation:
<xsl:template match="a[count(preceding::a) = 1]"/>
Here the problem is typically that testing many nodes individually against this predicate is less efficient than a bulk query to find all of those nodes in a single pass; it will often be better to select the nodes that meet the criteria first, and then test each node to see if it is a member of this set.
(Again, in this particular example Saxon will probably manage to optimize it: with count(XXX) = 1 it should stop evaluating XXX once it knows there is more than one. Without that optimization, you could be counting 100,000 nodes and then comparing 100,000 with 1 and finding it's not a match.)
As regards #8, I've no idea why that's in the list, I very much doubt this is something likely to affect the bottom line significantly.
I think that many XSLT processor have learned to optimize //item, nevertheless of course many beginners through in too much of // into their expressions, sometimes without wanting/needing at all, both by starting what should be a relative path with it or by throwing it in the middle of path expressions. But I would say that a justified use where you need you want to search a nested subtree for item elements is fine.
I would think that with XSLT 2 and 3 of course you would use xsl:for-each-group for grouping and usually not a key. Performance that way should be as well as with a key only that some xsl:for-each-group code might be easier to maintain and understand than equivalent Muenchian grouping code.
While xsl:number of course for complex uses can perform bad, I would nevertheless think that it has its uses in XSLT 2 and/or 3 where you can't get all done with position(). Of course XSLT 3 has accumulators that might help solve many cases that xsl:number previously did.
General hints for XSLT 2/3 in terms of good coding and hopefully good performance are to use the as attribute to declare the types of variables, parameters and function results.
Also most parameter/variable bindinds are better done with xsl:variable name="foo" select="some-xpath-expression" or xsl:param name="foo" select="some-xpath-expression" than the nested <xsl:variable name="foo"><xsl:value-of select="some-xpath-expression"/></xsl:variable> beginners tend to use.
I am not a heavy user of schema-aware XSLT 2 and/or 3 but I think Michael Kay in the past has mentioned that it gives you additional improved error messages and safety if you validate both your input and output.
In XSLT 2/3, to return the result of an xsl:function, I would say use xsl:sequence usually and probably never xsl:value-of (as the letter creates a text node and not e.g. a boolean, a string or a sequence of some values or whatever value the select expression of xsl:sequence returns).
xsl:iterate is also a new feature in XSLT 3 meant to allow you to avoid performance problems in the form of stack overflow recursion problems.
#5 I have never heard in the context of performance problems but rather with the question as how to structure your code and in my view 80 or 90 % are rather in favour of various templates than using a single one that inside uses xsl:choose/xsl:when.
#8 also surprises me, I certainly of rely on the identity transformation (now declared in XSLT 3 with xsl:mode on-no-match="shallow-copy") to simply have text node values copied through without needing to explicitly having them output with xsl:value-of.
Anyway, I would think that writing declarative code like XSLT 3 is not done with the priority of having it perform well, if some of my XSLT performs really bad I would look into measuring what exactly causes it and how to remedy it but would not be permanently preoccupied by using or avoiding constructs for performance reasons.

boost spirit 2 : is there a way to know what is the parser progression percentage?

I managed to parse a pgn file into several games mainly thanks to this forum.
However, as the files I have to deal with have so many games, the process can take two minutes on my recent computer. That's why I would like to animate a progress bar on the GUI application using this parser.
I think the easiest way would be to "ask" spirit how many characters he has already processed, and how many characters remain. (Or how many lines remain and have been processed).
Is it possible ? If so, how do I need to modify the parser file in order to get this ratio ?
You can use line_pos_iterator and potentially the iter_pos primitive from the repository.
(#GuyGreer:) There is no way to know the amount of backtracking involved (otherwise, there would not need to be backtracking in the first place). So, the best thing to do is accept that you get some kind of "average throughput" that can be a little bursty or laggy at times. If your grammar is that unbalanced that these variations are more than noise, you should consider fixing the the grammar/parser definitions in the first place.
To counter the "problem" of not knowing the stream length, you cannot fix it other than not having it as a stream.
I'd suggest memory mapping. You can use the facilities from boost::iostreams, boost::interprocess or just mmap.
I estimate I have at least 3 answers demonstrating each of the techniques mentioned in this answer, so I'd just search this site for them.

creating a regular expression for a list of strings

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example
I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:
[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}
A "best" answer for the table above would be:
[A-Z]{3}
[A-Za-z\s\.]+
\d{4}\sm
\d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
(speciosissima|intermediate|troglodytes)
(hf|sr)
\d{4}
Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.
I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.
NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values:
It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.
Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.
This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.
I only skimmed through one of their papers, but I'll try to lay out what I understand here.
There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.
The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.
They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)
So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.
I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.
Here is another related question. And here another one. And here another one.
My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).
flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)
The columns then become:
"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"
I shall then replace these by regular expressions such as
"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"
and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].
For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".
By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

Equation parser efficiency

I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.

Create a program that inputs a regular expression and outputs strings that satisfy that regular expression

I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.