I recently found these tips in a post from year 2000:
Eight tips for how to write efficient XSLT:
1. Avoid repeated use of "//item".
2. Don't evaluate the same node-set more than once; save it in a variable.
3. Avoid <xsl:number> if you can. For example, by using position().
4. Use <xsl:key>, for example to solve grouping problems.
5. Avoid complex patterns in template rules. Instead, use <xsl:choose> within the rule.
6. Be careful when using the preceding[-sibling] or following[-sibling] axes. This often indicates an algorithm with n-squared performance.
7. Don't sort the same node-set more than once. If necessary, save it as a result tree fragment and access it using the node-set() extension function.
8. To output the text value of a simple #PCDATA element, use <xsl:value-of> in preference to <xsl:apply-templates>.
Mike Kay
Obviously, this advice relates to XSLT 1.0. My question is how much are these tips still relevant today with XSLT 3.0? I realize that performance can vary between different processors, so I am tagging this question as Saxon, since that is the processor I currently use.
I am particularly interested in tips #5 and #8 because these are the ones I often see being ignored today. The other ones seem to me as self evident at all times.
I suspect I might have written that myself...
These days I think I would confine myself to one meta-tip - don't do anything to improve performance unless you first have a reliable measurement framework to assess the impact. If you can measure performance reliably, then improving it is usually very easy, if you can't then it's impossible.
Optimizers are certainly smarter now than they were in 2000, but they are also a lot less predictable. For example, they'll do a very good job with some flavours of xsl:number, but struggle with others. That means (a) you need to make your own measurements, and (b) it's a good idea to write code that doesn't rely too heavily on the optimizer if you can. For example, define explicit keys rather than relying on the optimizer to work out where introducing an index would be useful. Another example, move code out of a loop into a variable rather than relying on the optimizer to do it for you.
The other advice I would give is to read before you code. Don't copy and paste code from other people if you don't understand what it does. Don't use "//" rather than "/" on the basis that "I don't really understand the difference but that's what worked last time". The worst inefficiencies come from code written by people who just didn't understand what they were writing.
As regards #5, there are two things that can make template rules inefficient: one is having lots of complex template rules that involve repeated effort to see which of them applies, for example:
<xsl:template match="a[.//x[1]='foo']"/>
<xsl:template match="a[.//x[1]='bar']"/>
<xsl:template match="a[.//x[1]='baz']"/>
.. plus 20 more similar ..
Here the problem is that each node is being tested against many different rules. (Saxon-EE now goes to some lengths to try and optimise that, but it doesn't always succeed)
and the second is single template rules that do expensive navigation:
<xsl:template match="a[count(preceding::a) = 1]"/>
Here the problem is typically that testing many nodes individually against this predicate is less efficient than a bulk query to find all of those nodes in a single pass; it will often be better to select the nodes that meet the criteria first, and then test each node to see if it is a member of this set.
(Again, in this particular example Saxon will probably manage to optimize it: with count(XXX) = 1 it should stop evaluating XXX once it knows there is more than one. Without that optimization, you could be counting 100,000 nodes and then comparing 100,000 with 1 and finding it's not a match.)
As regards #8, I've no idea why that's in the list, I very much doubt this is something likely to affect the bottom line significantly.
I think that many XSLT processor have learned to optimize //item, nevertheless of course many beginners through in too much of // into their expressions, sometimes without wanting/needing at all, both by starting what should be a relative path with it or by throwing it in the middle of path expressions. But I would say that a justified use where you need you want to search a nested subtree for item elements is fine.
I would think that with XSLT 2 and 3 of course you would use xsl:for-each-group for grouping and usually not a key. Performance that way should be as well as with a key only that some xsl:for-each-group code might be easier to maintain and understand than equivalent Muenchian grouping code.
While xsl:number of course for complex uses can perform bad, I would nevertheless think that it has its uses in XSLT 2 and/or 3 where you can't get all done with position(). Of course XSLT 3 has accumulators that might help solve many cases that xsl:number previously did.
General hints for XSLT 2/3 in terms of good coding and hopefully good performance are to use the as attribute to declare the types of variables, parameters and function results.
Also most parameter/variable bindinds are better done with xsl:variable name="foo" select="some-xpath-expression" or xsl:param name="foo" select="some-xpath-expression" than the nested <xsl:variable name="foo"><xsl:value-of select="some-xpath-expression"/></xsl:variable> beginners tend to use.
I am not a heavy user of schema-aware XSLT 2 and/or 3 but I think Michael Kay in the past has mentioned that it gives you additional improved error messages and safety if you validate both your input and output.
In XSLT 2/3, to return the result of an xsl:function, I would say use xsl:sequence usually and probably never xsl:value-of (as the letter creates a text node and not e.g. a boolean, a string or a sequence of some values or whatever value the select expression of xsl:sequence returns).
xsl:iterate is also a new feature in XSLT 3 meant to allow you to avoid performance problems in the form of stack overflow recursion problems.
#5 I have never heard in the context of performance problems but rather with the question as how to structure your code and in my view 80 or 90 % are rather in favour of various templates than using a single one that inside uses xsl:choose/xsl:when.
#8 also surprises me, I certainly of rely on the identity transformation (now declared in XSLT 3 with xsl:mode on-no-match="shallow-copy") to simply have text node values copied through without needing to explicitly having them output with xsl:value-of.
Anyway, I would think that writing declarative code like XSLT 3 is not done with the priority of having it perform well, if some of my XSLT performs really bad I would look into measuring what exactly causes it and how to remedy it but would not be permanently preoccupied by using or avoiding constructs for performance reasons.
Related
I'm trying to figure out an efficient algorithm that takes in two QAbstractItemModels (trees) (A,B) and computes the differences between them, such that I get a list of Items that are not present in A (but are in B - added), or items that have been modified / deleted.
The only current way I can think of is doing a Breadth search of A for every item item in B. But this doesn't seem very efficient. Any ideas are welcome.
Have you tried using magic?
Seriously though, this is a very broad question, especially if we consider the fact it is an QAbstractItemModels and not a QAbstractListModel. For a list it would be much simpler, but an abstract item model implements a tree structure so there are a lot of variables.
do you check for total item count
do you check for item count per level
do you check if item is contained in both models
if so, is it contained at the same level
if so, is it contained at the same index
is the item in its original state or has it been modified
You need to make all those considerations and come up with an efficient solution. And don't expect it will as simple as a "by the book algorithm". Good news, since you are dealing with isolated items, it will be easier than trying to do that for text, and in the case of the latter, you can't hope to get anywhere nearly as concise as with isolated items. I've had my fair share of absurdly mindless github diff results.
And just in case that's your actual goal, it will be much easier to achieve by tracking the history of the derived data set than doing a blind comparison. Tracking history is much easier if you want to establish what is added, what is deleted, what is moved and what is modified. Because it will consider the actual event flow rather than just the end result comparison. Especially if you don't have any persistent ID scheme implemented. Is there a way to tell if item X has been deleted or moved to a new level/index and modified and stuff like that.
Also, worry about efficiency only after you have empirically established a performance issue. Some algorithms may seem overly complex, but modern machines are overly fast, and unless you are running that in a tight loop you shouldn't really worry about it. In the end, it doesn't boil down to how complex it is, it boils down to whether it is fast enough or not.
I'm running into a bit of a pickle where I'm working out network masks into shorthand (ie. 255.255.255.0 = /24)
I did a whole bunch of googling, and weirdly enough, no one has ever asked how to compute this in XSL.
So I came up with my own solution: why not do a whole bunch of statements like:
<xsl:choose>
<xsl:when test="'255.255.255.0'">
<xsl:value-of select="'/24'"/>
</xsl:when>
<xsl:when test="'255.255.0.0'">
<xsl:value-of select="'/16'"/>
</xsl:when>
...
and so on and so forth. Then I realized. I think I'm thinking too much outside the box. There has to be a solution to calculate this. There are too many possibilities of a network mask. Does anyone know how to calculate it?
Theoretically, it is possible, but its probably going to require extraordinarily verbose code, especially in XSLT1.0.
Network masks rely heavily on bitwise logic, which this answer covers in XSLT. But not only that, you'll need to tokenize the string first, which isn't easy/short in XSLT 1.0.
Then you need to verify that each octet is correct (i.e. consecutive 1's followed by consecutive 0s).
In all, it might just be shorter code-wise to list the 31 cases, and check against those in its own little named template somewhere. Possibly even computationally quicker, since string tokenization would be recursive, as would the bit logic.
Another alternative would be to step outside XSLT and write, and plug in, an extension function. In fact if there's an existing Java static function which will do this, some XSLT processors (eg Apache Xalan) will let you invoke it fairly directly. That does mean giving up some portability since you need to make sure the same extension will be available everywhere you run the stylesheet, but sometimes it is the best solution.
Unfortunately I don't think the standardized ESXLT extension functions include something for this purpose.
I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.
I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case
I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.
STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.
What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...
Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?
You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.
The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.
Imagine that you have an internally controlled list of vendors. Now imagine that you want to match unstructured strings against that list. Most will be easy to match, but some may be reasonably impossible. The algorithm will assign a confidence to each match, but a human needs to confirm all matches produced.
How could this algorithm be unit tested? The only idea I have had so far is to take a sample of pairs matched by humans and make sure the algorithm is able to successfully match those, omitting strings that I couldn't reasonably expect our algorithm to handle. Is there a better way?
i'd try some 'canonical' pairs, both "should match" and "shouldn't match" pairs, and test only if the confidence is above (or below) a given threshold.
maybe you can also do some ordering checks, such as "no pair should have greater confidence than the one from the exact match pair", or "the pair that matches all consonants should be >= the only vowels one".
You can also test if the confidence of strings your algorithm won't handle well is sufficiently low. In this way you can see if there is a threshold over which you can trust your algorithm as well.
An interesting exercise would be to store the human answers that correct your algorithm and try to see if you could improve your algorithm to not get them wrong.
If you can, add the new matches to the unit tests.
I don't think there's a better way than what you describe; effectively, you're just using a set of predefined data to test that the algorithm does what you expect. For any very complicated algorithm which has very nonlinear inputs and outputs, that's about the best you can do; choose a good test set, and assure that you run properly against that set of known values. If other values come up which need to be tested in the future, you can add them to the set of tested values.
That sound fair. If it's possible (given time constraints) get as large of a sample of human matches as possible, you could get a picture of how well your algorithm is doing. You could design specific unit tests which pass if they're within X% of correctness.
Best of luck.
I think there are two issues here: The way your code behaves according to the algorithm, and the way the algorithm is successful (i.e does not accept answers which a human later rejects, and does not reject answers a human would accept).
Issue 1 is regular testing. Issue 2 I would go with previous result sets (i.e. compare the algorithm's results to human ones).
What you describe is the best way because it is subjective what is the best match, only a human can come up with the appropriate test cases.
It sounds as though you are describing an algorithm which is deterministic, but one which is sufficiently difficult that your best initial guess at the correct result is going to be whatever your current implementation delivers to you (aka deterministic implementation to satisfy fuzzy requirements).
For those sorts of circumstances, I will use a "Guru Checks Changes" pattern. Generate a collection of inputs, record the outputs, and in subsequent runs of the unit tests, verify that the outcome is consistent with the previous results. Not so great for ensuring that the target algorithm is implemented correctly, but it is effective for ensuring that the most recent refactoring hasn't changed the behavior in the test space.
A variation of this - which may be more palatable for your circumstance, is to start from the same initial data collection, but rather than trying to preserve precisely the same result every time you instead predefine some buckets, and flag any time an implementation change moves a test result from one confidence bucket to another.
Samples that have clearly correct answers (exact matches, null matches, high value corner cases) should be kept in a separate test.