High-speed calculation of Coq's theorems - profiling

I have to wait until Coq finish its computations even in very simple cases.
I know about "Asynchronous and Parallel Proof Processing", but I suppose that my code has inherent vices, so I'd like to
get some references or advices to guidelines/best practices of proofs' styling.
e.g.:
try to use Definitions instead of Theorems,
Use compiler. Use parrallel processing. Use better hardware.
Do not use placeholders, fill every argument, like (#functionname var1 ... varn)
semicolons(;) instead of period(.)
It is much faster to use Definitions in Section instead of "set (f:=term)." in the proof. (possibly because every "set" get additional time to be printed. Even to Check Empty.)
How to accelerate Coq? (Please say if I have an error in items above. They are derived from my practice.)
What are the most critical stages of calculating and how to work with them?

The holy grail is to profile your code and optimize the hotspots. In Coq >= 8.6, you can Set Ltac Profiling. and Reset Ltac Profile. and Show Ltac Profile. to profile tactics. If you invoke coqc with the -time argument, you will get line-by-line timing information; a bit of sed trickery can sort the information for you:
coqc -time foo.v | sed s'/^\(.*\) \([0-9\.]\+ secs.*\)$/\2 \1/g' | sort -h
In Coq >= 8.8, you can Set NativeCompute Profiling to profile native computations (such as Eval native_compute in (slow program here)). This produces traces that you can visualize with perf report on GNU/Linux. See this bug report for more info.
If you hit other bottlenecks, you either have to be good at guessing, convince the devs to add more profiling tools, profile the running coq binary, or convince the devs that it's worth their time to profile your code for you. (I can sometimes get Pierre-Marie Pédrot to profile my code when the slowness points to an efficiency bug in Coq.)
One useful practice is to always profile your code on every single commit. In Coq >= 8.7, there are Makefile targets make-pretty-timed-before, make-pretty-timed-after, and print-pretty-timed-diff to get nice sorted tables of per-file compilation time diffs between two states of your repo. You can get per-line information with make TIMING=before, make TIMING=after, make all.timing.diff.
You may also be interested in looking at Experience Implementing a Performant Category-Theory Library in Coq (more media here), and perhaps also the slide deck for the presentation of that paper (pdf) (pptx with presenter notes).
Coq can be slow in a number of places, though most of the things you mention are irrelevant. Going through yours in order:
Theorem and Definition are synonyms, the only difference is that Definition supports := and Theorem doesn't.
There are no optimizing compilers for Coq, though better hardware, more RAM, faster CPUs, definitely help. As does parallel processing. On this note, file-level parallelism tends to work better than proof-level parallelism. I tend to split my files up as much as possible, have fine-grained imports so that the library loading time isn't an issue, and use make -j.
Filling every argument is more likely to hurt than to help. You incur additional unification costs, especially if the terms you are filling the arguments with are big. The real thing you do by filling in the arguments is trading evar creation against unification. Unification is usually slower. However, if you have any holes that are filled by canonical structure resolution, by typeclass resolution, or which require other backtracking and unfolding things or refreshing universe variables, filling them in can speed things up a lot.
I think the difference between semicolon and period in proof scripts only matters in interactive mode (in coqtop/CoqIDE/ProofGeneral/etc, not when running make). Let me know if you test it and discover otherwise.
This is very true and has impacts on both printing and other things. The set itself would not be slow due to printing, but instead due to the fact that it tries to find all occurrences of the term in your goal (stupidly, up to some reduction (beta-iota? I can't recall), rather than up to syntactic equality), and replace them with the new hypothesis. If you don't need this behavior, use pose rather than set. Additionally, big context variables can slow down tactics that depend on the size of the context, and this is why Definition is faster in sections.
Other thoughts on things I've run into:
Pick good abstraction barriers, and respect them religiously. You will pay in sweat and tears every single time you break an abstraction barrier. (Picking good abstraction barriers is incredibly hard. Usually I do this by copying existing mathematical abstractions or published papers. I've managed to generate a good abstraction barrier entirely from scratch, in some sense, exactly once in the past five years. The abstraction barrier, in hindsight, can be described with the "insight" that "when writing C-like code, every function takes one tuple as an argument and returns one tuple as a result.")
If you're doing reflective proofs, pick good algorithms. If you use unary natural numbers, or your algorithm has exponential running time, you're proof will be slow.
In Coq < 8.7, evar-map normalization incurs massive overhead on large terms. (Props to Pierre-Marie for fixing this with his EConstr branch.)
Slow End Section (and sometimes slow Defineds) are frequently caused by issues with rehashconsing. To fix this, cross your fingers and pray. (Or become a Coq dev and fix Coq.)
The guardedness checker is extremely slow at checking bare fixes with large bodies (maybe only in the presence of beta-iota-zeta reduxes?). The work-around is to extract the body into a separate definition. For example, rather than writing
Fixpoint length A (ls : list A) : nat :=
match ls with
| nil => 0
| cons _ xs => S (length _ xs)
end.
you could write
Definition length_step
(length : forall A, list A -> nat)
A (ls : list A)
: nat
:= match ls with
| nil => 0
| cons _ xs => S (length _ xs)
end.
Fixpoint length A (ls : list A) : nat
:= length_step length A ls.
Be aware that Coq will liberally (sometimes inconsistently) inline let x := ... in ... statements, which can easily lead to exponential definition internalization time.
When doing reification, canonical structures are fast but hard to read, Ltac is about 2x slower, and typeclass resolution can be 2x slower again (or can be the same speed as Ltac reification). Hopefully things will be better when Ltac2 is finished.
simpl is very slow on large terms
I may come back and add more later (and feel free to suggest things in the comments for me to add), but this is a half-decent start.

Related

What string search algorithm does strstr use?

I was reading through the String searching algorithm wikipedia article, and it made me wonder what algorithm strstr uses in Visual Studio? Should I try and use another implementation, or is strstr fairly fast?
Thanks!
The implementation in visual studio strstr is not know to me, and I am uncertain if it is to anyone. However I found these interesting sources and an example implementation. The latter shows that the algorithm runs in worst case quadratic time wrt the size of the searched string. Aggregate should be less than that. The algorithmic limit of non stochastic solutions should be that.
What is actually the case is that depending the size of the input it might be possible that different algorithms are used, mainly optimized to the metal. However, one cannot really bet on that. In case that you are doing DNA sequencing strstr and family are very important and most probably you will have to write your own customized version. Usually, standard implementations are optimized for the general case, but on the other hand those working on compilers know their shit n staff. At any rate you should not bet your own skills against the pros.
But really all this discussion about time to develop is hurting the effort to write good software. Be certain that the benefit of rewriting a custom strstr outweigh the effort that is going to be needed to maintain and tune it for your specific case, before you embark on this task.
As others have recommended: Profile. Perform valid performance tests.
Without the profile data, you could be optimizing a part of the code that runs 20% of the time, a waste of ROI.
Development costs are the prime concern with modern computers, not execution time. The best use of time is to develop the program to operate correctly with few errors before entering System Test. This is where the focus should be. Also due to this reasoning, most people don't care how Visual Studio implements strstr as long as the function works correctly.
Be aware that there is line or point where a linear search outperforms other searches. This line depends on the size of the data or the search criteria. For example, linear search using a processor with branch prediction and a large instruction cache may outperform other techniques for small and medium data sizes. A more complicated algorithm may have more branches that cause reloading of the instruction cache or data cache (wasting execution time).
Another method for optimizing your program is to make the data organization easier for searching. For example, making the string small enough to fit into a cache line. This also depends on the quantity of searching. For a large amount of searches, optimizing the data structure may gain some performance.
In summary, optimize if and only if the program is not working correctly, the User is complaining about speed, it is missing timing constraints or it doesn't fit in the allocated memory. Next step is then to profile and optimize the areas where most of the time is spent. Any other optimization is futile.
The C++ standard refers to the C standard for the description of what strstr does. The C standard doesn't seem to put any restrictions on the complexity, so pretty much any algorithm the finds the first instance of the substring would be compliant.
Thus different implementations may choose different algorithms. You'd have to look at your particular implementation to determine which it uses.
The simple, brute-force approach is likely O(m×n) where m and n are the lengths of the strings. If you need better than that, you can try other libraries, like Boost, or implement one of the sub-linear searches yourself.

F# code optimization or is it really that slow?

I was looking for a way to do proper algorithmic coding using .NET with all the benefits of modern languages (e.g. I like a strong type checking, operator overloading, lambdas, generic algorithms). Usually I write my algorithms (mainly image prozessing) in C++. As F# as a language seems to be quite interesting I played a bit, but it seems to be very slow. As a most simple test I just did some array manipulation -> brightness increase of an image:
let r1 = rgbPixels |> Array.map (fun x -> x + byte(10) )
It seems to be a factor of at least 8 times slower than a compared C++ implementation - even worse for more complex algorithms e.g. 2D convolution.
Is there some faster way or do I miss any specific compiler settings (yes, building release with optimization on...)?
I'm willing to pay for the nice and high abstracion, but such an overhead is not nice (I would need to parallelize on 8 cores to compensate :) ) - at least it destroys the motivation to learn further... My other option would be to leave my heavier algos in C++ and interface with manged C++, but this is not nice, as mainting the managed wrapper would be quite a burden.
If you are worried about performance one of the important things to remember is that F# by default does not mutate anything. This requires copying in many naive implementations of algorithms, such as the one you described.
EDIT: I have no idea why, but simple tests of the following code provides inferior results to Array.map. Be sure to profile any algorithm you try when performing these kinds of optimizations. However I got very similar results between for and map.
Array.map creates a new array for the result of the operation, instead you want Array.iteri.
rgbPixels |> Array.iteri (fun i x -> rgbPixels.[i] <- x + 10uy)
Note that this could be wrapped up in your own module as below
module ArrayM =
let map f a = a |> Array.iteri (fun i x -> a.[i] <- f x)
Unfortunately this is a necessary evil since one of the primary tenants of functional programming is to stick to immutable objects as much as your algorithm will allow, and then once finished, change to mutation where performance is critical. If you know your performance is critical from the get-go, you will need to start with these kinds of helpers.
Also note that there is probably a library out there that provides this functionality, I just don't know of it off hand.
I think that it's safe to say that idiomatic F# will often fail to match the performance of optimized C++ for array manipulation, for a few reasons:
Array accesses are checked against the bounds of the array in .NET to ensure memory safety. The CLR's JIT compiler is able to elide bounds checks for some stereotypical code, but this will typically require you to use a for loop with explicit bounds rather than more idiomatic F# constructs.
There is typically a slight amount of overhead to using abstractions like lambdas (e.g. fun i -> ...). In tight loops, this small overhead can end up being significant compared to the work that's being done in the loop body.
As far as I am aware, the CLR JIT compiler doesn't take advantage of SSE instructions to the same degree that C++ compilers are able to.
On the other side of the ledger,
You will never have a buffer overflow in F# code.
Your code will be easier to reason about. As a corollary, for a given level of total code complexity, you can often implement a more complex algorithm in F# than you can in C++.
If necessary you can write unidiomatic code which will run at closer to the speed of C++, and there are also facilities for interoperating with unsafe C++ code if you find that you need to write C++ to meet your performance requirements.

Real world Clojure performance tuning tips?

I am seeing some serious slowdowns in "some" parts of a Clojure application. Does anyone have any real world tips as to how they have done performance tuning on a clojure application?
My personal tips:
Check your algorithm first - are you incurring O(n^2) cost when it really should be O(n.log n)? If you've picked a bad algorithm, the rest of tuning is a waste of time.
Be aware of common "gotchas" like the O(n) cost of traversing a list / sequence.
Take advantage of nice features in Clojure, like the O(1) cost of copying a large persistent data structure or the O(log32 n) cost of map/set/vector accesses.
Choose among Clojure's core constructs wisely:
An atom is great when you need some mutable data, e.g. updating some data in a loop
If you are going to traverse some data in sequence, use a list rather than a vector or map since this will avoid creating temporary objects while traversing the sequence.
Use deftype/defrecord/defprotocol where appropriate. These are heavily optimised, and in particular should be preferred to defstruct/multimethods as of Clojure 1.2 onwards.
Take advantage of Clojure's concurrency capabilities:
pmap and future are both relatively easy ways to take advantage of multiple cores when you are doing a number of independent computations at the same time.
Remember that because of Clojure's immutable persistent data structures, making and working on multiple copies of data is very inexpensive. You also don't have to worry about locking when taking snapshots.....
If you are interfacing with Java code, use "(set! *warn-on-reflection* true)" and eliminate every reflection warning. Reflection is one of the most expensive operations, and will really slow your application down if done repeatedly.
If you still need more performance, identify the most performance critical parts of your code (e.g. the 5% of lines where the application spends 90%+ of CPU time), analyse this section in detail and judiciously apply the following rules:
Avoid laziness. Laziness is a great feature but comes with some additional overhead. Be aware that many of Clojure's common sequence / list functions are lazy (e.g. for, map, partition). loop/recur, dotimes and reduce are your non-lazy friends.
Use primitive hints and unchecked arithmetic to make arithmetic / numerical code faster. Primitives are much faster than Clojure's default BigInteger arithmetic.
Minimise memory allocations - try to avoid creating too much unnecessary intermediate data (vectors, lists, maps, non-primitive numbers etc.). All allocations impose a small amount of extra overhead, and also result in more/longer GC pauses over time (this is likely to be a bigger concern if you are writing a game / soft realtime app.).
(Ab)use Java arrays - not really idiomatic in Clojure, but aget / aset / areduce and friends are very fast (they benefit from a lot of JVM optimisations!!). (Ab)use primitive arrays for extra bonus points.
Use macros - to generate ugly-but-fast code at compile time where possible
Doing all the above should get pretty good performance out of Clojure code - with careful tuning I've usually been able to get reasonably near to pure Java performance, which is pretty impressive for a dynamic language!
You can use JVisualVM to do profiling of Clojure code (see JVisualVM and Clojure for an example). This should at least point you in the right direction of the slow code.
This question covers profiling with Clojure: Profiling tool for Clojure?
I'm sure you'll find some good tips there.
Then straight from the horse's mouth: http://clojure.org/getting_started#Getting%20Started-Profiling

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.
Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".
Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.
1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.
You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom
Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.
Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.
You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.
I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.
You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.
You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.

What is more efficient a switch case or an std::map

I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case
I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.
STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.
What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...
Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?
You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.
The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.