Pyomo/CBC: Get last iteration of CBC optimizer before infeasible - pyomo

I'm migrating an optimization script (using Pyomo and CBC) from a local Windows 10 machine to a Linux-based platform (Dataiku). For some reason either pyomo or cbc behaves differently even though there have been no changes to the optimization part of the script. On local Windows, even though the optimization was infeasible, I got a last-iteration solution before it turned out to be infeasible - for our purposes this was good enough, since the infeasibility appears in unimportant places and as there are about 100k variables, it did not make sense to look for them. On the linux-based platform, when the solution is infeasible, I get the output of initial values, basically zeros on all non-fixed positions. I assumed that this would be fixed by turning off the presolve, but it still returns initial values. Does anyone know, how to get the last iteration of the optimizer?
Thanks

Related

Standard math functions reproducibility on different CPU's

I am working on project with a lot math calculations. After switching on a new test machine, I have noticed that a lot of tests failed. But also important to notice that tests also failed on my develop machine, and on some machines of other developers. After tracing values and comparing with values from the old machine I found that some functions (At this moment I found only cosine) from math.h sometimes returns slightly different values (for example: 40965.8966304650828827e-01 and 40965.8966304650828816e-01, -3.3088623618085204e-08 and -3.3088623618085197e-08).
New CPU: Intel Xeon Gold 6230R (Intel64 Family 6 Model 85 Stepping 7)
Old CPU: Exact model is unknown (Intel64 Family 6 Model 42 Stepping 7)
My CPU: Intel Core i7-4790K
Tests results doesn't depend on Windows version (7 and 10 were tested).
I have tried to test with binary that was statically linked with standard library to exclude loading of different libraries for different processes and Windows versions, but all results were the same.
Project compiled with /fp:precise, switching to /fp:strict changed nothing.
MSVC from Visual Studio 15 is used: 19.00.24215.1 for x64.
How to make calculations fully reproducible?
Since you are on Windows, I am pretty sure the different results are because the UCRT detects during runtime whether FMA3 (fused-multiply-add) instructions are available for the CPU and if yes, use them in transcendental functions such as cosine. This gives slightly different results. The solution is to place the call set_FMA3_enable(0); at the very start of your main() or WinMain() function, as described here.
If you want to have reproducibility also between different operating systems, things become harder or even impossible. See e.g. this blog post.
In response also to the comments stating that you should just use some tolerance, I do not agree with this as a general statement. Certainly, there are many applications where this is the way to go. But I do think that it can be a sensible requirement to get exactly the same floating point results for some applications, at least when staying on the same OS (Windows, in this case). In fact, we had the very same issue with set_FMA3_enable a while ago. I am a software developer for a traffic simulation, and minor differences such as 10^-16 often build up and lead to entirely different simulation results eventually. Naturally, one is supposed to run many simulations with different seeds and average over all of them, making the different behavior irrelevant for the final result. But: Sometimes customers have a problem at a specific simulation second for a specific seed (e.g. an application crash or incorrect behavior of an entity), and not being able to reproduce it on our developer machines due to a different CPU makes it much harder to diagnose and fix the issue. Moreover, if the test system consists of a mixture of older and newer CPUs and test cases are not bound to specific resources, means that sometimes tests can deviate seemingly without reason (flaky tests). This is certainly not desired. Requiring exact reproducibility also makes writing the tests much easier because you do not require heuristic thresholds (e.g. a tolerance or some guessed value for the amount of samples). Moreover, our customers expect the results to remain stable for a specific version of the program since they calibrated (more or less...) their traffic networks to real data. This is somewhat questionable, since (again) one should actually look at averages, but the naive expectation in reality usually wins.
IEEE-745 double precision binary floating point provides no more than 15 decimal significant digits of precision. You are looking at the "noise" of different library implementations and possibly different FPU implementations.
How to make calculations fully reproducible?
That is an X-Y problem. The answer is you can't. But it is the wrong question. You would do better to ask how you can implement valid and robust tests that are sympathetic to this well-known and unavoidable technical issue with floating-point representation. Without providing the test code you are trying to use, it is not possible to answer that directly.
Generally you should avoid comparing floating point values for exact equality, and rather subtract the result from the desired value, and test for some acceptable discrepancy within the supported precision of the FP type used. For example:
#define EXPECTED_RESULT 40965.8966304650
#define RESULT_PRECISION 00000.0000000001
double actual_result = test() ;
bool error = fabs( actual_result-
EXPECTED_RESULT ) >
RESULT_PRECISION ;
First of all, 40965.8966304650828827e-01 cannot be a result from cos() function, as cos(x) is a function that, for real valued arguments always returns a value in the interval [-1.0, 1.0] so the result shown cannot be the output from it.
Second, you will have probably read somewhere that double values have a precision of roughly 17 digits in the significand, while your are trying to show 21 digit. You cannot get correct data past the ...508, as you are trying to force the result farther from the 17dig limit.
The reason you get different results in different computers is that what is shown after the precise digits are shown is undefined behaviour, so it's normal that you get different values (you could get different values even on different runs on the same machine with the same program)

How can I safely unit test a function without testing every possible 64-bit input value?

I have a function which checks the parity of a 64-bit word. Sadly the input value really could be anything so I cannot bias my test to cover a known sub-set of values, and I clearly cannot test every single possible 64-bit value...
I considered using random numbers so that each time the test was run, the function gained more coverage however unit tests should be consistent.
Ignoring my specific application, is there a sensible way to ensure a reasonable level of coverage, which is highly likely to expose errors introduced in the future, whilst not taking the best part of a billion years to run?
The following argumentation assumes that you have written / have access to the source code and do white box testing.
Depending on the level of confidence you need, you might consider proving the algorithm correct, possibly using automated provers. But, under the assumption that your code is not part of an application which demands this level of confidence, you probably can gain sufficient confidence with a comparably small set of unit-tests.
Lets assume that your algorithm somehow loops over the 64 bits (or, is intended to do so, because you still need to test it). This means, that the 64 bits are to be handled in a very regular way. Now, there could be a bug in your code such that, in the body of the loop, instead of using the respective bit from the 64 bit input, always a value of 0 is used by mistake. This bug would mean that you always get a parity of 0 as result. This particular bug can be found by any input value that leads to an expected parity of 1.
From this example we can conclude that for every bug that could realistically occur, you need one corresponding test case that can find that bug. Therefore, if you look at your algorithm and think about which bugs might be present, you may come up with, say, x bugs. Then you will need not more than x test cases to find these bugs. (Some of your test cases will likely find more than one of the bugs.)
This principal consideration has lead to a number of strategies to derive test cases, like equivalence partitioning or boundary testing. With boundary testing, for example, you would put special focus on the bits 0 and 63, which are at the boundaries of the loop's indices. This way you catch many of the classical off-by-one errors.
Now, what about the situation that an algorithm changes in the future (as you have asked about errors introduced in the future)? Instead of looping over the 64 bits, the parity can be calculated with xor-ing in various ways. For example, to improve the speed you might first xor the upper 32 bits with the lower 32 bits, then take the result and xor the upper 16 bits with the lower 16 bits and so on.
This alternative algorithm will have a different set of possible bugs. To be future proof with your test cases, you may also have to consider such alternative algorithms and the corresponding bugs. Most likely, however, the test cases for the first algorithm will find a large portion of those bugs as well - so probably the amount of tests will not increase too much. The analysis, however, becomes more complex.
In practice, I would focus on the currently chosen algorithm and rather take the approach to re-design the test suite in the case that the algorithm is changed fundamentally.
Sorry if this answer is too generic. But, as should have become clear, a more concrete answer would require more details about the algorithm that you have chosen.

Reducing size of switch statement in emulator?

I started writing a DCPU-16 emulator using this v1.7 spec. I started laying down the architecture, and I don't like the fact that I'm using very long switch statements. This is my first time writing an emulator, and so I don't know if there's better way to be doing it. While the switches aren't that large, due to the DCPU's small number of opcodes (and the fact that I haven't actually implemented the instructions yet), I can imagine if I were writing an emulator for a larger instruction set the switch statements would be huge.
Anywhom, here's my code.
EDIT: I forgot to get my question across:
Is there a better way to design an emulator than using a massive switch?
This approach seems reasonable to me. It is certainly how I would do it (I have written a few CPU emulators and similar types of code).
The nearest alternative is a set of function pointers, but some of your cases will probably be rather simple (e.g. cpu_regs.flags &= ~CARRY or if (cpu_regs.flags & CARRY) do_rel_jump(next_byte());, so using function pointers will slow you down.
You can bunch all the "No Operation Specified yet" to one place, that will make it a lot shorter in number of lines, but the number of cases will of course still be the same [unless you put it in default:].

How to test scientific software?

I'm convinced that software testing indeed is very important, especially in science. However, over the last 6 years, I never have come across any scientific software project which was under regular tests (and most of them were not even version controlled).
Now I'm wondering how you deal with software tests for scientific codes (numerical computations).
From my point of view, standard unit tests often miss the point, since there is no exact result, so using assert(a == b) might prove a bit difficult due to "normal" numerical errors.
So I'm looking forward to reading your thoughts about this.
I am also in academia and I have written quantum mechanical simulation programs to be executed on our cluster. I made the same observation regarding testing or even version control. I was even worse: in my case I am using a C++ library for my simulations and the code I got from others was pure spaghetti code, no inheritance, not even functions.
I rewrote it and I also implemented some unit testing. You are correct that you have to deal with the numerical precision, which can be different depending on the architecture you are running on. Nevertheless, unit testing is possible, as long as you are taking these numerical rounding errors into account. Your result should not depend on the rounding of the numerical values, otherwise you would have a different problem with the robustness of your algorithm.
So, to conclude, I use unit testing for my scientific programs, and it really makes one more confident about the results, especially with regards to publishing the data in the end.
Just been looking at a similar issue (google: "testing scientific software") and came up with a few papers that may be of interest. These cover both the mundane coding errors and the bigger issues of knowing if the result is even right (depth of the Earth's mantle?)
http://http.icsi.berkeley.edu/ftp/pub/speech/papers/wikipapers/cox_harris_testing_numerical_software.pdf
http://www.cs.ua.edu/~SECSE09/Presentations/09_Hook.pdf (broken link; new link is http://www.se4science.org/workshops/secse09/Presentations/09_Hook.pdf)
http://www.associationforsoftwaretesting.org/?dl_name=DianeKellyRebeccaSanders_TheChallengeOfTestingScientificSoftware_paper.pdf
I thought the idea of mutation testing described in 09_Hook.pdf (see also matmute.sourceforge.net) is particularly interesting as it mimics the simple mistakes we all make. The hardest part is to learn to use statistical analysis for confidence levels, rather than single pass code reviews (man or machine).
The problem is not new. I'm sure I have an original copy of "How accurate is scientific software?" by Hatton et al Oct 1994, that even then showed how different implementations of the same theories (as algorithms) diverged rather rapidly (It's also ref 8 in Kelly & Sanders paper)
--- (Oct 2019)
More recently Testing Scientific Software: A Systematic Literature Review
I'm also using cpptest for its TEST_ASSERT_DELTA. I'm writing high-performance numerical programs in computational electromagnetics and I've been happily using it in my C++ programs.
I typically go about testing scientific code the same way as I do with any other kind of code, with only a few retouches, namely:
I always test my numerical codes for cases that make no physical sense and make sure the computation actually stops before producing a result. I learned this the hard way: I had a function that was computing some frequency responses, then supplied a matrix built with them to another function as arguments which eventually gave its answer a single vector. The matrix could have been any size depending on how many terminals the signal was applied to, but my function was not checking if the matrix size was consistent with the number of terminals (2 terminals should have meant a 2 x 2 x n matrix); however, the code itself was wrapped so as not to depend on that, it didn't care what size the matrices were since it just had to do some basic matrix operations on them. Eventually, the results were perfectly plausible, well within the expected range and, in fact, partially correct -- only half of the solution vector was garbled. It took me a while to figure. If your data looks correct, it's assembled in a valid data structure and the numerical values are good (e.g. no NaNs or negative number of particles) but it doesn't make physical sense, the function has to fail gracefully.
I always test the I/O routines even if they are just reading a bunch of comma-separated numbers from a test file. When you're writing code that does twisted math, it's always tempting to jump into debugging the part of the code that is so math-heavy that you need a caffeine jolt just to understand the symbols. Days later, you realize you are also adding the ASCII value of \n to your list of points.
When testing for a mathematical relation, I always test it "by the book", and I also learned this by example. I've seen code that was supposed to compare two vectors but only checked for equality of elements and did not check for equality of length.
Please take a look at the answers to the SO question How to use TDD correctly to implement a numerical method?

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.
Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".
Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.
1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.
You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom
Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.
Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.
You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.
I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.
You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.
You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.