Is it possible to engage multiple cores (like gcc -j8) when solving with Pyomo? - pyomo

The power flow library PyPSA uses Pyomo. I am trying to reduce the cost of each linear optimal power flow simulation.
I read through the Pyomo docs. Nothing sticks out at me yet. Perhaps it is not possible to split up the processing when solving linear optimisation problems.
Ubuntu 19.04, i5-4210U 1.70 GHz, 8 Gb RAM

When you talk about processing there are two things to consider. Processing to write the .lp file and processing to solve the problem with an optimization solver.
First, writing the .lp file is to my knowledge not yet parallelized in Pyomo. PyPSA developers created Linopy to parallel the processing to reduce RAM requirements and increase the speed.
Second, parallelizing the solver processing depends on the solver. PyPSA-Eur has an example of integration for that for Gurobi, and CPLEX. The performant open-source solver HiGHS can also something like that see here.

Related

Drawbacks of avoiding crossover after barrier solve in linear program

I am running a large LP (approximately 5M non-zeros) and I want to speed up the solving process. I tried a concurrent solve to test which algorithm solves my problem the quickest and I found that the barrier method is the clear winner (solver = Xpress MP, but I guess that it would be the same for other solvers).
However, I want to further speed it up. I noticed that the real barrier solve takes less than 1% of the total solution time. The remainder of the time is spend in the crossover (~40%) and the primal solve (to find the corner solution in the new basis) (~60%). Unfortunately, I couldn't find a setting to tell the solver to do the dual crossover (there is one in Cplex, but I don't have a license for Cplex). Therefore I couldn't compare if this would be quicker.
Therefore I tried to turn off the crossover which yields a huge speed increase, but there are some disadvantages according to the documentation. So far, the drawbacks that I know of are:
Barrier solutions tend to be midface solutions.
Barrier without crossover does not produce a basic solution (although that the solver settings mention that "The full primal and dual solution is available whether or not crossover is used.").
Without a basis, you will not be able to optimize the same or similar problems repeatedly using advanced start information.
Without a basis, you will not be able to obtain range information for performing sensitivity analysis.
My question(s) is(are) simple. What other drawbacks are important to justify the very inefficient crossover step (both Cplex and Xpress MP enable the crossover as default setting). Alternatively, is my problem exceptional and is the crossover step very quick in other problems? Finally, what is wrong with having midface solutions (this means that a corner optimum is also not unique)?
Sources:
http://www.cs.cornell.edu/w8/iisi/ilog/cplex101/usrcplex/solveBarrier2.html (Barrier algorithm high-level theory)
http://tomopt.com/docs/xpress/tomlab_xpress008.php (Xpress MP solver settings)
https://d-nb.info/1053702175/34 (p87)
Main disadvantage: the solution will be "ugly", that is many 0.000001 and 0.9999999 in the solution values. Secondly you may get somewhat different duals. Finally a basis is required to do fast "hot starts". One possible way to speed up large models up is to use a simplex method and using an advanced basis from a representative base run.

Working with many fixed-size matrices in CUDA kernels

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

Want to improve the computing speed of matrix calculation, OpenMP or CUDA?

My program has a bunch of matrix multiplication and inversion, which is time consuming.
My computer: CPU: intel i7; GPU: 512MB NVIDIA® Quadro® NVS3100M
Which one is better for improving computing speed? OpenMP or CUDA?
(ps. I think generally, GPU has more cores than cpu, thus, CUDA could improve multiple times more than OpenMP?)
From my experience(work on both as a school project, in most condition, the calculation time for a medium size array, I would say less than 2000 * 2000, is almost the same, the actual calculation time depending on the working load of your computer(usually when you working on openMP, you would share a cluster with other guys, so make sure you are running your application alone, so that you might got a better result))
But if you are good at CUDA, the GPU is very powerful in these kinds of calculation stuff, when i was working on my CUDA project, there are lots of good materials in the official website. For openMP, it is only a library, and if you are good at c or c++, should not be any problem for you to use it(but the compiler of openMP is buggy~~, don't trust it, try to log anything).
And i assumed you have experience on CUDA, is not hard to find some good example i think. But CUDA is really dummy, can't debug, so I recommend you to try openMP first, it should be easier.
I'd guess it depends on what your application is and how you go about trying to implement improvements. Keep in mind that every optimization has tradeoffs. For instance, GPU's typically use half-precision floating point, and there are compiler options that allow you to bypass some aspects of the IEEE standard, which brings you some extra speed at the expense of precision, etc.

openmp vs opencl for computer vision

I am creating a computer vision application that detect objects via a web camera. I am currently focusing on the performance of the application
My problem is in a part of the application that generates the XML cascade file using Haartraining file. This is very slow and takes about 6days . To get around this problem I decided to use multiprocessing, to minimize the total time to generate Haartraining XML file.
I found two solutions: opencl and (openMp and openMPI ) .
Now I'm confused about which one to use. I read that opencl is to use multiple cpu and GPU but on the same machine. Is that so? On the other hand OpenMP is for multi-processing and using openmpi we can use multiple CPUs over the network. But OpenMP has no GPU support.
Can you please suggest the pros and cons of using either of the libraries.
OpenCL is for using the GPU stream processors. http://en.wikipedia.org/wiki/Opencl
OpenMP is for using the CPU cores. http://en.wikipedia.org/wiki/Openmp
OpenMPI is for using a distributed network cluster. http://en.wikipedia.org/wiki/Openmpi
Which is best to use depends on your problem specification, but I would try using OpenMP first because it is the easiest to port a single threaded program onto it. Sometimes you can just put a pragma telling it to parellelize a main loop, and you can get speedups in the order of the number of CPU cores.
If your problem is very data parallel and floating pointish - than you can get better performance out of GPU - but you have to write a kernel in a C-like language and map or read/write memory buffers between the host and GPU. Its a hassle, but performance gains in some cases can be on the order of 100 as GPUs are specifically designed for data parallel work.
OpenMPI will get you the most performance but you need a cluster (a bunch of servers on the same network), and they are expensive.
Could the performance problem be in the XML file itself?
Have you tried to use a different, lighter file format?
I think that an XML file that takes 6 days to be generated must be quite long and complex. If you have control on this data format, try Google's Protocol Buffers.
Before digging into OpenMP, OpenCL or whatever, check how much time is spent accessing the hard disk; if that is the issue, the parallel libraries won't improve things.
research opencv and see if it might help.

How to improve computational time by parallelizing

I have written a c++ code (using STL) and due to large computations it takes about one hour for the output to come. I checked on parallelizing on GPU and CPU. I have a ATI graphics card and a core i7 processor. On which one should i parallelize for better results.
Also can you please suggest reading material on how to set up my compiler for parallelizing on any of these platforms and how do i start parallelizing
For general libraries regarding multi-core/GPU programming:
Thrust for GPU/CPU STL-like interface programming
OpenMP for multi-threaded parallel code
TBB Intel Threading Building Blocks, lots of primitive data structures for parallel programming
in general, this area is absolutely vast, and no answer can make justice of the topic. There are many ways to approach parallelization, and that begins with analysing your logic and looking in parts that can be efficiently computed in parallel, and design (or redesign) your algorithms around those results.
You could also consider recoding your numerical kernels using OpenCL (and its ATI Stream implementation for your graphical card).