High memory demand of gurobi_direct - pyomo

I have some doubts regarding the performance of pyomo Gurobi solvers.
I want to solve a pretty large model of ~4 million continuous variables plus ~8 million integer variables with Gurobi. I am working on a machine with 32 GB of RAM and 32 GB allocated for SWAP.
Initially, I intended to use the 'gurobi_direct' interface. It is my understanding that directly building gurobipy objects preserves the numeric precision declared during the model construction better than writing the LP. However, by doing this I was unable to solve this model. The memory demand was too high for my machine. For quantification, I solved a simplification of the model by reducing the data input to ~1/12, which lead to 348243 continuous + 1035781 integer variables. Filprofiler showed a maximum memory peak demand of ~7000MiB, where ~70% corresponds to the method SolverFactory.solve(), with ~12% and ~30% of those 7000 MiB due to the constraint and variable declaration inside the direct solver.
However, when I switch to GUROBISHELL this maximum memory peak falls to 3700 MiB. Now, only 40% of that peak is due to SolverFactory.solve(). By using GUROBISHELL I was able to solve the full model without the memory crash. The resulting LP had a size of ~1.5GB.
I would like to better understand why writing the LP is more efficient memory-wise than directly declaring the gurobipy objects. Is there any way to improve pyomo's performance for GUROBIDIRECT by the user?
I've been working on other sources of improvement for this problem, such as memory increase and model simplification. However, at this point, before further considering these, I would like to better understand the reasons for this high memory demand and the reasons behind the difference between these two solvers.

Related

Is this behavior showing that I have a memory problem?

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?
Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

At what code complexity does an OpenACC kernel lose efficiency on common GPU?

At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?
I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:
Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.
Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.

Not enough memory in C++ : write to file instead, read data in when needed?

I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.
Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.
Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.
This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.

3D FFT with data larger than cache

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

CUDA - operations on single elements of a matrix - getting ideas

I'm about writing a CUDA kernel to perform a single operation on every single element of a matrix (e.g. squarerooting every element, or exponentiation, or calculating the sine/cosine if all the numbers are between [-1;1], etc..)
I chose the blocks/threads grid dimensions and I think the code is pretty straightforward and simple, but I'm asking myself... what can I do to maximize coalescence/SM occupancy?
My first idea was: making all semiwarp (16 threads) load data ensemble from global memory and then putting them all to compute, but it finds out that there are no enough memory-transfer/calculations parallelization.. I mean all threads load data, then compute, then load again data, then calculate again.. this sounds really poor in terms of performance.
I thought using shared memory would be great, maybe using some sort of locality to make a thread load more data than it actually needs to facilitate other threads' work, but this sounds stupid too because the second would wait for the former to finish loading data before starting its work.
I'm not really sure I gave the right idea regarding my problem, I'm just getting ideas before commencing to work on something concrete.
Every comment/suggestion/critic is well accepted, and thanks.
If you have defined the grid so that threads read along the major dimension of the array containing your matrix, then you have already guaranteed coalesced memory access, and there is little else to be done to improve performance. These sort of O(N) complexity operations really do not contain sufficient arithmetic intensity to give good parallel speed up over an optimized CPU implementation. Often the best strategy is to fuse multiple O(N) operations together into a single kernel to improve the FLOP to memory transaction ratio.
In my eyes your problem is this
load data ensemble from global memory
It seems that your algorithm idea is:
Do something on cpu - have some matrix
Transfer matrix from global to device memory
Perform your operation on every element
Transfer matrix back from device to global memory
Do something else on cpu - go sometimes back 1.
This kind of computations are almost everytime I/O-bandwidth limited (IO = memory IO), not computation power limited. GPGPU computations can sustain a very high memory bandwidth - but only from device memory to the gpu - transfer from global memory goes always over the very slow PCIe (slow compared to the device memory connection, that can deliver up to 160 GB/s + on fast cards). So one main thing to get good results is to keep the data (matrix) in device memory - preferable generate it even there if possible (depends on your problem). Never try to migrate data between cpu and gpu for and back as the transfer overhead eats all your speedup up. Also keep in mind that your matrix must have a certain size to amortize the transfer overhead, that you cant avoid (to compute a matrix with 10 x 10 elements would bring almost nothing, heck it would even cost more)
The interchanging transfer/compute/transfer is full ok, thats how such gpu algorithms work - but only if the the tranfer is from device memory.
The GPU for something this trivial is overkill and will be slower than just keeping it on the CPU. Especially if you have a multicore CPU.
I have seen many projects showing the "great" advantages of the GPU over the CPU. They rarely stand up to scrutiny. Of course, goofy managers who want to impress their managers want to show how "leading edge" his group is.
Someone in the department toils months on getting silly GPU code optimized (which is generally 8x harder to read than equivalent CPU code), then have the "equivalent" CPU code written by some Indian sweat shop (the programmer whose last project was PGP), compile it with the slowest version of gcc they can find, with no optimization, then tout their 2x speed improvement. And BTW, many overlook I/O speed as somehow not important.