Is this behavior showing that I have a memory problem?

Is this behavior showing that I have a memory problem? - linear-programming

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?

Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

Related

Memset too slow on large data. Any alternatives?

I have a large cv::Mat with dimensions (100,32768). I update it for every frame in a video stream. Before updating, I need to set everything back to zero so I execute
memset(myMat.data,0,100*32768*sizeof(int))
which takes 5ms on average.
Surprisingly(at least to me) in debug mode I get the same times, if not faster ones. While I'd appreciate an explanation as to why this is happening (google gives me loads of reasons so I will eventually figure it out), what I really need is an alternative faster solution. Is there anything I can do?

DDR 4 3k ish caps out at a bit under 100 GB/s DDR 3 800 is 6.4 GB/s.
Your speed is about 12 MB/5ms, or 2.4 GB/s.
So depending on your RAM, you might be near max speed for your hardware. A factor of 2 ain't bad.
You are working on a modestly sparse array. It is possible that a non contiguous buffer might be a better plan, depending on how your data is arranged. Also, GPUs tend to have faster internal memory bandwidth than CPUs, moving your work there could help.
The problem could also be latency; maybe clearing one buffer in another thread while using another would help.
Massively reducing your memory usage, and making it more local, may have a larger impact than you expect. It is plausible your non zeroing code is RAM speed constrained already.

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)

This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels.
Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32.
kernel uses a dev_input array of size n and a dev output array of size s*n.
kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
We aim to run the same kernel s times using one of the n streams each time. Similar to the simpleHyperQ example. Can you comment if and how any of the following affects concurrency please?
dev_input and dev_output are not pinned;
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts)
kernels read data from constant memory
10kb of shared memory are allocated per block.
kernel uses 60 registers
Any good comments will be appreciated...!!!
cheers,
Thanasio
Robert,
thanks a lot for your detailed answer. It has been very helpful. I edited 4, it is 10kb per block. So in my situation, i launch grids of 61 blocks and 256 threads. The kernels are rather computationally bound. I launch 8 streams of the same kernel. Profile them and then i see a very good overlap between the first two and then it gets worse and worse. The kernel execution time is around 6ms. After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. Regarding 5, i use a K20 which has a 255 register file. So i would not expect drawbacks from there. I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
Please take a look at the following link. There is an image called kF.png .It shows the profiler output for the streams..!!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/

Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (i.e. number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. Compute 3.5 has some advantages there.
You should also review the answer to this question.
When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application.
There shouldn't be "read conflicts", I'm not sure what you mean by that. Two independent threads/blocks/grids are allowed to read the same location in global memory. Generally this will get sorted out at the L2 cache level. As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency.
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. Apart from this secondary relationship, there is no direct effect on concurrency.
It would be more instructive to identify the amount of shared memory per block rather than per kernel. This will directly affect how many blocks can be scheduled on a SM. But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid.
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be.
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels.
EDIT: in response to new information posted in the question. I've looked at the kF.png
First let's analyze from a blocks per SM perspective. CC 3.5 allows 16 "open" or currently scheduled blocks per SM. If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. Stated another way, the GPU can handle 2 of these kernels at a time. As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it. This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. But maybe you think you're only getting 120/208, I don't know.
Now let's take a look from a shared memory perspective. A key question is what is the setting of your L1/shared split? I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. Regardless, according to your statement each block scheduled will use 10KB of shared memory. This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. The first kernel of 60 blocks gets scheduled. After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline.
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (e.g. 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. Here are a few more:
The C programming guide has a section on Asynchronous Concurrent Execution.
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page

My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. With small kernels its a different story, but usually the kernels are more complicated.
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization.
But still, GK110 has a great advantage just allowing this.

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.
The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).
Here is some code:
Vector Vector::operator+(const Vector& rhs) const
{
assert(m_size == rhs.m_size);
Vector result(m_size);
#pragma omp parallel for schedule(static)
for (unsigned int i = 0; i < m_size; i++)
result.m_data[i] = m_data[i]+rhs.m_data[i];
return result;
}
I have already read this post: OpenMP thread mapping to physical cores.
I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.
Thanks!

EDIT : Now that some code has been added.
In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:
The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.
For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.
I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.
Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program
On a Core i7 920 # 3.5 GHz (4 cores, 8 threads):
If I run with 4 threads, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds
If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:
This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds
So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...

To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.
Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.
Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.

Initializing Billion Integers to value 1

What is good posix thread design to initialize billion integers using c/c++ on linux platform 8-core CPU with 32GB of DRAM?
Thanks for your help.

This is a trivial operation and you need not consider multi-threading. Just do it with a memcpy in a single thread.

The exact number of threads will not be such a limiting factor, but sometimes for this questions it is worth to overcommit, say use 2 threads per physical core.
But the real bottleneck will be IO, writing the data into the RAM. You'd have to take care that the data that is to be replaced will never read before you erase it. Then you should assure that writes to memory appear in large chunks and (if possible) as "write through", mondern CPU have instructions for the later.
Usually something like memcpy with a fixed sized buffer (some pages) that contains the pattern that you want to see should be optimized quite well.

What is that for? Depending on usage, the following scenario might work: you initialize one memory page (that's several KB) to all 1's. Then you map that page into the virtual address space as many times as needed with a copy-on-write flag. This way, on reading you'll get all ones from all those virtual pages, on writing the system will allocate more physical pages as needed.

Perhaps a divide and conquer algorithm? Partition the memory containing the integers by some number corresponding to the number of threads optimal for your system. Then launch one thread per partition which initializes all of its integers.

If you do attempt multithreading, aligning your writes with the native cache line size will likely provide optimal memory throughput. As everyone says, the memory throughput will dominate the performance but there is some portion of CPU time required for these writes. Minimizing that time with multithreading and vectorized instructions may be helpful.
The real answer is to profile your system (since you stated a very specific target, it sounds like you don't want to design a balanced algorithm which is good enough for most targets). Modern CPUs which have access to 32GB of DRAM often have hardware performance counters (Intel's and AMD's do) which make finding out CPU, caching activity pretty easy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js