Memset too slow on large data. Any alternatives? - c++

I have a large cv::Mat with dimensions (100,32768). I update it for every frame in a video stream. Before updating, I need to set everything back to zero so I execute
memset(myMat.data,0,100*32768*sizeof(int))
which takes 5ms on average.
Surprisingly(at least to me) in debug mode I get the same times, if not faster ones. While I'd appreciate an explanation as to why this is happening (google gives me loads of reasons so I will eventually figure it out), what I really need is an alternative faster solution. Is there anything I can do?

DDR 4 3k ish caps out at a bit under 100 GB/s DDR 3 800 is 6.4 GB/s.
Your speed is about 12 MB/5ms, or 2.4 GB/s.
So depending on your RAM, you might be near max speed for your hardware. A factor of 2 ain't bad.
You are working on a modestly sparse array. It is possible that a non contiguous buffer might be a better plan, depending on how your data is arranged. Also, GPUs tend to have faster internal memory bandwidth than CPUs, moving your work there could help.
The problem could also be latency; maybe clearing one buffer in another thread while using another would help.
Massively reducing your memory usage, and making it more local, may have a larger impact than you expect. It is plausible your non zeroing code is RAM speed constrained already.

Related

Is this behavior showing that I have a memory problem?

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?
Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

Extremely slow ffmpeg/sws_scale() - only on heavy duty

I am writing a video player using ffmpeg (Windows only, Visual Studio 2015, 64 bit compile).
With common videos (up to 4K # 30FPS), it works pretty good. But with my maximum target - 4K # 60FPS, it fails. Decoding still is fast enough, but when it comes to YUV/BGRA conversion it is simply not fast enough, even though it's done in 16 threads (one thread per frame on a 16/32 core machine).
So as a first countermeasure I skipped the conversion of some frames and got a stable frame rate of ~40 that way. Comparing the two versions in Concurrency Visualizer, I found a strange issue I don't know the reason of.
.
Here's an image of the frameskip version:
You see that the conversion is pretty quick (average roughly ~35ms)
Thus, as multiple threads are used, it also should be quick enough for 60FPS, but it isn't!
.
The image of the non-frameskip version shows why:
The conversion of a single frame has become ten times slower than before (average roughly ~350ms). Now a heavy workload on many cores would of course cause a minor slowdown per core due to reduced turbo - let's say 10 or 20%. But never an extreme slowdown of ~1000%.
.
Interesting detail is, that the stack trace of the non-frameskip version shows some system activity I don't really understand - beginning with ntoskrnl.exe!KiPageFault+0x373. There are no exceptions, other error messages or such - it just becomes extremely slow.
Edit: A colleague just told me that this looks like a memory problem with paged-out memory at first glance - but my memory utilization is low (below 1GB, and more than 20GB free)
Can anyone tell me what could be causing this?
This is probably too old to be useful, but just for the record:
What's probably happening is that you're allocating 4k frames over and over again in multiple threads. The windows allocator really doesn't like that access pattern.
The malloc itself will not show up in the profiler, since only when the memory is actually accessed, will the OS fetch the pages. This shows up as ntoskrnl.exe!KiPageFault and gets attributed to the function first accessing the new memory.
Solutions include:
Using a different allocator (e.g. tbb_malloc, mimalloc, etc.)
Using your own per-thread or per process frame pool. ffmpeg does something similar internally, maybe you can just use that.

C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.

CUDA - operations on single elements of a matrix - getting ideas

I'm about writing a CUDA kernel to perform a single operation on every single element of a matrix (e.g. squarerooting every element, or exponentiation, or calculating the sine/cosine if all the numbers are between [-1;1], etc..)
I chose the blocks/threads grid dimensions and I think the code is pretty straightforward and simple, but I'm asking myself... what can I do to maximize coalescence/SM occupancy?
My first idea was: making all semiwarp (16 threads) load data ensemble from global memory and then putting them all to compute, but it finds out that there are no enough memory-transfer/calculations parallelization.. I mean all threads load data, then compute, then load again data, then calculate again.. this sounds really poor in terms of performance.
I thought using shared memory would be great, maybe using some sort of locality to make a thread load more data than it actually needs to facilitate other threads' work, but this sounds stupid too because the second would wait for the former to finish loading data before starting its work.
I'm not really sure I gave the right idea regarding my problem, I'm just getting ideas before commencing to work on something concrete.
Every comment/suggestion/critic is well accepted, and thanks.
If you have defined the grid so that threads read along the major dimension of the array containing your matrix, then you have already guaranteed coalesced memory access, and there is little else to be done to improve performance. These sort of O(N) complexity operations really do not contain sufficient arithmetic intensity to give good parallel speed up over an optimized CPU implementation. Often the best strategy is to fuse multiple O(N) operations together into a single kernel to improve the FLOP to memory transaction ratio.
In my eyes your problem is this
load data ensemble from global memory
It seems that your algorithm idea is:
Do something on cpu - have some matrix
Transfer matrix from global to device memory
Perform your operation on every element
Transfer matrix back from device to global memory
Do something else on cpu - go sometimes back 1.
This kind of computations are almost everytime I/O-bandwidth limited (IO = memory IO), not computation power limited. GPGPU computations can sustain a very high memory bandwidth - but only from device memory to the gpu - transfer from global memory goes always over the very slow PCIe (slow compared to the device memory connection, that can deliver up to 160 GB/s + on fast cards). So one main thing to get good results is to keep the data (matrix) in device memory - preferable generate it even there if possible (depends on your problem). Never try to migrate data between cpu and gpu for and back as the transfer overhead eats all your speedup up. Also keep in mind that your matrix must have a certain size to amortize the transfer overhead, that you cant avoid (to compute a matrix with 10 x 10 elements would bring almost nothing, heck it would even cost more)
The interchanging transfer/compute/transfer is full ok, thats how such gpu algorithms work - but only if the the tranfer is from device memory.
The GPU for something this trivial is overkill and will be slower than just keeping it on the CPU. Especially if you have a multicore CPU.
I have seen many projects showing the "great" advantages of the GPU over the CPU. They rarely stand up to scrutiny. Of course, goofy managers who want to impress their managers want to show how "leading edge" his group is.
Someone in the department toils months on getting silly GPU code optimized (which is generally 8x harder to read than equivalent CPU code), then have the "equivalent" CPU code written by some Indian sweat shop (the programmer whose last project was PGP), compile it with the slowest version of gcc they can find, with no optimization, then tout their 2x speed improvement. And BTW, many overlook I/O speed as somehow not important.

What is the ideal memory block size to use when copying?

I am currently using 100 megabytes per memory block to copy large files.
Is there a "good" amount that people normally use?
Edit
Thanks for all the great responses.
I'm still quite new to these concepts so I'll try to understand a lot of the ones that have been said (e.g. write back cache). I keep learning new things :)
A block between 4096 and 32KB is the typical choice. Using 100MB is counter-productive. You are occupying RAM with the buffer that can be put to much better use as the file system writeback cache.
Copying files is very fast when the file fits completely in the cache, the WriteFile() call is a simple memory-to-memory copy. The cache manager then lazily writes it out to the disk. But when there's no more room in the cache, the copy speed drops off a cliff when WriteFile() has to wait for space to be made available. It now goes at disk write speeds.
I would recommend you to benchmark this, and remember to include much smaller block sizes. In my own tests on this, I got quite counterintuitive results.
When reading and writing from the hard drive, all (power of two) block sizes between 512 byte and 512 kB gave the same speed. Increasing the block size from 512 kB to 1 MB reduced the copying speed to about 60%. Increasing the block size further increased the speed again, but never all the way back to the speed of using small blocks.
When all the copied data was in the cache memory, the (much faster) copying speed improved with increasing block sizes, flattening out around reaching 32 kB blocks, and then suddenly dropped to about half the previous speed when going from 256 kB to 512 kB blocks, never to return to the previous speeds.
After this test, I dropped read/write block sizes in several of my programs from around 1 MB to 32 kB.
There's generally little benefit in using blocks that large.
Suppose your operating system is super-naive and every read or write operation incurs a hard disc seek (in practice you will often find that writes get queued and reads get read-ahead-buffered, reducing the benefit of using large buffers in your application code).
Then every block costs you (say) 2x10ms for two seeks (one to read and one to write) and there's little point increasing your block size once the time for the actual reading and writing is substantially more than that. A really fast HD might read and write at 150MB/s, in which case that 10ms would correspond to 1.5MB of reading/writing, and you'd be gaining little for blocksizes beyond 15MB.
In practice, (1) your seek time will probably be less, (2) your read and write bandwidth will probably be more, and (3) your OS and drive hardware will probably be cacheing and queueing things for you; you'll probably see little or no benefit from blocksizes above about 100KB.
(You should probably benchmark a variety of blocksizes and see what you get on your own system.)
That's a pretty excessive amount. Consider that you don't even start writing data before reading 100 MB, so the filesystem driver doesn't even have an opportunity to write any of the destination file while you're reading. The disk could be writing parts of the file that happen to pass under the head as it's reading the source file (see elevator seek for example).
I think that it depends on size of free memory that you have.
If you use 100 M blocks to copy on machine that has for example 30Mb of empty memory then it'll take much more time to copy than using smaller (20M) block.
If your buffor for copying is larger than size of available free memory then due to virtual memory swapping your copying will be slower than expected.
Given that the drive must seek when it changes tracks, might not a block size of say 63 x 512 = 32256 produce optimum results?