Swap memory speed in Linux - c++

I have a process in Linux 64 bit (Redhat Enterprise) which enrolled one million of records into memory, each record is 4KB so total memory consumption is about 4 Gigabytes.
My computer has 2GB of RAM and 3 GB of swap memory. So obviously part of data will be put into swap memory. The problem is that I don't know why it really takes too long time to traverse across all those records. I have a function that loop through each record and do some stuff things. It works well with about 500,000 records, the function just need couple of minutes to accomplish. However, with double amount of that records, i.e 1,000,000 records, it needs hours to do the same function. I used top command in Linux to check the cpu load, and see that it's about 90%wa (waiting time for I/O). I guess this might cause the problem but really don't know why it happens.
I would thank you so much any helpful idea.

Swap area is disk. Disk bandwidth is two or three order of magnitude less than memory bandwidth.

There are two options:
The process works over the records sequentially. Than it was the stupidest thing on Earth to roll them all up to memory.
If you can fix the process, fix it to only load a bit at a time.
If you can't fix the process, you'll have to buy more memory.
The process works over the records in random order or multiple times (and can't do otherwise). Well, you'll have to buy more memory.

If you want to use your swap space efficiently, you should make sure that you traverse your data sequentially in contiguous memory blocks. I.e. blocks of several megabytes. That way, when a new chunk is loaded into ram from swap space, this chunk will contain the next few records as well.

Sounds like either cache or swap thrashing is happening. Check vmstat to verify. You can remedy swap thrashing if you load only as much data as you can fit into memory, process them, load another block, and so on. This way you don't have to impose processing order (random or sequential doesn't matter much). Alternatively, we'd have to have more details on your algorithm / program architecture to comment.

The speed of your swap memory depends on the speed of the underlying hardware where the swap resides.
Usually in the operating systems, Windows calls it pagefile.sys, Linux calls it is the swap partition(s), the hardware of the swap is one of the hard drives in the system, so it is orders of magnitude slower than RAM.

Before buying more RAM, you could try using part of your RAM as a compressed swap. I heard of compcache, but I have not used it myself. The idea is the following:
If the data you put in RAM can be compressed (lets say a ratio 3 to 1),
Allocate 1 GB of your 2 GB RAM to a $in memory* swap,
You then have a loo latency RAM of 4 GB.
I would be curious to know if it improves the amount of record you can handle without thrashing.


Had I better to reserve memory for my process?

I wrote a program that manipulates with files. In the idle state, the memory occupied by the process is ~ 60 MBs. Periodically, say every 2 minutes, the process allocates memory (~ 40 MBs), performs something with files, then frees the allocated memory. The procedure takes around 10 ~ 20 seconds. As the result, the memory usage of my process looks like in the below picture:
My question here is: should I reserve some memory in advance then use the memory when I need? This would make memory usage trend more stable. And the stability would be better for system, am I right?
No, any 21st century OS won't blink an eye if you do this. Things might get interesting if you try to allocate more than 4GB per millisecond across 100 CPU cores, but you're not even close to that.
If you allocate memory in advance say additional 50MB as you say the graph would be a straight line (i.e. stable) but programs competing for memory might not get enough and they would suffer due to it.

Initializing Billion Integers to value 1

What is good posix thread design to initialize billion integers using c/c++ on linux platform 8-core CPU with 32GB of DRAM?
Thanks for your help.
This is a trivial operation and you need not consider multi-threading. Just do it with a memcpy in a single thread.
The exact number of threads will not be such a limiting factor, but sometimes for this questions it is worth to overcommit, say use 2 threads per physical core.
But the real bottleneck will be IO, writing the data into the RAM. You'd have to take care that the data that is to be replaced will never read before you erase it. Then you should assure that writes to memory appear in large chunks and (if possible) as "write through", mondern CPU have instructions for the later.
Usually something like memcpy with a fixed sized buffer (some pages) that contains the pattern that you want to see should be optimized quite well.
What is that for? Depending on usage, the following scenario might work: you initialize one memory page (that's several KB) to all 1's. Then you map that page into the virtual address space as many times as needed with a copy-on-write flag. This way, on reading you'll get all ones from all those virtual pages, on writing the system will allocate more physical pages as needed.
Perhaps a divide and conquer algorithm? Partition the memory containing the integers by some number corresponding to the number of threads optimal for your system. Then launch one thread per partition which initializes all of its integers.
If you do attempt multithreading, aligning your writes with the native cache line size will likely provide optimal memory throughput. As everyone says, the memory throughput will dominate the performance but there is some portion of CPU time required for these writes. Minimizing that time with multithreading and vectorized instructions may be helpful.
The real answer is to profile your system (since you stated a very specific target, it sounds like you don't want to design a balanced algorithm which is good enough for most targets). Modern CPUs which have access to 32GB of DRAM often have hardware performance counters (Intel's and AMD's do) which make finding out CPU, caching activity pretty easy.

What is the ideal memory block size to use when copying?

I am currently using 100 megabytes per memory block to copy large files.
Is there a "good" amount that people normally use?
Thanks for all the great responses.
I'm still quite new to these concepts so I'll try to understand a lot of the ones that have been said (e.g. write back cache). I keep learning new things :)
A block between 4096 and 32KB is the typical choice. Using 100MB is counter-productive. You are occupying RAM with the buffer that can be put to much better use as the file system writeback cache.
Copying files is very fast when the file fits completely in the cache, the WriteFile() call is a simple memory-to-memory copy. The cache manager then lazily writes it out to the disk. But when there's no more room in the cache, the copy speed drops off a cliff when WriteFile() has to wait for space to be made available. It now goes at disk write speeds.
I would recommend you to benchmark this, and remember to include much smaller block sizes. In my own tests on this, I got quite counterintuitive results.
When reading and writing from the hard drive, all (power of two) block sizes between 512 byte and 512 kB gave the same speed. Increasing the block size from 512 kB to 1 MB reduced the copying speed to about 60%. Increasing the block size further increased the speed again, but never all the way back to the speed of using small blocks.
When all the copied data was in the cache memory, the (much faster) copying speed improved with increasing block sizes, flattening out around reaching 32 kB blocks, and then suddenly dropped to about half the previous speed when going from 256 kB to 512 kB blocks, never to return to the previous speeds.
After this test, I dropped read/write block sizes in several of my programs from around 1 MB to 32 kB.
There's generally little benefit in using blocks that large.
Suppose your operating system is super-naive and every read or write operation incurs a hard disc seek (in practice you will often find that writes get queued and reads get read-ahead-buffered, reducing the benefit of using large buffers in your application code).
Then every block costs you (say) 2x10ms for two seeks (one to read and one to write) and there's little point increasing your block size once the time for the actual reading and writing is substantially more than that. A really fast HD might read and write at 150MB/s, in which case that 10ms would correspond to 1.5MB of reading/writing, and you'd be gaining little for blocksizes beyond 15MB.
In practice, (1) your seek time will probably be less, (2) your read and write bandwidth will probably be more, and (3) your OS and drive hardware will probably be cacheing and queueing things for you; you'll probably see little or no benefit from blocksizes above about 100KB.
(You should probably benchmark a variety of blocksizes and see what you get on your own system.)
That's a pretty excessive amount. Consider that you don't even start writing data before reading 100 MB, so the filesystem driver doesn't even have an opportunity to write any of the destination file while you're reading. The disk could be writing parts of the file that happen to pass under the head as it's reading the source file (see elevator seek for example).
I think that it depends on size of free memory that you have.
If you use 100 M blocks to copy on machine that has for example 30Mb of empty memory then it'll take much more time to copy than using smaller (20M) block.
If your buffor for copying is larger than size of available free memory then due to virtual memory swapping your copying will be slower than expected.
Given that the drive must seek when it changes tracks, might not a block size of say 63 x 512 = 32256 produce optimum results?

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.

Staying away from virtual memory in Windows\C++

I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.
How about this function:
bytesLeftUntilVMUsed() {
return 0;
it should give the correct result in nearly all cases I think ;)
Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.
Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.
The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.
Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.