How to achieve low memory consumption? - memory-consumption

I want to know which technique antivirus programs use for scanning disk or files and maintaining low memory consumption. They don't affect the user activity either.
I am looking for an approach by which we can achieve disk scanning with low memory consumption.

They don't. Every scanner I know uses a lot of memory, and has impact on the performance.

I agree to most people that antivirus software has never had low memory or CPU consumption. However, here are a few ideas off of the top of my head:
Scan only the files the user opens, only when he opens them.
Only scan risky files - like executables or scripts, not all files.
The scanning is usually done by hashing the file and maching the hash against known virus hashes. To minimize memory usage you could just keep the known hashes on disk and search them when needed, but that would be very slow. The fastest way would be to keep them all in RAM and forbid the OS to swap them out, but that would use a lot of memory. A tradeoff can be achieved by several level of hash caches, like this:
1st level cache contains 24-bit hashes as a bitmask. This occupies about 16MB of RAM and can be kept completely in RAM (forbidding the OS to swap it out). Checking this can be done very quickly.
2nd level cache contains full 128-bit or larger hashes and is kept on disk. Only if first level cache gets hit, is the second level cache tested. Because the hash space of 1st level cache is small, it is likely to get a lot of false positives, so the second level cache has to be checked.
Cache the results of the last, say, 1000 files scanned. This way you don't have to do all the hashing and checking over and over again for files that are often used.

NOD32 has a pretty small footprint, but still 10-20MB in memory.
Keep in mind what AV has to do for the most part- look at the executable part of each file for malicious bytes. A traditional virus is typically less than 1000 bytes, the identifiable patterns maybe only be 50 bytes. So for AV to protect you against 100K virus patterns, it only needs a pattern database of 50*100K=5MB.

I think you are overestimating the leanness of these scanning tools. I've seen them routinely take huge chunks of memory and occasionally spike the cpu for a while. They also hijack your startup to make sure they start up first, which holds up your startup.

You should explore memory mapped files. They allow one to process huge files without loading the entire file into memory at one time.

Scan NTFS MFT directly, figure out NTFS filesystem structures (there are open source implementations available). That is the best way to write the most efficent code, do it yourself.
Antivirus I believe use low-level device drivers and aggressive memory caches to speed up the so called no-impact access. My AV (Norton) never scans unless the screensaver is active.
Problem is, your users hardware is still cheap. Hard disk drives, for the most part, are nitorious for being slow. Ask your users to upgrade to a Solid State Drive if the performance is too slow. Also laptop drives are even slower.

Related

Optimising data-structures so that they take advantage of virtual memory

I would like to know how to optimise data-structures in openCV (the mat type specifically) so that I am able to leverage the operating systems built in memory/virtual memory management.
For a full context please read the Q and A here - but otherwise the situation could be summed up that I have a large collection of mats* that I'll need to access arbitrarily and rapidly. The main complication is that full amount of data is well above the amount of RAM available.
(*Conceptually the data is a recursively defined 3D array of 3D arrays, but let's not muddy the water with that confusion!)
Rather than build my own LRU cache and RAM-hungry and inefficient 'page' addressing strategies to access it, I'd rather let the OS do this for me.
I think I get the concepts, but when it comes to the actual implementation I'm twiddling thumbs:
Is this a generic C++ consideration, or something I need to address at the openCV level?
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Is this a generic C++ consideration, or something I need to address at the openCV level?
You just allocate and use boatloads of memory. The whole point of paging / virtual memory is that it's completely transparent. Everything gets extremely slow, but keeps working. You don't get ENOMEM until you're out of swap space + RAM.
On a normal Linux system, your normal swap partition should be very small (under 1GB), so you'll probably need to dd a swap file, and mkswap / swapon on it. Make sure the swap file is has read-write permission for root only. Obviously every major OS will have its own procedures.
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
If you have pointers to other data, make sure you keep them together. You want all the small "hot" data to be in only a few pages that a decent OS LRU algorithm won't page out.
If you have hot data mixed with cold data, it will easily get paged out and lead to an extra page-file round trip before the cache miss for the final data can even happen.
Like Yakk says, sequential access patterns will do much better, because disk I/O does better with multi-block reads. (Even SSDs have better throughput with larger blocks). This also allows prefetching, which allows one I/O request to start before the previous one's data arrives. Maxing out I/O throughput requires pipelining requests.
Try to design your algorithms to do sequential accesses when possible. This is advantageous at all levels of memory, from paging all the way up to L1 cache. Sequential access even enables auto-vectorization with vector-registers.
Cache blocking (aka loop tiling) techniques are also applicable to page misses. Google for details, but the main idea is to do all the steps of your algorithm over a subset of the data, instead of touching all the data at each step. Then each piece of data only has to be loaded into cache once total, instead of once for each step of your algorithm.
Think of DRAM as a cache for your giant virtual address space.
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Swap space / the pagefile is the backing store for your process's address space. So yes, it's very similar to what you'd get if you allocated memory by mmaping a big file instead of making an anonymous allocation.

Is using istream::seekg too much expensive?

In c++, how expensive is it to use the istream::seekg operation?
EDIT: How much can I get away with seeking around a file and reading bytes? What about frequency versus magnitude of offset?
I have a large file (4GB) that I am parsing, and I want to know if it's necessary to try to consolidate some of my seekg calls. I would assume that the magnitude of differences in file location play a role--like if you seek more than a page in memory away, it will impact performance--but small seeking is of no consequence. Is this correct?
This question is heavily dependent on your operating system and disk subsystem.
Obviously, the seek itself will take essentially zero time, since it just updates an offset. Actually reading will pull some data off of disk...
...but how much data depends on many things. Your disk has a cache which may have its own block size and may do some sort of read-ahead. Your RAID controller (if any) will have its own cache, possibly with its own block size and read-ahead.
Your kernel has a page cache -- all of free RAM, essentially -- and it will also probably do some sort of read-ahead. On Linux this is configurable, and the kernel will adapt it based on how sequential your access patterns appear to be, whether you have called posix_fadvise, etc.
All of these caches mean if you access some data, then access nearby data later, there is a chance the second access will not actually touch the disk at all.
If you have the option of coding so that you access the file sequentially, that is certainly going to be faster than random reads, especially small random reads. Seeking on a single mechanical disk takes something like 10ms, so you can do the math here. (Although seeking on a solid state drive is around 100 times faster.)
Large reads are generally better than small reads... Although processing data a few kilobytes at a time can be faster than larger blocks if it allows the processing to stay in cache.
In short, you will need to provide a lot more details about your system and your application to get a proper answer, and even then the most likely answer is "benchmark it".

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

What approach works best for quickly reading files off of optical drives?

When reading files off of a hard drive, mmap is generally regarded as a good way to quickly get data into memory. When working with optical drives, accesses take more time and you have a higher latency to worry about. What approach/abstraction do you use to hide/eliminate as much latency and/or overall load time of the optical drive as possible?
There's no real abstraction you can employ. Optical drives have very specific characteristics that must be optimized for to get the best performance.
Some tips:
The biggest killer on optical drives is seek time. Where possible make sure all the files you are reading are sequential on disc and as closely packed as possible. If you must seek then seek in one direction and as infrequently as possible.
Asynchronous reading can also massively improve performance. If you need to load and process files A,B & C then before processing A you should start reading file B, and while processing B you should be reading file C and so on.
Generally the more data you can read in one go the better, e.g avoid lots of little reads(). You will only get the theoretical throughput of a disc while reading large amounts of data. Some OS's /drivers will minimize the penalty of reading lots of little files by caching sectors, some will not.
Doing lots of exists(filename) checking can also be detrimental on some filesystems / OSs where only parts of the TOC are cached.
In our applications we usually pack files into one or more "lumped" files and have them ordered sequentially based on their access order. Some files (and directories) are compressed and read in their entirety before being decompressed in memory. This can be a win if you have a directory that contains a multitude of small files (e.g XML or scripts).
Basically lots of benchmarking and tweaking :)
Minimize or eliminate seeks by reading in giant chunks of data sequentially from a few files (optimally one).
First you must keep in mind, that modern optical drives are quite fast reading sequential data, but seeking data is still a lot slower than on HDs. So if you must seek a lot within a big file (e.g. jump randomly around within a 500+ MB file), it might actually be faster to first copy the whole 500 MB to HD (into a temporary file), which will be done in sequential, fast reads, perform the operation on the temp file (much faster since much faster access times on HD) and delete the file again if you are done with it.
The same of above applies to little big vs many small files as well. Working with a couple of big files is much faster than with many small files, since every time you switch from one small file to another one the huge seeking time will give you headaches again. This is the reason why many games that ship on optical media packs game data in huge archive files (e.g. all textures of one level are in one huge file instead of having one small file per texture), so try keeping data well structured in big files you can read as sequential as possible.
HD caching itself is a good technique. There is this game I remember, though I forgot the title, that always kept the 3D data of your environment on HD. While you were moving through the world, it was constantly copying data from DVD to HD. Thus the surrounding 3D landscape was always available on HD for fast access, however not the whole DVD was copied, only about 200-300 MB were temporarily cached on HD to save HD space. The only annoying thing about that was that you often had DVD access "noise" while playing the game, however most of the time the whole process was happening only during CPU idle times, so it did not really affect game play. Only if you ran very fast constantly within the same direction it could happen that the DVD drive was falling back and all of a sudden the game stopped with a loading indicator for a couple of seconds. However I've been playing this games for days and maybe saw this loading indicator three times within a single week. If you were moving slow or not constantly into the same direction, there never was a loading indicator.
Slow drives are going to be slow. Sorry. However, optical drive hardware will normally be optimized to do sequential reads, so if you can make your code work that way you might see some improvement. I doubt you'll see much difference between mmap(), fread(), et al, for sequential access. You might also be able to tune your read buffer size to be a multiple of the drive's block size, if your OS isn't already doing that for you. Optical drive can have large block sizes compared to hard drives, and if your buffers aren't large enough you're paying a price.
I'm not sure that there is a lot that you can do by the time that you are reading it. You could look at the create file API -- you can pass some hints to Windows that tell it that you are opening the file for Sequential or Random access. That is supposed to allow Windows to optimize the caching strategy used for the file.
You can tune the "chunks" that you bite off when reading your file to make them larger or smaller. You might get a slight improvement if you read in chunks that are multiples of the allocation unit size on the disk.
The hardware and media can make a difference. Say you have a DVD drive that reads at 16x. It will require media that is rated at 16x or higher, and some drives don't work well with some media brands. So even if the media meets the ratings, you might not be reading at the maximum speed. (usually a good hardware review on an optical drive will include details like this).
The layout of the files on the optical disk could be important. Was it burned all at once? Was it just mounted as a disk (like a packet-mode R/W?). I don't have experience with this, but given the longer seek times on an optical drive, fragmented files might have a greater impact than they do with a modern hard drive.