How to build an application layer pre-fetching system

How to build an application layer pre-fetching system - c++

I'm working in a C/C++ mixed project that has the following situation.
I need to have a iteration to go through very small chunks (rarely larger chunks as well) in a file one by one. Ideally, I should just read them once consecutively. I think will be a better solution in this case to read a big chunk into a buffer and consume it later, rather than read each of them instantly when I need.
The problem is, how do I balance the cache size? Is there any well-known algorithm/library that I can take advantage of?
UPDATE: (changes the title)
Thanks for you guys' replies and I understand there are different levels of caching mechanism working in our boxes. But that not enough in my case.
I think I missed something important here. Actually I'm building an application upon an existing framework, in which requesting reads to the engine frquently will cost too much for me. (Yes, i believe the engine do take advantage of OS and disk level caches.) And what I'm trying to do is indeed to build an application level pre-fetching system.
Thoughts?

in general you should try to use what the OS gives you, rather than creating your own cache (because you run the risk of caching twice). for linux, you can request OS level caching via readahead(); i don't know what the windows equivalent would be.
looking into this some more, there is also a block level (ie disk) parameter, set via blockdev --setra. it's probably not a good idea to change that on your system (unless it is dedicated to just this one task), but if the value there (blockdev --getra) is already larger than your typical chunk size then you may not need to do anything else.
[and just to address the other point mentioned in the question comments - while an OS will cache file data in free memory, i don't believe that it will pre-emptively read an otherwise unread file (apart from to meet the requirements above). but if anyone knows otherwise, please post details...]

Have you tried mmap()ing the file instead of read()ing from it? In some cases this might be more efficient, in some cases this might not. However it is usually best to let the system optimize for you, since it knows more about the hardware than an application. mmap() will let the system know that you need the whole file, so it might just be more optimal.

Related

Is there a way to code data directly to the hard drive (similar to how one can do with RAM)?

My question concerns C/C++. It is possible to manipulate the data on the RAM with pretty great flexibility. You can also give the GPU direct commands using OpenGL, allowing one to manipulate VRAM as well.
My curiosity is whether it is possible to do this to the hard drive (even though this would likely be a horrible idea with many, many possibilities of corrupting existing data). The logic of my question comes from an assumption that the hard drive is similar to RAM and VRAM (bytes of data), but just accesses data slower.
I'm not asking about how to perform file IO, but instead how to directly modify bytes of memory on the hard drive (maybe via some sort of "hard-drive pointer").
If my assumption is totally off, a detailed correction about how the hard drive's data storage is different from RAM or VRAM would be very helpful. Thank you!

Modern operating systems in combination with modern CPUs offer the ability to memory-map disk clusters to memory pages.
The memory pages are initially marked as invalid, and as soon as you try to access them an invalid page "trap" or "interrupt" occurs, which is handled by the operating system, which loads the corresponding cluster into that memory page.
If you write to that page there is either a hardware-supported "dirty" bit, or another interrupt mechanism: the memory page is initially marked as read-only, so the first time you try to write to it there is another interrupt, which simply marks the page as dirty and turns it read-write. Then, you know that the page needs to be flushed to disk at a convenient time.
Note that reading and writing is usually done via Direct Memory Access (DMA) so the CPU is free to do other things while the pages are being transferred.
So, yes, you can do it, either with the help of the operating system, or by writing all that very complex code yourself.

Not for you. Being able to write directly to the hard drive would give you infinite potential to mess up things beyond all recognition. (The technical term is FUBAR, and the F doesn't stand for Mess).
And if you write hard disk drivers, I sincerely hope you are not trying to ask for help here.

Force executable into memory?

I have a cpp executable (it contains static libraries), about 1MB in size. When I run the exe, it consumes less than 200kb memory.
From what I understand this means the computer reads the exe little by little when it's needed from the HDD.
I want to improve the performance, even a bit, so, how can I say "load the exe into memory" and don't touch the HDD? Will this bring any performance improvement?

The OS will load parts of the executable into memory as it is needed. This is where knowing more about the instruction cache might be useful. The idea is that you structure your program so that common code is grouped together. For example, you might have some functions that are getting inlined - in this case the OS would have to load the same code in multiple places which might be slow. By removing the inline you'd have the code in one chunk in memory which would get cached and thus reduce loading time.
I would agree with the others though that this type of optimization should really be reserved until after you profile and know for sure that this is the bottleneck, which is very unlikely

If you really want to do this, you need to touch the memory pages by reading from them. But forcing pages into memory once does not guarantee that they will remain in memory. An apparent alternative solution would be to VirtualLock the region, but in practice this function doesn't work the way you'd think (at least on any system where I've used it), even if you have the appropriate privilegues.
Note that the default minimum working set is only 16MB, so for larger executables, forcing pages into RAM will necessarily push others (which you need!) out of the working set, so this is in fact an anti-optimization. Unless you have the necessary privilegues to increase the working set size.
It's a bit tedious to find out where the executable's mapping starts and ends. Not that it is impossible, but it's much more complicated than just mapping the file again. Then you simply run a loop which reads one byte every 4096 bytes, and you are done. This will consume twice as much address space, but will consume the same amount of RAM (thanks to how memory mapping works).
But, realistically, you will gain absolutely nothing from doing this.
The operating system does not need to load the entire executable and does not need to keep it resident at all times. Part of your executable will be debug info or import info, which the loader will maybe look at once (or won't look at) and never need afterwards. Forcing that stuff into memory only means you purge useful pages from the working set.
The OS likely has the parts (or most of it) that are not visible to you in the buffer cache anyway, but even if that isn't the case, you will hardly ever notice a difference.

Globally, forcing all of the program into RAM will slow it down.
There are usually large parts of the code which aren't executed
in any given run, and there's no need to ever read these from
disk.
Where forcing all or parts of the program into RAM can make a difference
is latency. If you're responding in real time to external
events, having to load the code in order to respond will reduce
latency. This can only be done by using a system specific
request (e.g. mlock under Posix systems supporting the read
time extension). You'll probably have to have special rights to
be able to do it, though. In practice, it should only be used
on machines dedicated to a specific application, since it can
have a very negative impact on the total system performance.
(There's a reason that it's in the real-time extensions, and not
in the basic Posix.) Locking the addresses used by the function in memory means that there can be no page faults when it is executed.

Keeping memory usage within available amount

I'm writing a program (a theorem prover as it happens) whose memory requirement is "as much as possible, please"; that is, it can always do better by using more memory, for practical purposes without upper bound, so what it actually needs to do is use just as much memory as is available, no more and no less. I can figure out how to prioritize data to delete the lowest value stuff when memory runs short; the problem I'm trying to solve is how to tell when this is happening.
Ideally I would like a system call that returns "how much memory is left" or "are we out of memory yet?"; as far as I can tell, no such thing exists?
Of course, malloc can signal out of memory by returning 0 and new can call a handler; these aren't ideal signals, but would be better than nothing. A problem, however, is that I really want to know when physical memory is running out, so I can avoid going deep into swap and thereby making everything grind to a halt; I don't suppose there's any way to ask "are we having to swap yet?" or tell the operating system "don't swap on my account, just fail my requests if it comes to that"?
Another approach would be to find out how much RAM is in the machine, and monitor how much memory the program is using at the moment. As far as I know, there is generally no way to tell the former? I also get the impression there is no reliable way to tell the latter except by wrapping malloc/free with a bookkeeper function (which is then more problematic in C++).
Are there any approaches I'm missing?
The ideal would be a portable solution, but I suspect that's not going to happen. Failing that, a solution that works on Windows and another one that works on Unix would be nice. Failing that, I could get by with a solution that works on Windows and another one that works on Linux.

I think the most useful and flexible way to use all the memory available is to let the user specify how much memory to use.
Let the user write it in a config file or through an interface, then create an allocator (or something similar) that will not provide more than this memory.
That way, you don't have to find statistics about the current computer as this will allways be biased by the fact that the OS could also run other programs as well. Don't even talk about the way the OS will manage cache, the differences between 32 and 64 bit making adress space limit your allocations etc.
In the end, human intelligence (assuming the user know about the context of use) is cheaper to implement when provided by the user.

To find out how much system memory is still unused, under Linux you can parse the file /proc/meminfo and look for a line starting with "MemFree:". Under Windows you can use GlobalMemoryStatusEx http://msdn.microsoft.com/en-us/library/aa366589%28VS.85%29.aspx

Relying on malloc to return 0 when no memory is available might cause problems on Linux, because Linux overcommits memory allocations. malloc will usually return a valid pointer (unless the process is out of virtual address space), but accessing the memory it points to may trigger the "OOM killer", a mechanism that kills your process or another process on the system. The system administrator can tune this behavior.

The best solution I can think of might be to query how many page faults have occurred within, say, the last second. If there's a lot of swapping going on, you should probably release some memory, and if not, you can try allocating more memory.
On Windows, WMI can probably give you some statistics you can use.
But it's a tough problem, since there is no hard limit you can ask the OS for and then stay below. You can keep allocating memory far beyond the point where you've run out of physical memory, which just means you'll cripple your process with excessive swapping.
So the best you can really do is some kind of approximation.

You can keep allocating memory beyond the point where it is useful to do so - i.e. that which requires the OS to swap, or page out important things. The trouble is, it is not necessarily easy to tell where this is.
Also, if your task does any (significant) IO, you will need to have some left for the OS buffers.
I recommend just examining how much there is in the machine, then allocating an amount as a function of that (proportion, or leave some free etc).

Permanent Memory Address

With my basic knowledge of C++, I've managed to whip together a simple program that reads some data from a program (using ReadProcessMemory) and sends it to my web server every five minutes, so I can see the status of said program while I'm not at home.
I found the memory addresses to read from using a program designed to hack games called "Memory Hacking Software." The problem is, the addresses change whenever I move the program to another machine.
My question is: is there a way to find a 'permanent' address that is the same on any machine? Or is this simply impossible. Excuse me if this is a dumb question, but I don't know a whole lot on the subject. Or perhaps another means to access information from a running program.
Thanks for any and all help!

There are ways to do it such as being able to recognise memory patterns around the thing you're looking for. Crackers can use this to find memory locations to patch even with software that "moves around", so to speak (as with operating systems that provide randomisation of address spaces).
For example, if you know that there are fixed character strings always located X bytes beyond the area of interest, you can scan the whole address space to find them, then calculate the area of interest from that.
However, it's not always as reliable as you might think.
I would instead be thinking of another way to achieve your ends, one that doesn't involve battling the features that are protecting such software from malicious behaviour.
Think of questions like:
Why exactly do you need access to the address space at all?
Does the program itself provide status information in a more workable manner?
If the program is yours, can you modify it to provide that information?
If you only need to know if the program is doing its job, can you simply "ping" the program (e.g., for a web page, send an HTML request and ensure you get a valid response)?
As a last resort, can you convince the OS to load your program without address space randomisation then continue using your (somewhat dubious) method?
Given your comment that:
I use the program on four machines and I have to "re-find" the addresses (8 of them) on all of them every time they update the program.
I would simply opt for automating this process. This is what some cracking software does. It scans files or in-memory code and data looking for markers that it can use for locating an area of interest.
If you can do it manually, you should be able to write a program that can do it. Have that program locate the areas of interest (by reading the process address space) and, once they're found, just read your required information from there. If the methods of finding them changes with each release (instead of just the actual locations), you'll probably need to update your locator routines with each release of their software but, unfortunately, that's the price you pay for the chosen method.
It's unlikely the program you're trying to read will be as secure as some - I've seen some move their areas of interest around as the program is running, to try and confuse crackers.

What you are asking for is impossible by design. ASLR is designed specifically to prevent this kind of snooping.
What kind of information are you getting from the remote process?

Sorry, this isn't possible. The memory layout of processes isn't going to be reliably consistent.
You can achieve your goal in a number of ways:
Add a client/server protocol that you can connect to and ask "what's your status?" (this also lends itself nicely to asking for more info).
Have the process periodically touch a file, the "monitor" can check the modification time of that file to see if the process is dead.

Decreasing performance writing large binary file

In one of our softwares we are creating records and storing them in a binary file. Once the writing operation is completed we read back this binary file. The issue is if this binary file is less than 100 MB then its performance is good enough, but once this file grows larger its performance is hit.
So, I thought of splitting this large binary file ( > 100 MB) into smaller ones ( < 100 MB). But it seems this solution is not gaining the performance. So, I was just thinking what can be the better approach to handle this scenario?
It will be really great help from you guys to comment on this.
Thanks

Maybe you could try using an Sqlite database instead.

It is always quite the difficult to provide accurate answers with only a glimpse of the system, but have you actually tried to check the actual throughput ?
As a first solution, I would simply recommend using a dedicated disk (so there are no concurrent read/write actions from other processes), and a fast one at that. This way it would be just some cost of hardware upgrade, and we all know hardware is usually cheaper that software ;) You may even go to a RAID controller for maximizing throughput.
If you are still limited by the disk throughput, there are new technologies out there using the Flash technology: USB keys (though it may not seem very professional) or the "new" Solid State Drives may provide more throughput than a mechanical disk.
Now, if the disks approach are not fast enough or you can't get your hands on good SSDs, you have other solutions, but they involve software changes, and I propose them off the top of my hat.
A socket approach: the second utility is listening on a port and you send it the data there. On a local machine it's relatively fast, and you parallelize the work too, so even if the size of the data grows, you will still begin treating fairly quickly.
A memory mapping approach: write to a dedicated area in live memory and have the utility read from that area (Boost.Interprocess may help, there are other solutions).
Note that if the read is sequential, I find it more "natural" to try a 'pipe' approach (ala Unix) so that the two processes execute concurrently. In a traditional pipe, the data may not hit the disk after all.
A shame, isn't it, that in this age of overwhelming processing power, we are still struggling with our disk IO ?

If your App is reading the data sequential migrating to a DB would not help to increase performance. If random access is used you should consider to move the data into a DB,especially if different indices are used. You should check whether enough resources are available, if loaded completly into memory virtual memory management could have an impact to performance (swapping,paging). Depending on your OS setting a limit for file io buffers could be reached. The file system itself could be fragmented.
To get a higer quality answer you should provide informations about hardware,os,memory and file system. And the way your data file is used. Than you could get hints about kernel tuning etc.

So what is the retrieval mechanism here? How does your application know which of the smaller files to look in to find a record? If you have split up the big file without implementing some form of keyed lookup - indexing, partitioning - you have not addressed the problem, just re-arranged it.
Of course, if you have implemented some form of indexing then you have started down the road of building your own database.
Without knowing more regarding your application it would be rash for us to offer specific advice. Maybe the solution would be to apply an RDBMS solution. Possibly a NoSQL approach would be better. Perhaps you need a text indexing and retrieval engine.
So...
How often does your application need to retrieve records? How does it decide which records to get? What is your definition of poor performance? Why did you (your project) decide to use flat files rather than a database in the first place? What sort of records are we talking about?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js