Related
Ok so I'm working on a game project. Just finished rebuilding a game engine I designed some time ago. I'm looking at making a proprietary file type to store data rather than using a database like sqlite.
Looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time.
My question is: Is it more efficient overall to load the data from the file and store it in a data manager class to be reused? Or is it more efficient overall to continually pull from the file?
Assuming the file follows some form of consistent structure for it's data. And we're looking at the largest "table" being something like 30 columns with roughly 1000 rows of data.
Here's a handy chart of "Latency Numbers Every Computer Programmer Should Know"
The far right hand side of the chart (red) has the time it takes to read 1 MB from disk. The green column has the same value read from RAM.
What this shows us is that you should do almost anything to avoid having to directly interact with the disk. Keeping data in RAM is good. Keeping data on disk is bad. (Memory mapped files might provide a way to handle this.)
This aside, reinventing the wheel is almost always the wrong solution. Sqlite works and works well. If it's not ideally suited for your needs, there are other file types out there.
If you're "looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time", you'll find that's easiest to do if you reuse preexisting solutions to common problems.
Keeping reading from a file is generally not a good idea; modern operating systems do keep large IO caches (so if you keep reading the same stuff it won't really hit the disk), but syscalls are of course way more onerous than straight accessing memory - although, whether this is actually going to be a performance problem for your specific case is impossible to judge with the information you provided. On the other hand, if you have a lot of data to access keeping it all in memory can be wasteful, slow to load and, when under memory pressure, lead to paging.
The easy way out of this conundrum is to map the file in memory; the data is automatically fetched from disk when required and, unless the system is under memory pressure, frequently accessed pages remain cached in RAM, guaranteeing you fast access.
Of course this is feasible only if the data you need to map is smaller than the address space, but given the example you provided (30 columns/1000 rows, which is really small) it shouldn't be a problem at all.
If you can hold the data in RAM then it is more efficient. This is because it is quicker for your computer to access values that are in RAM, a cache or the CPU's registers than it is to get it from the hard drive. Reading from the hard drive requires alot of time from the drivers of the operating system; therefore holding the data is more efficient
I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem: C:\WINDOWS\system32>fsutil
fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation
(cluster) size. Neither of which map to the SSD page size - probably
not very efficient.
There's some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?
I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.
Short Answer
You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.
To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.
The process can be summarized as:
Read the whole page to a buffer
Modify the buffer with your added content
Erase the whole page
Rewrite the whole page with the modified buffer
Long Answer
The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.
As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100'000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it's useless and at best reduces the lifetime).
Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.
Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it's much smaller.
Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.
On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.
Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.
Implementation
Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.
On the upside
* Will save you quite some disk space
* Will render the disk tolerant in case of failure if your design is smart enough
* will require much fewer resources if you are on a limited system
On the downside
* Much more work and debug
* The drive won't be natively recognized by the system.
If you only log, you don't need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.
The implementation I've done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.
The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.
This provided a very robust solution as they are very less likely to have table corruption.
I would also suggest to buffer 512 bytes and write page by page.
Others
You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system
I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.
Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.
Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.
This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.
I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.
I have a very latency sensitive routine that generates integers sequentially, but needs to store the last generated one to disk in case of a crash or re-start.
Currently I'm doing a seek to beginning of file then writing out the integer then flush each time a new int is generated. The flush is required so the write at least hits the battery-backed controller cache.
The seek is quite costly so I was thinking about just appending 4 bytes and if recovery is needed then to seek to the end and read the last 4 bytes. This previous statement obviously assumes that there isn't too much other disk activity happening, so the write head should ideally stay at end of the file.
The number won't typically go higher than 10,000,000 so 40MB isn't so bad.
Any advice as to how to achieve minimum latency without sacrificing integrity?
C or C++ on Linux 2.6+
I would think the fastest/easiest way to do this would be with mmap/msync -- mmap 1 page of the file into memory and store the value on that page. Any time the value changes, call msync(2) to force the page back to disk. This way you need only one system call per store
If I read correctly, how about using a memory mapped file? Just write your number to the assigned address and it appears in the file. This makes assumptions that the OS writing the cache to disk robustly when needed, but you might find it worth a try.
int len = sizeof(unsigned);
int fildes = open(...)
void* address = mmap(0, len, PROT_READ, MAP_PRIVATE, fildes, 0)
unsigned* mappedNumber = (unsigned*)(address);
*mappedNumber can now contain your integer.
Measure.
How much control do you have over the hardware? If anything less than full, you'll get no guarantees.
On Linux I'd probably try making a kernel driver that would do its writes with the highest priority, possibly even without using a file system.
But, theoretically... If it is enough for you to hit the controller cache, data will hit it every time you flush anything to disk. This means regardless of whether there will be physical seek inside the drive or not, the data will already be there. And because you'll never know what will other applications do, or how fast does the disk rotate, your seeks will be random even if you keep the logical file handle at the beginning or end of file.
And you can always ask your user to use a flash drive.
The fastest way to write a file is to map that file into memory and treat it as a char array.
You don't need to sync the file if you don't care about OS crashes (Linux never crashed on me in production). All your writes go to that file mapping bypassing the kernel, in other words, real zero-copy (you can't do that with sockets on the standard hardware yet). You may need to keep a header in that file that contains a number of records written in case your application crash during writing a record into the memory. I.e. write a record and only after that increment the record counter.
Resizing this file requires ftruncate()/remap() sequence which may take a bit too long, so you may want to minimize resizing by growing the file by a factor, like std::vector<> grows by 1.5 its size on push_back() when it overflows. Depending on your throughput and latency requirements certain optimization can be applied.
The kernel is going to write the file mapping to disk asynchronously (as if there were another thread in your application dedicated to writing to disk). There is a way to force the writes to disk if necessary by using msync(). This is only necessary, however, if you'd like to survive an OS crash. But surviving an OS crash requires sophisticated application design anyway, so in practice surviving the application crash is good enough.
Why does your application have to wait for the write complete at all?
Write your data asynchronously, or perhaps from another thread.
You don't really have much low-level control over the harddrive. As long as you write so little data at a time, you're going to incur a lot of expensive seeks. But since you're only using it as "checkpoints" to recover from in case of a crash, there seems to be no reason why the write couldn't occur asynchronously.
Storing an int only takes one block on disc, regardless of block size. So you have to sync one block to disc, and it takes as long as it takes, and there is nothing you can do to make it faster.
Whatever else you do, fdatasync() will be the killer, time-wise. It will sync one block into your (battery-backed RAID) controller.
Unless you have some kind of non-volatile ram, all (sensible) methods are going to be exactly equivalent because they all require one block to be sync'd.
Doing a seek system call is not going to make any difference, as that has no effect on hardware. In any case, you can avoid it by using pwrite().
Consider what "appending 4 bytes" means. Disks don't store files, or even bytes. They store clusters, and a fixed number of them. The notion of a file is created by the OS. It allocates some clusters to file system tables, to keep track of where a file is precisely located. Now, appending 4 bytes means at least writing the 4 bytes to a cluster. But that also means determining which cluster. What's the existing file size? Do we need a new cluster? If not, we need to read the last cluster, patch the 4 bytes in the correct position, and write back the cluster, then update the file size in the file system. If we do append a new cluster, we can write the 4 bytes followed by zeroes (don't need old value) but we need to do a whole lot of bookkeeping to add a cluster to a file.
So, the absolute fastest way cannot ever be to append 4 bytes. You must overwrite 4 existing bytes. Preferably in a sector that you already have in memory. Others have already pointed out that you can achieve this with mmap/msync.
Obviously, given current SSD and developer prices, and your 40 MB limit, you'll be using an SSD. It pays for itself if you save an hour. Therefore seek times are irrelevant; SSDs don't have physical heads.
There are a lot of people here talking about mmap() as if that will fix something, but your syscall overhead is basically zero compared to the disk write overhead. Remember that appending or writing to a file requires you to update the inode (mtime, filesize) anyway, so that means a disk seek.
I suggest you consider storing the integer somewhere other than a disk. For example:
write it to some nvram that you control (eg. on an embedded system). (If your RAID controller has nvram for writing, it might do this for you. But if you're asking this question, it probably doesn't.)
write it to free bytes in the system CMOS memory (eg. on PC hardware).
write it to another machine on the network (if it's a fast network) and get them to acknowledge.
redesign your application so you can get away with syncing after every n transactions, instead of after every transaction. That will be about n times faster than doing it every time.
redesign your application so that if the integer is lost, the changes from your most recent transaction are also lost. Then the fact that you've technically lost an integer update doesn't matter; when you reboot, it'll be as if you never incremented it, so you can just resume from there.
You didn't explain why you need this behaviour; to be honest, if your app needs this, it sounds like your application is probably not designed very well. For example, some people suggested using a database because they do this sort of thing all the time; true, but databases do it by being slow (ie. syncing the disk every time), unless you create a transaction first, in which case the disk only needs to get synced when you do 'commit transaction'. But if you absolutely must have a sync after every integer, you'd be constantly committing transactions, and a database couldn't save you from that; there's no magical way a database could guarantee not to lose data unless it does at least fdatasync().