Decreasing performance writing large binary file - c++

In one of our softwares we are creating records and storing them in a binary file. Once the writing operation is completed we read back this binary file. The issue is if this binary file is less than 100 MB then its performance is good enough, but once this file grows larger its performance is hit.
So, I thought of splitting this large binary file ( > 100 MB) into smaller ones ( < 100 MB). But it seems this solution is not gaining the performance. So, I was just thinking what can be the better approach to handle this scenario?
It will be really great help from you guys to comment on this.
Thanks

Maybe you could try using an Sqlite database instead.

It is always quite the difficult to provide accurate answers with only a glimpse of the system, but have you actually tried to check the actual throughput ?
As a first solution, I would simply recommend using a dedicated disk (so there are no concurrent read/write actions from other processes), and a fast one at that. This way it would be just some cost of hardware upgrade, and we all know hardware is usually cheaper that software ;) You may even go to a RAID controller for maximizing throughput.
If you are still limited by the disk throughput, there are new technologies out there using the Flash technology: USB keys (though it may not seem very professional) or the "new" Solid State Drives may provide more throughput than a mechanical disk.
Now, if the disks approach are not fast enough or you can't get your hands on good SSDs, you have other solutions, but they involve software changes, and I propose them off the top of my hat.
A socket approach: the second utility is listening on a port and you send it the data there. On a local machine it's relatively fast, and you parallelize the work too, so even if the size of the data grows, you will still begin treating fairly quickly.
A memory mapping approach: write to a dedicated area in live memory and have the utility read from that area (Boost.Interprocess may help, there are other solutions).
Note that if the read is sequential, I find it more "natural" to try a 'pipe' approach (ala Unix) so that the two processes execute concurrently. In a traditional pipe, the data may not hit the disk after all.
A shame, isn't it, that in this age of overwhelming processing power, we are still struggling with our disk IO ?

If your App is reading the data sequential migrating to a DB would not help to increase performance. If random access is used you should consider to move the data into a DB,especially if different indices are used. You should check whether enough resources are available, if loaded completly into memory virtual memory management could have an impact to performance (swapping,paging). Depending on your OS setting a limit for file io buffers could be reached. The file system itself could be fragmented.
To get a higer quality answer you should provide informations about hardware,os,memory and file system. And the way your data file is used. Than you could get hints about kernel tuning etc.

So what is the retrieval mechanism here? How does your application know which of the smaller files to look in to find a record? If you have split up the big file without implementing some form of keyed lookup - indexing, partitioning - you have not addressed the problem, just re-arranged it.
Of course, if you have implemented some form of indexing then you have started down the road of building your own database.
Without knowing more regarding your application it would be rash for us to offer specific advice. Maybe the solution would be to apply an RDBMS solution. Possibly a NoSQL approach would be better. Perhaps you need a text indexing and retrieval engine.
So...
How often does your application need to retrieve records? How does it decide which records to get? What is your definition of poor performance? Why did you (your project) decide to use flat files rather than a database in the first place? What sort of records are we talking about?

Related

Fast and frequent file access while executing C++ code

I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.

C++: Is it more efficient to store data or continually read it

Ok so I'm working on a game project. Just finished rebuilding a game engine I designed some time ago. I'm looking at making a proprietary file type to store data rather than using a database like sqlite.
Looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time.
My question is: Is it more efficient overall to load the data from the file and store it in a data manager class to be reused? Or is it more efficient overall to continually pull from the file?
Assuming the file follows some form of consistent structure for it's data. And we're looking at the largest "table" being something like 30 columns with roughly 1000 rows of data.
Here's a handy chart of "Latency Numbers Every Computer Programmer Should Know"
The far right hand side of the chart (red) has the time it takes to read 1 MB from disk. The green column has the same value read from RAM.
What this shows us is that you should do almost anything to avoid having to directly interact with the disk. Keeping data in RAM is good. Keeping data on disk is bad. (Memory mapped files might provide a way to handle this.)
This aside, reinventing the wheel is almost always the wrong solution. Sqlite works and works well. If it's not ideally suited for your needs, there are other file types out there.
If you're "looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time", you'll find that's easiest to do if you reuse preexisting solutions to common problems.
Keeping reading from a file is generally not a good idea; modern operating systems do keep large IO caches (so if you keep reading the same stuff it won't really hit the disk), but syscalls are of course way more onerous than straight accessing memory - although, whether this is actually going to be a performance problem for your specific case is impossible to judge with the information you provided. On the other hand, if you have a lot of data to access keeping it all in memory can be wasteful, slow to load and, when under memory pressure, lead to paging.
The easy way out of this conundrum is to map the file in memory; the data is automatically fetched from disk when required and, unless the system is under memory pressure, frequently accessed pages remain cached in RAM, guaranteeing you fast access.
Of course this is feasible only if the data you need to map is smaller than the address space, but given the example you provided (30 columns/1000 rows, which is really small) it shouldn't be a problem at all.
If you can hold the data in RAM then it is more efficient. This is because it is quicker for your computer to access values that are in RAM, a cache or the CPU's registers than it is to get it from the hard drive. Reading from the hard drive requires alot of time from the drivers of the operating system; therefore holding the data is more efficient

Performance difference of Boost r-tree in memory vs. mapped file

I need to create an 3D R*-tree, perhaps for long time storage, but performance will also be an issue.
In order to create the tree, I decided to use Boost's spacialindex and basically found two possible methods.
Either I create it directly using objects as it's here: Index of polygons stored in vector, however that does not allow me to store and load it without creating the R*-tree again.
Or I could use a mapped file as explained here: Index stored in mapped file using Boost.Interprocess, however, I am not sure if the performance of queries is good enough in this case.
My r-tree will contain several thousand entries, but most likely less than about 100,000. Now my question is, is there any strong performance issue by using mapped files compared to using the standard objects? Also, if the creation of an R*-tree of about 100,000 values does not take a substanial amount of time (I could have all bounding boxes and corresponding keys/data stored in a file) then it might be a better option to skip the mapped file and just create the tree every time I run the program?
Hopefully, somebody can help me here, as the documentation does not really provide much information (though it's still worlds better than the documentation of libspacialindex).
A mapped file will behave mostly like regular memory (in fact, in Linux, memory allocation with new or malloc will use mmap [with a "no file" backing storage] as the underlying allocation method). However, if you do many small writes "all over the place", and you are mapping over a REAL FILE, then the OS will restrict the amount of buffered writes before writing to the file.
I did some experiments when the subject came up a while ago, and by adjusting the settings for how the OS deals with these "pending writes", I got reasonably performance even for filebacked memory mapping with random read/write pattern [something I expect happens when you are building your tree].
Here's the "performance of mmap with random writes" question, which I think is highly related:
Bad Linux Memory Mapped File Performance with Random Access C++ & Python
(This answer applies to Linux - other OS's, in particular Windows, may well behave completely differently with regards to how it deals with writes to mapped files)
Of course, it's pretty hard to say "which is better", between memory mapped file or rebuild every time the program is run - it really depends on what your application does, whether you run it 100 times a second, or once a day, how long it takes to rebuild [I have absolutely no idea!], and lots of other such things. There are two choices: Build the simplest version, and see if it's "fast enough", or build both versions, and measure how much difference there is, and then decide which path to go down.
I tend to build the simple(ish) model, and if performance isn't good enough, figure out where the slowness comes from, and then fix that - it saves spending lots of time making something that takes 0.01% of the total execution time run 5 clock-cycles faster, and ending up with a big thinko somewhere else that makes it run 500 times slower than you expected...
Bulk-loading the index is much faster than repeated insertion, and yields a much more efficient tree. So if you can hold all your data in main memory, I suggest rebuilding the tree using STR bulk loading. In my experience this is more than fast enough (bulk loading time is dwarfed by I/O time).
Cost of STR is roughly that of sorting. O(n log n) theoretically, with very low constants (a less efficient implementation may be O(n log n log n) but that still is fairly cheap).

What NoSQL solution is recommended for mostly writing application?

We planning to move some of writes our back-end will do from RDBMS to NoSQL, as we expect them to be the main bottleneck.
Our business process has 95%-99% concurrent writes, and only concurrent 1%-5% reads on average. There will be a massive amount of data involved, so in-memory NoSQL DB won't fit.
What NoSQL DB on-disk would be optimal for this case?
Thanks!
If the concurrent writes are creating conflicts and data integrity is an issue, NoSQL isn't probably your way to go. You can easily test this with a data management that supports "optimistic concurrency" as then you can measure the real life locking conflicts and analyze them in details.
I am a little bit surprised when you say that you expect problems" without any further details. Let me give you one answer: Based on the facts you gave us. What is 100,000 sources and what is the writing scenario? MySQl is not best example of handling scalable concurrent writes etc.
It would be helpful if you'd provide some kind of use case(s) or anything helping to understand the problem in details.
Let me take two examples: In memory database having an advanced write dispatcher, data versioning etc, can easily take 1M "writers" the writers being network elements and the application an advanced NMS system. Lots of writes, no conflicts, optimistic concurrency, in-memory write buffering up to 16GB's, async parallel writing to 200+ virtual spindles (SSD or magnetic disks) etc. A real "sucker" to eat new data! An excellent candidate to scale the performance to its limits.
2nd example: MSC having a sparse number space, e.g. mobile numbers being "clusters" of numbers. Huge number space, but max. 200M individual addresses. Very rare situations where there are conflicting writes. RDBMS was replaced with memory mapped sparse files. And the performance improvement was close to 1000x, yes 1000x in best case, and "only" 100x in worst case. The replacement code was roughly 300 lines of C. That was a True BigNoSQL, as it was a good fit to the problem to be solved.
So, in short, without knowing more details, there is no "silver bullet" to answer your question. We're not after warewolves here, it's just "big bad data". When we don't know if your workload is "transactional" aka. number or IO's and latency sensitive, or "BLOB like" aka. streaming media, geodata etc, it would give 100% wrong results to promise anything. Bandwidth and io-rate/latency/transactions are more or less a trade-off in real life.
See for example http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/pessimistic.vs.optimistic.concurrency.control.html for some furher details.

Is there a fast in-memory queue I can use that swaps items as it reaches a certain size?

I've been using c/c++/cuda for less than a week and not familiar with all the options available in terms of libraries(sorry if my question is too wacky or impossible). Here's my problem, I have a process that takes data and analyzes it then does 1 of 3 things, (1) saves the results, (2) discards the results or (3) breaks the data down and sends it back to be processed.
Often option (3) creates a lot of data and I very quickly exceed the memory available to me(my server is 16 gigs) so the way I got around that was to setup a queue server(rabbitmq) that I would send and receive work from(it swaps the queue once it reaches a certain size of memory). This worked perfectly when I used small servers with faster nics to transfer the data, but lately I have been learning and converting my code from Java to c/c++ and running it on a GPU which has made the queues a big bottleneck. The bottleneck was obviously the network io(profiling on cheap systems showed high cpu usage and similar on old gpu's but new faster cpus/gpus are not getting utilized as much and network IO is steady at 300-400/mbs). So I decided to try to eliminate the network totally and run the queue server locally on the server which made it faster but I suspect it could be even more faster if I used a solution that didn't rely on external network services(even if I am running them locally). It may not work but I want to experiment.
So my question is, is there anything that I can use like a queue that I can remove entries as I read them but also swaps the queue to disk once it reaches a certain size(but keeps the in-memory queue always full so I don't have to wait to read from disk)? When learning about Cuda, there are many examples of researchers running analysis on huge datasets, any ideas of how they keep data coming in at the fastest rate for the system to process(I imagine they aren't bound by disk/network otherwise faster gpu's wouldn't really give them magnitudes increase in performance)?
Does anything like this exist?
p.s. if it helps, so far I have experimented with rabbitmq(too slow for my situation), apollo mq(good but still network based), reddis(really liked it but cannot exceed physical memory), playing with mmap(), and I've also compressed my data to get better throughput. I know general solutions but I'm wondering if there's something native to c/c++, cuda or a library I can use(ideally, I would have a queue in Cuda global memory that swapped to the host memory that swapped to the disk so the GPU's would always be at full speed but that maybe wishful thinking). If there's anything else you can think of let me know and I'd enjoy experimenting with it(if it helps, I develop on a Mac and run it on linux).
Let me suggest something quite different.
Building a custom solution would not be excessively hard for an experienced programmer, but it is probably impossible for an inexperienced or even intermediate programmer to produce something robust and reliable.
Have you considered a DBMS?
For small data sets it will all be cached in memory. As it grows, the DBMS will have some very sophisticated caching/paging techniques. You get goodies like sorting/prioritisation, synchronisation/sharing for free.
A really well-written custom solution will be much faster than a DBMS, but will have huge costs in developing and maintaining the custom solution. Spend a bit of time optimising and tuning the DBMS and it starts looking pretty fast and will be very robust.
It may not fit your needs, but I'd suggest having a long hard look at a DBMS before you reject it.
There's an open source implementation of the Standard Template Library containers that's created to address exactly this problem.
STXXL nearly transparently swaps data to the disk for any of the standard STL containers. It's very well-written and well-maintained, and is very easy to adapt/migrate your code to given its similarity to the STL.
Another option is to use the existing STL containers but specify a disk-backed allocator. All the STL containers have a template parameter for the STL allocator, which specifies how the memory for entries is stored. There's a good disk-backed STL allocator that's on the tip of my tongue, but I can't seem to find via Google (I'll update this if/when I do).
Edit: I see Roger had actually already mentioned this in the comments.