Suppose you are in an embedded environment and you have, oh, 1MB of RAM to play with. Now let's pretend you have a JSON based file management system in which each file produced by the device (let's say its metadata files) is entered into this (or these) JSON files. You let this device do whatever it does and a month later it comes back with 10,000 files and entries stored into the file system and JSON file. Each entry consumed around 200 bytes, so you have 10,000 * 200 = 2MB. Now you want to sort all those files by some piece of data, let's say by file name, which is 100 bytes each.
In order to sort, maybe alphabetically, do you need to load all 10,000 file names into RAM at once, or are there sequential ways to sort this kind of data? Maybe by first sorting it into subfiles and then sorting those files further? Is it even possible?
This a C++ environment.
In order to sort, maybe alphabetically, do you need to load all 10,000 file names into RAM at once,…
No, you do not.
… or are there sequential ways to sort this kind of data?
Of course ways exist to sort external data exist, although they are not necessarily “sequential.” Sorting data that is not all in main memory at once is called external sorting.
Maybe by first sorting it into subfiles and then sorting those files further?
Stack Overflow is for answering specific questions or problems. Whether algorithms exist and what they are called are specific questions, so I have answered those. What the algorithms are and what their properties and benefits are is a general question, so you should do further research on your own regarding them.
Is it even possible?
Yes.
If you need to access them in a specific order, it might be a good idea to just store them in this way on the file system in the first place.
As #Eric mentioned, this is not a specific question. You should just improve your C/C++ skills in order to answer these questions for yourself. There are a lot of free resources in the web.
Related
I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.
I'm working on a program and my "database" is some .csv files.
I have a list of objects in a .csv with some information about each. Which is the best-appropriate way to treat the "data".
Work with fstream, meaning that everytime that I want to modify the data, or I want to read something I will directly work with my files with the tools of fstream
Or, at the beginning of the program I will load the data in a vector, read,write on the vector and in the end of the program I delete the previous file and load the new one.
In a matter of performance will it be different? Considering that the objects are numerous.
I think that it is more of a combination rather than choosing A or B. Especially choosing only A. is not safe as multiple components could access the file simultaneously. Plus if there are many updates then using streams each time could be make your code very slow.
Therefore I believe that you should use B but also take care of the implementation to persist your data in a safe way (write your data in the file).
Regarding the data structure, this depends on the usage. One important question to ask here is whether there are many insertions and deletions. If this is the case then it would be more efficient to use a list instead of a vector, as the list provides instant time insertions and vector is not appropriate for this purpose.
If the data include a unique attribute and fast lookups are needed then a hash or a map would be more suitable.
Take the CSV parser from my CSVtoC utility.
http://www.malcolmmclean.site11.com/www/
CSV files are not good for dynamic update as the records are not fixed in known disk physical locations. (The alternative is to contrive the CSV so that that doesn't hold, but it's a delicate and messy approach).
Reading a CSV is hard, writing one is trivial.
I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?
Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.
I have a large dictionary of english words (around 70k of them) that I load into memory at the beginning of the program. They are loaded into a radix trie data structure, and each trie node often has many links from one node to many others (for example the word antonyms, "dead" -> "alive", "well"). Each node also have a std::vector<MetaData> in it which contains various miscellaneous metadata for my program.
Now, the problem is with the loading time of this file. Reading the file from disk, deserialization and allocating the data structure for the thing in general takes a lot of time (4-5 seconds).
Currently, I'm working on making the load asynchronously (or bit by bit, a fraction of them per frame), but due to the nature of the application (it's a mobile keyboard), there's just plenty of times where it simply has to be loaded quickly.
What can be done to speed up loading? Memory pool everything? I am benchmarking different parts to see what can be optimized, but it looks like, so far, it's just little things that add up.
If the trie is static (i.e. doesn't change when the program's running), then build an optimized version in an array using array indexes in place of pointers. You can then save that as your data file. Startup then amounts to just loading that block of data into memory.
Doing it that way makes some things less convenient (you'll have to use arrays rather than std::vector, for example), and you might have to do a bit of casting, but with a little thought you end up with a very compact and very fast data structure that doesn't suffer from the allocation overhead associated with creating an object for each node. Instead, it's essentially an array of varying length structures.
I did this for an application that used a directed acyclic word graph (DAWG). Rather than rebuild the DAWG every time the program was loaded (a time consuming process), I had a utility program that created the DAWG and shipped that as the data file in place of the word list.
Not knowing the details, only a vague idea:
Loading the bulk data (entries) will give you the basic dictionary.
For all the cross references like syn- and antonyms and whatever, load and process the data in background, after you've shown "ready". Chances are, until A. User has typed in the first query, you are ship-shape.
Later
If the file is rather big, reading a compressed version may gain.
Also, a BufferedReader with a suitably increased buffer size may help.
You should review the structure of the data to make the data faster to load.
Also, splitting into multiple tables may speed things up.
For example, have one table for the words, another table for synonyms and additional tables for other relationships.
The first table should have organization. This allows the synonym table to be represented as ; which should load fast.
You can then build any internal containers from the data loaded in. A reason for having different data structures for store data vs. internal data is for optimization. The structures used for data storage (and loading) are optimized for loading. The structure for internal data is optimized for searching.
Another idea based on the fact that it is a mobile keyboard application.
Some words are used more often than others, so maybe you could organize it so the frequently used words are loaded first and leave the infrequently used ones to be loaded as it is needed (or as you have time).
Currently, we use c++ code to read in files (line by line, then sort it and save it to other format (txt file)), data read in line by line is saved in vector. This is all fine for small size data file.
but now we need to support large data files which crash our code (no enough memory for vector to reallocate and store. we can't know how many lines data we'll have, so we can't set size for vector).
So we are thinking we should probably redesign our code to deal with large data.
This time, we hope we can save data in a way which we can manipulate (search, sort, insert, ...) data locally and as a whole.
I hope someone here could point me to a right direction how I should do this: such as what languages, data structures, algorithms, and etc I can use.
Have you looked at using memory-mapped files? They allow files to be addressed as though they are part of the application's memory, even if they are larger than the actual available memory.
See the following links for more information on what they are:
http://msdn.microsoft.com/en-us/library/dd997372.aspx
http://blogs.msdn.com/b/khen1234/archive/2006/01/30/519483.aspx
These links are previous answers to questions about size limitations of memory-mapped files. Basically the file can be larger than the address space, but you may not be able to "view" all of it at once.
How big can a memory-mapped file be?
Reading huge files using Memory Mapped Files