I'm working on a program and my "database" is some .csv files.
I have a list of objects in a .csv with some information about each. Which is the best-appropriate way to treat the "data".
Work with fstream, meaning that everytime that I want to modify the data, or I want to read something I will directly work with my files with the tools of fstream
Or, at the beginning of the program I will load the data in a vector, read,write on the vector and in the end of the program I delete the previous file and load the new one.
In a matter of performance will it be different? Considering that the objects are numerous.
I think that it is more of a combination rather than choosing A or B. Especially choosing only A. is not safe as multiple components could access the file simultaneously. Plus if there are many updates then using streams each time could be make your code very slow.
Therefore I believe that you should use B but also take care of the implementation to persist your data in a safe way (write your data in the file).
Regarding the data structure, this depends on the usage. One important question to ask here is whether there are many insertions and deletions. If this is the case then it would be more efficient to use a list instead of a vector, as the list provides instant time insertions and vector is not appropriate for this purpose.
If the data include a unique attribute and fast lookups are needed then a hash or a map would be more suitable.
Take the CSV parser from my CSVtoC utility.
http://www.malcolmmclean.site11.com/www/
CSV files are not good for dynamic update as the records are not fixed in known disk physical locations. (The alternative is to contrive the CSV so that that doesn't hold, but it's a delicate and messy approach).
Reading a CSV is hard, writing one is trivial.
Related
I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.
I have a large dictionary of english words (around 70k of them) that I load into memory at the beginning of the program. They are loaded into a radix trie data structure, and each trie node often has many links from one node to many others (for example the word antonyms, "dead" -> "alive", "well"). Each node also have a std::vector<MetaData> in it which contains various miscellaneous metadata for my program.
Now, the problem is with the loading time of this file. Reading the file from disk, deserialization and allocating the data structure for the thing in general takes a lot of time (4-5 seconds).
Currently, I'm working on making the load asynchronously (or bit by bit, a fraction of them per frame), but due to the nature of the application (it's a mobile keyboard), there's just plenty of times where it simply has to be loaded quickly.
What can be done to speed up loading? Memory pool everything? I am benchmarking different parts to see what can be optimized, but it looks like, so far, it's just little things that add up.
If the trie is static (i.e. doesn't change when the program's running), then build an optimized version in an array using array indexes in place of pointers. You can then save that as your data file. Startup then amounts to just loading that block of data into memory.
Doing it that way makes some things less convenient (you'll have to use arrays rather than std::vector, for example), and you might have to do a bit of casting, but with a little thought you end up with a very compact and very fast data structure that doesn't suffer from the allocation overhead associated with creating an object for each node. Instead, it's essentially an array of varying length structures.
I did this for an application that used a directed acyclic word graph (DAWG). Rather than rebuild the DAWG every time the program was loaded (a time consuming process), I had a utility program that created the DAWG and shipped that as the data file in place of the word list.
Not knowing the details, only a vague idea:
Loading the bulk data (entries) will give you the basic dictionary.
For all the cross references like syn- and antonyms and whatever, load and process the data in background, after you've shown "ready". Chances are, until A. User has typed in the first query, you are ship-shape.
Later
If the file is rather big, reading a compressed version may gain.
Also, a BufferedReader with a suitably increased buffer size may help.
You should review the structure of the data to make the data faster to load.
Also, splitting into multiple tables may speed things up.
For example, have one table for the words, another table for synonyms and additional tables for other relationships.
The first table should have organization. This allows the synonym table to be represented as ; which should load fast.
You can then build any internal containers from the data loaded in. A reason for having different data structures for store data vs. internal data is for optimization. The structures used for data storage (and loading) are optimized for loading. The structure for internal data is optimized for searching.
Another idea based on the fact that it is a mobile keyboard application.
Some words are used more often than others, so maybe you could organize it so the frequently used words are loaded first and leave the infrequently used ones to be loaded as it is needed (or as you have time).
I need to periodically write out a bunch of stuff to prevent data loss. The tricky bit is that not all the things written to the file change all the time but they might, the rate at which things change in various areas of the data is determined by the user and is unpredictable.
Since the structure of the data would stay the same, is it possible to only write parts of the file? Also how would this approach compare to storing the data into multiple files if, for example I stored each variable separately(would be thousands, I'm probably not going down this path in any case)?
It is possible to seek and write to an arbitrary location in a file, but this only works if the number of characters you're writing is the same as the number that you're replacing, or if the file structure has some other way of delimiting the data, and you keep that up-to-date with your writes.
At the end of the day, if you KNOW for certain all of the data elements you may ever need to store, you can develop your own STRICT file format, perhaps delimited by line, or some other method that allows you to easily GET-To, the location you want to change. If you are storing things like strings however, whose SIZE may change, then searching may be easy, but modifying may be more difficult as all contents below, would have to be pushed down, in order to add the characters if longer, or move up if the data is shrinking.
Perhaps the way I would think to do it is to write the entire file out when a value changes from what YOU KNOW is already in your file, else simply leave it alone.
I think what you also need to realize is that you probably wont realize any performance gain from trying to modify a PART OF your file, versus writing the entire file out again. It is a comparison of writing the entire file out, against the seeking through a file, plus the potential reading of everything past where you want to modify, modifying, then re-writing all contents past do to growth or shrinkage of the element you want to modify.
I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.
I need to make a list of key-value pairs (similar to std::map<std::string, std::string>) that is stored on disk, can be accessed by multiple threads at once. keys can be added or removed, values can be changed, keys are unique. Supposedly the whole thing might not fit into memory at once, so updates to the map must be saved to the disk.
The problem is that I'm not sure how to approach this problem. I understand how to deal with multithreading issues, but I'm not sure which data structure is suitable for storing data on disk. Pretty much anything I can think of can dramatically change structure and cause massive overwrite of the disk storage, if I approach problem head-on. On other hand, relational databases and windows registry deal with this problem, so there must be a way to approach it.
Is there a data structure that is "made" for such scenario?
Or do I simply use any traditional data structure(trees or skip lists, for example) and make some kind of "memory manager" (disk-backed "heap") that allocates chunks of disk space, loads them into memory on request and unloads them onto disk, when necessary? I can imagine how to write such "disk-based heap", but that solution isn't very elegant, especially when you add multi-threading to the picture.
Ideas?
The data structure that is "made" for your scenario is B-tree or its variants, like B+ tree.
Long and short of it: once you write things to disk you are not longer dealing with "data structures" - you are dealing with "serialization" and "databases."
The C++ STL and its data structures do not really address these issues, but, fortunately, they have already been addressed thousands of times by thousands of programmers already. Chances are 99.9% that they've already written something that will work well for you.
Based on your description, sqlite sounds like it would be a decent, balanced choice for your application.
If you only need to do lookups (and insertions, deletions) by key, and not more complex field-based queries, BDB may be a better choice for your application.