I am looking for a main memory database with a C++ interface. I am looking for a database with a programmatic query interface and preferably one that works with native C++ types. SQLLite, for example, takes queries as string and needs to perform parsing ... which is time consuming. The operations I am looking for are:
Creation of tables of arbitrary dimensions (number of attributes) capable of storing integer types.
Support for insertion, deletion, selection, projection and (not a priority) joins.
The parsing time of SQLite isn't really that much (you can amortize it over many queries) unless you're substituting the values into the SQL query by hand. Substituting by hand is hard work, awkward, slow and probably unsafe too. Instead, you should be using bound parameters so that you can do things more directly (see http://www.sqlite.org/c3ref/bind_blob.html for the relevant API).
Note that if you switch to a different database, you will have the same issue; you only get high speed out of any SQL system by using bound parameters. (And consider not sweating over performance too much; the bits where it hits storage are the bottleneckā¦)
Try Boost.MultiIndex.
BerkeleyDB (now owned by Oracle) can store data entirely in memory (though it was originally designed for disk storage). TimesTen (now also owned by Oracle) was designed from the beginning for in-memory storage. Both of them support both SQL and an API for direct access from C, C++, etc.
Related
I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.
I have some application, which makes database requests. I guess it doesn't actually matter, what kind of the database I am using, but let's say it's a simple SQLite-driven database.
Now, this application runs as a service and does some amount of requests per minute (this number might actually be huge).
I'm willing to benchmark the queries to retrieve their number, maximal / minimal / average running time for some period and I wish to design my own tool for this (obviously, there are some, but I need my own for some appropriate reasons :).
So - could you advice an approach for this task?
I guess there are several possible cases:
1) I have access to the application source code. Here, obviously, I want to make some sort of cross-application integration, probably using pipes. Could you advice something about how this should be done and (if there is one) any other possible solution?
2) I don't have sources. So, is this even possible to perform some neat injection from my application to benchmark the other one? I hope there is a way, maybe hacky, whatever.
Thanks a lot.
See C++ Code Profiler for a range of profilers.
Or C++ Logging and performance tuning library for rolling your own simple version
My answer is valid just for the case 1).
In my experience profiling it is a fun a difficult task. Using professional tools can be effective but it can take a lot of time to find the right one and learn how to use it properly. I usually start in a very simple way. I have prepared two very simple classes. The first one ProfileHelper the class populate the start time in the constructor and the end time in the destructor. The second class ProfileHelperStatistic is a container with extra statistical capability (a std::multimap + few methods to return average, standard deviation and other funny stuff).
The ProfilerHelper has an reference to the container and before exit the destructor push the data in the container.You can declare the ProfileHelperStatistic in the main and if you create on the stack ProfilerHelper at the beginning of a specific function the job is done. The constructor of the ProfileHelper will store the starting time and the destructor will push the result on the ProfileHelperStatistic.
It is fairly easy to implement and with minor modification can be implemented as cross-platform. The time to create and destroy the object are not recorded, so you will not polluted the result. Calculating the final statistic can be expensive, so I suggest you to run it once at the end.
You can also customize the information that you are going to store in ProfileHelperStatistic adding extra information (like timestamp or memory usage for example).
The implementation is fairly easy, two class that are not bigger than 50 lines each. Just two hints:
1) catch all in the destructor!
2) consider to use collection that take constant time to insert if you are going to store a lot of data.
This is a simple tool and it can help you profiling your application in a very effective way. My suggestion is to start with few macro functions (5-7 logical block) and then increase the granularity. Remember the 80-20 rule: 20% of the source code use 80% of the time.
Last note about database: database tunes the performance dynamically, if you run a query several time at the end the query will be quicker than at the beginning (Oracle does, I guess other database as well). In other word, if you test heavily and artificially the application focusing on just few specific queries you can get too optimistic results.
I guess it doesn't actually matter,
what kind of the database I am using,
but let's say it's a simple
SQLite-driven database.
It's very important what kind of database you use, because the database-manager might have integrated monitoring.
I could speak only about IBM DB/2, but I beliefe that IBM DB/2 is not the only dbm with integrated monitoring tools.
Here for example an short overview what you could monitor in IBM DB/2:
statements (all executed statements, execution count, prepare-time, cpu-time, count of reads/writes: tablerows, bufferpool, logical, physical)
tables (count of reads / writes)
bufferpools (logical and physical reads/writes for data and index, read/write times)
active connections (running statements, count of reads/writes, times)
locks (all locks and type)
and many more
Monitor-data could be accessed via SQL or API from own software, like for example DB2 Monitor does.
Under Unix, you might want to use gprof and its graphical front-end, kprof. Compile your app with the -pg flag (I assume you're using g++) and run it through gprof and observe the results.
Note, however, that this type of profiling will measure the overall performance of an application, not just SQL queries. If it's the performance of queries you want to measure, you should use special tools that are designed for your DBMS - for example, MySQL has a builtin query profiler (for SQLite, see this question: Is there a tool to profile sqlite queries? )
There is a (linux) solution you might find interesting since it could be used on both cases.
It's the LD_PRELOAD trick. It's an environment variable that let's you specify a shared library to be loaded right before your program is executed. The symbols load from this library will override any other available on the system.
The basic idea is to this custom library as a wrapper around the original functions.
There is a bunch of resources available that explain how to use this trick: 1 , 2, 3
Here, obviously, I want to make some sort of cross-application integration, probably using pipes.
I don't think that's obvious at all.
If you have access to the application, I'd suggest dumping all the necessary information to a log file and process that log file later on.
If you want to be able to activate and deactivate this behavior on-the-fly, without re-starting the service, you could use a logging library that supports enabling/disabling log channels on-the-fly.
Then you'd only need to send a message to the service by whatever means (socket connection, ...) to enable/disable logging.
If you don't have access to the application, then I think the best way would be what MacGucky suggested: let the profiling/monitoring tools of the DBMS do it. E.g. MS-SQL has a nice profiler that can capture requests to the server, including all kinds of useful data (CPU time for each request, IO time, wait time etc.).
And if it's really SQLite (plus you don't have access to the source) then your chances are rather low. If the program in question uses SQLite as a DLL, then you could substitute your own version of SQLite, modified to write the necessary log files.
Use the apache jmeter.
To test performances of your sql queries under high load
I have an array of objects (say, images), which is too large to fit into memory (e.g. 40GB). But my code needs to be able to randomly access these objects at runtime.
What is the best way to do this?
From my code's point of view, it shouldn't matter, of course, if some of the data is on disk or temporarily stored in memory; it should have transparent access:
container.getObject(1242)->process();
container.getObject(479431)->process();
But how should I implement this container? Should it just send the requests to a database? If so, which one would be the best option? (If a database, then it should be free and not too much administration hassle, maybe Berkeley DB or sqlite?)
Should I just implement it myself, memoizing objects after acces sand purging the memory when it's full? Or are there good libraries (C++) for this out there?
The requirements for the container would be that it minimizes disk access (some elements might be accessed more frequently by my code, so they should be kept in memory) and allows fast access.
UPDATE: I turns out that STXXL does not work for my problem because the objects I store in the container have dynamic size, i.e. my code may update them (increasing or decreasing the size of some objects) at runtime. But STXXL cannot handle that:
STXXL containers assume that the data
types they store are plain old data
types (POD).
http://algo2.iti.kit.edu/dementiev/stxxl/report/node8.html
Could you please comment on other solutions? What about using a database? And which one?
Consider using the STXXL:
The core of STXXL is an implementation
of the C++ standard template library
STL for external memory (out-of-core)
computations, i.e., STXXL implements
containers and algorithms that can
process huge volumes of data that only
fit on disks. While the compatibility
to the STL supports ease of use and
compatibility with existing
applications, another design priority
is high performance.
You could look into memory mapped files, and then access one of those too.
I would implement a basic cache. With this workingset size you will have the best results with a set-associative-cache with x byte cache-lines ( x == what best matches your access pattern ). Just implement in software what every modern processor already has in hardware. This should give you imho the best results. You could than optimize it further if you can optimize the accesspattern to be somehow linear.
One solution is to use a structure similar to a B-Tree, indices and "pages" of arrays or vectors. The concept is that the index is used to determine which page to load into memory to access your variable.
If you make the page size smaller, you can store multiple pages in memory. A caching system based on frequency of use or other rule, will reduce the number of page loads.
I've seen some very clever code that overloads operator[]() to perform disk access on the fly and load required data from disk/database transparently.
Is there library usable from c++ for sharing fairly simple data (integers,floating point numbers, strings) between cooperative processes?
Must be :
high-speed (SQL-based methods too slow due to parsing)
able to get,set,update,delete both fixed and variable data types (e.g. int and string)
ACID (atomic,consistent,isolated,durable)
usable under linux
usable by processes without a shared parent.
highly compatible license: e.g. LGPL,MIT,BSD
For bonus points:
ability to work across the network.
ability to handle aggregation/composition into more complicated structures
Take a look at boost::interprocess. For local use, you probably can't beat a map or hash table in shared memory. Allowing networking makes things more difficult, in that case something like memcached or CouchDB might be more appropriate.
My app keeps track of the state of about 1000 objects. Those objects are read from and written to a persistent store (serialized) in no particular order.
Right now the app uses the registry to store each object's state. This is nice because:
It is simple
It is very fast
Individual object's state can be read/written without needing to read some larger entity (like pulling out a snippet from a large XML file)
There is a decent editor (RegEdit) which allow easily manipulating individual items
Having said that, I'm wondering if there is a better way. SQLite seems like a possibility, but you don't have the same level of multiple-reader/multiple-writer that you get with the registry, and no simple way to edit existing entries.
Any better suggestions? A bunch of flat files?
If what you mean by 'multiple-reader/multiple-writer' is that you keep a lot of threads writing to the store concurrently, SQLite is threadsafe (you can have concurrent SELECTs and concurrent writes are handled transparently). See the [FAQ [1]] and grep for 'threadsafe'
[1]: http://www.sqlite.org/faq.html/ FAQ
If you do begin to experiment with SQLite, you should know that "out of the box" it might not seem as fast as you would like, but it can quickly be made to be much faster by applying some established optimization tips:
SQLite optimization
Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
On the other hand, if you tell sqlite to use the harddisk, then you will get a similar benefit to your current usage of RegEdit to manipulate the program's data "on the fly."
The way you could simulate your current RegEdit technique with sqlite would be to use the sqlite command-line tool to connect to the on-disk database. You can run UPDATE statements on the sql data from the command-line while your main program is running (and/or while it is paused in break mode).
I doubt any sane person would go this route these days, however some of what you describe could be done with Window's Structured/Compound Storage. I only mention this since you're asking about Windows - and this is/was an official Windows way to do this.
This is how DOC files were put together (but not the new DOCX format). From MSDN it'll appear really complicated, but I've used it, it isn't the worst API in Win32.
it is not simple
it is fast, I would guess it's faster then the registry.
Individual object's state can be read/written without needing to read some larger entity.
There is no decent editor, however there are some real basic stuff (VC++ 6.0 had the "DocFile Viewer" under Tools. (yeah, that's what that thing did) I found a few more online.
You get a file instead of registry keys.
You gain some old-school Windows developer geek-cred.
Other random thoughts:
I think XML is the way to go (despite the random access issue). Heck, INI files may work. The registry gives you very fine grain security if you need it - people seem to forget this when the claim using files are better. An embedded DB seems like overkill if I'm understanding what you're doing.
Do you need to persist the objects on each change event or just in memory and store on shutdown? If so, just load them up and serialize them at the end, assuming your app runs for a long time (and you don't share that state with another program) then in memory is going to be a winner.
If you've got fixed size structures then you could consider just using a memory mapped file and allocate memory from that?
If the only thing you do is serialize/deserialize individual objects (no fancy queries), then use a btree database, for example Berkeley DB. It is very fast at storing and retrieving chunks of data by key (I assume your objects have some id that can be used as a key) and access by multiple processes is supported.