Fast insertion of rdf triples using redland/c++ - c++

I went over the redland documentation and there are some problems I couldn't be certain of solving.
Looking on a c++ side, suppose you generate numerous rdf triples over time for several different graphs, and knowing it is not interested a primary interest to have all graphs in memory:
Is it possible to use redland to perform single/bulk insertions (write into persistent storage) without keeping the graph in memory, and how would you tune such insertions?
If we forget about the querying, what would be a good persistent way of storage: files or databases?
What do you think?

Is it possible to use redland to perform single/bulk insertions (write into persistent storage) without keeping the graph in memory, and how would you tune such insertions?
Yes. Create a librdf_storage object where you want your data stored and pass it to librdf_new_model(). Then use any of the API functions such as librdf_parser_parse_into_model()to store data in that model and it gets persisted in the storage.
The graph is only kept in memory if the librdf storage module is written that way.
If we forget about the querying, what would be a good persistent way of storage: files or databases?
The file storage is not really for serious business. It keeps the graph in memory and persists to disk by serializing to/from RDF/XML.
Use a database-backed storage such as mysql or BDB hashes.

Related

How do on-disk databases handle file reading and writing on a file system level?

Suppose I were to write my own database in c++ and suppose I would use a binary tree or a hash map as the underlying datastructure. How would I handle updates to this datastructure?
1) Should I first create the binary tree and then somehow persist it onto a disk? And every time data has to be updated I need to open this file and update it? Wouldn't that be a costly operation?
2) Is there a way to directly work on the binary tree without loading it into memory and then persisting again?
3) How does SQLite and Mysql deal with it?
4) My main question is, how do databases persist huge amounts of data and concurrently make updates to it without opening and closing the file each time.
Databases see the disk or file as one big bock device and manage blocks in M-way Balanced Trees. They insert/update/delete records in these blocks and flush dirty blocks to disk again. They manage allocation tables of free blocks so the database does not need to be rewritten on each access. As RAM memory is expensive but fast, pages are kept in a RAM cache. Separate indexes (either separate files or just blocks) manage quick access based on keys. Blocks are often the native allocation size of the underlying filesystem (e.g. cluster size). Undo/redo logs are maintained for atomicity. etc.
Much more to be told and this question actually belongs on Computer Science Stack Exchange. For more information read Horowitz & Sahni, "Fundamentals of datastructures", p.496.
As to your questions:
You open it once and keep open while your database manager is running. You allocate storage as needed and maintain an M-way tree as described above.
Yes. You read blocks that you keep in a cache.
and 4: See above.
Typically, you would not do file I/O to access the data. Use mmap to map the data into the virtual address space of the process and let the OS block cache take care of the reads and writes.

Why does AWS Dynamo SDK do not provide means to store objects larger than 64kb?

I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.

Is there a muti index container for the harddisk storage rather than memory?

I need a muti index container based on red-black trees (something like boost::multi_index::multi_index_container) for the case of the harddisk storage. All data must be store on hard disk rather than in memory.
Is there an open source container such that described conditions hold?
Note. I use C++.
If you have an in-memory solution, you can use a memory-mapped file and a custom allocator to achieve persistent storage.
I am afraid I don't know any.
For hard-disk storage I can only recommend a look to STXXL, which proposes STL containers and algorithms adapted to data that can only fit on disk. They have implemented many things to allow for a smoother manipulation, essentially by caching in memory as much as possible and delaying disk access when possible.
Now that won't get you a multi index, but at least you'll have a STL :)
Then, if you are determined, you can port multi-index to use the facilities provided by STXXL: they have decorrelated the IO access / memory caching from the containers themselves.
Or you can simply write what you need based on their STL-compliant containers.
How about SQLite? It can use disk as backing store, and supports multiple indexes on data, as Boost Multi Index does.

Main Memory Database with C++ Interface

I am looking for a main memory database with a C++ interface. I am looking for a database with a programmatic query interface and preferably one that works with native C++ types. SQLLite, for example, takes queries as string and needs to perform parsing ... which is time consuming. The operations I am looking for are:
Creation of tables of arbitrary dimensions (number of attributes) capable of storing integer types.
Support for insertion, deletion, selection, projection and (not a priority) joins.
The parsing time of SQLite isn't really that much (you can amortize it over many queries) unless you're substituting the values into the SQL query by hand. Substituting by hand is hard work, awkward, slow and probably unsafe too. Instead, you should be using bound parameters so that you can do things more directly (see http://www.sqlite.org/c3ref/bind_blob.html for the relevant API).
Note that if you switch to a different database, you will have the same issue; you only get high speed out of any SQL system by using bound parameters. (And consider not sweating over performance too much; the bits where it hits storage are the bottleneckā€¦)
Try Boost.MultiIndex.
BerkeleyDB (now owned by Oracle) can store data entirely in memory (though it was originally designed for disk storage). TimesTen (now also owned by Oracle) was designed from the beginning for in-memory storage. Both of them support both SQL and an API for direct access from C, C++, etc.

Random-access container that does not fit in memory?

I have an array of objects (say, images), which is too large to fit into memory (e.g. 40GB). But my code needs to be able to randomly access these objects at runtime.
What is the best way to do this?
From my code's point of view, it shouldn't matter, of course, if some of the data is on disk or temporarily stored in memory; it should have transparent access:
container.getObject(1242)->process();
container.getObject(479431)->process();
But how should I implement this container? Should it just send the requests to a database? If so, which one would be the best option? (If a database, then it should be free and not too much administration hassle, maybe Berkeley DB or sqlite?)
Should I just implement it myself, memoizing objects after acces sand purging the memory when it's full? Or are there good libraries (C++) for this out there?
The requirements for the container would be that it minimizes disk access (some elements might be accessed more frequently by my code, so they should be kept in memory) and allows fast access.
UPDATE: I turns out that STXXL does not work for my problem because the objects I store in the container have dynamic size, i.e. my code may update them (increasing or decreasing the size of some objects) at runtime. But STXXL cannot handle that:
STXXL containers assume that the data
types they store are plain old data
types (POD).
http://algo2.iti.kit.edu/dementiev/stxxl/report/node8.html
Could you please comment on other solutions? What about using a database? And which one?
Consider using the STXXL:
The core of STXXL is an implementation
of the C++ standard template library
STL for external memory (out-of-core)
computations, i.e., STXXL implements
containers and algorithms that can
process huge volumes of data that only
fit on disks. While the compatibility
to the STL supports ease of use and
compatibility with existing
applications, another design priority
is high performance.
You could look into memory mapped files, and then access one of those too.
I would implement a basic cache. With this workingset size you will have the best results with a set-associative-cache with x byte cache-lines ( x == what best matches your access pattern ). Just implement in software what every modern processor already has in hardware. This should give you imho the best results. You could than optimize it further if you can optimize the accesspattern to be somehow linear.
One solution is to use a structure similar to a B-Tree, indices and "pages" of arrays or vectors. The concept is that the index is used to determine which page to load into memory to access your variable.
If you make the page size smaller, you can store multiple pages in memory. A caching system based on frequency of use or other rule, will reduce the number of page loads.
I've seen some very clever code that overloads operator[]() to perform disk access on the fly and load required data from disk/database transparently.