I would like to persist some of my objects in a database (this could be relational (postgresql or MariaDB) or MongoDB). I have found a number of libraries that seem potentially useful, but I am missing the overall picture.
I have used boost::serialization serialize c++ to xml / binary, but it is not clear to me how to get this into the database (do I use the binary or xml format?)?
How do I get this into my mongoDB or postgresql?
You'd serialize to binary, as it is smaller and much faster. Also, the XML format isn't really pretty/easy to use outside of Boost Serialization anyways.
WARNING: Use Boost Portable Archive (EPA http://epa.codeplex.com/) if you need to use the format across different machines.
You'd usually store it in a column
text or CLOB (character large object) by encoding in base64 and putting that in the Database native charset (base64 is safe even for ASCII)
BLOB (binary large object) which doesn't bring the need to encode and could be more efficient storage wise.
Note: if you need to index, store the index properties in normal database columns.
Finally, if you like, I have recently made a streambuffer that allows you to stream data directly into a Sqlite BLOB column. Perhaps you can glean some ideas from this you could use:
How to boost::serialize into a sqlite::blob?
Related
I was thinking about the difference between those two approches.
Imagine you must handle information about pattern calls, which later should be
displayed to the user. A pattern call is a tuple consisting of a unique integer
identifier ("id"), a user defined name (“name"), a project relative path to the so
called pattern file ("patternFile") and a convenience flag, which states whether
the pattern should be called or not called. And the number of tuples are not known before and they won't be modified after initialization.
I thought that in this case a column based approach with big query for example would be better in terms of I/O and performance as well as the evolution of the schema. But actually I can't understand why. I would appreciate any help.
Amazon S3 is like a large key-value store. The Key is the filename (with full path) and the Value is the contents of the file. It's just a blob of data.
A columnar data store organizes data in such a way that specific data can be "jumped to", and only desired values need to be read from disk.
If you are wanting to perform a search on the data, then some form of logic is required on the data. This could be done by storing data in a database (typically a proprietary format) or by using a columnar storage format such as Parquet and ORC plus a query engine that understands this format (eg Amazon Athena).
The difference between S3 and columnar data stores is like the difference between a disk drive and an Oracle database.
I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a new column without rewriting the entire file?
The short answer is probably no.
Arrow's in-memory format & libraries support this. You can add a chunked array to a table by just creating a new table (this should be zero-copy).
However, it appears you are talking about storing tables in files. None of the common file formats in use (parquet, csv, feather) support partitioning a table in this way.
Keep in mind, if you are reading a parquet file, you can specify which column(s) you want to read and it will only read the necessary data. So if your goal is only to support individual column retrieval/query then you can just build one large table with all your columns.
I have been trying to insert key value pairs in database using leveldb and it works fine with simple strings. However if I want to store multiple attributes for a key or for example use JSON encoding , how can it be done in c++ . In Node.js leveldb package it can be done by specifying the encoding . Really can't figure this out
JSON is just a string, so I'm not completely sure where you're coming from here…
If you have some sort of in-memory representation of JSON you'll need to serialize it before writing it to the database and parse it when reading.
I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.
We have a situation in our product where for a long time some data has been stored in the application's database as SQL string (choice of MS SQL server or sybase SQL anywhere) which was encrypted via the Windows API function CryptEncrypt. (direct and de-cryptable)
The problem is that CryptEncrypt can produce NULL's in the output, meaning that when it's stored in the database, the string manipulations will at some point truncate the CipherText.
Ideally we'd like to use an algo that will produce CipherText that doesn't contain NULLs as that will cause the least amount of change to the existing databases (changing a column from string to binary and code to deal with binary instead of strings) and just decrypt existing data and re-encrypt with the new algorithm at database upgrade time.
The algorithm doesn't need to be the most secure, as the database is already in a reasonably secure environment (not an open network / the inter-webs) but does need to be better than ROT13 (which I can almost decrypt in my head now!)
edit: btw, any particular reason for changing ciphertext to cyphertext? ciphertext seems more widely used...
Any semi-decent algorithm will end up with a strong chance of generating a NULL value somewhere in the resulting ciphertext.
Why not do something like base-64 encode your resulting binary blob before persisting to the DB? (sample implementation in C++).
Storing a hash is a good idea. However, please definitely read Jeff's You're Probably Storing Passwords Incorrectly.
That's an interesting route OJ.
We're looking at the feasability of a non-reversable method (still making sure we don't explicitly retrieve the data to decrypt) e.g. just store a Hash to compare on a submission
It seems that the developer handling this is going to wrap the existing encryption with yEnc to preserve the table integrity as the data needs to be retrievable, and this save all that messy mucking about with infinite-improbab.... uhhh changing column types on entrenched installations.
Cheers Guys