Container for in-memory representation of a DB table - c++

Let's say I have a (MySQL) DB. I want to automate the update of this database via an application, that will:
1. Import from DB
2. Calculate updated data
3. Export back updated data
The timing is important, I don't want to import while calculating, in fact I don't want any queries then; I want to import (a) table(s) as a whole, then calculate. So, my question is, if a row is represented with an instance of a class, then what container do I put these objects into?
A vector? A set? What about ordered vs. unordered? Just use what seems best for my case according to big O times? Any special traps to fall into here? Is this case no different than with data "born in memory", so the only things to consider besides size overhead are "do I want the lookup or the insertion to be faster" ?
Probably the best route is to use some ORM, but let's say I don't want to.
I've seen some apps use boost::unordered_set, and I wondered, if there is a particular reason for its use...
I use a jdbc-like interface as the connector (libmysqlcpp).

I do not think that the container you have to use can be guessed with so few information. It mainly depends of the data size, type and the algorithm you will run.
But my main concern over such a design is that it will quickly choke your network or your base and database. If you have a big table you'll:
select all the data from the table
retrieve all the data over the network
process on you machine part (some columns ?) or the entirety of the data
push the data over the network
update your rows (or erase/replace maybe)
Why don't you consider working directly on the mysql server ? You create your user defined function that work on the directly data, saving the network and even taking advantage of the fact that mysql is built to handle gigantic amount of data, quantity that an in-memory container is not built to handle.

Related

What's the cheapest way to store an auto increment indexed list of values in AWS?

I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?
Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Model / View: Filtering data beforehand in database or at runtime in proxy model?

Imagine an applications that displays data from a sqlite database.
The app is making use of model/view programming.
It can have multiple views acting in parallel on different subsets of the same data (subsets made by filtering the required data types).
(Sidenote: I am using Qt, so there is no controller part, of course, but I did not find a more suitable tag.)
I am not sure which approach to take:
1a. Load all database data into one single model
1b. Then apply the model to all views, filtering the data inside the view with a proxy model
2a. One model for each view, but filtering done inside sqlite database.
Pros/Cons:
Idea 1:
(+) one model, makes use of model/view advantages (e.g. updating all connected views)
(-) memory usage could get huge because all data is loaded into a model, but only a subset is shown
Idea 2:
(+) theoeretically lower memory usage because only the filtered data is loaded from the database
(-) the views can have filters that could lead to intersecting data, meaning the same data would be stored in more than one model -> perhaps practically even bigger memory usage than in Idea 1
The data being loaded here is just case metadata, e.g. title, description, datetime and so on. Bigger data like images, files are not being loaded here. So as the database could indeed grow big (big for this kind of application, say 200 gb for power users), this does not affect the topic of the present question, because the metadata is much, much smaller and is proportional to overall data count, not data size.
Do you have practical experience with such a configuration and can suggest which one to use? It seems to me that Idea 1 is the way to go, but I am not sure about it.
In my experience, the less data is loaded from the database into memory, the better. It is not just the memory usage, but also startup time. If the data is delivered over the network, loading a few gigabytes can take forever.
So I would go for a variant of your second solution, where each table view has its own model. The model is an implementation of QAbstractItemModel that lazily fetches only the rows that currently need to be displayed. The models could, however, share a common cache. This will also make sure that they all display the same data where it intersects.

Non-permanent huge external data storage in C++ application

I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.
I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.