I've been tinkering with SQLite3 for the past couple days, and it seems like a decent database, but I'm wondering about its uses for serialization.
I need to serialize a set of key/value pairs which are linked to another table, and this is the way I've been doing this so far.
First there will be the item table:
CREATE TABLE items (id INTEGER PRIMARY KEY, details);
| id | details |
-+----+---------+
| 1 | 'test' |
-+----+---------+
| 2 | 'hello' |
-+----+---------+
| 3 | 'abc' |
-+----+---------+
Then there will a table for each item:
CREATE TABLE itemkv## (key TEXT, value); -- where ## is an 'id' field in TABLE items
| key | value |
-+-----+-------+
|'abc'|'hello'|
-+-----+-------+
|'def'|'world'|
-+-----+-------+
|'ghi'| 90001 |
-+-----+-------+
This was working okay until I noticed that there was a one kilobyte overhead for each table. If I was only dealing with a handful of items, this would be acceptable, but I need a system that can scale.
Admittedly, this is the first time I've ever used anything related to SQL, so perhaps I don't know what a table is supposed to be used for, but I couldn't find any concept of a "sub-table" or "struct" data type. Theoretically, I could convert the key/value pairs into a string like so, "abc|hello\ndef|world\nghi|90001" and store that in a column, but it makes me wonder if that defeats the purpose of using a database in the first place, if I'm going to the trouble of converting my structures to something that could be as easily stored in a flat file.
I welcome any suggestions anybody has, including suggestions of a different library better suited to serialization purposes of this type.
You might try PRAGMA page_size = 512; prior to creating the db, or prior to creating the first table, or prior to executing a VACUUM statement. (The manual is a bit contradictory and it also depends on the sqlite3 version.)
I think it's also kind of rare to create tables dynamically at a high rate. It's good that you are normalizing your schema, but it's OK for columns to depend on a primary key and, while repeating groups are a sign of lower normalization level, it's normal for foreign keys to repeat in a reasonable schema. That is, I think there is a good possibility that you need only one table of key/value pairs, with a column that identifies client instance.
Keep in mind that flat files have allocation unit overhead as well. Watch what happens when I create a one byte file:
$ cat > /tmp/one
$ ls -l /tmp/one
-rw-r--r-- 1 ross ross 1 2009-10-11 13:18 /tmp/one
$ du -h /tmp/one
4.0K /tmp/one
$
According to ls(1) it's one byte, according to du(1) it's 4K.
Don't make a table per item. That's just wrong. Similar to writing a class per item in your program. Make one table for all items, or perhaps, store the common parts of all items, with other tables referencing it with auxillary information. Do yourself a favor and read up on database normalization rules.
In general, the tables in your database should be fixed, in the same way that the classes in your C++ program are fixed.
Why not just store a foreign key to the items table?
Create Table ItemsVK (ID integer primary key, ItemID integer, Key Text, value Text)
If it's just serialization, i.e. one-shot save to disk and then one-shot restore from disk, you could use JSON (list of recommend C++ libraries).
Just serialize a datastructure:
[
{'id':1,'details':'test','items':{'abc':'hello','def':'world','ghi':'90001'}},
...
]
If you want to save some bytes, you can omit the id, details, and items keys and save a list instead: (in case that's a bottleneck):
[
[1,'test', {'abc':'hello','def':'world','ghi':'90001'}],
...
]
Related
I'm trying to learn DynamoDB just for didactic purposes, for that reason I propose myself to create a small project to sell vehicles (cars, bikes, quad bikes, etc) in order to learn and get some experience with NoSQL databases. I read a lot of documentation about creating the right models but I still cannot figure out the best way to store my data.
I want to get all the vehicles by filters like:
get all the cars not older than 3 months.
get all the cars not older than 3 months by brand, year and model.
And so on the same previous queries for bikes, quad bikes, etc.
After reading the official documentation and other pages with examples (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html#bp-general-nosql-design-approach , https://medium.com/swlh/data-modeling-in-aws-dynamodb-dcec6798e955 , Separate tables vs map lists - DynamoDB), they said that the best designs use only one table for storing everything, so I end up with a model like the next below:
-------------------------------------------------------------------------------------
Partition key | Sort key | Specific attributes for each type of vehicle
-------------------------------------------------------------------------------------
cars | date#brand#year#model | {main attributes for the car}
bikes | date#brand#year#model | {main attributes for the bike}
-------------------------------------------------------------------------------------
I've used a composite sort key because they specify that is a good practice for searching data (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html).
But after defining my model I end up that the previous model will have a problem called "Hotspotting" or "Hoy key". (https://medium.com/expedia-group-tech/dynamodb-data-modeling-c4b02729ac08, https://dzone.com/articles/partitioning-behavior-of-dynamodb) because in the official documentation they recommend having partitions keys with high cardinality to avoid the problem.
So at this point, I'm a little stuck about how to define a good and scalable model. Could you provide me some help or examples about how to achieve a model to get the queries above mentioned?
Note: I also considered creating a specific table for each vehicle but that would create more problems because to find the information I would need to perform a full table scan.
A few things...
hot partitions, only come into play if you have multiple partitions...
Just because you've got multiple partition (hash) keys, doesn't automatically mean DDB will need multiple partitions. You'll also need more than 10GB of data and/or more than 3000 RCU or 1000 WCU being used.
Next, DDB now supports "Adaptive Capacity", so hot partitions aren't as big a deal as they used to be. why what you know about DynamoDB might be outdated
In connection with the even newer "Instantaneous Adaptive Capacity", you've got DDB on demand.
One final note, you may be under the impression that a given partition (hash) key can only have a maximum of 10GB of data under it. This is true if your table utilizes Local Secondary Indexes (LSI) but is not true otherwise. Thus, consider using global secondary indexes (GSI). There's extra cost associated with GSIs, so it's a trade off to consider.
In the past, all my lists were put in the database. I did not know better and it seemed to me like data... so I put them in the database.
For some lists (e.g. countries), it is the right way to do it. But for others like options that triggers a different behaviour in your code, it is not.
For instance, let's say I have a User object and this object has a property called Status. The Status is tightly coupled to behavious in my code:
Active: you can access the application.
Banned: you cannot access the application and can never reset your account.
Inactive: You did not use your acccess for X months, you can fill a form to reactivate your account.
The old me would have created a table in the database called UserStatus with
3 rows in it. The table would have looked like this:
+----+----------+
| Id | Code |
+----+----------+
| 1 | ACTIVE |
| 2 | BAN |
| 3 | INACTIVE |
+----+----------+
Then, Code column would have been used in my code to "bind" the user status to the right behaviour (and to the right 'display string' according to the language). That said, editing the Code would have screwed everything. And adding new status in the database would have no effect as you need to add data in the database AND add code to handle the status. This seems like the wrong way of doing things.
Then, I thought about using an enum. Now, I don't have the database + code issue. I handle everything from the code and it makes more sense as, in my opinion, it's not data that should be in a database. But with enum comes this problem: in fact hey are an int but I need to use them as string.
I also thought about setting constants in the User object:
public const string Active = "ACTIVE";
public const string Ban = "BAN";
public const string Inactive = "INACTIVE";
But now that I have figured the right way to handle it in the code, I must display the list in the UI. It can be done with the enum but it requires some hacking. Samething with the constants as the list needs to be handled manually. Using the database would make that so easy... damn!
In the end, my question becomes: What is the right structure for these "static" lists that are tightly coupled with your code but also need to be displayed ?
Update:
Each user has one (and only one) Status and that Status can be changed.
I've contemplated the same question many, many times over the years. Here are some [perhaps] insights that I've learned.
First, some clarification. I assume that each user will have a Status attribute and that the value can change for each user. In other words, the value belongs in the DB and it must be associated with each user.
My favorite approach in this case - store a string in the User table
My favorite approach is to create a Status column on the user table. Since each user can only have 1 status, it makes sense to just store this detail in the User table. Doing so has the advantage of making queries easier to write. The reason I'd choose to store it as a string are 2:
It makes perusing the DB easier. While debugging and just viewing raw data in the db, it's easy for developers new to the project to understand what the Status column is at a glance and it's easy to figure out the the Status of any given User row at a glance. Making values easy to understand is of the utmost importance in my opinion when making software easy to develop/modify.
I assume that duplicating this string won't become a performance/storage problem. Now, if you have millions and millions (maybe even hundreds of millions) of user rows, then you may want to shrink the size of this column.
And one other thing I like to do is to make sure that I use a file of constants (like what you described) and only use those constants when interacting with the DB table (for inserts/updates, that is). That's the perfect use case for constants, and you can create easy little utility functions that ensure the value you are about to insert are in that list of constants.
Caveats that could change my answer, and alternative solutions
You have hundreds of millions of Users.
In this case, you may need to shrink the size of the column to save space. You can either use an enum of some sort in the DB and limit the field to be a small varchar type of some sort or use a small int of some sort to represent the Status. I'd still put that column on the User record to prevent extra join's in large queries. And if you went the int route, I would handle the translation of that int into a string in the code and in project documentation. Lookup tables that translate 1 to "Active" really aren't any more helpful than a file of constants in your code that does the same thing, in my opinion. And it just adds a join to every query that you write. Looping through result rows in your code is generally going to be so fast that there is no performance hit either way. This caveat probably isn't even worth mentioning, but I have run into it on rare occasions.
Each user can have multiple statuses. Well, in this case you'd need a join table of some sort. I'd probably make the join table look like this, if possible:
+--------+----------+
| UserId | Code |
+--------+----------+
| 1 | ACTIVE |
| 2 | BAN |
| 3 | INACTIVE |
| 1 | BAN |
+--------+----------+
Now I realize that your use-case doesn't really make sense for a one-to-many relationship like this. BUT, in other use-cases where you're debating where to stick constants like this, you may run into this situation. In these cases, I STILL like to put a string in the DB when possible. Unless I'm really worried about saving the bits and bytes, I'll use a string. And 99% of the time, a column like this isn't the place you're worried about saving space. It just makes perusing the DB and learning the DB so much easier.
Anyway, just some thoughts from my experiences, hope you find them helpful!
How to choose the primary key in DynamoDB if the item can only be uniquely identified by three or more attributes? Or this is not the correct way to use NOSQL database?
Generally, if your items are uniquely identified by three or more attributes you can concatenate the attribute values and form a composite string key that you can use as a hash key in the Dynamo tbale.
You can de-duplicate the attribes from the hash key into separate attributes on the item if you need to create indexes on them, or if you need to use them in conditional expressions.
The rules for relational databases normal form don't necessarily apply to NoSQL databases and, in fact, a denormalized schema is usually preferred.
To expand the concept, it is typical (and usually desirable) when designing relational database schemas to use normalized form. One of the normalized forms dictates that you should not duplicate data that represents the same "thing" in your database.
I'm going to use an example that has just two parts to the key but you can extend it further.
Let's say you're designing a table that contains geographical information in the united states. In US, a Zip Code consists of 5 digits and an additional 4 digits that may subdivide the region.
In a relational database you might use the following schema:
Zip | Plus4 | CityName | Population
---------+-----------+---------------+---------------
CHAR(5) | CHAR(4) | NVARCHAR(100) | INTEGER
With a composite primary key of Zip, Plus4
This is perfect because the combination of Zip and Plus4 is guaranteed to be unique and you can answer any query against this table regardless of whether you have both the Zip and the additional Plus4 code, or just the Zip. And you can also get all the Plus4 codes for a Zip code rather easily.
If you wanted to store the same information in Dynamo you might create a hash key called "ZipPlus4" that is of type String and which consists of the Zip code concatenated with the Plus4 code (ie. 60210-4598) and then also store two more attributes on the item, one of which is the Zip code by itself and the other which is the Plus4 by itself. So an item in your table might have the following attributes:
ZipPlus4 | Zip | Plus4 | CityName | Population
-----------+---------+----------+-------------+---------------
String | String | String | String | Number
The ZipPlus4 above would be the Hash key for the table.
Note that in the example above you could get away with having a hash key of "Zip" and a range key of "Plus4" but as you saw, when you have more than 2 parts you need something different.
In my database I have citations between objects as a ManyToMany field. Basically, every object can cite any other object.
In Postgres, this has created an intermediate table. The table has about 12 million rows, each looks roughly like:
id | source_id | target_id
----+-----------+-----------
81 | 798429 | 767013
80 | 798429 | 102557
Two questions:
What's the most Django-tastic way to select this table?
Is there a way to iterate over this table without pulling the entire thing into memory? I'm not sure Postgres or my server will be pleased if I do a simple select * from TABLE_FOO.
The solution I found to the first question was to grab the through table and then to use values_list to get a flattened result.
So, from my example, this becomes:
through_table = AcademicPaper.papers_cited.through
all_citations = through_table.objects.values('source_id', 'target_id')
Doing that runs the very basic SQL that I'd expect:
print all_citations.query
SELECT 'source_id', 'target_id' FROM my_through_table;
And it returns flattened ValueList objects, which are fairly small and I can work with very easily. Even in my table with 12M objects, I was actually able to do this and put it all in memory without the server freaking out too much.
So this solved both my problems, though I think the advice in the comments about cursors looks very sound.
Recently I am dealing with a following problem:
I am trying to build an "archive" for storing and retrieving data from various sources, so the data will always have different number of columns and rows. I think that allowing the user to create new tables just to store those CSV files (each in separate files ) would a serious violation of web development guidelines and also difficult to achieve in Django. That's why I came up with idea of attribute-value format for storing the data, but I don't know how to implement it in django.
I want to build a form in Django Admin to allow user to upload a CSV file with N-columns to a table that contains only two columns: 1)name of the column from csv file and 2) value for that column (more precisly: three value columns: one for integers, one for floats and one for storing strings. To do that I must of course "melt" the data from CSV file to a "long" format so the file:
col1 | col2 | col3
23 | 45.0 | 32
becomes:
key| val
col1| 23
col2 | 45.0
col3 | 32
And that I know how to do. However, i do not know if it is possible to process a file that is uploaded by the user to such a format and, later, how to retrive data in a simple, django-way.
Do you know of any such extensions /widegts or how to approach the problem? Or how to google it even? I have done my research, however, I found only general approaches for dynamic models and I don't think that my case requires using them:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
and here's dynamic model approach:
https://pypi.python.org/pypi/django-dynamo - however, I am not sure it's the right answer.
So my guess is that I do not understand django really that well, but I'd be grateful for some directions.
No you don't need a dynamic model. And you should avoid EAV(Entity Attribute Value)
schemas. It's bad desing.
Read here for how to process an uploaded file.
See here for how to override the save() instance method. This
is probably what you'll need to do.
Also, keep in mind that what you call melting is called serializing. It is helpful
to know the right terms and definitions when searching for these topics.