In my database I have citations between objects as a ManyToMany field. Basically, every object can cite any other object.
In Postgres, this has created an intermediate table. The table has about 12 million rows, each looks roughly like:
id | source_id | target_id
----+-----------+-----------
81 | 798429 | 767013
80 | 798429 | 102557
Two questions:
What's the most Django-tastic way to select this table?
Is there a way to iterate over this table without pulling the entire thing into memory? I'm not sure Postgres or my server will be pleased if I do a simple select * from TABLE_FOO.
The solution I found to the first question was to grab the through table and then to use values_list to get a flattened result.
So, from my example, this becomes:
through_table = AcademicPaper.papers_cited.through
all_citations = through_table.objects.values('source_id', 'target_id')
Doing that runs the very basic SQL that I'd expect:
print all_citations.query
SELECT 'source_id', 'target_id' FROM my_through_table;
And it returns flattened ValueList objects, which are fairly small and I can work with very easily. Even in my table with 12M objects, I was actually able to do this and put it all in memory without the server freaking out too much.
So this solved both my problems, though I think the advice in the comments about cursors looks very sound.
Related
I've an apparently simple task to perform, i have to convert several tables column from a string to a new entity (integer FOREIGN KEY) value.
I have DB 10 tables with a column called "app_version" which atm are VARCHAR columns type. Since i'm going to have a little project refactor i'd like to convert those VARCHAR columns to a new column which contains an ID representing the newly mapped value so:
V1 -> ID: 1
V2 -> ID: 2
and so on
I've prepared a Doctrine Migration (i'm using symfony 3.4) which performs the conversion by DROPPING the old column and adding the new id column for the AppVersion table.
Of course i need to preserve my current existing data.
I know about preUp and postUp but i can't figure how to do it w/o hitting the DB performance too much. I can collect the data via SELECT in the preUp, store them in some PHP vars to use later on inside postUp to write new values to DB but since i have 10 tables with many rows this become a disaster real fast.
Do you guys have any suggestion i could apply to make this smooth and easy?
Please do not ask why i have to do this refactor now and i didn't setup the DB correctly in the first time. :D
Keywords for ideas: transaction? bulk query? avoid php vars storage? write sql file? everything can be good
I feel dumb but the solution was much more simple, i created a custom migration with all the "ALTER TABLE [table_name] DROP app_version" to be executed AFTER one that simply does:
UPDATE [table_name] SET app_version_id = 1 WHERE app_version = "V1"
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
In the past, all my lists were put in the database. I did not know better and it seemed to me like data... so I put them in the database.
For some lists (e.g. countries), it is the right way to do it. But for others like options that triggers a different behaviour in your code, it is not.
For instance, let's say I have a User object and this object has a property called Status. The Status is tightly coupled to behavious in my code:
Active: you can access the application.
Banned: you cannot access the application and can never reset your account.
Inactive: You did not use your acccess for X months, you can fill a form to reactivate your account.
The old me would have created a table in the database called UserStatus with
3 rows in it. The table would have looked like this:
+----+----------+
| Id | Code |
+----+----------+
| 1 | ACTIVE |
| 2 | BAN |
| 3 | INACTIVE |
+----+----------+
Then, Code column would have been used in my code to "bind" the user status to the right behaviour (and to the right 'display string' according to the language). That said, editing the Code would have screwed everything. And adding new status in the database would have no effect as you need to add data in the database AND add code to handle the status. This seems like the wrong way of doing things.
Then, I thought about using an enum. Now, I don't have the database + code issue. I handle everything from the code and it makes more sense as, in my opinion, it's not data that should be in a database. But with enum comes this problem: in fact hey are an int but I need to use them as string.
I also thought about setting constants in the User object:
public const string Active = "ACTIVE";
public const string Ban = "BAN";
public const string Inactive = "INACTIVE";
But now that I have figured the right way to handle it in the code, I must display the list in the UI. It can be done with the enum but it requires some hacking. Samething with the constants as the list needs to be handled manually. Using the database would make that so easy... damn!
In the end, my question becomes: What is the right structure for these "static" lists that are tightly coupled with your code but also need to be displayed ?
Update:
Each user has one (and only one) Status and that Status can be changed.
I've contemplated the same question many, many times over the years. Here are some [perhaps] insights that I've learned.
First, some clarification. I assume that each user will have a Status attribute and that the value can change for each user. In other words, the value belongs in the DB and it must be associated with each user.
My favorite approach in this case - store a string in the User table
My favorite approach is to create a Status column on the user table. Since each user can only have 1 status, it makes sense to just store this detail in the User table. Doing so has the advantage of making queries easier to write. The reason I'd choose to store it as a string are 2:
It makes perusing the DB easier. While debugging and just viewing raw data in the db, it's easy for developers new to the project to understand what the Status column is at a glance and it's easy to figure out the the Status of any given User row at a glance. Making values easy to understand is of the utmost importance in my opinion when making software easy to develop/modify.
I assume that duplicating this string won't become a performance/storage problem. Now, if you have millions and millions (maybe even hundreds of millions) of user rows, then you may want to shrink the size of this column.
And one other thing I like to do is to make sure that I use a file of constants (like what you described) and only use those constants when interacting with the DB table (for inserts/updates, that is). That's the perfect use case for constants, and you can create easy little utility functions that ensure the value you are about to insert are in that list of constants.
Caveats that could change my answer, and alternative solutions
You have hundreds of millions of Users.
In this case, you may need to shrink the size of the column to save space. You can either use an enum of some sort in the DB and limit the field to be a small varchar type of some sort or use a small int of some sort to represent the Status. I'd still put that column on the User record to prevent extra join's in large queries. And if you went the int route, I would handle the translation of that int into a string in the code and in project documentation. Lookup tables that translate 1 to "Active" really aren't any more helpful than a file of constants in your code that does the same thing, in my opinion. And it just adds a join to every query that you write. Looping through result rows in your code is generally going to be so fast that there is no performance hit either way. This caveat probably isn't even worth mentioning, but I have run into it on rare occasions.
Each user can have multiple statuses. Well, in this case you'd need a join table of some sort. I'd probably make the join table look like this, if possible:
+--------+----------+
| UserId | Code |
+--------+----------+
| 1 | ACTIVE |
| 2 | BAN |
| 3 | INACTIVE |
| 1 | BAN |
+--------+----------+
Now I realize that your use-case doesn't really make sense for a one-to-many relationship like this. BUT, in other use-cases where you're debating where to stick constants like this, you may run into this situation. In these cases, I STILL like to put a string in the DB when possible. Unless I'm really worried about saving the bits and bytes, I'll use a string. And 99% of the time, a column like this isn't the place you're worried about saving space. It just makes perusing the DB and learning the DB so much easier.
Anyway, just some thoughts from my experiences, hope you find them helpful!
Recently I am dealing with a following problem:
I am trying to build an "archive" for storing and retrieving data from various sources, so the data will always have different number of columns and rows. I think that allowing the user to create new tables just to store those CSV files (each in separate files ) would a serious violation of web development guidelines and also difficult to achieve in Django. That's why I came up with idea of attribute-value format for storing the data, but I don't know how to implement it in django.
I want to build a form in Django Admin to allow user to upload a CSV file with N-columns to a table that contains only two columns: 1)name of the column from csv file and 2) value for that column (more precisly: three value columns: one for integers, one for floats and one for storing strings. To do that I must of course "melt" the data from CSV file to a "long" format so the file:
col1 | col2 | col3
23 | 45.0 | 32
becomes:
key| val
col1| 23
col2 | 45.0
col3 | 32
And that I know how to do. However, i do not know if it is possible to process a file that is uploaded by the user to such a format and, later, how to retrive data in a simple, django-way.
Do you know of any such extensions /widegts or how to approach the problem? Or how to google it even? I have done my research, however, I found only general approaches for dynamic models and I don't think that my case requires using them:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
and here's dynamic model approach:
https://pypi.python.org/pypi/django-dynamo - however, I am not sure it's the right answer.
So my guess is that I do not understand django really that well, but I'd be grateful for some directions.
No you don't need a dynamic model. And you should avoid EAV(Entity Attribute Value)
schemas. It's bad desing.
Read here for how to process an uploaded file.
See here for how to override the save() instance method. This
is probably what you'll need to do.
Also, keep in mind that what you call melting is called serializing. It is helpful
to know the right terms and definitions when searching for these topics.
I've been tinkering with SQLite3 for the past couple days, and it seems like a decent database, but I'm wondering about its uses for serialization.
I need to serialize a set of key/value pairs which are linked to another table, and this is the way I've been doing this so far.
First there will be the item table:
CREATE TABLE items (id INTEGER PRIMARY KEY, details);
| id | details |
-+----+---------+
| 1 | 'test' |
-+----+---------+
| 2 | 'hello' |
-+----+---------+
| 3 | 'abc' |
-+----+---------+
Then there will a table for each item:
CREATE TABLE itemkv## (key TEXT, value); -- where ## is an 'id' field in TABLE items
| key | value |
-+-----+-------+
|'abc'|'hello'|
-+-----+-------+
|'def'|'world'|
-+-----+-------+
|'ghi'| 90001 |
-+-----+-------+
This was working okay until I noticed that there was a one kilobyte overhead for each table. If I was only dealing with a handful of items, this would be acceptable, but I need a system that can scale.
Admittedly, this is the first time I've ever used anything related to SQL, so perhaps I don't know what a table is supposed to be used for, but I couldn't find any concept of a "sub-table" or "struct" data type. Theoretically, I could convert the key/value pairs into a string like so, "abc|hello\ndef|world\nghi|90001" and store that in a column, but it makes me wonder if that defeats the purpose of using a database in the first place, if I'm going to the trouble of converting my structures to something that could be as easily stored in a flat file.
I welcome any suggestions anybody has, including suggestions of a different library better suited to serialization purposes of this type.
You might try PRAGMA page_size = 512; prior to creating the db, or prior to creating the first table, or prior to executing a VACUUM statement. (The manual is a bit contradictory and it also depends on the sqlite3 version.)
I think it's also kind of rare to create tables dynamically at a high rate. It's good that you are normalizing your schema, but it's OK for columns to depend on a primary key and, while repeating groups are a sign of lower normalization level, it's normal for foreign keys to repeat in a reasonable schema. That is, I think there is a good possibility that you need only one table of key/value pairs, with a column that identifies client instance.
Keep in mind that flat files have allocation unit overhead as well. Watch what happens when I create a one byte file:
$ cat > /tmp/one
$ ls -l /tmp/one
-rw-r--r-- 1 ross ross 1 2009-10-11 13:18 /tmp/one
$ du -h /tmp/one
4.0K /tmp/one
$
According to ls(1) it's one byte, according to du(1) it's 4K.
Don't make a table per item. That's just wrong. Similar to writing a class per item in your program. Make one table for all items, or perhaps, store the common parts of all items, with other tables referencing it with auxillary information. Do yourself a favor and read up on database normalization rules.
In general, the tables in your database should be fixed, in the same way that the classes in your C++ program are fixed.
Why not just store a foreign key to the items table?
Create Table ItemsVK (ID integer primary key, ItemID integer, Key Text, value Text)
If it's just serialization, i.e. one-shot save to disk and then one-shot restore from disk, you could use JSON (list of recommend C++ libraries).
Just serialize a datastructure:
[
{'id':1,'details':'test','items':{'abc':'hello','def':'world','ghi':'90001'}},
...
]
If you want to save some bytes, you can omit the id, details, and items keys and save a list instead: (in case that's a bottleneck):
[
[1,'test', {'abc':'hello','def':'world','ghi':'90001'}],
...
]