I am a little confused about which is better for soft delete.
There are two ways for Soft Delete.
create table for deleted records.(In this way we will make copy
for the records in the table of deleted records, then delete it from its table)
create extra column called deleted,(In this way we will only change the status of this field to true , then at display records we will filter according to this extra field)
Also, I want to store the changes of the records after every update, So I think creating extra table is more suitable. What is your opinion?
I agree with #web-engineer, adding a nullable column with the datetime of when the row has been soft-deleted is the best. I used this ressource to do this.
And to answer the second part of your question, yes an extra table will be needed. There is a third party app named django-simple-history which handles it for you.
Best option is the second one, in your first example it's not a soft delete if your deleting it from the table - soft should be to modify the data in a minimal way. Leaving the row in place is the purpose of a soft-delete, this has the minimal effect on the data and will retain all attributes such as primary key index value and any internals you cant see that the database might use.
Your first option is far less succinct as it means duplicating data structures. A common approach is to add a "deleted_at" column (default to NULL), this positively identifies the record state.
Related
Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!
User has an email address and a display name.
Both of these must be unique.
Both of these must be updatable as long as either is not being used already.
A User table will exist with additional non-key attributes and a guid ID.
How to model to support efficient query check if email address or display name is already being used?
Should I create a table with the guid as Key, no range, and 2 separate GSI one for email and one for display name (each being the key)? Both will also have a second field with the guid id of the user. Or should these be completely separate tables, or ????
Thoughts, is there a better way?
Thanks.
There are 3 ways you can design that I can think of:
As you have mentioned, a table with guid and 2 separate GSI one for email and other for Name.
You have stated that both the fields had to be unique, so potentially you can make any one of them as hash and create GSI for other.(This will run into problem as you mention that you need to update Email & Name as well, for that you have to delete old record and add a new record with same attributes and updated Hash keys)
Advantage of this would be that you need to pay less as there will be only one GSI compared to #1.
Another option is to use CloudSearch, your DynamoDB table can be integrated with cloudSearch, in this option you can simply create a table with guid no need to add any GSI, whenever you want to search you can search on CloudSearch to get the output.
One more advantage you will get in CloudSearch is that you will be able to query on any attributes of the table and can use different filters on them.
One thing you need to see it that price difference between #2 and #3, you can go with anyone which is better suited in terms of price and functionality.
If you implement this with other ways feel free to share it.
Hope that helps
I have a fairly large production database system, based on a large hierarchy of nodes each with a 10+ associated models. If someone deletes a node fairly high in the tree, there can be thousands of models deleted and if that deletion was a mistake, restoring them can be very difficult. I'm looking for a way to give me an easy 'undo' option.
I've tried using Django-reversion, but it seems like in order to get the functionality I want (easily reverting a large cascade delete) it needs to store a bunch of information with each revision. When I created initial revisions, the process is less than 10% done and it's already using 8GB in my database, which is not going to work for me.
So, is there a standard solution for this problem? Or a way to customize Django-reversions to fit my use case?
What you're looking for is called a soft delete. Add a column named deleted with a value of false to the table. Now when you want to do a "delete" instead change the column deleted to true. Update all the code not to show the rows marked as deleted (or move the database table and replace it with a view that doesn't show them). Change all the unique constraints to have a filter WHERE deleted = false so you won't have a problem with not being able to add something similar to what user can't see in the system.
As for the cascades you have two options. Either do an ON UPDATE trigger that will update the child rows or add the deleted column to the FK and define it as ON UPDATE CASCADE.
You'll get the whole reverse functionality at a cost of one extra row (and not being able to delete stuff to save space unless you do it manually).
I am trying to retrieve a single row from a table. This row contains filed that hold foreign keys into another table, which in turns is related to yet another table. I am trying to get just one row returned, yet, the problem is, it returns not only the row but ALL the objects that are jointly related to that table as well. As I have to deal with a fairly large amount of data, the returned object is very cumbersome as it contains all the related data as well. In some cases my script simply times out because there is just far too much data to grab.
My question is; is there a way to retrieve just a single record without the associated fluff with it? I am basically accessing the table via the entityManager from the repository, then trying to get my record by using the ->find($id) method.
I am sure this is something stupidly simple but I can't seem to figure this out. Thanks in advance for any help, it is much appreciated.
Doctrine 2 use "lazy loading", it means that the associated objects are not really retrieved from the database while you don't try to access them.
So the find($id) is just fine.
I have a postgresql database with about 150 tables(it's a Django 1.2 project). Django adds ON DELETE NO ACTION and ON UPDATE NO ACTION to foreign keys at the time of table creation.
Now I need to bulk delete data (about 800,000 records) from a bunch of tables based on certain condition.
Using Model.objects.filter().delete() is not an options because data is huge and it takes a lot of time.
Only sanest options seems a cascading delete, but since Django has add "ON DELETE NO ACTION" it seem like a no option.
So my question: Is there any way to change all foreing keys to ON DELETE CASCADE in an easy way(there are many of them) or something similar.
(I am aware that I can manually write the SQL queries for each table, but that would be a monumental and difficult to maintain task.)
https://docs.djangoproject.com/en/dev/ref/models/fields/#django.db.models.ForeignKey.on_delete
As pointed out in the link which comprises Andrew's answer, if you set this to CASCADE in Django, then Django will go and do the deletes "retail". If it is set to NO ACTION you can create a database-level foreign key definition to handle things. That sounds like a reasonable plan to me.
Be sure you have an index defined on the referencing set of columns for every foreign key; otherwise you're going to see very slow performance. Some database products will automatically create such an index when you define a foreign key, but there are situations where that is not advantageous, so PostgreSQL puts the matter in your hands to optimize as you see fit. (Just as one example, it might not be worth the cost of maintaining the index during normal operations, but be worth building it before a purge and dropping it after.)
One note: ON DELETE CASCADE performs miserably on bulk operations. The reason is that this is done as a trigger. Consequently the way it looks from an algorithmic perspective is:
for row in delete_set:
for dependent row in (scan for referencing rows):
delete dependent row
If you are deleting 800000 rows in a parent table this translates into 800000 separate delete scans on the dependent tables. Even at your best case, with indexes usable 800000 separate index scans will be much slower than one sequential scan.
A better way to do this is to use a writeable common table expression in 9.1 or higher, or to just do separate delete statements in the same transaction. Something like:
WITH rows_to_delete (id) AS (
SELECT id FROM mytable WHERE where_condition
),
deleted_rows (id) AS (
DELETE FROM referencing_table WHERE mytable_id IN (select id FROM rows_to_delete)
RETURNING mytable_id
),
DELETE FROM mytable WHERE id IN (select id FROM deleted_rows);
This Reduces to something like, algorithmically:
scan for rows to delete as delete_set
for dependent in scan for rows dependent to delete:
delete dependent
for to_delete in scan for rows referenced by deleted dependents:
delete to_delete
Getting rid of the forced nested loop scan will greatly speed things up.