Optimizing Database Population with Django - django

I have a 10GB csv file (34 million rows) with data(without column description/header) that needs to be populated in a Postgres database.
The row data has columns that need to go in different Models.
I have the following DB schema:
Currently what I do is:
Loop through rows:
Create instance B with specific columns from row and append to an array_b
Create instance C with specific columns from row and append to an array_c
Create instance A with specific columns from row and relation to B and C, and
append to an array_a
Bulk create in order: B, C and A
This works perfectly fine, however, it takes 4 hours to populate the DB.
I wanted to optimize the population and came across the psql COPY FROM command.
So I thought I would be able to do something like:
Create instance A with specific columns from the file
for foreign key to C
create instance C with specific columns from the row
for foreign key to B
create instance B with specific columns from the row
Go to 1.
After a short research on how to do that, I found out that it does not allow table manipulations while copying data (such as looking up to another table for fetching proper foreign keys to insert)
Anyone can guide me in what to look for, any other method or 'hack' on how to optimize the data population?
Thanks in advance.
Management command:
with open(settings.DATA_PATH, 'r') as file:
csvreader = csv.reader(file)
next(csvreader)
b_array = []
c_array = []
a_array = []
for row in csvreader:
some_data = row[0]
some_data = row[1]
....
b = B(
some_data=some_data
)
b_array.append(b)
c = C(
some_data=some_data
)
c_array.append(c)
a = A(
some_data=_some_data,
b=b,
c=c
)
a_array.append(property)
B.objects.bulk_create(b_array)
C.objects.bulk_create(c_array)
A.objects.bulk_create(a_array)
References:
Use COPY FROM command in PostgreSQL to insert in multiple tables
Postgres: \copy syntax
Create multiple tables using single .sql script file
https://www.sqlshack.com/different-approaches-to-sql-join-multiple-tables/

Related

Transaction Management with Raw SQL and Models in a single transaction Django 1.11.49

I have an API which reads from two main tables Table A and Table B.
Table A has a column which acts as foreign key to Table B entries.
Now inside api flow, I have a method which runs below logic.
Raw SQL -> Joining table A with some other tables and fetching entries which has an active status in Table A.
From result of previous query we take the values from Table A column and fetch related rows from Table B using Django Models.
It is like
query = "Select * from A where status = 1" #Very simplified query just for example
cursor = db.connection.cursor()
cursor.execute(query)
results = cursor.fetchAll()
list_of_values = get_values_for_table_B(results)
b_records = list(B.objects.filter(values__in=list_of_values))
Now there is a background process which will enter or update new data in Table A and Table B. That process is doing everything using models and utilizing
with transaction.atomic():
do_update_entries()
However, the update is not just updating old row. It is like deleting old row and deleting related rows in Table B and then new rows are added to both tables.
Now the problem is if I run api and background job separately then everything is good, but when both are ran simultaneously then for many api calls the second query of Table B fails to get any data because the transaction executed in below manner:
Table A RAW Transaction executes and read old data
Background Job runs in a single txn and delete old data and enter new data. Having different foreign key values that relates it to Table B.
Table B Models read query executes which refers to values already deleted by previous txn, hence no records
So, for reading everything in a single txn I have tried below options
with transaction.atomic():
# Raw SQL for Table A
# Models query for Table B
This didn't worked and I am still getting same issue.
I tried another way around
transaction.set_autocommit(False)
Raw SQl for Table A
Models query for Table B
transaction.commit()
transaction.set_autocommit(True)
But this didn't work either. How can I read both queries in a single transaction so background job updates should not affect this read process.

sqlite3 & python: get list of primary and foreign keys

I am very new to sql and intermediate at python. Using sqlite3, how can I get a print() list of of primary and foreign keys (per table) in my database?
Using Python2.7, SQLite3, PyCharm.
sqlite3.version = 2.6.0
sqlite3.sqlite_version = 3.8.11
Also note: when I set up the database, I enabled FKs as such:
conn = sqlite3.connect(db_file)
conn.execute('pragma foreign_keys=ON')
I tried the following:
conn=sqlite3.connect(db_path)
print(conn.execute("PRAGMA table_info"))
print(conn.execute("PRAGMA foreign_key_list"))
Which returned:
<sqlite3.Cursor object at 0x0000000002FCBDC0>
<sqlite3.Cursor object at 0x0000000002FCBDC0>
I also tried the following, which prints nothing (but I think this may be because it's a dummy database with tables and fields but no records):
conn=sqlite3.connect(db_path)
rows = conn.execute('PRAGMA table_info')
for r in rows:
print r
rows2 = conn.execute('PRAGMA foreign_key_list')
for r2 in rows2:
print r2
Unknown or malformed PRAGMA statements are ignored.
The problem with your PRAGMAs is that the table name is missing. You have to get a list of all tables, and then execute those PRAGMAs for each one:
rows = db.execute("SELECT name FROM sqlite_master WHERE type = 'table'")
tables = [row[0] for row in rows]
def sql_identifier(s):
return '"' + s.replace('"', '""') + '"'
for table in tables:
print("table: " + table)
rows = db.execute("PRAGMA table_info({})".format(sql_identifier(table)))
print(rows.fetchall())
rows = db.execute("PRAGMA foreign_key_list({})".format(sql_identifier(table)))
print(rows.fetchall())
SELECT
name
FROM
sqlite_master
WHERE
type ='table' AND
name NOT LIKE 'sqlite_%';
this sql will show all table in database, for eache table run sql PRAGMA table_info(your_table_name);, you can get the primary key of the table.
Those pictures show what sql result like in my database:
first sql result
second sql result

How can I update Cassandra table with only primary key and static columns?

I am using Cassandra 3.9 and DataStax C++ driver 2.6. I have created a table that has only a primary key and static columns. I am able to insert data into the table, but I am not able to update the table and I don't know why. As an example, I created the table t that is defined here:
[Cassandra Table with primary key and static column][1]
Then I successfully inserted data into the table with the following CQL insert command:
"insert into t (k, s, i) VALUES('George', 'Hello', 2);"
Then, "select * from t;" results in the following:
k | i | s
-------+---+-------
George | 2 | Hello
However, if I then try to update the table using the following command:
"UPDATE t set s = "World" where k = "George";"
I get the following error:
SyntaxException: line 1:26 no viable alternative at input 'where' (UPDATE t set s = ["Worl]d" where...)
Does anyone know how to update a table with only static columns and a primary key (i.e. partition key + cluster key)?
Enclose string with single quote
Example :
UPDATE t set s = 'World' where k = 'George';

Informatica : something like CDC without adding any column in target table

I have a source table named A in oracle.
Initially Table A is loaded(copied) into table B
next I operate DML on Table A like Insert , Delete , Update .
How do we reflect it in table B ?
without creating any extra column in target table.
Time stamp for the row is not available.
I have to compare the rows in source and target
eg : if a row is deleted in source then it should be deleted in target.
if a row is updated then update in target and if not available in source then insert it in the target .
Please help !!
Take A and B as source.
Do a full outer join using a joiner (or if both tables are in the same databse, you can join in Source Qualifier)
In a expression create a flag based on the following scenarios.
A key fields are null => flag='Delete',
B key fields are null => flag='Insert',
Both A and B key fields are present - Compare non-key fields of A and B, if any of the fields are not equal set flag to 'Update' else 'No Change'
Now you can send the records to target(B) after applying the appropriate function using Update Strategy
If you do not want to retain the operations done in target table (as no extra column is allowed), the fastest way would simply be -
1) Truncate B
2) Insert A into B

SyncFramework: How to sync all columns from a table?

I create a program to sync tables between 2 databases.
I use this common code:
DbSyncScopeDescription myScope = new DbSyncScopeDescription("myscope");
DbSyncTableDescription tblDesc = SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", onPremiseConn);
myScope.Tables.Add(tblDesc);
My program creates the tracking table only with Primary Key (id column).
The sync is ok to delete and insert rows.
But updating don't. I need update all the columns and they are not updated (For example: a telephone column).
I read that I need to add the columns I want to sync MANUALLY with this code:
Collection<string> includeColumns = new Collection<string>();
includeColumns.Add("telephone");
...
includeColumns.Add(Last column);
And changing the table descripcion in this way:
DbSyncTableDescription tblDesc = SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", includeColumns, onPremiseConn);
Is there a way to add all the columns of the table automatically?
Something like:
Collection<string> includeColumns = GetAllColums("Table");
Thanks,
SqlSyncDescriptionBuilder.GetDescriptionForTable("Table", onPremiseConn) will include all the columns of the table already.
the tracking tables only stores the PK and filter columns and some Sync Fx specific columns.
the tracking is at row level, not column level.
during sync, the tracking table and its base table are joined to get the row to be synched.