Large number of columns in Django - django

I have performance data for a model execution instance that I am trying to store in Django (sqlite backend). There are 435 timers with 11 different attributes per timer resulting in 4,785 unique values that need to be stored per run. I wrote code that generated the models.py code. It gets 1,564 columns in and throws the below error. There are no duplicate columns of that name as I have checked the models.py file. When I switch that line with the next in the models file (and make a new migration) it dies at the same line number but new column name.
First time through:
django.db.utils.OperationalError: duplicate column name: CAM_export_processes
Second time:
django.db.utils.OperationalError: duplicate column name: CAM_export_threads
What limitation am I hitting here?

The default setting for SQLITE_MAX_COLUMN is 2000.
This comes from the SQLite documentation. If you are storing over 1000 columns in a table you are most definitely doing something wrong (tm).
While your math is probably sound, there is a chance django is throwing in some extra columns for its own metadata which is pushing you closer to (or over) the limit.
Seriously reconsider your approach for storing this data.

Related

How to check that data is deleted or not in PostgreSQL

Yesterday we've added a simple form to our site, and we just implemented an API View using Django which is connected to a PostgreSQL database.
Today I queried the database to see how many rows are submitted, and I encountered a strange thing in the results, We've created and migrated our model using Django ORM, so the primary key is defined as an auto-increment integer field, the problem is row ids are not continuous and they are so diverse, when I'm writing this question, the max id value is 252, but we have only 72 records in the table,
I've seen this before in other tables, but those tables were subjected to delete and update queries, but we only insert to this new table, and my question is: is our data deleted or it's a normal behavior in PostgreSQL?
I've searched in google and it seems that the only way is to check WAL logs, but we have not enabled that for our database yet, is there another way to check that the data is consistent or not?
Thanks.
Expect holes in a sequence
If you have multiple connections to a database that are adding rows, then you should expect to see holes in the sequence number results.
If Alice is adding a row, she may bump the sequence from 10 to 11 while not yet doing a COMMIT. Meanwhile, Bob adds a record, bumping the sequence to 12, and assigning 12 to his row, which he now commits. So the database has stored rows with ID field values of 10 and 12, but not 11.
If Alice commits, then 11 will appear in a query.
If Alice does a ROLLBACK, then 11 will never appear in a query.

Does searching by id depends on number of columns in postgres?

I have the following query: MyModel.objects.filter(id__in=ids).
I noticed that increasing number of columns in table decreases speed of the above query.
Why is that?
Query time in Postgres mostly consists of planing time, execution time and data fetch.
Planing time and execution time should not be affected by a number of columns in the table, but the data fetch phase definitely is as you are returning more data.
Also, an additional step that happens is the mapping of return data into Django QuerySet which takes more time if more columns are involved.
To limit the scope of data returned if applicable, you can always use values, defer, or only.
In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.

Performance optimization on Django update or create

In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()

Adding data to database in django issue

I have code like this:
form = TestForm(request.POST)
form.save(commit=False).save()
This code sometimes work sometimes dont. Problem is in auto increment id.
When i have some data in db that is not written by django and i want to add data from django i get IntegrityError id already exists.
I i have 2 rows in db(not added by django) i need to click "add data" 3 times. After third time when id increment to 3 all is ok.
How to solve this?
These integrity errors appear, when your table sequence is not updated after new item is created. Or if sequence is out of sync with reality. For example - you import items from some source and the items also contain id, which is higher than your table index sequence indicates. I have not seen a case where django messes sequences up.
So what i guess happens is, that the other source that inserts data into your database, also inserts id's and sequence is not updated. Fix that and your problems should disappear.

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create