How to store very little Integer (4 bits) integer in my database?
Hi.🙌
In my model, I want to store a very small integer in my DB and I don't want to use SmallIntegerField. Because Django will store 16 Byte data in DB, which is too much for my need. How can I store 4 bits integers or even less in PostgreSQL?
Thanks for your help.🙏
As Adrian said, the smallest storage for PostgreSQL itself is 2 bytes. There is nothing smaller than that.
If you want to make a constraint, create a maximum check via the save method in a Django model or add a field validator in the Django field.
2 bytes is the smallest, but there is nothing else
If you really want a database with such constraints, you can install an extension and make your own DB field by following this:
https://dba.stackexchange.com/questions/159090/how-to-store-one-byte-integer-in-postgresql
I have two models, user and course who should be matched. To provide a basis for this matching, both models can rank the other one. I'm trying to figure out a database structure to efficiently store the preferences. The problem is, that the number of ranks is unknown (depending of number of users/courses).
I'm using Django with Postgres, if this helps.
I tried storing the preferences as a field with a pickled dictionary, but this causes problems if a model stored in the dict is updated and takes a long time to load.
I'm thankful for any tips or suggestions for improvement.
Dear All,
I had a problem retrieving a clob from the database and displaying it in an apex Check box. The reason for this is it is a very large clob and Apex item have a 32k character (byte) limit. This is because PL/SQL treats oracle apex page items as varchars not clobs and varchars have a maximum size. Anything over that size will not be displayed.
I checked one blog about these problem and also applied but problem is not solved.
[http://mayo-tech-ans.blogspot.com/2013/06/displaying-large-clobs-in-oracle-apex.html][1]
Database Version ::: 12.1.0.2.0
Apex Version ::: 5.1.4
Thanks in advance
Regards,
Sultan
Apex items can never be over 32 KB, that's the PL/SQL limit and there's not much you can do about that.
I question whether you really need a checkbox that has a value of > 32 KB. Could you use some other value like the primary key of the row for the CLOB, or an MD5 hash of it?
If you need to display some large content, like some large amount of text in a region, there is the OraOpenSource clob-load plugin.
You can also take a CLOB and chunk it up and write it out as HTML using the htp package. I gave an example in this answer.
Is there an advantage for using UUIDField with Django and Postgresql (native datatype) over a self-made generated unique key?
Currently I use a random-generated alphanumeric ID field on my models and I am wondering if the Postgres native datatype and the UUIDField are better for this purpose and whether there's a reason to switch over.
I generate the id using random letters and digits. It's 25 chars long. I put a db_index on it for faster retrieval. I don't shard my DB. The reason being that some models cannot have consecutive ids for business purposes
Switching to UUID will have an advantage particularly if you have a large number of records. Lookups and inserts ought to be a tiny bit faster. You will be saving 9 bytes of storage per row since UUID fields are only 128 bits.
However that doesn't mean your home made primary key is a bad idea. Far from it. It's a good one and a similar approach is used by Instagram who also happen to be using Postgresq and DJango. Their solution though uses only 64 bits and also manages to squeeze in information about object creation time into the primary key.
Their primary purpose is sharding but works very well even for non shared dbs. Just use some random number for the 13 bits that represent their sharding information. They have a sql sample at the link above.
I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create