How to take back storage from my postgres db - django

I have a table named duplicates_duplicatebackendentry_documents that has a size of 49gb. This table has 2 indexes that are each 25 gb. And two constraints that are also each 25gb.
The table is used by the duplicates module in a django app I deployed. I have now turned off the module. I am unable to run full vacuuum because I do not have the space necessary to run it. Deleting the table returns the storage (I tested in a dev env) but is there a way I can delete the bloat but keep the table, its constraints and indexes? I just want to empty the bloat along with all the contents.

I just want to empty the bloat along with all the contents.
The canonical way to do that is
TRUNCATE duplicates_duplicatebackendentry_documents;
which will render the table and all its indexes empty.

Related

Simultaneously `CREATE TABLE LIKE` in AWS Redshift and change a few of columns' default values

Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!

Sitecore media conversion tool eating storage space

I have a question regarding the media conversion tool for Sitecore.
With this module you can convert media items between a hard drive location and a Sitecore database, and vice versa. But each time I convert some items it keeps taking additional harddrive space.
So when I convert 3gb to the hard drive it adds an additional 3gb (which seems logic -> 6gb total) but then when I convert them back to the blob format it adds another 3gb (9gb total). Instead of overwriting the previous version in the database.
Is there a way to clean the previous blobs or something? Because now it is using too much hard drive space.
Thanks in advance.
Using "Clean Up Databases" should work, but if the size gets too large, as my client's blob table did, the clean up will fail due to either a SQL timeout or because SQL Server uses up all the available locks.
Another solution is to run a script to manually clean up the blobs table. We had this issue previously and Sitecore support was able to provide us with a script to do so:
DECLARE #UsableBlobs table(
ID uniqueidentifier
);
I-N-S-E-R-T INTO
#UsableBlobs
S-E-L-E-C-T convert(uniqueidentifier,[Value]) as EmpID from [Fields]
where [Value] != ''
and (FieldId='{40E50ED9-BA07-4702-992E-A912738D32DC}' or FieldId='{DBBE7D99-1388-4357-BB34-AD71EDF18ED3}')
D-E-L-E-T-E from [Blobs]
where [BlobId] not in (S-E-L-E-C-T * from #UsableBlobs)
This basically looks for blobs that are still in use and stores them in a temp table. It them compares the items in this table to the Blobs table and deletes the ones that aren't in the temp table.
In our case, even this was bombing out due to the SQL Server locks problem, so I updated the delete statement to be delete top (x) from [Blobs] where x is a number you feel is more appropriate. I started at 1000 and eventually went up to deleting 400,000 records at a time. (Yes, it was that large)
So try the built-in "Clean Up Databases" option first and failing that, try to run the script to manually clean the table.
Edit note: Sorry, had to change the "insert", "select" and "delete" commands to allow SO to save the entry.

Sqlite3/C++ executes DELETE statement without changing the db size

How is it possible? I have a simple C++ app that is using SQLite3 to INSERT/DELETE records.
I use a single database and a single table inside. Then after I choose to store some data into the db, it does and the size of my.db increases naturally.
While there is a problem with DELETE - it does not. But if I do:
sqlite3 my.db
sqlite> select count(*) from mytable;
there is 0 returned which is okay, but if do ls -l on the folder containing my.db, the size
is the same.
Can anybody explain?
When you execute a DELETE query, Sqlite does not actually delete the records and rearrange the data. That would take too much time. Instead, it just marks deleted records and ignore them from then on.
If you actually want to reduce the data size, execute VACUUM command. There is also an option for auto vacuuming. See http://www.sqlite.org/lang_vacuum.html.
The scenario is listed in the SQLite Frequently Asked Questions:
(12) I deleted a lot of data but the database file did not get any
smaller. Is this a bug?
No. When you delete information from an SQLite database, the unused disk space is added to an internal "free-list" and is reused
the next time you insert data. The disk space is not lost. But neither
is it returned to the operating system.
If you delete a lot of data and want to shrink the database file, run the VACUUM command. VACUUM will reconstruct the database from
scratch. This will leave the database with an empty free-list and a
file that is minimal in size. Note, however, that the VACUUM can take
some time to run (around a half second per megabyte on the Linux box
where SQLite is developed) and it can use up to twice as much
temporary disk space as the original file while it is running.
As of SQLite version 3.1, an alternative to using the VACUUM command is auto-vacuum mode, enabled using the auto_vacuum pragma.
The documentation is your friend; please use it.
Also from the documentation:
When information is deleted in the database, and a btree page becomes
empty, it isn't removed from the database file, but is instead marked
as 'free' for future use. When a new page is needed, SQLite will use
one of these free pages before increasing the database size. This
results in database fragmentation, where the file size increases
beyond the size required to store its data, and the data itself
becomes disordered in the file.
Another side effect of a dynamic database is table fragmentation. The
pages containing the data of an individual table can become spread
over the database file, requiring longer for it to load. This can
appreciably slow database speed because of file system behavior.
Compacting fixes both of these problems.
The easiest way to remove empty pages is to use the SQLite command
VACUUM. This can be done from within SQLite library calls or the
sqlite utility.
In-depth examples follow.

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

SQLite - pre allocating database size

Is there a way to pre allocate my SQLite database to a certain size? Currently I'm adding and deleting a number of records and would like to avoid this over head at create time.
The fastest way to do this is with the zero_blob function:
Example:
Y:> sqlite3 large.sqlite
SQLite version 3.7.4
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table large (a);
sqlite> insert into large values (zeroblob(1024*1024));
sqlite> drop table large;
sqlite> .q
Y:> dir large.sqlite
Volume in drive Y is Personal
Volume Serial Number is 365D-6110
Directory of Y:\
01/27/2011 12:10 PM 1,054,720 large.sqlite
Note: As Kyle properly indicates in his comment:
There is a limit to how big each blob can be, so you may need to insert multiple blobs if you expect your database to be larger than ~1GB.
There is a hack - Insert a bunch of data into the database till the database size is what you want and then delete the data. This works because:
"When an object (table, index, or
trigger) is dropped from the database,
it leaves behind empty space. This
empty space will be reused the next
time new information is added to the
database. But in the meantime, the
database file might be larger than
strictly necessary."
Naturally, this isn't the most reliable method. (Also, you will need to make sure that auto_vacuum is disabled for this to work). You can learn more here - http://www.sqlite.org/lang_vacuum.html