Speed up database inserts from ORM - django

I have a Django view which creates 500-5000 new database INSERTS in a loop. Problem is, it is really slow! I'm getting about 100 inserts per minute on Postgres 8.3. We used to use MySQL on lesser hardware (smaller EC2 instance) and never had these types of speed issues.
Details:
Postgres 8.3 on Ubuntu Server 9.04.
Server is a "large" Amazon EC2 with database on EBS (ext3) - 11GB/20GB.
Here is some of my postgresql.conf -- let me know if you need more
shared_buffers = 4000MB
effective_cache_size = 7128MB
My python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
Some output from top:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
Let me know if any other details would be helpful.

One common reason for slow bulk operations like this is each insert happening in its own transaction. If you can get all of them to happen in a single transaction, it could go much faster.

Firstly, ORM operations are always going to be slower than pure SQL. I once wrote an update to a large database in ORM code and set it running, but quit it after several hours when it had completed only a tiny fraction. After rewriting it in SQL the whole thing ran in less than a minute.
Secondly, bear in mind that your code here is doing up to four separate database operations for every row in your data set - the get in get_or_create, possibly also the create, the count on the filter, and finally the save. That's a lot of database access.
Bearing in mind that a maximum of 5000 objects is not huge, you should be able to read the whole dataset into memory at the start. Then you can do a single filter to get all the existing Keyword objects in one go, saving a huge number of queries in the Keyword get_or_create and also avoiding the need to instantiate duplicate ProfileKeywords in the first place.

Related

Power BI Report with dynamic columns according to slicer selection

I have a table with the columns as below -
There are rows showing the allocation of various people under various project. The months (columns) can extend to Dec,20 and continue on from Jan,21 in the same pattern as above.
One Staff can be tagged to any number of projects in a given month.
Now I want to prepare a Power BI report on this in the format as below -
Staff ID, Project ID and End Date are the slicers to be present.
For the End Date slicer we can select options in the format of (MMM,YY) (eg - Jan,23). On the basis of this slicer I want to show the preceding 6 months of data, as portrayed by the above sample image.
I have tried using parameters but those have to specified for each combination so not usable for this data as this is going to increase over time.
Is there any way to do this or am I missing some simple thing in particular?
Any help on this will be highly appreciated.
Adding in the sample data as below -
Staff ID
Project ID
Jan,20
Feb,20
Mar,20
Apr,20
May,20
Jun,20
Jul,20
1
20
0
0
0
100
80
10
0
1
30
0
0
0
0
20
90
100
2
20
100
100
100
0
0
0
0
3
50
80
100
0
0
0
0
0
3
60
15
0
0
0
20
0
0
3
70
5
0
100
100
80
0
0

Why does nose ignore certain files on coverage report?

I am running tests on a project I've been assigned to. Everything is triggered by calling tox.
Default tests run with nose, which adds a coverage report, this is the command that tox calls:
django-admin test -s
and settings file has this configuration for nose:
NOSE_ARGS = [
'--with-coverage',
'--cover-erase',
'--cover-package=app_name',
'--cover-inclusive'
]
This is the report that's shown while running nose with tox:
Name Stmts Miss Cover
---------------------------------------------------------------------------
app_name/__init__.py 0 0 100%
app_name/apps.py 3 0 100%
app_name/apps_settings.py 12 2 83%
app_name/base.py 118 27 77%
app_name/choices.py 18 0 100%
app_name/constants.py 6 0 100%
app_name/exceptions.py 10 0 100%
app_name/helpers/__init__.py 0 0 100%
app_name/helpers/util.py 20 10 50%
app_name/management/__init__.py 0 0 100%
app_name/migrations/0001_initial.py 9 0 100%
app_name/migrations/__init__.py 0 0 100%
app_name/mixins.py 6 0 100%
app_name/models.py 64 4 94%
app_name/permissions.py 7 3 57%
app_name/serializers/__init__.py 0 0 100%
app_name/serializers/address_serializer.py 7 0 100%
app_name/serializers/base_response_serializer.py 7 0 100%
app_name/serializers/body_request_user_serializer.py 14 0 100%
app_name/serializers/contact_serializer.py 4 0 100%
app_name/serializers/file_serializer.py 11 2 82%
app_name/serializers/iban_serializer.py 3 0 100%
app_name/serializers/identification_serializer.py 11 2 82%
app_name/serializers/payment_account_serializer.py 3 0 100%
app_name/serializers/transfer_serializer.py 20 10 50%
app_name/services/__init__.py 0 0 100%
app_name/services/authentication_service.py 7 0 100%
app_name/services/document_service.py 23 9 61%
app_name/services/user_service.py 37 21 43%
app_name/services/webhook_service.py 26 7 73%
app_name/storage_backends.py 10 0 100%
app_name/views/__init__.py 0 0 100%
app_name/views/webhook_view.py 25 8 68%
---------------------------------------------------------------------------
TOTAL 481 105 78%
----------------------------------------------------------------------
Ran 5 tests in 4.615s
But, if after it I run coverage report this is shown:
Name Stmts Miss Cover
-----------------------------------------------------------------------------
app_name/__init__.py 0 0 100%
app_name/apps.py 3 0 100%
app_name/apps_settings.py 12 2 83%
app_name/base.py 118 27 77%
app_name/choices.py 18 0 100%
app_name/constants.py 6 0 100%
app_name/decorators.py 10 7 30%
app_name/exceptions.py 10 0 100%
app_name/helpers/__init__.py 0 0 100%
app_name/helpers/util.py 20 10 50%
app_name/management/__init__.py 0 0 100%
app_name/management/commands/__init__.py 0 0 100%
app_name/management/commands/generate_uuid.py 9 4 56%
app_name/migrations/0001_initial.py 9 0 100%
app_name/migrations/__init__.py 0 0 100%
app_name/mixins.py 6 0 100%
app_name/models.py 64 4 94%
app_name/permissions.py 7 3 57%
app_name/serializers/__init__.py 0 0 100%
app_name/serializers/address_serializer.py 7 0 100%
app_name/serializers/base_response_serializer.py 7 0 100%
app_name/serializers/body_request_token_serializer.py 4 0 100%
app_name/serializers/body_request_user_serializer.py 14 0 100%
app_name/serializers/contact_serializer.py 4 0 100%
app_name/serializers/document_serializer.py 7 0 100%
app_name/serializers/file_serializer.py 11 2 82%
app_name/serializers/files_serializer.py 4 0 100%
app_name/serializers/iban_serializer.py 3 0 100%
app_name/serializers/identification_serializer.py 11 2 82%
app_name/serializers/payment_account_serializer.py 3 0 100%
app_name/serializers/transfer_serializer.py 20 10 50%
app_name/serializers/user_information_serializer.py 7 0 100%
app_name/services/__init__.py 0 0 100%
app_name/services/account_service.py 62 45 27%
app_name/services/authentication_service.py 7 0 100%
app_name/services/document_service.py 23 9 61%
app_name/services/method_service.py 23 15 35%
app_name/services/user_service.py 37 21 43%
app_name/services/webhook_service.py 26 7 73%
app_name/storage_backends.py 10 0 100%
app_name/tests/__init__.py 0 0 100%
app_name/tests/apps/__init__.py 0 0 100%
app_name/tests/apps/apps_test.py 9 0 100%
app_name/tests/helpers/__init__.py 0 0 100%
app_name/tests/helpers/helpers_test.py 7 0 100%
app_name/tests/services/__init__.py 0 0 100%
app_name/tests/services/authentication_service_test.py 8 0 100%
app_name/tests/services/document_service_test.py 13 0 100%
app_name/tests/services/user_service_test.py 13 0 100%
app_name/urls/__init__.py 0 0 100%
app_name/urls/webhook_url.py 7 2 71%
app_name/views/__init__.py 0 0 100%
app_name/views/webhook_view.py 25 8 68%
-----------------------------------------------------------------------------
TOTAL 664 178 73%
Now, as you can see, certain files were ignored by nose report, but shown by coverage, like app_name/services/account_service.py. And since that file contains feature code it should be shown on report.
The interesting thing here is that, as far as I know, both libraries: nose and coverage are generating their reports from the same report file .coverage
I guess this is a default nose behavior. I'm not very familiar with nose so perhaps anyone can tell me why this difference of behavior happens.

Scikit-learn labelencoder: how to preserve mappings between batches?

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do?
If I apply the same Labelencoder instance to different batches will the mappings remain the same?
I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?
I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0

Severe memory leak with Django

I am facing a problem of huge memory leak on a server, serving a Django (1.8) app with Apache or Ngnix (The issue happens on both).
When I go on certain pages (let's say on the specific request below) the RAM of the server goes up to 16 G in few seconds (with only one request) and the server freeze.
def records(request):
"""Return list 14 last records page. """
values = []
time = timezone.now() - timedelta(days=14)
record =Records.objetcs.filter(time__gte=time)
return render(request,
'record_app/records_newests.html',
{
'active_nav_tab': ["active", "", "", ""]
' record': record,
})
When I git checkout to older version, back when there was no such problem, the problem survives and i have the same issue.
I Did a memory check with Gumpy for the faulty request here is the result:
>>> hp.heap()
Partition of a set of 7042 objects. Total size = 8588675016 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1107 16 8587374512 100 8587374512 100 unicode
1 1014 14 258256 0 8587632768 100 django.utils.safestring.SafeText
2 45 1 150840 0 8587783608 100 dict of 0x390f0c0
3 281 4 78680 0 8587862288 100 dict of django.db.models.base.ModelState
4 326 5 75824 0 8587938112 100 list
5 47 1 49256 0 8587987368 100 dict of 0x38caad0
6 47 1 49256 0 8588036624 100 dict of 0x39ae590
7 46 1 48208 0 8588084832 100 dict of 0x3858ab0
8 46 1 48208 0 8588133040 100 dict of 0x38b8450
9 46 1 48208 0 8588181248 100 dict of 0x3973fe0
<164 more rows. Type e.g. '_.more' to view.>
After a day of search I found my answer.
While investigating I checked statistics on my DB and saw that some table was 800Mo big but had only 900 rows. This table contains a Textfield without max len. Somehow one text field got a huge amount of data inserted into and this line was slowing everything down on every pages using this model.

Developing a GUI using OpenCV and Qt C++- High-Resolution images

I have been chosen by my Computer Science laboratory in order to develop a Graphical User Interface (GUI) which main goal is to allow the user to open High Resolution images (TIFF format) and manipulate them (zoom in, zoom out, draw rectangles,create and edit annotations...).
I would like to build this GUI using Qt coupled with OpenCV. Nevertheless, I have some doubts about working with OpenCV for this project.
Therefore, my questions are:
Does Qt allow you to handle High-Resolution TIFF images (2000x3000 pixels?
Does OpenCV allow you to handle High-Resolution TIFF images?
Is it easy to convert a OpenCV TIFF image into a Qt TIFF image?
Create and edit annotations? That sounds like going much farther than process TIFF.
Supposing the original post was understood correctly in both the motivation - to create GUI capable of working with large TIFF image files which also allows users to create and ( later ) edit annotations and other graphical elements ( rectangles et al )
The solution goes in a much different way than just to how to handle TIFF images in OpenCV and similar tools.
If the goal is not to just reinvent wheel again, there is a lovely and powerful tool xfig used in CERN since ages, which provides a powerful framework for your idea.
Using xfig's syntax-engine and rendering options allows you to keep objects compose-able and editable throughout the whole lifecycle of the work and leaves you more time for developing your own concept of GUI, not spending man*years of efforts on the internals of how to handle standard format pixmaps and layered vector objects on the lowest-level side.
Meta-file looks like:
#FIG 3.2
Landscape
Center
Inches
Letter
200.00
Single
-3
1200 2
6 8625 825 11775 7275
6 8625 1125 9075 3975
6 8625 1125 9075 2175
6 8625 1125 9075 1575
2 1 0 3 0 7 0 0 -1 0.000 1 0 -1 0 0 2
8700 1200 9000 1200
2 1 0 3 0 7 0 0 -1 0.000 1 0 -1 0 0 2
8700 1500 9000 1500
-6
6 8625 1725 9075 2175
2 1 0 3 0 7 0 0 -1 0.000 1 0 -1 0 0 2
8700 1800 9000 1800
2 1 0 3 0 7 0 0 -1 0.000 1 0 -1 0 0 2
8700 2100 9000 2100
...
4 0 0 0 0 0 20 0.0000 4 195 1080 35100 19800 Z-80 PIO\001
4 0 0 0 0 0 15 0.0000 4 180 930 34575 21075 CTL/DAT\001
4 0 0 0 0 0 15 0.0000 4 180 795 34575 21375 B/A SEL\001
4 0 0 0 0 0 15 0.0000 4 150 270 34575 21975 A6\001
4 0 0 0 0 0 15 0.0000 4 150 270 34575 22275 A5\001
4 0 0 0 0 0 15 0.0000 4 150 480 34575 22875 GND\001
4 0 0 0 0 0 15 0.0000 4 150 270 34575 24075 A0\001
4 0 0 0 0 0 15 0.0000 4 150 600 34575 24375 A STB\001
4 0 0 0 0 0 15 0.0000 4 150 570 34575 24675 B STB\001
4 0 0 0 0 0 15 0.0000 4 150 240 36525 21975 B6\001
and result may look like: