I am running tests on a project I've been assigned to. Everything is triggered by calling tox.
Default tests run with nose, which adds a coverage report, this is the command that tox calls:
django-admin test -s
and settings file has this configuration for nose:
NOSE_ARGS = [
'--with-coverage',
'--cover-erase',
'--cover-package=app_name',
'--cover-inclusive'
]
This is the report that's shown while running nose with tox:
Name Stmts Miss Cover
---------------------------------------------------------------------------
app_name/__init__.py 0 0 100%
app_name/apps.py 3 0 100%
app_name/apps_settings.py 12 2 83%
app_name/base.py 118 27 77%
app_name/choices.py 18 0 100%
app_name/constants.py 6 0 100%
app_name/exceptions.py 10 0 100%
app_name/helpers/__init__.py 0 0 100%
app_name/helpers/util.py 20 10 50%
app_name/management/__init__.py 0 0 100%
app_name/migrations/0001_initial.py 9 0 100%
app_name/migrations/__init__.py 0 0 100%
app_name/mixins.py 6 0 100%
app_name/models.py 64 4 94%
app_name/permissions.py 7 3 57%
app_name/serializers/__init__.py 0 0 100%
app_name/serializers/address_serializer.py 7 0 100%
app_name/serializers/base_response_serializer.py 7 0 100%
app_name/serializers/body_request_user_serializer.py 14 0 100%
app_name/serializers/contact_serializer.py 4 0 100%
app_name/serializers/file_serializer.py 11 2 82%
app_name/serializers/iban_serializer.py 3 0 100%
app_name/serializers/identification_serializer.py 11 2 82%
app_name/serializers/payment_account_serializer.py 3 0 100%
app_name/serializers/transfer_serializer.py 20 10 50%
app_name/services/__init__.py 0 0 100%
app_name/services/authentication_service.py 7 0 100%
app_name/services/document_service.py 23 9 61%
app_name/services/user_service.py 37 21 43%
app_name/services/webhook_service.py 26 7 73%
app_name/storage_backends.py 10 0 100%
app_name/views/__init__.py 0 0 100%
app_name/views/webhook_view.py 25 8 68%
---------------------------------------------------------------------------
TOTAL 481 105 78%
----------------------------------------------------------------------
Ran 5 tests in 4.615s
But, if after it I run coverage report this is shown:
Name Stmts Miss Cover
-----------------------------------------------------------------------------
app_name/__init__.py 0 0 100%
app_name/apps.py 3 0 100%
app_name/apps_settings.py 12 2 83%
app_name/base.py 118 27 77%
app_name/choices.py 18 0 100%
app_name/constants.py 6 0 100%
app_name/decorators.py 10 7 30%
app_name/exceptions.py 10 0 100%
app_name/helpers/__init__.py 0 0 100%
app_name/helpers/util.py 20 10 50%
app_name/management/__init__.py 0 0 100%
app_name/management/commands/__init__.py 0 0 100%
app_name/management/commands/generate_uuid.py 9 4 56%
app_name/migrations/0001_initial.py 9 0 100%
app_name/migrations/__init__.py 0 0 100%
app_name/mixins.py 6 0 100%
app_name/models.py 64 4 94%
app_name/permissions.py 7 3 57%
app_name/serializers/__init__.py 0 0 100%
app_name/serializers/address_serializer.py 7 0 100%
app_name/serializers/base_response_serializer.py 7 0 100%
app_name/serializers/body_request_token_serializer.py 4 0 100%
app_name/serializers/body_request_user_serializer.py 14 0 100%
app_name/serializers/contact_serializer.py 4 0 100%
app_name/serializers/document_serializer.py 7 0 100%
app_name/serializers/file_serializer.py 11 2 82%
app_name/serializers/files_serializer.py 4 0 100%
app_name/serializers/iban_serializer.py 3 0 100%
app_name/serializers/identification_serializer.py 11 2 82%
app_name/serializers/payment_account_serializer.py 3 0 100%
app_name/serializers/transfer_serializer.py 20 10 50%
app_name/serializers/user_information_serializer.py 7 0 100%
app_name/services/__init__.py 0 0 100%
app_name/services/account_service.py 62 45 27%
app_name/services/authentication_service.py 7 0 100%
app_name/services/document_service.py 23 9 61%
app_name/services/method_service.py 23 15 35%
app_name/services/user_service.py 37 21 43%
app_name/services/webhook_service.py 26 7 73%
app_name/storage_backends.py 10 0 100%
app_name/tests/__init__.py 0 0 100%
app_name/tests/apps/__init__.py 0 0 100%
app_name/tests/apps/apps_test.py 9 0 100%
app_name/tests/helpers/__init__.py 0 0 100%
app_name/tests/helpers/helpers_test.py 7 0 100%
app_name/tests/services/__init__.py 0 0 100%
app_name/tests/services/authentication_service_test.py 8 0 100%
app_name/tests/services/document_service_test.py 13 0 100%
app_name/tests/services/user_service_test.py 13 0 100%
app_name/urls/__init__.py 0 0 100%
app_name/urls/webhook_url.py 7 2 71%
app_name/views/__init__.py 0 0 100%
app_name/views/webhook_view.py 25 8 68%
-----------------------------------------------------------------------------
TOTAL 664 178 73%
Now, as you can see, certain files were ignored by nose report, but shown by coverage, like app_name/services/account_service.py. And since that file contains feature code it should be shown on report.
The interesting thing here is that, as far as I know, both libraries: nose and coverage are generating their reports from the same report file .coverage
I guess this is a default nose behavior. I'm not very familiar with nose so perhaps anyone can tell me why this difference of behavior happens.
If one can understand how to store stuff in memory, know where it is and how to get it out again in an orderly way it will go a long way to achieving results in Perl.(And probably all programming languages)
I am not a programmer.
I am trying to extract data from 'older program' output and import it into a SQL database. The extraction is the thing.
My previous question was largely incorrect, as I found out when importing data into the table, as I did not have enough data from the 'old program' output file.
I would like to learn from my mistake and re-ask my previous question, hopefully correctly this time.
I have included my poor effort at extracting the data, exactly as it was last time. It doesn't come anywhere near getting the correct data out.
I believe this is quite a complex question but maybe it isn't.
It is certainly above my level of Perl at present, and maybe ever.
Answers to my incorrectly phrased question have been partially understood. Thank you very much for them.
If I could summarize it, my main problem with this task is dealing with the type of question: 'If a line contains ... get data from two lines up and insert it at the beginning of .... Seemingly impossible for me.
I tried regex over the end of line searches but was unable to get that to work.
I was unable also to arrange successive loops to insert data in lines as I wanted. If one loop worked, the next did not and so on. I was prepared to work on successive files in a step by step process but the 'two lines up' question stumped me completely.
I was able to extract other data from these output files relatively easily as they are very orderly files, but this particular question has me stumped.
My revised question is:
My input file consists of batches of data(+-50 - 70 lines long) in the following format:
1(P1) 3 P.ell 05/0120 W/P068819 0 12.0 98/99 380 380 C03 104 PROCESSED 21/02/16 TIME 22.16.52 KSINA=8
AGE SPH %THN %INC SV SI MAI20 HTPC VIPC AGE BA DBH HT SPH CIH% CIV% CVD BCON CMAI C0 C0CAL SI20
0 1100 .0 89.0%SPH 2 2 .00 .0 .0 20.00 1 .0 17.3 0 .0 .0 .0 .0 .0000 .000 0% .00
7 815 25.9 .0 2 2 9.90 75.5 47.2 20.00 1 26.6 17.3 330 .0 .0 .0 13.0 .2099 1.005 .000 17.30
13 550 32.5 .0
18 330 40.0 .0
45 0 100.0 .0
0SQ -4 -4 -4 = SI20 17 17 17 PLANTN---104 GREEN MEADOWS MODEL---P.ELLIOTTII MAC MAC SQ 10 SI20 22.90
HTPC 76 76 76 =MAI20 10 10 10 FROM HTPC HTPC 100 MAI20 20.71
VIPC 47 47 47 =MAI20 10 10 10 HTPC/VIPC REGRESSION---P.ELLIOTTII GENERAL 1/83 VIPC 100 MAI20 20.99
MAIDBH 0
INMAI==> 0
0INPUT FOR CALCULATING HTPC & VIPC = HT ---- ----
AGE DBH HT VTREE SPH BA TOTAL WS UTIL S A B C D TCAI CTCAI TMAI UCAI CUCAI UMAI SCAI CSCAI SMAI IAGE
1 .0 .2 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 1.0
2 .0 .9 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 2.0
3 3.9 2.0 .0007 979 1 1 1 0 0 0 0 0 0 .7 1 .2 .0 0 .0 .0 0 .0 3.0
4 7.1 3.4 .0041 979 4 4 3 1 1 0 0 0 0 3.4 4 1.0 .6 1 .2 .0 0 .0 4.0
5 9.4 4.6 .0102 979 7 10 5 5 5 0 0 0 0 5.9 10 2.0 4.1 5 .9 .0 0 .0 5.0
6 11.3 5.7 .0188 979 10 18 6 12 12 1 0 0 0 8.4 18 3.1 7.5 12 2.0 .0 0 .0 6.0
7 13.0 6.7 .0293 979 13 29 7 22 19 3 0 0 0 10.3 29 4.1 9.7 22 3.1 .0 0 .0 7.0
17%
THN 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
REM 13.4 6.7 .0315 815 12 26 6 20 17 3 0 0 0
8 15.0 7.6 .0453 815 14 37 6 31 21 10 0 0 0 11.2 40 5.0 10.9 33 4.1 .0 0 .0 7.6
9 16.4 8.5 .0607 815 17 49 6 43 23 20 0 0 0 12.5 52 5.8 12.2 45 5.0 .2 0 .0 8.6
10 17.4 9.4 .0771 815 19 63 7 56 24 30 2 0 0 13.4 66 6.6 13.1 58 5.8 1.3 2 .2 9.6
11 18.3 10.3 .0941 815 21 77 7 70 24 41 5 0 0 13.9 80 7.3 13.6 72 6.5 3.0 5 .4 10.6
12 19.0 11.3 .1118 815 23 91 7 84 24 50 10 0 0 14.4 94 7.8 14.1 86 7.2 5.4 10 .8 11.6
13 19.6 12.2 .1299 815 25 106 8 98 24 56 18 0 0 14.7 109 8.4 14.4 100 7.7 8.0 18 1.4 12.6
33%
THN 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
REM 20.6 12.2 .1421 550 18 78 5 73 16 42 15 0 0
14 21.3 13.0 .1636 550 20 90 6 84 16 44 25 0 0 11.8 121 8.6 11.6 112 8.0 10.0 28 2.0 10.4
15 22.0 13.7 .1864 550 21 103 6 97 16 45 36 0 0 12.5 133 8.9 12.3 124 8.3 11.0 39 2.6 11.2
16 22.7 14.5 .2100 550 22 116 6 109 15 46 48 0 0 13.0 146 9.1 12.7 137 8.6 12.0 51 3.2 12.0
17 23.3 15.3 .2345 550 23 129 6 123 15 46 61 0 0 13.5 160 9.4 13.2 150 8.8 12.9 64 3.8 12.8
18 23.9 15.9 .2598 550 25 143 7 136 15 46 74 1 0 13.9 174 9.6 13.6 164 9.1 13.8 78 4.3 13.6
40%
THN 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
REM 25.3 15.9 .2901 330 17 96 4 92 9 28 54 1 0
19 26.0 16.6 .3203 330 17 106 4 101 9 27 63 3 0 10.0 184 9.7 9.8 174 9.1 10.5 88 4.6 11.0
20 26.6 17.3 .3519 330 18 116 5 112 9 27 71 5 0 10.4 194 9.7 10.2 184 9.2 10.6 99 4.9 11.7
21 27.2 18.0 .3849 330 19 127 5 122 9 27 80 8 0 10.9 205 9.8 10.7 194 9.3 11.1 110 5.2 12.4
22 27.9 18.7 .4192 330 20 138 5 133 8 26 87 11 0 11.3 216 9.8 11.1 206 9.3 11.5 121 5.5 13.2
23 28.4 19.3 .4546 330 21 150 5 145 8 26 94 16 0 11.7 228 9.9 11.4 217 9.4 11.8 133 5.8 14.0
24 29.0 20.0 .4914 330 22 162 5 157 8 26 101 22 0 12.2 240 10.0 11.9 229 9.5 12.3 145 6.1 14.9
25 29.6 20.6 .5292 330 23 175 6 169 8 25 106 29 0 12.5 253 10.1 12.2 241 9.6 12.6 158 6.3 15.7
26 30.2 21.2 .5682 330 24 188 6 182 8 25 112 37 0 12.9 265 10.2 12.6 254 9.8 13.0 171 6.6 16.5
27 30.7 21.8 .6083 330 25 201 6 194 8 25 115 46 0 13.2 279 10.3 13.0 267 9.9 13.3 184 6.8 17.3
28 31.3 22.4 .6492 330 25 214 7 208 8 24 119 56 1 13.5 292 10.4 13.2 280 10.0 13.6 198 7.1 18.2
29 31.9 23.0 .6908 330 26 228 7 221 8 24 122 65 2 13.7 306 10.5 13.5 293 10.1 13.8 212 7.3 19.0
30 32.4 23.5 .7332 330 27 242 7 235 8 24 123 77 3 14.0 320 10.7 13.7 307 10.2 14.0 226 7.5 19.8
31 33.0 23.9 .7766 330 28 256 7 249 8 24 125 88 5 14.3 334 10.8 14.0 321 10.4 14.3 240 7.7 20.4
32 33.6 24.4 .8202 330 29 271 8 263 8 23 126 99 7 14.4 349 10.9 14.1 335 10.5 14.4 255 8.0 21.0
Firstly the two variables in the first line(1(P1...): in this case 'C03 104' need to be extracted from it and be sent to OUTPUT.(Same as previous question, but the output position changes.)
Secondly, all lines beginning with 'THN' need to be extracted as they are except that the THN can be dropped.
If there are two, three, four or even five etc. 'THN' lines, they all need to be extracted from the batch and sent to OUTPUT.(+- same as previous question)
Thirdly, although sequentially the second step, the last figure in the 'AGE' column of the main tabular data just before the 'THN' line, needs to be attached to the extracted 'THN' line directly below it.(in this case the figures 7, 13 and 18) These need to be added to their respective THN lines. See expected output below where the ages have been inserted after the two 'C03 104' variables in each line.
If there are no 'THN' lines in a given batch, the entire batch should be ignored, with no output, and the next batch(starting with a '1(P1)' again) considered.
The correct output expected from the above batch is:
CO3 104 7 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
CO3 104 13 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
CO3 104 18 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
As will be seen from this, the two variables from the top line are inserted at the start of the output THN data line. The age figure read from the input batch is then inserted into its respective THN line and thereafter the rest of the THN line data is attached.
My effort some time ago but not updated is as follows:
while ( my $line = <INPUT> ) {
if($line =~ /\s{6,11}(\w{1}\d{1}\w{0,5})\s{0,5}(\d{3})/) {
my #c_no = "$1,$2\n";
foreach (#c_no) {
print OUTPUT $_;
}
if ($line =~ /^(\s{1}THN)(\s{1,3}\d{0,2}.\d)(\s{1,3}\d{0,2}.\d)(\s{1,2}\d{0,1}.\d{4})(\s{1,2}\d{2,4})
(\s{2,3}\d{1,2})(\s{1,6}\d{1,4})(\s{1,2}\d{1,2})(\s{1,5}\d{1,4})(\s{1,4}\d{1,4})
(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})|(^1(P1))/x){
print OUTPUT "$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14\n";
}
}
}
Advice, guidance and help would be greatly appreciated.
This is much more simply done using split to separate each line into space-delimited fields, and quite straightforward if you maintain state variables for the two fields from the header row, and the age from any row whose first field is entirely digits. Then all that is necessary is to print these three values before the numbers on any line that starts with THN
Note that it's simplest to pass the name of the input file as a parameter on the command line. Then all you have to do is read from <>. All the opening and error handling are already done for you
The output format you've asked for is rather esoteric. I can't see any pattern to the column widths and I've had to write a custom printf format to recreate it. If you need something else then all the values in each output line are in the #data array, which you can use as you wish
use strict;
use warnings 'all';
my ($c1, $c2, $age);
while ( <> ) {
next unless /\S/;
my #fields = split;
if ( $fields[0] eq '1(P1)' ) {
($c1, $c2) = #fields[10,11];
}
elsif ( $fields[0] !~ /\D/ ) {
$age = $fields[0];
}
elsif ( $fields[0] eq 'THN' ) {
my #data = ( $c1, $c2, $age, #fields[1..13] );
printf "%4s %5s %5d %5.1f%5.1f%7.4f%5d%4d%7d%3d%6d%5d%5d%5d%5d%5d\n", #data;
}
}
output
C03 104 7 11.4 6.7 0.0230 164 2 4 1 3 2 0 0 0 0
C03 104 13 17.5 12.2 0.1044 265 6 28 2 25 8 15 3 0 0
C03 104 18 21.6 15.9 0.2142 220 8 47 2 45 6 19 20 0 0
I copied and modified your example data, so this hasn't had a really good test. And I'm printing to STDOUT for testing purposes, but that should be easy to change.
The trick is to recognize that you've got line matching to do, which is great with regexes, and other processing, which is probably better with plain old code. So build a little loop, and process the lines with equal precedence (this is important for detecting errors in the file - don't try to nest things too much). Put in some state variables to help keep track of what comes next, and be sure to reset them appropriately.
Also, one thing I noticed in your example code is that you spent a lot if time getting spacing and number-of-digits right for the fields. That was almost certainly wasted time in this context, since the key was the "THN" at the start of the line. One trick with processing text is to focus on the things you really need, and use .* for the other stuff. That way, line noise or a syntax error or some strange formatting glitch won't screw up your program. (Sometimes .* becomes [^"]* or whatever, but you take the point...)
my $line_prefix, $have_age_col, $age_col;
while (<>) {
if (/^1\(P1\).*\s(?P<two_vars>\w+\s+\w+)\s+PROCESSED .* TIME .* KSINA=.*$/) {
# Start new section
$line_prefix = $+{two_vars};
$have_age_col = 0;
$age_col = undef;
}
if (/^AGE /) {
$have_age_col = 1;
}
if ($have_age_col && /^\s{0,5}(\d+)/) {
$age_col = substr " ".$1, -5;
}
if (/^THN /) {
die "THN encountered without header"
unless $line_prefix;
die "THN encountered without age column"
unless $have_age_col and $age_col;
s/^THN \s*//;
s/\s+$//;
my $output = "$line_prefix $age_col $_\n";
print STDOUT $output;
}
}
I have a table A with real time values in it.
Amount Count Pct1 Pct2
300 2 0.000 100.000
1,891 2 0.001 100.000
500 2 0.000 100.000
100 2 0.000 100.000
1,350 2 0.001 100.000
2,648 2 0.001 100.000
2,255 2 0.001 100.000
500 2 0.000 100.000
200 2 0.000 30.441
10 2 0.000 100.000
1,928 2 0.001 100.000
40 2 0.000 100.000
200 2 0.000 100.000
256 2 0.000 100.000
254 2 0.000 100.000
100 2 0.001 100.000
50 1 0.000 33.333
1,512 2 0.001 100.000
I have a table B with a set of conditions. I want to generate the Condition success count in SAS. i.e. If I pass the row 1 in the below table as a condition to the table A it succeeds 2 times. I am using a join to generate a cartesin product and its not efficient. I want an efficient way to solve this problem (similar to what countifs function does in excel). Thanks a lot for your help.
Amount Count Pct1 Pct2 Condion Success Count
1,576 2 0 100 4
1,537 2 0 100 4
1,484 2 0 100 5
1,405 2 0 100 5
1,290 2 0 100 6
1,095 2 0 100 6
948 2 0 100 6
932 2 0 100 6
914 2 0 100 6
887 2 0 100 6
850 2 0 100 6
774 2 0 100 6
707 2 0 100 6
704 2 0 100 6
695 2 0 100 6
646 2 0 100 6
50 1 0 5.42 16
50 1 0 5.42 16
You said that you have tried join to make to make a cartesian product. However, since you didn't post any code I am not sure if you tried to make full product and then calculate the rows. Doing the counting in one SQL statement is much faster since actually full cartesian product is not written anywhere. Like this:
proc sql;
create table tableC as
select c.*, coalesce(s,0) as SuccessCount from TableB c
left join (
select id, count(*) as s from TableA a,TableB b
where
a.amount >= b.amount and
a.count >= b.count and
a.pct1 >= b.pct1 and
a.pct2 >= b.pct2
group by id
) as d
on c.id = d.id
;
quit;
Note that tableB needs to have some unique id column. You should always have some column to use as id but if you don't have it already simple create it like this for example:
data tableB;
set tableB;
id = _N_;
run;
I am following a tutorial for a word processor for my QT module at uni.
It has asked me to put set this attribute:
MainWindow::setAttribute(Qt::WA_DeleteOnClose);
the problem comes when i run the application it causes an error saying that the application has closed unexpectedly.
Also it asked me to make a actionExit action and add to the file toolbar, which doesnt show, i am guessing that it is due to the fact that i am writing it on OSx and the exit/quit is taken care for you with the cmd+Q shortcut.
I was wondering if anyone could shed some light on this problem for me so that i know for future reference. if needed i can post the tutorial + source code.
Thanks
edit: backtrace from the debugger(hope this is correct)
0 __pthread_kill 0 0x7fff8eaff212
1 pthread_kill 0 0x7fff86f7eaf4
2 abort 0 0x7fff86fc2dce
3 free 0 0x7fff86f96959
4 MainWindow::~MainWindow mainwindow.cpp 22 0x100002cff
5 QObject::event 0 0x100e48906
6 QWidget::event 0 0x1000ecd5e
7 QMainWindow::event 0 0x10049cadb
8 QApplicationPrivate::notify_helper 0 0x10009593d
9 QApplication::notify 0 0x10009bdc4
10 QCoreApplication::notifyInternal 0 0x100e3417c
11 QCoreApplicationPrivate::sendPostedEvents 0 0x100e355a0
12 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ 0 0x7fff90925101
13 __CFRunLoopDoSources0 0 0x7fff90924a25
14 __CFRunLoopRun 0 0x7fff90947dc5
15 CFRunLoopRunSpecific 0 0x7fff909476b2
16 RunCurrentEventLoopInMode 0 0x7fff8d0f60a4
17 ReceiveNextEventCommon 0 0x7fff8d0f5d84
18 BlockUntilNextEventMatchingListInMode 0 0x7fff8d0f5cd3
19 _DPSNextEvent 0 0x7fff91a00613
20 -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] 0 0x7fff919ffed2
... <More>
Is your MainWindow object declared on the stack, by any chance? If so, then DeleteOnClose is not a good idea, simply because deleting an object that is on the stack is an error.
I have a Django view which creates 500-5000 new database INSERTS in a loop. Problem is, it is really slow! I'm getting about 100 inserts per minute on Postgres 8.3. We used to use MySQL on lesser hardware (smaller EC2 instance) and never had these types of speed issues.
Details:
Postgres 8.3 on Ubuntu Server 9.04.
Server is a "large" Amazon EC2 with database on EBS (ext3) - 11GB/20GB.
Here is some of my postgresql.conf -- let me know if you need more
shared_buffers = 4000MB
effective_cache_size = 7128MB
My python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
Some output from top:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
Let me know if any other details would be helpful.
One common reason for slow bulk operations like this is each insert happening in its own transaction. If you can get all of them to happen in a single transaction, it could go much faster.
Firstly, ORM operations are always going to be slower than pure SQL. I once wrote an update to a large database in ORM code and set it running, but quit it after several hours when it had completed only a tiny fraction. After rewriting it in SQL the whole thing ran in less than a minute.
Secondly, bear in mind that your code here is doing up to four separate database operations for every row in your data set - the get in get_or_create, possibly also the create, the count on the filter, and finally the save. That's a lot of database access.
Bearing in mind that a maximum of 5000 objects is not huge, you should be able to read the whole dataset into memory at the start. Then you can do a single filter to get all the existing Keyword objects in one go, saving a huge number of queries in the Keyword get_or_create and also avoiding the need to instantiate duplicate ProfileKeywords in the first place.