Increasing speed of a for loop through a List and dataframe - python-2.7

Currently I am dealing with a massive amount of Data in the original form of a list through combination. I am running conditions on each set of list through a for loop. Problem is this small for loop is taking hours with the data. I'm looking to optimize the speed by changing some functions or vectorizing it.
I know one of the biggest NO NOs is don't do Pandas or Dataframe operations in for loops but I need to sum up the columns and organize it a little to get what I want. It seems unavoidable.
So you have a better understanding, each list looks something like this when its thrown into a dataframe:
0
Name Role Cost Value
0 Johnny Tsunami Driver 1000 39
1 Michael B. Jackson Pistol 2500 46
2 Bobby Zuko Pistol 3000 50
3 Greg Ritcher Lookout 200 25
Name Role Cost Value
4 Johnny Tsunami Driver 1000 39
5 Michael B. Jackson Pistol 2500 46
6 Bobby Zuko Pistol 3000 50
7 Appa Derren Lookout 250 30
This is the current loop, any ideas?
for element in itertools.product(*combine_list):
combo = list(element)
df = pd.DataFrame(np.array(combo).reshape(-1,11))
df[[2,3]] = df[[2,3]].apply(pd.to_numeric)
if (df[2].sum()) <= 5000 and (df[3].sum()) > 190:
df2 = pd.concat([df2, df], ignore_index=True)
Couple things I've done that have sliced off some time but not enough.
*df[2].sum() to df[2].values.sum----its faster
*where the concat is in the if statement I've tried using append and also adding the dataframe together as a list...concat is actually 2 secs faster normally or it will end up being about the same speed.
*by the .apply(pd.to_numeric) changed it to .astype(np.int64) its faster as well.
I'm currently looking at PYPY and Cython as well but I want to start here first before I go through the headache.

Related

Maximizing score by scheduling tasks

So we have a timeline of T days in which some tasks have to be performed.
Every task has a penalty score. If the task is not performed in the given timeline , it's score adds up in the final penalty score. Every task can be performed only after it's given starting time.
The input will be given in the format:
T
Score Quantity_of_task Starting_time
For eg :
T = 10
140 5 4
This means that 5 tasks with penalty score 140 have to be performed from 4th day onwards.
You can perform at most 1 task on a particular day.
The goal is to minimize the final penalty score.
What I tried to do:
Example -
T = 10
Input size = 5
150 4 1
120 4 3
200 2 7
100 10 5
50 5 1
I sorted the list according to the penalty score , and greedily assigned the tasks with high penalty score to their corresponding days,i.e
2 tasks with highest score 200 are assigned to days 7 and 8
4 tasks with next highest score 150 are assigned to 1,2,3,4 days
4 tasks with next highest score 120 are assigned to 5,6,9,10 days
which gives the schedule as
150 150 150 150 120 120 200 200 120 120
Left out tasks:
10 tasks with 100 score = 1000 penalty
5 tasks with 50 score = 250 penalty
Final penalty = 1250.
This requires O(T * input_size). Is there a more elegant and optimized way of doing it?
Both input size and T have a constraint of 10^5.
Thanks.
If you store the available days in an ordered set, then you can perform your algorithm much faster.
For example, C++ provides an ordered set with a lower_bound method that will find in O(logn) time the first available day after the starting time.
Overall this should give an O(nlogn) algorithm where n = T+input_size.
For example, I suspect that when you have your 4 tasks of penalty 120 to assign from day 3 onwards, your current code will loop over days 3,4,5,etc. until you find a day that has not been assigned. You can now replace this O(n) loop with a single O(logn) call to lower_bound to find the first unassigned day. When you greedily assign the days, you should also remove them from the set so they won't be assigned twice.
Note that there are only T days so there will be at most T day assignments. For example, suppose all tasks have starting time 1, and quantity T. Then the first task will take O(Tlogn) time to assign, but all subsequent tasks will only need a single call to lower_bound (because there are no days left to assign), so will take O(logn) each.

Fast Update For a Table

I need to update CustomerValue table for 4000 customers for 20 different options.
It exactly comes out to 80,000 records.
I wrote this:
Update CustomerValue Set Value = 100 where Option in
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
But it is taking time. I was wondering If I can use PL/SQL block or any other way to make it run faster. Few minutes are okay....It ran for 11 minutes so I cancelled it.
Note: There is no ROWID in that table.
Thanks
If your condition is regular like this
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
1100000 rows 6 second.
UPDATE CustomerValue
SET DEGER = 100
WHERE Value >= 1 AND Value <=20

Pandas groupby + transform taking hours for 600 Million records

My DataFrame 3 fields are account ,month and salary.
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and calculating sum of salary for group. Then removing duplicates.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(sum)
MyDataFrame = MyDataFrame.drop_duplicates()
Expecting output like below:
account month Salary
1 201501 10000
2 201506 40000
3 201508 60000
3 201506 40000
It works well for few records. I tried same for 600 Million records and it is in progress since 4-5 hours. Initially when I loaded data using pd.read_csv() data acquired 60 GB RAM, till 1-2 hour RAM usages was in between 90 to 120 GB. After 3 hours process is taking 236 GB RAM and it is still running.
Please suggest if any other alternative faster way is available for this.
EDIT:
Now 15 Minutes in df.groupby(['account', 'month'], sort=False)['Salary'].sum()
Just to follow up on chrisb's answer and Alexander's comment, you indeed will get more performance out of the .sum() and .agg('sum') methods. Here's a Jupyter %%timeit output for the three:
So, the answers that chrisb and Alexander mention are about twice as fast on your very small example dataset.
Also, according to the Pandas API documentation, adding the kwarg sort=False will also help performance. So, your groupby should look something like df.groupby(['account', 'month'], sort=False)['Salary'].sum(). Indeed, when I ran it, it was about 10% faster than the runs shown in the above image.
Unless I'm misunderstanding something, you're really doing an aggregation - transform is for when you need the data in the shape as the original frame. This should be somewhat faster and does it all in one step.
df.groupby(['account', 'month'])['Salary'].agg('sum')
Might be worth downloading the development version of Pandas 0.17.0. They are unlocking the GIL, which controls multi threading. It's going to be natively implemented in the groupby and this blog post suggested speed increases of 3x on a group-mean example.
http://continuum.io/blog/pandas-releasing-the-gil
http://pandas.pydata.org/

MIP GAP couldn't be set properly

I am solving a MIP in IBM ILOG Cplex. I have set the relative MIP GAP and absolute MIP gap to 0 ,but the gap was reported in engine log was upper than 0. also when I run the model by default values (1.0E-4,1.0E-6), the gap is reported in engine log is upper than 1.0E-4 (sometimes even 6%). and the surprising thing is that even the time of calculation was small(below 1sec). I think maybe other settings is needed besides mip gap to set it to zero to obtain optimal value of objective function. one another thing is that my other settings are as default. I appreciate if anyone can help me.
this is the result of one of my run(relative MIP GAP is set to 0 but the reported gap is 1.13% as you can see):
Nodes Cuts/
Node Left Objective IInf Best Integer Best Node ItCnt Gap
0 0 15619.2777 30 15619.2777 204
0 0 21532.4345 31 Cuts: 92 300
0 0 22240.7958 65 Cuts: 50 389
0 0 22374.7172 46 Cuts: 63 452
0 0 22428.5062 28 Cuts: 31 475
0 0 22447.7754 48 Cuts: 28 517
0 0 22486.3137 39 Cuts: 34 542
0 0 22486.3137 40 Cuts: 13 557
0 0 22486.3137 30 ZeroHalf: 4 558
0 0 22486.3137 28 Cuts: 15 583
* 0+ 0 23225.6696 22486.3137 583 3.18%
0 2 22486.3137 28 23225.6696 22486.3137 583 3.18%
Elapsed real time = 0.36 sec. (tree size = 0.01 MB, solutions = 1)
* 26 20 integral 0 22743.1173 22486.3137 1126 1.13%
GUB cover cuts applied: 2
Clique cuts applied: 23
Cover cuts applied: 9
Implied bound cuts applied: 105
Flow cuts applied: 1
Mixed integer rounding cuts applied: 30
Zero-half cuts applied: 74
Gomory fractional cuts applied: 3
Root node processing (before b&c):
Real time = 0.31
Parallel b&c, 4 threads:
Real time = 0.25
Sync time (average) = 0.02
Wait time (average) = 0.06
-------
Total (root+branch&cut) = 0.56 sec.
beforehand thanks for your help.
As Tim said in the comment, it is the last line of the log file that actually shows the final solution (upper and lower bounds) and not the last line of the tree log part.
Here is an example:
It seems that the problem is as follows:
CPLEX reports an optimal solution, with a default gap of 10e-4, but it reports 1.13%
You have found a better solution, in which you have checked that it is feasible
For the first bullet, CPLEX does not actually report that the gap is 1.13%. The gap was 1.13% when it was reported during the search.
The fact that it returns a solution, that should obey the default tolerances, is a proof that, using your formulation the optimal cannot be less than what is reported.
Since you are sure that there is a better feasible solution, you have a few options.
Try to inject your known solution to CPLEX before it starts the optimization. One way to do this is using goals. This might be somewhat complicated so you might want to ..
Print out the model and evaluate manually that the solution you claim is better is feasible with respect to the constraints you have entered (recommended method).
Add a constraint that the objective function value should be less than or equal to your solution's objective value augmented by a small quantity (something around 10e-3 should be OK). If your solution is feasible for formulation, you should get it, otherwise there is a bug and the solution is infeasible. I do not like this method because if there are multiple optimal solution you can get funny results, but usually it works).
All in all, try to debug the model and let us know. It is a fairly complicated model and it is easy to have missed something (i.e., adding a constraint that you did not mean to add).
If all this fail and you still find yourself wondering what is going on, you might want to submit your model to IBM's official forum. If there is indeed a bug in the solver they will take care of it and let you know.
I hope this helps

Loading first few observations of data set without reading entire data set (Stata 13.1)?

(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent