I have a geopandas dataframe where there are NaN in a column. I want to impute the NaN with the average value of its neighbors. I made up the following example and would appreciate it if somebody can help me out with the final steps. Thanks.
# Load libraries
import numpy as np
import pandas as pd
import geopandas as gpd
from libpysal.weights.contiguity import Queen
# Make data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
africa = world[world['continent'] == 'Africa']
africa.reset_index(inplace=True, drop=True)
africa.loc[[2,8,15,22,30,35,39,43],'pop_est'] = np.nan # Make NaN value for pop_est
africa
# Generate weight
w = Queen.from_dataframe(africa)
w.neighbors[2] # Check neighbors of index 2
For example, index 2 has a missing value on population estimate and its neighbors are [0, 35, 36, 48, 49, 50, 27, 28, 31]. I want to use the mean of population estimate from [0, 35, 36, 48, 49, 50, 27, 28, 31] to replace NaN. Thanks.
I finally figured out how to do it.
with a log like this:
API code n.º 1111111111, registered transaction
NAT Code: 8500/1500
6 value
User code: 51000
Start Time 18-09-2019 22:45:59 CET [18-09-2019 16:45:59 ET]
End Time 18-09-2019 23:00:47 CET [18-09-2019 17:00:47 ET]
1: cod_user1 (online), 19.236 (99%)
2: cod_user2 (online), 5.244 (88%)
3: cod_user3 (online),
4: cod_user4 (online),
5: cod_user5 (offline),
6: cod_user6 (offline),
Queue 542. End transaction.
API code n.º 2222222222, registered transaction
NAT Code: 8500/1500
6 value
User code: 51000
Start Time 18-09-2019 22:45:59 CET [18-09-2019 16:45:59 ET]
End Time 18-09-2019 23:00:47 CET [18-09-2019 17:00:47 ET]
1: cod_user1 (online), 19.236 (99%)
2: cod_user2 (online), 5.244 (88%)
3: cod_user3 (online),
4: cod_user4 (online),
5: cod_user5 (offline),
6: cod_user6 (offline),
Queue 542. End transaction.
API code n.º 3333333333, registered transaction
NAT Code: 8500/1500
6 value
User code: 51000
Start Time 18-09-2019 22:45:59 CET [18-09-2019 16:45:59 ET]
End Time 18-09-2019 23:00:47 CET [18-09-2019 17:00:47 ET]
1: cod_user1 (online), 19.236 (99%)
2: cod_user2 (online), 5.244 (88%)
3: cod_user3 (online),
4: cod_user4 (online),
5: cod_user5 (offline),
6: cod_user6 (offline),
Queue 542. End transaction.
There are N iterations and I need extract data in a list mode (separated by tabs), something like:
apicode nat code value Start date End date Queue
1111111111 8500/1500 6 value 18-09-2019 22:45:59 18-09-2019 23:00:47 542
2222222222 8500/1500 6 value 18-09-2019 22:45:59 18-09-2019 23:00:47 542
and so on ....
And I need also extract the complete list of users and status and data, for every iteration of the API code, like this (separated by tabs):
apicode user status data eff
1111111111 cod_user1 online 19.236 99
1111111111 cod_user2 online 5.244 88
1111111111 cod_user3 online
1111111111 cod_user4 online
1111111111 cod_user5 offline
1111111111 cod_user6 offline
2222222222 cod_user1 online 19.236 99
2222222222 cod_user2 online 5.244 88
2222222222 cod_user3 online
2222222222 cod_user4 online
2222222222 cod_user5 offline
2222222222 cod_user6 offline
3333333333 cod_user1 online 19.236 99
3333333333 cod_user2 online 5.244 88
3333333333 cod_user3 online
3333333333 cod_user4 online
3333333333 cod_user5 offline
3333333333 cod_user6 offline
Is possible with a regular expression?
What I've got.
https://regex101.com/r/1bpioM/1
I have the first script, I've got the list with substitutions. BUT the script is adding a line break between each row.
\1\t\2\t\3\t\4\t\5\t\6\t\10
1111111111 8500/1500 6 51000 18-09-2019 22:45:59 18-09-2019 23:00:47 542
2222222222 8500/1500 6 51000 18-09-2019 22:45:59 18-09-2019 23:00:47 542
3333333333 8500/1500 6 51000 18-09-2019 22:45:59 18-09-2019 23:00:47 542
What I have not achieved is the list of users, since the script only locates the first user of each item in the log
Could you help me to revise this?
Thank you
Assuming the text is like that you can highlight the interesting parts with
(?<=API\scode\sn\.º\s)\d{10}|(?<=NAT\sCode:\s)\d{4}\/\d{4}|\d\svalue|(?<=User\scode:\s)\d{5}|(?<=(Start|End)\sTime\s)\d{2}-\d{2}-\d{4}\s\d{2}:\d{2}:\d{2}\sCET|(?<=Queue\s)\d{3}
It select every piece that is followed by specific strings. The "pieces" are formatted like the ones on your file, and so are the "specific strings".
See a better explaination here.
Check if it works of your engine, because it uses constructs (positive lookbehind) that are not always compatible
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]
I'm learning the C++ API of CBC, and I'm having trouble matching the performance of a compiled C++ program that loads an MPS file and solves it using the CbcModel class when compared to just opening the CBC command line utility, importing the same file and using solve. The cmd line utility solves the MIP in 1 second, and the C++ program doesn't terminate in <10 minutes.
I figured the problem is that when I'm using C++ API I have to configure all the parameters explicitly and It seems that the default parameters used by the cmd line utility are pretty well rounded for your average MIP model.
Is there a list of the default parameters for the presolve, heuristics and cuts that are used by the cmd line utility and which I should activate in my C++ program to match the performance. Maybe someone has played around with these parameters and found a good set of parameters empirically.
The C++ program is:
int main ()
{
OsiClpSolverInterface solver1;
solver1.setLogLevel(0);
// Read in example model in MPS file format
// and assert that it is a clean model
int numMpsReadErrors = solver1.readMps("generic_mip.mps","");
assert(numMpsReadErrors==0);
// Pass the solver with the problem to be solved to CbcModel
CbcModel model(solver1);
model.setLogLevel(0);
// Add clique cut generator
CglClique clique_generator;
model.addCutGenerator(&clique_generator,-1, "Clique");
// Add rounding heuristic
CglMixedIntegerRounding mixedGen;
model.addCutGenerator(&mixedGen, -1, "Rounding");
model.setNumberThreads(4);
model.messageHandler()->setPrefix(false);
model.branchAndBound();
const double * solution = model.bestSolution();
printf("Optimal value is %.2f", *solution);
return 0;
}
The MIP model in question can be downloaded from HERE. Optimal objective value: -771.2957.
Cbc command line utility log that indicates all kinds of advances features are activated (preprocessing, primal heuristics and strong branching):
Continuous objective value is -798.689 - 0.03 seconds
Cgl0002I 21 variables fixed
Cgl0003I 0 fixed, 175 tightened bounds, 1972 strengthened rows, 0 substitutions
Cgl0004I processed model has 3731 rows, 3835 columns (3835 integer (3660 of which binary)) and 37873 elements
Cbc0038I Initial state - 365 integers unsatisfied sum - 129.125
Cbc0038I Pass 1: (0.18 seconds) suminf. 58.66667 (121) obj. -572.133 iterations 510
Cbc0038I Pass 2: (0.18 seconds) suminf. 58.66667 (121) obj. -572.133 iterations 23
Cbc0038I Pass 3: (0.18 seconds) suminf. 58.66667 (121) obj. -572.133 iterations 1
Cbc0038I Pass 4: (0.20 seconds) suminf. 69.00000 (138) obj. -299.496 iterations 589
Cbc0038I Pass 5: (0.20 seconds) suminf. 54.00000 (109) obj. -287.063 iterations 194
Cbc0038I Pass 6: (0.21 seconds) suminf. 54.00000 (109) obj. -287.063 iterations 12
Cbc0038I Pass 7: (0.21 seconds) suminf. 49.00000 (100) obj. -273.321 iterations 33
Cbc0038I Pass 8: (0.22 seconds) suminf. 48.00000 (97) obj. -269.421 iterations 14
Cbc0038I Pass 9: (0.22 seconds) suminf. 48.00000 (98) obj. -268.624 iterations 8
Cbc0038I Pass 10: (0.23 seconds) suminf. 48.00000 (97) obj. -264.813 iterations 4
Cbc0038I Pass 11: (0.23 seconds) suminf. 47.00000 (94) obj. -261.75 iterations 8
Cbc0038I Pass 12: (0.24 seconds) suminf. 47.00000 (94) obj. -261.75 iterations 3
Cbc0038I Pass 13: (0.24 seconds) suminf. 47.00000 (94) obj. -261.75 iterations 3
Cbc0038I Pass 14: (0.25 seconds) suminf. 57.75000 (118) obj. -103.115 iterations 508
Cbc0038I Pass 15: (0.26 seconds) suminf. 49.00000 (98) obj. -97.4793 iterations 163
Cbc0038I Pass 16: (0.26 seconds) suminf. 49.00000 (98) obj. -97.4793 iterations 3
Cbc0038I Pass 17: (0.27 seconds) suminf. 48.75000 (98) obj. -101.421 iterations 24
Cbc0038I Pass 18: (0.27 seconds) suminf. 47.00000 (94) obj. -103.346 iterations 25
Cbc0038I Pass 19: (0.28 seconds) suminf. 47.00000 (94) obj. -103.346 iterations 2
Cbc0038I Pass 20: (0.28 seconds) suminf. 47.00000 (94) obj. -103.346 iterations 21
Cbc0038I Pass 21: (0.29 seconds) suminf. 51.50000 (107) obj. 60.0315 iterations 469
Cbc0038I Pass 22: (0.30 seconds) suminf. 40.00000 (80) obj. 59.913 iterations 168
Cbc0038I Pass 23: (0.30 seconds) suminf. 40.00000 (80) obj. 59.913 iterations 2
Cbc0038I Pass 24: (0.31 seconds) suminf. 39.50000 (79) obj. 59.913 iterations 27
Cbc0038I Pass 25: (0.31 seconds) suminf. 39.00000 (78) obj. 59.913 iterations 23
Cbc0038I Pass 26: (0.32 seconds) suminf. 39.00000 (78) obj. 59.913 iterations 13
Cbc0038I Pass 27: (0.33 seconds) suminf. 50.00000 (101) obj. 124.699 iterations 504
Cbc0038I Pass 28: (0.34 seconds) suminf. 41.00000 (82) obj. 118.624 iterations 174
Cbc0038I Pass 29: (0.34 seconds) suminf. 41.00000 (82) obj. 118.624 iterations 5
Cbc0038I Pass 30: (0.34 seconds) suminf. 41.00000 (82) obj. 118.624 iterations 19
Cbc0038I No solution found this major pass
Cbc0038I Before mini branch and bound, 2356 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.41 seconds)
Cbc0038I After 0.41 seconds - Feasibility pump exiting - took 0.25 seconds
Cbc0031I 583 added rows had average density of 8.2024014
Cbc0013I At root node, 583 cuts changed objective from -798.68913 to -771.29565 in 10 passes
Cbc0014I Cut generator 0 (Probing) - 541 row cuts average 2.0 elements, 0 column cuts (0 active) in 0.044 seconds - new frequency is 1
Cbc0014I Cut generator 1 (Gomory) - 751 row cuts average 116.6 elements, 0 column cuts (0 active) in 0.108 seconds - new frequency is 1
Cbc0014I Cut generator 2 (Knapsack) - 451 row cuts average 2.0 elements, 0 column cuts (0 active) in 0.040 seconds - new frequency is 1
Cbc0014I Cut generator 3 (Clique) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.004 seconds - new frequency is -100
Cbc0014I Cut generator 4 (MixedIntegerRounding2) - 155 row cuts average 16.9 elements, 0 column cuts (0 active) in 0.028 seconds - new frequency is 1
Cbc0014I Cut generator 5 (FlowCover) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.008 seconds - new frequency is -100
Cbc0014I Cut generator 6 (TwoMirCuts) - 1171 row cuts average 20.0 elements, 0 column cuts (0 active) in 0.068 seconds - new frequency is 1
Cbc0010I After 0 nodes, 1 on tree, 1e+50 best solution, best possible -771.29565 (1.18 seconds)
Cbc0004I Integer solution of -771.29565 found after 2671 iterations and 1 nodes (1.24 seconds)
Cbc0001I Search completed - best objective -771.2956521739131, took 2671 iterations and 1 nodes (1.24 seconds)
Cbc0032I Strong branching done 22 times (542 iterations), fathomed 0 nodes and fixed 0 variables
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from -798.689 to -771.296
Probing was tried 12 times and created 552 cuts of which 0 were active after adding rounds of cuts (0.044 seconds)
Gomory was tried 12 times and created 756 cuts of which 0 were active after adding rounds of cuts (0.116 seconds)
Knapsack was tried 12 times and created 456 cuts of which 0 were active after adding rounds of cuts (0.044 seconds)
Clique was tried 10 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.004 seconds)
MixedIntegerRounding2 was tried 12 times and created 155 cuts of which 0 were active after adding rounds of cuts (0.036 seconds)
FlowCover was tried 10 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.008 seconds)
TwoMirCuts was tried 12 times and created 1197 cuts of which 0 were active after adding rounds of cuts (0.084 seconds)
ImplicationCuts was tried 2 times and created 11 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: -771.29565217
Enumerated nodes: 1
Total iterations: 2671
Time (CPU seconds): 1.27
Time (Wallclock seconds): 1.30
I was able to use the same settings as in command line utility by calling the solver by using such code:
const char *argv[] = {"", "-solve"};
CbcMain1(2, argv, model);
Of course, you can first set the log level, the number of threads, etc. This way you do not have to copy the code from CbcSolver.cpp, what could be concluded from sascha's answer.
Maybe this part of the official code helps. It's linedoc is called Set up likely cut generators and defaults
CBC's code is hard to read and it's hard to analyze what kind of default behaviour is there without investing some time.
But the linked code above looks a bit like the defaults activated within some cmd-call.
Which compiler do you use?
Is debug enabled resp. optimization disabled?
E.g. for Visual Studio, this makes a huge difference in performance and might be the reason that your compiled code is much slower.
I am using Pandas to handle some Timeseries data. I have a data frame in the following format:
Date Time Reading
552726 2016/08/01 0: 0: 0 17.28
552727 2016/08/01 0: 0: 5 17.28
552728 2016/08/01 0: 0:10 17.21
552729 2016/08/01 0: 0:15 17.16
552730 2016/08/01 0: 0:20 17.11
552731 2016/08/01 0: 0:25 17.08
552732 2016/08/01 0: 0:30 17.18
552733 2016/08/01 0: 0:35 17.18
etc...
I want to average the Reading column, so that it takes a 10 minute window and calculates the average, and I want to move this window across the time series. Then I want the data frame to be updated with the new averaged values, and also the Timestamp, so it would look like this:
Date Time Reading
552726 2016/08/01 0: 0: 0 17.30
552727 2016/08/01 0: 10:0 17.35
552728 2016/08/01 0: 20:0 17.20
etc...
What is the best way to do this in Pandas? I tried the rolling mean method setting up a frequency for the rolling window. But then I have to rebuild the data frame, with new timestamps myself, and I think there's a cleaner, easier way to do this.
Thank you, and please let me know if I can clarify things better.
Given your data, say I wanted to calculate average of 15 seconds intervals.
I simply did:
#frame contains your data
n_obs = 3
result = frame.rolling(window = n_obs, min_periods = 1).mean().iloc[::n_obs,:]
# Date Time Reading
# 0 2016/08/01 0: 0: 0 17.280000
# 3 2016/08/01 0: 0:15 17.216667
# 6 2016/08/01 0: 0:30 17.123333
Where the main "trick" is selecting the observations multiple of n_obs.
This should work for you using n_obs = 120, although it implies calculating many more averages than you actually need.