Given the following table I want to create another table with the total of video_duration and total duration of videos that have events
video can have numerous events, but the duration is of course the same per video file, hence I can have rows of the same video with different event but the video duration stay the same.
input
filename
event
video_duration
A
RUN
20
A
WALK
20
B
FIGHT
10
B
RUN
10
C
30
D
WALK
25
D
FALL
25
E
15
desired output
total videos duration
videos with events duration
100
55
what I've tried:
I created a calculated field
C_total_videos_duration = sum(max({video_duration, [filename]))
which gave me the desired output (100). But, for gods sake, I can't figure out how to get the "videos without events duration".
I have tried:
sumIf(max({video_duration}, [{filename}]), isNotNull({event})) ERROR: the calculation operated on LAC agg experssions is not valid
sum(maxIf({video_duration}, isNotNull({event})), [{filename}])
ERROR: Nesting of aggregate functions like NESTED_sum and NESTED_SUM(MAX(CASE WHEN "id" IS NOT NULL THEN video_duration ELSE NULL END), filename) is not allowed
ifelse(isNotNull({event}), sum(max({vide_duration}, [{filename}])), 0) ERROR: Mismatched aggregation. Custom aggregations can't contain both aggregate SUM and non-aggregated fields SUM(NESTEDMAX(video_duration, filename)) in any combination
The only thing that partially work is
sumOver(maxIf({video_duration},isNotNull(id)), [filename],POST_AGG_FILTER)
but here I get:
filename
total_videos_duration
videos_with_events_duration
A
20
20
B
10
10
C
30
D
25
25
E
15
Total
100
55
I don't this output because I have A LOT of videos, I just want to get the total durations
thank you!
I figure it out just now!
I did sum(max(ifelse(isNotNull(id),{video_duration}, 0), [filename])) and it worked.
Thank you stack!
i have this scenario, in which i have 3 "solutions" (s1,s2,s3) , each solutions has a "impact" % from the original value
Value = 100
i want to calculate the Percentage of s1 from value ,then the remaining perctenage of s2, and .... s3 from Value,
for example:
STEP 1 - 50% (s1) from 100 = 50
STEP 2- then 50%(s2) from WHAT IS REMAINING from STEP1 = 50% from 50 = 25
STEP 3 then 50%(s3) from WHAT IS REMAINING from STEP2 = 50% from 25 = 12.5
solutions impact value explain
s1 50% 50 50% from the original 100 value
s2 50% 25 50% from the "last row" value which is 50, so 50% from 50, 25
s3 50% 12.5 and now 50% from 25
how do i build a DAX measure for this please ? / way to make this work
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]
I have intra-day price data for stock trades and need to write a code to determine the instances in which the following condition is met: Price should go up at least for 10 consecutive trades.
Here is a sample of my data (time is number of minutes in the day, if it's 1 am my time will be 60, if it's 2 am, my time will be 120 etc.):
Obs Time Symbol Price
1 288 AA 36.2800
2 304 AA 36.2800
3 305 AA 36.3400
4 307 AA 36.2800
5 311 AA 36.1500
6 337 AA 36.2000
How can I write this code? Probably a loop is necessary but I can not figure it out. Thank you.
Assuming no missing values, something like:
data want ;
set have ;
lagPrice=lag(Price) ;
if Price>lagPrice and not missing(lagPrice) then Increasing ++ 1 ;
else Increasing=0 ;
if Increasing > 10 then Trend=1 ;
run ;
That will flag the 10th record of an increasing trend, and all those after. Is that what you want? Or are you looking for a ways to flag all records involved in the trend? Or something else??
"0(607.0/60.0)"
"1(149.0/14.0)"
I know that 607 and 149 represent the total number of examples covered by each leaf.
I want to know what the numbers "60" and "14" after the '/' represent?
The second number is the number (weight) of those instances that are misclassified.
The first number is the total number of instances (weight of instances) reaching the leaf. The second number is the number (weight) of those instances that are misclassified.
https://weka.wikispaces.com/What+do+those+numbers+mean+in+a+J48+tree%3F
For sample dataset
Decision tree result:
physician-fee-freeze = n: democrat (253.41/3.75).
First number indicated the number of correct things that reach that node. ( in this democrats) and the second number after “/” shows number of incorrect things that reach that node ( in this case republicans)
Total number of instances:
435 Total number of no (also integral number of correct things): 253
Probability of having no:
253/435 = 0.58
Total number of missing data:
11 Total number of times where it is coming with “no”: 8 Probability:
8/11 = 0.72
Total probability that missing data could be no:
0.58 X 0.72 = 0.42
Total number of correct things:
253+0.42 = 253.42 ~ 253.41
The number after the “/”shows number of incorrect things that reach that node. Now if you see this data it has five incorrect instances where “republican” is the result while “physician fee freeze” is “n” (or “?”)
Those five can be split as following: Total number incorrect instances with “n” : 2 Total number incorrect instances with “?”: 3
Similar formula:
2+(253/435)*3=3.75