If i run Stata's timer:
timer on 1
(code lines)
timer off 1
timer list 1
I cannot read the result:
timer list 1
1: 325.15 / 2 = 162.5725
The next time the timer produces:
timer list 1
1: 622.47 / 3 = 207.4883
It seems it is dividing 325.15 by 2, dividing 622.47 by 3.
Why? What does pre-division number mean? What does post-division number mean?
I tried reading the manual on the topic and other information online but I couldn't find any answer.
The first number is the time elapsed in seconds and the second is the number of times the timer was turned on and off.
Using the example from the help file:
program tester
version 13
forvalues repeat=1(1)100 {
timer on 1
quietly summarize price
timer off 1
}
timer list 1
return list
end
And the toy dataset auto.dta:
sysuse auto, clear
timer clear 1
tester
1: 0.01 / 100 = 0.0001
scalars:
r(N) = 74
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
r(t1) = .008
r(nt1) = 100
tester
1: 0.02 / 200 = 0.0001
scalars:
r(N) = 74
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
r(t1) = .017
r(nt1) = 200
If you clear the timer again:
timer clear 1
tester
1: 0.01 / 100 = 0.0001
scalars:
r(N) = 74
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
r(t1) = .007
r(nt1) = 100
Related
I have the following need :
Calculate the ratio between the sum of the amounts of tickets with status finalized for each material and the sum of the total amounts of the tickets finalized.
My fact table is like below :
TicketID StatusID MaterialID CategoryID Amount FKDATE
123 3 45 9 150 12/03/2021
124 5 50 4 569 11/03/2021
125 3 78 78 556 14/03/2021
126 -1 -1 -1 -1 12/03/2021
My dimension Status is like below :
StatusID Status
1 Open
2 In Process
3 Finalized
My dimension Material is like below :
MaterialID MaterielLabel
1 Bikes
.. ..
I want to exclude the TicketID with MaterialID = -1.
Try the following :
AmountFinalizedByMaterial:=
VAR AmountFinalizedByMaterialGroup =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
yourFactTable[MaterialID] <> -1)
VAR TotalAmountFinalized =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
ALL(Material)
)
RETURN
DIVIDE (
AmountFinalizedByMaterialGroup,
TotalAmountFinalized
)
I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)
Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
I have a dataframe(df1) as following:
datetime m d 1d 2d 3d
2014-01-01 1 1 2 2 3
2014-01-02 1 2 3 4 3
2014-01-03 1 3 1 2 3
...........
2014-12-01 12 1 2 2 3
2014-12-31 12 31 2 2 3
Also I have another dataframe(df2) as following:
datetime m d
2015-01-02 1 2
2015-01-03 1 3
...........
2015-12-01 12 1
2015-12-31 12 31
I want to merge the 1d 2d 3d columns value of df1 to df2.
There are two conditions:
(1) only m and d are the same in both df1 and df2 can merge.
(2) if the index of df2 index % 30 ==0 don't merge, the value of 1d 2d 3d of these index can be Nan.
I mean I want the new dataframe of df2 like as following:
datetime m d 1d 2d 3d
2015-01-02 1 2 Nan Nan Nan
2015-01-03 1 3 1 2 3
...........
2015-12-01 12 1 2 2 3
2015-12-31 12 31 2 2 3
Thanks in advance!
I think you need add NaNs by loc and then merge with left join:
np.random.seed(10)
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 366
rng = pd.date_range('2016-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.tail())
df2 = df_te.copy()
df1 = df_tr_2014.copy()
df1 = df1.set_index('datetime')
df1.index += pd.offsets.DateOffset(years=1)
#correct 29 February
y = df1.index[0].year
df1 = df1.reindex(pd.date_range(pd.datetime(y,1,1), pd.datetime(y,12,31)))
idx = df1.index[(df1.index.month == 2) & (df1.index.day == 29)]
df1.loc[idx, :] = df1.loc[idx - pd.Timedelta(1, unit='d'), :].values
df1.loc[idx, 'day_m'] = idx.day
df1[['month','day_m']] = df1[['month','day_m']].astype(int)
df1[['7d','15d', '20d']] = df1[['7d','15d', '20d']].astype(float)
df1.loc[np.arange(len(df1.index)) % 30 == 0, ['7d','15d','20d']] = 0
df1 = df1.reset_index()
print (df1.iloc[57:62])
index month day_m 7d 15d 20d
57 2016-02-27 2 27 2.0 0.0 1.0
58 2016-02-28 2 28 2.0 3.0 5.0
59 2016-02-29 2 29 2.0 3.0 5.0
60 2016-03-01 3 1 0.0 0.0 0.0
61 2016-03-02 3 2 7.0 6.0 9.0
Why don't you just remove the rows in df1 that don't match (m, d) pairs in df2?
df_new = df2.drop(df2[(not ((df2.m == df1.m) & (df2.n == df1.n)).any()) or (df2.index % 30 == 0)].index)
Or something along those lines.
Link to a related answer.
I'm not enormously familiar with Pandas and have not tested the above example.
df_te is df2
df_tr_2014 is df1
7d 15d 20 is 1d 2d 3d respectively in question. size_df_te is the length of df_te, month and day_m are m, d in df2
df_te['7d'] = 0
df_te['15d'] = 0
df_te['20d'] = 0
mj = 0
dj = 0
for i in range(size_df_te):
if i%30 != 0:
m = df_te.loc[i,'month']
d = df_te.loc[i,'day_m']
if (m== 2) & (d == 29):
m = 2
d = 28
dk_7 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['7d']
dk_15 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['15d']
dk_20 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['20d']
df_te.loc[i,'7d'] = float(dk_7)
df_te.loc[i,'15d'] = float(dk_15)
df_te.loc[i,'20d'] = float(dk_20)
EDIT:
Sample data:
np.random.seed(10)
N = 365
rng = pd.date_range('2014-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.head())
ho could I manipulate some Int adding them as minutes and sum them? the result should be in hours and minutes just like 1:15 or 6:30.
My playground gives 1.25 but I expected 1.15
struct standardDayOfWork {
var dailyHours : Double = 0
}
var dayToUse = standardDayOfWork()
enum hourFractions : Double {
case quarter = 15
case half = 30
case threeQuarter = 45
case hour = 60
}
dayToUse.dailyHours += hourFractions.half.rawValue
dayToUse.dailyHours += hourFractions.half.rawValue
dayToUse.dailyHours += hourFractions.quarter.rawValue
var total = dayToUse.dailyHours / 60 //1.25
Because in the decimal system a quarter is 0.25.
To get numeric 1.15 you could use this weird expression:
var total = Double(Int(dayToUse.dailyHours) / 60) + (dayToUse.dailyHours.truncatingRemainder(dividingBy: 60) / 100.0)
Or if you can live with a formatted "hh:mm" string I'd recommend
let formatter = DateComponentsFormatter()
formatter.allowedUnits = [.hour, .minute]
formatter.string(from: dayToUse.dailyHours * 60)
I have a data frame as following:
ID Value
A 70
A 80
B 75
C 10
B 50
A 1000
C 60
B 2000
.. ..
I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.
So far
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})
How can I find outliers, remove them and get the statistics.
I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})
And then determine whether values in the original DF are outliers:
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 *
IQR))).any(axis=1)]
just do :
In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]:
Value
mean median std
ID
A 75.0 75.0 7.071068
B 62.5 62.5 17.677670
C 35.0 35.0 35.355339