groupby datediff in pandas - python-2.7

I am trying to get the difference between the min date and max date a product is sold in terms of month in a new column.But I am having an unusual return when apply function in groupby.
Any help is much appreciated.
So my steps are :
data :
pch_date day product qty unit_price total_price year_month
421 2013-01-07 tuesday p3 13 4.58 59.54 1
141 2015-09-13 monday p8 3 3.77 11.31 9
249 2015-02-02 monday p5 3 1.80 5.40 2
826 2015-10-09 tuesday p5 6 1.80 10.80 10
427 2014-04-18 friday p7 6 4.21 25.26 4
function definition :
def diff_date(x):
max_date = x.max()
min_date = x.min()
diff_month = (max_date.year - min_date.year)*12 + max_date.month +1
return diff_month
When trying for test:
print diff_date(prod_df['pch_date'])
49 which is correct
But Problem:
print prod_df[['product','pch_date']].groupby(['product']).agg({'pch_date': diff_date}).reset_index()[:5]
Results coming with a extra date:
product pch_date
0 p1 1970-01-01 00:00:00.000000049
1 p10 1970-01-01 00:00:00.000000048
2 p11 1970-01-01 00:00:00.000000045
3 p12 1970-01-01 00:00:00.000000049
4 p13 1970-01-01 00:00:00.000000045
How to get the difference in integer ?

You can use Groupby.apply instead which returns integers and not datetime objects.
df.groupby(['product'])['pch_date'].apply(diff_date).reset_index()
As a workaround for not letting the integer values getting converted to their DatetimeIndex values, you can change the last line of your function to str(diff_month) and you can continue using Groupby.agg as shown:
df.groupby(['product'])['pch_date'].agg({'pch_date': diff_date}).reset_index()

Related

Calculate the average excluding some values

I have a table (TABLE1) similar to this:
DATE
VALUE
01/01/2022
4
01/01/2022
3
01/01/2022
5
01/01/2022
8
02/01/2022
9
02/01/2022
8
02/01/2022
7
02/01/2022
3
I would like to calculate for each day the average value excluding the values that are less than the general average.
For example, for the 01/01/2022 the average is (4+3+5+8)/4 = 5 and the value that I want to calculate is the average excluding the values undder than this average (5+8)/2 = 6,5
Hope you can help me with a measure to calculate this.
Thanks!!
Test this measure:
AverageValue =
VAR AllAverage = CALCULATE(AVERAGE(TestTable[VALUE]),ALLEXCEPT(TestTable,TestTable[DATE]))
VAR TblSummary = ADDCOLUMNS(
VALUES(TestTable[DATE]),
"AvgValue",CALCULATE(AVERAGE(TestTable[VALUE]),TestTable[VALUE]>=AllAverage)
)
RETURN
AVERAGEX(TblSummary,[AvgValue])

Extracting the same groups from different regex patterns for different date formats

Based on a data frame like
import pandas as pd
string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.Series([string_1,string_2,string_3])
each of the following statements succesfully extracts the date of exactly one row:
print(df.str.extract(r'((?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4}))').dropna())
0 month day year
1 03/25/93 03 25 93
print(df.str.extract(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*)[\s\.\-\,](?P<day>\d{2})[\-\,\s]*(?P<year>\d{4})').dropna())
month day year
2 April 11 1990
print(df.str.extract(r'((?P<day>\d{2})\s(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*)[\s\.\-\,]*(?P<year>\d{4}))').dropna())
0 day month year
0 24 Jan 2001 24 Jan 2001
How can the statements be combined to create the data frame
day month year
0 24 Jan 2001
1 25 03 93
2 11 April 1990
Where the indices need to be the original indices?
You may use PyPi regex module (install using pip install regex) and join the patterns with OR inside a branch reset group:
import regex
import pandas as pd
string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.Series([string_1,string_2,string_3])
pat1 = r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})'
pat2 = r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z.]*)[\s.,-](?P<day>\d{2})[-,\s]*(?P<year>\d{4})'
pat3 = r'(?P<day>\d{2})\s(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z.]*)[\s.,-]*(?P<year>\d{4})'
rx = regex.compile(r"(?|{}|{}|{})".format(pat1,pat2,pat3))
empty_val = pd.Series(["","",""], index=['month','day','year'])
def extract_regex(seq):
m = rx.search(seq)
if m:
return pd.Series(list(m.groupdict().values()), index=['month','day','year'])
else:
return empty_val
df2 = df.apply(extract_regex)
Output:
>>> df2
month day year
0 Jan 24 2001
1 03 25 93
2 April 11 1990
string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.DataFrame([string_1,string_2,string_3])
patterns = [r'(?P<day>\d{1,2}) (?P<month>(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)) (?P<year>\d{4})',
r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})',
r'(?P<month>(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*) (?P<day>\d{2}), (?P<year>\d{4})']
def extract_date(s):
result = None, None, None
for p in patterns:
m = re.search(p, s)
if m:
result = m.group('year'), m.group('month'), m.group('day')
break
return result
df['year'], df['month'], df['day'] = zip(*df[0].apply(lambda s: extract_date(s)))

How to get numeric time of day from XTS index, with correct timezone?

I need a fast numeric representation of the time of day.
Lets start with some basic data:
> z1 = structure(
+ c(1:5),.Dim = c(5L, 1L), .Dimnames = list(NULL, c("Hour")),
+ index = as.POSIXct(paste("2018-06-06",paste(1:5,":00:00",sep = ""),sep = " "), tz = 'America/Chicago'),
+ .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = 'America/Chicago',
+ tclass = c("POSIXct", "POSIXt"), tzone = 'America/Chicago', class = c("xts", "zoo"))
> z1
Hour
2018-06-06 01:00:00 1
2018-06-06 02:00:00 2
2018-06-06 03:00:00 3
2018-06-06 04:00:00 4
2018-06-06 05:00:00 5
> index(z1[1])
[1] "2018-06-06 01:00:00 CDT"
So I have 5 times by the hour Chicago time or CDT.
I need to be able to look at the time, like 1AM and get a numeric time like 1/24 = .0416666667.
The XTS index is in Datetime format or seconds from 1970-01-01 so the math should be simple by using the modulo function %%.
Lets try:
> cbind(z1,(unclass(index(z1)) %% (60*60*24))/(60*60*24),(unclass(index(z1)) %% (60*60*24))/(60*60*24)*24)
Hour ..2 ..3
2018-06-06 01:00:00 1 0.2500000 6
2018-06-06 02:00:00 2 0.2916667 7
2018-06-06 03:00:00 3 0.3333333 8
2018-06-06 04:00:00 4 0.3750000 9
2018-06-06 05:00:00 5 0.4166667 10
I unclass the index (to have the same value rcpp will see) then do the modulo for the seconds in the day to the get the days left over and then hours left over.
The problem is obviously the timezone. The resulting time of day is in UTC time zone, but I need it Chicago time just like the the XTS object. If I could simply get the numeric offset for the timezone it would be easy, but it seems getting the offset is not so simple.
So, I need a function in rcpp that if given an XTS time would give me the time of day in the correct timezone. It can be in days, hours or anything else, as long as it is numeric and fast.
The intended use of this TimeOfDay function is, say for running code during typical workday hours of 9 AM to 5 PM.
if(TimeOfDay(Index(1)) > 9.0 && TimeOfDay(Index(1)) < 17.0)
{
//Code to run.
}
Here is a very simple solution using the convenience function in data.table:
R> z2 <- data.table(pt=index(z1))
R> z2[, myhour:=hour(pt)][]
pt myhour
1: 2018-06-06 01:00:00 1
2: 2018-06-06 02:00:00 2
3: 2018-06-06 03:00:00 3
4: 2018-06-06 04:00:00 4
5: 2018-06-06 05:00:00 5
R>
We just pass the POSIXct object in, and derive hour from it. You'd be hard-pressed to beat it in home-grown C/C++ code -- and this solution already exists.

There are two format of Time series datetime in the same series, how to change them to one format?

I want to split a time series into two set: train and test.
Here's my code:
train = data.iloc[:1100]
test = data.iloc[1101:]
Here's what the time series looks like:
And here's the train series:There's no time, only the date in the index.
Finally, the test:
How to change the index to same form?
Consider the simplified series s
s = pd.Series(1, pd.date_range('2010-08-16', periods=5, freq='12H'))
s
2010-08-16 00:00:00 1
2010-08-16 12:00:00 1
2010-08-17 00:00:00 1
2010-08-17 12:00:00 1
2010-08-18 00:00:00 1
Freq: 12H, dtype: int64
But when I subset s leaving only Timestamps that need no time element, pandas does me the "favor" of not displaying a bunch of zeros for no reason.
s.iloc[::2]
2010-08-16 1
2010-08-17 1
2010-08-18 1
Freq: 24H, dtype: int64
But rest assured, the values are the same:
s.iloc[::2].index[0] == s.index[0]
True
And have the same dtype and precision
print(s.iloc[::2].index.values.dtype)
dtype('<M8[ns]')
And
print(s.index.values.dtype)
dtype('<M8[ns]')
I think if same dataframe separated by iloc, there are only no 00:00:00 show. So add times is not necessary, because both dtypes are DatetimeIndex.
mux = pd.MultiIndex.from_product([['GOOG'],
pd.DatetimeIndex(['2010-08-16 00:00:00',
'2010-08-17 00:00:00',
'2010-08-18 00:00:00',
'2010-08-19 00:00:00',
'2010-08-20 15:00:00'])], names=('Ticker','Date'))
data = pd.Series(range(5), mux)
print (data)
Ticker Date
GOOG 2010-08-16 00:00:00 0
2010-08-17 00:00:00 1
2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
#splitting
train = data.iloc[:2]
test = data.iloc[2:]
print (train)
Ticker Date
GOOG 2010-08-16 0
2010-08-17 1
dtype: int32
It seems there are some times as mentioned piRSquared:
print (test)
Ticker Date
GOOG 2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
dtype: int32
#check if same dtypes
print (train.index.get_level_values('Date').dtype)
datetime64[ns]
print (test.index.get_level_values('Date').dtype)
datetime64[ns]
#if want see only times in test dataframes
m = test.index.get_level_values('Date').time != pd.to_datetime('2015-01-01').time()
only_times = test[m]
print (only_times)
Ticker Date
GOOG 2010-08-20 15:00:00 4
dtype: int32

Stata: add values onto existing values

year
0
1
6
....
(omit)
....
77
90
....
(omit)
....
The "year" is a numeric variable. I need to add "200" before the 1-digit values, and "19" before the 2-digit values.
year
2000
2001
2006
....
1977
1990
....
How can I do this in Stata?
Be careful: the variable might be byte and that will bite.
This should work:
gen year2 = cond(year < 10, 2000 + year, 1900 + year)
tab year2
If year2 looks good,
drop year
rename year2 year