Pandas time series: groupby and sum from noon to noon

Pandas time series: groupby and sum from noon to noon - python-2.7

My pandas dataframe is structured like this (with 'date' as index):
starttime duration_seconds
date
2012-12-24 11:52:00 31800
2012-12-23 0:28:00 35940
2012-12-22 2:00:00 26820
2012-12-21 1:57:00 23520
2012-12-20 1:32:00 23100
2012-12-19 0:50:00 25080
2012-12-18 1:17:00 24780
2012-12-17 0:38:00 25440
2012-12-15 10:38:00 32760
2012-12-14 0:35:00 23160
2012-12-12 22:54:00 3960
2012-12-12 0:21:00 24060
2012-12-10 23:45:00 900
2012-12-11 11:00:00 24840
2012-12-10 0:27:00 25980
2012-12-09 19:29:00 4320
2012-12-09 3:00:00 29880
2012-12-08 2:07:00 34380
I use the following to groupby date and sum the total seconds each day:
df_sum = df.groupby(df.index.date).sum()
What I'd like to do is sum duration_seconds from noon on one day to noon on the following day. Is there an elegant (pandas) way of doing this? Thanks in advance!

pd.TimeGrouper is a custom groupby class for time-interval grouping of NDFrames with a DatetimeIndex, TimedeltaIndex or PeriodIndex. (If your dataframe index is using date-strings, you'll need to convert it to a DatetimeIndex first by using df.index = pd.DatetimeIndex(df.index).)
df.groupby(pd.TimeGrouper('24H')).sum() groups df using 24-hour intervals starting at time 00:00:00.
df.groupby(pd.TimeGrouper('24H'), base=12).sum() groups df using 24-hour intervals starting at time 12:00:00:
In [90]: df.groupby(pd.TimeGrouper('24H', base=12)).sum()
Out[90]:
duration_seconds
2012-12-07 12:00:00 34380.0
2012-12-08 12:00:00 34200.0
2012-12-09 12:00:00 26880.0
2012-12-10 12:00:00 24840.0
2012-12-11 12:00:00 28020.0
2012-12-12 12:00:00 NaN
2012-12-13 12:00:00 23160.0
2012-12-14 12:00:00 32760.0
2012-12-15 12:00:00 NaN
2012-12-16 12:00:00 25440.0
2012-12-17 12:00:00 24780.0
2012-12-18 12:00:00 25080.0
2012-12-19 12:00:00 23100.0
2012-12-20 12:00:00 23520.0
2012-12-21 12:00:00 26820.0
2012-12-22 12:00:00 35940.0
2012-12-23 12:00:00 31800.0
Documentation on pd.TimeGrouper is a little sparse. It is a subclas of pd.Grouper and thus many of its parameters have the same meaning as those documented for pd.Grouper. You can find more examples of pd.TimeGrouper usage in the Cookbook. I found the base parameter by inspecting the source code. The base parameter in pd.TimeGrouper has the same meaning as the base parameter in pd.resample and that is not surprising since pd.resample is implemented using pd.TimeGrouper.
In fact, come to think of it, another way to compute the desired result is
df.resample('24H', base=12).sum()

Related

Graph the average for the day / multiple data daily - Power BI

I have 4 values collected daily.
I want to graph the average of the 4 values on a time series graph.
If I was to plot this.
1/03/2021 will show an average value of 15 and 2/03/2021 will show an average value of 35.
I tried using quick measure that says rolling average of 1 day before 0 days after, it gives me an error.
The Dax which I've tried didn't work either - getting "too many arguments were passed to the Values Function. the maximum argument count for the function is 1". This is me trying to follow some instructions online for the first time.
Day Avg = AVERAGEX(VALUES([Date], [Values]))
Thanks for the input.
Gem

Assuming your data looks like this
Table
Date
Time
Value
01/03/2021
00:01:00
10
01/03/2021
06:00:00
20
01/03/2021
12:00:00
15
01/03/2021
18:00:00
15
02/03/2021
00:01:00
30
02/03/2021
06:00:00
20
02/03/2021
12:00:00
40
02/03/2021
18:00:00
50
It seems your row context is at the table level, so you don't need to use VALUES.
AVG =
AVERAGEX ( 'Table', 'Table'[Value] )

DAX SUM between Dates is not working as expected

I have two tables:
DateDim
Time
I am trying to get the sum of hours_actual from my Time table where they are between two dates from my DateDim. They have a relationship on the date shown in the following:
I am currently using the following DAX formula:
PreviousPeriod_Hours = CALCULATE(SUM('Time'[hours_actual])
,DATESBETWEEN(
DateDim[FullDateAlternateKey],
[Start of Previous Period],
[End of Previous Period]),
ALL(DateDim)
)
The values for [Start of Previous Period] and [End of Previous Period] are calculated DAX dates, that are showing as I would expect.
In order to arrive at those dates I create a few DAX functions first:
Start of This Period = FIRSTDATE(DateDim[FullDateAlternateKey])
End of This Period = LASTDATE(DateDim[FullDateAlternateKey])
Days in This Period = DATEDIFF([Start of This Period],[End of This Period],DAY)
End of Previous Period = PREVIOUSDAY(LASTDATE(DATEADD(DateDim[FullDateAlternateKey],-1*[Days in This Period],DAY)))
Start of Previous Period = PREVIOUSDAY(FIRSTDATE(DATEADD(DateDim[FullDateAlternateKey],-1*[Days in This Period] + IF(MOD(Year('MeasureTable'[End of This Period]),4) == 0,1,0),DAY)))
To quickly summarize the above, it is finding the days between a start and end date, and then subtracting these days from my start and end dates that are selected. If it is a leap year, then add a day.
The dax formula is giving me the correct sum total I am expecting. However, if I display the hours by month between the 2 dates, they are showing something different altogether from what it should be, and don't add to the sum it displays.
I was expecting the following values:
I am not sure where the 13 is coming from, and the 28.25 looks to be a repeat from the previous month of the following year. What I am missing here? Is my current approach correct, I am just doing something incorrectly? or am I taking the wrong approach altogether?
UPDATE - Adding in some of the data I am working with:
Then the DateDim is just a generated date table, for example, a row looks like the following (2016-2021): 
FullDateAlternateKey Year Month Month Name Quarter Week of Year Week of Month Day Day of Week Day of Year Day Name Fiscal Year Fiscal Period Fiscal Quarter
2016-01-02 2016 1 January 1 1 1 2 6 2 Saturday 2016 5 2
And the hours_actual and date look like the following: 
Date_Start hours_actual
2019-03-05 12:00:00 AM 5
2019-03-26 12:00:00 AM 3
2019-04-23 12:00:00 AM 0.75
2019-04-24 12:00:00 AM 0.08
2019-05-22 12:00:00 AM 4
2019-05-22 12:00:00 AM 2
2019-05-22 12:00:00 AM 1.75
2019-05-27 12:00:00 AM 8
2019-05-31 12:00:00 AM 0.25
2019-06-03 12:00:00 AM 0.25
2019-06-05 12:00:00 AM 0.25
2019-06-21 12:00:00 AM 1
2019-06-27 12:00:00 AM 2
2019-06-27 12:00:00 AM 0.5
2019-06-28 12:00:00 AM 1
2019-06-28 12:00:00 AM 3
2019-07-04 12:00:00 AM 3
2019-07-05 12:00:00 AM 3
2019-07-10 12:00:00 AM 2.5
2019-07-10 12:00:00 AM 0.5
2019-07-10 12:00:00 AM 1.5
2019-07-10 12:00:00 AM 0.5
2019-07-10 12:00:00 AM 2
2019-07-12 12:00:00 AM 2.5
2019-07-17 12:00:00 AM 1
2019-07-18 12:00:00 AM 0.5
2019-07-24 12:00:00 AM 0.5
2019-07-24 12:00:00 AM 1
2019-07-24 12:00:00 AM 1.5
2019-07-24 12:00:00 AM 1
2019-07-25 12:00:00 AM 1
2019-07-25 12:00:00 AM 0.5
2019-07-31 12:00:00 AM 1
2019-07-31 12:00:00 AM 1.5
2019-07-31 12:00:00 AM 1
2019-07-31 12:00:00 AM 0.5
2019-08-01 12:00:00 AM 2
2019-08-07 12:00:00 AM 4
2019-08-07 12:00:00 AM 3.75
2019-08-08 12:00:00 AM 4
2019-08-14 12:00:00 AM 1.25
2019-09-11 12:00:00 AM 3.5
2019-09-11 12:00:00 AM 2.5
2019-09-12 12:00:00 AM 3
2019-09-12 12:00:00 AM 1.75
2019-09-13 12:00:00 AM 4
2019-09-13 12:00:00 AM 1.75
2019-09-13 12:00:00 AM 3
2019-09-14 12:00:00 AM 2
2019-09-14 12:00:00 AM 3.25
2019-09-16 12:00:00 AM 0.5
2019-09-16 12:00:00 AM 0.5
2019-09-26 12:00:00 AM 2.5

After experimenting a little more, the DAX functions for the previous start and end dates were being picked up on a monthly basis as well as a yearly basis. My mistake was thinking the DAX function would only evaluate on the slicers and not on table values presented.
I took a different approach, and basically created a reference table of the Time table, and added a column that added a year to the date for each row. I then joined the reference table to my DateDim table by this future_date column. I was finally able to show the values by the current period and previous period and it accurately gave the results I was looking for.

How to generate_series of every hour of every day of 1 year from the current timestamp

I have a query that generates every day of the year(shown below). What if I want to get a series of every hour of every day of the year from the current timestamp. Example: today is July 23,2019 10:30:00 AM, the result I am hoping to get is below
2019-07-23 20:30:00
2019-07-23 20:00:00
2019-07-23 19:00:00
2019-07-23 18:00:00
.
.
.
2018-07-23 20:00:00
This is a Redshift (PostgreSQL 8.0.2) query for Eclipse Birt. Hoping to create a parameter for both date and time but seems difficult to achieve if 2 separate ranges.
select cast(convert_timezone('UTC','AEST',cast(now() as timestamp without time zone)) as date) - generate_series(0, 365) date,
to_char(cast(convert_timezone('UTC','AEST',cast(now() as timestamp without time zone)) as date) - generate_series(0, 365), 'dd/mm/yyyy') date_disp;
Example: today is July 23,2019 10:30:00 AM, the result I am hoping to get is below:
2019-07-23 20:30:00
2019-07-23 20:00:00
2019-07-23 19:00:00
2019-07-23 18:00:00
.
.
.
2018-07-23 20:00:00

This is to similar to your previous question.
Use:
SELECT date_trunc('hour', now()::timestamp) - generate_series(0, 24 * 365) * interval '1 hour'
This outputs:
2019-07-23 05:00:00
2019-07-23 04:00:00
etc

You can use the DATEADD Redshift function, using "h", "hr" or "hrs" as your first parameter. Documentation for this function can be found here and here.

Redshift - Adding timezone offset (Varchar) to timestamp column

as part of ETL to Redshift, in one of the source tables, there are 2 columns:
original_timestamp - TIMESTAMP: which is the local time when the record was inserted in whichever region
original_timezone_offset - Varchar: which is the offset to UTC
The data looks something like this:
original_timestamp original_timezone_offset
2011-06-22 11:00:00.000000 -0700
2014-11-29 17:00:00.000000 -0800
2014-12-02 22:00:00.000000 +0900
2011-06-03 09:23:00.000000 -0700
2011-07-28 03:00:00.000000 -0700
2011-05-01 01:30:00.000000 -0700
In my target table, I need to convert this to UTC (using the offset). How do I do it?
So far I have tried multiple things but dateadd() seems to be the closest solution. But the problem with dateadd() is, when I say:
SELECT original_timestamp, original_timezone_offset
,dateadd(H, original_timezone_offset, original_timestamp) as original_utc_time
it is adding/subtracting '700'/'800' hours instead of 7/8 hrs to the original timestamp because the offset is a VARCHAR and the values are like: -0700 etc.
Did anyone see this issue before? Appreciate any help/inputs. Thanks.

Just take the 'hours' part of the offset:
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
DATEADD(hour, SUBSTRING(original_timezone_offset, 1, 3)::INT, original_timestamp)
FROM t
2011-06-22 11:00:00 -0700 2011-06-22 04:00:00
2014-11-29 17:00:00 -0800 2014-11-29 09:00:00
2014-12-02 22:00:00 +0900 2014-12-03 07:00:00
You'll need some additional fancy code if you have non-full-hour offsets (eg +0730).

First, recognize that if your timestamps are already in local time of the given offset, then you need to subtract that offset to convert back to UTC. In that first example you gave, 2011-06-22 11:00:00 -0700 is equivalent to 2011-06-22 18:00:00 UTC.
However, rather than try to add or subtract these values yourself, you should let the AT TIME ZONE function do the work for you. It will create a timestamptz that is in your supplied offset, then you can use it again to convert to UTC.
(Note that you could use the CONVERT_TIMEZONE function instead, but that one is only understood by Redshift, where AT TIME ZONE works on regular PostgreSQL also.)
However, you have is that the time zone offsets you have aren't in a format understood by these functions. See time zone usage notes. So, before we try to convert, let's translate your offset strings to an understood format.
We will want -0700 to become +07:00. The colon is required, and the sign must be flipped because it will be interpreted with the POSIX-style time zone format. In that format, positive values lie west of GMT instead of the usual conventions specified in ISO 8601.
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))
Then we will use that with AT TIME ZONE to do the conversion:
(original_timezone AT TIME ZONE <the above mess>) AT TIME ZONE 'UTC' AS utc_timestamp
Putting it all together...
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2)) as modified_timezone_offset,
(original_timestamp AT TIME ZONE concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))) AT TIME ZONE 'UTC' AS utc_timestamptz
FROM t
Output:
2011-06-22 11:00:00 -0700 +07:00 2011-06-22 18:00:00
2014-11-29 17:00:00 -0800 +08:00 2014-11-30 01:00:00
2014-12-02 22:00:00 +0900 -09:00 2014-12-02 13:00:00
SQL Fiddle here.

Resample pandas dataframe and count instances

If I have a dataframe such as:
index = pd.date_range(start='2014 01 01 00:00', end='2014 01 05 00:00', freq='12H')
df = pd.DataFrame(pd.np.random.randn(9),index=index,columns=['A'])
df
Out[5]:
A
2014-01-01 00:00:00 2.120577
2014-01-01 12:00:00 0.968724
2014-01-02 00:00:00 1.232688
2014-01-02 12:00:00 0.328104
2014-01-03 00:00:00 -0.836761
2014-01-03 12:00:00 -0.061087
2014-01-04 00:00:00 -1.239613
2014-01-04 12:00:00 0.513896
2014-01-05 00:00:00 0.089544
And I want to resample to daily frequency, it is quite easy:
df.resample(rule='1D',how='mean')
Out[6]:
A
2014-01-01 1.544650
2014-01-02 0.780396
2014-01-03 -0.448924
2014-01-04 -0.362858
2014-01-05 0.089544
However, I need to track how many instances are going into each day. Is there a good pythonic way of using resample to both perform the specified "how" operation AND track number of data points going into each mean value, e.g. yielding
Out[6]:
A Instances
2014-01-01 1.544650 2
2014-01-02 0.780396 2
2014-01-03 -0.448924 2
2014-01-04 -0.362858 2
2014-01-05 0.089544 2

Conveniently, how accepts a list:
df1 = df.resample(rule='1D', how=['mean', 'count'])
This will return a DataFrame with a MultiIndex column: one level for 'A' and another level for 'mean' and 'count'. To get a simple DataFrame like the desired output in your question, you can drop the extra level like df1.columns = df1.columns.droplevel(0) or, better, you can do your resampling on df['A'] instead of df.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pandas time series: groupby and sum from noon to noon - python-2.7

Related

Graph the average for the day / multiple data daily - Power BI

DAX SUM between Dates is not working as expected

How to generate_series of every hour of every day of 1 year from the current timestamp

Redshift - Adding timezone offset (Varchar) to timestamp column

Resample pandas dataframe and count instances

Categories

Resources