Grab rows between two Datetime and avoid iterating - python-2.7

I use Pandas to retrieve a lot of Data via an SQL query (from Hive). I have a big DataFrame now:
market_pings = pandas.read_sql_query(query, engine)
market_pings['event_time'] = pandas.to_datetime(market_pings['event_time'])
I have calculated Time Delta periods which are: if something interesting happens within the timeline of these events within this market_pings DataFrame, I want the logs of that time interval only.
To grab DataFrame rows where a column has certain values there is a cool trick:
valuelist = ['value1', 'value2', 'value3']
df = df[~df.column.isin(value_list)]
Does anyone have an idea how to do this for time periods, so that I get the events of certain times from the market_pings DataFrame without direct Iteration (row by row)?
I can build a list of periods (1s accuracy) like:
2015-08-03 19:19:47
2015-08-03 19:20:00
But this means my valuelist becomes a tupel and I somehow have to compare dates.

You can create a list of time stamp as value_list and do operation you intend to.
time_list = [pd.Timestamp('2015-08-03 19:19:47'),pd.Timestamp('2015-08-03 19:20:00') ]
One thing in using between_time() is index have to be that date or time,
If not you can set by set_index()
mydf = pd.Series(np.random.randn(4), time_list)
mydf
Out[123]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:20:00 -0.234267
2015-08-03 19:19:48 0.159056
2015-08-03 21:20:00 -0.842017
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),include_end=False)
Out[124]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:19:48 0.159056
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),
include_end=False,include_start=False)
Out[125]:
2015-08-03 19:19:48 0.159056
dtype: float64

Related

Serialize pandas dataframe consists NaN fields before sending as a response

I have a dataframe that has NaN fields in it. I want to send this dataframe as a response. Because it has Nan fields I get this error,
ValueError: Out of range float values are not JSON compliant
I don't want to drop the fields or fill them with a character or etc. and the default response structure is ideal for my application.
Here is my views.py
...
forecast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
forecast['actual_value'] = df['y'] # <- Nan fields are added here
forecast.rename(
columns={
'ds': 'date',
'yhat': 'predictions',
'yhat_lower': 'lower_bound',
'yhat_upper': 'higher_bound'
}, inplace=True
)
context = {
'detail': forecast
}
return Response(context)
Dataframe,
date predictions lower_bound higher_bound actual_value
0 2022-07-23 06:31:41.362011 3.832143 -3.256209 10.358063 1.0
1 2022-07-23 06:31:50.437211 4.169004 -2.903518 10.566005 7.0
2 2022-07-28 14:20:05.000000 12.085815 5.267806 18.270929 20.0
...
16 2022-08-09 15:07:23.000000 105.655997 99.017424 112.419991 NaN
17 2022-08-10 15:07:23.000000 115.347283 108.526287 122.152684 NaN
Hoping to find a way to send dataframe as a response.
You could try to use the fillna and replace methods to get rid of those NaN values.
Adding something like this should work as None values are JSON compliant:
forecast = forecast.fillna(np.nan).replace([np.nan], [None])
Using replace alone can be enough, but using fillna prevent errors if you also have NaT values for example.

Convert a number column into a time format in Power BI

I'm looking for a way to convert a decimal number into a valid HH:mm:ss format.
I'm importing data from an SQL database.
One of the columns in my database is labelled Actual Start Time.
The values in my database are stored in the following decimal format:
73758 // which translates to 07:27:58
114436 // which translates to 11:44:36
I cannot simply convert this Actual Start Time column into a Time format in my Power BI import as it returns errors for some values, saying it doesn't recognise 73758 as a valid 'time'. It needs to have a leading zero for cases such as 73758.
To combat this, I created a new Text column with the following code to append a leading zero:
Column = FORMAT([Actual Start Time], "000000")
This returns the following results:
073758
114436
-- which is perfect. Exactly what I needed.
I now want to convert these values into a Time.
Simply changing the data type field to Time doesn't do anything, returning:
Cannot convert value '073758' of type Text to type Date.
So I created another column with the following code:
Column 2 = FORMAT(TIME(LEFT([Column], 2), MID([Column], 3, 2), RIGHT([Column], 2)), "HH:mm:ss")
To pass the values 07, 37 and 58 into a TIME format.
This returns the following:
_______________________________________
| Actual Start Date | Column | Column 2 |
|_______________________________________|
| 73758 | 073758 | 07:37:58 |
| 114436 | 114436 | 11:44:36 |
Which is what I wanted but is there any other way of doing this? I want to ideally do it in one step without creating additional columns.
You could use a variable as suggested by Aldert or you can replace Column by the format function:
Time Format = FORMAT(
TIME(
LEFT(FORMAT([Actual Start Time],"000000"),2),
MID(FORMAT([Actual Start Time],"000000"),3,2),
RIGHT([Actual Start Time],2)),
"hh:mm:ss")
Edit:
If you want to do this in Power query, you can create a customer column with the following calculation:
Time.FromText(
if Text.Length([Actual Start Time])=5 then Text.PadStart( [Actual Start Time],6,"0")
else [Actual Start Time])
Once this column is created you can drop the old column, so that you only have one time column in the data. Hope this helps.
I, on purpose show you the concept of variables so you can use this in future with more complex queries.
TimeC =
var timeStr = FORMAT([Actual Start Time], "000000")
return FORMAT(TIME(LEFT([timeStr], 2), MID([timeStr], 3, 2), RIGHT([timeStr], 2)), "HH:mm:ss")

Adding moving average column to dataframe per indexed category variable

I have a pandas dataframe timeseries weight for over 100 scales named per "short_id". I am having trouble figuring out the best way to apply a moving filter for each scale's weight data to remove outliers.
Here is a sample of the data:
Out[159]:
published_at short_id weight
0 2017-11-08 16:03:36 INT16 50.35
1 2017-11-08 16:02:43 INT1 45.71
2 2017-11-08 16:02:10 NOT11 35.52
3 2017-11-08 16:01:07 INT7 50.03
4 2017-11-08 16:00:23 INT3 47.04
converting the dataframe into a dictionary per "short_id" and apply moving filter per dict item did not work out, nor did converting the data to a "wide" format from "long" (using pandas.pivot_table).
It seems like it could be possible in one line using groupy.by then .apply the rolling function:
df['MovingFilt'] = df.groupby('short_id')['weight'].apply(pd.rolling(6).median())
but receive an error: TypeError: incompatible index of inserted column with frame index...This is because sometimes there is weight data at the same time for certain scales, but not usually.
Is this the best way to approach the problem?:
Creating new dataframes per 'short_id' then using seems not pythonic enough, although it runs fine
INT16['MovingFilt'] = pd.Series.rolling(INT16['weight'], window=6, center=True).median()
The error is because you wrote the groupby wrong
df['MovingFilt'] = df.groupby('short_id')['weight'].rolling(6).median().values

Python 2.7 Find occurences from datetime and plot

Since I didn't find anywhere else this topic I will ask it here. I am getting data from CSV file, I have written datetime format in one of columns. I get that column with pandas module and then I need to count occurrences in specific time slots and plot that with matplotlib. Bellow you can see example of column.
Time and Date
0 2015-08-21 10:51:06.398000
1 2015-08-21 10:51:00.017000
2 2015-08-21 10:52:06.402000
3 2015-08-21 10:54:06.407000
...
I know how I can split time like so:
pd.date_range("10:50", "12:30", freq="1min").time
But how can I assign occurrences of my read values from CSV and then plot it? Any advice or direction would help.
It's hard to tell what you want as you haven't posted desired output but if I understand you correctly you want to count the number of rows in time intervals of certain length. You can do this by combining resample and len. To use resample, first set the index to 'Time and Date:
df.set_index('Date and Time', drop=False)
Note that drop=False is only necessary if the data frame has no other columns.
Then to get the number of rows in each 1-minute interval do
counts = df.resample('1min', len).astype(int)
If there are multiple dates and you want to sum the counts for each time interval over dates do
counts.groupby(lambda ts: ts.time()).sum()

Filter a dataframe

I'm trying to filter a dataframe for a certain date in a column.
The colum entries are timestamps and I try to construct a boolean vector from those,
checking for a certain date.
I tried:
filterfr = df[((df.expiration.month==6) & (df.expiration.day==22) & (df.expiration.year==2002)]
It doesn't work, because 'Series' object has no attribute 'month'.
How can this be done?
When you do df.expiration, you get back a Series where the items are the expiration datetimes.
Try comparing to an actual datetime.datetime object:
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
You may want to look into using a DatetimeIndex, depending on your dataset. This lets you use the convenient syntax
df['2002-06-22']
To have access to the DatetimeIndex methods you have to wrap it in DatetimeIndex (currently*).
The fastest way is to access the day, month and year attributes (just like you attempted):
expir = pd.DatetimeIndex(df['expiration'])
(expir.day == 22) & (expir.month == 6) & (expir.year == 2002)
Alternative, but slower ways are to use the normalize method (to bring it to the start of the day), or to use the date attribute:
pd.DatetimeIndex(df['expiration']).normalize() == datetime.datetime(2002, 06, 22)
pd.DatetimeIndex(df['expiration']).date == datetime.datetime(2002, 06, 22)
*In 0.15 there will be a dt attribute so that you can access these as:
expir = df['expiration']
expir.dt.day ...
This
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
worked fine.
However, after doing some filtering, I got an error,
when trying to do filterfr.expiration[0]
or filterfr['expiration'][0]
to get the first element in the series.
KeyError: 0L is raised, although there are elements in the series.
The series looks like this:
Name: expiration, Length: 534668, dtype: datetime64[ns]
Shouldn't this actually always work?