I have a Django model stored in a Postgres DB comprised of values of counts at irregular intervals:
WidgetCount
- Time
- Count
I'm trying to use a window function with Lag to give me a previous row's values as an annotation. My problem is when I try to combine it with some distinct date truncation the window function uses the source rows rather than the distinctly grouped ones.
For example if I have the following rows:
time count
2020-01-20 05:00 15
2020-01-20 06:00 20
2020-01-20 09:00 30
2020-01-21 06:00 35
2020-01-21 07:00 40
2020-01-22 04:00 50
2020-01-22 06:00 54
2020-01-22 09:00 58
And I want to return a queryset showing the first reading per day, I can use:
from django.db.models.functions import Trunc
WidgetCount.objects.distinct("date").annotate(date=Trunc("time", "day"))
Which gives me:
date count
01/01/20 15
01/01/21 35
01/01/22 50
I would like to add an annotation which gives me yesterday's value (so I can show the change per day).
date count yesterday_count
01/01/20 15
01/01/21 35 15
01/01/22 50 35
If I do:
from django.db.models.functions import Trunc, Lag
from django.db.models import Window
WidgetCount.objects.distinct("date").annotate(date=Trunc("time", "day"), yesterday_count=Window(expression=Lag("count")))
The second row return gives me 30 for yesterday_count - ie, its showing me the previous row before applying the distinct clause.
If I add a partiion clause like this:
WidgetCount.objects.distinct("date").annotate(date=Trunc("time", "day"), yesterday_count=Window(expression=Lag("count"), partition_by=F("date")))
Then yesterday_count is None for all rows.
I can do this calculation in Python if I need to but it's driving me a bit mad and I'd like to find out if what I'm trying to do is possible.
Thanks!
I think the main problem is that you're mixing operations that used in annotation generates a grouped query set such as sum with a operation that simples create a new field for each record in the given query set such as yesterday_count=Window(expression=Lag("count")).
So Ordering really matters here. So when you try:
WidgetCount.objects.distinct("date").annotate(date=Trunc("time", "day"), yesterday_count=Window(expression=Lag("count")))
The result queryset is simply the WidgetCount.objects.distinct("date") annotated, no grouping is perfomed.
I would suggest decoupling your operations so it becomes easier to understand what is happening, and notice you're iterating over the python object so don't need to make any new queries!
Note in using SUM operation as example because I am getting an unexpected error with the FirstValue operator. So I'm posting with Sum to demonstrate the idea which remains the same. The idea should be the same for first value just by changing acc_count=Sum("count") to first_count=FirstValue("count")
for truncDate_groups in Row.objects.annotate(trunc_date=Trunc('time','day')).values("trunc_date")\
.annotate(acc_count=Sum("count")).values("acc_count","trunc_date")\
.order_by('trunc_date')\
.annotate(y_count=Window(Lag("acc_count")))\
.values("trunc_date","acc_count","y_count"):
print(truncDate_groups)
OUTPUT:
{'trunc_date': datetime.datetime(2020, 1, 20, 0, 0, tzinfo=<UTC>), 'acc_count': 65, 'y_count': None}
{'trunc_date': datetime.datetime(2020, 1, 21, 0, 0, tzinfo=<UTC>), 'acc_count': 75, 'y_count': 162}
{'trunc_date': datetime.datetime(2020, 1, 22, 0, 0, tzinfo=<UTC>), 'acc_count': 162, 'y_count': 65}
It turns out FirstValue operator requires to use a Windows function so you can't nest FirtValue and then calculate Lag, so in this scenario I'm not exactly sure if you can do it. The question becomes how to access the First_Value column without nesting windows.
I haven't tested it out locally but I think you want to GROUP BY instead of using DISTINCT here.
WidgetCount.objects.values(
date=Trunc('time', 'day'),
).order_by('date').annotate(
date_count=Sum('count'), # Will trigger a GROUP BY date
).annotate(
yesterday_count=Window(Lag('date_count')),
)
Related
I am using pvlib https://pvlib-python.readthedocs.io/en/v0.9.0/generated/pvlib.iotools.get_psm3.html for retrieving NSRDB PSM3 time-series data from the PSM3 API. I was able to retrieve data for 30 minutes, but it doesn't work for changing the interval for 5 minutes!!! any help
my_metadata, my_data = pvlib.iotools.get_psm3(
latitude= xxx, longitude=xxx,
interval=5,
api_key=My_key, email='xxx', # <-- any email works here fine
#names='tmy'
names='xxx')
Perhaps you're using an older version of pvlib? I would try updating to the latest version with pip install -U pvlib or similar. You can check your installed version with print(pvlib.__version__).
This snippet successfully retrieves 5-minute data for me with pvlib 0.9.1:
import pvlib
lat, lon = 40, -80
key = '<redacted>'
email = '<redacted>'
df, metadata = pvlib.iotools.get_psm3(lat, lon, key, email, names='2020',
interval=5, map_variables=True)
It's worth pointing out that TMYs are always hourly; the 5-minute data is only for single (calendar) years like the above 2020 request. The get_psm3 docstring mentions this:
interval : int, {60, 5, 15, 30}
interval size in minutes, must be 5, 15, 30 or 60. Only used for
single-year requests (i.e., it is ignored for tmy/tgy/tdy requests).
This is just for clarification, know exactly what a qpointer is but today in a meeting the concept of a dpointer was raised. Anyone know what a "D" pointer refers to? Never ever heard this term before.
This is a nice question because it helped me put together a couple of pieces I had rolling around in my head, so thanks for that!
D's are dictionary items that refer to a logical location in the the data array and you have probably seen them a million times in the DICT of any given file.
A D Item in the VOC servers the same purpose and is valid with any query. Lots of shops have some generics (F1, F2, F3, F4, F5, F6..etc) set up so you don't have to remember the dictionary name if you know what filed you want. I think the precedence for dictionary items is DICT File -> VOC but I could be wrong on that.
As an example to illiterate this I went into HS.SALES and took one of the DICT items in the CUSTOMER table and wrote it to VOC after removing the conversion in field 3. I chose BUY_DATE because it had a conversion
SORT CUSTOMER BUY_DATE 06:51:04am 10 Oct 2017 PAGE 1
CUSTOMER.. Date Purchased
1 01/07/91
10 01/28/91
01/29/91
01/30/91
Remove the conversion and save into the VOC.
>ED DICT CUSTOMER BUY_DATE
10 lines long.
0001: D Date of purchase
0002: 14
0003: D2/
0004: Date Purchased
0005: 8R
0006: M
0007: ORDERS
0008: INTEGER
0009:
0010:
----: 3
0003: D2/
----: R
0003:
----: SAVE VOC F14NOCON
"F14NOCON" filed in file "VOC".
----: Q
Now sort with new D type. Values are before the Y-1995 era when pick date were still 4 digits!
SORT CUSTOMER F14NOCON 06:45:25am 10 Oct 2017 PAGE 1
CUSTOMER.. Date Purchased
1 8408
10 8429
8430
8431
Good Luck!
I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns.
The data are, however, quite messed up.
I have sometimes full dates:
May 30, 1949
Sometimes full dates are combined with other dates and attributes:
May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940
Sometimes you have timespan:
1905-05 OR 1905-1906
Sometimes only the year
1905
Sometimes year with attributes
August or September 1908
Doesn't seems to follow any specific schema or order.
I would like to extract (at least)ca start and end date year, in order to have two columns:
-----------------------
|start_date | end_date|
|1905 | 1906 |
-----------------------
without the rest of the attributes.
I can find the last date using
value.match(/.*(\d{4}).*?/)[0]
and the first one with
value.match(/.*^(\d{4}).*?/)[0]
but I got some trouble with the two formulas.
The latter cannot match anything in case of:
May 30, 1949 and 1951, published 1979
while in the case of:
Paris, winter 1911-12
The latter formula cannot match anything and the former formula match 1911
Anyone know how I can resolve the problem?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date.
Moreover, I would be glad to have some clue about how to extract other information, such as
if published or printed or executed is present in the text -> copy date to a new column name “execution”.
should be something like create a new column
if(value.match("string1|string2|string3" + (\d{4}), "perform the operation", do nothing)
value.match() is a very useful but sometimes tricky function. To extract a pattern from a text, I prefer to use Python/Jython's regular expressions :
import re
pattern = re.compile(r"\d{4}")
return pattern.findall(value)
From there, you can create a string with all the years concatenated:
return ",".join(pattern.findall(value))
Or select only the first:
return pattern.findall(value)[0]
Or the last:
return pattern.findall(value)[-1]
etc.
Same thing for your sub-question:
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
return pattern.findall(value)[0][1]
Or :
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
m = re.search(pattern, value)
return m.group(2)
Example:
Here is a regex which will extract start_date and end_date in named groups :
If there is only one date, then it consider it's the start_date :
((?<start_date>\d{4}).*?)?(?<end_date>\d{4}|(?<=-)\d{2})?$
Demo
I'm trying to filter a dataframe for a certain date in a column.
The colum entries are timestamps and I try to construct a boolean vector from those,
checking for a certain date.
I tried:
filterfr = df[((df.expiration.month==6) & (df.expiration.day==22) & (df.expiration.year==2002)]
It doesn't work, because 'Series' object has no attribute 'month'.
How can this be done?
When you do df.expiration, you get back a Series where the items are the expiration datetimes.
Try comparing to an actual datetime.datetime object:
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
You may want to look into using a DatetimeIndex, depending on your dataset. This lets you use the convenient syntax
df['2002-06-22']
To have access to the DatetimeIndex methods you have to wrap it in DatetimeIndex (currently*).
The fastest way is to access the day, month and year attributes (just like you attempted):
expir = pd.DatetimeIndex(df['expiration'])
(expir.day == 22) & (expir.month == 6) & (expir.year == 2002)
Alternative, but slower ways are to use the normalize method (to bring it to the start of the day), or to use the date attribute:
pd.DatetimeIndex(df['expiration']).normalize() == datetime.datetime(2002, 06, 22)
pd.DatetimeIndex(df['expiration']).date == datetime.datetime(2002, 06, 22)
*In 0.15 there will be a dt attribute so that you can access these as:
expir = df['expiration']
expir.dt.day ...
This
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
worked fine.
However, after doing some filtering, I got an error,
when trying to do filterfr.expiration[0]
or filterfr['expiration'][0]
to get the first element in the series.
KeyError: 0L is raised, although there are elements in the series.
The series looks like this:
Name: expiration, Length: 534668, dtype: datetime64[ns]
Shouldn't this actually always work?
I'd like to subtract two values, one in the current record, then the next in the next record...they are time clock entries and I want to calculate the amount of time an employee spend on his/her break, so I'll have to subtract the time the employee clocked out, and the time the employee clocked back in. This will be done for several records, then at the end of it all I also want to have total of all the breaks taken.
So how can I do this? I'm doing this in Django BTW.
UPDATE
The records look a bit like this:
employee_id, rec_date, start_time, end_time
18, 2010-08-23, 09:58:00, 14:13:00
18, 2010-08-23, 14:39:00, 18:47:00
19, 2010-08-23, 14:15:00, 18:31:00
21, 2010-08-23, 12:05:00, 14:52:00
21, 2010-08-23, 15:23:00, 18:49:00
21, 2010-08-31, 08:00:00, 12:00:00
21, 2010-08-31, 12:45:00, 19:00:00
You'll have to iterate over the clock actions and do the calculation manually. Something of this sort:
breaks = 0
break_start = None
clock_actions = user.clock_actions.filter(date=desired_day).order_by('date')
for action in clock_actions:
if action.type == 'CLOCK IN':
break_start = action.time
elif break_start is not None:
breaks = breaks + action.time - break_start
break_start = None
You may have to adjust the subtraction to work with timedelta objects, depending on your choice of field for time.