Boto3 and DynamoDB - How to mimic Aggregation - amazon-web-services

I have a table in DynamoDB in the below format:
DeviceId (PK)
SensorDataType
SensorValue
CurrentTime (SK)
BSMD002
HeartRate
86
2021-03-13 14:50:17.292663
BSMD002
HeartRate
106
2021-03-13 14:50:17.564644
BSMD002
HeartRate
97
2021-03-13 14:50:17.854391
I am pulling the data from this table using boto3 and want to create a new table basis user input ( DeviceId, Date Range).This table will have sensortype wise data - Max,min and avg grouped by minute.
I know DynamoDB doesn't support aggregation and using Streams + Lambda is the more efficient way. But want to understand if there is a way to do this in boto3. So far have worked pulling the data as per below code.
import boto3
import time
from datetime import datetime, timedelta
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('BSMDataTable')
devicetag = input(" Enter the Device ID to find: ").upper()
datefrom = input("Enter Starting Date in YYYY-MM-DD format: ")
dateto = input("Enter Ending Date in YYYY-MM-DD format: ")
fe = Key('CurrentTime').between(datefrom,dateto) & Key('DeviceId').eq(devicetag);
response = table.query(
KeyConditionExpression=fe
)
for i in response['Items']:
print(i)

You're actually very close. All that's missing is the aggregation of the items from the response.
Here's an example for that.
We first group the items by the minute and then calculate the statistics for each minute.
import statistics
import itertools
# Sample data
response = {
"Items": [
{"DeviceId": "BSMD002", "SensorDataType": "HeartRate", "SensorValue": 86, "CurrentTime": "2021-03-13 14:50:17.123"},
{"DeviceId": "BSMD002", "SensorDataType": "HeartRate", "SensorValue": 100, "CurrentTime": "2021-03-13 14:50:18.123"},
{"DeviceId": "BSMD002", "SensorDataType": "HeartRate", "SensorValue": 19, "CurrentTime": "2021-03-13 14:51:17.123"},
]
}
# Group the response by items per minute
items_by_minute = itertools.groupby(
response["Items"],
key=lambda x: x["CurrentTime"][:16] # The first 16 characters including the minute
)
# Calculate the statistics for each minute
for minute, items in items_by_minute:
values_per_minute = [item["SensorValue"] for item in items]
avg = statistics.mean(values_per_minute)
min_value = min(values_per_minute)
max_value = max(values_per_minute)
print(f"Minute: {minute} / Average {avg} / Min {min_value} / Max {max_value}")
Output
Minute: 2021-03-13 14:50 / Average 93 / Min 86 / Max 100
Minute: 2021-03-13 14:51 / Average 19 / Min 19 / Max 19

Related

how to take average of every 10 minutes of a model django

I am using multiple APIs and saving them to the database. I have one model called Station (it has a DateTime field and some other fields) and every API is for one station. These APIs come from devices that measure some variables and they get updated every 3 minutes.
I wrote a background task that calls a saveToDB function and stores them in the database. For example:
1. station A "some variables" 2022/10/1 13:10
2. station B "some variables" 2022/10/1 13:10
3. station A "some variables" 2022/10/1 13:13
4. station B "some variables" 2022/10/1 13:13
Now I need to take an average of every station every 10 minutes, 2 hours, week, month, and year.
There are 30 stations. How can I do this?
If your question is what the django code would look like to make these calculations, your should read up here on aggregation. Jump down to the "values()" section of the page. The code to group by station and calculate the average of all entries would be:
Station.objects.values('station_id').annotate(myAverage=Avg('some_variable'))
This will group all entries by station.
However, you can simplify by using a nested loop to isolate each station and run the average over each 10 minute interval. Not sure exactly what the conditions for which 10 minute intervals you need, but let's say you want each 10 minutes from yesterday. Something like:
from datetime import datetime, timedelta
from django.db.models import Avg
from .models import Station
def test():
# get yesterday's date at midnight
yesterday_at_midnight = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=1)
# add 10 minutes to the yesterday_at_midnight variable
ten_minutes_after_midnight = yesterday_at_midnight + timedelta(minutes=10)
# set the start time to yesterday_at_midnight
start_time = yesterday_at_midnight
# set the end time to ten_minutes_after_midnight
end_time = ten_minutes_after_midnight
# loop over each station in the database
for s in Station.objects.all():
# loop over the next 143 10 minute intervals (24 hours - 1 interval)
for i in range(143):
# find all the Station objects that fall within the current 10 minute interval
stations = Station.objects.filter(station_id=s.station_id, created_at__gte=start_time, created_at__lt=end_time).aggregate(Avg('some_variable'))
# do something with this QuerySet
print(stations)
# increment the start and end times by 10 minutes
start_time += timedelta(minutes=10)
end_time += timedelta(minutes=10)

Athena query performance between similar queries differs significantly

Noticed the other day that there are some significant differences in query performance when running two nearly identical queries.
QUERY 1:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '11' and day = '1')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 40.751 sec
Data scanned: 94.06 KB
QUERY 2:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '10' and day = '30')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 1.78 sec
Data scanned: 216.86 KB
The only major difference between the two is that one query looks at 11/1 & 10/31 and the other looks at 10/31 & 10/30. So there is an additional month partition being looked at in QUERY 1.
When running both queries
with EXPLAIN I
noticed that
QUERY 2 uses a TableScan while QUERY1 uses a ScanFilter.
Anyone know why this might be the case between these two queries?
Additional Details:
Time in queue for both queries was sub 1 second.
In s3, the data is structured as follows:
badge=%s/year=%s/month=%s/day=%s/hour=%s
badge,year,month,day & hour are all partitions defined via Partition Projection.

How to sum data from an annotated query

I'm trying to create a report that shows the number of males, females, and total customers in a given day by hour.
The data is inserted into the database as a transaction whenever somebody enters the building. It stores their gender there.
The query to gather the data looks as follows:
initial_query = EventTransactions.objects.using('reportsdb')
.filter(building_id=id,
actualdatetime__lte=end_date,
actualdatetime__gte=beg_date)
From there, I annotate the query to extract the date:
ordered_query = initial_query.annotate(
year=ExtractYear('actualdatetime'),
month=ExtractMonth('actualdatetime'),
day=ExtractDay('actualdatetime'),
hour=ExtractHour('actualdatetime'),
male=Coalesce(Sum(Case(When(customer_gender='M', then=1)), output_field=IntegerField()), Value(0)),
female=Coalesce(Sum(Case(When(customer_gender='F', then=1)), output_field=IntegerField()), Value(0))
).values(
'year', 'month', 'day', 'hour', 'male', 'female'
)
How do I then sum the male customers and female customers by hour?
By this, I mean that I wish to provide a table to the user which contains each hour of the day (can just be a number from 0-23 at this point), total males for that hour, total females for that hour, and total customers for that hour:
TIME | MALE | FEMALE | TOTAL
0 12 4 16
1 5 8 13
2 2 3 5
3 20 38 58
etc.
I'd be happy to provide more information if necessary. Thank you!
You're nearly there. Do the annotation for the gendered populations after calling values such that the database will do the aggregate on the group of datetime extracted values.
ordered_query = initial_query.annotate(
year=ExtractYear('actualdatetime'),
month=ExtractMonth('actualdatetime'),
day=ExtractDay('actualdatetime'),
hour=ExtractHour('actualdatetime'),
).values(
'year', 'month', 'day', 'hour',
).annotate(
male=Coalesce(Sum(Case(When(customer_gender='M', then=1)), output_field=IntegerField()), Value(0)),
female=Coalesce(Sum(Case(When(customer_gender='F', then=1)), output_field=IntegerField()), Value(0))
)

counting number of new values per month in pandas dataframe

I have a huge list(pandas dataframe) that looks like this
user userID
Date
1/1/2018 Annual 12345
1/3/2018 Annual 12345
1/5/2018 One Time
1/11/2018 One Time
1/12/2018 One Time
1/13/2018 Annual 98765
.
.
2/1/2018 Annual 12345
2/3/2018 Annual 12345
2/5/2018 One Time
2/11/2018 One Time
2/12/2018 One Time
2/13/2018 Annual 98765
This is a list of history of user activities. Every time someone uses this service, it is recorded. There are annual membership holders and one time users.
What I want to do is counting number of new annual membership purchases per month.
Membership is valid for one year so I assume if a membership is purchased on 1/1/2017, userID 11111 is valid until 12/31/2017. In the example list above, user 12345 used the service twice but the second one shouldnt count because user 12345 purchased annual membership on 1/1/2018. Similarly, user 12345s activity on 2/1/2018 shouldnt count as a new membership purchase because it was purchased on 1/1/2017.
And it is also assumed that annual membership was purchased when they used their first service as an annual membership holder.(userID 12345 purchased his/her membership on 1/1/2018)
EDIT
example
import numpy as np
import pandas as pd
from random import randint
from random import randrange
from datetime import timedelta
from datetime import datetime
start = datetime.strptime('1/1/2017', '%m/%d/%Y')
end = datetime.strptime('12/31/2017', '%m/%d/%Y')
def random_date(start, end):
delta = end - start
int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
random_second = randrange(int_delta)
return start + timedelta(seconds=random_second)
userIDs = []
dates = []
userType = []
for i in range(10000):
userIDs.append( randint(100, 999))
dates.append( random_date(start, end) )
userType.append( randint(1, 2) )
df = pd.DataFrame({'ID': userIDs, 'date':dates, 'type': userType})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace = True)
You could try groupings (by year and userID, then by year and month), but working with the expiry date would require many maneuvers. I believe a more mechanical solution can handle the problem pretty straightforward.
from dateutil.relativedelta import relativedelta
count = {} # month's number of subscriptions
since = {} # member's date of subscription
for i, r in df[df.type==1].sort_values('date').iterrows():
if r.ID in since and r.date < since[r.ID] + relativedelta(years=1):
continue # valid member, not counting
since[r.ID] = r.date
ym = r.date.year, r.date.month
count[ym] = count.setdefault(ym, 0) + 1
I prefer not to consider date as index, because two members should be able to adhere at the same time.
Printing count in order gives something like:
(2017, 1) 94
(2017, 2) 7
(2018, 1) 76
(2018, 2) 20
(2018, 3) 5
(2019, 1) 50
(2019, 2) 39
(2019, 3) 10
(2019, 4) 2

How do I filter a queryset to dates within the next x months?

I have a model called Post with an attribute date.
I want to filter the queryset so that I retrieve posts only in the next x months. e.g. next 3 months
from datetime import datetime
import calendar
def add_months(sourcedate, months):
month = sourcedate.month - 1 + months
year = sourcedate.year + month / 12
month = month % 12 + 1
day = min(sourcedate.day, calendar.monthrange(year,month)[1])
return datetime.date(year, month, day)
today = datetime.today()
table_name.objects.filter(date__range=[str(today.date()), str(add_months(today, 3).date())])
Reference
- How to increment datetime by custom months in python without using library
- Django database query: How to filter objects by date range?
Take a look at range. Build the date and pass to the QuerySet. Relevant SO question for reference.