Group By with Django's queryset - django

I have a model in Django and this is how it looks like with fewer fields -
I want to group the rows w.r.t buy_price_per_unit and at the same time I also want to know the total units on sale for that buy_price_per_unit.
So in our case only two distinct buy_price_per_unit are available (9, 10). Hence the query would return only two rows like this -
The one last condition which I have to meet is the query result should be in descending order of buy_price_per_unit.
This is what I have tried so far -
orders = Orders.objects.values('id', 'buy_price_per_unit')\
.annotate(units=Sum("units"))\
.order_by("-buy_price_per_unit")\
The response for the query above was -
[
{
"id": 13,
"buy_price_per_unit": 10,
"units": 1
},
{
"id": 12,
"buy_price_per_unit": 9,
"units": 10
},
{
"id": 14,
"buy_price_per_unit": 9,
"units": 2
},
{
"id": 15,
"buy_price_per_unit": 9,
"units": 1
}
]
The problem with this response is that even for the same price multiple records are being returned.

This is happening because you have id in .values and based on the underlying query, it is grouping on id and buy_price_per_unit both.
So simply remove id from .values
orders = Orders.objects.values('buy_price_per_unit')\
.annotate(units=Sum("units"))\
.order_by("-buy_price_per_unit")\

Related

Oracle Apex 22.21 - REST data source - nested JSON array - discovery

I need to get APEX Rest Data Source to parse my JSON which has a nested array. I've read that JSON nested arrays are not supported but there must be a way.
I have a REST API that returns data via JSON as per below. On Apex, I've created a REST data source following the tutorial on this Oracle blog link
However, Auto-Discovery does not 'discover' the nested array. It only returns the root level data.
[ {
"order_number": "so1223",
"order_date": "2022-07-01",
"full_name": "Carny Coulter",
"email": "ccoulter2#ovh.net",
"credit_card": "3545556133694494",
"city": "Myhiya",
"state": "CA",
"zip_code": "12345",
"lines": [
{
"product": "Beans - Fava, Canned",
"quantity": 1,
"price": 1.99
},
{
"product": "Edible Flower - Mixed",
"quantity": 1,
"price": 1.50
}
]
},
{
"order_number": "so2244",
"order_date": "2022-12-28",
"full_name": "Liam Shawcross",
"email": "lshawcross5#exblog.jp",
"credit_card": "6331104669953298",
"city": "Humaitá",
"state": "NY",
"zip_code": "98670",
"lines": [
{
"order_id": 5,
"product": "Beans - Green",
"quantity": 2,
"price": 4.33
},
{
"order_id": 1,
"product": "Grapefruit - Pink",
"quantity": 5,
"price": 5.00
}
]
},
]
So in the JSON above, it only 'discovers' order_numbers up to zip_code. The 'lines' array with attributes order_id, product, quantity, & price do not get 'discovered'.
I found this SO question in which Carsten instructs to create the Rest Data Source manually. I've tried changing the Row Selector to "." (a dot) and leaving it blank. That still returns the root level data.
Changing the Row Selector to 'lines' returns only 1 array for each 'lines'
So in the JSON example above, it would only 'discover':
{
"product": "Beans - Fava, Canned",
"quantity": 1,
"price": 1.99
}
{
"order_id": 5,
"product": "Beans - Green",
"quantity": 2,
"price": 4.33
}
and not the complete array..
This is how the Data Profile is set up when creating Data Source manually.
There's another SO question with a similar situation so I followed some steps such as selecting the data type for 'lines' as JSON Document. I feel I've tried almost every selector & data type. But obviously not enough.
The docs are not very helpful on this subject and it's been difficult finding links on Google, Oracle Blogs, or SO.
My end goal would be to have two tables as below auto synchronizing from the API.
orders
id pk
order_number num
order_date date
full_name vc(200)
email vc(200)
credit_card num
city vc(200)
state vc(200)
zip_code num
lines
order_id /fk orders
product vc(200)
quantity num
price num
view orders_view orders lines
As you're correctly stating, REST Data Sources do not support nested arrays - a REST Source can only "extract" one flat table from the JSON response. In your example, the JSON as such is an array ("orders"). The Row Selector in the Data Profile would thus be "." (to select the "root node").
That gives you all the order attributes, but discovery would skip the lines array. However, you can manually add a column to the Data Profile, of the JSON Document data type, and using lines as the selector.
As a result, you'd still get a flat table from the REST Data Source, but that table contains a LINES column, which contains the "JSON Fragment" for the order line items. You could then synchronize the REST Source to a local table ("REST Synchronization"), then you can use some custom code to extract the JSON fragments to a ORDER_LINES child table.
Does that help?

Efficient way of running Django query over list of dates

I am working on an investment app in Django which requires calculating portfolio balances and values over time. The database is currently set up this way:
class Ledger(models.Model):
asset = models.ForeignKey('Asset', ....)
amount = models.FloatField(...)
date = models.DateTimeField(...)
...
class HistoricalPrices(models.Model):
asset = models.ForeignKey('Asset', ....)
price = models.FloatField(...)
date = models.DateTimeField(...)
Users enter transactions in the Ledger, and I update prices through APIs.
To calculate the balance for a day (note multiple Ledger entries for the same asset can happen on the same day):
def balance_date(date):
return Ledger.objects.filter(date__date__lte=date).values('asset').annotate(total_amount=Sum('amount'))
Trying to then get values for every day between the date of the first Ledger entry and today becomes more challenging. Currently I am doing it this way - assuming a start_date and end_date that are datetime.date() and tr_dates a list on unique dates on which transactions did occur (to avoid calculating balances on days where nothing happened) :
import pandas as pd
idx = pd.date_range(start_date, end_date)
main_df = pd.DataFrame(index=tr_dates)
main_df['date_send'] = main_df.index
main_df['balances'] = main_df['date_send'].apply(lambda x: balance_date(x))
main_df = main_df.sort_index()
main_df.index = pd.DatetimeIndex(main_df.index)
main_df = main_df.reindex(idx, method='ffill')
This works but my issue is performance. It takes at least 150-200ms to run this, and then I need to get the prices for each date (all of them, not just transaction dates) and somehow match and multiply by the correct balances, which makes the run time about 800 ms or more.
Given this is a web app the view taking 800ms at minimum to calculate makes it hardly scalable, so I was wondering if anyone had a better way to do this?
EDIT - Simple example of expected input / output
Ledger entries (JSON format) :
[
{
"asset":"asset_1",
"amount": 10,
"date": "2015-01-01"
},
{
"asset":"asset_2",
"amount": 15,
"date": "2017-10-15"
},
{
"asset":"asset_1",
"amount": -5,
"date": "2018-02-09"
},
{
"asset":"asset_1",
"amount": 20,
"date": "2019-10-10"
},
{
"asset":"asset_2",
"amount": 3,
"date": "2019-10-10"
}
]
Sample Price from Historical Prices:
[
{
"date": "2015-01-01",
"asset": "asset_1"
"price": 5,
},
{
"date": "2015-01-01",
"asset": "asset_2"
"price": 15,
},
{
"date": "2015-01-02",
"asset": "asset_1"
"price": 6,
},
{
"date": "2015-01-02",
"asset": "asset_2"
"price": 11,
},
...
{
"date": "2017-10-15",
"asset": "asset_1"
"price": 20
},
{
"date": "2017-10-15",
"asset": "asset_2"
"price": 30
}
{
]
In this case:
tr_dates is ['2015-01-01', '2017-10-15', '2018-02-09', '2019-10-10']
date_range is ['2015-01-01', '2015-01-02', '2015-01-03'.... '2019-12-14, '2019-12-15']
Final output I am after: Balances by date with price by date and total value by date
date asset balance price value
2015-01-01 asset_1 10 5 50
2015-01-01 asset_2 0 10 0
.... balances do not change as there are no new Ledger entries but prices change
2015-01-02 asset_1 10 6 60
2015-01-02 asset_2 0 11 0
.... all dates between 2015-01-02 and 2017-10-15 (no change in balance but change in price)
2017-10-15 asset_1 10 20 200
2017-10-15 asset_2 15 30 450
... dates in between
2018-02-09 asset_1 5 .. etc based on price
2018-02-09 asset_2 15 .. etc based on price
... dates in between
2019-10-10 asset_1 25 .. etc based on price
2019-10-10 asset_2 18 .. etc based on price
... goes until the end of date_range
I have managed to get this working but takes about a second to compute and I ideally need this to be at least 10x faster if possible.
EDIT 2 Following ac2001 method:
ledger = (Ledger
.transaction
.filter(portfolio=p)
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date).dt.date
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df = df.groupby(by=['asset', 'transaction_date']).sum()
yields the following dataframe (with multiindex):
transaction_amount
asset transaction_date
asset_1 2015-01-01 10.0
2018-02-09 5.0
2019-10-10 25.0
asset_2 2017-10-15 15.0
2019-10-10 18.0
These balances are correct (and also yield correct results on more complex data) but now I need to find a way to ffill these results to all dates in between as well as from the last date 2019-10-10 to today 2019-12-15 but not sure how that works given the multi-index.
Final solution
Thanks to #ac2001's code and pointers I have come up with the following:
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df['date_cast'] = pd.to_datetime(df.index).dt.date
df_grouped = df.groupby(by=['asset', 'date_cast']).last()
df_unstacked = df_.unstack(['asset'])
df_unstacked.index = pd.DatetimeIndex(df_unstacked.index)
df_unstacked = df_unstacked.reindex(idx)
df_unstacked = df_unstacked.ffill()
This gives me a matrix of asset by dates. I then get a matrix of prices by dates (from database) and multiply the two matrices.
Thanks
I think this might take some back and forth. I think the best approach is to do this in a couple steps.
Let's start with getting asset balances daily and then we will merge the prices together. The transaction amount is a cumulative total. Does this look correct? I don't have your data so it is a little difficult for me to tell.
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.groupby('asset').resample('D').ffill()
df = df.reset_index() <--- added this line here
<---edit below --->
Then create a dataframe from HistoricalPrices and merge it with the ledger. You might have to adjust the merge criteria to ensure you are getting what you want, but I think this is the correct path.
# edit
ledger = df
prices = (HistoricalPrice
.objects
.annotate(transaction_date=F('date__date'))
.values('asset', 'price', 'transaction_date'))
prices = pd.DataFrame(list(prices))
result = ledger.merge(prices, how='left', on=['asset', 'transaction_date'])
Depending on how you are using the data later, if you need a list of dicts which is a preferred method in Django templates, you can do that conversion with df.to_dict(orient='records')
If you want to group your Ledgers by date, then calculate the daily asset amount;
Ledger.objects.values('date__date').annotate(total_amount=Sum('amount'))
this should help (edit: fix typo)
second edit: assuming you want to group them by asset as well:
Ledger.objects.values('date__date', 'asset').annotate(total_amount=Sum('amount'))

How to show the percentage of uptime of an AWS service on the dashboard of CloudWatch?

I want to build a dashboard that displays the percentage of the uptime for each month of an Elastic Beanstalk service in my company.
So I used boto3 get_metric_data to retrieve the Environment Health CloudWatch metrics data and calculate the percentage of non-severe time of my service.
from datetime import datetime
import boto3
SEVERE = 25
client = boto3.client('cloudwatch')
metric_data_queries = [
{
'Id': 'healthStatus',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/ElasticBeanstalk',
'MetricName': 'EnvironmentHealth',
'Dimensions': [
{
'Name': 'EnvironmentName',
'Value': 'ServiceA'
}
]
},
'Period': 300,
'Stat': 'Maximum'
},
'Label': 'EnvironmentHealth',
'ReturnData': True
}
]
response = client.get_metric_data(
MetricDataQueries=metric_data_queries,
StartTime=datetime(2019, 9, 1),
EndTime=datetime(2019, 9, 30),
ScanBy='TimestampAscending'
)
health_data = response['MetricDataResults'][0]['Values']
total_times = len(health_data)
severe_times = health_data.count(SEVERE)
print(f'total_times: {total_times}')
print(f'severe_times: {severe_times}')
print(f'healthy percent: {1 - (severe_times/total_times)}')
Now I'm wondering how to show the percentage on the dashboard on CloudWatch. I mean I want to show something like the following:
Does anyone know how to upload the healthy percent I've calculated to the dashboard of CloudWatch?
Or is there any other tool that is more appropriate for displaying the uptime of my service?
You can do math with CloudWatch metrics:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
You can create a metric math expression with the metrics you have in metric_data_queries and get the result on the graph. Metric math also works with GetMetricData API, so you could move the calculation you do into MetricDataQuery and get the number you need directly from CloudWatch.
Looks like you need a number saying what percentage of datapoints in the last month the metric value equaled to 25.
You can calculate it like this (this is the source of the graph, you can use in CloudWatch console on the source tab, make sure the region matches your region and the metric name matches your metric):
{
"metrics": [
[
"AWS/ElasticBeanstalk",
"EnvironmentHealth",
"EnvironmentName",
"ServiceA",
{
"label": "metric",
"id": "m1",
"visible": false,
"stat": "Maximum"
}
],
[
{
"expression": "25",
"label": "Value for severe",
"id": "severe_c",
"visible": false
}
],
[
{
"expression": "m1*0",
"label": "Constant 0 time series",
"id": "zero_ts",
"visible": false
}
],
[
{
"expression": "1-AVG(CEIL(ABS(m1-severe_c)/MAX(m1)))",
"label": "Percentage of times value equals severe",
"id": "severe_pct",
"visible": false
}
],
[
{
"expression": "(zero_ts+severe_pct)*100",
"label": "Service Uptime",
"id": "e1"
}
]
],
"view": "singleValue",
"stacked": false,
"region": "eu-west-1",
"period": 300
}
To explain what is going on there (what is the purpose of each element above, by id):
m1 - This is your original metric. Setting stat to Maximum.
severe_c - Constant you want to use for your SEVERE value.
zero_ts - Creating a constant time series with all values equal zero. This is needed because constants can't be graphed and the final value will be constant. So to graph it, we'll just add the constant to this time series of zeros.
severe_pct - this is where you actually calculate the percentage of value that are equal SEVERE.
m1-severe_c - sets the datapoints with value equal SEVERE to 0.
ABS(m1-severe_c) - makes all values positive, keeps SEVERE datapoints at 0.
ABS(m1-severe_c)/MAX(m1) - dividing by maximum value ensures that all values are now between 0 and 1.
CEIL(ABS(m1-severe_c)/MAX(m1)) - snaps all values that are different than 0 to 1, keeps SEVERE at 0.
AVG(CEIL(ABS(m1-severe_c)/MAX(m1)) - Because metric is now all 1s and 0s, with 0 meaning SEVERE, taking the average gives you the percentage of non severe datapoints.
1-AVG(CEIL(ABS(m1-severe_c)/MAX(m1))) - finally you need the percentage of severe values and since values are either severe or not sever, substracting from 1 gives you the needed number.
e1 - The last expression gave you a constant between 0 and 1. You need a time series between 0 and 100. This is the expression that gives you that: (zero_ts+severe_pct)*100. Not that this is the only result that you're returning, all other expressions have "visible": false.

PowerBI Custom Visual - Table data binding

Also asked this on the PowerBI forum.
I am trying to change sampleBarChart PowerBI visual to use a "table" data binding instead of current "categorical". First goal is to build a simple table visual, with inputs "X", "Y" and "Value".
Both data bindings are described on the official wiki. This is all I could find:
I cannot find any example visuals which use it and are based on the new API.
From the image above, a table object has "rows", "columns", "totals" and "identities". So it looks like rows and columns are my x/y indexes, and totals are my values?
This is what I tried. (Naming is slightly off as most of it came from existing barchart code)
Data roles:
{ "displayName": "Category1 Data",
"name": "category1",
"kind": 0},
{ "displayName": "Category2 Data",
"name": "category2",
"kind": 0},
{ "displayName": "Measure Data",
"name": "measure",
"kind": 1}
Data view mapping:
"table": {
"rows": {"for": {"in": "category1"}},
"columns": {"for": {"in": "category2"}},
"totals": {"select": [{"bind": {"to": "measure"}}]}
}
Data Point class:
interface BarChartDataPoint {
value: number;
category1: number;
category2: number;
color: string;
};
Relevant parts of my visualTransform():
...
let category1 = categorical.rows;
let category2 = categorical.columns;
let dataValue = categorical.totals;
...
for (let i = 1, len = category1.length; i <= len; i++) {
for (let j = 1, jlen = category2.length; j <= jlen; j++) {
{
barChartDataPoints.push({
category1: i,
category2: j,
value: dataValue[i,j],
color: "#555555"//for now
});
}
...
Test data looks like this:
__1_2_3_
1|4 4 3
2|4 5 5
3|3 6 7 (total = 41)
The code above fills barChartDataPoints with just six data points:
(1; 1; 41),
(1; 2; undefined),
(2; 1; 41),
(2; 2; undefined),
(3; 1; 41),
(3; 2; undefined).
Accessing zero indeces results in nulls.
Q: Is totals not the right measure to access value at (x;y)? What am I doing wrong?
Any help or direction is very appreciated.
User #RichardL shared this link on the PowerBI forum. Which helped quite a lot.
"Totals" is not the right measure to access value at (x;y).
It turns out Columns contain column names, and Rows contain value arrays which correspond to those columns.
From the link above, this is how table structure looks like:
{
"columns":[
{"displayName": "Year"},
{"displayName": "Country"},
{"displayName": "Cost"}
],
"rows":[
[2014, "Japan", 25],
[2015, "Japan", 30],
[2016, "Japan", 18],
[2015, "North America", 14],
[2016, "North America", 30],
[2016, "China", 100]
]
}
You can also view the data as your visual receives it by placing this
window.alert(JSON.stringify(options.dataViews))
In your update() method. Or write it in html contents of your visual.
This was very helpful but it shows up a few fundamental problems with the data management of PowerBI for a custom visual. There is no documentation and the process from Roles to mapping to visualTransform is horrendous because it takes so much effort to rebuild the data into a format that is usable consistently with D3.
Commenting on user5226582's example, for me, columns is presented in a form where I have to look up the Roles property to be able to understand the order of data presented in the rows column array. displayName offers no certainty. For exampe, if a user uses the same field in two different dataRoles then it all gets crazily awry.
I think the safest approach is to build a new array inside visualTransform using the known well-field names (the "name" property in dataRoles), then iterate columns interrogating the Roles property to establish an index to the rows array items. Then use that index to populate the new array reliably. D3 then gobbles that up.
I know that's crazy, but at least it means reasonably consistent data and allows for the user selecting the same data field more than once or choosing count instead of column value.
All in all, I think this area needs a lot of attention before custom Visuals can really take off.

kairosdb aggregate group by

I have one year's 15 minute interval data in my kairosdb. I need to do following things sequentially:
- filter data using a tag
- group filtered data using few tags. I am not specifying values of tags because I want them to automatically grouped by tag values at runtime.
- once grouped on those tags, I want to aggregate sum 15 min interval data into a month.
I wrote this query to run from python script based on information available on kairosdb google code forum. But the aggregated values seem incorrect. Output seem skewed. I want to understand where I am going wrong. I am doing this in python. Here is my json query:
agg_query = {
"start_absolute": 1412136000000,
"end_absolute": 1446264000000,
"metrics":[
{
"tags": {
"insert_date": ["11/17/2015"]
},
"name": "gb_demo",
"group_by": [
{
"name": "time",
"range_size": {
"value": "1",
"unit": "months"
},
"group_count": "12"
},
{
"name": "tag",
"tags": ["usage_kind","building_snapshot_id","usage_point_id","interval"]
}
],
"aggregators": [
{
"name": "sum",
"sampling": {
"value": 1,
"unit": "months"
}
}
]
}
]
}
For reference: Data is something like this:
[[1441065600000,53488],[1441066500000,43400],[1441067400000,44936],[1441068300000,48736],[1441069200000,51472],[1441070100000,43904],[1441071000000,42368],[1441071900000,41400],[1441072800000,28936],[1441073700000,34896],[1441074600000,29216],[1441075500000,26040],[1441076400000,24224],[1441077300000,27296],[1441078200000,37288],[1441079100000,30184],[1441080000000,27824],[1441080900000,27960],[1441081800000,28056],[1441082700000,29264],[1441083600000,33272],[1441084500000,33312],[1441085400000,29360],[1441086300000,28400],[1441087200000,28168],[1441088100000,28944],[1443657600000,42112],[1443658500000,36712],[1443659400000,38440],[1443660300000,38824],[1443661200000,43440],[1443662100000,42632],[1443663000000,42984],[1443663900000,42952],[1443664800000,36112],[1443665700000,33680],[1443666600000,33376],[1443667500000,28616],[1443668400000,31688],[1443669300000,30872],[1443670200000,28200],[1443671100000,27792],[1443672000000,27464],[1443672900000,27240],[1443673800000,27760],[1443674700000,27232],[1443675600000,27824],[1443676500000,27264],[1443677400000,27328],[1443678300000,27576],[1443679200000,27136],[1443680100000,26856]]
This is snapshot of some data from Sep and Oct 2015. When I run this, if I give start timestamp of Sep, it will sum Sep data correctly, but for october it doesn't.
I believe your group by time will create groups by calendar month (January to December), but your sum aggregator will sum values by a running month starting withyour start date... Which seems a bit weird. COuld that be the cause of what you see?
What is the data like? What is the aggregated result like?