I am working on an investment app in Django which requires calculating portfolio balances and values over time. The database is currently set up this way:
class Ledger(models.Model):
asset = models.ForeignKey('Asset', ....)
amount = models.FloatField(...)
date = models.DateTimeField(...)
...
class HistoricalPrices(models.Model):
asset = models.ForeignKey('Asset', ....)
price = models.FloatField(...)
date = models.DateTimeField(...)
Users enter transactions in the Ledger, and I update prices through APIs.
To calculate the balance for a day (note multiple Ledger entries for the same asset can happen on the same day):
def balance_date(date):
return Ledger.objects.filter(date__date__lte=date).values('asset').annotate(total_amount=Sum('amount'))
Trying to then get values for every day between the date of the first Ledger entry and today becomes more challenging. Currently I am doing it this way - assuming a start_date and end_date that are datetime.date() and tr_dates a list on unique dates on which transactions did occur (to avoid calculating balances on days where nothing happened) :
import pandas as pd
idx = pd.date_range(start_date, end_date)
main_df = pd.DataFrame(index=tr_dates)
main_df['date_send'] = main_df.index
main_df['balances'] = main_df['date_send'].apply(lambda x: balance_date(x))
main_df = main_df.sort_index()
main_df.index = pd.DatetimeIndex(main_df.index)
main_df = main_df.reindex(idx, method='ffill')
This works but my issue is performance. It takes at least 150-200ms to run this, and then I need to get the prices for each date (all of them, not just transaction dates) and somehow match and multiply by the correct balances, which makes the run time about 800 ms or more.
Given this is a web app the view taking 800ms at minimum to calculate makes it hardly scalable, so I was wondering if anyone had a better way to do this?
EDIT - Simple example of expected input / output
Ledger entries (JSON format) :
[
{
"asset":"asset_1",
"amount": 10,
"date": "2015-01-01"
},
{
"asset":"asset_2",
"amount": 15,
"date": "2017-10-15"
},
{
"asset":"asset_1",
"amount": -5,
"date": "2018-02-09"
},
{
"asset":"asset_1",
"amount": 20,
"date": "2019-10-10"
},
{
"asset":"asset_2",
"amount": 3,
"date": "2019-10-10"
}
]
Sample Price from Historical Prices:
[
{
"date": "2015-01-01",
"asset": "asset_1"
"price": 5,
},
{
"date": "2015-01-01",
"asset": "asset_2"
"price": 15,
},
{
"date": "2015-01-02",
"asset": "asset_1"
"price": 6,
},
{
"date": "2015-01-02",
"asset": "asset_2"
"price": 11,
},
...
{
"date": "2017-10-15",
"asset": "asset_1"
"price": 20
},
{
"date": "2017-10-15",
"asset": "asset_2"
"price": 30
}
{
]
In this case:
tr_dates is ['2015-01-01', '2017-10-15', '2018-02-09', '2019-10-10']
date_range is ['2015-01-01', '2015-01-02', '2015-01-03'.... '2019-12-14, '2019-12-15']
Final output I am after: Balances by date with price by date and total value by date
date asset balance price value
2015-01-01 asset_1 10 5 50
2015-01-01 asset_2 0 10 0
.... balances do not change as there are no new Ledger entries but prices change
2015-01-02 asset_1 10 6 60
2015-01-02 asset_2 0 11 0
.... all dates between 2015-01-02 and 2017-10-15 (no change in balance but change in price)
2017-10-15 asset_1 10 20 200
2017-10-15 asset_2 15 30 450
... dates in between
2018-02-09 asset_1 5 .. etc based on price
2018-02-09 asset_2 15 .. etc based on price
... dates in between
2019-10-10 asset_1 25 .. etc based on price
2019-10-10 asset_2 18 .. etc based on price
... goes until the end of date_range
I have managed to get this working but takes about a second to compute and I ideally need this to be at least 10x faster if possible.
EDIT 2 Following ac2001 method:
ledger = (Ledger
.transaction
.filter(portfolio=p)
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date).dt.date
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df = df.groupby(by=['asset', 'transaction_date']).sum()
yields the following dataframe (with multiindex):
transaction_amount
asset transaction_date
asset_1 2015-01-01 10.0
2018-02-09 5.0
2019-10-10 25.0
asset_2 2017-10-15 15.0
2019-10-10 18.0
These balances are correct (and also yield correct results on more complex data) but now I need to find a way to ffill these results to all dates in between as well as from the last date 2019-10-10 to today 2019-12-15 but not sure how that works given the multi-index.
Final solution
Thanks to #ac2001's code and pointers I have come up with the following:
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.sort_index(inplace=True)
df['date_cast'] = pd.to_datetime(df.index).dt.date
df_grouped = df.groupby(by=['asset', 'date_cast']).last()
df_unstacked = df_.unstack(['asset'])
df_unstacked.index = pd.DatetimeIndex(df_unstacked.index)
df_unstacked = df_unstacked.reindex(idx)
df_unstacked = df_unstacked.ffill()
This gives me a matrix of asset by dates. I then get a matrix of prices by dates (from database) and multiply the two matrices.
Thanks
I think this might take some back and forth. I think the best approach is to do this in a couple steps.
Let's start with getting asset balances daily and then we will merge the prices together. The transaction amount is a cumulative total. Does this look correct? I don't have your data so it is a little difficult for me to tell.
ledger = (Ledger
.objects
.annotate(transaction_date=F('date__date'))
.annotate(transaction_amount=Window(expression=Sum('amount'),
order_by=[F('asset').asc(), F('date').asc()],
partition_by=[F('asset')]))
.values('asset', 'transaction_date', 'transaction_amount'))
df = pd.DataFrame(list(ledger))
df.transaction_date = pd.to_datetime(df.transaction_date)
df.set_index('transaction_date', inplace=True)
df.groupby('asset').resample('D').ffill()
df = df.reset_index() <--- added this line here
<---edit below --->
Then create a dataframe from HistoricalPrices and merge it with the ledger. You might have to adjust the merge criteria to ensure you are getting what you want, but I think this is the correct path.
# edit
ledger = df
prices = (HistoricalPrice
.objects
.annotate(transaction_date=F('date__date'))
.values('asset', 'price', 'transaction_date'))
prices = pd.DataFrame(list(prices))
result = ledger.merge(prices, how='left', on=['asset', 'transaction_date'])
Depending on how you are using the data later, if you need a list of dicts which is a preferred method in Django templates, you can do that conversion with df.to_dict(orient='records')
If you want to group your Ledgers by date, then calculate the daily asset amount;
Ledger.objects.values('date__date').annotate(total_amount=Sum('amount'))
this should help (edit: fix typo)
second edit: assuming you want to group them by asset as well:
Ledger.objects.values('date__date', 'asset').annotate(total_amount=Sum('amount'))
Related
I have a model in Django and this is how it looks like with fewer fields -
I want to group the rows w.r.t buy_price_per_unit and at the same time I also want to know the total units on sale for that buy_price_per_unit.
So in our case only two distinct buy_price_per_unit are available (9, 10). Hence the query would return only two rows like this -
The one last condition which I have to meet is the query result should be in descending order of buy_price_per_unit.
This is what I have tried so far -
orders = Orders.objects.values('id', 'buy_price_per_unit')\
.annotate(units=Sum("units"))\
.order_by("-buy_price_per_unit")\
The response for the query above was -
[
{
"id": 13,
"buy_price_per_unit": 10,
"units": 1
},
{
"id": 12,
"buy_price_per_unit": 9,
"units": 10
},
{
"id": 14,
"buy_price_per_unit": 9,
"units": 2
},
{
"id": 15,
"buy_price_per_unit": 9,
"units": 1
}
]
The problem with this response is that even for the same price multiple records are being returned.
This is happening because you have id in .values and based on the underlying query, it is grouping on id and buy_price_per_unit both.
So simply remove id from .values
orders = Orders.objects.values('buy_price_per_unit')\
.annotate(units=Sum("units"))\
.order_by("-buy_price_per_unit")\
According to the documentation, the end_time is when the cutoff point for when the data starts:
The end_time property indicates a data set's lookback cutoff date; data older than this value is not included in the data set's calculation.
When looking at online_followers in insights, the data looks like this:
{
"value": {
"0": 18634,
"1": 18604,
"2": 19849,
"3": 21491,
"4": 23519,
"5": 25000,
"6": 24772,
"7": 25081,
"8": 25408,
"9": 25883,
"10": 26216,
"11": 26591,
"12": 27182,
"13": 27398,
"14": 25384,
"15": 19336,
"16": 13968,
"17": 11596,
"18": 10770,
"19": 10156,
"20": 9967,
"21": 11243,
"22": 14837,
"23": 18040
},
"end_time": "2021-07-01T07:00:00+0000"
Do the numbers refer to the hour of the day? Or do they refer to the number of hours that have passed since 07:00:00? If the latter, would this data be for 2021-07-21 and 2021-07-22?
According to documentation,
"Metrics that support lifetime periods will have results returned in
an array of 24 hour periods, with periods ending on UTC−07:00"
UTC-07:00 corresponds to US Pacific Time (PDT) 0 AM (midnight/start of a new day)
so, for anyone in that time zone, the number of hours passing since that UTC Timestamp IS EQUAL to the hour of the day. In your example, that data is for every hour of the day 2021-07-01 in PDT.
I have a table that looks like this
date car_crashes city
01.01 1 Washington
01.02 4 Washington
01.03 0 Washington
01.04 2 Washington
01.05 0 Washington
01.06 3 Washington
01.07 4 Washington
01.08 1 Washington
01.01 0 Detroit
01.02 2 Detroit
01.03 4 Detroit
01.04 2 Detroit
01.05 0 Detroit
01.06 3 Detroit
01.07 1 Detroit
I want to know how many car crashes for each day happened in the entire nation, and I can do that with this:
Model.values("date") \
.annotate(car_crashes=Sum('car_crashes')) \
.values("date", "car_crashes")
Now, let's suppose I have an array like this:
weights = [
{
"city": "Washington",
"weight": 1,
},
{
"city": "Detroit",
"weight": 2,
}
]
This means that Detroit's car crashes should be multiplied by 2 before being aggregated with Washington's.
It can be done like this:
from django.db.models import IntegerField
when_list = [When(city=w['city'], then=w['weight']) for w in weights]
case_params = {'default': 1, 'output_field': IntegerField()}
Model.objects.values('date') \
.annotate(
weighted_car_crashes=Sum(
F('car_crashes') * Case(*when_list, **case_params)
))
However, this generates very slow SQL code, especially as more properties and a larger array are introduced.
Another solution which is way faster but still sub-optimal is using pandas:
aggregated = false
for weight in weights:
ag = Model.filter(city=w[‘city’]).values("date") \
.annotate(car_crashes=Sum('car_crashes') * w[‘weight’]) \
.values("date", "car_crashes")
if aggregated is False:
aggregated = ag
else:
aggregated = aggregated.union(ag)
aggregated = pd.DataFrame(aggregated)
if len(weights) > 1:
aggregated = aggregated.groupby("date", as_index=False).sum(level=[1])
This is faster, but still not as fast as what happens if, before calling pandas, I take the aggregated.query string and
wrap it with a few lines of SQL.
SELECT "date", sum("car_crashes") FROM (
// String from Python
str(aggregated.query)
) as "foo" GROUP BY "date"
This works perfectly when pasted into my database SQL. I could do this in Python/Django using .raw() but the documentation says to ask before using .raw() as mostly anything could be acomplished with the ORM.
Yet, I don't see how. Once I call .union() on 2 querysets, I cannot aggregate further.
aggregated.union(ag).annotate(cc=Sum('car_crashes'))
gives
Cannot compute Sum('car_crashes'): 'car_crashes' is an aggregate
Is this possible to do with the Django ORM or should I use .raw()?
A very similar post was made about this issue here. In cloudant, I have a document structure storing when users access an application, that looks like the following:
{"username":"one","timestamp":"2015-10-07T15:04:46Z"}---| same day
{"username":"one","timestamp":"2015-10-07T19:22:00Z"}---^
{"username":"one","timestamp":"2015-10-25T04:22:00Z"}
{"username":"two","timestamp":"2015-10-07T19:22:00Z"}
What I want to know is to count the # of unique users for a given time period. Ex:
2015-10-07 = {"count": 2} two different users accessed on 2015-10-07
2015-10-25 = {"count": 1} one different user accessed on 2015-10-25
2015 = {"count" 2} two different users accessed in 2015
This all just becomes tricky because for example on 2015-10-07, username: one has two records of when they accessed, but it should only return a count of 1 to the total of unique users.
I've tried:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
emit([time.getUTCFullYear(),time.getUTCMonth(),time.getUTCDay(),doc.username], 1);
}
This suffers from several issues, which are highlighted by Jesus Alva who commented in the post I linked to above.
Thanks!
There's probably a better way of doing this, but off the top of my head ...
You could try emitting an index for each level of granularity:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
var year = time.getUTCFullYear();
var month = time.getUTCMonth()+1;
var day = time.getUTCDate();
// day granularity
emit([year,month,day,doc.username], null);
// year granularity
emit([year,doc.username], null);
}
// reduce function - `_count`
Day query (2015-10-07):
inclusive_end=true&
start_key=[2015, 10, 7, "\u0000"]&
end_key=[2015, 10, 7, "\uefff"]&
reduce=true&
group=true
Day query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,10,7,"one"],"value":2},
{"key":[2015,10,7,"two"],"value":1}
]}
Year query:
inclusive_end=true&
start_key=[2015, "\u0000"]&
end_key=[2015, "\uefff"]&
reduce=true&
group=true
Query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,"one"],"value":3},
{"key":[2015,"two"],"value":1}
]}
I am looking for fast method to count model's objects created within past 30 days, for each day separately. For example:
27.07.2013 (today) - 3 objects created
26.07.2013 - 0 objects created
25.07.2013 - 2 objects created
...
27.06.2013 - 1 objects created
I am going to use this data in google charts API. Have you any idea how to get this data efficiently?
items = Foo.objects.filter(createdate__lte=datetime.datetime.today(), createdate__gt=datetime.datetime.today()-datetime.timedelta(days=30)).\
values('createdate').annotate(count=Count('id'))
This will (1) filter results to contain the last 30 days, (2) select just the createdate field and (3) count the id's, grouping by all selected fields (i.e. createdate). This will return a list of dictionaries of the format:
[
{'createdate': <datetime.date object>, 'count': <int>},
{'createdate': <datetime.date object>, 'count': <int>},
...
]
EDIT:
I don't believe there's a way to get all dates, even those with count == 0, with just SQL. You'll have to insert each missing date through python code, e.g.:
import datetime
# needed to use .append() later on
items = list(items)
dates = [x.get('createdate') for x in items]
for d in (datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,30)):
if d not in dates:
items.append({'createdate': d, 'count': 0})
I think this can be somewhat more optimized solution with #knbk 's solution with python. This has fewer iterations and iterations inside SET is highly optimized in python (both in processing and in CPU-cycles).
from_date = datetime.date.today() - datetime.timedelta(days=7)
orders = Order.objects.filter(created_at=from_date, dealer__executive__branch__user=user)
orders = orders.annotate(count=Count('id')).values('created_at').order_by('created_at')
if len(orders) < 7:
orders_list = list(orders)
dates = set([(datetime.date.today() - datetime.timedelta(days=i)) for i in range(6)])
order_set = set([ord['created_at'] for ord in orders])
for dt in (order_set - dates):
orders_list.append({'created_at': dt, 'count': 0})
orders_list = sorted(orders_list, key=lambda item: item['created_at'])
else:
orders_list = orders