I am working on a Bayesian Hierarchical linear model in Pymc3.
The model consists of three input variables on a daily level: number of users, product category and product sku and the output variable is revenue. In total the data consists of roughly 73.000 records with 180 categories and 12.000 sku's. Moreover, some categories/sku's are highly present while other categories aren't. An example of the data is shown in the link:
Preview of the data
As the data on sku level is very sparse an hierarchical model has been chosen with the intent that sku's with less data should shrink towards the category level mean and if a category is scarce the group level mean should shrink towards the overall mean.
In the final model the categories are label encoded and the continuous variables users and revenue are min-max scaled.
At this point the model is formalized as follows:
with pm.Model() as model:
sigma_overall = pm.HalfNormal("sigma_overall", mu=50)
sigma_category = pm.HalfNormal("sigma_category", mu=sigma_overall)
sigma_sku = pm.HalfNormal("sigma_sku", sigma=sigma_category, shape=n_sku)
beta = pm.HalfNormal("beta", sigma=sigma_sku, shape=n_sku)
epsilon = pm.HalfCauchy("epsilon", 1)
y = pm.Deterministic('y', beta[category_idx][sku_idx] * df['users'].values)
y_likelihood = pm.Normal("y_likelihood", mu=y, sigma=epsilon, observed=df['revenue'].values)
trace = pm.sample(2000)
The main hurdle is that the model is very slow. It takes hours, sometimes a day before the model completes. Metropolis- or NUTS sampling with find_MAP() did not make a difference. Furthermore, I doubt whether the model is formalized correctly as I am pretty new to Pymc3.
A review of the model and advice to speed it up is very welcome.
Related
I have a database with the following details:
Product
Name
SKU
UOM (There is a UOM master, so all purchase and sales are converted to base uom and stored in the db)
Some other details
has_attribute
has_batch
Attributes
Name
Details/Remarks
Product-Attribute
Product (FK)
Attribute(FK)
Value of attribute
Inventory Details
#This is added for every product lot bought & quantity available is updated after every sale
Product (FK)
Warehouse (FK to warehoue model)
Purchase Date
Purchase Price
MRP
Tentative sales price
Quantity_bought
Quantity_available
Other batch details if applicable(batch id, manufactured_date, expiry_date)
Inventory Ledger
#This table records all in & out movement of inventory
Product
Warehouse (FK to warehoue model)
Transaction Type (Purchase/Sales)
Quantity_transacted(i.e. quantity purchased/sold)
Inventory_Purchase_cost(So as to calculate inventory valuation)
Now, my problem is:
I need to find out the historical inventory cost. For example, let's say I need to find out the value of inventory on 10th Feb 2017, what I'll be doing with the current table is not very efficient: I'll find out current inventory and go back through the ledger for all 1000-1500 SKU and about 100 transactions daily (for each sku) for more than 120 days and come to a value. taht's about 1500*100*120. It's Huge. Is there a better DB design to handle this case?
Firstly, have you tested it? 1500*100*120 is not that huge. It may be acceptable performance and there is no problem to be solved!
I'm not 100% clear how you compute the value. Do you sum up the InventoryLedger rows for each Product in each Warehouse? Is so, it's easy to put a value on the purchases, but how do you value the sales? I'm going to assume that you value the sales using the Inventory_Purchase_Cost (so it should maybe be called TransactionValue instead).
If you must optimise it, I suggest you could populate a record each day for the valuation of the product in each warehouse. I suggest the following StockValution table could be populated daily and this would allow quick computation of the valuations for any historical day.
Diagram made using QuickDBD, where I work.
I have a simple eCommerce site with Product and Variation.
Each variation is definitely going to have different weights, and each weight will have a quantity.
I have implemented a Weight model with a ForeignKey relationship to Variation since each variation can have different weights and each weight will have a quantity.
class Weight(models.Model):
variation = models.ForeignKey(Variation)
size = models.DecimalField(decimal_places=3, max_digits=8,
validators=[MinValueValidator(Decimal('0.10'))])
quantity = models.PositiveIntegerField(null=True, blank=True, help_text="Select Quantity for this size")
I added weight as inline and can add multiple weight values and quantity in a variation. Please see this http://imgur.com/XLM6sQJ
One might think this could be possible through creating variation for each weight but since it is definite that each product will have different weights there is no need create a variation for each weight.
Now the problem I am facing is that each variation will have different weights so for e.g. a variation could have weights of 1lb, 2 lb, 3lb. Each of these will create new weight objects for each variation. This means if another variation also has weights of 1lb, 2 lb, 3lb, new objects are created and NOT the existing ones are reused. This will result in a huge db table with many duplicates weight values. This is a problem because there is a limited number of weight and quantity value needed by any product (weight = 1lb to 100lb and quantity = 1 to 100) and so these should ideally be reused.
To avoid this I am thinking to have the Weight model with ManyToMany field to Variation and then quantity should be dropdown for each selected weight. This will allow be to store values of both weight and quantity, and have each product use the same values in each instance.
The problem I have is:
1. Is this the correct approach?
2. if not what is the best approach to do this?
3. If this is the correct approach how to do this?
4. If this is the correct approach how do I display this in admin site since each weight should also have a quantity (I have no clue how to do this)?
5. Is there a better way to achieve this, and if so how?
You have a clear understanding on how you want to do it.
I would agree with you about reusing the same weights for different variations rather than creating new ones which would again have same weights.
This is what I think that would be better there may be multiple ways to do it.
To answer your question, please try this Model relations in your app:
class Quantity(models.Model):
quantity = models.PositiveIntegerField()
class Weight(models.Model):
weight = models.PositiveIntegerField()
quantity = models.ManyToManyField(Quantity)
class Variation(models.Model):
name = models.CharField()
weight = models.ManyToManyField(Weight)
Then add all the weights as you require in the Weights class individually. So after than whenever you need to add some weight to Variation table then you can select the weights from the Weights class in which we have already added the weights that we might require.
In this way you can reuse the same weight for different variations without even having to have duplicate records.
Make sure you have registered both the models in the admin for easy access.
This should solve your problem for having multiple records in the weight table.
class Offer(models.Model):
match = models.ForeignKey(Match, related_name='offers')
# arbitrary information
class Odds(models.Model):
offer = models.ForeignKey(Offer, related_name='odds')
time = models.DateTimeField(db_index=True)
class Meta:
get_latest_by = 'time'
# arbitrary information
I have a potentially huge set of Offers where I need to get the latest Odds object. The query I am performing right now is the following
for m in Match.objects.all():
odds = [o.odds.latest() for o in m.offers.all()]
The rest of the Odds objects connected to Offer are stored for historical purposes and should not be used in the computations that would follow.
The problem is that this computes one query for each Offer object and is a huge time and performance factor that I'm drastically trying to fix.
TLDR;
I want to get one Odds object for each Offer, using ORDER BY time.
Any help is truly appreciated.
You can use prefetch_related. it does a separate lookup for each relationship, and does the ‘joining’ in Python, instead of many small queries from your code.
For Django 1.6 and earlier:
class Offer(models.Model):
match = models.ForeignKey(Match, related_name='offers')
def latest_odds(self):
return max(self.odds.all(), key=lambda odds: odds.time)
...
for m in Match.objects.all().prefetch_related('offers__odds'):
odds = [o.latest_odds() for o in m.offers.all()]
---------
3 queries
In Django 1.7, there is a Prefetch() object that allows you to control the behaviour of prefetch_related:
for m in Match.objects.all().prefetch_related(Prefetch("offers__odds", queryset=Odds.objects.order_by('time'))):
odds = [o.odds.first() for o in m.offers.all()]
---------
3 queries
If Odds is really huge, you should look towards denormalization table
My model consists of a Portfolio, a Holding, and a Company. Each Portfolio has many Holdings, and each Holding is of a single Company (a Company may be connected to many Holdings).
Portfolio -< Holding >- Company
I'd like the Portfolio query to return the sum of the product of the number of Holdings in the Portfolio, and the value of the Company.
Simplified model:
class Portfolio(model):
some fields
class Company(model):
closing = models.DecimalField(max_digits=10, decimal_places=2)
class Holding(model):
portfolio = models.ForeignKey(Portfolio)
company = models.ForeignKey(Company)
num_shares = models.IntegerField(default=0)
I'd like to be able to query:
Portfolio.objects.some_function()
and have each row annotated with the value of the Portfolio, where the value is equal to the sum of the product of the related Company.closing, and Holding.num_shares. ie something like:
annotate(value=Sum('holding__num_shares * company__closing'))
I'd also like to obtain a summary row, which contains the sum of the values of all of a user's Portfolios, and a count of the number of holdings. ie something like:
aggregate(Sum('holding__num_shares * company__closing'), Count('holding__num_shares'))
I would like to do have a similar summary row for a single Portfolio, which would be the sum of the values of each holding, and a count of the total number of holdings in the portfolio.
I managed to get part of the way there using extra:
return self.extra(
select={
'value': 'select sum(h.num_shares * c.closing) from portfolio_holding h '
'inner join portfolio_company as c on h.company_id = c.id '
'where h.portfolio_id = portfolio_portfolio.id'
}).annotate(Count('holding'))
but this is pretty ugly, and extra seems to be frowned upon, for obvious reasons.
My question is: is there a more Djangoistic way to summarise and annotate queries based on multiple fields, and across related tables?
These two options seem to move in the right direction:
Portfolio.objects.annotate(Sum('holding__company__closing'))
(ie this demonstrates annotation/aggregation over a field in a related table)
Holding.objects.annotate(Sum('id', field='num_shares * id'))
(this demonstrates annotation/aggregation over the product of two fields)
but if I attempt to combine them: eg
Portfolio.objects.annotate(Sum('id', field='holding__company__closing * holding__num_shares'))
I get an error: "No such column 'holding__company__closing'.
So far I've looked at the following related questions, but none of them seem to capture this precise problem:
Annotating django QuerySet with values from related table
Product of two fields annotation
Do I just need to bite the bullet and use raw / extra? I'm hoping that Django ORM will prove the exception to the rule that ORMs really only work as designed for simple queries / models, and anything beyond the most basic ones require either seriously gnarly tap-dancing, or stepping out of the abstraction, which somewhat defeats the purpose...
Thanks in advance!
I'm building a food logging database in Django and I've got a query related problem.
I've set up my models to include (among other things) a Food model connected to the User model through an M2M-field "consumer" via the Consumption model. The Food model describes food dishes and the Consumption model describes a user's consumption of Food (date, amount, etc).
class Food(models.Model):
food_name = models.CharField(max_length=30)
consumer = models.ManyToManyField("User", through=Consumption)
class Consumption(models.Model):
food = models.ForeignKey("Food")
user = models.ForeignKey("User")
I want to create a query that returns all Food objects ordered by the number of times that Food object appears in the Consumption table for that user (the number of times the user has consumed the food).
I'm trying something in the line of:
Food.objects.all().annotate(consumption_times = Count(consumer)).order_by('consumption_times')`
But this will of course count all Consumption objects related to the Food object, not just the ones associated with the user. Do I need to change my models or am I just missing something obvious in the queries?
This is a pretty time-critical operation (among other things, it's used to fill an Autocomplete field in the Frontend) and the Food table has a couple of thousand entries, so I'd rather do the sorting in the database end, rather than doing the brute force method and iterate over the results doing:
Consumption.objects.filter(food=food, user=user).count()
and then using python sort to sort them. I don't think that method would scale very well as the user base increases and I want to design the database as future proof as I can from the start.
Any ideas?
Perhaps something like this?
Food.objects.filter(consumer__user=user)\
.annotate(consumption_times=Count('consumer'))\
.order_by('consumption_times')
I am having a very similar issue. Basically, I know that the SQL query you want is:
SELECT food.*, COUNT(IF(consumption.user_id=123,TRUE,NULL)) AS consumption_times
FROM food LEFT JOIN consumption ON (food.id=consumption.food_id)
ORDER BY consumption_times;
What I wish is that you could mix aggregate functions and F expression, annotate F expressions without an aggregate function, have a richer set of operations/functions for F expressions, and have virtual fields that are basically an automatic F expression annotation. So that you could do:
Food.objects.annotate(consumption_times=Count(If(F('consumer')==user,True,None)))\
.order_by('consumtion_times')
Also, just being able more easily able to add your own complex aggregate functions would be nice, but in the meantime, here's a hack that adds an aggregate function to do this.
from django.db.models import aggregates,sql
class CountIf(sql.aggregates.Count):
sql_template = '%(function)s(IF(%(field)s=%(equals)s,TRUE,NULL))'
sql.aggregates.CountIf = CountIf
consumption_times = aggregates.Count('consumer',equals=user.id)
consumption_times.name = 'CountIf'
rows = Food.objects.annotate(consumption_times=consumption_times)\
.order_by('consumption_times')