Django QuerySet slicing returning unexpected results - django

I'm dynamically writing a Django query and am receiving unexpected results based on the slice parameters. For example, if I request queryset[0:10] and querset[10:20] I receive some of the same item's in query2 that I found in query1.
Searching around, the issue I'm facing appears similar to:
Simple Djanqo Query generating confusing Queryset results
except I am defining a order_by for my query so it doesn't appear to be an exact match.
Viewing the querset.query for my two queries....
queryset[0:10] generates:
SELECT "intercache_localinventorycountsummary"."id",
"intercache_localinventorycountsummary"."part",
"intercache_localinventorycountsummary"."site",
"intercache_localinventorycountsummary"."location",
"intercache_localinventorycountsummary"."hadTransactionsDuring"
FROM "intercache_localinventorycountsummary"
ORDER BY "intercache_localinventorycountsummary"."hadTransactionsDuring" DESC
LIMIT 10
queryset[10:20] generates:
SELECT "intercache_localinventorycountsummary"."id",
"intercache_localinventorycountsummary"."part",
"intercache_localinventorycountsummary"."site",
"intercache_localinventorycountsummary"."location",
"intercache_localinventorycountsummary"."hadTransactionsDuring"
FROM "intercache_localinventorycountsummary"
ORDER BY "intercache_localinventorycountsummary"."hadTransactionsDuring" DESC
LIMIT 10 OFFSET 10
Per request, I've listed the literal SQL generated by Django, and ran it manually against the DB.
Results for Query1:
id | part | site | location | hadTransactionsDuring
------+---------+------+----------+-----------------------
2787 | 2217-1 | 01 | Bluebird | t
2839 | 2215 | 01 | 2600 FG | t
2558 | R4367 | 01 | 2600 Raw | t
2637 | 4453 | 01 | 2600 FG | t
2810 | 1000 | 01 | 2600 FG | t
2531 | 3475 | 01 | 2600 FG | t
2526 | 4596Z | 01 | 2550 FG | t
2590 | 3237-12 | 01 | 2600 Raw | t
3077 | 4841Y | 01 | 2600 FG | t
2919 | 3407 | 01 | 2600 FG | t
Results for Query2:
id | part | site | location | hadTransactionsDuring
------+--------------+------+----------+-----------------------
2598 | 2217-2 | 01 | 2600 Raw | t
2578 | 2216-5 | 01 | 2600 Raw | t
2531 | 3475 | 01 | 2600 FG | t
3010 | 3919 | 01 | 2600 FG | t
2558 | R4367 | 01 | 2600 Raw | t
2637 | 4453 | 01 | 2600 FG | t
2526 | 4596Z | 01 | 2550 FG | t
2590 | 3237-12 | 01 | 2600 Raw | t
2570 | R3760-BRN-GS | 01 | 2600 Raw | f
2569 | 4098 | 01 | 2600 FG | f
(You can see id's 2558, 2637, 2526, 2590 are returned for both queries)
Any guesses what I'm doing wrong here? It seem I must be fundamentally misunderstanding something about how QuerySet slicing works.
Update:
The DB schema is as follows... are result orderings non-reliable when ordering by non-indexed fields perhaps?
\d intercache_localinventorycountsummary
Table "public.intercache_localinventorycountsummary"
Column | Type | Modifiers
-----------------------+--------------------------+------------------------------------------------------------------------------------
id | integer | not null default nextval('intercache_localinventorycountsummary_id_seq'::regclass)
_domain_id | integer |
_created | timestamp with time zone | not null
_synced | timestamp with time zone |
_active | boolean | not null default true
dirty | boolean | not null default true
lastRefresh | timestamp with time zone |
part | character varying(18) | not null
site | character varying(8) | not null
location | character varying(8) | not null
quantity | numeric(16,9) |
startCount | timestamp with time zone |
endCount | timestamp with time zone |
erpCountQOH | numeric(16,9) |
hadTransactionsDuring | boolean | not null default false
quantityChangeSince | numeric(16,9) |
hadManualDating | boolean | not null
variance | numeric(16,9) |
unitCost | numeric(16,9) |
countCost | numeric(16,9) |
varianceCost | numeric(16,9) |
Indexes:
"intercache_localinventorycountsummary_pkey" PRIMARY KEY, btree (id)
"intercache_localinventorycount__domain_id_5691b6f8cca017dc_uniq" UNIQUE CONSTRAINT, btree (_domain_id, part, site, location)
"intercache_localinventorycountsummary__active" btree (_active)
"intercache_localinventorycountsummary__domain_id" btree (_domain_id)
"intercache_localinventorycountsummary__synced" btree (_synced)
Foreign-key constraints:
"_domain_id_refs_id_163d40e6b21ac0f9" FOREIGN KEY (_domain_id) REFERENCES intercache_domain(id) DEFERRABLE INITIALLY DEFERRED

The problem lies with this:
ORDER BY "intercache_localinventorycountsummary"."hadTransactionsDuring" DESC
Apparently you've overridden ordering either explicitly in the query or in model's meta options (vide Model Meta options: ordering).
If you want to order by hadTransactionsDuring but have predictable ordering, you should add second ordering that will resolve cases where first one has same value. For example:
queryset.order_by("-hadTransactionsDuring", "id")
Keep in mind RDBMSes, be it PostgreSQL or MySQL, never guarantee any order at all unless explicitly specified with ORDER BY. Most queries usually return in order of primary key, but that's more like just a happy coincidence, depending on internal implementation of table storage, rather than something you can rely on. In other words you cannot assume that Django queryset is ordered on any field besides the fields you've specified in order_by.

Related

Filter out outliers dynamically using PERCENTILE

I'm building a sales dashboard in PowerBI.
I have a Sales table.
My source of data is declarative, so I have a few extreme values caused by human errors and mistypes, etc.
Let's say I want to build a histogram with:
On the X axis, the stock aging of any sales. Which is "how long the product has been in stock at the time of sale". It is given by the [Product_Age] column
On values, the number of sales.
What I want to do is exclude the top 1% extreme values from my calculations (average, etc.) and vizualisations.
I've created a measure :
SalesByAge_Adjusted =
VAR TEMP =
FILTER(
SALES;
VAR StockAgingMAX =
PERCENTILE.INC(
SALES[Sales_Age];
0,99
)
RETURN
SALES[Sales_Age] < StockAgingMAX
)
RETURN
COUNTROWS(TEMP)
It uses PERCENTILE.INC to get the 99th percentile of Sales_Age values in the current context and I try to use it as a filter.
However, it just won't work.
I can diplay the measure on its own. How many sales I have. But as soon as I drag and drop "Sales_Age" to summarize the values. It shows nothing.
I have created the following table as an example.
+-------+--------+
| Axis | Values |
+-------+--------+
| 1 | 1067 |
| 2 | 1725 |
| 4 | 298 |
| 8 | 402 |
| 16 | 1848 |
| 32 | 1395 |
| 64 | 1116 |
| 128 | 1027 |
| 256 | 1948 |
| 512 | 790 |
| 1024 | 2173 |
| 2048 | 2025 |
| 4096 | 104 |
| 8192 | 1243 |
| 16384 | 1676 |
| 32768 | 1285 |
| 65536 | 806 |
+-------+--------+
For filtering the values that are out the 99% percentile I've created the following measure. Basically it gets an overall percentile without filter context and compares to each Axis value.
Filter = IF(CALCULATE(PERCENTILE.INC('Table'[Axis],0.99),ALL('Table'))>=MAX('Table'[Axis]),1,0)
In the visual of the chart, you use the filter measure to exclude your outliers
In this case, it will filter the last value of table: 65,536

PowerBI: Use non-shown values for Drillthrough

I am trying to build a Power BI report for data from a SQL database where I have to show detail pages using Drillthrough. The only viable way to connect the datasets is using the database row ids.
From a user's perspective the row ids would not add any value but a lot of noise.
Is there a way to drillthrough using the row ids without showing them in a visual?
Yes, this is possible in the current release of Power Bi Desktop using a workaround solution that involves hiding the row id column in the parent (or summary) page.
Take the following tables as example:
ALBUM
+---------+------------------------+
| AlbumId | AlbumName |
+---------+------------------------+
| 1 | Hoist |
+---------+------------------------+
| 2 | The Story Of the Ghost |
+---------+------------------------+
TRACK
+---------+---------+--------------------------+
| TrackId | AlbumId | TrackName |
+---------+---------+--------------------------+
| 1 | 1 | Julius |
+---------+---------+--------------------------+
| 2 | 1 | Down With Disease |
+---------+---------+--------------------------+
| 3 | 1 | If I Could |
+---------+---------+--------------------------+
| 4 | 1 | Riker's Mailbox |
+---------+---------+--------------------------+
| 5 | 1 | Axilla, Part II |
+---------+---------+--------------------------+
| 6 | 1 | Lifeboy |
+---------+---------+--------------------------+
| 7 | 1 | Sample In a Jar |
+---------+---------+--------------------------+
| 8 | 1 | Wolfmans Brother |
+---------+---------+--------------------------+
| 9 | 1 | Scent of a Mule |
+---------+---------+--------------------------+
| 10 | 1 | Dog Faced Boy |
+---------+---------+--------------------------+
| 11 | 1 | Demand |
+---------+---------+--------------------------+
| 12 | 2 | Ghost |
+---------+---------+--------------------------+
| 13 | 2 | Birds of a Feather |
+---------+---------+--------------------------+
| 14 | 2 | Meat |
+---------+---------+--------------------------+
| 15 | 2 | Guyute |
+---------+---------+--------------------------+
| 16 | 2 | Fikus |
+---------+---------+--------------------------+
| 17 | 2 | Shafty |
+---------+---------+--------------------------+
| 18 | 2 | Limb by Limb |
+---------+---------+--------------------------+
| 19 | 2 | Frankie Says |
+---------+---------+--------------------------+
| 20 | 2 | Brian and Robert |
+---------+---------+--------------------------+
| 21 | 2 | Water in the Sky |
+---------+---------+--------------------------+
| 22 | 2 | Roggae |
+---------+---------+--------------------------+
| 23 | 2 | Wading in the Velvet Sea |
+---------+---------+--------------------------+
| 24 | 2 | The Moma Dance |
+---------+---------+--------------------------+
| 25 | 2 | End of Session |
+---------+---------+--------------------------+
Add them as data sources. The 1:many relationship between AlbumId should be created. Create a parent page with a table containing AlbumId and AlbumName. Then create the details page with a table containing only the TrackName column. In the Drillthrough filter field of the details page, drag the Album Table -> AlbumId to this field.
Now go back to the parent page and notice that when you right click on an album, you get the drillthrough menu to the details page. This works, but now you have a messy AlbumId column on your parent page.
The workaround is to hide the AlbumId on the parent report. First go to the Format(Paint roller) menu of the table on the parent report and in the column header -> word wrap turn this off. Then drag the column separator of the table to hide the AlbumId. See before and after images below.
BEFORE HIDE
AFTER HIDE
I have the powerbi file posted here if you want to see it in action.

DynamoDB with daily/weekly/monthly aggregated values

My application is creating a log file every 10min, which I want to store in DynamoDB in an aggregated way, e.g. 144 log files per day, 1008 log files per week or ~4400 log files per month.
I have different partition keys, but for sake of simplicity I have used only a single partition key in the following examples.
The straight forward solution would be to have different tables, e.g.
Table "TenMinLogsDay":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-04-30 | 144 | some serialized aggregated data
1 | 2017-05-01 | 144 | some serialized aggregated data
1 | 2017-05-02 | 144 | some serialized aggregated data
1 | 2017-05-03 | 144 | some serialized aggregated data
Table "TenMinLogsWeek":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-05-01 | 1008 | some serialized aggregated data
1 | 2017-05-08 | 1008 | some serialized aggregated data
1 | 2017-05-15 | 1008 | some serialized aggregated data
Table "TenMinLogsMonth":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-05-01 | 4464 | some serialized aggregated data
1 | 2017-06-01 | 4320 | some serialized aggregated data
1 | 2017-07-01 | 4464 | some serialized aggregated data
I would prefer however a combined table. Out of the box DynamoDB does not seem to support this.
Also, I want to query either the daily OR the weekly OR the monthly aggregated items, thus I don't want to use the filter feature for this.
The following solution would be possible, but seems like a poor hack:
Table "TenMinLogsCombined":
id (=part.key) | date (=sort key) | week (=LSI sort key) | month (=LSI sort key) | cntTenMinLogs | data
-------------- | ---------------- | -------------------- | --------------------- | ------------- | -----
1 | 2017-04-30 | (empty) | (empty) | 144 | ...
1 | 2017-05-01 | (empty) | (empty) | 144 | ...
1 | 0017-05-01 | 2017-05-01 | (empty) | 1008 | ...
1 | 1017-05-01 | (empty) | 2017-05-01 | 4464 | ...
1 | 2017-05-02 | (empty) | (empty) | 144 | ...
1 | 2017-05-03 | (empty) | (empty) | 144 | ...
Explanation:
By using the year "0017" and "1017" instead of "2017" I can query the date range for, e.g. 2017-05-01 to 2017-05-04 and DynamoDB won't read the items starting with 0017 or 1017
For week or month range queries, such a hack is not required, as empty LSI sort keys are possible.
Does anybody know of a better way to achieve this?

Pattern matching with regular expression in spark dataframes using spark-shell

Suppose we are given dataset ("DATA") like :
YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY | ANDERSON | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN | JOHNSON | Spark|R; 90|56
2006 | NIHA | DIVA | w/o sports
and we have another dataset ("RESULT") like :
YEAR | FIRST NAME | LAST NAME
1992 | EMMA | CENA
2008 | JOY | ANDERSON
2008 | STEVEN | ANDERSON
2006 | NIHA | DIVA
and so on.
The output should be ("RESULT") :
YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA | CENA | | | |
2008 | JOY | ANDERSON | SPARK | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | PYTHON | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | SCALA | 45 | FALSE | TRUE
2008 | STEVEN | ANDERSON | | | |
2006 | NIHA | DIVA | | | FALSE |
2008 | STEVEN | JOHNSON | SPARK | 90 | |
2008 | STEVEN | JOHNSON | SPARK | 56 | |
2008 | STEVEN | JOHNSON | R | 90 | |
2008 | STEVEN | JOHNSON | R | 56 | |
and so on.
Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on.
Hope you understand my query. And I am using spark-shell with spark dataframes.
Note that "Spark" and "spark" should be considered as same.
As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell
data:
val df = List(
("2008","JOY ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN ","JOHNSON ","Spark|R;90|56"),
("2006","NIHA ","DIVA ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")
I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give
val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
.withColumn("letter",explode(split('backrefReplace(0),"\\|")))
.select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
explode(split('backrefReplace(1),"\\|")).as("digits"),
'backrefReplace(2).as("tags")
)
which gives
scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE |letter|digits|tags |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45 |w/o sports;w datascience|
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |56 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |56 | |
|2006|NIHA |DIVA |w/o sports | | |w/o sports |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:
List(("a;b;c")).toDF("str")
.withColumn("char",explode(split('str,";")))
.groupBy('str)
.pivot("char")
.count
.show()
+-----+---+---+---+
| str| a| b| c|
+-----+---+---+---+
|a;b;c| 1| 1| 1|
+-----+---+---+---+
Read more about pivot here
The final step is simply to do a left join on the second dataset (first "RESULT").

How to run raw query with a model with dynamic fields in Django 1.9?

I have a complex result that requires writing raw sql queries.
See https://stackoverflow.com/a/38548462/80353
The expected result is a table showing several columns.
The first column header is simply Product and the other column headers are store names.
The values are simply the product names and the aggregated sales values of the product in these stores.
Which stores will be shown is entirely dynamic. Maximum should be 9 stores.
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
For more details of the schema, check the question in How to get back aggregate values across 2 dimensions using Python Cubes?
My question
The schema is not super important to my question which is:
Since I am going to write a complex raw query, is there a way to map the query result to a model where the fields are dynamic?
I found documentation about how to execute raw queries in Django and how to execute raw queries to existing models with fixed fields and matching table.
My question is is it possible to do that for a model that has no matching table and dynamic fields?
If so, how?
Or if I choose to use materialised view in postgresql, how do I match it with a model class?