The performance of where clause in dolphindb - sql-optimization

I have a dfs table stored 5 billion tick data for one year
the table partitioned by 'date' and 'code' , the table schema is as follows
date | time | code | bid | ask | bidvol | askvol
2020.03.05 |18:00:00.001 | 20012 | 0.01 | 0.02 | 100 | 200
I want to select the data from 9:00 this morning to 16:00 the next afternoon . My code
tb = loadTable("dfs://db","tick")
timer select * from tb where code='2993' , concatDateTime(Date,Time) between pair(2020.03.05T07:00:00.000, 2020.03.05T18:00:00.000)
Time elapsed: 161.352 ms,
But if I take data of two days , it will be much faster ?
timer select * from tb where code='2993' , Date between pair(2020.03.05,2020.03.06)
Time elapsed: 41.813 ms
What's the reason?

I think this is due to partition pruning in DolphinDB. Check out this link from DolphinDB's manual:
https://www.dolphindb.com/help/index.html?Newtopic10.html

Related

Handle account balance during concurrent transactions

I've been developing an application that handles accounts and transactions made over all the accounts.
Currently the MariDB database table the application uses is modeled the following way:
id column in account_transaction is primary key and it will auto increment
account_transaction
+------+-------------+----------------------+---------+------------------+-----+
| id | account_id | date | value | resulting_amount | ... |
+------+-------------+----------------------+---------+------------------+-----+
| 101 | 100 | 03/may/2012 10:13:33 | 2000 | 2000 | ... |
| 102 | 100 | 03/may/2012 10:13:33 | 500 | 2500 | ... |
| 103 | 100 | 03/may/2012 10:13:34 | -1000 | 1500 | ... |
| 104 | 200 | 03/may/2012 10:13:35 | 1300 | 1300 | ... |
| 105 | 200 | 03/may/2012 10:13:36 | 200 | 1500 | ... |
| 106 | 200 | 03/may/2012 10:13:37 | -500 | 1000 | ... |
+------+-------------+----------------------+---------+------------------+-----+
The query to credit the amount 300 to the account_id (100) the query is
INSERT INTO account_transaction (account_id,date, value, resulting_amount)
VALUES (100, NOW(), 300, COALESCE((SELECT at.resulting_amount
FROM account_transaction at
WHERE at.account_id = 100
ORDER BY at.date DESC, at.id DESC
LIMIT 1), 0) + 300)
The query to debit the amount 300 to the account_id (100) the query is
INSERT INTO account_transaction (account_id,date, value, resulting_amount)
VALUES (100, NOW(), -300, COALESCE((SELECT at.resulting_amount
FROM account_transaction at
WHERE at.account_id = 100
ORDER BY at.date DESC, at.id DESC
LIMIT 1), 0) - 300)
I am using sub query to find latest balance while inserting new transaction. I have used coalesce if there are no transactions for the account.
I could have ran the below subquery separately to find the current balance of the account and use it in the new transaction but the problem is multiple concurrent transactions are reading same balance which leads to account balance discrepancy and it is loss to the company.So I have written the subquery inside insert query to avoid balance discrepancy
SELECT at.resulting_amount
FROM account_transaction at
WHERE at.account_id = 100
ORDER BY at.date DESC, at.id DESC
LIMIT 1
Subquery inside insert query approach was able to handle balance discrepancy if concurrent requests are lesser than 50.
If number of transactions are more than 50 then balance discrepancy is occurring some times.
Example of balance discrepancy: If account balance is 1000 and if 2 concurrent transactions wants to debit 100 then resulting_amount for both transactions would be 900 which is incorrect.
Please suggest better approach to handle balance discrepancy when large number concurrent transactions are placed. If you want to suggest locks approach then use column level lock (lock account_id column).
The easy answer is don't keep resulting_amount in the transaction table, just a balance in a separate table (with primary key account_id).
Or do that and in a transaction update the account balance and use the new balance as the resulting_amount to insert.
Your existing code just assumes ORDER BY at.date DESC, at.id DESC will always find the most recently inserted record, and that isn't going to hold true with concurrent requests.

Adding a measure which finds the next row value for every row (similar to SQL Lead window function)

will be very grateful if you could share your experience and advice on the following problem in Power BI:
3 Tables given in the data model:
calendar dimension table
fact table on sessions
fact table on spending
| CW | Total cost | Sessions | Expected Column 1 | Expected Column 2 |
+----+-------------+-----------+-------------------+-------------------+
| 1 | 1200 | 50 | | |
| 2 | 1500 | 60 | 1200 | 50 |
| 3 | 1700 | 48 | 1500 | 60 |
| 4 | 1150 | 36 | 1700 | 48 |
| 5 | 900 | 29 | 1150 | 36 |
+----+-------------+-----------+-------------------+-------------------+
CW column indicates the calendar week and it is from calendar table. Sessions and Total cost are from sessions and spending tables respectively. Data is aggregated and visualized on calendar week level.
Problem: I need to create measures to derive Expected column 1 and expected column 2 based on total cost and sessions columns. Basically getting next values for each row similar to lead window function.
I have checked power BI community and there are several ideas (for example here https://community.powerbi.com/t5/Desktop/DAX-Query-to-Find-Next-Value/td-p/833896).
But these solution assume all columns are from the same table, however in the above described case
all 3 columns are from different tables.
Will the be possible to get expected columns 1 and 2 and how? Many thanks in advance!

How to append current and previous sessions side by side filtered by two independent slicers

Objective: I would like obtain the difference between current and previous sessions based on date slicers
I want the output to be 4 columns as such:
Date
Current Sessions (see measure below)
Previous Sessions (see measure below)
Difference (no measure calculated yet).
Situation:
I currently have two measures
Current Sessions: SUM(Sales[Sessions])
Previous Sessions (thanks to #Alexis Olson):
VAR datediffs = DATEDIFF(
CALCULATE (MAX ( 'Date'[Date] ) ),
CALCULATE (MAX ('Previous Date'[Date])),
DAY
)
RETURN
CALCULATE(SUM(Sales[Sessions]),
USERELATIONSHIP('Previous Date'[Date],'Date'[Date]),
DATEADD('Date'[Date],datediffs,DAY)
)
I have three tables.
Sales
Date
Previous Date (carbon copy of Date table)
My previous date table is 1:1 inactive relationship with the Date table. Date table is 1 to many active relationship
with my Sales Table.
I have two slicers at all time comparing the same amount of days from different time periods (e.g. Jan 1th to Jan 7th 2019 vs Dec 25st to Dec 31th 2019)
If i put current sessions, previous sessions and a date column from any of the three tables
+----------+------------------+-------------------+------------+
| date | current sessions | previous sessions | difference |
+----------+------------------+-------------------+------------+
| Jan 8th | 10000 | 70000 | 3000 |
| Jan 9th | 20000 | 10000 | 10000 |
| Jan 10th | 15000 | 16000 | -1000 |
| Jan 11th | 14000 | 12000 | 2000 |
| Jan 12th | 12000 | 14000 | -2000 |
| Jan 13th | 11000 | 16000 | -5000 |
| Jan 14th | 15000 | 18000 | -3000 |
+----------+------------------+-------------------+------------+
When I put the Sessions date on the table along with sessions and previous sessions, I get the sessions amounts right for each day but the previous session amounts doesn't calculate correctly I assume because its being filtered by the date rows.
How can I override that table filter and force it to get the exact previous sessions amounts? Basically have both results appended to each other.The following shows my problem. the previous session is the same on each day and is basically the amount of dec 31st jan 2018 because the max date is different for each row but I want it to be based on the slicer.
The mistake came in the first part of the VAR Datediffs variable within the previous session formula:
CALCULATE(LASTDATE('Date'[Date]),ALLSELECTED('Date'))
This forces to always calculate the last day for each row and overrides the date value in each row.

SUM of column conditional to many values of another column

I am trying to accomplish something, but don't know how to do it.
I have a Dimension (Table called TEntry) that represents time entries for employees like so :
Id | EmployeeId | EntryDT | TimeInMinutes | PriceAgreementId
------ | ---------- | ---------- | ------------- | ----------------
1 | 1 | 2017-03-20 | 100 | 1
2 | 1 | 2017-03-31 | 50 | null
3 | 2 | 2017-03-21 | 100 | 1
4 | 2 | 2017-03-23 | 125 | 2
5 | 3 | 2017-03-15 | 90 | null
6 | 3 | 2017-03-25 | 60 | 1
Sometimes they work on "PriceAgreements", and sometimes they don't.
In my Dashboard, i have a Table that groups the table TEntry by EmployeeId and Sums the TimeInMinutes. I also have a Slicer for EntryDT :
EmployeeId | TimeInMinutes
-------------- | -------------
1 | 150
2 | 225
3 | 150
I need to create 2 new columns that represent :
The total TimeInMinutes an Employee has worked on all PriceAgreements
So for EmployeeId #1, the Total would be 100.
The total TimeInMinutes ALL Employees have worked, but only for the PriceAgreements the current Employee (current row) has worked on.
The Table would look like this (without the PriceAgreementIds in parenthesis) :
EmployeeId | TimeInMinutes | TimeInMinutes on PriceAgreements | TimeInMinutes on PriceAgreements ALL other EmployeeIds
-------------- | ------------- | -------------------------------- | ------------------------------------------------------
1 | 150 | 100 (PriceAgreementId=1) | 260 (PriceAgreementId=1)
2 | 225 | 225 (PriceAgreementId=1 and 2) | 385 (PriceAgreementId=1 and 2)
3 | 150 | 150 (PriceAgreementId=1) | 260 (PriceAgreementId=1)
Column "TimeInMinutes on PriceAgreements" is quite easy, but the other one, i cannot find a solution...
I have this DAX expression I started, but it is not complete:
CALCULATE(SUM(TEntry[TimeInMinutes]), NOT ISBLANK(TEntry[PriceAgreementId]), ALL(TEmployee))
TEmployee is a Dimension linked to the main TEntry Table.
Any help would be appreciated.
Thank you
I'm throwing this on as an answer because (a) it might get you (or someone else) going in the right direction and (b) if it's guaranteed that an Employee would only ever have time entries corresponding to 2 price agreements, this would work - which is unlikely the case for you, but might be the case for others trying to accomplish a similar thing.
Measure =
CALCULATE (
SUM ( TEntry[TimeInMinutes] ),
FILTER (
ALL ( TEntry ),
(
TEntry[PriceAgreementID] = MIN ( TEntry[PriceAgreementID] )
|| TEntry[PriceAgreementID] = MAX ( TEntry[PriceAgreementID] )
)
&& TEntry[PriceAgreementID] <> BLANK ()
)
)
This measure is saying: SUM the TimeInMinutes for all records in the TEntry table where the PriceAgreementID matches either the minimum OR maximum PriceAgreementID (in the context of the current row) AND the PriceAgreementID isn't blank.
The fatal flaw in this answer is in the MIN and MAX. For Employee ID 2, who has 2 PriceAgreementIDs (1 & 2) - the MIN will calculate the minutes for PriceAgreementID 1 and the MAX will calculate the minutes for PriceAgreementID 2. However, to expand to a case where there might be more than 2 PriceAgreements...I don't know how to do that.
It does work on the sample data in your question, though (since there is a max of 2 price agreements per employee):
Typically when I'm faced with a problem like this that isn't easy to solve, I think about my data model and make sure that it conforms to a star schema as closely as possible.
In your case, an employee can have multiple price agreements, and a price agreement can be associated with many employees. That, to me, suggests a many-to-many relationship. I'd strongly recommend reading more about many-to-many relationships and whether restructuring the underlying tables (e.g. to include a bridge table) would help get you closer to the answer you need.
A good starting point might be: https://www.sqlbi.com/articles/many-to-many-relationships-in-power-bi-and-excel-2016/

% of Grand Total of a Measure that uses other Measures and is Crossfiltered

this seems so easy in my head but I haven't been able to get it for the last few hours....
I have a Table visualization that provides Cost by Hour using measures.
Category | Total Cost | Hours | Cost per Hour
A | 1000 | 10 | 100
B | 2000 | 100 | 20
C | 100 | 4 | 25
D | -500 | 100 | -5
Total | 2600 | 214 | 12.1495
For my purposes, I would also like to create a % of Grand Total of Cost per hour to add to a treechart visualization. However, if I simply add [Cost per Hour] to the treechart again and use the "quick clac" functionality on the field it would return 823.7% for the first record in the above table as (100/12.1495) = 8.2307. I would like this % of GT of Cost per Hour to use the total sum of the Cost per Hour column. Desired Result:
Category | Total Cost | Hours | Cost per Hour | % of Cost per Hour
A | 1000 | 10 | 100 | 71.4%
B | 2000 | 100 | 20 | 14.3%
C | 100 | 4 | 25 | 17.9%
D | -500 | 100 | -5 | -3.8%
Total | 2600 | 214 | 12.1495 | 100%
A few things to note that makes the application of any DAX challenging. All of the below Measures are filtered by multiple filter visualizations from Tables 1-5 and page level filters from Tables 1-5
The table visualization exists in Table1. Costs exist in Tables 2-5 and are related to Table1 using a Many-to-One Single Direction Filter Relationship.
[Total Cost] is a Measure that adds together values from 4 different tables. Eg:Total Cost = sum(table2[value])+sum(table3[value])+sum(table4[value])+sum(table5[value])
[Hours] is a Measure that adds together a column from a table and divides by the distinct count of records in that table. Eg:Hours = sum(table1[hours])/Distinctcount(table1[records])
[Cost per Hour] is a Measure consisting of two other measure.Cost per Hour = [Total Cost] / [Hours]
I sort of feel like this is similar to people wanting to add percentages to pie charts... I'm just trying to ascribe a real number to express the proportion displayed in the TreeChart visualization. I really hope that this is easier than it seems.
EDIT #alejandrozuleta:
Table1 is the original table from which tables 2-5 are referenced&created. An index number was assigned in Table1 and tables 2-5 are linked on this reference number. The reason that tables 2-5 exists separately is because they contain separate cost "types" and a join that occurs in these tables adds additional columns that are only applicable to specific costs types.... for example Table2 is Personnel Costs:
index | Category | Cost Type | Value | Age of Personnel
1 | A | Personnel | 1 | 33
and Table3 is Maintenance Costs:
index | Category | Cost Type | Value | Scheduled or UnScheduled Maint
2 | A | Maintenance | 5 | Scheduled
The if [Age of Personnel] existed in Table3 then it would have a "null" for any record of the Maintenance [Cost Type] vice-versa [Scheduled or UnScheduled Maint] would have a "null" if it existed in Table2. Because I don't want to have to deal with filter visualizations needing to select "(blanks)" for certain costs types the data relationship between these tables is a Many-to-One Single Direction Filter using [index] as the key.
EDIT2:
Working .pbix file with notional data and the data model I described is linked:
StackOverflow_GTofMeasure_Crosfilltered.pbix
I think this solution could work for you. Basically I've created two helper measures (which you don't have to show in your table):
CostPerHourHelper = SUMX(TableName,[Cost per Hour])
CostPerHourTotal = SUMX(ALL(TableName),[Cost per Hour])
Now you can create your % of Cost per Hour measure using this expression:
% Cost Per Hour = [CostPerHourHelper]/[CostPerHourTotal]
It should produce:
UPDATE:
Use ALLSELECTED() function to preserve the explicit filters you applied.
% Cost Per Hour = SUMX ( TableName, [Cost per Hour] )
/ SUMX ( ALLSELECTED ( TableName ), [Cost per Hour] )
Let me know if this helps.