Is there an alternative to UNION that does fewer scans?

Is there an alternative to UNION that does fewer scans? - c++

See the db-fiddle.
On the following table
CREATE TABLE foo (x INTEGER PRIMARY KEY, y INTEGER);
INSERT INTO foo VALUES (0,41), (1, 23), (2,45), (3,32), ...
I need the x and y which has min(y) over groups of 10 x, and the same for max(y):
SELECT x, min(y) FROM foo GROUP BY (x/10)
UNION
SELECT x, max(y) FROM foo GROUP BY (x/10);
The EXPLAIN QUERY PLAN output shows that two scans of the table are performed
`--COMPOUND QUERY
|--LEFT-MOST SUBQUERY
| |--SCAN TABLE foo
| `--USE TEMP B-TREE FOR GROUP BY
`--UNION ALL
|--SCAN TABLE foo
`--USE TEMP B-TREE FOR GROUP BY
Is there any way to reword the query so that only one scan is performed?
What I've done in the mean time is to select all rows (SELECT x, y FROM foo;) and manually aggregate min/max as rows are returned to the host language:
int lastGroup = 0;
while (sqlite3_step(query) == SQLITE_ROW) {
int x = sqlite3_column_int(query, 0);
int y = sqlite3_column_int(query, 1);
int group = x / 10;
if (group != lastGroup) {
// save minX, minY, maxX, maxY in a list somewhere
// reset minX, minY, maxX, maxY
// ...
lastGroup = group;
}
else {
if (y < minY) {
minX = x;
minY = y;
}
else if (y > maxY) {
maxX = x;
maxY = y;
}
}
}
This achieves a single scan and the whole process is more than twice as fast... but I'd rather express this logic declaritively in SQL if possible.

Why not just do one group by with more columns?
On the following table
SELECT (x/10) * 10, min(y), max(y)
FROM foo
GROUP BY (x/10)
If you want multiple rows, you can unpivot afterwards:
SELECT x, (CASE WHEN x.which = 1 THEN min_y ELSE max_y END) as min_max_y
FROM (SELECT (x/10) * 10 as x, min(y) as min_y, max(y) as max_y
FROM foo
GROUP BY (x/10)
) f CROSS JOIN
(SELECT 1 as which UNION ALL SELECT 2) x;
EDIT:
You are using a SQLite extension -- which is not consistent with the standard or any other SQL language. A better way to write this uses window functions:
select x, y
from (select f.*,
row_number() over (partition by (x/10) order by y asc) as seqnum_asc,
row_number() over (partition by (x/10) order by y desc) as seqnum_desc
from foo f
) f
where 1 in (seqnum_asc, seqnum_desc);
Or, using first_value() if you don't like subqueries:
select distinct (x/10)*10, -- this is not necessary but helps to make the purpose clear
first_value(x) over (partition by (x/10) order by y asc) as x_at_min_y,
min(y) over (partition by x/10) as min_y,
first_value(x) over (partition by (x/10) order by y desc) as x_at_max_y,
max(y) over (partition by x/10) as max_y
from foo;
Here is a db-fiddle.
If you like, you can unpivot afterwards, as illustrated above.

Related

Power BI - Matching closest 3D points from two tables

I have two tables (Table 1 and Table 2) both containing thousands of three dimensional point coordinates (X, Y, Z), Table 2 also has an attribute column.
Table 1
X
Y
Z
6007
44268
1053
6020
44269
1051
Table 2
X
Y
Z
Attribute
6011
44310
1031
A
6049
44271
1112
B
I need to populate a calculated column in Table 1 with an attribute from Table 2 based on the minimum distance between points in 3D space. Basically, match the points in Table 1 to the closest point in Table 2 and then fetch the attribute from Table 2.
So far I have tried rounding X, Y and Z in both tables, then concatenating the rounded values into a separate column in each table. I then use DAX:
CALCULATE(FIRSTNONBLANK(Table 2 [Attribute],1),FILTER(ALL(Table2), Table 2[XYZ]=Table 1 [XYZ])).
This has given me reasonable success depending on the degree of rounding applied to the coordinates.
Is there a better way to achieve this in Power Bi?

This is similar to this post, except with a simpler distance function. See also this post.
Assuming you want the standard Euclidean Distance:
ClosestPointAttribute =
MINX (
TOPN (
1,
Table2,
( Table2[X] - Table1[X] ) ^ 2 +
( Table2[Y] - Table1[Y] ) ^ 2 +
( Table2[Z] - Table1[Z] ) ^ 2,
ASC
),
Table2[Attribute]
)
Note: I've omitted the SQRT from the formula because we don't need the actual distance, just the ordering (and SQRT preserves order since it's a strictly increasing function). You can include it if you prefer.

A function in M Code:
(p1 as list, q1 as list)=>
let
f = List.Generate(
()=> [x = Number.Power(p1{0}-q1{0},2), idx=0],
each [idx]<List.Count(p1),
each [x = Number.Power(p1{[idx]+1}-q1{[idx]+1},2), idx=[idx]+1],
each [x]
),
r = Number.Sqrt(List.Sum(f))
in
r
Each list is a set of coordinates and the function will return the distance between p and q
The above function (which I named fnDistance) can be incorporated into power query code as in this example:
let
//Read in both tables and set data types
Source2 =Excel.CurrentWorkbook(){[Name="Table_2"]}[Content],
table2 = Table.TransformColumnTypes(Source2,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type},{"Attribute", Text.Type}}),
Source = Excel.CurrentWorkbook(){[Name="Table_1"]}[Content],
table1 = Table.TransformColumnTypes(Source,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type}}),
//calculate distances from Table 1 coordinates to each of the Table 2 coordinates and store in a List
custom = Table.AddColumn(table1,"Distances", each
let
t2 = Table.ToRecords(table2),
X=[X],
Y=[Y],
Z=[Z],
distances = List.Generate(()=>
[d=fnDistance({X,Y,Z},{t2{0}[X],t2{0}[Y],t2{0}[Z]}),a=t2{0}[Attribute], idx=0],
each [idx] < List.Count(t2),
each [d=fnDistance({X,Y,Z},{t2{[idx]+1}[X],t2{[idx]+1}[Y],t2{[idx]+1}[Z]}),a=t2{[idx]+1}[Attribute], idx=[idx]+1],
each {[d],[a]}),
//determine set of coordinates with the minimum distance and return associate Attribute
minDistance = List.Min(List.Alternate(List.Combine(distances),1,1,1)),
attribute = List.Range(List.Combine(distances), List.PositionOf(List.Combine(distances),minDistance)+1,1){0}
in
attribute, Text.Type)
in
custom

Postgres C extended data type definition

When dealing with the following problems, Postgres is a bit tricky to deal with more complex structures. I want to set up a two-dimensional array of structure, but I don't know how to make Postgres C support me to do so? Do anyone have any ideas?
Table
id contents(text) num(double)
1 I love you. {1,3,4,5,6,7,8,10}
2 why do it? {3,4,2,11,12,33,44,15}
3 stopping. {22,33,11,15,14,22,11,55}
4 try it again. {15,12,11,22,55,21,31,11}
Sort the rows of each position of the array to get the fo.lowing structure. The result of the first row below is the first position of the num field column array, and so on.the count 4 refers to returning the first n sorted.
select my_func(contents, num, 4) from table;
expected result:
result
{('stopping.', 22), ('try it again.', 15), ('why do it?', 3), ('I love you.', 1)}
{('stopping.', 33), ('try it again.', 12), ('why do it?', 4), ('I love you.', 3)}
{('stopping.', 11), ('try it again.', 11), ('I love you.', 4), ('why do it?', 2)}
......
......
Thanks in advance.

I'm not sure why you need C extended data type, but the following will give you what you want and can be implemented as plpgsql function.
WITH t1 AS (
SELECT id, contents, unnest (num) AS n FROM table
),
t2 AS (
SELECT id, contents, n,
row_number () OVER (PARTITION BY id ORDER BY id) AS o
FROM t1 ORDER BY o ASC, n DESC, id ASC
),
t3 AS (
SELECT array_agg (ROW (contents, n)) AS a, o
FROM t2 GROUP BY o ORDER BY o
)
SELECT array_agg (a ORDER BY o) FROM t3;
UPDATE: Problem of the above may be undefined order of 'unnest'.
The following gives consistent relation between index and num, but need to write the size of num array explicitly.
WITH RECURSIVE t1 (i, id, contents, num) AS (
SELECT 1, id, contents, num[1] FROM table
UNION ALL
SELECT t1.i + 1, table.id, table.contents, table.num[t1.i + 1]
FROM t1, table
WHERE t1.id = table.id AND t1.i < 8 -- put here size of array
),
t2 (i, d) AS (
SELECT i, array_agg (ROW (contents, num) ORDER BY num DESC)
FROM t1 GROUP BY i
)
SELECT array_agg (d ORDER BY i) FROM t2;

How to calculate performance curve for each row of data

I want to plot a performance curve for each row of data I have.
A simple version of what I want to do is plot the function with the equation as Y= m*X+b, where I have a table with m and b values and I want Y values for X = 1 to 10.
How is this calculated?
A Y = mX + b example can be seen in the following plot:

The following works:
WITH NUMBERS AS
(
SELECT N FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))N(N)
),
Examples AS
(
SELECT m,b FROM (VALUES (1,2),(2,2))N(m,b)
)
SELECT
'Y = ' + CAST(Examples.m as varchar(10)) + 'X + ' + CAST(Examples.b as varchar(10)) AS Formula
,Numbers.N AS X
, Numbers.N * Examples.m + Examples.b
FROM Examples
CROSS JOIN NUMBERS

pyspark mathematical computation in a dataframe

I have extracted a Dataframe from a larger Dataframe, and now I need to do simple computation like addition and division in dataframe.
sample dataframe is like.
item counts
z 23156
x 15462
What I need to do is to divide x by sum of x and z
for example
value= x/x+z

You must compute the sum of x and first then divide x by sum(x) + sum(y)
for example:
Table 1(original table):
x z
1 2
3 4
Table 2 (Aggregated table):
table2 = sqlCtx.sql("select sum(x) + sum(z) as sum_xz")
table2.registerTempTable("table2")
sum_xz
10
Then join both table and divide
table3 = sqlCtx.sql("select a.x / bs.um_xz from table1 a join table2 b")
For your reference.

Piecewise linear regression with SAS PHREG

How to implement a piecewise linear regression model in PHREG procedure of SAS?
For example with one knot at X=T:
Y = β_10 + β_11 . X if X ≤ T
Y = β_20 + β_21 . X if X >T
Given the model with the constraint of continuity:
Y = β_10 + β_11 . X if X ≤ T
Y = β_10 + (β_11 - β_21) T + β_21 . X if X >T
i.e :
Y= β_0 + β_1 . X + S_1
where
S_1 = ( β_11 - β_21 ) T if X >T and 0 otherwise.
Finally i would like to include it in a Cox model:
Proc PHREG
Model time * cas (censure) = X S_1 ;
Run ;
But the problem is S_1 has unknown beta coefficients in it.
Thanks for your help!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is there an alternative to UNION that does fewer scans? - c++

Related

Power BI - Matching closest 3D points from two tables

Postgres C extended data type definition

How to calculate performance curve for each row of data

pyspark mathematical computation in a dataframe

Piecewise linear regression with SAS PHREG

Categories

Resources