How do I group by into struct in BigQuery? - google-cloud-platform

I have the following data:
player_id level talent_id
1 1 a
1 2 b
1 3 c
2 1 d
2 2 e
And want to group by player_id and have rows as structs, with level values as struct field names:
player_id data
1 {_1 = a, _2 = b, _3 = c}
2 {_1 = d, _2 = e, _3 = null}
the level column is always from the set of {1, 2, 3} but some levels might be missing (null)
What I've got so far is aggregation by player_id and with attached array of results:
talents as (
select
p.player_id,
array_agg(struct(p.level, p.talent_id)) as talents
from source.player_talent p
group by player_id
),
player_id data
1 [{1, a}, {2, b}, {3, c}]
2 {{1, d}, {2, e}]
now I need to map this array to a struct with fixed property names _1, _2, _3

This returns the expected results:
WITH Players AS (
SELECT 1 AS player_id, 1 AS level, 'a' AS talent_id UNION ALL
SELECT 1, 2, 'b' UNION ALL
SELECT 1, 3, 'c' UNION ALL
SELECT 2, 1, 'd' UNION ALL
SELECT 2, 2, 'e'
)
SELECT
player_id,
STRUCT(
MAX(IF(level = 1, talent_id, NULL)) AS _1,
MAX(IF(level = 2, talent_id, NULL)) AS _2,
MAX(IF(level = 3, talent_id, NULL)) AS _3) AS data
FROM Players
GROUP BY player_id
The technique here is known as pivoting (converting rows to columns).

Related

Postgres C extended data type definition

When dealing with the following problems, Postgres is a bit tricky to deal with more complex structures. I want to set up a two-dimensional array of structure, but I don't know how to make Postgres C support me to do so? Do anyone have any ideas?
Table
id contents(text) num(double)
1 I love you. {1,3,4,5,6,7,8,10}
2 why do it? {3,4,2,11,12,33,44,15}
3 stopping. {22,33,11,15,14,22,11,55}
4 try it again. {15,12,11,22,55,21,31,11}
Sort the rows of each position of the array to get the fo.lowing structure. The result of the first row below is the first position of the num field column array, and so on.the count 4 refers to returning the first n sorted.
select my_func(contents, num, 4) from table;
expected result:
result
{('stopping.', 22), ('try it again.', 15), ('why do it?', 3), ('I love you.', 1)}
{('stopping.', 33), ('try it again.', 12), ('why do it?', 4), ('I love you.', 3)}
{('stopping.', 11), ('try it again.', 11), ('I love you.', 4), ('why do it?', 2)}
......
......
Thanks in advance.
I'm not sure why you need C extended data type, but the following will give you what you want and can be implemented as plpgsql function.
WITH t1 AS (
SELECT id, contents, unnest (num) AS n FROM table
),
t2 AS (
SELECT id, contents, n,
row_number () OVER (PARTITION BY id ORDER BY id) AS o
FROM t1 ORDER BY o ASC, n DESC, id ASC
),
t3 AS (
SELECT array_agg (ROW (contents, n)) AS a, o
FROM t2 GROUP BY o ORDER BY o
)
SELECT array_agg (a ORDER BY o) FROM t3;
UPDATE: Problem of the above may be undefined order of 'unnest'.
The following gives consistent relation between index and num, but need to write the size of num array explicitly.
WITH RECURSIVE t1 (i, id, contents, num) AS (
SELECT 1, id, contents, num[1] FROM table
UNION ALL
SELECT t1.i + 1, table.id, table.contents, table.num[t1.i + 1]
FROM t1, table
WHERE t1.id = table.id AND t1.i < 8 -- put here size of array
),
t2 (i, d) AS (
SELECT i, array_agg (ROW (contents, num) ORDER BY num DESC)
FROM t1 GROUP BY i
)
SELECT array_agg (d ORDER BY i) FROM t2;

Django filtering with list and specific data

I'm sorry for the weird title. I don't know how to explain my problem in a short sentence. I'm trying to filter my model with a list but sometimes query returns multiple rows. For example:
all_pos [1,2,3]
query = MyModel.objects.filter(pos__in=all_pos)
The query above returns a list of rows from the database but second item in the list returns two rows with B and C in the second column.
1, A, word
2, B, word
2, C, word
3, A, word
4, C, word
But I only want the row with B on the second row and not losing 4th row with C. How can I filter this further so I can achieve the result below.
1, A, word
2, B, word
3, A, word
4, C, word
Your specifications are still not very clear. You need to specify by what logic you keep a result vs other.
But what I deduce, is you want the results ordered by col 1 ASC and col 2 alphabetically and distinct on col 1.
all_pos = [1, 2, 3, 4]
query = MyModel.objects.filter(pos__in=all_pos).order_by('col1', 'col2').distinct('col1')
Otherwise you will need 2 passes maybe like:
all_pos = [1, 2, 3, 4]
final_results = []
qs = MyModel.objects.filter(pos__in=all_pos)
for row in col1_qs:
if custom_logic_with(row.col2):
final_results.append(row)

Google BigQuery - Execute dynamically generated queries from a select statement

Have a huge table in Google BigQuery with following structure (> 100 million rows):
name | departments
abc | 1,2,5,6
xyz | 4,5
pqr | 3,4,6
Want to convert the data into following format:
name | 1 | 2 | 3 | 4 | 5 | 6
abc | 1 | 1 | | | 1 | 1
xyz | | | | 1 | 1 |
pqr | | | 1 | 1 | | 1
As of now, able to generate the queries required to prepare the dataset in this format by using CONCAT and REGEX_REPLACE functions:
SELECT ' insert into dataset.output ( name, ' +
CONCAT(
'_' , replace(departments,',',',_') )
+ ' ) values( \'' + name +'\','+ REGEXP_REPLACE(departments, "([^,\n]+)", "1") +')'
FROM (
select name, departments from dataset.input )
This generates the output with the 100 M insert queries which can be used to create the data in the required structure.
However, now below are my questions:
Can we execute the output of this query (100 M insert queries) directly by using Big Query SQL or we would need to fire each insert one by one?
I believe there is no way to pivoting or transposing the data in a column with multiple comma separated values. Is that right?
Is there a more optimal way of achieving this using BigQuery SQL and not writing custom Java code?
Thanks.
Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
with result as
Row name d1 d2 d3 d4 d5 d6
1 abc 1 1 0 0 1 1
2 xyz 0 0 0 1 1 0
3 pqr 0 0 1 1 0 1
So you need to run above with destination to whatever new table you prepared
Note, above assumes you have just 6 departments and most important there is no ambiguity in numbers like 1 does not conflict with 10 for example
If you do have such case - you need transform below lines
IF(departments LIKE '%2%', 1, 0) AS d2,
into
IF(CONCAT(',', departments, ',') LIKE '%,2,%', 1, 0) AS d2 ...
And of course, you can use just one simple INSERT statement
INSERT `project.dataset.new_table` (name, d1, d2, d3, d4, d5, d6)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
So, the final point of all this is:
instead of generating INSERT STATEMENT for each and every row in original table - you should generate simple SELECT statement that does "pivoting"
Update for "extreme" minimizing generated code
See an example:
#standardSQL
CREATE TEMP FUNCTION c(departments STRING, department INT64) AS (
IF(departments LIKE CONCAT('%',CAST(department AS STRING),'%'), 1, 0)
);
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
), temp AS (
SELECT name, departments AS d
FROM `project.dataset.table`
)
SELECT
name,
c(d,1)d1,
c(d,2)d2,
c(d,3)d3,
c(d,4)d4,
c(d,5)d5,
c(d,6)d6
FROM temp
as you can see - now each of your 10000 lines will be like c(d,N)dN, with max in length as c(d,10000)d10000, so you have chance to fit into query size limit

PL/SQL split one to many rows

I have a table like this.
|PARAMKEY | PARAMVALUE
----------+------------
KEY |[["PAR_A",2,"SCH_A"],["PAR_B",4,"SCH_B"],["PAR_C",3,"SCH_C"]]
I need to split the values into three columns and I use REGEXP_SUBSTR. Here is my code.
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1,1 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 2) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 3) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY'
UNION ALL
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 4 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 5) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 6) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY'
UNION ALL
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 7 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 8) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 9) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY';
and this is the result that i need.
PARAMETER | VERSION | SCHEMA
---------+---------+-------
PAR_A |2 |SCH_A
PAR_B |4 |SCH_B
PAR_C |3 |SCH_C
But the value is too long and I hope there is another way to make it simplier by using loop or anything.
Thanks
Try something like this:
with tmp_param_table as
(
select 'KEY' as PARAMKEY , '[["PAR_A",2,"SCH_A"],["PAR_B",4,"SCH_B"],["PAR_C",3,"SCH_C"]],["PAR_D",4,"SCH_D"]]' as PARAMVALUE from dual
),
levels as (select level as lv from dual connect by level <= 156),
steps as (select lv-2 as step from levels where MOD(lv,3)=0)
select step, (SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') parameter,
(SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step+1 ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') version,
(SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step+2 ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') schema
from steps
Here
levels - returns numbers form 1 till 156 (52*3) (or whatever you need)
steps - are the numbers 1, 4, 7 etc with step 3
Results:
1 PAR_A 2 SCH_A
4 PAR_B 4 SCH_B
7 PAR_C 3 SCH_C
10 PAR_D 4 SCH_D
13
etc..
I have tried using regular expression
and part paramvalue column value into common separated value
SELECT
REGEXP_SUBSTR(COL, '[^],["]+', 1, 1) PARAMETER,
REGEXP_SUBSTR(COL, '[^],[",]+', 1, 2) VERSION,
REGEXP_SUBSTR(COL, '[^],["]+', 1, 3) SCHEMA
FROM
(
SELECT paramkey,REGEXP_SUBSTR(to_char(paramvalue),'[^][^]+',1,level ) COL
from tmp_param_table
connect by regexp_substr(to_char(paramvalue),'[^][^]+',1, level) is not null
)
WHERE COL <>','
I hope this may help.

Sorting number of lists according to indexes and priority

I have a collection of lists with each containing around 6 to 7 values. Like,
list1 = 2,4,7,4,9,5
list2 = 4,3,7.3,9,8,1.2
list3 = 2,2.4,7,9,8,5
list4 = 9,1.6,4,3,4,1
list5 = 2,5,7,9,1,4
list6 = 6,8,7,2,1,5
list7 = 4,2,5,2,1,3
Now I want to sort these with index1 as primary and index3 as secondary and index2 as tertiary and so on. That is, the output should be like:
2,2.4,7,9,8,5
2,4,7,4,9,5
2,5,7,9,1,4
4,2,5,2,1,3
6,8,7,2,1,5
9,1.6,4,3,4,1
I want the list order to be sorted for index1 first and if the values are same for index1 than sorting is done on index3 and if further same than on index2. Here the number of lists are less which can increase to 20 and the indexes can grow up to 20 as well.
The algorithm I want to know is the same as that of iTunes song sorting, in which songs with the same album are sorted first and then by artist and then by rank and then by name. That's the album's if album names are the same then sorting is done on the artist if same, then by rank and so on. The code can be in C/C++/tcl/shell.
sort -n -t ',' -k 1 -k 3 -k 2
Feed the lists as individual lines into it.
To do this in Tcl, assuming there's not huge amounts of data (a few MB wouldn't be “huge”) the easiest way would be:
# Read the values in from stdin, break into lists of lists
foreach line [split [read stdin] "\n"] {
lappend records [split $line ","]
}
# Sort twice, first by secondary key then by primary (lsort is _stable_)
set records [lsort -index 1 -real $records]
set records [lsort -index 0 -real $records]
# Write the values back out to stdout
foreach record $records {
puts [join $record ","]
}
If you're using anything more complex than simple numbers, consider using the csv package in Tcllib for parsing and formatting, as it will deal with many syntactic issues that crop up in Real Data. If you're dealing with a lot of data (where “lot” depends on how much memory you deploy with) then consider using a more stream-oriented method for handling the data (and there are a few other optimizations in the memory handling) and you might also want to use the -command option to lsort to supply a custom comparator so you can sort only once; the performance hit of a custom comparator is quite high, alas, but for many records the reduced number of comparisons will win out. Or shove the data into a database like SQLite or Postgres.
You can use STL's sort, and then all you have to do is to write a comparison function that does what you want (the example in the link should be good enough).
Since you asked for a Tcl solution:
set lol {
{2 4 7 4 9 5}
{4 3 7.3 9 8 1.2}
{2 2.4 7 9 8 5}
{9 1.6 4 3 4 1}
{2 5 7 9 1 4}
{6 8 7 2 1 5}
{4 2 5 2 1 3}
}
set ::EPS 10e-6
proc compareLists {ixo e1 e2} {
foreach ix $ixo {
set d [expr {[lindex $e1 $ix] - [lindex $e2 $ix]}]
if {abs($d) > $::EPS} {
return [expr {($d>0)-($d<0)}]
}
}
return 0
}
foreach li [lsort -command [list compareLists {0 2 1}] $lol] {
puts $li
}
Hope that helps.
Here is a C++ solution:
#include <iostream>
#include <vector>
#include <algorithm>
template <typename Array, typename CompareOrderIndex>
struct arrayCompare
{
private:
size_t
size ;
CompareOrderIndex
index ;
public:
arrayCompare( CompareOrderIndex idx ) : size( idx.size() ), index(idx) { }
bool helper( const Array &a1, const Array &a2, unsigned int num ) const
{
if( a1[ index[size-num] ] > a2[ index[size-num] ] )
{
return false ;
}
if( !(a1[ index[size-num] ] < a2[ index[size-num] ]) )
{
if( 1 != num )
{
return helper( a1, a2, num-1 ) ;
}
}
return true ;
}
bool operator()( const Array &a1, const Array &a2 ) const
{
return helper( a1, a2, size ) ;
}
} ;
int main()
{
std::vector< std::vector<float> > lists = { { 2, 4, 7, 4, 9, 5},
{ 4, 3, 7.3, 9, 8, 1.2 },
{ 2, 2.4, 7, 9, 8, 5 },
{ 4, 2, 5, 2, 1, 3 },
{ 9, 1.6, 4, 3, 4, 1 },
{ 2, 5, 7, 9, 1, 4 },
{ 6, 8, 7, 2, 1, 5 },
{ 4, 2, 5, 2, 1, 1 },
};
//
// Specify the column indexes to compare and the order to compare.
// In this case it will first compare column 1 then 3 and finally 2.
//
//std::vector<int> indexOrder = { 0, 2, 1, 3, 4 ,5 } ;
std::vector<int> indexOrder = { 0, 2, 1 } ;
arrayCompare< std::vector<float>, std::vector<int>> compV( indexOrder ) ;
std::sort( lists.begin(), lists.end(), arrayCompare< std::vector<float>, std::vector<int>>( indexOrder ) ) ;
for(auto p: lists)
{
for( unsigned int i = 0; i < p.size(); ++i )
{
unsigned int idx = ( i > (indexOrder.size() -1) ? i : indexOrder[i] ) ;
std::cout << p[idx] << ", " ;
}
std::cout << std::endl ;
}
}