Google BigQuery - Execute dynamically generated queries from a select statement - google-cloud-platform

Have a huge table in Google BigQuery with following structure (> 100 million rows):
name | departments
abc | 1,2,5,6
xyz | 4,5
pqr | 3,4,6
Want to convert the data into following format:
name | 1 | 2 | 3 | 4 | 5 | 6
abc | 1 | 1 | | | 1 | 1
xyz | | | | 1 | 1 |
pqr | | | 1 | 1 | | 1
As of now, able to generate the queries required to prepare the dataset in this format by using CONCAT and REGEX_REPLACE functions:
SELECT ' insert into dataset.output ( name, ' +
CONCAT(
'_' , replace(departments,',',',_') )
+ ' ) values( \'' + name +'\','+ REGEXP_REPLACE(departments, "([^,\n]+)", "1") +')'
FROM (
select name, departments from dataset.input )
This generates the output with the 100 M insert queries which can be used to create the data in the required structure.
However, now below are my questions:
Can we execute the output of this query (100 M insert queries) directly by using Big Query SQL or we would need to fire each insert one by one?
I believe there is no way to pivoting or transposing the data in a column with multiple comma separated values. Is that right?
Is there a more optimal way of achieving this using BigQuery SQL and not writing custom Java code?
Thanks.

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
with result as
Row name d1 d2 d3 d4 d5 d6
1 abc 1 1 0 0 1 1
2 xyz 0 0 0 1 1 0
3 pqr 0 0 1 1 0 1
So you need to run above with destination to whatever new table you prepared
Note, above assumes you have just 6 departments and most important there is no ambiguity in numbers like 1 does not conflict with 10 for example
If you do have such case - you need transform below lines
IF(departments LIKE '%2%', 1, 0) AS d2,
into
IF(CONCAT(',', departments, ',') LIKE '%,2,%', 1, 0) AS d2 ...
And of course, you can use just one simple INSERT statement
INSERT `project.dataset.new_table` (name, d1, d2, d3, d4, d5, d6)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
So, the final point of all this is:
instead of generating INSERT STATEMENT for each and every row in original table - you should generate simple SELECT statement that does "pivoting"
Update for "extreme" minimizing generated code
See an example:
#standardSQL
CREATE TEMP FUNCTION c(departments STRING, department INT64) AS (
IF(departments LIKE CONCAT('%',CAST(department AS STRING),'%'), 1, 0)
);
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
), temp AS (
SELECT name, departments AS d
FROM `project.dataset.table`
)
SELECT
name,
c(d,1)d1,
c(d,2)d2,
c(d,3)d3,
c(d,4)d4,
c(d,5)d5,
c(d,6)d6
FROM temp
as you can see - now each of your 10000 lines will be like c(d,N)dN, with max in length as c(d,10000)d10000, so you have chance to fit into query size limit

Related

Extract Numbers from String - Custom

I'd like to extract "Most" numbers from a string and Add "JW" at the end.
My values look like:
RFID_DP_IDS339020JW3_IDMsg - Result = 339020JW
RFID_DP_IDSA72130JW_IDMsg --> 72130JW
RFID_DP_IDS337310JW1_IDMsg --> 337310JW
Basically I would remove all first letters, keep all numbers and JW
For now I had this
regexp_replace(Business_CONTEXT, '[^0-9]', '')||'JW' RegistrationPoint
But that would include the numbers AFTER 'JW'
Any idea?
How about this?
result would return exactly two letters after bunch of digits
result2 would return digits + JW
Pick the one you find the most appropriate.
SQL> with test (col) as
2 (select 'RFID_DP_IDS339020JW3_IDMsg' from dual union all
3 select 'RFID_DP_IDSA72130JW_IDMsg' from dual union all
4 select 'RFID_DP_IDS337310JW1_IDMsg' from dual
5 )
6 select col,
7 regexp_substr(col, '\d+[[:alpha:]]{2}') result,
8 regexp_substr(col, '\d+JW') result2
9 from test;
COL RESULT RESULT2
-------------------------- -------------------------- --------------------------
RFID_DP_IDS339020JW3_IDMsg 339020JW 339020JW
RFID_DP_IDSA72130JW_IDMsg 72130JW 72130JW
RFID_DP_IDS337310JW1_IDMsg 337310JW 337310JW
SQL>
If you really want to extract the longest digit string out of your given strings you can use the following:
WITH test (Business_CONTEXT) AS
(SELECT 'RFID_DP_IDS339020JW3_I9DMsg' from dual union all
SELECT 'RFID_DP_IDSA72130JW_IDMsg' from dual union all
SELECT 'RFID_DP_IDS337310JW1_IDMsg' from dual
)
SELECT Business_CONTEXT
, (SELECT MAX(regexp_substr(Business_CONTEXT, '\d+', 1, LEVEL))
KEEP (dense_rank last ORDER BY LENGTH(regexp_substr(Business_CONTEXT, '\d+', 1, LEVEL)))
FROM dual
CONNECT BY regexp_substr(Business_CONTEXT, '\d+', 1, LEVEL) IS NOT NULL) num
FROM test
Result:
Business_CONTEXT | NUM
----------------------------+-----
RFID_DP_IDS339020JW3_I9DMsg | 339020
RFID_DP_IDSA72130JW_IDMsg | 72130
RFID_DP_IDS337310JW1_IDMsg | 337310

extract all numbers in a string

How can I extract all numbers in a string?
Sample inputs:
7nr-6p
12c-18L
12nr-24L
11nr-12p
Expected Outputs:
{7,6}
{12,18}
{12,24}
etc...
The following is tested with the first one, 7nr-6p:
select regexp_split_to_array('7nr-6p', '[^0-9]') AS new_volume from mytable;
Gives: {7,"","",6,""} // Why is a numeric-only match returning spaces?
select regexp_matches('7nr-6p', '[0-9]*'::text) from mytable;
Gives: {7} // Why isn't this continuing?
select regexp_matches('7nr-6p', '\d'::text) from mytable;
Gives: {7}
select NULLIF(regexp_replace('7nr-6p', '\D',',','g'), '')::text from mytable;
Gives: 7,,,6,
The following query:
select regexp_split_to_array(regexp_replace('7nr-6p', '^[^0-9]*|[^0-9]*$', 'g'), '[^0-9]+')
AS new_volume from mytable;
"Trims" the prefix and suffix non-numbers and splits by the remaining non-numbers.
select regexp_matches('7nr-6p', '[0-9]*'::text) from mytable;
Gives: {7} // Why isn't this continuing?
Because without the 'g' flag, the regex stops at the first match.
Add the 'g' flag:
select regexp_matches('7nr-6p', '[0-9]*'::text, 'g') from mytable;
You can replace all text and then split:
SELECT regexp_split_to_array(
regexp_replace('7nr-6p', '[a-zA-Z]', '','g'),
'[^0-9]'
)
This returns {7,6}
SELECT id, (regexp_matches(string, '\d+', 'g'))[1]::int AS nr
FROM (
VALUES
(1, '7nr-6p')
, (2, '12c-18L')
, (3, '12nr-24L')
, (4, '11nr-12p')
) tbl(id, string);
Result:
id | nr
----+----
1 | 7
1 | 6
2 | 12
2 | 18
3 | 12
3 | 24
4 | 11
4 | 12
I wanted them in a single cell so I could extract them as needed
SELECT id, trim(regexp_replace(string, '\D+', ',', 'g'), ',') AS nrs
FROM (
VALUES
(1, '7nr-6p')
, (2, '12c-18L')
, (3, '12nr-24L')
, (4, '11nr-12p')
) tbl(id, string);
Result:
id | nrs
----+-------
1 | 7,6
2 | 12,18
3 | 12,24
4 | 11,12
dbfiddle here
Here is a more robust solution
CREATE OR REPLACE FUNCTION get_ints_from_text(TEXT) RETURNS int[] AS $$
select array_remove(regexp_split_to_array($1,'[^0-9]+','i'),'')::int[];
$$ LANGUAGE SQL IMMUTABLE;
Example
select get_ints_from_text('7nr-6p'); -- 7,6
-- also resilient in situations like
select get_ints_from_text('-7nr--6p'); -- 7,6
Here is a link to try
http://sqlfiddle.com/#!17/c6ac7/2
I feel that wrapping this functionality into an immutable function is prudent. This is a pure function, one that will not mutate data and one that returns the same result given the same input. Immutable functions marked as "immutable" have performance benefits.
By using a function we also benefit from abstraction. There is one source to update should this functionality need to improve in the future.
For more information about immutable functions see
https://www.postgresql.org/docs/10/static/sql-createfunction.html

PL/SQL split one to many rows

I have a table like this.
|PARAMKEY | PARAMVALUE
----------+------------
KEY |[["PAR_A",2,"SCH_A"],["PAR_B",4,"SCH_B"],["PAR_C",3,"SCH_C"]]
I need to split the values into three columns and I use REGEXP_SUBSTR. Here is my code.
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1,1 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 2) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 3) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY'
UNION ALL
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 4 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 5) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 6) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY'
UNION ALL
SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 7 ) PARAMETER
,REGEXP_SUBSTR(paramvalue, '[^],[",]+', 1, 8) VERSION
,REGEXP_SUBSTR(paramvalue, '[^],["]+', 1, 9) SCHEMA
FROM tmp_param_table
where paramkey = 'KEY';
and this is the result that i need.
PARAMETER | VERSION | SCHEMA
---------+---------+-------
PAR_A |2 |SCH_A
PAR_B |4 |SCH_B
PAR_C |3 |SCH_C
But the value is too long and I hope there is another way to make it simplier by using loop or anything.
Thanks
Try something like this:
with tmp_param_table as
(
select 'KEY' as PARAMKEY , '[["PAR_A",2,"SCH_A"],["PAR_B",4,"SCH_B"],["PAR_C",3,"SCH_C"]],["PAR_D",4,"SCH_D"]]' as PARAMVALUE from dual
),
levels as (select level as lv from dual connect by level <= 156),
steps as (select lv-2 as step from levels where MOD(lv,3)=0)
select step, (SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') parameter,
(SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step+1 ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') version,
(SELECT REGEXP_SUBSTR(paramvalue, '[^],["]+',1, step+2 ) PARAMETER FROM tmp_param_table where paramkey = 'KEY') schema
from steps
Here
levels - returns numbers form 1 till 156 (52*3) (or whatever you need)
steps - are the numbers 1, 4, 7 etc with step 3
Results:
1 PAR_A 2 SCH_A
4 PAR_B 4 SCH_B
7 PAR_C 3 SCH_C
10 PAR_D 4 SCH_D
13
etc..
I have tried using regular expression
and part paramvalue column value into common separated value
SELECT
REGEXP_SUBSTR(COL, '[^],["]+', 1, 1) PARAMETER,
REGEXP_SUBSTR(COL, '[^],[",]+', 1, 2) VERSION,
REGEXP_SUBSTR(COL, '[^],["]+', 1, 3) SCHEMA
FROM
(
SELECT paramkey,REGEXP_SUBSTR(to_char(paramvalue),'[^][^]+',1,level ) COL
from tmp_param_table
connect by regexp_substr(to_char(paramvalue),'[^][^]+',1, level) is not null
)
WHERE COL <>','
I hope this may help.

DB2 IF and LENGTH usage

I have this DB2 table
A | B | C
aaaa |123 |
bbbb |1 |
cccc |123456 |
All columns are varchars. I would like to have the column C filled up with the contents of B concatenated with the contents of A.
BUT the max length of C is 8. So if the concatenated string exceeds 8, then i would like to have only 5 characters + "...".
Basically:
if(length(A) + length(B) > maximum(C) {
//display only the first (maximum(C) - 3) characters, then add "..."
} else {
// display B + A
}
How can i do this in DB2?
One good option would be to define column C as generated column so you do not have to handle anything.
create table t3 (A varchar(10),
B varchar(10),
C varchar(8) generated always as (case when length(concat(A, B)) > 8 then substr(concat(A,B),1,5) || '...' else concat(A, B) end)
)
insert into t3 (A,B) values ('This', ' is a test');
insert into t3 (A,B) values ('ABCD', 'EFGH');
select * from t3
will return
A B C
----------------------------------
This is a test This ...
ABCD EFGH ABCDEFGH
Alternatives could be triggers, procedures, explicit code etc.

using Oracle REGEXP_INSTR to find exact word

I want to return the following position from the strings using REGEXP_INSTR.
I am looking for the word car with exact match in the following strings.
car,care,oscar - 1
care,car,oscar - 6
oscar,care,car - 12
something like
SELECT REGEXP_INSTR('car,care,oscar', 'car', 1, 1) "REGEXP_INSTR" FROM DUAL;
I am not sure what kind of escape operators to use.
A simpler solution is to surround the source string and search string with commas and find the position using INSTR.
SELECT INSTR(',' || 'car,care,oscar' || ',', ',car,') "INSTR" FROM DUAL;
Example:
SQL Fiddle
with x(y) as (
SELECT 'car,care,oscar' from dual union all
SELECT 'care,car,oscar' from dual union all
SELECT 'oscar,care,car' from dual union all
SELECT 'car' from dual union all
SELECT 'cart,care,oscar' from dual
)
select y, ',' || y || ',' , instr(',' || y || ',',',car,')
from x
| Y | ','||Y||',' | INSTR(','||Y||',',',CAR,') |
|-----------------|-------------------|----------------------------|
| car,care,oscar | ,car,care,oscar, | 1 |
| care,car,oscar | ,care,car,oscar, | 6 |
| oscar,care,car | ,oscar,care,car, | 12 |
| car | ,car, | 1 |
| cart,care,oscar | ,cart,care,oscar, | 0 |
The following query handles all scenarios. It returns the starting position if the string begins with car, or the whole string is just car. It returns the starting position + 1 if ,car, is found or if the string ends with ,car to account for the comma.
SELECT
CASE
WHEN REGEXP_LIKE('car,care,oscar', '^car,|^car$') THEN REGEXP_INSTR('car,care,oscar', '^car,|^car$', 1, 1)
WHEN REGEXP_LIKE('car,care,oscar', ',car,|,car$') THEN REGEXP_INSTR('car,care,oscar', ',car,|,car$', 1, 1)+1
ELSE 0
END "REGEXP_INSTR"
FROM DUAL;
SQL Fiddle demo with the various possibilities
I like Noel his answer as it gives a very good performance! Another way around is by creating separate rows from a character separated string:
pm.nodes = 'a;b;c;d;e;f;g'
(select regexp_substr(pm.nodes,'[^;]+', 1, level)
from dual
connect by regexp_substr(pm.nodes, '[^;]+', 1, level) is not null)