BigQuery Limit Rows Scanned by Merge DML - google-cloud-platform

Given DML statement below, is there a way to limit numbers of rows scanned by target table? For example, let's say we have a field shard_id which the table is partitioned with. I know beforehand that all update should happen in some range of shard_id. Is there a way to specify where clause for target to limit numbers of rows that need to be scanned so update does not have to do a full table scan to look for an id?
MERGE dataset.table_target target
USING dataset.table_source source
ON target.id = "123"
WHEN MATCHED THEN
UPDATE SET some_value = source.some_value
WHEN NOT MATCHED BY SOURCE AND id = "123" THEN
DELETE

The ON condition is the Where statement where you need to write your clause.
ON target.id = "123" AND DATE(t.shard_id) BETWEEN date1 and date2

For your case, it's incorrect to do the partition pruning by ON condition. Instead, you should do that in WHEN clause.
There is an example exactly for such scenario at https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables#pruning_partitions_when_using_a_merge_statement.
Basically, the ON condition is used as the matching condition to join target & source tables in MERGE. Following two queries shows the difference between join condition and where clause,
Query 1:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 and pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL
2000-01-01 10 NULL
Query 2:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 where pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL

Related

Informatica Cloud Data Integration - find non matching rows

I am working on Informatica Cloud Data
Integraion.I have 2 tables- Tab1 and Tab2.The joining column is id.I want to find all records in Tab1 that do not exist in Tab2.What transformations can I use to achieve this?
Tab1
id name
1 n1
2 n2
3 n3
Tab2
id
1
5
6
I want to get records with id 2 and 3 from tab1 as they do not exist in tab2
You can use database source qualifier overwrite sql
Select * from table1 where id not in ( select id from table2)
Or else you can use informatica like below.
Do a lookup on table2, on join condition on id.
In exp transformation, create a flag
out_flag= iif(isnull (:lkp(id)),'pass','fail')
Put a filter next and keep the condition as out_flag= 'pass'
Whole map should be like this
Lkp
|
Sq --exp|-----> fil---tgt

How to make a measure that counts summarized values that are over certain number

That title was kinda hard to phrase but I'll try to explain myself a bit better.
So I have a few tables where I have a few relations going:
Table 1 (T1 in picture) has a column that has ID. The ID can be in format such as: id1, id2, id3 or id1 or id1;id2;id3.
The T4 is the table with the column that I want to relate the ID column from T1 table. For that purpose, I have created a distinct version of the ID column in T2 table from T1 (it removes duplicate values).
T1 is related to T2 with a many-to-one relation. From T2 I have created T3 as a duplicate where the ID column is split to rows by delimiters , and ; and related to T2 with a one-to-many relation. This creates the table where ID values are separated but still related to the T1 via relation to T2.
Finally, the relation from T3's split ID column is formed to T4 ID column via many to one relation.
Now the real question I have is that how can I count how many IDs in T4 have more than 5 related rows in T1 table?
I have placed the ID from T4 and the count of T1's ID on a table visual which shows the rows, but I don't really know how I can count the ones that surpass that certain requirement.
The result I want is somewhat like:
IDs with more than 5 related rows : 350
IDs with related rows : 474
1st of, i'am not 100% sure I understood your IDs format, but this might be worth a try.
Create a calculated column as:
countT1Rows =
VAR currentId = T4['id']
RETURN
COUNTROWS ( FILTER ( T1, T1['id'] = currentId ) )
The formula counts for every row in T4 the number for rows in T1, which have identical ID.
Than you can use slicer to filter only rows with 5 or more related rows. Or you could wrap it in IF clause.

Redshift. Convert comma delimited values into rows

I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:
I have:
user_id|user_name|user_action
-----------------------------
1 | Shone | start,stop,cancell...
I would like to see
user_id|user_name|parsed_action
-------------------------------
1 | Shone | start
1 | Shone | stop
1 | Shone | cancell
....
A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join to make the query more compact.
Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.
Specifically, if we assume the number of rows in cmd_logs is larger than the maximum number of commas in the user_action column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action column:
select
(row_number() over (order by true))::int as n
into numbers
from cmd_logs
limit 100;
If we want to get fancy, we can compute the number of commas from the cmd_logs table to create a more precise set of rows in numbers:
select
n::int
into numbers
from
(select
row_number() over (order by true) as n
from cmd_logs)
cross join
(select
max(regexp_count(user_action, '[,]')) as max_num
from cmd_logs)
where
n <= max_num + 1;
Once there is a numbers table, we can do:
select
user_id,
user_name,
split_part(user_action,',',n) as parsed_action
from
cmd_logs
cross join
numbers
where
split_part(user_action,',',n) is not null
and split_part(user_action,',',n) != '';
Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:
... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced
... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action
Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.
If you know that there are not many actions in your user_action column, you use recursive sub-querying with union all and therefore avoiding the aux numbers table.
But it requires you to know the number of actions for each user, either adjust initial table or make a view or a temporary table for it.
Data preparation
Assuming you have something like this as a table:
create temporary table actions
(
user_id varchar,
user_name varchar,
user_action varchar
);
I'll insert some values in it:
insert into actions
values (1, 'Shone', 'start,stop,cancel'),
(2, 'Gregory', 'find,diagnose,taunt'),
(3, 'Robot', 'kill,destroy');
Here's an additional table with temporary count
create temporary table actions_with_counts
(
id varchar,
name varchar,
num_actions integer,
actions varchar
);
insert into actions_with_counts (
select user_id,
user_name,
regexp_count(user_action, ',') + 1 as num_actions,
user_action
from actions
);
This would be our "input table" and it looks just as you expected
select * from actions_with_counts;
id
name
num_actions
actions
2
Gregory
3
find,diagnose,taunt
3
Robot
2
kill,destroy
1
Shone
3
start,stop,cancel
Again, you can adjust initial table and therefore skipping adding counts as a separate table.
Sub-query to flatten the actions
Here's the unnesting query:
with recursive tmp (user_id, user_name, idx, user_action) as
(
select id,
name,
1 as idx,
split_part(actions, ',', 1) as user_action
from actions_with_counts
union all
select user_id,
user_name,
idx + 1 as idx,
split_part(actions, ',', idx + 1)
from actions_with_counts
join tmp on actions_with_counts.id = tmp.user_id
where idx < num_actions
)
select user_id, user_name, user_action as parsed_action
from tmp
order by user_id;
This will create a new row for each action, and the output would look like this:
user_id
user_name
parsed_action
1
Shone
start
1
Shone
stop
1
Shone
cancel
2
Gregory
find
2
Gregory
diagnose
2
Gregory
taunt
3
Robot
kill
3
Robot
destroy
Here are two ways to achieve this.
In my example, I'm assuming that I am accepting a comma separated list of values. My values look like schema.table.column.
The first involves using a recursive CTE.
drop table if exists #dep_tbl;
create table #dep_tbl as
select 'schema.foobar.insert_ts,schema.baz.load_ts' as dep
;
with recursive tmp (level, dep_split, to_split) as
(
select 1 as level
, split_part(dep, ',', 1) as dep_split
, regexp_count(dep, ',') as to_split
from #dep_tbl
union all
select tmp.level + 1 as level
, split_part(a.dep, ',', tmp.level + 1) as dep_split_u
, tmp.to_split
from #dep_tbl a
inner join tmp on tmp.dep_split is not null
and tmp.level <= tmp.to_split
)
select dep_split from tmp;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
The second involves a stored procedure.
CREATE OR REPLACE PROCEDURE so_test(dependencies_csv varchar(max))
LANGUAGE plpgsql
AS $$
DECLARE
dependencies_csv_vals varchar(max);
BEGIN
drop table if exists #dep_holder;
create table #dep_holder
(
avoid varchar(60000)
);
IF dependencies_csv is not null THEN
dependencies_csv_vals:='('||replace(quote_literal(regexp_replace(dependencies_csv,'\\s','')),',', '\'),(\'') ||')';
execute 'insert into #dep_holder values '||dependencies_csv_vals||';';
END IF;
END;
$$
;
call so_test('schema.foobar.insert_ts,schema.baz.load_ts')
select
*
from
#dep_holder;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
in conclusion
If you only care about one single column in your input (the X delimited values), then I think the stored procedure is easier/faster.
However, if you have other columns you care about and want to keep those columns along with your comma separated value column now transformed to rows, OR, if you want to know the argument (original list of delimited values), I think the stored procedure is the way to go. In that case, you can just add those other columns to your columns selected in the recursive query.
You can get the expected result with the following query. I'm using "UNION ALL" to convert a column to row.
select user_id, user_name, split_part(user_action,',',1) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',2) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',3) as parsed_action from cmd_logs
Here's my equally-terrible answer.
I have a users table, and then an events table with a column that is just a comma-delimited string of users at said event. eg
event_id | user_ids
1 | 5,18,25,99,105
In this case, I used the LIKE and wildcard functions to build a new table that represents each event-user edge.
SELECT e.event_id, u.id as user_id
FROM events e
LEFT JOIN users u ON e.user_ids like '%' || u.id || '%'
It's not pretty, but I throw it in a WITH clause so that I don't have to run it more than once per query. I'll likely just build an ETL to create that table every night anyway.
Also, this only works if you have a second table that does have one row per unique possibility. If not, you could do LISTAGG to get a single cell with all your values, export that to a CSV and reupload that as a table to help.
Like I said: a terrible, no-good solution.
Late to the party but I got something working (albeit very slow though)
with nums as (select n::int n
from
(select
row_number() over (order by true) as n
from table_with_enough_rows_to_cover_range)
cross join
(select
max(json_array_length(json_column)) as max_num
from table_with_json_column )
where
n <= max_num + 1)
select *, json_extract_array_element_text(json_column,nums.n-1) parsed_json
from nums, table_with_json_column
where json_extract_array_element_text(json_column,nums.n-1) != ''
and nums.n <= json_array_length(json_column)
Thanks to answer by Bob Baxley for inspiration
Just improvement for the answer above https://stackoverflow.com/a/31998832/1265306
Is generating numbers table using the following SQL
https://discourse.looker.com/t/generating-a-numbers-table-in-mysql-and-redshift/482
SELECT
p0.n
+ p1.n*2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as number
INTO numbers
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
ORDER BY 1
LIMIT 100
"ORDER BY" is there only in case you want paste it without the INTO clause and see the results
create a stored procedure that will parse string dynamically and populatetemp table, select from temp table.
here is the magic code:-
CREATE OR REPLACE PROCEDURE public.sp_string_split( "string" character varying )
AS $$
DECLARE
cnt INTEGER := 1;
no_of_parts INTEGER := (select REGEXP_COUNT ( string , ',' ));
sql VARCHAR(MAX) := '';
item character varying := '';
BEGIN
-- Create table
sql := 'CREATE TEMPORARY TABLE IF NOT EXISTS split_table (part VARCHAR(255)) ';
RAISE NOTICE 'executing sql %', sql ;
EXECUTE sql;
<<simple_loop_exit_continue>>
LOOP
item = (select split_part("string",',',cnt));
RAISE NOTICE 'item %', item ;
sql := 'INSERT INTO split_table SELECT '''||item||''' ';
EXECUTE sql;
cnt = cnt + 1;
EXIT simple_loop_exit_continue WHEN (cnt >= no_of_parts + 2);
END LOOP;
END ;
$$ LANGUAGE plpgsql;
Usage example:-
call public.sp_string_split('john,smith,jones');
select *
from split_table
You can try copy command to copy your file into redshift tables
copy table_name from 's3://mybucket/myfolder/my.csv' CREDENTIALS 'aws_access_key_id=my_aws_acc_key;aws_secret_access_key=my_aws_sec_key' delimiter ','
You can use delimiter ',' option.
For more details of copy command options you can visit this page
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

SQLite C++ Compare two tables within the same database for matching records

I want to be able to compare two tables within the same SQLite Database using a C++ interface for matching records. Here are my two tables
Table name : temptrigrams
ID TEMPTRIGRAM
---------- ----------
1 The cat ran
2 Compare two tables
3 Alex went home
4 Mark sat down
5 this database blows
6 data with a
7 table disco ninja
++78
Table Name: spamtrigrams
ID TRIGRAM
---------- ----------
1 Sam's nice ham
2 Tuesday was cold
3 Alex stood up
4 Mark passed out
5 this database is
6 date with a
7 disco stew pot
++10000
The first table has two columns and 85 records and the second table has two columns with 10007 records.
I would like to take the first table and compare the records within the TEMPTRIGRAM column and compare it against the TRIGRAM columun in the second table and return the number of matches across the tables. So if (ID:1 'The Cat Ran' appears in 'spamtrigrams', I would like that counted and returned with the total at the end as an integer.
Could somebody please explain the syntax for the query to perform this action?
Thank you.
This is a join query with an aggregation. My guess is that you want the number of matches per trigram:
select t1.temptrigram, count(t2.trigram)
from table1 t1 left outer join
table2 t2
on t1.temptrigram = t2.trigram
group by t1.temptrigram;
If you just want the number of matches:
select count(t2.trigram)
from table1 t1 join
table2 t2
on t1.temptrigram = t2.trigram;

Compare Tables in BigQuery

How would I compare two tables (Table1 and Table2) and find all the new entries or changes in Table2.
Using SQL Server I can use
Select * from Table1
Except
Select * from Table2
Here a sample of what I want
Table1
A | 1
B | 2
C | 3
Table2
A | 1
B | 2
C | 2
D | 4
So, if I comparing the two tables I want my results to show me the following
C | 2
D | 4
I tried a few statements with no luck.
Now that I have your actual sample dataset, I can write a query that finds every domain in one table that is not on the other table:
https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1024 has 24,729,816 rows. https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1025 has 24,732,640 rows.
Let's look at everything in 1025 that is not in 1024:
SELECT a.domain
FROM [inbound-acolyte-377:demo.1025] a
LEFT OUTER JOIN EACH [inbound-acolyte-377:demo.1024] b
ON a.domain = b.domain
WHERE b.domain IS NULL
Result: 39,629 rows.
(8.1s elapsed, 2.04 GB processed)
To get the differences (given that tkey is your unique row identifier):
SELECT a.tkey, a.name, b.name
FROM [your.tableold] a
JOIN EACH [your.tablenew] b
ON a.tkey = b.tkey
WHERE a.name != b.name
LIMIT 100
For the new rows, one way is the one you proposed:
SELECT col1, col2
FROM table2
WHERE col1 NOT IN
(SELECT col1 FROM Table1)
(you'll have to switch to a JOIN EACH when Table1 gets too large)