Redshift. Convert comma delimited values into rows - amazon-web-services

I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:
I have:
user_id|user_name|user_action
-----------------------------
1 | Shone | start,stop,cancell...
I would like to see
user_id|user_name|parsed_action
-------------------------------
1 | Shone | start
1 | Shone | stop
1 | Shone | cancell
....

A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join to make the query more compact.
Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.
Specifically, if we assume the number of rows in cmd_logs is larger than the maximum number of commas in the user_action column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action column:
select
(row_number() over (order by true))::int as n
into numbers
from cmd_logs
limit 100;
If we want to get fancy, we can compute the number of commas from the cmd_logs table to create a more precise set of rows in numbers:
select
n::int
into numbers
from
(select
row_number() over (order by true) as n
from cmd_logs)
cross join
(select
max(regexp_count(user_action, '[,]')) as max_num
from cmd_logs)
where
n <= max_num + 1;
Once there is a numbers table, we can do:
select
user_id,
user_name,
split_part(user_action,',',n) as parsed_action
from
cmd_logs
cross join
numbers
where
split_part(user_action,',',n) is not null
and split_part(user_action,',',n) != '';

Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:
... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced
... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action
Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.

If you know that there are not many actions in your user_action column, you use recursive sub-querying with union all and therefore avoiding the aux numbers table.
But it requires you to know the number of actions for each user, either adjust initial table or make a view or a temporary table for it.
Data preparation
Assuming you have something like this as a table:
create temporary table actions
(
user_id varchar,
user_name varchar,
user_action varchar
);
I'll insert some values in it:
insert into actions
values (1, 'Shone', 'start,stop,cancel'),
(2, 'Gregory', 'find,diagnose,taunt'),
(3, 'Robot', 'kill,destroy');
Here's an additional table with temporary count
create temporary table actions_with_counts
(
id varchar,
name varchar,
num_actions integer,
actions varchar
);
insert into actions_with_counts (
select user_id,
user_name,
regexp_count(user_action, ',') + 1 as num_actions,
user_action
from actions
);
This would be our "input table" and it looks just as you expected
select * from actions_with_counts;
id
name
num_actions
actions
2
Gregory
3
find,diagnose,taunt
3
Robot
2
kill,destroy
1
Shone
3
start,stop,cancel
Again, you can adjust initial table and therefore skipping adding counts as a separate table.
Sub-query to flatten the actions
Here's the unnesting query:
with recursive tmp (user_id, user_name, idx, user_action) as
(
select id,
name,
1 as idx,
split_part(actions, ',', 1) as user_action
from actions_with_counts
union all
select user_id,
user_name,
idx + 1 as idx,
split_part(actions, ',', idx + 1)
from actions_with_counts
join tmp on actions_with_counts.id = tmp.user_id
where idx < num_actions
)
select user_id, user_name, user_action as parsed_action
from tmp
order by user_id;
This will create a new row for each action, and the output would look like this:
user_id
user_name
parsed_action
1
Shone
start
1
Shone
stop
1
Shone
cancel
2
Gregory
find
2
Gregory
diagnose
2
Gregory
taunt
3
Robot
kill
3
Robot
destroy

Here are two ways to achieve this.
In my example, I'm assuming that I am accepting a comma separated list of values. My values look like schema.table.column.
The first involves using a recursive CTE.
drop table if exists #dep_tbl;
create table #dep_tbl as
select 'schema.foobar.insert_ts,schema.baz.load_ts' as dep
;
with recursive tmp (level, dep_split, to_split) as
(
select 1 as level
, split_part(dep, ',', 1) as dep_split
, regexp_count(dep, ',') as to_split
from #dep_tbl
union all
select tmp.level + 1 as level
, split_part(a.dep, ',', tmp.level + 1) as dep_split_u
, tmp.to_split
from #dep_tbl a
inner join tmp on tmp.dep_split is not null
and tmp.level <= tmp.to_split
)
select dep_split from tmp;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
The second involves a stored procedure.
CREATE OR REPLACE PROCEDURE so_test(dependencies_csv varchar(max))
LANGUAGE plpgsql
AS $$
DECLARE
dependencies_csv_vals varchar(max);
BEGIN
drop table if exists #dep_holder;
create table #dep_holder
(
avoid varchar(60000)
);
IF dependencies_csv is not null THEN
dependencies_csv_vals:='('||replace(quote_literal(regexp_replace(dependencies_csv,'\\s','')),',', '\'),(\'') ||')';
execute 'insert into #dep_holder values '||dependencies_csv_vals||';';
END IF;
END;
$$
;
call so_test('schema.foobar.insert_ts,schema.baz.load_ts')
select
*
from
#dep_holder;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
in conclusion
If you only care about one single column in your input (the X delimited values), then I think the stored procedure is easier/faster.
However, if you have other columns you care about and want to keep those columns along with your comma separated value column now transformed to rows, OR, if you want to know the argument (original list of delimited values), I think the stored procedure is the way to go. In that case, you can just add those other columns to your columns selected in the recursive query.

You can get the expected result with the following query. I'm using "UNION ALL" to convert a column to row.
select user_id, user_name, split_part(user_action,',',1) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',2) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',3) as parsed_action from cmd_logs

Here's my equally-terrible answer.
I have a users table, and then an events table with a column that is just a comma-delimited string of users at said event. eg
event_id | user_ids
1 | 5,18,25,99,105
In this case, I used the LIKE and wildcard functions to build a new table that represents each event-user edge.
SELECT e.event_id, u.id as user_id
FROM events e
LEFT JOIN users u ON e.user_ids like '%' || u.id || '%'
It's not pretty, but I throw it in a WITH clause so that I don't have to run it more than once per query. I'll likely just build an ETL to create that table every night anyway.
Also, this only works if you have a second table that does have one row per unique possibility. If not, you could do LISTAGG to get a single cell with all your values, export that to a CSV and reupload that as a table to help.
Like I said: a terrible, no-good solution.

Late to the party but I got something working (albeit very slow though)
with nums as (select n::int n
from
(select
row_number() over (order by true) as n
from table_with_enough_rows_to_cover_range)
cross join
(select
max(json_array_length(json_column)) as max_num
from table_with_json_column )
where
n <= max_num + 1)
select *, json_extract_array_element_text(json_column,nums.n-1) parsed_json
from nums, table_with_json_column
where json_extract_array_element_text(json_column,nums.n-1) != ''
and nums.n <= json_array_length(json_column)
Thanks to answer by Bob Baxley for inspiration

Just improvement for the answer above https://stackoverflow.com/a/31998832/1265306
Is generating numbers table using the following SQL
https://discourse.looker.com/t/generating-a-numbers-table-in-mysql-and-redshift/482
SELECT
p0.n
+ p1.n*2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as number
INTO numbers
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
ORDER BY 1
LIMIT 100
"ORDER BY" is there only in case you want paste it without the INTO clause and see the results

create a stored procedure that will parse string dynamically and populatetemp table, select from temp table.
here is the magic code:-
CREATE OR REPLACE PROCEDURE public.sp_string_split( "string" character varying )
AS $$
DECLARE
cnt INTEGER := 1;
no_of_parts INTEGER := (select REGEXP_COUNT ( string , ',' ));
sql VARCHAR(MAX) := '';
item character varying := '';
BEGIN
-- Create table
sql := 'CREATE TEMPORARY TABLE IF NOT EXISTS split_table (part VARCHAR(255)) ';
RAISE NOTICE 'executing sql %', sql ;
EXECUTE sql;
<<simple_loop_exit_continue>>
LOOP
item = (select split_part("string",',',cnt));
RAISE NOTICE 'item %', item ;
sql := 'INSERT INTO split_table SELECT '''||item||''' ';
EXECUTE sql;
cnt = cnt + 1;
EXIT simple_loop_exit_continue WHEN (cnt >= no_of_parts + 2);
END LOOP;
END ;
$$ LANGUAGE plpgsql;
Usage example:-
call public.sp_string_split('john,smith,jones');
select *
from split_table

You can try copy command to copy your file into redshift tables
copy table_name from 's3://mybucket/myfolder/my.csv' CREDENTIALS 'aws_access_key_id=my_aws_acc_key;aws_secret_access_key=my_aws_sec_key' delimiter ','
You can use delimiter ',' option.
For more details of copy command options you can visit this page
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Related

Does AWS Athena supports Order by in Array_AGG?

Im working with AWS Athena to concat a few rows to a single row.
Example table:(name: unload)
xid pid sequence text
1 1 0 select * from
1 1 1 mytbl
1 1 2
2 1 0 update test
2 1 1 set mycol=
2 1 2 'a';
So want to contact the text column.
Expected Output:
xid pid text
1 1 select * from mytbl
2 1 update test set mycol='a';
I ran the following query to partition it first with proper order and do the concat.
with cte as
(SELECT
xid,
pid,
sequence,
text,
row_number()
OVER (PARTITION BY xid,pid
ORDER BY sequence) AS rank
FROM unload
GROUP BY xid,pid,sequence,text
)
SELECT
xid,
pid,
array_join(array_agg(text),'') as text
FROM cte
GROUP BY xid,pid
But if you see the below output the order got misplaced.
xid pid text
1 1 mytblselect * from
2 1 update test'a'; set mycol=
I checked the Presto documentation, the latest version supports order by in array agg, but Athena is using Presto 0.172, so Im not sure whether it is supported or not.
What is the workaround for this in Athena?
One approach:
create records with a sortable format of text
aggregate into an unsorted array
sort the array
transform each element back into the original value of text
convert the sorted array to a string output column
WITH cte AS (
SELECT
xid, pid, text
-- create a sortable 19-digit ranking string
, SUBSTR(
LPAD(
CAST(
ROW_NUMBER() OVER (PARTITION BY xid, pid ORDER BY sequence)
AS VARCHAR)
, 19
, '0')
, -19) AS SEQ_STR
FROM unload
)
SELECT
xid, pid
-- make sortable string, aggregate into array
-- then sort array, revert each element to original text
-- finally combine array elements into one string
, ARRAY_JOIN(
TRANSFORM(
ARRAY_SORT(
ARRAY_AGG(SEQ_STR || text))
, combined -> SUBSTR(combined, 1 + 19))
, ' '
, '') AS TEXT
FROM cte
GROUP BY xid, pid
ORDER BY xid, pid
This code assumes:
xid + pid + sequence is unique for all input records
There are not many combinations of xid + pid + sequence (eg, not more than 20 million)

How to Abort or Exit from Redshift Query with a conditional expression?

I'm trying to abort/exit a query based on a conditional expression using CASE statement:
If the table has 0 rows then the query should go happy path.
If the table has > 0 rows then the query should abort/exit.
drop table if exists #dups_tracker ;
create table #dups_tracker
(
column1 varchar(10)
);
insert into #dups_tracker values ('John'),('Smith'),('Jack') ;
with c1 as
(select
0 as denominator__v
,count(*) as dups_cnt__v
from #dups_tracker
)
select
case
when dups_cnt__v > 0 THEN 1/denominator__v
else
1
end Ind__v
from c1
;
Here is the Error Message :
Amazon Invalid operation: division by zero; 1 statement failed.
There is no concept of aborting an SQL query. It either compiles into a query or it doesn't. If it does compile, the query runs.
The closest option would be to write a Stored Procedure, which can include IF logic. So, it could first query the contents of a table and, based on the result, decide whether it will perform another query.
Here is the logic I was able to write to abort a SQL in case of positive usecase,
/* Dummy Table to Abort Dups Check process if Positive */
--Dups Table
drop table if exists #dups;
create table #dups
(
dups_col varchar(1)
);
insert into #dups values('A');
--Dummy Table
drop table if exists #dummy ;
create table #dummy
(
dups_check decimal(1,0)
)
;
--When Table is not empty and has Dups
insert into #dummy
select
count(*) * 10
from #dups
;
/*
[Amazon](500310) Invalid operation: Numeric data overflow (result precision)
Details:
-----------------------------------------------
error: Numeric data overflow (result precision)
code: 1058
context: 64 bit overflow
query: 3246717
location: numeric.hpp:158
process: padbmaster [pid=6716]
-----------------------------------------------;
1 statement failed.
*/
--When Table is empty and doesn't have dups
truncate #dups ;
insert into #dummy
select
count(*) * 10
from #dups
;
drop table if exists temp_table;
create temp table temp_table (field_1 bool);
insert into temp_table
select case
when false -- or true
then 1
else 1 / 0
end as field_1;
This should compile, and fail when the condition isn't met.
Not sure why it's different from your example, though...
Edit: the above doesn't work querying against a table. Leaving it here for posterity.

BigQuery Limit Rows Scanned by Merge DML

Given DML statement below, is there a way to limit numbers of rows scanned by target table? For example, let's say we have a field shard_id which the table is partitioned with. I know beforehand that all update should happen in some range of shard_id. Is there a way to specify where clause for target to limit numbers of rows that need to be scanned so update does not have to do a full table scan to look for an id?
MERGE dataset.table_target target
USING dataset.table_source source
ON target.id = "123"
WHEN MATCHED THEN
UPDATE SET some_value = source.some_value
WHEN NOT MATCHED BY SOURCE AND id = "123" THEN
DELETE
The ON condition is the Where statement where you need to write your clause.
ON target.id = "123" AND DATE(t.shard_id) BETWEEN date1 and date2
For your case, it's incorrect to do the partition pruning by ON condition. Instead, you should do that in WHEN clause.
There is an example exactly for such scenario at https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables#pruning_partitions_when_using_a_merge_statement.
Basically, the ON condition is used as the matching condition to join target & source tables in MERGE. Following two queries shows the difference between join condition and where clause,
Query 1:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 and pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL
2000-01-01 10 NULL
Query 2:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 where pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL

Proc sql - Group by aggregate function from subquery in main query

I two data sets containing millions of rows. Table1 contains two different ID numbers, ID1 and ID2. It also contains a variable explaining which group (variable y1) a certain ID belongs to.
The second table (Table2) contains two variables from the first table and an additional one.
I want to join the two tables together but before the join, I want table1 to only contain information grouped by ID1 and also for it to give me information which group an ID belongs to.
I could do this in two Proc Sql stages where I first create a table on table1 where I group by ID1 and then create another step where I merge it onto table2. However this is rather inefficient as my tables contain so many rows and I would therefore like to do it in one run. Hence I have instead created a subquery that does what I want. My problem is that I get the error that I can't group by the variable "WhichGroup" from my subquery as it stems from an aggregate function. I'm wondering if there is some good workaround to what I want to achieve?
Many thanks in advance!
Example code:
data table1;
input ID1 $ ID2 $ x1 2. y1 $;
datalines;
1 p1 10 Group1
1 p2 20 Group2
2 p3 50 Group1
;
run;
data table2;
input ID1 $ x1 x2;
datalines;
1 10 500
1 20 600
2 50 700
;
run;
Proc sql;
Create table Test
as select
t1.WhichGroup
,sum(t1.Sum_x1) as Sum_x1
,sum(t2.x2) as Sum_x2
from (select
a.ID1
,case when max(case when a.y1 = 'Group1' then 1 else 0 end) = 0 then 'Group2'
when max(case when a.y1 = 'Group2' then 1 else 0 end) = 0 then 'Group1'
else 'Both' end as WhichGroup
,Sum(a.x1) as Sum_x1
from work.table1 as a
group by 1
) as t1
left join
work.table2 as t2
on t1.ID1 = t2.ID1
Group by 1;
Quit;
- Answering my own question -
I am not sure why this is happening but I have encountered a very interest phenomenon and potentially a bug in SAS.
It appears that the whole reason the query doesn't work is because SAS does not understand the group by statement if it is given in digits rather than explicitly stating the variable name you want to group by. Potentially SAS gets lost in the column order?
Has anyone else encountered such a phenomenon before in SAS?
Hence the query works if the following code is used:
Proc sql;
Create table Test
as select
t1.WhichGroup
,sum(t1.Sum_x1) as sum_x1
,sum(t2.x2) as Sum_x2
from (select
a.ID1
,case when max(case when a.y1 = 'Group1' then 1 else 0 end) = 0 then 'Group2'
when max(case when a.y1 = 'Group2' then 1 else 0 end) = 0 then 'Group1'
else 'Both' end as WhichGroup
,Sum(a.x1) as Sum_x1
from work.table1 as a
group by 1
) as t1
left join
work.table2 as t2
on t1.ID1 = t2.ID1
Group by WhichGroup;
Quit;

How do I extract a pattern from a table in Oracle 11g?

I want to extract text from a column using regular expressions in Oracle 11g. I have 2 queries that do the job but I'm looking for a (cleaner/nicer) way to do it. Maybe combining the queries into one or a new equivalent query. Here they are:
Query 1: identify rows that match a pattern:
select column1 from table1 where regexp_like(column1, pattern);
Query 2: extract all matched text from a matching row.
select regexp_substr(matching_row, pattern, 1, level)
from dual
connect by level < regexp_count(matching_row, pattern);
I use PL/SQL to glue these 2 queries together, but it's messy and clumsy. How can I combine them into 1 query. Thank you.
UPDATE: sample data for pattern 'BC':
row 1: ABCD
row 2: BCFBC
row 3: HIJ
row 4: GBC
Expected result is a table of 4 rows of 'BC'.
You can also do it in one query, functions/procedures/packages not required:
WITH t1 AS (
SELECT 'ABCD' c1 FROM dual
UNION
SELECT 'BCFBC' FROM dual
UNION
SELECT 'HIJ' FROM dual
UNION
SELECT 'GBC' FROM dual
)
SELECT c1, regexp_substr(c1, 'BC', 1, d.l, 'i') thePattern, d.l occurrence
FROM t1 CROSS JOIN (SELECT LEVEL l FROM dual CONNECT BY LEVEL < 200) d
WHERE regexp_like(c1,'BC','i')
AND d.l <= regexp_count(c1,'BC');
C1 THEPATTERN OCCURRENCE
----- -------------------- ----------
ABCD BC 1
BCFBC BC 1
BCFBC BC 2
GBC BC 1
SQL>
I've arbitrarily limited the number of occurrences to search for at 200, YMMV.
Actually there is an elegant way to do this in one query, if you do not mind to run some extra miles. Please note that this is just a sketch, I have not run it, you'll probably have to correct a few typos in it.
create or replace package yo_package is
type word_t is record (word varchar2(4000));
type words_t is table of word_t;
end;
/
create or replace package body yo_package is
function table_function(in_cur in sys_refcursor, pattern in varchar2)
return words_t
pipelined parallel_enable (partition in_cur by any)
is
next varchar2(4000);
match varchar2(4000);
word_rec word_t;
begin
word_rec.word = null;
loop
fetch in_cur into next;
exit when in_cur%notfound;
--this you inner loop where you loop through the matches within next
--you have to implement this
loop
--TODO get the next match from next
word_rec.word := match;
pipe row (word_rec);
end loop;
end loop;
end table_function;
end;
/
select *
from table(
yo_package.table_function(
cursor(
--this is your first select
select column1 from table1 where regexp_like(column1, pattern)
)
)