Does AWS Athena supports Order by in Array_AGG?

Does AWS Athena supports Order by in Array_AGG? - amazon-web-services

Im working with AWS Athena to concat a few rows to a single row.
Example table:(name: unload)
xid pid sequence text
1 1 0 select * from
1 1 1 mytbl
1 1 2
2 1 0 update test
2 1 1 set mycol=
2 1 2 'a';
So want to contact the text column.
Expected Output:
xid pid text
1 1 select * from mytbl
2 1 update test set mycol='a';
I ran the following query to partition it first with proper order and do the concat.
with cte as
(SELECT
xid,
pid,
sequence,
text,
row_number()
OVER (PARTITION BY xid,pid
ORDER BY sequence) AS rank
FROM unload
GROUP BY xid,pid,sequence,text
)
SELECT
xid,
pid,
array_join(array_agg(text),'') as text
FROM cte
GROUP BY xid,pid
But if you see the below output the order got misplaced.
xid pid text
1 1 mytblselect * from
2 1 update test'a'; set mycol=
I checked the Presto documentation, the latest version supports order by in array agg, but Athena is using Presto 0.172, so Im not sure whether it is supported or not.
What is the workaround for this in Athena?

One approach:
create records with a sortable format of text
aggregate into an unsorted array
sort the array
transform each element back into the original value of text
convert the sorted array to a string output column
WITH cte AS (
SELECT
xid, pid, text
-- create a sortable 19-digit ranking string
, SUBSTR(
LPAD(
CAST(
ROW_NUMBER() OVER (PARTITION BY xid, pid ORDER BY sequence)
AS VARCHAR)
, 19
, '0')
, -19) AS SEQ_STR
FROM unload
)
SELECT
xid, pid
-- make sortable string, aggregate into array
-- then sort array, revert each element to original text
-- finally combine array elements into one string
, ARRAY_JOIN(
TRANSFORM(
ARRAY_SORT(
ARRAY_AGG(SEQ_STR || text))
, combined -> SUBSTR(combined, 1 + 19))
, ' '
, '') AS TEXT
FROM cte
GROUP BY xid, pid
ORDER BY xid, pid
This code assumes:
xid + pid + sequence is unique for all input records
There are not many combinations of xid + pid + sequence (eg, not more than 20 million)

Related

Informatica Cloud Data Integration - find non matching rows

I am working on Informatica Cloud Data
Integraion.I have 2 tables- Tab1 and Tab2.The joining column is id.I want to find all records in Tab1 that do not exist in Tab2.What transformations can I use to achieve this?
Tab1
id name
1 n1
2 n2
3 n3
Tab2
id
1
5
6
I want to get records with id 2 and 3 from tab1 as they do not exist in tab2

You can use database source qualifier overwrite sql
Select * from table1 where id not in ( select id from table2)
Or else you can use informatica like below.
Do a lookup on table2, on join condition on id.
In exp transformation, create a flag
out_flag= iif(isnull (:lkp(id)),'pass','fail')
Put a filter next and keep the condition as out_flag= 'pass'
Whole map should be like this
Lkp
|
Sq --exp|-----> fil---tgt

Concatenating row values in Athena Aws

I've 2 cols lets say id and values. I want to concatenate values grouped by id col.
for eg.
I've
ID Values
1 a
1 b
2 a
2 b
I need the output as
ID Values
1 a,b
2 a,b

You can use an array_agg followed by an array_join
select id, array_join(array_agg(values),',') from table group by 1
The array_agg will give you an array of all values with the same id, and the array_join will concatenate them into a string. See the docs.

Siddhi App, 'Output Rate Limiting' clause misbehaviour in Joins

I am facing a non-expected behaviour when using the clause output every along with table join clause.
I have an basic app, with one input stream, and 2 tables, which store a different list of values. Then, there are also 2 queries,
The first query1 will join with table1, and when there is a match will output first every 5 sec.
Second query2 will do similarly, will join table2, and will output first value found every 5 sec.
The goal of this is, every 5 seconds, when there is a value in input stream which is contained into table 1, there will be a match, and if there is a value contained into table 2, there will be a different match, and both queries will keep silent until next 5 seconds block.
the app is the following
#App:name("delays_tables_join")
define stream input(value string);
define stream table_input(value string);
define table table1(value string);
define table table2(value string);
#sink(type='log')
define stream LogStream (value string);
-- fill table1
#info(name='insert table 1')
from table_input[value == '1']
insert into table1;
-- fill table2
#info(name='insert table 2')
from table_input[value == '2']
insert into table2;
-- query input join with table 1, output once every 5 sec
#info(name='query1')
from input join table1 on input.value == table1.value
select input.value
output first every 5 sec
insert into LogStream;
-- query input join with table 2, output once every 5 sec
#info(name='query2')
from input join table2 on input.value == table2.value
select input.value
output first every 5 sec
insert into LogStream;
When this app is run,first its sent to table_input the values 1, and 2 to fill both tables
And then, it starts sending to the input stream repeatedly values: 1, 2, 1, 2, 1, 2...
It is expected to have in LogStream 2 values every 5 seconds, the first appearance of 1 value, and the first appearance of value 2.
But instead, just the first occurrence of value 1 appears all the time, but not the value 2
[2020-04-02_18-55-16_498] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays_tables_join : LogStream : Event{timestamp=1585846516098, data=[1], isExpired=false}
[2020-04-02_18-55-21_508] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays_tables_join : LogStream : Event{timestamp=1585846521098, data=[1], isExpired=false}
Please, note that, when there are no table joins involved, both queries work as expected. Example without joins:
#App:name("delays")
define stream Input(value string);
#sink(type='log')
define stream LogStream (value string);
#info(name='query1')
from Input[value == '1']
select value
output first every 5 sec
insert into LogStream;
#info(name='query2')
from Input[value == '2']
select value
output first every 5 sec
insert into LogStream;
this will produce the following output:
[2020-04-02_18-53-50_305] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846430304, data=[1], isExpired=false}
[2020-04-02_18-53-50_706] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846430305, data=[2], isExpired=false}
[2020-04-02_18-53-55_312] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846438305, data=[1], isExpired=false}
[2020-04-02_18-53-56_114] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846439305, data=[2], isExpired=false}
.
I was wondering if this behaviour is expected, or there is any error at all in the design of the application.
Many thanks!

I was able to get the results as in the "without join" by fixing the "insert table 2" query by changing the table1 to table2 in insert into line
-- fill table2
#info(name='insert table 2')
from table_input[value == '2']
insert into table1;

Hive: First and last occurrence in a string

I have an id column and a string column as follows:
id values
1 AD123~DF123~SQ345
2 CF234~DF234
3 BG123
I need the first occurrence and last occurrence of columns below in Hive
id first last
1 AD123 SQ345
2 CF234 DF234
3 BG123 BG123
I have already tried using the HIVE split function to solve it
select id, split(values, '\~') [0] as first, reverse(split(reverse(values), '\~')[0]) from demo;
I keep getting a syntax error in Hive saying that [ is unexpected.
Another alternative I found is regex but I am new to Hive, can some one please help me out here with regex or split. Thanks

Using split:
with your_table as(
select stack(3,
1, 'AD123~DF123~SQ345',
2, 'CF234~DF234',
3, 'BG123'
) as (id,values)
) --use your_table instead of this
select id, values[0] as first, values[size(values)-1] as last
from
(
select id, split(values,'~') values
from your_table t
)s
;
Returns:
id first last
1 AD123 SQ345
2 CF234 DF234
3 BG123 BG123
Using regexp:
select id,
regexp_extract(values,'^([^~]*)',1) as first,
regexp_extract(values,'([^~]*)$',1) as last
from your_table t
;

Redshift. Convert comma delimited values into rows

I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:
I have:
user_id|user_name|user_action
-----------------------------
1 | Shone | start,stop,cancell...
I would like to see
user_id|user_name|parsed_action
-------------------------------
1 | Shone | start
1 | Shone | stop
1 | Shone | cancell
....

A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join to make the query more compact.
Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.
Specifically, if we assume the number of rows in cmd_logs is larger than the maximum number of commas in the user_action column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action column:
select
(row_number() over (order by true))::int as n
into numbers
from cmd_logs
limit 100;
If we want to get fancy, we can compute the number of commas from the cmd_logs table to create a more precise set of rows in numbers:
select
n::int
into numbers
from
(select
row_number() over (order by true) as n
from cmd_logs)
cross join
(select
max(regexp_count(user_action, '[,]')) as max_num
from cmd_logs)
where
n <= max_num + 1;
Once there is a numbers table, we can do:
select
user_id,
user_name,
split_part(user_action,',',n) as parsed_action
from
cmd_logs
cross join
numbers
where
split_part(user_action,',',n) is not null
and split_part(user_action,',',n) != '';

Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:
... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced
... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action
Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.

If you know that there are not many actions in your user_action column, you use recursive sub-querying with union all and therefore avoiding the aux numbers table.
But it requires you to know the number of actions for each user, either adjust initial table or make a view or a temporary table for it.
Data preparation
Assuming you have something like this as a table:
create temporary table actions
(
user_id varchar,
user_name varchar,
user_action varchar
);
I'll insert some values in it:
insert into actions
values (1, 'Shone', 'start,stop,cancel'),
(2, 'Gregory', 'find,diagnose,taunt'),
(3, 'Robot', 'kill,destroy');
Here's an additional table with temporary count
create temporary table actions_with_counts
(
id varchar,
name varchar,
num_actions integer,
actions varchar
);
insert into actions_with_counts (
select user_id,
user_name,
regexp_count(user_action, ',') + 1 as num_actions,
user_action
from actions
);
This would be our "input table" and it looks just as you expected
select * from actions_with_counts;
id
name
num_actions
actions
2
Gregory
3
find,diagnose,taunt
3
Robot
2
kill,destroy
1
Shone
3
start,stop,cancel
Again, you can adjust initial table and therefore skipping adding counts as a separate table.
Sub-query to flatten the actions
Here's the unnesting query:
with recursive tmp (user_id, user_name, idx, user_action) as
(
select id,
name,
1 as idx,
split_part(actions, ',', 1) as user_action
from actions_with_counts
union all
select user_id,
user_name,
idx + 1 as idx,
split_part(actions, ',', idx + 1)
from actions_with_counts
join tmp on actions_with_counts.id = tmp.user_id
where idx < num_actions
)
select user_id, user_name, user_action as parsed_action
from tmp
order by user_id;
This will create a new row for each action, and the output would look like this:
user_id
user_name
parsed_action
1
Shone
start
1
Shone
stop
1
Shone
cancel
2
Gregory
find
2
Gregory
diagnose
2
Gregory
taunt
3
Robot
kill
3
Robot
destroy

Here are two ways to achieve this.
In my example, I'm assuming that I am accepting a comma separated list of values. My values look like schema.table.column.
The first involves using a recursive CTE.
drop table if exists #dep_tbl;
create table #dep_tbl as
select 'schema.foobar.insert_ts,schema.baz.load_ts' as dep
;
with recursive tmp (level, dep_split, to_split) as
(
select 1 as level
, split_part(dep, ',', 1) as dep_split
, regexp_count(dep, ',') as to_split
from #dep_tbl
union all
select tmp.level + 1 as level
, split_part(a.dep, ',', tmp.level + 1) as dep_split_u
, tmp.to_split
from #dep_tbl a
inner join tmp on tmp.dep_split is not null
and tmp.level <= tmp.to_split
)
select dep_split from tmp;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
The second involves a stored procedure.
CREATE OR REPLACE PROCEDURE so_test(dependencies_csv varchar(max))
LANGUAGE plpgsql
AS $$
DECLARE
dependencies_csv_vals varchar(max);
BEGIN
drop table if exists #dep_holder;
create table #dep_holder
(
avoid varchar(60000)
);
IF dependencies_csv is not null THEN
dependencies_csv_vals:='('||replace(quote_literal(regexp_replace(dependencies_csv,'\\s','')),',', '\'),(\'') ||')';
execute 'insert into #dep_holder values '||dependencies_csv_vals||';';
END IF;
END;
$$
;
call so_test('schema.foobar.insert_ts,schema.baz.load_ts')
select
*
from
#dep_holder;
the above yields:
|dep_split|
|schema.foobar.insert_ts|
|schema.baz.load_ts|
in conclusion
If you only care about one single column in your input (the X delimited values), then I think the stored procedure is easier/faster.
However, if you have other columns you care about and want to keep those columns along with your comma separated value column now transformed to rows, OR, if you want to know the argument (original list of delimited values), I think the stored procedure is the way to go. In that case, you can just add those other columns to your columns selected in the recursive query.

You can get the expected result with the following query. I'm using "UNION ALL" to convert a column to row.
select user_id, user_name, split_part(user_action,',',1) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',2) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',3) as parsed_action from cmd_logs

Here's my equally-terrible answer.
I have a users table, and then an events table with a column that is just a comma-delimited string of users at said event. eg
event_id | user_ids
1 | 5,18,25,99,105
In this case, I used the LIKE and wildcard functions to build a new table that represents each event-user edge.
SELECT e.event_id, u.id as user_id
FROM events e
LEFT JOIN users u ON e.user_ids like '%' || u.id || '%'
It's not pretty, but I throw it in a WITH clause so that I don't have to run it more than once per query. I'll likely just build an ETL to create that table every night anyway.
Also, this only works if you have a second table that does have one row per unique possibility. If not, you could do LISTAGG to get a single cell with all your values, export that to a CSV and reupload that as a table to help.
Like I said: a terrible, no-good solution.

Late to the party but I got something working (albeit very slow though)
with nums as (select n::int n
from
(select
row_number() over (order by true) as n
from table_with_enough_rows_to_cover_range)
cross join
(select
max(json_array_length(json_column)) as max_num
from table_with_json_column )
where
n <= max_num + 1)
select *, json_extract_array_element_text(json_column,nums.n-1) parsed_json
from nums, table_with_json_column
where json_extract_array_element_text(json_column,nums.n-1) != ''
and nums.n <= json_array_length(json_column)
Thanks to answer by Bob Baxley for inspiration

Just improvement for the answer above https://stackoverflow.com/a/31998832/1265306
Is generating numbers table using the following SQL
https://discourse.looker.com/t/generating-a-numbers-table-in-mysql-and-redshift/482
SELECT
p0.n
+ p1.n*2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as number
INTO numbers
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
ORDER BY 1
LIMIT 100
"ORDER BY" is there only in case you want paste it without the INTO clause and see the results

create a stored procedure that will parse string dynamically and populatetemp table, select from temp table.
here is the magic code:-
CREATE OR REPLACE PROCEDURE public.sp_string_split( "string" character varying )
AS $$
DECLARE
cnt INTEGER := 1;
no_of_parts INTEGER := (select REGEXP_COUNT ( string , ',' ));
sql VARCHAR(MAX) := '';
item character varying := '';
BEGIN
-- Create table
sql := 'CREATE TEMPORARY TABLE IF NOT EXISTS split_table (part VARCHAR(255)) ';
RAISE NOTICE 'executing sql %', sql ;
EXECUTE sql;
<<simple_loop_exit_continue>>
LOOP
item = (select split_part("string",',',cnt));
RAISE NOTICE 'item %', item ;
sql := 'INSERT INTO split_table SELECT '''||item||''' ';
EXECUTE sql;
cnt = cnt + 1;
EXIT simple_loop_exit_continue WHEN (cnt >= no_of_parts + 2);
END LOOP;
END ;
$$ LANGUAGE plpgsql;
Usage example:-
call public.sp_string_split('john,smith,jones');
select *
from split_table

You can try copy command to copy your file into redshift tables
copy table_name from 's3://mybucket/myfolder/my.csv' CREDENTIALS 'aws_access_key_id=my_aws_acc_key;aws_secret_access_key=my_aws_sec_key' delimiter ','
You can use delimiter ',' option.
For more details of copy command options you can visit this page
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Does AWS Athena supports Order by in Array_AGG? - amazon-web-services

Related

Informatica Cloud Data Integration - find non matching rows

Concatenating row values in Athena Aws

Siddhi App, 'Output Rate Limiting' clause misbehaviour in Joins

Hive: First and last occurrence in a string

Redshift. Convert comma delimited values into rows

Categories

Resources