I am facing a non-expected behaviour when using the clause output every along with table join clause.
I have an basic app, with one input stream, and 2 tables, which store a different list of values. Then, there are also 2 queries,
The first query1 will join with table1, and when there is a match will output first every 5 sec.
Second query2 will do similarly, will join table2, and will output first value found every 5 sec.
The goal of this is, every 5 seconds, when there is a value in input stream which is contained into table 1, there will be a match, and if there is a value contained into table 2, there will be a different match, and both queries will keep silent until next 5 seconds block.
the app is the following
#App:name("delays_tables_join")
define stream input(value string);
define stream table_input(value string);
define table table1(value string);
define table table2(value string);
#sink(type='log')
define stream LogStream (value string);
-- fill table1
#info(name='insert table 1')
from table_input[value == '1']
insert into table1;
-- fill table2
#info(name='insert table 2')
from table_input[value == '2']
insert into table2;
-- query input join with table 1, output once every 5 sec
#info(name='query1')
from input join table1 on input.value == table1.value
select input.value
output first every 5 sec
insert into LogStream;
-- query input join with table 2, output once every 5 sec
#info(name='query2')
from input join table2 on input.value == table2.value
select input.value
output first every 5 sec
insert into LogStream;
When this app is run,first its sent to table_input the values 1, and 2 to fill both tables
And then, it starts sending to the input stream repeatedly values: 1, 2, 1, 2, 1, 2...
It is expected to have in LogStream 2 values every 5 seconds, the first appearance of 1 value, and the first appearance of value 2.
But instead, just the first occurrence of value 1 appears all the time, but not the value 2
[2020-04-02_18-55-16_498] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays_tables_join : LogStream : Event{timestamp=1585846516098, data=[1], isExpired=false}
[2020-04-02_18-55-21_508] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays_tables_join : LogStream : Event{timestamp=1585846521098, data=[1], isExpired=false}
Please, note that, when there are no table joins involved, both queries work as expected. Example without joins:
#App:name("delays")
define stream Input(value string);
#sink(type='log')
define stream LogStream (value string);
#info(name='query1')
from Input[value == '1']
select value
output first every 5 sec
insert into LogStream;
#info(name='query2')
from Input[value == '2']
select value
output first every 5 sec
insert into LogStream;
this will produce the following output:
[2020-04-02_18-53-50_305] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846430304, data=[1], isExpired=false}
[2020-04-02_18-53-50_706] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846430305, data=[2], isExpired=false}
[2020-04-02_18-53-55_312] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846438305, data=[1], isExpired=false}
[2020-04-02_18-53-56_114] INFO {io.siddhi.core.stream.output.sink.LogSink} - delays : LogStream : Event{timestamp=1585846439305, data=[2], isExpired=false}
.
I was wondering if this behaviour is expected, or there is any error at all in the design of the application.
Many thanks!
I was able to get the results as in the "without join" by fixing the "insert table 2" query by changing the table1 to table2 in insert into line
-- fill table2
#info(name='insert table 2')
from table_input[value == '2']
insert into table1;
Related
I am unable to insert records from one table to another table using Pyspark SQL in AWS Glue job. Its not prompting any error as well..
Table1 name : details_info
spark.sql('select count(*) from details_info').show();
count: 50000
Table2 name : goals_info
spark.sql('select count(*) from goals_info').show();
count: 0
I am trying to insert data from "details_info" table to "goals_info" table with below querie
spark.sql("INSERT INTO goals_info SELECT * FROM details_info")
I am expecting the count of goals_info as 50000 but it is showing me as 0 after
executing above sql statement
spark.sql('select count(*) from goals_info').show();
count: 0
code block is executing with out throwing any error but data is not inserting as count is showing as 0. Could anybody help me understand what might be the reason?
I even tried with write.insertInto() pyspark method , but the count is still showing a zero
view_df = spark.table("details_info")
view_df.write.insertInto("goals_info")
I'm trying to run this query
SELECT
id AS id,
ARRAY_AGG(DISTINCT users_ids) AS users_ids,
MAX(date) AS date
FROM
users,
UNNEST(users_ids) AS users_ids
WHERE
users_ids != " 1111"
AND users_ids != " 2222"
GROUP BY
id;
Where users table is splitted table with id column and user_ids (comma separated) column and date column
on a +4TB and it give me resources
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.
.. any idea why?
id userids date
1 2,3,4 1-10-20
2 4,5,6 1-10-20
1 7,8,4 2-10-20
so the final result I'm trying to reach
id userids date
1 2,3,4,7,8 2-10-20
2 4,5,6 1-10-20
Execution details:
It's constantly repartitioning - I would guess that you're trying to cramp too much stuff into the aggregation part. Just remove the aggregation part - I don't even think you have to cross join here.
Use a subquery instead of this cross join + aggregation combo.
Edit: just realized that you want to aggregate the arrays but with distinct values
WITH t AS (
SELECT
id AS id,
ARRAY_CONCAT_AGG(ARRAY(SELECT DISTINCT uids FROM UNNEST(user_ids) as uids WHERE
uids != " 1111" AND uids != " 2222")) AS users_ids,
MAX(date) OVER (partition by id) AS date
FROM
users
GROUP BY id
)
SELECT
id,
ARRAY(SELECT DISTINCT * FROM UNNEST(user_ids)) as user_ids
,date
FROM t
Just the draft I assume id is unique but it should be something along those lines? Grouping by arrays is not possible ...
array_concat_agg() has no distinct so it comes in a second step.
Im working with AWS Athena to concat a few rows to a single row.
Example table:(name: unload)
xid pid sequence text
1 1 0 select * from
1 1 1 mytbl
1 1 2
2 1 0 update test
2 1 1 set mycol=
2 1 2 'a';
So want to contact the text column.
Expected Output:
xid pid text
1 1 select * from mytbl
2 1 update test set mycol='a';
I ran the following query to partition it first with proper order and do the concat.
with cte as
(SELECT
xid,
pid,
sequence,
text,
row_number()
OVER (PARTITION BY xid,pid
ORDER BY sequence) AS rank
FROM unload
GROUP BY xid,pid,sequence,text
)
SELECT
xid,
pid,
array_join(array_agg(text),'') as text
FROM cte
GROUP BY xid,pid
But if you see the below output the order got misplaced.
xid pid text
1 1 mytblselect * from
2 1 update test'a'; set mycol=
I checked the Presto documentation, the latest version supports order by in array agg, but Athena is using Presto 0.172, so Im not sure whether it is supported or not.
What is the workaround for this in Athena?
One approach:
create records with a sortable format of text
aggregate into an unsorted array
sort the array
transform each element back into the original value of text
convert the sorted array to a string output column
WITH cte AS (
SELECT
xid, pid, text
-- create a sortable 19-digit ranking string
, SUBSTR(
LPAD(
CAST(
ROW_NUMBER() OVER (PARTITION BY xid, pid ORDER BY sequence)
AS VARCHAR)
, 19
, '0')
, -19) AS SEQ_STR
FROM unload
)
SELECT
xid, pid
-- make sortable string, aggregate into array
-- then sort array, revert each element to original text
-- finally combine array elements into one string
, ARRAY_JOIN(
TRANSFORM(
ARRAY_SORT(
ARRAY_AGG(SEQ_STR || text))
, combined -> SUBSTR(combined, 1 + 19))
, ' '
, '') AS TEXT
FROM cte
GROUP BY xid, pid
ORDER BY xid, pid
This code assumes:
xid + pid + sequence is unique for all input records
There are not many combinations of xid + pid + sequence (eg, not more than 20 million)
I'm trying to abort/exit a query based on a conditional expression using CASE statement:
If the table has 0 rows then the query should go happy path.
If the table has > 0 rows then the query should abort/exit.
drop table if exists #dups_tracker ;
create table #dups_tracker
(
column1 varchar(10)
);
insert into #dups_tracker values ('John'),('Smith'),('Jack') ;
with c1 as
(select
0 as denominator__v
,count(*) as dups_cnt__v
from #dups_tracker
)
select
case
when dups_cnt__v > 0 THEN 1/denominator__v
else
1
end Ind__v
from c1
;
Here is the Error Message :
Amazon Invalid operation: division by zero; 1 statement failed.
There is no concept of aborting an SQL query. It either compiles into a query or it doesn't. If it does compile, the query runs.
The closest option would be to write a Stored Procedure, which can include IF logic. So, it could first query the contents of a table and, based on the result, decide whether it will perform another query.
Here is the logic I was able to write to abort a SQL in case of positive usecase,
/* Dummy Table to Abort Dups Check process if Positive */
--Dups Table
drop table if exists #dups;
create table #dups
(
dups_col varchar(1)
);
insert into #dups values('A');
--Dummy Table
drop table if exists #dummy ;
create table #dummy
(
dups_check decimal(1,0)
)
;
--When Table is not empty and has Dups
insert into #dummy
select
count(*) * 10
from #dups
;
/*
[Amazon](500310) Invalid operation: Numeric data overflow (result precision)
Details:
-----------------------------------------------
error: Numeric data overflow (result precision)
code: 1058
context: 64 bit overflow
query: 3246717
location: numeric.hpp:158
process: padbmaster [pid=6716]
-----------------------------------------------;
1 statement failed.
*/
--When Table is empty and doesn't have dups
truncate #dups ;
insert into #dummy
select
count(*) * 10
from #dups
;
drop table if exists temp_table;
create temp table temp_table (field_1 bool);
insert into temp_table
select case
when false -- or true
then 1
else 1 / 0
end as field_1;
This should compile, and fail when the condition isn't met.
Not sure why it's different from your example, though...
Edit: the above doesn't work querying against a table. Leaving it here for posterity.
Given DML statement below, is there a way to limit numbers of rows scanned by target table? For example, let's say we have a field shard_id which the table is partitioned with. I know beforehand that all update should happen in some range of shard_id. Is there a way to specify where clause for target to limit numbers of rows that need to be scanned so update does not have to do a full table scan to look for an id?
MERGE dataset.table_target target
USING dataset.table_source source
ON target.id = "123"
WHEN MATCHED THEN
UPDATE SET some_value = source.some_value
WHEN NOT MATCHED BY SOURCE AND id = "123" THEN
DELETE
The ON condition is the Where statement where you need to write your clause.
ON target.id = "123" AND DATE(t.shard_id) BETWEEN date1 and date2
For your case, it's incorrect to do the partition pruning by ON condition. Instead, you should do that in WHEN clause.
There is an example exactly for such scenario at https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables#pruning_partitions_when_using_a_merge_statement.
Basically, the ON condition is used as the matching condition to join target & source tables in MERGE. Following two queries shows the difference between join condition and where clause,
Query 1:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 and pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL
2000-01-01 10 NULL
Query 2:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 where pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL