I need to implement a filtering feature on events using WSO2 CEP 4.1.0.
My filters are stored in a PostgreSQL database.
To do this, I created an event_table configuration, and I join my event on this event_table stream.
My filters can have default values, so I need a complex joining condition:
from my_stream#window.length(1) left outer join my_event_table as filter
on (filter.field1 == '' OR stream.field1 == filter.field1)
AND (filter.field2 == '' OR stream.field2 == filter.field2)
(I do a LEFT OUTER JOIN because I must have a different process if the filter is found or not: if I find the filter, I complete my_stream with information from it, and I save the event in database table; if not, I save the event in another database table).
Problem is when the system extract the join condition to interpret it, it removes the parenthesis, so the boolean interpretation is wrong:
on filter.field1 == '' OR stream.field1 == filter.field1
AND filter.field2 == '' OR stream.field2 == filter.field2
Is there a way to implement this kind of feature, without plugin creation?
Regards.
EDIT: This is the current solution I found, but I am afraid about performance and complexity, so look for another one:
#first, I left join on my event_table
from my_stream#window.length(1) left outer join my_event_table as filter
on (filter.field1 == '' OR stream.field1 == filter.field1)
select stream.field1, stream.field2, stream.field3, filter.field1 as filter_field1, filter.field2 as filter_field2, filter.field3 as filter_field3, filter.info1
insert into tempStreamJoinProblemCount
#if the join return nothing, then no filter for my line
from tempStreamJoinProblemCount[filter_field1 IS NULL]
insert into filter_not_found
#if the join return some lines, maybe 1 of these lines can match, I continue to check
from tempStreamJoinProblemCount[NOT filter_field1 IS NULL]
select field1, field2, field3, info1
#I check my complex joining condition and store it in a boolean for later: 1 then my filter match, 0 then no match
convert(
(filter_field2=='' OR field2 == filter_field2)
AND(filter_field3=='' OR field3 == filter_field3),'int') as filterMatch
insert into filterCheck
#if filterMatch is 1, I extract the filter information (info1), else I put a default value (minimal value); custom:ternaryInt is just the ternary function: boolean_condition?value_if_true:value_if_false
from computeFilterMatchInformation
select field1, custom:ternaryInt(filterMatch==1, info1, 0) as info1, filterMatch
insert into filterCheck
#As we did not join on all fields, 1 line has been expanded into several lines, so we group the lines, to remove these generated lines and keep only 1 initial line;
from filterMatchGroupBy#window.time(10 sec)
#max(info1) return only the filter value (because the value 0 from previous stream is the minimal value);
#sum(filterMatch) return 0 if there is no match, and 1+ if there is a match
select field1, max(info1) as info1, sum(filterMatch) as filter_match
group by field1, field2, field3
insert into filterCheck
#we found no match
from filterCheck[filter_match == 0]
select field1, field2, field3
insert into filter_not_found
#we found a match, so we extract filter information (info1)
from filterCheck[filter_match > 0]
select field1, field2, field3, info1
insert into filter_found
Fundamentally, left outer join might not work with an event table. Because event table is not an active construct (like a stream). So we cannot assign a window to an event table. However, in order to join with (outer joins) each stream should be associated with a window. Since we cannot do that with event tables, outer-joins wouldn't work anyway.
However, to address your scenario, you can join my_stream with my_event_table without any conditions and emit resulting events into an intermediate stream, and then check for the conditions on that intermediate stream. Try something similar to this;
from my_stream join my_event_table
select
my_stream.field1 as streamField1,
my_event_table.field1 as tableField1,
my_stream.field1 as streamField2,
my_event_table.field1 as tableField2,
insert into intermediateStream;
from intermediateStream[((tableField1 == '' OR streamField1 == tableField1) AND (tableField2 == '' OR streamField2 == tableField2))]
select *
insert into filtereMatchedStream;
from intermediateStream[not ((tableField1 == '' OR streamField1 == tableField1) AND (tableField2 == '' OR streamField2 == tableField2))]
select *
insert into filtereUnMatchedStream;
Related
I've got a similar table which I'm trying to pivot in Redshift:
UUID
Key
Value
a123
Key1
Val1
b123
Key2
Val2
c123
Key3
Val3
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
create table test (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;
Then you want to take this string and insert it into your PIVOT query which should look like this:
select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
OPEN curs1 for EXECUTE 'select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
call pivot_on_all_keys('mycursor');
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
fetch all from mycursor;
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
fetch 1000 from mycursor;
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
EXECUTE 'drop table if exists test_stage;';
EXECUTE 'create table test_stage AS select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
call pivot_on_all_keys();
select * from test_stage;
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.
A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;
can be written as:
SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;
Even multi-level pivoting can also be handled in the same syntax.
SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;
I'm trying to get substring dynamically and group by it. So if my uri column contains records like: /uri1/uri2 and /somelongword/someotherlongword I would like to get everything up to second delimiter, namely up to second / and count it. I'm using this query but obviously it is cutting string statically (6 letters after the first one).
SELECT substr(uri, 1, 6) as URI,
COUNT(*) as COUNTER
FROM staging
GROUP BY substr(uri, 1, 6)
ORDER BY COUNTER DESC
How can I achieve that?
You can use combination of SUBSTRING() and POSITION()
schema:
CREATE TABLE Table1
(`uri` varchar(10))
;
INSERT INTO Table1
(`uri`)
VALUES
('some/text'),
('some/text1'),
('some/text2'),
('aa/bb'),
('aa/cc'),
('bb/cc')
;
query
SELECT
SUBSTRING(uri,1,POSITION('/' IN uri)-1),
COUNT(*)
FROM Table1
GROUP BY SUBSTRING(uri,1,POSITION('/' IN uri)-1);
http://sqlfiddle.com/#!9/293dd3/3/0
edit: here I found amazon athena documentation: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html and here is the string function documentation: https://prestodb.io/docs/0.217/functions/string.html
my answer above still stands, but you might need to change SUBSTRING to SUBSTR
edit 2: it seems there's a special function to achieve this in amazon athena called SPLIT_PART()
query:
SELECT SPLIT_PART(uri, '/', 1), COUNT(*) FROM tbl GROUP BY SPLIT_PART(uri, '/', 1)
from docs:
split_part(string, delimiter, index) → varchar
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.
The following statement retrieve the value of sub tag msg_id from MISC column if the sub stag contain value like %PACS%.
SELECT REGEXP_SUBSTR(MISC, '(^|\s|;)msg_id = (.*?)\s*(;|$)',1,1,NULL,2) AS TRANS_REF FROM MISC_HEADER
WHERE MISC LIKE '%PACS%';
I notice the query return record with null value (without msg_id) as well. Any idea if can exclude those null records from the syntax of REGEXP_SUBSTR, without adding any where clause.
Sample data of MISC:
channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123
channel=pos; phone=9922; ustrd=U156
The second record without msg_id, so it need to be excluded.
This method does not use REGEXP so may not be suitable for you.
However, it does satisfy your requirement.
This takes your embedded list of msg_id, breaks it out to a row for each component for an ID (I've assumed you do have something uniquely identifies each record).
It then only returns the original row where one of the rows for the ID has 'PACS' in it.
WITH thedata
AS (SELECT 1 AS theid
, 'channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123'
AS msg_id
FROM DUAL
UNION ALL
SELECT 2, 'channel=pos; phone=9922; ustrd=U156' FROM DUAL)
, mylist
AS (SELECT theid, COLUMN_VALUE AS msg_component
FROM thedata
, XMLTABLE(('"' || REPLACE(msg_id, ';', '","') || '"')))
SELECT *
FROM thedata td
WHERE EXISTS
(SELECT 1
FROM mylist m
WHERE m.theid = td.theid
AND m.msg_component LIKE '%PACS%')
Thedata sub-query is simply to generate a couple of records and pretend to be your table. You could remove that and substitute your actual table name.
There are other ways to break up an embedded list including ones that use REGEXP, I just find the XMLTABLE method 'cleaner'.
I have a page item :P1_STUDY_SEARCH and a shuttle :P1_STUDY_CODES
The query driving the shuttle looks as below:
SELECT
DISTINCT
D.INTERNAL_REF_NO AS d,
D.INTERNAL_REF_NO AS v
FROM
ARIEL.DIM_DRUG_PRODUCT A,
ARIEL.DIM_REGISTRATION_SET B,
ARIEL.v_rep_includes C,
ARIEL.dim_registration_additional D
WHERE
A.DRUG_PRODUCT_ID = B.DRUG_PRODUCT_ID
AND
B.VERSION_SEQ = C.VERSION_SEQ
AND
B.REGISTRATION_SET_ID = D.REGISTRATION_SET_ID
AND
B.APPLICATION_TYPE IN ('CAT','DOG')
AND
B.DATA_STATE = 'C'
AND
D.INTERNAL_REF_NO IS NOT NULL
AND
D.INTERNAL_REF_NO LIKE 'D%'
AND
LENGTH(D.INTERNAL_REF_NO) >=10
AND
1 = (CASE WHEN :P1_STUDY_SEARCH IS NULL THEN 1 ELSE
CASE WHEN D.INTERNAL_REF_NO LIKE '%' || :P1_STUDY_SEARCH || '%' THEN 1 ELSE 0 END END)
ORDER BY 1;
This query restricts the values in the left hand side of the shuttle based on the values of the search term.
The shuttle source SQL is a colon delimited list:
SELECT LISTAGG(STUDY_CODE,':') WITHIN GROUP (ORDER BY STUDY_CODE)
FROM GRET_STUDIES WHERE GRET_ID = :P1_GRET GROUP BY GRET_ID
This all works beautifully, until a dynamic action is called which is attached to the :P1_STUDY_SEARCH item. On Key down in this search field, the dynamic action refreshes the :P1_STUDY_CODES page item.
The idea is the list gets restricted based on the search term. This part works, but the right hand side of the shuttle loses all of its values.
I suspect the reason is, you cant have ANYTHING in the right hand side of the shuttle (the SOURCE query), if it ISNT also part of the result set in the List of Values query....? #
This seems quite bad, as the source query is distinct from the LOV query?!
I want to mask sensitive information on multiple columns in a table named my_table using ProxySQL.
I've followed this tutorial to successfully mask a single column named column_name in a table using the following mysql_query_rules:
/* only show the first character in column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,match_pattern,re_modifiers,replace_pattern,apply)
VALUES (1,1,'developer','my_table','(\(?)(`?\w+`?\.)?\`?column_name\`?(\)?)([ ,\n])','caseless,global',
"\1CONCAT(LEFT(\2column_name,1),REPEAT('X',CHAR_LENGTH(column_name)-1))\3 column_name\4",1);
But when I add a second rule for masking another column called second_column_name in the table, proxysql fails to mask the second column. Here's the second rule:
/* masking the last 3 characters in second_column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,match_pattern,re_modifiers,replace_pattern,apply)
VALUES (2,1,'developer','my_table','(\(?)(`?\w+`?\.)?\`?second_column_name\`?(\)?)([ ,\n])','caseless,global',
"\1CONCAT(LEFT(\2second_column_name,CHAR_LENGTH(second_column_name)-3),REPEAT('X',3))\3 second_column_name\4",1);
Here's the query result after the 2 rules are added:
SELECT column_name FROM my_table; returns a masked column_name.
SELECT second_column_name FROM my_table; returns a masked second_column_name.
SELECT column_name, second_column_name FROM my_table; returns data with column_name masked, but second_column_name is not masked.
SELECT second_column_name, column_name FROM my_table; also returns data with column_name masked, but second_column_name is not masked.
Does this mean that 1 query can only be matched with 1 rule?
How can I mask data in multiple columns with ProxySQL?
Using flagIN, flagOUT, and apply allows me to mask data on multiple columns.
Here's the final mysql_query_rules I have:
/* only show the first character in column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,flagIN,match_pattern,re_modifiers,flagOUT,replace_pattern,apply)
VALUES (1,1,'developer','my_db',0,'(\(?)(`?\w+`?\.)?\`?column_name\`?(\)?)([ ,\n])','caseless,global',6, "\1CONCAT(LEFT(\2column_name,1),REPEAT('X',CHAR_LENGTH(column_name)-1))\3 column_name\4",0);
/* masking the last 3 characters in second_column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,flagIN,match_pattern,re_modifiers,flagOUT,replace_pattern,apply)
VALUES (2,1,'developer','my_db',6,'(\(?)(`?\w+`?\.)?\`?second_column_name\`?(\)?)([ ,\n])','caseless,global',NULL,
"\1CONCAT(LEFT(\2second_column_name,CHAR_LENGTH(second_column_name)-3),REPEAT('X',3))\3 second_column_name\4",1);
The meanings of the 3 variables are as the following:
flagIN, flagOUT, apply - these allow us to create "chains of rules"
that get applied one after the other. An input flag value is set to
0, and only rules with flagIN=0 are considered at the beginning. When
a matching rule is found for a specific query, flagOUT is evaluated
and if NOT NULL the query will be flagged with the specified flag in
flagOUT. If flagOUT differs from flagIN , the query will exit the
current chain and enters a new chain of rules having flagIN as the
new input flag. If flagOUT matches flagIN, the query will be
re-evaluate again against the first rule with said flagIN. This
happens until there are no more matching rules, or apply is set to 1
(which means this is the last rule to be applied)