How to excldue null values using REGEXP_SUBSTR - regex

The following statement retrieve the value of sub tag msg_id from MISC column if the sub stag contain value like %PACS%.
SELECT REGEXP_SUBSTR(MISC, '(^|\s|;)msg_id = (.*?)\s*(;|$)',1,1,NULL,2) AS TRANS_REF FROM MISC_HEADER
WHERE MISC LIKE '%PACS%';
I notice the query return record with null value (without msg_id) as well. Any idea if can exclude those null records from the syntax of REGEXP_SUBSTR, without adding any where clause.
Sample data of MISC:
channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123
channel=pos; phone=9922; ustrd=U156
The second record without msg_id, so it need to be excluded.

This method does not use REGEXP so may not be suitable for you.
However, it does satisfy your requirement.
This takes your embedded list of msg_id, breaks it out to a row for each component for an ID (I've assumed you do have something uniquely identifies each record).
It then only returns the original row where one of the rows for the ID has 'PACS' in it.
WITH thedata
AS (SELECT 1 AS theid
, 'channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123'
AS msg_id
FROM DUAL
UNION ALL
SELECT 2, 'channel=pos; phone=9922; ustrd=U156' FROM DUAL)
, mylist
AS (SELECT theid, COLUMN_VALUE AS msg_component
FROM thedata
, XMLTABLE(('"' || REPLACE(msg_id, ';', '","') || '"')))
SELECT *
FROM thedata td
WHERE EXISTS
(SELECT 1
FROM mylist m
WHERE m.theid = td.theid
AND m.msg_component LIKE '%PACS%')
Thedata sub-query is simply to generate a couple of records and pretend to be your table. You could remove that and substitute your actual table name.
There are other ways to break up an embedded list including ones that use REGEXP, I just find the XMLTABLE method 'cleaner'.

Related

use regexp_extract [duplicate]

Let's say I have a column called 'Youtube' and I want to extract the string after the last slash of a URL. How would I do this in BigQuery Standard SQL?
Examples:
https://youtube.com/user/HaraldSchmidtShow
https://youtube.com/user/applesofficial
https://youtube.com/user/GrahamColton
Essentially, I want:
HaraldSchmidtShow
applesofficial
GrahamColton
An alternative to the previous answer, which also works when there's a '/' at the end:
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton/'
)
SELECT REGEXP_EXTRACT(url, r'/([^/]+)/?$') name
FROM `data`
This might already do the trick for you:
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton'
)
SELECT
SPLIT(url, '/')[SAFE_OFFSET(ARRAY_LENGTH(SPLIT(url, '/')) - 1)] AS name
FROM `data`
It just splits the string and goes for the last value.
Below is for BigQuery Standard SQL
#standardSQL
SELECT url,
(SELECT v FROM UNNEST(SPLIT(url, '/')) v WITH OFFSET o
WHERE v != '' ORDER BY o DESC LIMIT 1
) last_string
FROM `data`
You can test, play with above using dummy data as
#standardSQL
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton/' UNION ALL
SELECT 'youtube.com/channel/UCEDBbJXgUqRQXCOsluJJ0FQ'
)
SELECT url,
(SELECT v FROM UNNEST(SPLIT(url, '/')) v WITH OFFSET o
WHERE v != '' ORDER BY o DESC LIMIT 1
) last_string
FROM `data`
with result
Row url last_string
1 https://youtube.com/user/HaraldSchmidtShow HaraldSchmidtShow
2 https://youtube.com/user/applesofficial applesofficial
3 https://youtube.com/user/GrahamColton/ GrahamColton
4 youtube.com/channel/UCEDBbJXgUqRQXCOsluJJ0FQ UCEDBbJXgUqRQXCOsluJJ0FQ
Obviously, using regular expression functions as in Felipe's answer - more elegant and easier to read.
But in some cases using above approach still has practical value so I wanted to bring it to that post

Redshift Pivot Function

I've got a similar table which I'm trying to pivot in Redshift:
UUID
Key
Value
a123
Key1
Val1
b123
Key2
Val2
c123
Key3
Val3
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
create table test (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;
Then you want to take this string and insert it into your PIVOT query which should look like this:
select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
OPEN curs1 for EXECUTE 'select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
call pivot_on_all_keys('mycursor');
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
fetch all from mycursor;
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
fetch 1000 from mycursor;
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
EXECUTE 'drop table if exists test_stage;';
EXECUTE 'create table test_stage AS select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
call pivot_on_all_keys();
select * from test_stage;
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.
A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;
can be written as:
SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;
Even multi-level pivoting can also be handled in the same syntax.
SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;

MYSQL get substring

I'm trying to get substring dynamically and group by it. So if my uri column contains records like: /uri1/uri2 and /somelongword/someotherlongword I would like to get everything up to second delimiter, namely up to second / and count it. I'm using this query but obviously it is cutting string statically (6 letters after the first one).
SELECT substr(uri, 1, 6) as URI,
COUNT(*) as COUNTER
FROM staging
GROUP BY substr(uri, 1, 6)
ORDER BY COUNTER DESC
How can I achieve that?
You can use combination of SUBSTRING() and POSITION()
schema:
CREATE TABLE Table1
(`uri` varchar(10))
;
INSERT INTO Table1
(`uri`)
VALUES
('some/text'),
('some/text1'),
('some/text2'),
('aa/bb'),
('aa/cc'),
('bb/cc')
;
query
SELECT
SUBSTRING(uri,1,POSITION('/' IN uri)-1),
COUNT(*)
FROM Table1
GROUP BY SUBSTRING(uri,1,POSITION('/' IN uri)-1);
http://sqlfiddle.com/#!9/293dd3/3/0
edit: here I found amazon athena documentation: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html and here is the string function documentation: https://prestodb.io/docs/0.217/functions/string.html
my answer above still stands, but you might need to change SUBSTRING to SUBSTR
edit 2: it seems there's a special function to achieve this in amazon athena called SPLIT_PART()
query:
SELECT SPLIT_PART(uri, '/', 1), COUNT(*) FROM tbl GROUP BY SPLIT_PART(uri, '/', 1)
from docs:
split_part(string, delimiter, index) → varchar
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.

Parse data from table Greenplum

I have a table scheme2.central_id__new_numbers in Greenplum.
I need select data from scheme2.central_id__new_numbers in the form of a many-to-many relationship.
Also I write the code but must have made a wrong turn somewhere (the code doesn't work):
CREATE FUNCTION my_scheme.parse_new_numbers (varchar) RETURNS SETOF varchar as
$BODY$
declare
i int;
BEGIN
FOR i IN 1..10 LOOP
select
central_id,
(select regexp_split_to_table((select new_numbers
from scheme2.central_id__new_numbers limit 1 offset i), '\s+'))
from scheme2.central_id__new_numbers limit 1 offset i
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
I'd recommend using the UNNEST() function instead, i.e. assuming new_numbers column is of int[] data type,
SELECT central_id
, UNNEST(new_numbers) AS new_numbers
FROM central_id__new_numbers;
If new_columns column is not an array data type then you need to use i.e. string_to_array() or similar before using UNNEST().

Remove rows from SQL DB that appear in a Array

I Develop with MFC Visual C++ and Oracle SQL Server.
I have SQL table with: IDs, value and time, when the application insert a new row: some ID, some Value and time being inserted.
My goal is to delete rows of values that were changed between certain time. since the data that was inserted during that time has incorrect value.
Where is the catch ? I dont need to delete all the rows that were updated in that time period, only the rows with IDs that appear on a certain CArray.
I can go through each ID from CArray and execute a delete query to that certain ID in that time period (whether there is entry or not) - problem since i can have 150K IDs to iterate
on..
Thanks
DELETE FROM table-name WHERE id in (...)
transform your array into a tempTable with one column and then delete from your destiantion table where ID in (select Id from temptable)
Here is an example:
declare #RegionID varchar(50)
SET #RegionID = '853,834,16,467,841'
declare #S varchar(20)
if LEN(#RegionID) > 0 SET #RegionID = #RegionID + ','
CREATE TABLE #ARRAY(region_ID VARCHAR(20))
WHILE LEN(#RegionID) > 0 BEGIN
SELECT #S = LTRIM(SUBSTRING(#RegionID, 1, CHARINDEX(',', #RegionID) - 1))
INSERT INTO #ARRAY (region_ID) VALUES (#S)
SELECT #RegionID = SUBSTRING(#RegionID, CHARINDEX(',', #RegionID) + 1, LEN(#RegionID))
END
delete from from your_table
where regionID IN (select region_ID from #ARRAY)