use regexp_extract [duplicate] - regex

Let's say I have a column called 'Youtube' and I want to extract the string after the last slash of a URL. How would I do this in BigQuery Standard SQL?
Examples:
https://youtube.com/user/HaraldSchmidtShow
https://youtube.com/user/applesofficial
https://youtube.com/user/GrahamColton
Essentially, I want:
HaraldSchmidtShow
applesofficial
GrahamColton

An alternative to the previous answer, which also works when there's a '/' at the end:
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton/'
)
SELECT REGEXP_EXTRACT(url, r'/([^/]+)/?$') name
FROM `data`

This might already do the trick for you:
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton'
)
SELECT
SPLIT(url, '/')[SAFE_OFFSET(ARRAY_LENGTH(SPLIT(url, '/')) - 1)] AS name
FROM `data`
It just splits the string and goes for the last value.

Below is for BigQuery Standard SQL
#standardSQL
SELECT url,
(SELECT v FROM UNNEST(SPLIT(url, '/')) v WITH OFFSET o
WHERE v != '' ORDER BY o DESC LIMIT 1
) last_string
FROM `data`
You can test, play with above using dummy data as
#standardSQL
WITH data AS(
SELECT 'https://youtube.com/user/HaraldSchmidtShow' AS url UNION ALL
SELECT 'https://youtube.com/user/applesofficial' UNION ALL
SELECT 'https://youtube.com/user/GrahamColton/' UNION ALL
SELECT 'youtube.com/channel/UCEDBbJXgUqRQXCOsluJJ0FQ'
)
SELECT url,
(SELECT v FROM UNNEST(SPLIT(url, '/')) v WITH OFFSET o
WHERE v != '' ORDER BY o DESC LIMIT 1
) last_string
FROM `data`
with result
Row url last_string
1 https://youtube.com/user/HaraldSchmidtShow HaraldSchmidtShow
2 https://youtube.com/user/applesofficial applesofficial
3 https://youtube.com/user/GrahamColton/ GrahamColton
4 youtube.com/channel/UCEDBbJXgUqRQXCOsluJJ0FQ UCEDBbJXgUqRQXCOsluJJ0FQ
Obviously, using regular expression functions as in Felipe's answer - more elegant and easier to read.
But in some cases using above approach still has practical value so I wanted to bring it to that post

Related

MYSQL get substring

I'm trying to get substring dynamically and group by it. So if my uri column contains records like: /uri1/uri2 and /somelongword/someotherlongword I would like to get everything up to second delimiter, namely up to second / and count it. I'm using this query but obviously it is cutting string statically (6 letters after the first one).
SELECT substr(uri, 1, 6) as URI,
COUNT(*) as COUNTER
FROM staging
GROUP BY substr(uri, 1, 6)
ORDER BY COUNTER DESC
How can I achieve that?
You can use combination of SUBSTRING() and POSITION()
schema:
CREATE TABLE Table1
(`uri` varchar(10))
;
INSERT INTO Table1
(`uri`)
VALUES
('some/text'),
('some/text1'),
('some/text2'),
('aa/bb'),
('aa/cc'),
('bb/cc')
;
query
SELECT
SUBSTRING(uri,1,POSITION('/' IN uri)-1),
COUNT(*)
FROM Table1
GROUP BY SUBSTRING(uri,1,POSITION('/' IN uri)-1);
http://sqlfiddle.com/#!9/293dd3/3/0
edit: here I found amazon athena documentation: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html and here is the string function documentation: https://prestodb.io/docs/0.217/functions/string.html
my answer above still stands, but you might need to change SUBSTRING to SUBSTR
edit 2: it seems there's a special function to achieve this in amazon athena called SPLIT_PART()
query:
SELECT SPLIT_PART(uri, '/', 1), COUNT(*) FROM tbl GROUP BY SPLIT_PART(uri, '/', 1)
from docs:
split_part(string, delimiter, index) → varchar
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.

Remove duplicate substring from a string in oracle

I have strings like below in my table
2001,2452,2452,2421,2421,2495
2001,2483,2421,2421,2482
2001,2420,2421,2421,2425
2001,2420,2421,2421,2422
2001,2452,2452,2421,2421,2464
I want to remove the repeated numbers like 2452 and 2421 and show them only once in the data like
2001,2452,2421,2495
2001,2483,2421,2482
2001,2420,2421,2425
2001,2420,2421,2422
2001,2452,2421,2464
Has anyone done something like this? please let me know how to solve this
Thanks!
In Oracle SQL, You can use the hierarchy query and listagg as follows:
select str, listagg(str_distinct, ',') within group (order by 1) as distinct_str from
(select distinct str, regexp_substr(str,'[^,]+',1,column_value) str_distinct from cte
cross join table(
cast(multiset(
select level lvl
from dual
connect by level <= regexp_count(str, '[^,]+'))
as sys.odcivarchar2list)
) lvls)
group by str;
db<>fiddle for one of the input string.

How to excldue null values using REGEXP_SUBSTR

The following statement retrieve the value of sub tag msg_id from MISC column if the sub stag contain value like %PACS%.
SELECT REGEXP_SUBSTR(MISC, '(^|\s|;)msg_id = (.*?)\s*(;|$)',1,1,NULL,2) AS TRANS_REF FROM MISC_HEADER
WHERE MISC LIKE '%PACS%';
I notice the query return record with null value (without msg_id) as well. Any idea if can exclude those null records from the syntax of REGEXP_SUBSTR, without adding any where clause.
Sample data of MISC:
channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123
channel=pos; phone=9922; ustrd=U156
The second record without msg_id, so it need to be excluded.
This method does not use REGEXP so may not be suitable for you.
However, it does satisfy your requirement.
This takes your embedded list of msg_id, breaks it out to a row for each component for an ID (I've assumed you do have something uniquely identifies each record).
It then only returns the original row where one of the rows for the ID has 'PACS' in it.
WITH thedata
AS (SELECT 1 AS theid
, 'channel=atm ; phone=0123 ; msg_id=PACS00812 ; ustrd=U123'
AS msg_id
FROM DUAL
UNION ALL
SELECT 2, 'channel=pos; phone=9922; ustrd=U156' FROM DUAL)
, mylist
AS (SELECT theid, COLUMN_VALUE AS msg_component
FROM thedata
, XMLTABLE(('"' || REPLACE(msg_id, ';', '","') || '"')))
SELECT *
FROM thedata td
WHERE EXISTS
(SELECT 1
FROM mylist m
WHERE m.theid = td.theid
AND m.msg_component LIKE '%PACS%')
Thedata sub-query is simply to generate a couple of records and pretend to be your table. You could remove that and substitute your actual table name.
There are other ways to break up an embedded list including ones that use REGEXP, I just find the XMLTABLE method 'cleaner'.

Using another table in BigQuery Regex

I would like to map a string column to a category based on a regular expression match.
Is it possible to use another bigquery table containing the regular expressions and corresponding category for this? This would make it easier for me to update only a table when adding new categories/updating the regex, instead of having to update all queries that would use this lookup.
Query:
CASE
-- Use the entries from another table here
WHEN REGEXP_MATCH(string_to_check, cat1regex) THEN cat1
WHEN REGEXP_MATCH(string_to_check, cat2regex) THEN cat2
etc.
END
Mapping table:
Regex category
pagex|pagey xy
pagez|page1 z1
It's also possible there is another simple way to do something similar that I'm not thinking of, answers pointing those out are welcome too.
Any help would be appreciated.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check
You can test / play with it using below dummy date from your question
#standardSQL
WITH `mappingTable` AS (
SELECT r'pagex|pagey' AS reg, 'xy' AS category UNION ALL
SELECT r'pagez|page1', 'z1'
),
`yourTable` AS (
SELECT string_to_check
FROM UNNEST(["pagex.com", "pagez#example.org", "page.example.net"]) AS string_to_check
)
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check

Searching jsonb array in PostgreSQL

I'm trying to search a JSONB object in PostgreSQL 9.4. My question is similar to this thread.
However my data structure is slightly different which is causing me problems. My data structure is like:
[
{"id":1, "msg":"testing"}
{"id":2, "msg":"tested"}
{"id":3, "msg":"nothing"}
]
and I want to search for matching objects in that array by msg (RegEx, LIKE, =, etc). To be more specific, I want all rows in the table where the JSONB field has an object with a "msg" that matches my request.
The following shows a structure similar to what I have:
SELECT * FROM
(SELECT
'[{"id":1,"msg":"testing"},{"id":2,"msg":"tested"},{"id":3,"msg":"nothing"}]'::jsonb as data)
as jsonbexample;
This shows an attempt to implement the answer to the above link, but does not work (returns 0 rows):
SELECT * FROM
(SELECT
'[{"id":1,"msg":"testing"},{"id":2,"msg":"tested"},{"id":3,"msg":"nothing"}]'::jsonb as data)
as jsonbexample
WHERE
(data #>> '{msg}') LIKE '%est%';
Can anyone explain how to search through a JSONB array? In the above example I would like to find any row in the table whose "data" JSONB field contains an object where "msg" matches something (for example, LIKE '%est%').
Update
This code creates a new type (needed for later):
CREATE TYPE AlertLine AS (id INTEGER, msg TEXT);
Then you can use this to rip apart the column with JSONB_POPULATE_RECORDSET:
SELECT * FROM
JSONB_POPULATE_RECORDSET(
null::AlertLine,
(SELECT '[{"id":1,"msg":"testing"},
{"id":2,"msg":"tested"},
{"id":3,"msg":"nothing"}]'::jsonb
as data
)
) as jsonbex;
Outputs:
id | msg
----+---------
1 | testing
2 | tested
3 | nothing
And putting in the constraints:
SELECT * FROM
JSONB_POPULATE_RECORDSET(
null::AlertLine,
(SELECT '[{"id":1,"msg":"testing"},
{"id":2,"msg":"tested"},
{"id":3,"msg":"nothing"}]'::jsonb
as data)
) as jsonbex
WHERE
msg LIKE '%est%';
Outputs:
id | msg
---+---------
1 | testing
2 | tested
So the part of the question still remaining is how to put this as a clause in another query.
So, if the output of the above code = x, how would I ask:
SELECT * FROM mytable WHERE x > (0 rows);
You can use exists:
SELECT * FROM
(SELECT
'[{"id":1,"msg":"testing"},{"id":2,"msg":"tested"},{"id":3,"msg":"nothing"}]'::jsonb as data)
as jsonbexample
WHERE
EXISTS (SELECT 1 FROM jsonb_array_elements(data) as j(data) WHERE (data#>> '{msg}') LIKE '%est%');
To query table as mentioned in comment below:
SELECT * FROM atable
WHERE EXISTS (SELECT 1 FROM jsonb_array_elements(columnx) as j(data) WHERE (data#>> '{msg}') LIKE '%est%');